FAQ

Short text similarity algorithm based on the edit distance and thesaurus

Publication date: 14.12.2016

Technical Transactions, 2016, Fundamental Sciences Issue 1-NP 2016, pp. 159 - 173

https://doi.org/10.4467/2353737XCT.16.149.5760

Authors

Artur Niewiarowski
Institute of Computer Science, Faculty of Physics, Mathematics and Computer Science of Cracow University of Technology
All publications →

Titles

Short text similarity algorithm based on the edit distance and thesaurus

Abstract

This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.

References


Information

Information: Technical Transactions, 2016, Fundamental Sciences Issue 1-NP 2016, pp. 159 - 173

Article type: Original article

Titles:

Polish:

Short text similarity algorithm based on the edit distance and thesaurus

English:

Short text similarity algorithm based on the edit distance and thesaurus

Authors

Institute of Computer Science, Faculty of Physics, Mathematics and Computer Science of Cracow University of Technology

Published at: 14.12.2016

Article status: Open

Licence: None

Percentage share of authors:

Artur Niewiarowski (Author) - 100%

Article corrections:

-

Publication languages:

English