TY - JOUR TI - Short text similarity algorithm based on the edit distance and thesaurus AU - Niewiarowski, Artur TI - Short text similarity algorithm based on the edit distance and thesaurus AB - This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization. VL - 2016 IS - Nauki Podstawowe Zeszyt 1-NP 2016 PY - 2016 SN - 0011-4561 C1 - 2353-737X SP - 159 EP - 173 DO - 10.4467/2353737XCT.16.149.5760 UR - https://ejournals.eu/czasopismo/czasopismo-techniczne/artykul/short-text-similarity-algorithm-based-on-the-edit-distance-and-thesaurus KW - Levenshtein distance algorithm KW - the edit distance KW - thesaurus KW - the measure of texts similarity KW - plagiarism detection KW - text mining KW - Natural Language Processing KW - Natural Language Understanding KW - stemming KW - lemmatization