Short text similarity algorithm based on the edit distance and thesaurus

Artur Niewiarowski

Tytuły

Abstrakt

This paper proposes a method of comparing the short texts using the Levenshtein distance algorithm and thesaurus for analysing terms enclosed in texts instead of popular methods exploiting the grammatical variations glossary. The tested texts contain a variety of nouns and verbs together with grammatical or orthographical mistakes. Based on the proposed new algorithm the similarity of such texts will be estimated. The described technique is compared with methods: Cosine distances, distance Dice and Jaccard distance constructed on the term frequency method. The proposition is competitive against well-known algorithms of stemming and lemmatization.

Słowa kluczowe

Levenshtein distance algorithm, the edit distance, thesaurus, the measure of texts similarity, plagiarism detection, text mining, Natural Language Processing, Natural Language Understanding, stemming, lemmatization

Bibliografia

Informacje

Informacje: Czasopismo Techniczne, 2016, Nauki Podstawowe Zeszyt 1-NP 2016, s. 159-173

DOI: https://doi.org/10.4467/2353737XCT.16.149.5760

Typ artykułu: Oryginalny artykuł naukowy

Tytuły:

Polski:

Short text similarity algorithm based on the edit distance and thesaurus

Angielski:

Short text similarity algorithm based on the edit distance and thesaurus

Autorzy

Artur Niewiarowski

Institute of Computer Science, Faculty of Physics, Mathematics and Computer Science of Cracow University of Technology

Publikacja: 14.12.2016

Status artykułu: Otwarte

Licencja: Żadna

Udział procentowy autorów:

Artur Niewiarowski (Autor) - 100%

Korekty artykułu:

-

Języki publikacji:

Angielski

Liczba wyświetleń: 1657

Liczba pobrań: 1300

<p> Short text similarity algorithm based on the edit distance and thesaurus</p>

Pobierz pełny tekst

Tytuły

Abstrakt

Słowa kluczowe

Bibliografia

Informacje

Autorzy