Experiments with language combinatorics in text classification: lessons learned and future
	implications

Michal Ptaszynski; Fumito Masui

Tytuły

Experiments with language combinatorics in text classification: lessons learned and future
implications

Abstrakt

W niniejszym artykule przedstawiono metaanalizę badań przeprowadzonych za pomocą kombinatoryki językowej (language combinatorics, LC), nowej metody generacji modelu języka i ekstrakcji cech, opartej o kombinacyjne manipulacje na elementach zdań (np. słowa). W trakcie ostatnich lat LC została zastosowana do wielu zadań z dziedziny klasyfikacji tekstu, takich jak analiza afektu, wykrywanie cyberagresji lub ekstrakcja odniesień do przyszłych wydarzeń. W niniejszym artykule podsumowujemy dwa z najbardziej obszernych doświadczeń i omawiamy ogólne implikacje dotyczące przyszłych zastosowań kombinatoryjnego modelu języka.

This paper presents a meta-analysis of experiments performed with language combinatorics (LC), a novel language model generation and feature extraction method based on combinatorial manipulations of sentence elements (e.g., words). Along recent years LC has been applied to a number of text classification tasks, such as affect analysis, cyberbullying detection or future reference extraction. We summarize two of the most extensive experiments and discuss general implications for future implementations of combinatorial language model.

Słowa kluczowe

kombinatoryka językowa, przetwarzanie języków naturalnych, klasyfikacja tekstu

language combinatorics, natural language processing, text classification

Bibliografia

[1] Ptaszynski M., Masui F., Rzepka R., Araki K., First Glance on Pattern-based Language Modeling, Language Acquisition and Understanding Research Group Technical Reports, 2014.

[2] Ptaszynski M., Masui F., Kimura Y., Rzepka R., Araki K., Extracting Patterns of Harmful Expressions for Cyberbullying Detection, Proceedings of LTC’15, 2016, 370-375.

[3] Ptaszynski M., Masui F., Rzepka R., Araki K., Subjective? Emotional? Emotive?: Language Combinatorics based Automatic Detection of Emotionally Loaded Sentences, Linguistics and Literature Studies, Vol. 5, No. 1, 2017, 36-50.

[4] Bickel S., Haider P., Scheffer T., Predicting sentences using n-gram language models, Proceedings of HLT-EMNLP 2005, 2005, 193-200.

[5] Li Haizhou, Bin Ma, A phonotactic language model for spoken language identification, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, 515-522.

[6] Ponte J.M., Croft W.B., A language modeling approach to information retrieval, Proceedings of the 21st annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, 275-281.

[7] Brown P.F., Cocke J., Pietra S.A.D., Pietra V.J.D., Jelinek F., Lafferty J.D., Mercer R.L., Roossin P.S., A statistical approach to machine translation, Computational Linguistics, Vol. 16, No. 2, 1990, 79-85.

[8] Mays E., Damerau F.J., Mercer R.L., Context based spelling correction, Information Processing & Management, Vol. 27, No. 5, 1991, 517-522.

[9] Kupiec J., Robust part-of-speech tagging using a hidden Markov model, Computer Speech & Language, Vol. 6, No.3, 1992, 225-242.

[10] Hu Y., Lu R., Li X., Chen Y., Duan J., A language modeling approach to sentiment analysis, Computational Science – ICCS 2007, 1186-1193.

[11] Ptaszynski M., Rzepka R., Araki K., Momouchi Y., Language combinatorics: A sentence pattern extraction architecture based on combinatorial explosion, International Journal of Computational Linguistics (IJCL), Vol. 2, No. 1, 2011, 24-36.

[12] Harris Z., Distributional Structure, Word, Vol. 10, N. 2/3, 1954, 146-162.

[13] Cambria E., Hussain A., Sentic Computing: Techniques, Tools, and Applications, Springer, 2012.

[14] Lu Y., Zhai C.X., Positional Language Models for Information Retrieval, 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009, 299-306.

[15] Markov A.A., Extension of the limit theorems of probability theory to a sum of variables connected in a chain, Reprinted in Appendix B of: R. Howard, Dynamic Probabilistic Systems, Vol. 1: Markov Chains, John Wiley and Sons, 1971.

[16] Huang X., Alleva F., Hon H.W., Hwang M.Y., Rosenfeld R., The SPHINX-II Speech Recognition System: An Overview,Computer, Speech and Language, Vol. 7, 1992, 137-148.

[17] Guthrie D., Allison B., Liu W., Guthrie L., Wilks Y., A closer look at skip-gram modelling, Proceedings of LREC-2006, 2006, 1-4.

[18] Pickhardt R., Gottron T., Korner M., Wagner P.G., Speicher T., Staab S., A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing, Proceedings of ACL 2014, 2014, 1145-1154.

[19] Ptaszynski M., Lempa P., Masui F., A Modular System for Support of Experiments in Text Classification, Technical Transactions, vol. 7-B/2015, 229-243.

[20] Nakajima Y., Ptaszynski M., Honma H., Masui F., Investigation of Future Reference Expressions in Trend Information, Proceedings of the 2014 AAAI Spring Symposium Series, 2014, 31-38.

[21] Ptaszynski M., Dybala P., Rzepka R., Araki K., Affecting Corpora: Experiments with Automatic Affect Annotation System – A Case Study of the 2channel Forum, Proceedings of PACLING-09, 2009, 223-228.

[22] Human Rights Research Institute Against All Forms for Discrimination and Racism in Mie Prefecture, Japan, http://www.pref.mie.lg.jp/jinkenc/hp/ (access: 21.04.2017).

[23] Ministry of Education, Culture, Sports, Science and Technology (MEXT), ‘Netto-jo no ijime’ ni kansuru taio manyuaru jirei shu (gakko, kyoin muke), MEXT, 2008.

[24] Ure J., Lexical density and register differentiation, [in:] Applications of Linguistics, (eds.) G. Perren, J.L.M. Trim, Cambridge University Press, London 1971, 443-452.

Informacje

Informacje: Czasopismo Techniczne, 2017, Volume 11 Year 2017 (114), s. 183 - 197

DOI: https://doi.org/10.4467/2353737XCT.17.199.7428

Typ artykułu: Oryginalny artykuł naukowy

Tytuły:

Polski:

Experiments with language combinatorics in text classification: lessons learned and future
implications

Angielski:

Experiments with language combinatorics in text classification: lessons learned and future
implications

Autorzy

Michal Ptaszynski

Department of Computer Science Kitami Institute of Technology, Japan

Fumito Masui

Department of Computer Science Kitami Institute of Technology, Japan

Publikacja: 22.11.2017

Status artykułu: Otwarte

Licencja: Żadna

Udział procentowy autorów:

Michal Ptaszynski (Autor) - 50%

Fumito Masui (Autor) - 50%

Korekty artykułu:

-

Języki publikacji:

Angielski

Liczba wyświetleń: 1532

Liczba pobrań: 1033

<p> Experiments with language combinatorics in text classification: lessons learned and future<br /> implications</p>

Pobierz pełny tekst

Tytuły

Abstrakt

Słowa kluczowe

Bibliografia

Informacje

Autorzy