Document content mining for authors’ identification task

Szymon  Łukasik; Marcin  Haręza; Marcin  Kaczor

Titles

Document content mining for authors' identification task

Abstract

Eksploracja treści dokumentów w problemie identyfikacji autorów

Przedmiotem niniejszego artykułu jest problem identyfikacji autora na podstawie analizy treści dokumentów. Podejście to opiera się na wyborze odpowiednich cech związanych ze specyficznym użyciem struktur gramatycznych, interpunkcji oraz słownika, a następnie – użycie wybranego algorytmu klasyfikacji. W artykule przedstawiono najpierw różne charakterystyki tekstu, które mogą być użyte w omawianym zagadnieniu, a następnie załączono wyniki eksperymentów obliczeniowych obejmujących wybór cech i badanie skuteczności klasyfikacji w problemie identyfikacji autorów. Artykuł podsumowano wnioskami oraz propozycjami dalszych prac w rozważanej tematyce badawczej.

This paper deals with automatic authorship attribution through documents content analysis. This approach is based on selecting sets of suitable features relying on specific use of grammar, punctuation or vocabulary and in the next step – executing given classification algorithm. The contribution first overviews various text characteristics which can be employed for that purpose, then presents the results of experiments involving feature selection and examines classifier performance for author identification problem. The paper concludes with discussion and proposals for further research.

Keywords

identyfikacja autora, wybór cech, klasyfikacja

author identification, feature selection, classification

References

Argamon S., Levitan S., Measuring the Usefulness of Function Words for Authorship Attribution, Proc. Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, paper no 162, 2005.

Broccias C., Cognitive linguistic theories of grammar and grammar teaching, [in:] De Knop S., De Rycker T. (eds.), Cognitive Approaches to Pedagogical Grammar, Walter de Gruyter, Berlin 2008, 67-90.

Cheng N. Chandramouli R., Subbalakshmi K.P., Author gender identification from text, Digital Investigation, vol. 8, 2011, 78-88.

Eder M., Rybicki J., Do birds of a feather really flock together, or how to choose training samples for authorship attribution, Literary and Linguistic Computing /to appear.

Grieve J., Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, vol. 22, 2007, 251-270.

Hastie T., Tibshirani R., Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York 2009.

Hoover D.L., Another Perspective on Vocabulary Richness, Computers and the Humanities, vol. 37, 2003, 151-178.

Jagadev A.K, Devi S., Mall R., Soft Computing for Feature Selection, [in:] Dehuri S., Cho, S.-B. (eds.), Knowledge Mining using Intelligent Agents, Imperial College Press, London 2011, 217-258.

Jockers M.L., Witten D.M., A comparative study of machine learning methods for authorship attribution, Literary and Linguistic Computing, vol. 25, 2010, 215-223.

Jurafsky D., Martin J.H., Speech And Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, New Jersey 2009.

Karlgren J., Eriksson G., Authors, genre, and linguistic convention, Proc. SIGIR Workshop on Plagiarism Analysis, Author ship Attribution, and Near-Duplicate Detection, 2007, 23-28.

Klein D., Manning C.D., Accurate Unlexicalized Parsing, Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003, 423-430.

Koppel M. Schler J., Argamon S., Authorship attribution in the wild, Language Resources and Evaluation, vol. 45, 2011, 83-94.

Kowalski P.A., Procedure of feature extraction from face image for biometrical system (in Polish), Technical Transactions, vol. 1-AC/2012, Cracow University of Technology Press, 55-79.

Layton R., Authorship Attribution for Twitter in 140 Characters or Less, Cracow University of Technology Press, Proc. 2nd Cybercrime and Trustworthy Computing Workshop, 2010, 1-8.

Łukasik S., Haręza M., Kaczor M., Grammatical structures ranking (supplementary material), http://www.pk.edu.pl/~szymonl/nauka/Author_suppl.pdf (date of access: 9.10.2013).

Łukasik S., Kulczycki P., Using topology preservation measures for high-dimensional data analysis in a reduced feature space (in Polish), Technical Transactions, vol. 1-AC/2012, Cracow University of Technology Press, 5-15.

Panicheva P., Cardiff J., Rosso P., Personal Sense and Idiolect: Combining Authorship Attribution and Opinion Analysis, [in:] Proc. International Conference on Language Resources and Evaluation, Valletta, paper no. 10.491, 2010.

Penn Treebank Project, http://www.cis.upenn.edu/~treebank/

Punctuation, [in:] Merriam-Webster’s Collegiate Dictionary: Eleventh Edition, 1009, Merriam-Webster, Springfield 2004.

Radford A., Minimalist Syntax: Exploring the structure of English, Cambridge University Press, Cambridge 2004.

Stamatatos E., A survey of modern authorship attribution methods, Journal of the American Society for Information Science and Technology, vol. 60, 2009, 538-556.

Wang L.P, Fu X.J., Data Mining with Computational Intelligence, Springer, Berlin 2005.

Zheng R., Li J., Chen H., Huang Z., A framework for authorship identification of online messages: Writing style features and classification techniques, Journal of the American Society of Information Science and Technology, vol. 57, 2006, 378-393.

Information

Information: Technical Transactions, 2013, Automatic Control Issue 1-AC (2) 2013 , pp. 3-15

DOI: https://doi.org/10.4467/2353737XCT.14.001.1989

Article type: Original article

Titles:

Polish:

Document content mining for authors' identification task

English:

Document content mining for authors’ identification task

Authors

Szymon Łukasik

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology; Systems Research Institute, Polish Academy of Sciences

Marcin Haręza

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology

Marcin Kaczor

Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology

Published at: 2013

Article status: Open

Licence: None

Percentage share of authors:

Szymon Łukasik (Author) - 33%

Marcin Haręza (Author) - 33%

Marcin Kaczor (Author) - 34%

Article corrections:

-

Publication languages:

English

View count: 2409

Number of downloads: 1462

<p> Document content mining for authors’ identification task</p>

Download full text