Success Rates in Most-frequent-word-based Authorship Attribution. A Case Study of 1000 Polish Novels from Ignacy Krasicki to Jerzy Pilch

Jan Rybicki

Success Rates in Most-frequent-word-based Authorship Attribution. A Case Study of 1000 Polish Novels from Ignacy Krasicki to Jerzy Pilch

Publication date: 21.09.2015

Studies in Polish Linguistics, Volume 10 (2015), Vol. 10, Issue 2, pp. 87-104

https://doi.org/10.4467/23005920SPL.15.004.3561

Authors

Download full text

PDF

Titles

Success Rates in Most-frequent-word-based Authorship Attribution. A Case Study of 1000 Polish Novels from Ignacy Krasicki to Jerzy Pilch

Abstract

The success rate of authorship attribution by multivariate analysis of most-frequent-word frequencies is studied in a 1000-novel corpus of Polish literary works from the late 18th to the early 21st century. The results are examined for possible influences of the number of authors and/or the number of texts to be attributed. Also, the success rates achieved in this study are compared to those obtained in earlier studies for smaller corpora, too small perhaps to produce regular patterns. This study shows that text sets of this size confirm the intuitive predictions as to those influences: 1) the more authors, the less successful attribution; 2) for the same number of authors, the number of texts to be attributed does not influence success rate.

W artykule zbadano skuteczność atrybucji autorskiej opartej na wielowymiarowej analizie najczęstszych słów w korpusie 1000 powieści polskich napisanych między końcem XVIII i początkiem XXI wieku. Oceniono wpływ liczby autorów i/lub tekstów na uzyskane wyniki. Porównano skuteczność atrybucji w niniejszej pracy z wynikami uzyskanymi we wcześniejszych opracowaniach wykorzystujących mniejsze korpusy – a więc te, które mogły nie wykazywać regularnych prawidłowości pod tym względem. Wykazano, że w dużych kolekcjach tekstów sprawdzają się intuicyjne przypuszczenia: 1) im więcej autorów, tym trudniej o skuteczną atrybucję; 2) przy tej samej liczbie autorów liczba tekstów nie ma wpływu na skuteczność atrybucji.

Keywords

multivariate analysis, authorship attribution, Polish literature, stylometry

analiza wielowymiarowa, atrybucja autorska, literatura polska, stylometria

References

Download references

Barthes Roland (1967). The death of the author. Aspen Magazine 5/6, 67–69.

Burrows John (1987). Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press.

Burrows John (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17(3), 267–287.

Connors Louisa (2013). Computational stylistics, Cognitive Grammar, and the Tragedy of Mariam: combining formal and contextual approaches in a computational study of early modern tragedy. Newcastle, NSW: University of Newcastle, Ph.D. thesis.

Dobrołęcki Piotr (2014). Rynek książki w Polsce 2014, Instytut Książki. [URL: http://www.instytutksiazki.pl/upload/Files/RYNEK_KSIKI_2014.pdf; accessed May 15, 2015].

Eder Maciej (2011). Style-markers in authorship attribution: A cross-language study of the authorial fingerprint, Studies in Polish Linguistics 6, 99–114.

Eder Maciej (2013). Does size matter? Authorship attribution, small samples, big problem, Literary and Linguistic Computing, first published online November 14, 2013. [URL: http://dsh.oxfordjournals.org/content/early/2014/12/02/llc.fqt066; accessed February 15, 2015].

Eder Maciej, Kestemont Mike, Rybicki Jan (2013). Stylometry with R: a suite of tools. Digital Humanities 2013: Conference Abstracts, 487–489. Lincoln, NE: University of Nebraska.

Eder Maciej, Rybicki Jan (2013). Do birds of a feather really flock together, or how to choose training samples for authorship attribution. Literary and Linguistic Computing 28(2), 229–236.

Forsyth Richard, Sharoff Serge (2014). Document dissimilarity within and across languages: A benchmarking study. Literary and Linguistic Computing 29(1), 6–22.

Górski Rafał, Eder Maciej, Rybicki Jan (2014). Stylistic fingerprints, POS tags and inflected languages: a case study in Polish. Paper presented at QUALICO 2014, Olomouc, May 29−June 1, 2014.

Jockers Matthew (2013). Macroanalysis. Digital Methods and Literary History. Springfield: University of Illinois Press.

Juola Patrick (2004). Ad-hoc authorship attribution competition. Paper presented at ALLC/ACH 2004, Göteborg, June 11−16, 2004.

Juola Patrick (2009). Cross-linguistic transference of authorship attribution, or why English-only prototypes are acceptable. Proceedings of Digital Humanities 2009, 162–163. College Park, MD.

Kenny Anthony (1982). The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities. Oxford/New York: Pergamon Press.

Love Harold (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.

Mason Mark (2011). How many books will you read in your lifetime? The Telegraph. [URL: http://blogs.telegraph.co.uk/culture/markmason/100054373/how-many-books-will-you-read-in-your-lifetime/; accessed August 15, 2014].

McKenna Wayne, Burrows John, Antonia Alexis (1999). Beckett’s Trilogy: Computational stylistics and the nature of translation. Revue informatique et statistique dans les sciences humaines 35(1–4), 151–171.

Metzler Donald (2011). A Feature-Centric View of Information Retrieval. Berlin/Heidelberg: Springer.

Miłosz Czesław (1983). The History of Polish Literature. Oakland: University of California Press.

Nerbonne John (2014). Review of Pennebaker (2011). Literary and Linguistic Computing 29(1), 149–151.

Pawłowski Adam (2003). O problemie atrybucji tekstu w lingwistyce kwantytatywnej (na przykładzie tekstów Romaina Gary. In Prace językoznawcze dedykowane Profesor Jadwidze Sambor, Jadwiga Linde-Usiekniewicz, Romuald Huszcza (eds.), 169–190. Warszawa: Wydział Polonistyki Uniwersytetu Warszawskiego.

Pennebaker James (2011). The Secret Life of Pronouns: What Our Words Say about Us. New York, Bloomsbury Press.

R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Wien. [URL: http://www.R-project.org/]

Rybicki Jan, Eder Maciej (2011). Deeper Delta across genres and languages: do we really need the most frequent words? Literary and Linguistic Computing 26(3), 315–321.

Information

Information: Studies in Polish Linguistics, Volume 10 (2015), Vol. 10, Issue 2, pp. 87-104

DOI: https://doi.org/10.4467/23005920SPL.15.004.3561

Article type: Original article

Titles:

English:

Success Rates in Most-frequent-word-based Authorship Attribution. A Case Study of 1000 Polish Novels from Ignacy Krasicki to Jerzy Pilch

Authors

Jan Rybicki

Jagiellonian University in Kraków, Gołębia 24, 31-007 Kraków, Poland

Published at: 21.09.2015

Article status: Open

Licence: None

Percentage share of authors:

Jan Rybicki (Author) - 100%

Article corrections:

Publication languages:

English

View count: 2790

Number of downloads: 1588

<div id="cke_pastebin">Success Rates in Most-frequent-word-based Authorship Attribution. A Case Study of 1000 Polish Novels from Ignacy Krasicki to Jerzy Pilch</div>

Download xml