FAQ

Studies in Polish Linguistics

Style-Markers in Authorship Attribution A Cross-Language Study of the Authorial Fingerprint

Publication date: 15.10.2011

Studies in Polish Linguistics, Volume 6 (2011), Vol. 6, Issue 1, pp. 99 - 114

Authors

Maciej Eder
Polish Academy of Sciences, Warsaw, Poland
University of the National Education Commission, Krakow
ul. Podchorążych 2, 30-084 Kraków, Poland
All publications →

Abstract

The present study addresses one of the theoretical problems of computer-assisted authorship attribution, namely the question which traceable features of language can betray authorial uniqueness (a stylistic fingerprint) of literary texts. A number of recent approaches show that apart from lexical measures — especially those relying on the frequencies of the most frequent words — also some other features of written language are considerably effective as discriminators of authorial style. However, there have been no attempts to compare the attribution potential of these features. The aim of the present study, then, was to examine the effectiveness of several style-markers in authorship attribution. The style-markers chosen for the empirical investigation are those that can be retrieved from a non-lemmatized corpus of plain text files, such as the most frequent words, word bi-grams, different letter sequences, and markers of different nature, combined in one sample. Equally important, however, was to compare usefulness of the chosen style-markers across a few languages: English, Polish, German, and Latin. The results confirmed a high attribution effectiveness of word-based style-markers in the English corpus, but the alternative markers are shown to be usually more effective in the other languages.

References

Argamon Shlomo (2008): Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations. — Literary and Linguistic Computing 23, 131–147.

Baayen Harald (2001): Word Frequency Distributions. — Dordrecht: Kluwer.

Baayen Harald (2008): Analyzing Liguistic Data: A Practical Introduction to Statistics Using R. — Cambridge: Cambridge University Press.

Baayen Harald, Van Halteren Hans, Neijt Anneke, Tweedie Fiona (2002): An Experiment in Authorship Attribution. — Proceedings of JADT 2002, Université de Rennes, St. Malo, 29–37.

Burrows John (1987): Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. — Oxford: Clarendon Press.

Burrows John (2002): “Delta”: A Measure of Stylistic Diff erence and a Guide to Likely Authorship.— Literary and Linguistic Computing 17, 267–287.

Craig Hugh, Kinney Arthur, eds. (2009): Shakespeare, Computers, and the Mystery of Authorship. — Cambridge: Cambridge University Press.

Crane Greg (2004): Classics and the Computer: An End of the History. — [In:] A Companion to Digital Humanities, Susan Schreibman, Ray Siemens and John Unsworth (eds.), 46–55. Oxford: Blackwell.

Eder Maciej (2010): Does Size Matter? Authorship Attribution, Small Samples, Big Problem. — Digital Humanities 2010: Conference Abstracts, King’s College London, 132–135.

Eder Maciej, Rybicki Jan (2011): Stylometry with R. — Digital Humanities 2011: Conference Abstracts, Stanford University, Stanford, CA, 308–311.

Eder Maciej, Rybicki Jan (2012): Do Birds of a Feather Really Flock Together, or How to Choose Test Samples for Authorship Attribution. Literary and Linguistic Computing 27 (forthcoming).

Good Philip (2006): Resampling Methods: A Practical Guide to Data Analysis. — Boston–Basel–Berlin: Birkhäuser.

Gries Stefan (2010): Statistics for Linguistics with R: A Practical Introduction. — Berlin: De Gruyter Mouton.

Grieve Jack (2007): Quantitative Authorship Attribution: An Evaluation of Techniques. — Literary and Linguistic Computing 22, 251–270.

Hirst Graeme, Feiguina Ol’ga (2007): Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. — Literary and Linguistic Computing 22, 405–417.

Holmes David (1998): The Evolution of Stylometry in Humanities Scholarship. — Literary and Linguistic Computing 13, 111–117 .

Hoover David L. (2001): Statistical Stylistic and Authorship Attribution: an Empirical Investigation. — Literary and Linguistic Computing 16, 421–444.

Hoover David L. (2002): Frequent Word Sequences and Statistical Stylistics. — Literary and Linguistic Computing 17, 157–180.

Hoover David L. (2003): Multivariate Analysis and the Study of Style Variation. — Literary and Linguistic Computing 18, 341–360.

Hoover David L. (2004): Testing Burrows’s Delta. — Literary and Linguistic Computing 19, 453–475. Jockers Matthew, Witten Daniela, Criddle Craig (2008): Reassessing Authorship in the Book of

Mormon Using Delta and Nearest Shrunken Centroid Classifi cation. — Literary and Linguistic Computing 23, 465–491.

Jockers Matthew, Witten Daniela (2010): A Comparative Study of Machine Learning Methods for Authorship Attribution. — Literary and Linguistic Computing 25, 215–223.

Juola Patrick (2006): Authorship Attribution. — Foundations and Trends in Information Retrieval 1, 233–334.

Juola Patrick (2009): Cross-linguistic Transference of Authorship Attribution, or Why English-only Prototypes Are Acceptable. — Digital Humanities 2009: Conference Abstracts, University of Maryland, College Park, MD, 162–163.

Juola Patrick, Baayen Harald (2005): A Controlled-corpus Experiment in Authorship Identifi cation by Cross-entropy. — Literary and Linguistic Computing 20 (Suppl. Issue), 59–67.

Love Herald (2002): Attributing Authorship: An Introduction. — Cambridge: Cambridge University Press.

Style-Markers in Authorship Attribution. A Cross-Language Study of the Authorial Fingerprint

Lutosławski Wincenty (1897): The Origin and Growth of Plato’s Logic: With an Account of Plato’s Style and of the Chronology of his Writings. — London: Longmans.

Mosteller Frederick, Wallace David (2007 [1964]): Inference and Disputed Authorship: The Federalist. Reprinted with a new introduction by John Nerbonne. — Stanford: CSLI Publications.

Nerbonne John (2007): Th e Exact Analysis of Text. — [Foreword in:] Mosteller and Wallace (2007 [1964]), xi–xx.

Pawłowski Adam (2003): O problemie atrybucji tekstu w lingwistyce kwantytatywnej (na przykładzie tekstów Romaina Gary). — [In:] Prace językoznawcze dedykowane Profesor Jadwidze Sambor,  Jadwiga

Linde-Usienkiewicz, Romuald Huszcza (eds.), 169–190; Warszawa: Wydawnictwo Uniwersytetu Warszawskiego.

Pawłowski Adam, Pacewicz Artur (2004): Wincenty Lutosławski (1863–1954). Philosophe, helléniste ou fondateur sous-estimé de la stylométrie? — Historiographia Linguistica 21, 423–447.

Rudman Joseph (1998): The State of Authorship Attribution Studies: Some Problems and Solutions. — Computers and the Humanities 31, 351–365.

Rybicki Jan, Eder Maciej (2011): Deeper Delta Across Genres and Languages: Do We Really Need the Most Frequent Words? — Literary and Linguistic Computing 26, 315–321.

Smith Peter, Aldrigde W. (2011): Improving Authorship Attribution: Optimizing Burrows’s Delta Method. — Journal of Quantitative Linguistics 18, 63–88.

Information

Information: Studies in Polish Linguistics, Volume 6 (2011), pp. 99 - 114

Article type: Original research article

Titles:

Polish:

Style-Markers in Authorship Attribution A Cross-Language Study of the Authorial Fingerprint

Authors

Polish Academy of Sciences, Warsaw, Poland

University of the National Education Commission, Krakow
ul. Podchorążych 2, 30-084 Kraków, Poland

Published at: 15.10.2011

Article status: Open

Licence: None

Percentage share of authors:

Maciej Eder (Author) - 100%

Article corrections:

-

Publication languages:

English

View count: 3851

Number of downloads: 6956