Pronouns in English-language fiction

This project uses the 'Syntactic Ngrams' corpus, based on the Google Books corpus and containing parse fragments with Penn-Treebank part-of-speech (POS) tags, to discover and validate the reflexive pronouns contained and tagged therein, and to compare the frequency results between the general English corpus and the partition for fiction.

Results

Candidate Reflexive Pronouns

Based on projections over the respective Syntactic-Ngrams verbargs files, and filtering with a regex (r'sel[fvl]') to approximate reflexivity, the following candidate lists of pronouns (PRP) as direct objects (dobj) are obtained and ranked by frequency, truncated to include the gender-neutral singular reflexive pronoun 'themself' (see Methodology for more details; candidates-eng.txt and candidates-fiction.txt for the full listings):

English	English Fiction
himself: 8979386	himself: 3207256
themselves: 5578605	herself: 1408594
itself: 3409264	myself: 1296374
myself: 3014093	themselves: 931156
herself: 2329877	yourself: 636914
yourself: 1623751	itself: 495199
ourselves: 1404367	ourselves: 255542
oneself: 219794	yourselves: 41846
yourselves: 143277	oneself: 27295
thyself: 77279	thyself: 19193
himselfe: 4020	hisself: 2384
hisself: 3088	yerself: 1106
ourself: 1380	meself: 724
yerself: 1238	yoursell: 719
meself: 1079	imself: 533
yoursell: 866	ourself: 356
imself: 414	~~mademoiselle: 316~~
themself: 316	isself: 304
yo'self: 291	yo'self: 281
herselfe: 243	himselfe: 190
~~mademoiselle: 237~~	heself: 165
isself: 221	erself: 156
myselfe: 208	himsell: 155
mysell: 202	hetself: 154
erself: 149	mysell: 142
himsell: 144	yoursells: 88
heself: 136	hymself: 83
itselfe: 136	themself: 82

There are a few features of these lists to which bold font is used to draw attention:

the relative prominence of 'himself' and 'herself' (interestingly much more balanced in fiction);
the relative prominence of the ungendered plural pronoun 'themselves' to these;
the relative prominence of the ungendered singular pronoun 'themself' (interestingly much less prominent in fiction, at least historically).

It is also worth noting the relative prominence of dialectical pronouns, such as 'isself' (a variant of the already non-standard 'hisself', used in my own native dialect), as well as American dialect terms like 'yo'self'.

Strikethrough is used to indicate that, although the regex for reflexive forms does a relatively good job of cleaning up a lot of misidentified pronouns (from the POS tagger), the word 'mademoiselle' slips through due to its lexical form. Less troublingly, the word 'ltself' appears in the full listings (this is likely due to flawed OCR).

Methodology

The first challenge in approaching this problem is the broken link to the Syntactic Ngrams corpus. It is, however, still contained in a cache via the Internet Archive.

Acknowledgements

The Syntactic Ngrams corpus is documented in Goldberg and Orwant (ACL 2013) [1].

This analysis was aided by the authors of Hoyle et al., (ACL 2019) [2].

References

[1] Yoav Goldberg and Jon Orwant (2013). A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 241–247, Atlanta, Georgia, USA. Association for Computational Linguistics.

[2] Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Augenstein, and Ryan Cotterell (2019). Unsupervised Discovery of Gendered Language through Latent-Variable Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1706–1716, Florence, Italy. Association for Computational Linguistics.

Licenses

This text and its embedded resources are made available under CC-BY 4.0.

All code is made available using the MIT license (contained in the repository).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
LICENSE		LICENSE
README.md		README.md
candidates-eng.txt		candidates-eng.txt
candidates-fiction.txt		candidates-fiction.txt
candidates-revised.json		candidates-revised.json
candidates.py		candidates.py
counts.py		counts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pronouns in English-language fiction

Results

Candidate Reflexive Pronouns

Methodology

Acknowledgements

References

Licenses

About

Releases

Packages

Contributors 2

Languages

License

BarryNorton/pronouns

Folders and files

Latest commit

History

Repository files navigation

Pronouns in English-language fiction

Results

Candidate Reflexive Pronouns

Methodology

Acknowledgements

References

Licenses

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages