This project uses the 'Syntactic Ngrams' corpus, based on the Google Books corpus and containing parse fragments with Penn-Treebank part-of-speech (POS) tags, to discover and validate the reflexive pronouns contained and tagged therein, and to compare the frequency results between the general English corpus and the partition for fiction.
Based on projections over the respective Syntactic-Ngrams verbargs files, and filtering with a regex (r'sel[fvl]') to approximate reflexivity, the following candidate lists of pronouns (PRP) as direct objects (dobj) are obtained and ranked by frequency, truncated to include the gender-neutral singular reflexive pronoun 'themself' (see Methodology for more details; candidates-eng.txt and candidates-fiction.txt for the full listings):
English | English Fiction |
---|---|
himself: 8979386 | himself: 3207256 |
themselves: 5578605 | herself: 1408594 |
itself: 3409264 | myself: 1296374 |
myself: 3014093 | themselves: 931156 |
herself: 2329877 | yourself: 636914 |
yourself: 1623751 | itself: 495199 |
ourselves: 1404367 | ourselves: 255542 |
oneself: 219794 | yourselves: 41846 |
yourselves: 143277 | oneself: 27295 |
thyself: 77279 | thyself: 19193 |
himselfe: 4020 | hisself: 2384 |
hisself: 3088 | yerself: 1106 |
ourself: 1380 | meself: 724 |
yerself: 1238 | yoursell: 719 |
meself: 1079 | imself: 533 |
yoursell: 866 | ourself: 356 |
imself: 414 | |
themself: 316 | isself: 304 |
yo'self: 291 | yo'self: 281 |
herselfe: 243 | himselfe: 190 |
heself: 165 | |
isself: 221 | erself: 156 |
myselfe: 208 | himsell: 155 |
mysell: 202 | hetself: 154 |
erself: 149 | mysell: 142 |
himsell: 144 | yoursells: 88 |
heself: 136 | hymself: 83 |
itselfe: 136 | themself: 82 |
There are a few features of these lists to which bold font is used to draw attention:
- the relative prominence of 'himself' and 'herself' (interestingly much more balanced in fiction);
- the relative prominence of the ungendered plural pronoun 'themselves' to these;
- the relative prominence of the ungendered singular pronoun 'themself' (interestingly much less prominent in fiction, at least historically).
It is also worth noting the relative prominence of dialectical pronouns, such as 'isself' (a variant of the already non-standard 'hisself', used in my own native dialect), as well as American dialect terms like 'yo'self'.
Strikethrough is used to indicate that, although the regex for reflexive forms does a relatively good job of cleaning up a lot of misidentified pronouns (from the POS tagger), the word 'mademoiselle' slips through due to its lexical form. Less troublingly, the word 'ltself' appears in the full listings (this is likely due to flawed OCR).
The first challenge in approaching this problem is the broken link to the Syntactic Ngrams corpus. It is, however, still contained in a cache via the Internet Archive.
The Syntactic Ngrams corpus is documented in Goldberg and Orwant (ACL 2013) [1].
This analysis was aided by the authors of Hoyle et al., (ACL 2019) [2].
[1] Yoav Goldberg and Jon Orwant (2013). A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 241–247, Atlanta, Georgia, USA. Association for Computational Linguistics.
[2] Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Augenstein, and Ryan Cotterell (2019). Unsupervised Discovery of Gendered Language through Latent-Variable Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1706–1716, Florence, Italy. Association for Computational Linguistics.
This text and its embedded resources are made available under CC-BY 4.0.
All code is made available using the MIT license (contained in the repository).