Skip to content

Latest commit

 

History

History
44 lines (32 loc) · 2.79 KB

README.md

File metadata and controls

44 lines (32 loc) · 2.79 KB

Historical Analysis of Species References in American English

Data and analysis for a study that uses the two largest available American English language corpora, Google Books and COHA, to investigate relations between ecology and language.

There are two parts: an examination of frequency of common names of species followed by aspect-level sentiment analysis of concordance lines.

Research findings are published in the journal Corpora.

Files

google_books_data:

  • getngrams.py - python script to retrieve data behind the trajectories plotted on the Google Ngram Viewer. (For usage see google-ngrams)
  • species_ngram_NA_normalized.csv & species_ngram_BR_normalized.csv - data collected by running getngrams.py on each species name. Normalized with respect to the average. (NA: American, BR: British)
  • NA_google_data_normalized.csv & BR_google_data_normalized.csv - summary of species_ngram_NA_normalized. (Used for plotting)
  • rural_pop_US.csv & rural_pop_BR.csv - percent of populaton rural
  • COHA_normalized.csv - frequency data from the Corpus of Historical American English
  • figure1.py - plotting/statistical analysis for North American frequency and population data (1800-2000)
  • figure2.py - plotting/statistical analysis for North American frequency and population data (1900-2000)
  • figure3.py plotting/statistical analysis for British frequency and population data (1800-2000)
  • figure4.py plotting/statistical analysis for British frequency and population data (1850-1920)

coha_data:

  • speciesname.csv (caribou.csv, elm.csv, etc.) - key word in context (kwic) lines from the Corpus of Historical American English
  • COHA_normalized.csv - frequency data from the Corpus of Historical American English
  • coha_content.csv - data for the composition of the COHA corpus
  • sentiment.py - sentiment analysis script: uses Natural Language Toolkit (NLTK) and SentiWordNet for aspect-level sentiment of adjective/target-noun pairs
  • annotated.csv - output of sentiment.py, manually annotated
  • final.csv - retained kwic lines from annotated.csv
  • figure5.py - plotting/statistical analysis of sentiment analysis results

Prerequisites

Python Packages and Modules

pandas, numpy, scipy, matplotlib, nltk

Authors

Acknowledgments