This project aims to find markers within the text data that can be used to predict the age group of bloggers.
As of now, I am looking to explore information such as organization of sentences, punctuation, and word choice to form generalizations on age group.
I am also planning to train a classification model and seeing if it can help highlight language differences between the groups.
The dataset I am using for this project is the Blog Authorship Corpus.
It is a collection of approximately 681,288 posts from 19,320 bloggers, and contains blog post with some user metadata, including age.
The existing data source can be found here
The goal of this project is to find markers within the text data that can be used to predict the age group of bloggers. After EDA, I'd like to explore sentence construction, word choice, and the use of punctuation and whether these factors can suggest that "Someone who is X years old wrote this."
I also hope to consider some technical features, such as word frequency, n-grams, and part-of-speech tags. Also, a model that is trained and tested on the dataset to make sure these features are being used.
The linguistic analysis will include the extraction of a set of linguistic features from the text data that may be indicative of age group.
I should have a presentation which answers the linguistic question, hopefully with nice examples, graphs and data.
It shouldn't be too different from other presentations.