- Download reddit data from Academic torrents - either subreddit or entire reddit dump.
- Subreddit data will be in .zst format. One file for submissions and another for comments
- Entire reddit dump, for each month ~50GB. Use
combine_folder_multiprocess.py
to extract subreddit data for each month, and then combine across months.
- Use
filter_file.py
to extract subreddit data (submissions and comments) for specific time periods from the .zst files combine_submission_comments.py
to combine submissions and comments for each subreddit and write as a csv with following schema.- Comments: Score, created_utc, author, permalink, link_id, body
- Submissions: Score, created_utc, title, id, author, permalink, selftext
- Submissions + Comments: id, title, selftext, body
generation.py
andprompt_generation_AI_work.txt
to generate text from the combined submissions and comments using LLMs. 1 LLM call for each post and all associated comments. Generates structured output withinoutput_raw
folder according to following schema:
"anecdotes": [
{
"quote": "Full quote of a personal experience explicitly mentioning AI's impact on work",
"summary": "Brief summary of the anecdote, with context if needed"
}
],
"media_reports": [
{
"quote": "Full quote discussing a media report about AI's impact on work",
"summary": "Brief summary of the media report discussion, with context if needed"
}
],
"opinions": [
{
"quote": "Full quote expressing an opinion about AI's impact on work",
"summary": "Brief summary of the opinion, with context if needed"
}
],
"other": [
{
"quote": "Full quote of other relevant content explicitly about AI's impact on work",
"summary": "Brief summary of the other content, with context if needed"
}
]
}
-
extract_anecdotes.py
to extract only anecdotes from LLM output -
classification.py
andprompt_classification_AI_work.txt
to generate labels for each of the anecdotes. The labels are:- Labor market risks (1)
- Global AI divide (2)
- Market concentration risks and single points of failure (3)
- Risks to the environment (4)
- Risks to privacy (5)
- Copyright infringement (6)
- Anecdotes unrelated to the impact of AI on work (7)
The LLM output is structured as follows:
{ "categories": [1, 7, 1, 7] }
-
combine_anecdotes_labels.py
to combine anecdotes and their respective labels in a CSV -
extract_anecdotes_category.py
to extract anecdotes with their respective categories in separate CSV
Statistics:
| Subreddit | Cat 1 | Cat 2 | Cat 3 | Cat 4 | Cat 5 | Cat 6 | Cat 7 | Total |
|----------------------|-------|-------|-------|-------|-------|-------|-------|-------|
| ArtistLounge | 500 | 2 | 1 | 0 | 1 | 104 | 686 | 1294 |
| Ask_Lawyers | 25 | 0 | 0 | 0 | 1 | 2 | 32 | 60 |
| creativewriting | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| developersIndia | 2272 | 15 | 16 | 0 | 15 | 4 | 3918 | 6240 |
| education | 45 | 1 | 0 | 0 | 1 | 0 | 171 | 218 |
| freelanceWriters | 875 | 5 | 2 | 0 | 3 | 21 | 295 | 1201 |
| Journalism | 141 | 0 | 1 | 0 | 1 | 2 | 153 | 298 |
| medicine | 193 | 1 | 7 | 0 | 0 | 1 | 965 | 1167 |
| musicians | 49 | 0 | 0 | 0 | 0 | 4 | 150 | 203 |
| Music | 77 | 0 | 1 | 0 | 0 | 12 | 172 | 262 |
| nursing | 445 | 0 | 5 | 0 | 11 | 2 | 2568 | 3031 |
| paralegal | 191 | 0 | 0 | 0 | 0 | 0 | 334 | 525 |
| Poetry | 1 | 0 | 0 | 0 | 0 | 1 | 7 | 9 |
| Screenwriting | 189 | 1 | 0 | 0 | 1 | 6 | 303 | 500 |
| softwaredevelopment | 28 | 0 | 0 | 0 | 0 | 0 | 67 | 95 |
| SoftwareEngineering | 163 | 0 | 0 | 0 | 0 | 0 | 182 | 345 |
| Teachers | 866 | 3 | 0 | 0 | 13 | 23 | 2804 | 3709 |
| VoiceActing | 210 | 1 | 0 | 0 | 1 | 11 | 78 | 301 |
| writers | 184 | 2 | 0 | 0 | 2 | 15 | 326 | 529 |
| writing | 192 | 0 | 1 | 0 | 2 | 16 | 485 | 696 |
|----------------------|-------|-------|-------|-------|-------|-------|-------|-------|
| Total | 6647 | 31 | 34 | 0 | 52 | 224 | 13696 | 20684 |
-
classification_labor_csv.py
andprompt_classification_AI_work_labor.txt
to generate labels for each of the labor anecdotes. The labels are:- Job Displacement: Anecdotes discussing people losing work or being laid off due to AI tools. (1)
- Career Transitions: Anecdotes about people adapting their careers in response to AI, including reskilling or changing roles. (2)
- AI-enhanced Work: Anecdotes where people are adopting AI tools to enhance productivity and streamline their workflows.(3)
- Other: Not an anecdote, doesn't explicitly concern the impact of AI on work, or is speculation. (4)
The LLM output is structured as follows:
{ "category": 1 }
-
combine_labor_anecdotes_labels.py
to combine labor anecdotes and their respective labels in a CSV -
combine_profession_anecdotes.py
to group anecdotes into 3 professions. The professions and their subreddit mapping is below, along with the number of anecdotes in each category.
Industry | Stakeholders | Subreddits |
---|---|---|
Creatives | Writers (freelancers, screenwriting, creative writers, poets, journalists), Musicians, Artists, Actors | r/freelanceWriters/, r/screenwriting, r/creativewriting, r/Poetry, r/Writers, r/Writing, r/Journalism, r/Music, r/Musicians, r/ArtistLounge, r/VoiceActing |
Professionals | Lawyers, Doctors, Nurses, Software Engineers | r/Ask_Lawyers, r/Paralegal, r/Nursing, r/Medicine, r/SoftwareEngineering, r/SoftwareDevelopment, r/DevelopersIndia |
Educators | Teachers | r/Teachers/, r/Education |
Industry | Job Displacement (1) | Career Transitions (2) | AI-enhanced Work (3) | Total |
---|---|---|---|---|
Creatives | 773 | 119 | 330 | 1222 |
Professionals | 589 | 389 | 482 | 1460 |
Educators | 63 | 60 | 332 | 455 |
Overall | 1425 | 568 | 1144 | 3137 |
-
Top 40k subreddits: 2005 to 12-2023
- https://www.reddit.com/r/pushshift/s/18wjxKJUB9
- https://academictorrents.com/details/56aa49f9653ba545f48df2e33679f014d2829c10
- Torrent file: reddit-subreddit-2005-2023.torrent
-
Monthly dumps 2005-06 to 2024-06 https://academictorrents.com/details/20520c420c6c846f555523babc8c059e9daa8fc5
- Torrent file: reddit-monthly-2005-2024.torrent
-
Monthly dump statistics, schema of posts and comments data https://docs.google.com/spreadsheets/d/1umjeU3eIi1V0m3efY2Hq1mbm2eczU2ct-bVJyB0RigE/htmlview
-
Data processing scripts debugging: https://www.reddit.com/r/pushshift/s/owMJQkjPM5
-
qBittorrent v4.6.6 works
-
PRAW documentation: https://praw.readthedocs.io/en/latest/code_overview/models/comment.html
Acknowledgement: Grateful to u/Watchful1
for pushshift processing scripts (https://github.com/Watchful1/PushshiftDumps/blob/master/scripts) and subreddit data for Dec 2022 - Aug 2024.