Many fields are now expecting existing and potential employees to have a robust social media presence. However, it can be difficult to come up with what to write about or tweet about. It would be useful if a finance or investment professional could easily generate sample text to use in keeping up their social media presence and emulating their favorite professionals.
This repo creates an NLP text generator that generates text based off of the most prominent venture capitalist Twitter accounts. A user will be able to generate a prompt with the click of a button. Readability of text may vary, humorous interpretation is welcome. Try it out on HuggingFace.
You'll need a Twitter API token in order to use the 01_twitter_user.py
file from Mehran Shakarami's AI Spectrum. Go to developer.twitter.com and request Elevated access. Once you have your credentials, create a new file and store the configurations in separate variables.
Once you have this, just change the username and limit variables in 01_twitter_user.py
to download the desired number of tweets as a csv file. The cleaned file for my modeling can be found here.
Use the 02_cleaning.ipynb
notebook to clean your .csv file with the Twitter data for model input. It will remove usernames, URLs, quotes, and the 'RT :' string. It also saves a new column to retain whether the tweet was originally a retweet or not in case that's important to you.
03_eda.ipynb
just vectorizes the data and checks the most common single words and two-word phrases to ensure that the content has the venture capital topic skew that we were looking for.
This project first used Max Woolf's gpt-2-simple
, which also has a useful Google Colab notebook to help set it up: gpt-2-simple. Using that Colab notebook, I fine-tuned the pre-trained model using the 124M
parameter configuration and my Twitter text data. The fine-tuned model was then downloaded locally (size: ~500MB). This took over 45 minutes per run and then took over 3-4 minutes for each text generation request.
Later on, it was pointed out that gpt-2-simple has a successor called aitextgen. It also has an easy-to-use Google Colab notebook, but this time the fine-tuning takes less than 20 minutes and each text generation request is fulfilled in less than 10 seconds, even locally without GPU.
The 04_aitextgen.ipynb
notebook is largely taken from this same Colab notebook with a few tweaks and is only used to generate text on an already fine-tuned model stored locally. It works well on Colab or locally. If you want to fine-tune the model on other text, please use the original Colab notebook linked inside aitextgen's README.
If you'd like to use the models I fine-tuned, they can be found in Google Drive here.
gpt-2-simple runs:
run1
: the original fine-tuning that didn't remove the strings with 'RT' and quote symbols; used only 300 tweets per userrun2
: second run using the 300 posts/user. Took out RT and quotes and it looks cleaner.run3
- third run, this time using 3,000 posts/user. results are more matter-of-fact and less funny.run4
- tried excluding tweets that had fewer than 20 characters to help the output be long each time.
aitextgen runs:
run1.1
: uses 300 tweets/user in csv format to avoid strange spacingrun2.1
: uses 300 tweets/user in txt formatrun3.1
: uses 3,000 tweets/user in csv formatrun4.1
: uses 3,000 tweets/user in txt format
My opinion is that run1.1 has the funniest output, but that run3.1 has the most coherent. Use whichever you'd like.
The Streamlit app sets up the pre-trained, fine-tuned run1.1
model to generate a text sample at the click of a button. You can access a working version of it here.
Twitter: The users chosen for this analysis are anonymous. Some proper names are still present in the tweets after cleaning, but those names are well-known and often talked about in venture capital.
Shoutout to Niraj Saran for pointing me to resource #2 below.
-
The gpt-2-simple library and Colab notebook for providing a pre-trained model and simple model training and fine-tuning.
-
The aitextgen library that improves on gpt-2-simple by leaps and bounds.
-
Mehran Shakarami's AI Spectrum for a great, easy to use script for scraping tweets.
-
StackOverflow help for the following regex: