A simple Python script for transforming a corpus of documents into text vectors suitable for visualization in .tsv format. It uses the Rosette API's /text-embedding
endpoint and the BBC News Corpus. Note that the corpus is only free for research purposes.
-
Clone the repo and open the files in your favorite text editor/python IDE.
-
Download the raw text files zip,
bbc-fulltext.zip
from http://mlg.ucd.ie/datasets/bbc.html and extract it into the project root folder. You should get a folder called "bbc". -
Run
visualize-embeddings.py
via your python IDE or command line (replaceROSAPI_KEY
with your Rosette API key):$ python visualize-embeddings.py --key ROSAPI_KEY
You'll see that the script parses the raw text files of the corpus into a list of documents. Each document consist of 3 fields:
- category
- headline
- content
The script then creates two files:
- embeddings.tsv: a TSV file where each line contains the text vector for a document's content field.
- metadata.tsv: a TSV file where each line contains a document's metadata (i.e. category and headline).
To visualize the embeddings, load them into Google TensorFlow's Embedding Projector. Turn on color coding by category to really see the vectors in action. You can see our projection at this link.
Try replacing the BBC News corpus with your own data. And if you find anything interesting, we'd love to hear about it! Find us at [email protected].