Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.92 KB

README.md

File metadata and controls

24 lines (17 loc) · 1.92 KB

Rosette API Text Embeddings Visualization Sample Code

A simple Python script for transforming a corpus of documents into text vectors suitable for visualization in .tsv format. It uses the Rosette API's /text-embedding endpoint and the BBC News Corpus. Note that the corpus is only free for research purposes.

Getting started

  1. Clone the repo and open the files in your favorite text editor/python IDE.

  2. Download the raw text files zip, bbc-fulltext.zip from http://mlg.ucd.ie/datasets/bbc.html and extract it into the project root folder. You should get a folder called "bbc".

  3. Run visualize-embeddings.py via your python IDE or command line (replace ROSAPI_KEY with your Rosette API key):

     $ python visualize-embeddings.py --key ROSAPI_KEY
    

You'll see that the script parses the raw text files of the corpus into a list of documents. Each document consist of 3 fields:

  • category
  • headline
  • content

The script then creates two files:

  • embeddings.tsv: a TSV file where each line contains the text vector for a document's content field.
  • metadata.tsv: a TSV file where each line contains a document's metadata (i.e. category and headline).

To visualize the embeddings, load them into Google TensorFlow's Embedding Projector. Turn on color coding by category to really see the vectors in action. You can see our projection at this link.

Customize for your data

Try replacing the BBC News corpus with your own data. And if you find anything interesting, we'd love to hear about it! Find us at [email protected].