#Introduction This repository contains an example Python script demonstrating how one might go about converting results from Rosette API's named entity extraction to the data format used in the CoNLL 2003 shared task for named entity extraction.
##The Annotated Data Model
To convert the named entity annotations we take advantage of Rosette's A(nnotated) D(ata) M(odel) via the Python bindings. The following is a sample ADM one might receive as a result when you set the "output"
parameter to "rosette"
and make an entities
call to the Rosette API:
{
"data": "New York City or NYC is the most populous city in the United States.\n",
"attributes": {
"entities": {
"items": [
{
"headMentionIndex": 0,
"mentions": [
{
"source": "gazetteer",
"subsource": "/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin",
"normalized": "New York City",
"startOffset": 0,
"endOffset": 13
},
{
"source": "gazetteer",
"subsource": "/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin",
"normalized": "NYC",
"startOffset": 17,
"endOffset": 20
}
],
"confidence": 0.501718114501715,
"type": "LOCATION",
"entityId": "Q60"
},
{
"headMentionIndex": 0,
"mentions": [
{
"source": "gazetteer",
"subsource": "/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin",
"normalized": "United States",
"startOffset": 54,
"endOffset": 67
}
],
"confidence": 0.08375498050536179,
"type": "LOCATION",
"entityId": "Q30"
}
],
"type": "list",
"itemType": "entities"
},
"token": {
"items": [
{
"text": "New",
"startOffset": 0,
"endOffset": 3
},
{
"text": "York",
"startOffset": 4,
"endOffset": 8
},
{
"text": "City",
"startOffset": 9,
"endOffset": 13
},
{
"text": "or",
"startOffset": 14,
"endOffset": 16
},
{
"text": "NYC",
"startOffset": 17,
"endOffset": 20
},
{
"text": "is",
"startOffset": 21,
"endOffset": 23
},
{
"text": "the",
"startOffset": 24,
"endOffset": 27
},
{
"text": "most",
"startOffset": 28,
"endOffset": 32
},
{
"text": "populous",
"startOffset": 33,
"endOffset": 41
},
{
"text": "city",
"startOffset": 42,
"endOffset": 46
},
{
"text": "in",
"startOffset": 47,
"endOffset": 49
},
{
"text": "the",
"startOffset": 50,
"endOffset": 53
},
{
"text": "United",
"startOffset": 54,
"endOffset": 60
},
{
"text": "States",
"startOffset": 61,
"endOffset": 67
},
{
"text": ".",
"startOffset": 67,
"endOffset": 68
}
],
"type": "list",
"itemType": "token"
},
"scriptRegion": {
"items": [
{
"script": "Latn",
"startOffset": 0,
"endOffset": 69
}
],
"type": "list",
"itemType": "scriptRegion"
},
"languageDetection": {
"detectionResults": [
{
"confidence": 0.981137482980466,
"script": "Latn",
"language": "eng",
"encoding": "UTF-16BE"
}
],
"type": "languageDetection",
"startOffset": 0,
"endOffset": 69
},
"sentence": {
"items": [
{
"startOffset": 0,
"endOffset": 69
}
],
"type": "list",
"itemType": "sentence"
}
},
"responseHeaders": {
"X-RosetteAPI-Concurrency": "2",
"transfer-encoding": "chunked",
"Strict-Transport-Security": "max-age=63072000; includeSubdomains; preload",
"Server": "openresty",
"Connection": "keep-alive",
"X-RosetteAPI-Request-Id": "a53453af-7c40-4bd3-8849-513405f7cba0",
"Content-Encoding": "gzip",
"Vary": "Accept-Encoding",
"X-RosetteAPI-App-Id": "1409612466626",
"Date": "Tue, 29 Nov 2016 21:31:11 GMT",
"Content-Type": "application/json"
},
"version": "1.1.0",
"documentMetadata": {
"processedBy": [
"[email protected]",
"[email protected]",
"[email protected]"
],
"res-docid": [
"res-document-964ec8f4-f361-494f-828b-0bc746decdc0"
]
}
}
From this result we can access all the information we need to pull out the entity extractions and format them in the way we want.
##rosette_to_conll2003.py
This script traverses the words, sentences, and named entities identified in the ADM to produce CoNLL 2003-style output with one token per line.
###Installing Dependencies with Virtualenv The script is written for Python 3. If you are alright with installing external Python packages globally, you may skip this section.
You can install the dependencies using virtualenv
so that you don't alter your global site packages.
The process for installing the dependencies using virtualenv
is as follows for bash
or similar shells:
Ensure your virtualenv
is up to date.
$ pip install -U virtualenv
Note: You may need to use pip3
depending on your Python installation.
cd
into the directory where the rosette_to_conll2003.py
script exists and create a Python virtual environment (this is the same location as this README):
$ virtualenv .
Activate the virtual environment:
$ source bin/activate
Once you've activated the virtual environment you can proceed to install the requirements safely without affecting your globabl site packages.
###Installing the Dependencies
You can install the dependencies via pip
(or pip3
depending on your installation of Python 3) as follows using the provided requirements.txt
:
$ pip install -r requirements.txt
###Usage Once you've installed the dependencies you can run the script as follows:
$ ./rosette_to_conll2003.py -h
usage: rosette_to_conll2003.py [-h] [-k KEY] [-u URL] [-l LANGUAGE] input
Get Rosette API named entity results in CoNLL 2003-style BIO format
positional arguments:
input A plain-text document to process
optional arguments:
-h, --help show this help message and exit
-k KEY, --key KEY Rosette API Key (default: None)
-u URL, --url URL Alternative API URL (default:
https://api.rosette.com/rest/v1/)
-l LANGUAGE, --language LANGUAGE
A three-letter (ISO 639-2 T) code that will override
Rosette language detection (default: None)
If you do not use the --key
option the script will prompt you to type in your Rosette API key before running. If you find yourself running the script repeatedly, it may be convenient to set your Rosette API key as an environment variable in your shell:
$ export ROSETTE_USER_KEY=<your user key>
Then you can add your key as an option with -k $ROSETTE_USER_KEY
.
###Example The CoNLL 2003 data format has 4 fields separated by spaces:
Field | Description |
---|---|
1 | A word token |
2 | A part-of-speech (POS) tag |
3 | A syntactic chunk tag |
4 | A named entity tag |
The following is a sample sentence annotated in the CoNLL 2003 format:
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
The ConLL 2003 format uses so-called BIO or B(egining) I(nside) O(outside) tags to indicate the relative position of word tokens within named entity boundaries. Tokens that are part of a named entity are suffixed with a named entity type: LOC
, ORG
PER
, or MISC
. Note that the first word within a named entity gets prefixed with B-
because it is at the beginning of the mention. Subsequent tokens within a named entity are prefixed with I-
indicating they are inside the entity mention. All other tokens that are outside of an entity mention are tagged as O
.
Note: In this example we will ignore the second field. You can get POS tags from the Rosette API via the morphology/parts-of-speech
endpoint, but that is a separate API call, and we are only concerned with the named entity tags here. Rosette does not currently offer syntactic chunking, so we will also ignore the third field (though we do offer dependency parsing). In the fourth and final field, we use Rosette named entity tags, which includes a larger, more informative set of named entity tags than the four tags used in the CoNLL 2003 shared task.
You view the example text, example/ny.txt
, as follows:
$ cat example/ny.txt
New York City or NYC is the most populous city in the United States.
You can run the script on the example file as follows:
$ ./rosette_to_conll2003.py example/ny.txt
Enter your Rosette API key:
-DOCSTART- -X- O O
New B-LOCATION
York I-LOCATION
City I-LOCATION
or O
NYC B-LOCATION
is O
the O
most O
populous O
city O
in O
the O
United B-LOCATION
States I-LOCATION
. O
To translate Rosette API named entity tags to CoNLL 2003 named entity tags, use the --use-conll-ne-tags
option:
$ ./rosette_to_conll2003.py --use-conll-ne-tags example/ny.txt
Enter your Rosette API key:
-DOCSTART- -X- O O
New B-LOC
York I-LOC
City I-LOC
or O
NYC B-LOC
is O
the O
most O
populous O
city O
in O
the O
United B-LOC
States I-LOC
. O