ACE2005-toolkit
βββ ace_2005 (the ACE2005 raw data)
β βββ data
β β βββ ...
β βββ docs
β β βββ ...
β βββ dtd
β β βββ ...
β βββ index.html
βββ cache_data (empty before run)
β βββ Arabic/
β βββ Chinese/
β βββ English/
βββ filelist (train/dev/test doc files)
β βββ ace.ar.dev
β βββ ace.ar.test
β βββ ace.ar.train
β βββ ace.en.dev
β βββ ace.en.test
β βββ ace.en.train
β βββ ace.zh.dev
β βββ ace.zh.test
β βββ ace.zh.train
β
βββ output (final output, empty before run)
β βββ BIO (BIO output)
β β βββ train/
β β βββ test/
β β βββ dev/
β βββ ...
βββ udpipe (udpipe files)
β βββ arabic-padt-ud-2.5-191206
β βββ chinese-gsd-ud-2.5-191206
β βββ english-ewt-ud-2.5-191206
βββ ace_parser.py
βββ extract.py
βββ format.py
βββ transform.py
βββ udpipe.py
βββ requirements.txt
βββ run.sh
- Download the ACE2005 raw data and rename into
ace_2005
; - Install all the requirements by
pip install -r requirements.txt
; - Start preprocess by
bash run.sh en
,en
can be replaced byzh
orar
; - Enter
n
to get data divided by filelist, or entery
andtrain/dev/test rate
(e.g.0.8 0.1 0.1
) to get data divided by sentences; - Enter
y
to get transform the data into BIO-type format, the transformed data will be inoutput/BIO/
, each train (test or dev) data will be transformed into 4 BIO-style json files(token
,entity_BIO
,event_trigger_BIO
andevent_argument_BIO
); - The final output will be in directory
output/
.
The output will save separately in output/
, each file can be loaded by json.loads()
. After loading, the data will be in python list
type, each line will be in python dict
type:
{
"sentence": "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region.",
"tokens": [
"Orders",
"went",
"out",
"today",
"to",
"deploy",
"17,000",
"U.S.",
"Army",
"soldiers",
"in",
"the",
"Persian",
"Gulf",
"region",
"."
],
"golden-entity-mentions": [
{
"entity-id": "CNN_CF_20030303.1900.02-E4-186",
"entity-type": "GPE:Nation",
"text": "U.S",
"sent_id": "4",
"position": [
7,
7
],
"head": {
"text": "U.S",
"position": [
7,
7
]
}
},
...
],
"golden-event-mentions":
{
"event-id": "CNN_CF_20030303.1900.02-EV1-1",
"event_type": "Movement:Transport",
"arguments": [
{
"text": "17,000 U.S. Army soldiers",
"sent_id": "4",
"position": [
6,
9
],
"role": "Artifact",
"entity-id": "CNN_CF_20030303.1900.02-E25-1"
},
{
"text": "the Persian Gulf region",
"sent_id": "4",
"position": [
11,
15
],
"role": "Destination",
"entity-id": "CNN_CF_20030303.1900.02-E76-191"
}
],
"text": "Orders went out today to deploy 17,000 U.S. Army soldiers\nin the Persian Gulf region",
"sent_id": "4",
"position": [
0,
15
],
"trigger": {
"text": "deploy",
"position": [
5,
5
]
}
},
...
],
"golden-relation-mentions": [
{
"relation-id": "CNN_CF_20030303.1900.02-R1-1",
"relation-type": "ORG-AFF:Employment",
"text": "17,000 U.S. Army soldiers",
"sent_id": "4",
"position": [
6,
9
],
"arguments": [
{
"text": "17,000 U.S. Army soldiers",
"sent_id": "4",
"position": [
6,
9
],
"role": "Arg-1",
"entity-id": "CNN_CF_20030303.1900.02-E25-1"
},
{
"text": "U.S. Army",
"sent_id": "4",
"position": [
7,
8
],
"role": "Arg-2",
"entity-id": "CNN_CF_20030303.1900.02-E66-157"
}
]
},
...
]
}
You will get all the golden data of entities, events and relations
in output files.
You can change the file names in filelist/
, which will directly change the files belong to train/dev/test
, we use a default (529/30/40
) division.
Any questions can contact us by [email protected]
.