We have defined a set of lightweight, task-specific schema to help simplify programmatic access to common nusantara-nlp
datasets. This schema should be implemented for each dataset in addition to a schema that preserves the original dataset format.
- Knowledge Base (KB)
- Named entity recognition (NER)
- Named entity disambiguation/normalization/linking (NED)
- Event extraction (EE)
- Relation extraction (RE)
- Coreference resolution (COREF)
- Question Answering (QA)
- Question answering (QA)
- Textual Entailment (TE)
- Textual entailment (TE)
- Text Pairs (PAIRS)
- Semantic Similarity (STS)
- Text to Text (T2T)
- Paraphasing (PARA)
- Translation (TRANSL)
- Summarization (SUM)
- Text (TEXT)
- Text classification (TXTCLASS)
This is a simple container format with minimal nesting that supports a range of common knowledge base construction / information extraction tasks.
- Named entity recognition (NER)
- Named entity disambiguation/normalization/linking (NED)
- Event extraction (EE)
- Relation extraction (RE)
- Coreference resolution (COREF)
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"passages": [...],
"entities": [...],
"events": [...],
"coreferences": [...],
"relations": [...]
}
Schema Notes
id
fields appear at the top (i.e. document) level and in every sub-component (passages
,entities
,events
,coreferences
,relations
). They can be set in any fashion that makes everyid
field in a dataset unique (includingid
fields in different splits like train/validation/test).document_id
should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top levelid
.offsets
contain character offsets into the string that would be created from" ".join([passage["text"] for passage in passages])
offsets
andtext
are always lists to support discontinous spans. For continuous spans, they will have the formoffsets=[(lo,hi)], text=["text span"]
. For discontinuous spans, they will have the formoffsets=[(lo1,hi1), (lo2,hi2), ...], text=["text span 1", "text span 2", ...]
normalized
sub-component may contain 1 or more normalized links to database entity identifiers.passages
captures document structure such as named sections.entities
,events
,coreferences
,relations
may be empty fields depending on the dataset and specific task.
- Schema Template
- Examples: SmSA
{
"id": "0",
"text": "meski masa kampanye sudah selesai , bukan berati habis pula upaya mengerek tingkat kedipilihan elektabilitas .",
"labels": [
"neutral"
]
}
- Schema Template
- Examples: BaPOS
{
{
"id": "0",
"tokens": [
"Seorang",
"penduduk",
"yang",
"tinggal",
"dekat",
"tempat",
"kejadian",
"mengatakan",
",",
"dia",
"mendengar",
"suara",
"tabrakan",
"yang",
"keras",
"dan",
"melihat",
"mobil",
"ambulan",
"membawa",
"orang-orang",
"yang",
"berlumuran",
"darah",
"."
],
"labels": [
"B-NND",
"B-NN",
"B-SC",
"B-VB",
"B-JJ",
"B-NN",
"B-NN",
"B-VB",
"B-Z",
"B-PRP",
"B-VB",
"B-NN",
"B-NN",
"B-SC",
"B-JJ",
"B-CC",
"B-VB",
"B-NN",
"B-NN",
"B-VB",
"B-NN",
"B-SC",
"B-VB",
"B-NN",
"B-Z"
]
}
- Schema Template
- Examples: MQP
{
"id": "0",
"document_id": "NULL",
"text_1": "Am I over weight (192.9) for my age (39)?",
"text_2": "I am a 39 y/o male currently weighing about 193 lbs. Do you think I am overweight?",
"label": 1,
}
- Schema Template
- Examples: TyDiQA-ID
{
"id": "0",
"document_id": "24267510",
"question_id": "55031181e9bde69634000014",
"question": "Is RANKL secreted from the cells?",
"type": "yesno",
"choices": [],
"context": "Osteoprotegerin (OPG) is a soluble secreted factor that acts as a decoy receptor for receptor activator of NF-\u03baB ligand (RANKL)",
"answer": ["yes"],
}
- Schema Template
- Examples: ParaMed
{
"id": "0",
"text_1": "Pleasing God doesn"t mean that we must busy ourselves with a new set of "spiritual" activities\n",
"text_2": "Menyenangkan Allah tidaklah berarti bahwa kita harus menyibukkan diri sendiri dengan berbagai aktivitas rohani\n",
"text_1_name": "eng",
"text_2_name": "ind"
}
- Schema Template
- Examples: CC100
{
"id": "0",
"text": "Placeholder text. Will change to a real example soon."
}
- Schema Template
- Examples: TITML-IDN
{
{"id": "01-001",
"path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
"audio": {
"path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
"array": array([-0.0005188 , -0.00018311, -0.00021362, ..., -0.00018311, -0.00033569, -0.00015259], dtype=float32),
"sampling_rate": 16000
},
"text": "hai selamat pagi apa kabar",
"speaker": "01",
"metadata": {"speaker_age": 25, "speaker_gender": "female"}}
}