Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.02 KB

README.md

File metadata and controls

24 lines (17 loc) · 1.02 KB

CLSE: Corpus of Linguistically Significant Entities

Description

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

For more details, see the docs/ directory and the paper.

License

The contents of this repository is licensed under CC-BY.

Paper

Make sure to cite the following paper when using this dataset:

@inproceedings{clse2022,
  title={CLSE: Corpus of Linguistically Significant Entities},
  author={Chuklin, Aleksandr and Zhao, Justin and Kale, Mihir},
  booktitle={Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2022) at EMNLP 2022},
  year={2022}
}

https://arxiv.org/abs/2211.02423