CLSE: Corpus of Linguistically Significant Entities

Description

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

For more details, see the docs/ directory and the paper.

License

The contents of this repository is licensed under CC-BY.

Paper

Make sure to cite the following paper when using this dataset:

@inproceedings{clse2022,
  title={CLSE: Corpus of Linguistically Significant Entities},
  author={Chuklin, Aleksandr and Zhao, Justin and Kale, Mihir},
  booktitle={Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2022) at EMNLP 2022},
  year={2022}
}

https://arxiv.org/abs/2211.02423

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CLSE: Corpus of Linguistically Significant Entities

Description

License

Paper

Files

README.md

Latest commit

History

README.md

File metadata and controls

CLSE: Corpus of Linguistically Significant Entities

Description

License

Paper