This repository contains the models and datasets described in the article titled "Universal Grammatical Dependencies for Portuguese with CINTIL Data, LX Processing and CLARIN support" presented in LREC 2022.
You have two choices to download the code and datasets:
-
clone this repository via Git
-
download zip files from the PORTULAN CLARIN repository:
- CINTIL-UPos dataset
- CINTIL-UDep dataset
- CINTIL-USuite dataset
- LX-UTagger Transformer model
- LX-UDParser NLP4J model
- LX-USuite Transformer models
However, the recommended and easiest way to use LX-UTagger, LX-UDParser, and LX-USuite is from the PORTULAN CLARIN workbench:
The followind data is available in this repository:
- The directory
cintil-upos
contains the CINTIL-UPos - CINTIL corpus annotated with Universal POS tags. Seecintil-upos/README.md
. - The directory
cintil-udep
contains the CINTIL-UDep - CINTIL treebank annotated with Universal Dependencies. Seecintil-udep/README.md
. - The directory
cintil-usuite
contains the CINTIL-USuite - CINTIL corpus annotated with Universal POS tags, lemmas and Universal features. Seecintil-usuite/README.md
.
The followind models are available in this repository:
- The directory
lx-udparser
contains the LX-UDParser NLP4J model. Seelx-udparser/README.md
for more information. - The directory
lx-utagger
contains the LX-UTagger Transformer model. Seelx-utagger/README.md
for more information. - The directory
lx-usuite
contains the LX-USuite code, which is a wrapper for three token classifiers (LX-UTagger, LX-UFeaturizer and LX-NeuralLemmatizer) Seelx-usuite/README.md
for more information. - The directory
lx-ufeaturizer
contains the LX-UFeaturizer Transformer models. Seelx-ufeaturizer/README.md
for more information. - The directory
lx-neurallemmatizer
contains the LX-NeuralLemmatizer Transformer models. Seelx-neurallemmatizer/README.md
for more information.
Irrespective of the most recent version of this dataset you may use, when mentioning it, please cite this reference:
António Branco, João Ricardo Silva, Luís Gomes and João Rodrigues, 2022, "Universal Grammatical Dependencies for Portuguese with CINTIL Data, LX Processing and CLARIN support", In Proceedings, 13th Conference on Language Resources and Evaluation (LREC2022) (pdf).
Bibtex:
@InProceedings{branco-EtAl:2022:LREC,
author = {Branco, António and Silva, João Ricardo and Gomes, Luís and António Rodrigues, João},
title = {Universal Grammatical Dependencies for Portuguese with CINTIL Data, LX Processing and CLARIN support},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {5617--5626},
url = {https://aclanthology.org/2022.lrec-1.603}
}
The models and datasets in this repository are made available under the Creative Commons BY-NC-ND license (Attribution-NonCommercial-NoDerivatives 4.0 International).
See LICENSE.txt
for full text.