-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues Training Models for New Language Support #1343
Comments
Regarding the language code, the documentation on adding a new language
starts with an explanation on how to add a new language code:
https://stanfordnlp.github.io/stanza/new_language.html
I added the new code to the dev branch, if you want to use that instead of
1.7.0 I also added some debugging output to run_tokenizer which will
report where the tokenized files are being written. Hopefully that helps
you figure out where the mismatch is happening. If not, let us know what
the output is and we'll try to figure it out.
…On Wed, Feb 14, 2024 at 9:51 AM Lilit Kharatyan ***@***.***> wrote:
I am trying to train a pipeline for a new language (xcl). My goal is to
train the full pipeline (tokenizer, tagger, parser, lemmatizer, and
morphological parser) for this language, starting with the tokenizer. I've
followed the instructions provided in the official documentation and GitHub
repository closely but have encountered several issues that hinder my
progress.
Here are the steps I've taken and the issues encountered:
1.
Setting Up Environment and Data: After organizing my .conllu files for
training and validation as per the guidelines, I set the environment
variables in config.sh and sourced it. My data is for the language code
xcl, which is not recognized by Stanza, so I used HY (Armenian) as a
temporary workaround.
2.
Training the Tokenizer: When attempting to train the tokenizer using
the command
python3 -m stanza.utils.training.run_tokenizer HY_Classical_Armenian
I encountered a FileNotFoundError related to missing .toklabels files,
which should have been generated during the data preparation step. The
exact error message was:
FileNotFoundError: [Errno 2] No such file or directory:
'data/tokenize/hy_classical_armenian-ud-train.toklabels'
This indicates that either the preparation step was missed or did not
complete successfully, or there's a mismatch in the expected directory
structure or naming convention. However, following the instructions from
the documentation and GitHub, it wasn't clear how to proceed with the
preparation step for a language not yet recognized by Stanza.
-
Could you provide more detailed instructions or clarification on how
to correctly prepare the data for model training, especially for new
languages not currently supported by Stanza? This includes generating the
necessary .toklabels files.
-
Is there a recommended approach to adding support for entirely new
languages, ensuring that all necessary preprocessing and setup steps are
covered?
-
Any advice on troubleshooting or steps I might have overlooked would
be greatly appreciated. I am especially interested in any scripts or
commands specific to preparing data for new languages.
—
Reply to this email directly, view it on GitHub
<#1343>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWOFGNUCTKJ4TYN575DYTT2QXAVCNFSM6AAAAABDIW5PZ6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTIOBWGY2DQMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you very much, I managed to train tokenizer, lemmatizer, POS and dependency parser. Now, I cannot deploy them. Whenever I try to run the trained models, for some reason, English models come forward. How can I run my models from the saved models folder. Alos since we have quite good results and a whole pipeline ready, how can we add XCL to the official languages, and make our models public? Thanks |
I'd have to see what you mean by the English models come forward. I assume
you mean you are trying to build a pipeline with Pipeline("en") as opposed
to Pipeline("xcl")? In general, you can give it the path to the models, eg
the saved_models/... folders, with the "tokenize_model_path",
"lemma_model_path", etc attributes. You will also need to give the POS and
dependency parsers the pretrained embeddings path (if you used them) with
"pos_pretrain_path"
We are always happy to add support for more languages! The easiest way is
to send us a reference to the data sources used, any code changes or
conversion scripts needed to make those data sources compatible with
stanza, and a reference to the embeddings.
The new version 1.8.0 has the xcl language code, btw
…On Mon, Feb 26, 2024 at 12:59 AM Lilit Kharatyan ***@***.***> wrote:
Thank you very much, I managed to train tokenizer, lemmatizer, POS and
dependency parser. Now, I cannot deploy them. Whenever I try to run the
trained models, for some reason, English models come forward. How can I run
my models from the saved models folder. Alos since we have quite good
results and a whole pipeline ready, how can we add XCL to the official
languages, and make our models public? Thanks
—
Reply to this email directly, view it on GitHub
<#1343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWINYJWD7JS6VSW3MXLYVRFHDAVCNFSM6AAAAABDIW5PZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTGYYTKMRXHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Please let us know if we can help host the models - the main thing we need to be able to rebuild them going forward is the code changes and links to the data sources. |
Thank you for your response! I am currently retraining the models with some additional data! I guess in an hour or so they will be ready and I will try to execute them. If I still run into problems I will get back to you! |
Are you able to sync the dev branch and try that? I just ran into this
issue with someone looking to add their models for Albanian:
#1360
Here is the format for the Pipeline you need to use:
#1360 (comment)
|
Thank you for the suggestions. |
The easiest way is if you're able to turn your dataset into a UD treebank: https://universaldependencies.org/ If that's not an option, but you have some other repo for the data, please let us know. Any code changes or new scripts you needed to convert the data to the Stanza input formats would also be very helpful. |
Thanks for the information. Actually, our dataset is available in UD and the models are trained on the last released dataset of UD Classical Armenian. We have the four models I mentioned earlier (tokenizer, lemmatizer, POS Tagger and Dependency parser) so we wanted to know if it is possible to make those models part of the official package of Stanza. Thanks |
Sounds good. Yes, we can definitely add that as a model. Are there word vectors or other resources we need to make those models aside from the UD dataset? |
Thank you for your quick answer. Yes there are word vectors that have been used for the training (and deployment too). We will provide those too. |
Unless there's anything unusual about the training, we'll be happy to use those word vectors to rebuild the models as part of our updates for new UD releases. If there's something specific we have to do, please let us know or make a PR |
Thank you! How would you like me to send you the vectors? And what is the expected new release date? Thanks |
Probably late June on account of other work commitments, and people have posted them in box or dropbox for example in the past I may even be able to find some storage at the university which we can share. How big are they? |
Thank you for your prompt response. |
Ultimately, like nearly all open source projects, we are dependent on users observing their license terms. (Stanza is licensed under the generous Apache license, which does allow commercial use, but there are still license terms.) We are happy to label the models with their license and to point out that the models are restricted to non-commercial use, but we aren’t in a position to control usage (again, any more than a typical open source project). It’d be part of our offering as a non-profit organization, and we wouldn’t be charging anyone for their use. |
Thank you! Actually what you said makes sense. We will just ask to label them with the relevant license and that's it. The vector's file is not that big, I can even send it via email. What would you be more comfortable with? |
For sending us the word vectors, email, dropbox, or anything that works for you will work great. Thanks! |
Can I please ask for an email? I will send them right away. |
oh, sorry, can you not see it from my account? |
I can see it, just wasn't sure if that was the right address to send. |
For the word vectors, is there a citation of some kind or other links I should put on this page? https://stanfordnlp.github.io/stanza/word_vectors.html#word-vector-sources |
Also, does |
Hi. There is not a specific paper about the word vectors but we mention them briefly in our paper: |
Sounds good. I will add that information to the documentation. https://stanfordnlp.github.io/stanza/word_vectors.html#classical-armenian |
I am trying to train a pipeline for a new language (xcl). My goal is to train the full pipeline (tokenizer, tagger, parser, lemmatizer, and morphological parser) for this language, starting with the tokenizer. I've followed the instructions provided in the official documentation and GitHub repository closely but have encountered several issues that hinder my progress.
Here are the steps I've taken and the issues encountered:
Setting Up Environment and Data: After organizing my .conllu files for training and validation as per the guidelines, I set the environment variables in config.sh and sourced it. My data is for the language code xcl, which is not recognized by Stanza, so I used HY (Armenian) as a temporary workaround.
Training the Tokenizer: When attempting to train the tokenizer using the command
python3 -m stanza.utils.training.run_tokenizer HY_Classical_Armenian
I encountered a FileNotFoundError related to missing .toklabels files, which should have been generated during the data preparation step. The exact error message was:
FileNotFoundError: [Errno 2] No such file or directory: 'data/tokenize/hy_classical_armenian-ud-train.toklabels'
This indicates that either the preparation step was missed or did not complete successfully, or there's a mismatch in the expected directory structure or naming convention. However, following the instructions from the documentation and GitHub, it wasn't clear how to proceed with the preparation step for a language not yet recognized by Stanza.
Could you provide more detailed instructions or clarification on how to correctly prepare the data for model training, especially for new languages not currently supported by Stanza? This includes generating the necessary .toklabels files.
Is there a recommended approach to adding support for entirely new languages, ensuring that all necessary preprocessing and setup steps are covered?
Any advice on troubleshooting or steps I might have overlooked would be greatly appreciated. I am especially interested in any scripts or commands specific to preparing data for new languages.
The text was updated successfully, but these errors were encountered: