Issues Training Models for New Language Support #1343

LilitKharatyan · 2024-02-14T17:50:53Z

I am trying to train a pipeline for a new language (xcl). My goal is to train the full pipeline (tokenizer, tagger, parser, lemmatizer, and morphological parser) for this language, starting with the tokenizer. I've followed the instructions provided in the official documentation and GitHub repository closely but have encountered several issues that hinder my progress.

Here are the steps I've taken and the issues encountered:

Setting Up Environment and Data: After organizing my .conllu files for training and validation as per the guidelines, I set the environment variables in config.sh and sourced it. My data is for the language code xcl, which is not recognized by Stanza, so I used HY (Armenian) as a temporary workaround.
Training the Tokenizer: When attempting to train the tokenizer using the command
python3 -m stanza.utils.training.run_tokenizer HY_Classical_Armenian

I encountered a FileNotFoundError related to missing .toklabels files, which should have been generated during the data preparation step. The exact error message was:

FileNotFoundError: [Errno 2] No such file or directory: 'data/tokenize/hy_classical_armenian-ud-train.toklabels'

This indicates that either the preparation step was missed or did not complete successfully, or there's a mismatch in the expected directory structure or naming convention. However, following the instructions from the documentation and GitHub, it wasn't clear how to proceed with the preparation step for a language not yet recognized by Stanza.

Could you provide more detailed instructions or clarification on how to correctly prepare the data for model training, especially for new languages not currently supported by Stanza? This includes generating the necessary .toklabels files.
Is there a recommended approach to adding support for entirely new languages, ensuring that all necessary preprocessing and setup steps are covered?
Any advice on troubleshooting or steps I might have overlooked would be greatly appreciated. I am especially interested in any scripts or commands specific to preparing data for new languages.

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-02-14T20:23:10Z

Regarding the language code, the documentation on adding a new language starts with an explanation on how to add a new language code: https://stanfordnlp.github.io/stanza/new_language.html I added the new code to the dev branch, if you want to use that instead of 1.7.0 I also added some debugging output to run_tokenizer which will report where the tokenized files are being written. Hopefully that helps you figure out where the mismatch is happening. If not, let us know what the output is and we'll try to figure it out.

…

On Wed, Feb 14, 2024 at 9:51 AM Lilit Kharatyan ***@***.***> wrote: I am trying to train a pipeline for a new language (xcl). My goal is to train the full pipeline (tokenizer, tagger, parser, lemmatizer, and morphological parser) for this language, starting with the tokenizer. I've followed the instructions provided in the official documentation and GitHub repository closely but have encountered several issues that hinder my progress. Here are the steps I've taken and the issues encountered: 1. Setting Up Environment and Data: After organizing my .conllu files for training and validation as per the guidelines, I set the environment variables in config.sh and sourced it. My data is for the language code xcl, which is not recognized by Stanza, so I used HY (Armenian) as a temporary workaround. 2. Training the Tokenizer: When attempting to train the tokenizer using the command python3 -m stanza.utils.training.run_tokenizer HY_Classical_Armenian I encountered a FileNotFoundError related to missing .toklabels files, which should have been generated during the data preparation step. The exact error message was: FileNotFoundError: [Errno 2] No such file or directory: 'data/tokenize/hy_classical_armenian-ud-train.toklabels' This indicates that either the preparation step was missed or did not complete successfully, or there's a mismatch in the expected directory structure or naming convention. However, following the instructions from the documentation and GitHub, it wasn't clear how to proceed with the preparation step for a language not yet recognized by Stanza. - Could you provide more detailed instructions or clarification on how to correctly prepare the data for model training, especially for new languages not currently supported by Stanza? This includes generating the necessary .toklabels files. - Is there a recommended approach to adding support for entirely new languages, ensuring that all necessary preprocessing and setup steps are covered? - Any advice on troubleshooting or steps I might have overlooked would be greatly appreciated. I am especially interested in any scripts or commands specific to preparing data for new languages. — Reply to this email directly, view it on GitHub <#1343>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOFGNUCTKJ4TYN575DYTT2QXAVCNFSM6AAAAABDIW5PZ6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZTIOBWGY2DQMA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

LilitKharatyan · 2024-02-26T08:59:17Z

Thank you very much, I managed to train tokenizer, lemmatizer, POS and dependency parser. Now, I cannot deploy them. Whenever I try to run the trained models, for some reason, English models come forward. How can I run my models from the saved models folder. Alos since we have quite good results and a whole pipeline ready, how can we add XCL to the official languages, and make our models public? Thanks

AngledLuffa · 2024-02-26T21:09:04Z

I'd have to see what you mean by the English models come forward. I assume you mean you are trying to build a pipeline with Pipeline("en") as opposed to Pipeline("xcl")? In general, you can give it the path to the models, eg the saved_models/... folders, with the "tokenize_model_path", "lemma_model_path", etc attributes. You will also need to give the POS and dependency parsers the pretrained embeddings path (if you used them) with "pos_pretrain_path" We are always happy to add support for more languages! The easiest way is to send us a reference to the data sources used, any code changes or conversion scripts needed to make those data sources compatible with stanza, and a reference to the embeddings. The new version 1.8.0 has the xcl language code, btw

…

On Mon, Feb 26, 2024 at 12:59 AM Lilit Kharatyan ***@***.***> wrote: Thank you very much, I managed to train tokenizer, lemmatizer, POS and dependency parser. Now, I cannot deploy them. Whenever I try to run the trained models, for some reason, English models come forward. How can I run my models from the saved models folder. Alos since we have quite good results and a whole pipeline ready, how can we add XCL to the official languages, and make our models public? Thanks — Reply to this email directly, view it on GitHub <#1343 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWINYJWD7JS6VSW3MXLYVRFHDAVCNFSM6AAAAABDIW5PZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRTGYYTKMRXHA> . You are receiving this because you commented.Message ID: ***@***.***>

AngledLuffa · 2024-03-03T21:43:20Z

Please let us know if we can help host the models - the main thing we need to be able to rebuild them going forward is the code changes and links to the data sources.

LilitKharatyan · 2024-03-04T09:31:46Z

Thank you for your response! I am currently retraining the models with some additional data! I guess in an hour or so they will be ready and I will try to execute them. If I still run into problems I will get back to you!

AngledLuffa · 2024-03-18T19:13:57Z

Are you able to sync the dev branch and try that? I just ran into this issue with someone looking to add their models for Albanian: #1360 Here is the format for the Pipeline you need to use: #1360 (comment)

LilitKharatyan · 2024-05-20T13:45:32Z

Thank you for the suggestions.
Now we have tokenizer, lemmatizer, POS Tagger and Dependency parser for classical Armenian, with pretty good results. Can you please guide us how (if) we can add them to the list of languages Stanza supports? If so, what do we need to provide for that? Thank you!

AngledLuffa · 2024-05-21T16:19:55Z

The easiest way is if you're able to turn your dataset into a UD treebank:

https://universaldependencies.org/

If that's not an option, but you have some other repo for the data, please let us know. Any code changes or new scripts you needed to convert the data to the Stanza input formats would also be very helpful.

LilitKharatyan · 2024-05-22T10:16:26Z

Thanks for the information. Actually, our dataset is available in UD and the models are trained on the last released dataset of UD Classical Armenian. We have the four models I mentioned earlier (tokenizer, lemmatizer, POS Tagger and Dependency parser) so we wanted to know if it is possible to make those models part of the official package of Stanza. Thanks

AngledLuffa · 2024-05-22T16:26:03Z

Sounds good. Yes, we can definitely add that as a model. Are there word vectors or other resources we need to make those models aside from the UD dataset?

LilitKharatyan · 2024-05-22T17:10:36Z

Thank you for your quick answer. Yes there are word vectors that have been used for the training (and deployment too). We will provide those too.

AngledLuffa · 2024-05-22T17:27:21Z

Unless there's anything unusual about the training, we'll be happy to use those word vectors to rebuild the models as part of our updates for new UD releases. If there's something specific we have to do, please let us know or make a PR

LilitKharatyan · 2024-05-23T05:18:46Z

Thank you! How would you like me to send you the vectors? And what is the expected new release date? Thanks

AngledLuffa · 2024-05-23T05:25:46Z

Probably late June on account of other work commitments, and people have posted them in box or dropbox for example in the past

I may even be able to find some storage at the university which we can share. How big are they?

LilitKharatyan · 2024-05-28T09:27:36Z

Thank you for your prompt response.
We have a specific question regarding the licensing of AI models. Our vectors have been trained on a dataset to which we were granted access under the condition that both the dataset and any AI models derived from it can only be used for non-profit purposes.
Could you please clarify how licensing is regulated on your side? Additionally, do you believe this usage restriction can be effectively controlled? We prefer to license our models under the CC BY-NC-ND 4.0 license. Is this possible within your framework?
Thank you for your assistance.

manning · 2024-05-28T21:23:18Z

Ultimately, like nearly all open source projects, we are dependent on users observing their license terms. (Stanza is licensed under the generous Apache license, which does allow commercial use, but there are still license terms.) We are happy to label the models with their license and to point out that the models are restricted to non-commercial use, but we aren’t in a position to control usage (again, any more than a typical open source project). It’d be part of our offering as a non-profit organization, and we wouldn’t be charging anyone for their use.
Based on that, it’s up to you whether sharing the word vectors you trained in this way would violate your license agreement or not. We doubt that there will be commercial use of Classical Armenian models, but if you really wanted to regulate access, you’d need to keep the models and only make them available to people who have (say) filled out a license agreement.
You could write back to ask the original dataset owners whether or not this use case would violate the agreement or seems okay to them. Some licensors are more strict than others...

LilitKharatyan · 2024-06-06T16:33:58Z

Thank you! Actually what you said makes sense. We will just ask to label them with the relevant license and that's it. The vector's file is not that big, I can even send it via email. What would you be more comfortable with?

AngledLuffa · 2024-06-16T06:43:08Z

For sending us the word vectors, email, dropbox, or anything that works for you will work great. Thanks!

LilitKharatyan · 2024-06-17T10:55:37Z

Can I please ask for an email? I will send them right away.

AngledLuffa · 2024-06-17T17:49:44Z

oh, sorry, can you not see it from my account?

[email protected]

LilitKharatyan · 2024-06-18T09:51:13Z

I can see it, just wasn't sure if that was the right address to send.
I just sent the vectors to you. If anything please let me know. Thanks

AngledLuffa · 2024-06-25T07:17:16Z

For the word vectors, is there a citation of some kind or other links I should put on this page?

https://stanfordnlp.github.io/stanza/word_vectors.html#word-vector-sources

AngledLuffa · 2024-06-25T15:38:07Z

Also, does caval refer to the university group in Australia, or is there someone else I should be crediting for this?

LilitKharatyan · 2024-06-28T11:48:17Z

Hi. There is not a specific paper about the word vectors but we mention them briefly in our paper:
Caval refers to the following project of the University of Wurzburg.
In case there are more questions, please let us know

AngledLuffa · 2024-06-28T16:19:42Z

Sounds good. I will add that information to the documentation.

https://stanfordnlp.github.io/stanza/word_vectors.html#classical-armenian

LilitKharatyan added the question label Feb 14, 2024

AngledLuffa added a commit that referenced this issue Jun 23, 2024

Make Caval the default set of models for XCL. #1343

8b103ff

Jemoka pushed a commit that referenced this issue Jul 16, 2024

Make Caval the default set of models for XCL. #1343

02b92f1

AngledLuffa closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues Training Models for New Language Support #1343

Issues Training Models for New Language Support #1343

LilitKharatyan commented Feb 14, 2024

AngledLuffa commented Feb 14, 2024 via email

LilitKharatyan commented Feb 26, 2024

AngledLuffa commented Feb 26, 2024 via email

AngledLuffa commented Mar 3, 2024

LilitKharatyan commented Mar 4, 2024

AngledLuffa commented Mar 18, 2024 via email

LilitKharatyan commented May 20, 2024

AngledLuffa commented May 21, 2024

LilitKharatyan commented May 22, 2024

AngledLuffa commented May 22, 2024

LilitKharatyan commented May 22, 2024

AngledLuffa commented May 22, 2024

LilitKharatyan commented May 23, 2024

AngledLuffa commented May 23, 2024

LilitKharatyan commented May 28, 2024

manning commented May 28, 2024 •

edited

Loading

LilitKharatyan commented Jun 6, 2024

AngledLuffa commented Jun 16, 2024

LilitKharatyan commented Jun 17, 2024

AngledLuffa commented Jun 17, 2024

LilitKharatyan commented Jun 18, 2024

AngledLuffa commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

LilitKharatyan commented Jun 28, 2024

AngledLuffa commented Jun 28, 2024

Issues Training Models for New Language Support #1343

Issues Training Models for New Language Support #1343

Comments

LilitKharatyan commented Feb 14, 2024

AngledLuffa commented Feb 14, 2024 via email

LilitKharatyan commented Feb 26, 2024

AngledLuffa commented Feb 26, 2024 via email

AngledLuffa commented Mar 3, 2024

LilitKharatyan commented Mar 4, 2024

AngledLuffa commented Mar 18, 2024 via email

LilitKharatyan commented May 20, 2024

AngledLuffa commented May 21, 2024

LilitKharatyan commented May 22, 2024

AngledLuffa commented May 22, 2024

LilitKharatyan commented May 22, 2024

AngledLuffa commented May 22, 2024

LilitKharatyan commented May 23, 2024

AngledLuffa commented May 23, 2024

LilitKharatyan commented May 28, 2024

manning commented May 28, 2024 • edited Loading

LilitKharatyan commented Jun 6, 2024

AngledLuffa commented Jun 16, 2024

LilitKharatyan commented Jun 17, 2024

AngledLuffa commented Jun 17, 2024

LilitKharatyan commented Jun 18, 2024

AngledLuffa commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

LilitKharatyan commented Jun 28, 2024

AngledLuffa commented Jun 28, 2024

manning commented May 28, 2024 •

edited

Loading