Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New model for unsupported language (Albanian: sq) #1360

Open
rahonalab opened this issue Mar 2, 2024 · 33 comments
Open

New model for unsupported language (Albanian: sq) #1360

rahonalab opened this issue Mar 2, 2024 · 33 comments
Labels

Comments

@rahonalab
Copy link

Sorry for the double bug report.
Can you please tell me what is the right procedure to load a model for a language that is not currently supported i..e, Albanian (sq).
I have tried the following two things:

  • I have created a full resources.json file in a new directory and load it, telling stanza to not download a new resource file:
    pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None)
    It doesn't work:

2024-03-02 15:25:18 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value

  • I have initialized a custom Config and passed it to the pipeline:
    # Language code for the language to build the Pipeline in 'lang': 'sq', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}" # You only need model paths if you have a specific model outside of stanza_resources 'tokenize_model_path': '/corpus/models/stanza/sq/tokenize/sq_nel_tokenizer.pt', 'pos_model_path': '/corpus/models/stanza/sq/pos/sq_nel_tagger.pt', 'lemma_model_path': '/corpus/models/stanza/sq/lemma/sq_nel_lemmatizer.pt', 'depparse_model_path': '/corpus/models/stanza/sq/depparse/sq_nel_parser.pt', 'pos_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', 'depparse_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', })
    But, again, it doesn't work:

2024-03-02 16:00:25 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value

As a workaround, I have put a code of a supported language, but it's not ideal, as it might load other models...

Thanks!

@rahonalab rahonalab added the bug label Mar 2, 2024
@AngledLuffa
Copy link
Collaborator

Random request, this is really hard to read, please check the formatting next time on the stack traces

@AngledLuffa
Copy link
Collaborator

Try adding allow_unknown_language=True to the Pipeline construction:

pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None, allow_unknown_language=True)

AngledLuffa added a commit that referenced this issue Mar 2, 2024
… a note on how to develop a new language's Pipeline. Related to #1360
@rahonalab
Copy link
Author

many thanks AngledLuffa, now it works. And sorry about the awful formatting :(

@rahonalab
Copy link
Author

rahonalab commented Mar 6, 2024

unfortunately, the new option and/or the dev branch doesn't seem to work. If I load models using the config dictionary, I get the following

2024-03-06 10:26:43 INFO: Using device: cuda
2024-03-06 10:26:43 INFO: Loading: tokenize
2024-03-06 10:26:43 DEBUG: With settings:
2024-03-06 10:26:43 DEBUG: {'model_path': '/corpus/saved_models/tokenize/sq_nel_tokenizer.pt', 'lang': 'mine', 'mode': 'predict'}
2024-03-06 10:27:00 DEBUG: Building Adam with lr=0.002000, betas=(0.9, 0.9), eps=0.000000, weight_decay=0.0
2024-03-06 10:27:01 INFO: Loading: mwt
2024-03-06 10:27:01 DEBUG: With settings:
2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/mwt/sq_nel_mwt_expander.pt', 'lang': 'mine', 'mode': 'predict'}
2024-03-06 10:27:01 DEBUG: Building an attentional Seq2Seq model...
2024-03-06 10:27:01 DEBUG: Using a Bi-LSTM encoder
2024-03-06 10:27:01 DEBUG: Using soft attention for LSTM.
2024-03-06 10:27:01 DEBUG: Finetune all embeddings.
2024-03-06 10:27:01 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000
2024-03-06 10:27:01 INFO: Loading: pos
2024-03-06 10:27:01 DEBUG: With settings:
2024-03-06 10:27:01 DEBUG: {'model_path': '/corpus/saved_models/pos/sq_nel_nocharlm_tagger.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'}
2024-03-06 10:27:01 DEBUG: Loading pretrain /corpus/saved_models/pretrain/fasttextwiki.pt
2024-03-06 10:27:02 DEBUG: Loaded pretrain from /corpus/saved_models/pretrain/fasttextwiki.pt
2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001
2024-03-06 10:27:03 INFO: Loading: lemma
2024-03-06 10:27:03 DEBUG: With settings:
2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/lemma/sq_nel_nocharlm_lemmatizer.pt', 'lang': 'mine', 'mode': 'predict'}
2024-03-06 10:27:03 DEBUG: Building an attentional Seq2Seq model...
2024-03-06 10:27:03 DEBUG: Using a Bi-LSTM encoder
2024-03-06 10:27:03 DEBUG: Using soft attention for LSTM.
2024-03-06 10:27:03 DEBUG: Using POS in encoder
2024-03-06 10:27:03 DEBUG: Finetune all embeddings.
2024-03-06 10:27:03 DEBUG: Running seq2seq lemmatizer with edit classifier...
2024-03-06 10:27:03 DEBUG: Building Adam with lr=0.001000, betas=(0.9, 0.999), eps=0.000000
2024-03-06 10:27:03 INFO: Loading: depparse
2024-03-06 10:27:03 DEBUG: With settings:
2024-03-06 10:27:03 DEBUG: {'model_path': '/corpus/saved_models/depparse/sq_nel_nocharlm_parser_checkpoint.pt', 'pretrain_path': '/corpus/saved_models/pretrain/fasttextwiki.pt', 'lang': 'mine', 'mode': 'predict'}
2024-03-06 10:27:03 DEBUG: Reusing pretrain /corpus/saved_models/pretrain/fasttextwiki.pt
2024-03-06 10:27:04 DEBUG: Building Adam with lr=0.003000, betas=(0.9, 0.95), eps=0.000001
2024-03-06 10:27:05 INFO: Done loading processors!
Reading: /corpus/texts/100Years_Albanian.txt
Starting parser...
endminiciep+ string found
Parsing miniciep+
2024-03-06 10:27:19 DEBUG: 6 batches created.
2024-03-06 10:27:22 DEBUG: 450 batches created.
2024-03-06 10:27:22 DEBUG: 127 batches created.
Traceback (most recent call last):
File "/tools/ud-stanza-ciep.py", line 119, in
main()
File "/tools/ud-stanza-ciep.py", line 114, in main
parseciep(nlp,file_content,filename,args.target,args.miniciep)
File "/tools/parsing/stanza_parser.py", line 80, in parseciep
miniciep = nlp(preparetext(splitciep[0]))
File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call
return self.process(doc, processors)
File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process
doc = process(doc)
File "/opt/conda/lib/python3.10/site-packages/stanza/pipeline/depparse_processor.py", line 57, in process
raise ValueError("POS not run before depparse!")
ValueError: POS not run before depparse!

but the pos processor is actually loaded!
bonus question: what is the difference between

depparse

│   ├── sq_nel_nocharlm_parser_checkpoint.pt
│   └── sq_nel_nocharlm_parser.pt

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Mar 6, 2024

The _checkpoint files include the optimizer and have the most recent state of the optimizer, even if the dev scores of the latest model didn't go up and therefore the main save file wasn't updated. You'll notice that the non-checkpoint file is much smaller than the checkpoint file... that's the optimizer. You can restart a training run that got interrupted in the middle, although if it got interrupted while saving the checkpoint file, you're probably screwed (something we should address)

I can see that you're loading the POS model first before the depparse. Sanity check first - is the POS model labeling either upos or xpos? If somehow it was trained to only label the features, I could see it throwing this kind of error. Otherwise, it really looks from the code that this particular error should happen - it only triggers if both upos and xpos are missing for a word.

        if any(word.upos is None and word.xpos is None for sentence in document.sentences for word in sentence.words):
            raise ValueError("POS not run before depparse!")

If the POS model should be working, what happens if you run the pipeline without the depparse and print out the results? Are there any sentences for which the POS is actually missing?

I wonder if that can happen if the POS model has blank tags in the dataset it's learning from

@rahonalab
Copy link
Author

rahonalab commented Mar 6, 2024

Many thanks for the detailed answer! This is really strange, I have tried to load the pipeline as I do in the script and it worked correctly on a few sentence. I have also tried to pass to the script a small txt file with some sentences and it worked too.
But then I try to work on these txt files, as I did in the past, and it throws the error. I assume there's something in these sentences like an unknown word that triggers the error, how can I circumvent it? The model I am using is highly experimental, so I expect that it misses a lot of things. But, again. this is strange. I have trained in the past models on very small data and they worked correctly on this dataset I am trying to parse.

@AngledLuffa
Copy link
Collaborator

The model I am using is highly experimental, so I expect that it misses a lot of things

If it "misses" things to be incorrect, that's one thing. But I do very much wonder why it would label anything None.

Are you able to send the data + the data you are trying to test on, or maybe just send the model and the test data? I'd really like to see it in action myself to debug this issue.

Another possible debugging step would be to examine the output of just the tokenizer and the POS w/o any of the subsequent models and check for any words which are missing both xpos and upos.

@AngledLuffa
Copy link
Collaborator

Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.

Is this something you want to fix on your end?

Maybe the tagger is supposed to ignore those items, or learn to tag them with _... not sure which would be more productive

@AngledLuffa
Copy link
Collaborator

... to be more precise, it IS learning to tag words w/o tags with _, and then the pipeline itself treats that the same as a blank tag.

@rahonalab
Copy link
Author

Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS.

Is this something you want to fix on your end?

thing is, I have already used these data to train a model two or three times last November and it worked fine. I have just added a few sentences for teaching the parser to recognize mwt like Albanian ta = të + e.
I try to run the parser without depparse, and let you know...

@AngledLuffa
Copy link
Collaborator

It will successfully train a tagger even if there are empty tags. However, it's learned to recognize some words as having the empty tag, and that's the label the tagger gives those words. Did I express that clearly? I did the following experiment. Instead of sentences such as this in English, where the gets the tags DET and DT

22      which   which   PRON    WDT     PronType=Rel    26      obj     20:ref  _
23      they    they    PRON    PRP     Case=Nom|Number=Plur|Person=3|PronType=Prs      26      nsubj   26:nsubj        _
24      should  should  AUX     MD      VerbForm=Fin    26      aux     26:aux  _
25      have    have    AUX     VB      VerbForm=Inf    26      aux     26:aux  _
26      left    leave   VERB    VBN     Tense=Past|VerbForm=Part        20      acl:relcl       20:acl:relcl    _
27      in      in      ADP     IN      _       29      case    29:case _
28      the     the     DET       DT       Definite=Def|PronType=Art       29      det     29:det  _
29      car     car     NOUN    NN      Number=Sing     26      obl     26:obl:in       SpaceAfter=No

I changed all instances of the to _, so

22      which   which   PRON    WDT     PronType=Rel    26      obj     20:ref  _
23      they    they    PRON    PRP     Case=Nom|Number=Plur|Person=3|PronType=Prs      26      nsubj   26:nsubj        _
24      should  should  AUX     MD      VerbForm=Fin    26      aux     26:aux  _
25      have    have    AUX     VB      VerbForm=Inf    26      aux     26:aux  _
26      left    leave   VERB    VBN     Tense=Past|VerbForm=Part        20      acl:relcl       20:acl:relcl    _
27      in      in      ADP     IN      _       29      case    29:case _
28      the     the     _       _       Definite=Def|PronType=Art       29      det     29:det  _
29      car     car     NOUN    NN      Number=Sing     26      obl     26:obl:in       SpaceAfter=No

Now the tagger I trained labels the with blank tags, which would trigger this error in the dependency parser, since it isn't expecting to receive blank tags.

I think it might make more sense to either throw an error when training a tagger on a partially complete file, or possibly treat single blank tags as masked out. Learning to recognize the blank tag doesn't seem very useful...

In the meantime, if you find and eliminate those blank tags from your dataset, I believe this error will go away.

@rahonalab
Copy link
Author

ok, I have successfully parsed a file with just the pos tagging. Indeed, there are some tokens without UPOS. Actually, just one i.e., the stupid " punctuation 🔝
I have the same error in the training data, I'll correct and the error will likely go away.
Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌

@AngledLuffa
Copy link
Collaborator

Many thanks again. May I comment that is probably overkill to stop an entire parsing for a blank UPOS? 🙌

Indeed. I just need to figure out what the right approach is. The two leading candidates in my mind are to stop the tagger from training if there are blank UPOS, so as to give the user a chance to go back and fix the issue, or to treat the blanks as unlabeled tokens in the tagger which don't get a label of any kind.

The second one is more appealing to me ideologically, but the problem is that in a case similar to yours where maybe all the punctuation was unlabeled, then they would all get tagged with the most likely known tag at test time (perhaps NOUN, for example).

If you have an alternate suggestion, happy to hear it.

@rahonalab
Copy link
Author

I have corrected the dataset, retrained the model and now the parser works fine.
You might insert something in the dataset prepare process, telling the user that is training a model on 'wrong' data...

AngledLuffa added a commit that referenced this issue Mar 7, 2024
…ly labeled training data was causing problems when training a non-UD dataset #1360
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 20, 2024

This error message is now part of the 1.8.2 release. Is there anything else you need addressed?

@rahonalab
Copy link
Author

great! thank you, everything looks good!

Jemoka pushed a commit that referenced this issue Jul 16, 2024
… a note on how to develop a new language's Pipeline. Related to #1360
Jemoka pushed a commit that referenced this issue Jul 16, 2024
…ly labeled training data was causing problems when training a non-UD dataset #1360
@AngledLuffa
Copy link
Collaborator

@rahonalab I'm wondering - there is only a very small Albanian UD dataset on universaldependencies.org, and I don't see any planned Albanian expansions. Can I ask what dataset you used for this? If there is any publicly available data (larger than the UD dataset) we could add this language as a standard language to Stanza.

@rahonalab
Copy link
Author

hello! I have used two datasets which we plan to release as UD treebanks soon. I'll keep you posted

@AngledLuffa
Copy link
Collaborator

That would be excellent! Looking forward to it.

@rahonalab
Copy link
Author

hello, the first of the two datasets for Albanian has been released in UD 2.15:

https://github.com/UniversalDependencies/UD_Albanian-STAF

It's a bit tiny (200 sentences, 3,3K tokens), but I hope it can already serve as a training model.

I am not responsible for the second dataset, but here's a paper describing it: https://aclanthology.org/2024.clib-1.7.pdf

@AngledLuffa
Copy link
Collaborator

Thanks for the heads up! Do you know if the second treebank you just used will also be part of UD? If you don't know, I can contact the authors.

Furthermore, do you have any thoughts on the interoperability of the two treebanks? Are we able to just add the training data from the two of them together, or will there be significant differences in the annotation schemes? My guess would be they will be interoperable, since from reading your work it appears you have used models trained from their dataset to bootstrap the annotation of your dataset. Such a situation would be ideal, as we can easily combine the two datasets in that case.

@rahonalab
Copy link
Author

Hello! Yes, they have plans for a UD release.
No, unfortunately the treebank described in the paper is not interoperable with STAF.
As they describe in the article, they take several annotation choices which are not UD-compliant for UPOS, morphological features and dependency relations.
It is probably possible to automatically 'correct' their dataset for UD guidelines, but for STAF I have decided to work manually on each sentence. By contrast, STAF should be, to a certain extent, interoperable with the other UD treebank, TSA: I have just added several other annotations for morphological features and syntactic relations - see https://universaldependencies.org/sq/index.html for further information.

@AngledLuffa
Copy link
Collaborator

Excellent, thanks for the heads up. It's possible to train the tagger on UPOS and XPOS from both treebanks, but just the features from your treebank, so that's what I'll do for Albanian. If'n the other treebank gets added, I'll add that to the mix as well.

Incidentally, you might very well be able to update their treebank with your newer feature scheme by starting with such a tagger, silver tagging their dataset with your feature versions, and then hand correcting them. 60 sentences might not be too many sentences for such a project, and often other treebank maintainers are happy to get improved annotation schemes.

What do you mean by syntactic relations - do you mean there are dependency types which appear in your treebank but don't appear in the other treebank? That would be harder to make use of with our current model, although perhaps we could make a version of the dependency parser which has the same input layers but two prediction heads, and therefore can train the bottom and middle layers so it is possible to learn from different dependency annotation schemes.

Do you have a recommendation for word vectors to use for these models? Fasttext has word vectors: https://fasttext.cc/docs/en/crawl-vectors.html but frequently I have found that a dedicated project to building embeddings will produce something that performs better on downstream tasks than those embeddings.

Also, if you can think of NER, sentiment, coref, or (doubtful) constituency datasets for Albanian, we can build models for that as well.

@rahonalab
Copy link
Author

Thank you! I have used fasttest for training Albanian models.

@AngledLuffa
Copy link
Collaborator

I find that the fasttext vectors are better than random initialization, although not by a huge amount.

Would you clarify what you mean by the syntactic relations are different - is it that the dependency trees have different dependency types? In that case, the trees probably shouldn't mix together, right?

@rahonalab
Copy link
Author

rahonalab commented Nov 17, 2024

Yes, STAF and TSA differ in a few dependency types. The new sub-dependency types introduced in STAF are listed in the documentation, while the difference between the two treebanks is in the treatment of the clitic pronouns. In Albanian, indirect object and, to a lesser extent, direct object are marked twice: on the nominal argument and on a co-referring pronoun. STAF annotates both for obj/iobj, while TSA annotates the former for obj/iobj and the latter for expl - see 'Clitic Doubling' in TSA's paper.
However, STAF has been basically validated on the basis of TSA, in the sense that the UD validation script follows TSA plus some morphological features and new sub-dependency types, so they should be interoperable to a some extent.

@AngledLuffa
Copy link
Collaborator

Ultimately our depparse doesn't have any capacity to learn from two different labeling schemes for trees. We can add it (as we did for the POS and NER tags) but in the meantime I'll make tokenize, mwt, lemma, and pos from both treebanks and just make the depparse from STAF

@AngledLuffa
Copy link
Collaborator

How about the MWT? I notice that the STAF dataset has MWT labeled, whereas TSA does not. I think some of the tokens are the same across datasets, though. For example:

STAF, MWT label on t'i:

# text = - Kanë filluar, kanë filluar t'i rehabilitojnë, - qeshi dhe ai.
# sent_id = STAF__21
7-8     t'i     _       _       _       _       _       _       _       end_char=2108|start_char=2105
7       të      të      PART    _       _       9       mark    _       _
8       i       ai      PRON    _       Case=Acc|Gender=Masc|Number=Plur|Person=3|PronType=Prs  9       obj     _       _

TSA, no MWT label on t'i:

# text = Ky lloj i familjes është i përhapur në vendet ku femrat kanë burime për t'i rritur fëmijët.
14      për     për     ADP     _       _       17      mark    _       _
15      t'      t'      PART    _       _       17      mark    _       SpaceAfter=No
16      i       i       PRON    _       Case=Acc|Gender=Masc|Number=Plur|PronType=Emp   17      expl    _       _
17      rritur  rris    VERB    _       Aspect=Perf|Tense=Past|VerbForm=Inf|Voice=Act   12      advcl   _       _
18      fëmijët fëmijë  NOUN    _       Case=Acc|Definite=Ind|Gender=Fem|Number=Plur    17      obj     _       SpaceAfter=No

There's also this in TSA, although it doesn't show up in STAF:

# text = Paragjykimet s'i përshtaten me lehtësi informacionit ose përvojave të reja.
1       Paragjykimet    paragjykim      NOUN    _       Case=Nom|Definite=Def|Gender=Masc|Number=Plur   4       nsubj   _       _
2       s'      s'      PART    _       Polarity=Neg    4       advmod  _       SpaceAfter=No
3       i       i       PRON    _       Case=Dat|Gender=Masc|Number=Sing        4       expl    _       _

@rahonalab
Copy link
Author

Yes, there is also support for MWTs in the STAF treebank. I was able to compile the MWT processor with the unreleased data and it worked pretty well.

@AngledLuffa
Copy link
Collaborator

That's good to hear - my concern here is that the TSA treebank doesn't have MWT, and apparently does have tokens which could have been labeled MWT. Therefore it probably shouldn't be combined with STAF for training MWT

... although I wonder if it'd be a simple and useful improvement to just add them. For example, the couple I linked above look like obvious candidates. Would you like to file such an issue? I can also look into figuring out what I can based on examples like the above, where s'i and t'i are separate words but not marked as an MWT

@rahonalab
Copy link
Author

Do you want me to open an issue on TSA? I can but I am not sure that the treebank is still mantained…

@AngledLuffa
Copy link
Collaborator

I did try, but I'm not sure we'll hear back any time soon either. Still I believe if the treebank is truly unmaintained, it's possible for someone else to offer to help with it - otherwise eventually the treebank will fall behind some updated requirement and not be published any more.

As it stands now it's a bit less clear how much we can use of that data - without MWT, the tokenization also can't be combined or it will be learning not to mark things as MWT. So that leaves pretty much the lemmas and the UPOS, which is at least something, I suppose.

One thing that would help a lot if you have the time to look for other MWT in that treebank aside from s'i and t'i would be, we can make a branch with those changes as a PR, and I can use that branch to build the models while waiting to hear back from the original maintainers

@AngledLuffa
Copy link
Collaborator

I took a stab at the MWT change:

UniversalDependencies/UD_Albanian-TSA#8

I looked for all the ones in STAF, and all the words with SpaceAfter=no, and this is what I came up with. Would you verify if I've made the changes correctly (including one possible lemma change)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants