Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ukrainian language support in Flair #2985

Open
5 tasks done
alanakbik opened this issue Nov 8, 2022 · 7 comments
Open
5 tasks done

Ukrainian language support in Flair #2985

alanakbik opened this issue Nov 8, 2022 · 7 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@alanakbik
Copy link
Collaborator

alanakbik commented Nov 8, 2022

This issue tracks the progress of adding support for the Ukrainian language from lang-uk to Flair. We would like to add:

  • Ukrainian Flair embeddings trained by @dchaplinsky and available here: forward and backward. Should be made loadable with embeddings = FlairEmbeddings('uk-forward')and embeddings = FlairEmbeddings('uk-backward')
  • Ukrainian NER by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('ner-ukrainian')
  • Ukrainian part-of-speech tagger by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('pos-ukrainian')
  • Ukrainian NER dataset described here. Loadable as corpus = NER_UKRAINIAN(). Should be integrated only once version 2.0 is complete.
  • Ukrainian Universal Dependency Treebank, loadable as corpus = UD_UKRAINIAN().
@dchaplinsky
Copy link

This is the code for the NER corpus I've used:
https://github.com/lang-uk/flair-ner/blob/main/train_base.py#L32

and the code for the POS corpus:
https://github.com/lang-uk/flair-pos/blob/main/train_grid.py#L21

I'll take a look if I have fixed split for ner hosted somewhere else

@stefan-it
Copy link
Member

stefan-it commented Dec 6, 2022

Really cool idea!

I had to do a lot of manual preprocessing steps to get NER working when evaluating the ELECTRA model:

https://github.com/stefan-it/ukrainian-electra/blob/main/download_prepare_data_ner.sh

@dchaplinsky
Copy link

Oh, @stefan-it thanks for reminding me. Totally forgot about fixed split.

On a separate topic. Would you like to try to train electra on a better quality ukrainian texts?

@stefan-it
Copy link
Member

Hey @dchaplinsky , I currently have access to TPUs, so if you have texts available I would love to pretrain another model 🤗

@dchaplinsky
Copy link

Yes I do! Could you contact me at chaplinsky[dot]dmitry on gmail?

@dchaplinsky
Copy link

Hi @alanakbik and @stefan-it

I've just uploaded two bigger models for the Ukrainian language:
https://huggingface.co/lang-uk/flair-uk-forward-large
https://huggingface.co/lang-uk/flair-uk-backward-large

Those has hidden_size=2048 (in contrast to the 1024 of the original ones) and trained on my data + data from Stefan (54gb in total).

I've also trained a downstream NER model on them, and got a nice 1.5% improvement over the previous one, will publish it shortly.

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants