-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two questions about applying your amazing model to Traditional Chinese dataset #1
Comments
Hi Marcus, You are correct. You first train the BERT_NER model using the bert_ner_trainer.py file. After this is trained, you use the bert_ner_trainer_data_dist_kl.py to fine tune these weights. With around 60k sentences, I would recommend a high batch size to make the fine tuning as effective as possible. Yes exactly. The goal of the model is to be able to leverage unlabelled data to improve accuracy. If you can get this "raw dataset" in the same format as your labelled examples then hopefully this approach will be successful for you. Let me know if you have any more questions! Thanks, |
Hi Adam, Thanks for your prompt response. Just one follow-up question. If my approach is :
Do you think my approach is correct? Thanks a lot. Marcus |
Hi Marcus, Yes this is the correct approach. Please note that in the bert_ner_trainer_data_dist_kl.py file, you will also pass the 40% of your raw dataset as the "labelled_ds" argument and the remaining 60% as the "unlabelled_ds" argument. Also do not forget that you will need to estimate the true probability distribution of your labels and place that in the "bert_ner_data_dist_kl_config.yaml" config file. Good luck! |
Hi Adam, sure. thanks a lot. Will do so in mid-this week Thanks. Marcus |
Hi Adam, I would like to share the progress of my work on your amazing library. I am training the BERT_NER model by using the bert_ner_trainer.py file with 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/2' on google colab. my dataset is something like these: i have about 22 ner tags, including: org- firm/organization Configs: word_id_field: 'word_id' EPOCHS: 20 Lets see the result after finish. Marcus |
Also i use 'BMES' format as label instead of 'BIO' format |
Hi Marcus, So your loss seems to be NaN from the beginning which implies the input to the model seems to be wrong. Have you made sure that the tokenizer which generates the word ids is compatible with chinese (This is the most likely issue)? I suspect you will also maybe need to pull a different version of the bert layer from tf hub in both the preprocessor and the model. Following that I would recommend checking the batches being passed to the model and just ensuring they seem sensible. Thanks, |
ok i may try to use bert-Chinese-base from tensor-hub first. let 's see if i can make a difference. |
Thanks for your amazing works. Appreciate this models very much.
I do have a Chinese dataset of ~60k sentences. The labeled NERs are done but 50% of them must be further polished as the manual NER labelling would include some noises.
Based on your instructions in "bert_ner_data_dist_kl_config.yaml",
model_start_weights_filename: 'BERT_NER_final'
So , in order to train a bert_ner_data_dist_kl model, we have to train a ber_ner_baseline model first, and then leverage the result of this baseline model to train bert_ner_data_dist_kl model.
My Understanding is correct?
Additionally, the purpose of your semi-supervised-bert-ner is to tackle the issues of limited labelled NER data for training a NER model right?
For me , my firm offers me a huge "raw dataset" but labelling data is limited.
Thanks a lot.
Marcus
The text was updated successfully, but these errors were encountered: