Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two questions about applying your amazing model to Traditional Chinese dataset #1

Open
marcusau opened this issue Jul 10, 2020 · 9 comments

Comments

@marcusau
Copy link

Thanks for your amazing works. Appreciate this models very much.

I do have a Chinese dataset of ~60k sentences. The labeled NERs are done but 50% of them must be further polished as the manual NER labelling would include some noises.

Based on your instructions in "bert_ner_data_dist_kl_config.yaml",

model_start_weights_filename: 'BERT_NER_final'

So , in order to train a bert_ner_data_dist_kl model, we have to train a ber_ner_baseline model first, and then leverage the result of this baseline model to train bert_ner_data_dist_kl model.

My Understanding is correct?

Additionally, the purpose of your semi-supervised-bert-ner is to tackle the issues of limited labelled NER data for training a NER model right?

For me , my firm offers me a huge "raw dataset" but labelling data is limited.

Thanks a lot.

Marcus

@AdamStein97
Copy link
Owner

Hi Marcus,

You are correct. You first train the BERT_NER model using the bert_ner_trainer.py file. After this is trained, you use the bert_ner_trainer_data_dist_kl.py to fine tune these weights. With around 60k sentences, I would recommend a high batch size to make the fine tuning as effective as possible.

Yes exactly. The goal of the model is to be able to leverage unlabelled data to improve accuracy. If you can get this "raw dataset" in the same format as your labelled examples then hopefully this approach will be successful for you.

Let me know if you have any more questions!

Thanks,
Adam

@marcusau
Copy link
Author

Hi Adam,

Thanks for your prompt response. Just one follow-up question. If my approach is :

  1. use 40% of my raw dataset (~25k sentences) to train the BERT_NER model using the bert_ner_trainer.py file,
    (because I have high conviction of the quality of the NER label on this 40% data.)

  2. use the full dataset (~60k sentences) for bert_ner_trainer_data_dist_kl.py to fine tune these weights.

Do you think my approach is correct?

Thanks a lot.

Marcus

@AdamStein97
Copy link
Owner

Hi Marcus,

Yes this is the correct approach. Please note that in the bert_ner_trainer_data_dist_kl.py file, you will also pass the 40% of your raw dataset as the "labelled_ds" argument and the remaining 60% as the "unlabelled_ds" argument.

Also do not forget that you will need to estimate the true probability distribution of your labels and place that in the "bert_ner_data_dist_kl_config.yaml" config file.

Good luck!
Adam

@marcusau
Copy link
Author

Hi Adam,

sure. thanks a lot.

Will do so in mid-this week

Thanks.

Marcus

@marcusau
Copy link
Author

Hi Adam,

I would like to share the progress of my work on your amazing library.

I am training the BERT_NER model by using the bert_ner_trainer.py file with 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/2' on google colab.

my dataset is something like these:
sentence_id,Word,word_id,tag_id
0,[CLS],101,0
0,花,5709,0
0,旗,3186,0
0,發,4634,22
0,表,6134,22
0,報,1841,3
0,告,1440,3
0,指,2900,22
0,,,8024,22
0,中,704,0
0,升,1285,0
0,控,2971,0
0,股,5500,0
0,旗,3186,22
0,下,678,22
0,L,154,0
0,e,147,0
0,x,166,0
0,u,163,0
0,s,161,0
0,新,3173,22
0,款,3621,22
0,M,155,3
0,P,158,3
0,V,164,3
0,L,154,3
0,M,155,3
0,3,124,3
0,0,121,3
0,0,121,3
0,市,2356,15
0,場,1842,15
0,接,2970,22
0,受,1358,22
0,度,2428,22
0,極,3513,22
0,佳,881,22
0,,,8024,22
0,有,3300,22

i have about 22 ner tags, including:
{"0": "ORG", "1": "LOC", "2": "FAC", "3": "PRODUCT", "4": "LANGUAGE", "5": "NORP", "6": "WORK_OF_ART", "7": "QUANTITY", "8": "PERSON", "9": "LAW", "10": "EVENT", "11": "TITLE", "12": "TIME", "13": "IDIOM", "14": "ENGLISH", "15": "J", "16": "FIN", "17": "TERM", "18": "UNIT", "19": "CONCEPT", "20": "POLICY", "21": "SLOGAN"}

org- firm/organization
FIN = financial tools, e.g stock indice or bond, options etc.
CONCEPT= idea, concepts
TERM= professional terms, outside financial scopes, e.g. XX accounting standards
UNIT = kg, lots of stocks, etc
SLOGAN= most Chinese listed companies and policies are made up of slogans, this is character of China equity market.
Policy = Gov policies or schemes
LAW = rule or laws
J = shortnames of some stocks, names or policies,

image

Configs:
csv_filename: 'preprocessed_ner_dataset.csv'
max_seq_length: 216
BATCH_SIZE: 128
BUFFER_SIZE: 2048
test_set_batches: 75
labelled_train_batches: 20
categories: 22

word_id_field: 'word_id'
mask_field: 'mask'
segment_id_field: 'segment_id'
tag_id_field: 'tag_id'

EPOCHS: 20
latent_dim: 32
rate: 0.0
mlp_dims: [256, 128, 64]
lr: 0.001
model_save_weights_name: 'BERT_NER'

Lets see the result after finish.

Marcus

@marcusau
Copy link
Author

Also i use 'BMES' format as label instead of 'BIO' format

@marcusau
Copy link
Author

Hi Adam,

I may need your help on my first training on Bert_trainer.py.

For my initial training exercise, all procedures are run on google colab but the training result is strange and below expectation.

image

I dont know what mistake i have made on the dataset. The val accuracy is capped by 10% even running till 100 epochs and the accuracy without 'O' must be 0%.......

There are the sample of my 'processed_dataset.csv' and the parameters i used for training:

image

I do think my preprocessed dataset format is correct and it is strictly followed your requirements.

The parameters i used are:

in config.yaml

csv_filename: 'preprocessed_ner_dataset.csv'
max_seq_length: 128 ( ----> I change to 128 in order to fit the news articles of my data source)
BATCH_SIZE: 128
BUFFER_SIZE: 2048
test_set_batches: 75
labelled_train_batches: 22
categories: 22 ( ----> there are 22 ner categories in my dataset)

word_id_field: 'word_id'
mask_field: 'mask'
segment_id_field: 'segment_id'
tag_id_field: 'tag_id'

bert_ner.yaml

EPOCHS: 20
latent_dim: 32
rate: 0.0
mlp_dims: [256, 128, 64]
lr: 0.001
model_save_weights_name: 'BERT_NER'

For the bert pretrained model, i used multi-lingual bert
image

Please give me some hints about any mistake i have made .

Thanks a lot.

Marcus

@AdamStein97
Copy link
Owner

Hi Marcus,

So your loss seems to be NaN from the beginning which implies the input to the model seems to be wrong. Have you made sure that the tokenizer which generates the word ids is compatible with chinese (This is the most likely issue)? I suspect you will also maybe need to pull a different version of the bert layer from tf hub in both the preprocessor and the model.

Following that I would recommend checking the batches being passed to the model and just ensuring they seem sensible.

Thanks,
Adam

@marcusau
Copy link
Author

ok i may try to use bert-Chinese-base from tensor-hub first. let 's see if i can make a difference.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants