Two questions about applying your amazing model to Traditional Chinese dataset #1

marcusau · 2020-07-10T02:38:25Z

Thanks for your amazing works. Appreciate this models very much.

I do have a Chinese dataset of ~60k sentences. The labeled NERs are done but 50% of them must be further polished as the manual NER labelling would include some noises.

Based on your instructions in "bert_ner_data_dist_kl_config.yaml",

model_start_weights_filename: 'BERT_NER_final'

So , in order to train a bert_ner_data_dist_kl model, we have to train a ber_ner_baseline model first, and then leverage the result of this baseline model to train bert_ner_data_dist_kl model.

My Understanding is correct?

Additionally, the purpose of your semi-supervised-bert-ner is to tackle the issues of limited labelled NER data for training a NER model right?

For me , my firm offers me a huge "raw dataset" but labelling data is limited.

Thanks a lot.

Marcus

AdamStein97 · 2020-07-10T08:17:16Z

Hi Marcus,

You are correct. You first train the BERT_NER model using the bert_ner_trainer.py file. After this is trained, you use the bert_ner_trainer_data_dist_kl.py to fine tune these weights. With around 60k sentences, I would recommend a high batch size to make the fine tuning as effective as possible.

Yes exactly. The goal of the model is to be able to leverage unlabelled data to improve accuracy. If you can get this "raw dataset" in the same format as your labelled examples then hopefully this approach will be successful for you.

Let me know if you have any more questions!

Thanks,
Adam

marcusau · 2020-07-13T01:55:18Z

Hi Adam,

Thanks for your prompt response. Just one follow-up question. If my approach is :

use 40% of my raw dataset (~25k sentences) to train the BERT_NER model using the bert_ner_trainer.py file,
(because I have high conviction of the quality of the NER label on this 40% data.)
use the full dataset (~60k sentences) for bert_ner_trainer_data_dist_kl.py to fine tune these weights.

Do you think my approach is correct?

Thanks a lot.

Marcus

AdamStein97 · 2020-07-13T06:49:19Z

Hi Marcus,

Yes this is the correct approach. Please note that in the bert_ner_trainer_data_dist_kl.py file, you will also pass the 40% of your raw dataset as the "labelled_ds" argument and the remaining 60% as the "unlabelled_ds" argument.

Also do not forget that you will need to estimate the true probability distribution of your labels and place that in the "bert_ner_data_dist_kl_config.yaml" config file.

Good luck!
Adam

marcusau · 2020-07-13T06:50:27Z

Hi Adam,

sure. thanks a lot.

Will do so in mid-this week

Thanks.

Marcus

marcusau · 2020-07-16T11:19:03Z

Hi Adam,

I would like to share the progress of my work on your amazing library.

I am training the BERT_NER model by using the bert_ner_trainer.py file with 'https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/2' on google colab.

my dataset is something like these:
sentence_id,Word,word_id,tag_id
0,[CLS],101,0
0,花,5709,0
0,旗,3186,0
0,發,4634,22
0,表,6134,22
0,報,1841,3
0,告,1440,3
0,指,2900,22
0,，,8024,22
0,中,704,0
0,升,1285,0
0,控,2971,0
0,股,5500,0
0,旗,3186,22
0,下,678,22
0,L,154,0
0,e,147,0
0,x,166,0
0,u,163,0
0,s,161,0
0,新,3173,22
0,款,3621,22
0,M,155,3
0,P,158,3
0,V,164,3
0,L,154,3
0,M,155,3
0,3,124,3
0,0,121,3
0,0,121,3
0,市,2356,15
0,場,1842,15
0,接,2970,22
0,受,1358,22
0,度,2428,22
0,極,3513,22
0,佳,881,22
0,，,8024,22
0,有,3300,22

i have about 22 ner tags, including:
{"0": "ORG", "1": "LOC", "2": "FAC", "3": "PRODUCT", "4": "LANGUAGE", "5": "NORP", "6": "WORK_OF_ART", "7": "QUANTITY", "8": "PERSON", "9": "LAW", "10": "EVENT", "11": "TITLE", "12": "TIME", "13": "IDIOM", "14": "ENGLISH", "15": "J", "16": "FIN", "17": "TERM", "18": "UNIT", "19": "CONCEPT", "20": "POLICY", "21": "SLOGAN"}

org- firm/organization
FIN = financial tools, e.g stock indice or bond, options etc.
CONCEPT= idea, concepts
TERM= professional terms, outside financial scopes, e.g. XX accounting standards
UNIT = kg, lots of stocks, etc
SLOGAN= most Chinese listed companies and policies are made up of slogans, this is character of China equity market.
Policy = Gov policies or schemes
LAW = rule or laws
J = shortnames of some stocks, names or policies,

Configs:
csv_filename: 'preprocessed_ner_dataset.csv'
max_seq_length: 216
BATCH_SIZE: 128
BUFFER_SIZE: 2048
test_set_batches: 75
labelled_train_batches: 20
categories: 22

word_id_field: 'word_id'
mask_field: 'mask'
segment_id_field: 'segment_id'
tag_id_field: 'tag_id'

EPOCHS: 20
latent_dim: 32
rate: 0.0
mlp_dims: [256, 128, 64]
lr: 0.001
model_save_weights_name: 'BERT_NER'

Lets see the result after finish.

Marcus

marcusau · 2020-07-16T11:20:55Z

Also i use 'BMES' format as label instead of 'BIO' format

marcusau · 2020-07-17T10:13:29Z

Hi Adam,

I may need your help on my first training on Bert_trainer.py.

For my initial training exercise, all procedures are run on google colab but the training result is strange and below expectation.

I dont know what mistake i have made on the dataset. The val accuracy is capped by 10% even running till 100 epochs and the accuracy without 'O' must be 0%.......

There are the sample of my 'processed_dataset.csv' and the parameters i used for training:

I do think my preprocessed dataset format is correct and it is strictly followed your requirements.

The parameters i used are:

in config.yaml

csv_filename: 'preprocessed_ner_dataset.csv'
max_seq_length: 128 ( ----> I change to 128 in order to fit the news articles of my data source)
BATCH_SIZE: 128
BUFFER_SIZE: 2048
test_set_batches: 75
labelled_train_batches: 22
categories: 22 ( ----> there are 22 ner categories in my dataset)

word_id_field: 'word_id'
mask_field: 'mask'
segment_id_field: 'segment_id'
tag_id_field: 'tag_id'

bert_ner.yaml

EPOCHS: 20
latent_dim: 32
rate: 0.0
mlp_dims: [256, 128, 64]
lr: 0.001
model_save_weights_name: 'BERT_NER'

For the bert pretrained model, i used multi-lingual bert

Please give me some hints about any mistake i have made .

Thanks a lot.

Marcus

AdamStein97 · 2020-07-21T08:06:53Z

Hi Marcus,

So your loss seems to be NaN from the beginning which implies the input to the model seems to be wrong. Have you made sure that the tokenizer which generates the word ids is compatible with chinese (This is the most likely issue)? I suspect you will also maybe need to pull a different version of the bert layer from tf hub in both the preprocessor and the model.

Following that I would recommend checking the batches being passed to the model and just ensuring they seem sensible.

Thanks,
Adam

marcusau · 2020-07-26T14:51:31Z

ok i may try to use bert-Chinese-base from tensor-hub first. let 's see if i can make a difference.
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two questions about applying your amazing model to Traditional Chinese dataset #1

Two questions about applying your amazing model to Traditional Chinese dataset #1

marcusau commented Jul 10, 2020

AdamStein97 commented Jul 10, 2020

marcusau commented Jul 13, 2020

AdamStein97 commented Jul 13, 2020

marcusau commented Jul 13, 2020

marcusau commented Jul 16, 2020

marcusau commented Jul 16, 2020

marcusau commented Jul 17, 2020

AdamStein97 commented Jul 21, 2020

marcusau commented Jul 26, 2020

Two questions about applying your amazing model to Traditional Chinese dataset #1

Two questions about applying your amazing model to Traditional Chinese dataset #1

Comments

marcusau commented Jul 10, 2020

AdamStein97 commented Jul 10, 2020

marcusau commented Jul 13, 2020

AdamStein97 commented Jul 13, 2020

marcusau commented Jul 13, 2020

marcusau commented Jul 16, 2020

marcusau commented Jul 16, 2020

marcusau commented Jul 17, 2020

AdamStein97 commented Jul 21, 2020

marcusau commented Jul 26, 2020