-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query related to data iterators for Seq2Seq translation using bert-gpt2 #256
Comments
|
Thanks @huzecong for the reply, tokenizer_gpt2 = tx.data.GPT2Tokenizer(
pretrained_model_name='gpt2-small')
tokenizer_bert = tx.data.BERTTokenizer(
pretrained_model_name='bert-base-uncased')
def token_transform_bert(arr):
arr_str = ' '.join(arr)
ret_arr = tokenizer_bert.map_text_to_token(arr_str)
return ret_arr
def token_transform_gpt2(arr):
arr_str = ' '.join(arr)
ret_arr = tokenizer_gpt2.map_text_to_token(arr_str)
return ret_arr
data_hparams={
'train':{
'source_dataset': {'files': 'exp/train_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/train_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 40,
"allow_smaller_final_batch": True,
"shuffle": True,
"num_parallel_calls":3
},
'test':{
'source_dataset': {'files': 'exp/test_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/test_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 12
},
'valid':{
'source_dataset': {'files': 'exp/valid_src.txt','vocab_file':'exp/bert_vocab.txt','max_seq_length': 40,
'bos_token':'[CLS]','eos_token':'[SEP]','other_transformations':[token_transform_bert]},
'target_dataset': {'files': 'exp/valid_tgt.txt','vocab_file':'exp/gpt2_vocab.txt','max_seq_length': 40,
'bos_token':'<|endoftext|>','eos_token':'<|endoftext|>','other_transformations':[token_transform_gpt2]},
'batch_size': 12
}
} After this an exception was raised that these special tokens already exists in vocab. So had to remove that from vocabulary.py class. Also, monkey patched paired_text_data.py since there was no way to pass pad and unk to PairedTextData self._src_vocab = Vocab(src_hparams.vocab_file,
bos_token=src_hparams.bos_token,
eos_token=src_hparams.eos_token,
pad_token='[PAD]',
unk_token='[UNK]')
self._tgt_vocab = Vocab(tgt_hparams["vocab_file"],
bos_token=tgt_bos_token,
eos_token=tgt_eos_token,
pad_token='<|endoftext|>',
unk_token='<|endoftext|>') I Think:
|
Thank you for your feedback! These are all valuable suggestions and I think we could add them. We're actually discussing the possibility to deprecate the |
Yes. I think we should support this feature. Since |
Hi, while trying to use the following snippet:
In this example
TIA
The text was updated successfully, but these errors were encountered: