Training error for DistilBERT (pages 575-582) #42
-
Hi Sebastian, I want to share my findings about the error I encountered and how I solved it while trying to fine-tune DistilBert for sentiment classification (pages 575-582), and it will be great to get feedback do I understand and resolve the problem correctly. First of all, I installed required packages including transformers 4.9.1. After that, I followed the code from ch16-part3-bert.ipynb, with one little modification - because I don't have internet on the server with GPU, I downloaded model and tokenizer manually with additional required files and use it to load model/tokenizer. When I ran this code:
I saw the following warning:
And the sample encoding has the following attributes:
I decided to follow the next cells and when I ran training loop using device='cuda', I saw the following error:
I switched to device='cpu' to see more detailed description of the error, and got the following error:
It seems that somehow embedding lookup was not correct, so I decided to check embedding dimensions in our pre-trained model.
Model and tokenizer vocabulary sizes are equal to 30522, so I decided to return to the first warning of the tokenizer and follow the recommendation to specify max_length argument for the tokenizer.
And after that the training loop was completed successfully. Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Yeah, so the way I understand it is if you initialize a general tokenizer, it wouldn't know what the max_length is you need and you have to specify it manually. In this case, we use a tokenizer specific to the model we want to fine-tune, so it sets the appropriate |
Beta Was this translation helpful? Give feedback.
-
Hi Sebastian, I've made additional experiments starting from original notebook, and it seems that the problem not in tokenizer, but in loading method (from local disk or from hugging face hub). To load all the required files to the local directory, I followed the recommendations on the page of the model - download files using the following commands (the list of files can be reduced for our case but I decide to download all of them for this experiment):
After that I copied all the folder "distilbert-base-uncased" to the "ch16/models" subfolder. Next I changed the paths for loading model and tokenizer from the local directory, the modified code for the tokenizer is the following:
PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased/', vocab_size=30522, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}) And as we can see, model_max_len is not limited "to the maximum acceptable input length for the model", what we can read in "truncation" argument description:
So my guess that the type of the model cannot be detected when loading tokenizer from the local directory, when we use the process and configuration specified above. P.S. After all these experiments, I found the following GitHub issue "[Bug] tokenizer.model_max_length is different when loading model from shortcut or local path": And it confirms that difference in behaviour for different loading methods, especially for old models where specific parameters were not implemented yet in tokenizer_config.json (this is described in the issue).
And I followed recommendations from the issue (load local pre-trained tokenizer with model_max_length argument specified, save the model and reload it againg without any additional arguments):
PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased/', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased_saved/', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}) As we can see, now model type and input size limits are detected automatically based on extended tokenizer configuration. And tokenizer_config.json for the saved model is much more detailed:
Thank you. |
Beta Was this translation helpful? Give feedback.
Yeah, so the way I understand it is if you initialize a general tokenizer, it wouldn't know what the max_length is you need and you have to specify it manually.
In this case, we use a tokenizer specific to the model we want to fine-tune, so it sets the appropriate
max_length
value. Sure, we could hard-code themax_length=512
, but this would not work for arbitrary models anymore (although, I think 99% of the HF models have a modelmax_length
of 512.