Training error for DistilBERT (pages 575-582) #42

labdmitriy · 2022-03-24T14:58:22Z

labdmitriy
Mar 24, 2022

Hi Sebastian,

I want to share my findings about the error I encountered and how I solved it while trying to fine-tune DistilBert for sentiment classification (pages 575-582), and it will be great to get feedback do I understand and resolve the problem correctly.

First of all, I installed required packages including transformers 4.9.1.

After that, I followed the code from ch16-part3-bert.ipynb, with one little modification - because I don't have internet on the server with GPU, I downloaded model and tokenizer manually with additional required files and use it to load model/tokenizer.

When I ran this code:

train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

I saw the following warning:

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

And the sample encoding has the following attributes:

train_encodings[0]

Encoding(num_tokens=3157, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

I decided to follow the next cells and when I ran training loop using device='cuda', I saw the following error:

...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I switched to device='cpu' to see more detailed description of the error, and got the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [40], in <cell line: 3>()
     12 labels = batch['labels'].to(DEVICE)
     14 ### Forward
---> 15 outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
     16 loss, logits = outputs['loss'], outputs['logits']
     18 ### Backward
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:625, in DistilBertForSequenceClassification.forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
    617 r"""
    618 labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
    619     Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
    620     config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
    621     If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    622 """
    623 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 625 distilbert_output = self.distilbert(
    626     input_ids=input_ids,
    627     attention_mask=attention_mask,
    628     head_mask=head_mask,
    629     inputs_embeds=inputs_embeds,
    630     output_attentions=output_attentions,
    631     output_hidden_states=output_hidden_states,
    632     return_dict=return_dict,
    633 )
    634 hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
    635 pooled_output = hidden_state[:, 0]  # (bs, dim)
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:488, in DistilBertModel.forward(self, input_ids, attention_mask, head_mask, inputs_embeds, output_attentions, output_hidden_states, return_dict)
    485 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
    487 if inputs_embeds is None:
--> 488     inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
    489 return self.transformer(
    490     x=inputs_embeds,
    491     attn_mask=attention_mask,
   (...)
    495     return_dict=return_dict,
    496 )
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/transformers/models/distilbert/modeling_distilbert.py:118, in Embeddings.forward(self, input_ids)
    115 position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)
   117 word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
--> 118 position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
    120 embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)
    121 embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/modules/sparse.py:158, in Embedding.forward(self, input)
    157 def forward(self, input: Tensor) -> Tensor:
--> 158     return F.embedding(
    159         input, self.weight, self.padding_idx, self.max_norm,
    160         self.norm_type, self.scale_grad_by_freq, self.sparse)
 
File ~/venvs/machine_learning_book/lib/python3.8/site-packages/torch/nn/functional.py:2044, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2038     # Note [embedding_renorm set_grad_enabled]
   2039     # XXX: equivalent to
   2040     # with torch.no_grad():
   2041     #   torch.embedding_renorm_
   2042     # remove once script supports set_grad_enabled
   2043     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2044 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
 
IndexError: index out of range in self

It seems that somehow embedding lookup was not correct, so I decided to check embedding dimensions in our pre-trained model.

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
...

Model and tokenizer vocabulary sizes are equal to 30522, so I decided to return to the first warning of the tokenizer and follow the recommendation to specify max_length argument for the tokenizer.
I supposed that the problem with the positional embedding, because we have the samples with the length greater than number of rows in positional embedding (512).
We can see it above based on train_encodings[0] example (num_tokens=3157).
So I changed the code for tokenization:

train_encodings = tokenizer(list(train_texts),  max_length=512, truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts),  max_length=512, truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts),  max_length=512, truncation=True, padding=True)

And after that the training loop was completed successfully.

Thank you.

Answered by rasbt

Mar 24, 2022

Yeah, so the way I understand it is if you initialize a general tokenizer, it wouldn't know what the max_length is you need and you have to specify it manually.

In this case, we use a tokenizer specific to the model we want to fine-tune, so it sets the appropriate max_length value. Sure, we could hard-code the max_length=512, but this would not work for arbitrary models anymore (although, I think 99% of the HF models have a model max_length of 512.

View full answer

rasbt · 2022-03-24T15:33:25Z

rasbt
Mar 24, 2022
Maintainer

Yeah, so the way I understand it is if you initialize a general tokenizer, it wouldn't know what the max_length is you need and you have to specify it manually.

In this case, we use a tokenizer specific to the model we want to fine-tune, so it sets the appropriate max_length value. Sure, we could hard-code the max_length=512, but this would not work for arbitrary models anymore (although, I think 99% of the HF models have a model max_length of 512.

5 replies

labdmitriy Mar 24, 2022
Author

This error is also reproduced with original model/tokenizer that you specified, and general auto model/tokenizer which I used correctly detect the same model/tokenizer from your code.
I tried both combinations and got the same error.

labdmitriy Mar 24, 2022
Author

“In this case, we use a tokenizer specific to the model we want to fine-tune, so it sets the appropriate max_length value.”
As I understand from the warning which I described for tokenizer (it is true for both general and specific in my case), it is not enough to set only the argument truncation=True, and it was confirmed after setting max_length parameter manually.

labdmitriy Mar 24, 2022
Author

Ah I see that you mark max_length in screenshot, so it is only my problem somehow :) I will check again tomorrow.
Sorry for unnecessary comments.

rasbt Mar 24, 2022
Maintainer

This is a very interesting discussion! When you have time tomorrow, can you share the code for setting up your tokenizer? Maybe that will help seeing how/why things happen like they do

labdmitriy Mar 24, 2022
Author

Sure, I will check again starting from original notebook code and I will share results.
Thank you!

labdmitriy · 2022-03-25T08:14:52Z

labdmitriy
Mar 25, 2022
Author

Hi Sebastian,

I've made additional experiments starting from original notebook, and it seems that the problem not in tokenizer, but in loading method (from local disk or from hugging face hub).

To load all the required files to the local directory, I followed the recommendations on the page of the model - download files using the following commands (the list of files can be reduced for our case but I decide to download all of them for this experiment):

git lfs install
git clone https://huggingface.co/distilbert-base-uncased

After that I copied all the folder "distilbert-base-uncased" to the "ch16/models" subfolder.

Next I changed the paths for loading model and tokenizer from the local directory, the modified code for the tokenizer is the following:

tokenizer = DistilBertTokenizerFast.from_pretrained('./models/distilbert-base-uncased/')
tokenizer

PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased/', vocab_size=30522, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

And as we can see, model_max_len is not limited "to the maximum acceptable input length for the model", what we can read in "truncation" argument description:

truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`False`):
        Activates and controls truncation. Accepts the following values:

        * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
          :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
          provided.** This will truncate token by token, removing a token from the longest sequence in the pair
          if a pair of sequences (or a batch of pairs) is provided.

So my guess that the type of the model cannot be detected when loading tokenizer from the local directory, when we use the process and configuration specified above.

P.S. After all these experiments, I found the following GitHub issue "[Bug] tokenizer.model_max_length is different when loading model from shortcut or local path":
huggingface/transformers#14561

And it confirms that difference in behaviour for different loading methods, especially for old models where specific parameters were not implemented yet in tokenizer_config.json (this is described in the issue).
The original tokenizer_config.json from the hub has the following content:

{
  "do_lower_case": true
}

And I followed recommendations from the issue (load local pre-trained tokenizer with model_max_length argument specified, save the model and reload it againg without any additional arguments):

tokenizer = DistilBertTokenizerFast.from_pretrained('./models/distilbert-base-uncased/', model_max_length=512)
tokenizer.save_pretrained('./models/distilbert-base-uncased_saved/')
tokenizer

PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased/', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

tokenizer = DistilBertTokenizerFast.from_pretrained('./models/distilbert-base-uncased_saved/')
tokenizer

PreTrainedTokenizerFast(name_or_path='./models/distilbert-base-uncased_saved/', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

As we can see, now model type and input size limits are detected automatically based on extended tokenizer configuration.

And tokenizer_config.json for the saved model is much more detailed:

{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "./models/distilbert-base-uncased/", "tokenizer_class": "DistilBertTokenizer"}

Thank you.

4 replies

rasbt Mar 25, 2022
Maintainer

Oh wow, thanks for this elaborate post and explanation. So, in short, it seems like there was a "bug" due to the old tokenizer_config.json format?

So, and this happens when you do

git clone https://huggingface.co/distilbert-base-uncased
tokenizer = DistilBertTokenizerFast.from_pretrained('./models/distilbert-base-uncased/')

but not when you fetch it online. Interesting. Do you think that's because './models/distilbert-base-uncased/' is simply an old .json file still? That should be something independent of the transformer version though, right? What I am not understanding is why it works when it's loaded via web directly via

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

I.e., why would they have different .json files for these purposes?

labdmitriy Mar 25, 2022
Author

I suppose that when we load model via web, then we specify model name directly (distilbert-base-uncased), which is mapped to configuration, for example for model input sizes:
https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/distilbert/tokenization_distilbert_fast.py#L65-L80
https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/distilbert/tokenization_distilbert_fast.py#L45-L52

But when we load from local folder, folder name can be any name, so the model type and other configuration like input limits can't be detected automatically without any external parameters. These external parameters were added recently in implementation and it seems that for older models (like our model) json configurations were not updated in the hub. It is mentioned in the issue specified above (huggingface/transformers#14561):
"We added a new feature some time ago that saves the tokenizer type in the tokenizer_config.json file in the tokenizer_class key. This new key allows retrieving the value of model_max_length. The problem you are experiencing is related to the fact that the tokenizer_config file of gpt2 was created before this feature existed."

So I think that you are right and it is independent of the transformer version and just is a problem because tokenizer json configuration was not updated for 'distilbert-base-uncased' on hub after new configuration features were implemented.

rasbt Mar 25, 2022
Maintainer

I see, I thought the .json config could have been part of / included in the './models/distilbert-base-uncased/' folder. Makes sense now! Thanks! I can add a note about loading from a local file to the notebook in case other people want to / have to try it.

labdmitriy Mar 25, 2022
Author

Yes, json configs for model and tokenizer are in the same folder as other files in hub for specific model, for example:
https://huggingface.co/distilbert-base-uncased/tree/main
So when I save modified pretrained model to another folder, only new set of config files are created, not the model itself (you can see that the path to the model in the new json file specifies original folder).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error for DistilBERT (pages 575-582) #42

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Training error for DistilBERT (pages 575-582) #42

labdmitriy Mar 24, 2022

Replies: 2 comments · 9 replies

rasbt Mar 24, 2022 Maintainer

labdmitriy Mar 24, 2022 Author

labdmitriy Mar 24, 2022 Author

labdmitriy Mar 24, 2022 Author

rasbt Mar 24, 2022 Maintainer

labdmitriy Mar 24, 2022 Author

labdmitriy Mar 25, 2022 Author

rasbt Mar 25, 2022 Maintainer

labdmitriy Mar 25, 2022 Author

rasbt Mar 25, 2022 Maintainer

labdmitriy Mar 25, 2022 Author

labdmitriy
Mar 24, 2022

Replies: 2 comments 9 replies

rasbt
Mar 24, 2022
Maintainer

labdmitriy Mar 24, 2022
Author

labdmitriy Mar 24, 2022
Author

labdmitriy Mar 24, 2022
Author

rasbt Mar 24, 2022
Maintainer

labdmitriy Mar 24, 2022
Author

labdmitriy
Mar 25, 2022
Author

rasbt Mar 25, 2022
Maintainer

labdmitriy Mar 25, 2022
Author

rasbt Mar 25, 2022
Maintainer

labdmitriy Mar 25, 2022
Author