[DataCollatorForCompletionOnlyLM] Warn on identical `eos_token_id` and `pad_token_id` #988

MustSave · 2023-11-14T05:57:45Z

What does this PR do?

This PR displays a warning message when the values of pad_token_id and eos_token_id are identical. This is to prevent unexpected behavior during multi-turn training.

After the multi-turn data training with DataCollatorForCompletionOnlyLM, I encountered an issue where the model continued generating outputs even after the assistant's turn had been completed. This issue was due to the equivalence of the tokenizer's eos token and pad token by default in some model, resulting in the eos token not being properly trained.

For instance, in the torch_call() function within data_collator.py, the pad_token_id is converted to ignore_id(-100).

if self.mlm:
    batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
        batch["input_ids"], special_tokens_mask=special_tokens_mask
    )
else:
    labels = batch["input_ids"].clone()
    if self.tokenizer.pad_token_id is not None:
        labels[labels == self.tokenizer.pad_token_id] = -100
    batch["labels"] = labels

If the multi-turn data is formatted as shown below, and eos_token_id and pad_token_id are identical, the eos_token would not be properly trained. This could lead to a scenario where the model continuously generates both user and assistant turns without recognizing the end of sequence (eos) token.

<s>### User: What is up?
### Assistant: Hello! How can I help you today?</s>
### User: Goodbye
### Assistant: Goodbye! If you have any more questions in the future, don't hesitate to ask.</s>

Suggestion

In case of Vicuna, they set pad_token to unk_token. Would it be a solution?

…id are identical Display a warning message if the and values are the same in order to prevent unintended behavior during multi-turn training.

HuggingFaceDocBuilderDev · 2023-11-14T18:22:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

younesbelkada

Makes a lot of sense , thank you!
Related: #976

…id are identical (huggingface#988) Display a warning message if the and values are the same in order to prevent unintended behavior during multi-turn training.

[DataCollatorForCompletionOnlyLM] warn if eos_token_id and pad_token_…

1284d65

…id are identical Display a warning message if the and values are the same in order to prevent unintended behavior during multi-turn training.

MustSave changed the title ~~[DataCollator] Warn on identical eos_token_id and pad_token_id~~ [DataCollatorForCompletionOnlyLM] Warn on identical eos_token_id and pad_token_id Nov 14, 2023

younesbelkada approved these changes Nov 14, 2023

View reviewed changes

younesbelkada merged commit be3faa7 into huggingface:main Nov 14, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCollatorForCompletionOnlyLM] Warn on identical `eos_token_id` and `pad_token_id` #988

[DataCollatorForCompletionOnlyLM] Warn on identical `eos_token_id` and `pad_token_id` #988

MustSave commented Nov 14, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 14, 2023

younesbelkada left a comment

[DataCollatorForCompletionOnlyLM] Warn on identical eos_token_id and pad_token_id #988

[DataCollatorForCompletionOnlyLM] Warn on identical eos_token_id and pad_token_id #988

Conversation

MustSave commented Nov 14, 2023 • edited Loading

What does this PR do?

Suggestion

HuggingFaceDocBuilderDev commented Nov 14, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

[DataCollatorForCompletionOnlyLM] Warn on identical `eos_token_id` and `pad_token_id` #988

[DataCollatorForCompletionOnlyLM] Warn on identical `eos_token_id` and `pad_token_id` #988

MustSave commented Nov 14, 2023 •

edited

Loading