Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCollatorForCompletionOnlyLM] Warn on identical eos_token_id and pad_token_id #988

Merged

Conversation

MustSave
Copy link
Contributor

@MustSave MustSave commented Nov 14, 2023

What does this PR do?

This PR displays a warning message when the values of pad_token_id and eos_token_id are identical. This is to prevent unexpected behavior during multi-turn training.

After the multi-turn data training with DataCollatorForCompletionOnlyLM, I encountered an issue where the model continued generating outputs even after the assistant's turn had been completed. This issue was due to the equivalence of the tokenizer's eos token and pad token by default in some model, resulting in the eos token not being properly trained.

For instance, in the torch_call() function within data_collator.py, the pad_token_id is converted to ignore_id(-100).

if self.mlm:
    batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
        batch["input_ids"], special_tokens_mask=special_tokens_mask
    )
else:
    labels = batch["input_ids"].clone()
    if self.tokenizer.pad_token_id is not None:
        labels[labels == self.tokenizer.pad_token_id] = -100
    batch["labels"] = labels

If the multi-turn data is formatted as shown below, and eos_token_id and pad_token_id are identical, the eos_token would not be properly trained. This could lead to a scenario where the model continuously generates both user and assistant turns without recognizing the end of sequence (eos) token.

<s>### User: What is up?
### Assistant: Hello! How can I help you today?</s>
### User: Goodbye
### Assistant: Goodbye! If you have any more questions in the future, don't hesitate to ask.</s>

Suggestion

In case of Vicuna, they set pad_token to unk_token. Would it be a solution?

…id are identical

Display a warning message if the  and  values are the same in order to prevent unintended behavior during multi-turn training.
@MustSave MustSave changed the title [DataCollator] Warn on identical eos_token_id and pad_token_id [DataCollatorForCompletionOnlyLM] Warn on identical eos_token_id and pad_token_id Nov 14, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense , thank you!
Related: #976

@younesbelkada younesbelkada merged commit be3faa7 into huggingface:main Nov 14, 2023
9 checks passed
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
…id are identical (huggingface#988)

Display a warning message if the  and  values are the same in order to prevent unintended behavior during multi-turn training.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants