Skip to content

Commit

Permalink
[DataCollatorForCompletionOnlyLM] warn if eos_token_id and pad_token_…
Browse files Browse the repository at this point in the history
…id are identical (huggingface#988)

Display a warning message if the  and  values are the same in order to prevent unintended behavior during multi-turn training.
  • Loading branch information
MustSave authored and Andrew Lapp committed May 10, 2024
1 parent 51a34fb commit 35bc5a3
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions trl/trainer/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,14 @@ def __init__(
# The user already provides the token ids
self.response_token_ids = response_template

if not self.mlm and self.instruction_template and self.tokenizer.pad_token_id == self.tokenizer.eos_token_id:
warnings.warn(
"The pad_token_id and eos_token_id values of this tokenizer are identical. "
"If you are planning for multi-turn training, "
"it can result in the model continuously generating questions and answers without eos token. "
"To avoid this, set the pad_token_id to a different value."
)

self.ignore_index = ignore_index

def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
Expand Down

0 comments on commit 35bc5a3

Please sign in to comment.