data_collator error: Could not find response key in token IDs tensor #989

Sosycs · 2023-11-14T09:01:05Z

Hello everyone,
I am currently fine tuning Llama2 on a my own dataset and using DataCollatorForCompletionOnlyLM.

the structure of my text column is:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Geology is the study of the Earths solid material and structures and the processes that create them. Some ideas geologists might consider include how rocks and landforms are created or the composition of rocks, minerals, or various landforms. Geologists consider how natural processes create and destroy materials on Earth, and how humans can use Earth materials as resources, among other topics. Geologists study rocks in the field to learn what they can from them. Question: Earth science is the study of Options:(A) solid Earth (B) Earths oceans (C) Earths atmosphere (D) all of the above Answer: [/INST] D </s>

My formatting function is:

def format(sample):
    s_prompt="Please select the correct answer from the given multiple Options based on the given Context:"
    system_prompt = f"<s>[INST] <<SYS>>\n {s_prompt}\n<</SYS>>\n\n"
    user_prompt = f"{sample['prompt']}"
    model_answer = f"\n[/INST] {sample['response']} </s>"

    # join all the parts together
    prompt = "".join([i for i in [system_prompt, user_prompt, model_answer] if i is not None])
    return prompt

My Training code:

# Trainer (Fine-tuning)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
    data_collator=DataCollatorForCompletionOnlyLM("[/INST]", tokenizer=tokenizer),
)
trainer.train()
trainer.model.save_pretrained(output_dir)

I have tried multiple response templates but always get the error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])

Can you please guide me to the correct one?

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-11-14T18:30:13Z

Hi @Sosycs
Thanks a lot for the issue, sometimes you need also to pre-pend \n, can you try to pass "\n[/INST]" in the datacollator?
Optionally you can also directly pass the ids directly to the collator - please have a look at this section of the docs: https://huggingface.co/docs/trl/sft_trainer#using-tokenids-directly-for-responsetemplate and let me know if this is any of help to you

Sosycs · 2023-11-16T06:08:33Z

closed to be discussed in #981

younesbelkada mentioned this issue Nov 14, 2023

Guidance with the correct format of the validation dataset #981

Closed

Sosycs closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_collator error: Could not find response key in token IDs tensor #989

data_collator error: Could not find response key in token IDs tensor #989

Sosycs commented Nov 14, 2023 •

edited

Loading

younesbelkada commented Nov 14, 2023

Sosycs commented Nov 16, 2023 •

edited

Loading

data_collator error: Could not find response key in token IDs tensor #989

data_collator error: Could not find response key in token IDs tensor #989

Comments

Sosycs commented Nov 14, 2023 • edited Loading

younesbelkada commented Nov 14, 2023

Sosycs commented Nov 16, 2023 • edited Loading

Sosycs commented Nov 14, 2023 •

edited

Loading

Sosycs commented Nov 16, 2023 •

edited

Loading