Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_collator error: Could not find response key in token IDs tensor #989

Closed
Sosycs opened this issue Nov 14, 2023 · 2 comments
Closed

data_collator error: Could not find response key in token IDs tensor #989

Sosycs opened this issue Nov 14, 2023 · 2 comments

Comments

@Sosycs
Copy link

Sosycs commented Nov 14, 2023

Hello everyone,
I am currently fine tuning Llama2 on a my own dataset and using DataCollatorForCompletionOnlyLM.

the structure of my text column is:
<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Geology is the study of the Earths solid material and structures and the processes that create them. Some ideas geologists might consider include how rocks and landforms are created or the composition of rocks, minerals, or various landforms. Geologists consider how natural processes create and destroy materials on Earth, and how humans can use Earth materials as resources, among other topics. Geologists study rocks in the field to learn what they can from them. Question: Earth science is the study of Options:(A) solid Earth (B) Earths oceans (C) Earths atmosphere (D) all of the above Answer: [/INST] D </s>

My formatting function is:

def format(sample):
    s_prompt="Please select the correct answer from the given multiple Options based on the given Context:"
    system_prompt = f"<s>[INST] <<SYS>>\n {s_prompt}\n<</SYS>>\n\n"
    user_prompt = f"{sample['prompt']}"
    model_answer = f"\n[/INST] {sample['response']} </s>"

    # join all the parts together
    prompt = "".join([i for i in [system_prompt, user_prompt, model_answer] if i is not None])
    return prompt

My Training code:

# Trainer (Fine-tuning)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
    data_collator=DataCollatorForCompletionOnlyLM("[/INST]", tokenizer=tokenizer),
)
trainer.train()
trainer.model.save_pretrained(output_dir)

I have tried multiple response templates but always get the error:
RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])

Can you please guide me to the correct one?

@younesbelkada
Copy link
Contributor

Hi @Sosycs
Thanks a lot for the issue, sometimes you need also to pre-pend \n, can you try to pass "\n[/INST]" in the datacollator?
Optionally you can also directly pass the ids directly to the collator - please have a look at this section of the docs: https://huggingface.co/docs/trl/sft_trainer#using-tokenids-directly-for-responsetemplate and let me know if this is any of help to you

@Sosycs
Copy link
Author

Sosycs commented Nov 16, 2023

closed to be discussed in #981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants