-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage] tokenization mismatch when finetuning v1.5-7b #661
Comments
Same problem, I found |
The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token) I tried to fix this WARNING by: cur_len = 1 + 1 # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1 # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1 |
Awesome ! I tested your change, and it did work. So the problem is caused by both USER and |
@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". |
Yes, this will lead to different tokenization results with LLaMA tokenizer. |
For the above case, can the tokenizer correctly separate No (or other words) before |
Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14. You may run |
@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer. |
Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after |
@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. |
@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion? |
Is here fixed? |
Setting use_fast=True works for my case. |
@haotian-liu |
round_len = len(tokenizer(rou).input_ids), for each round,the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks |
I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it. |
Hi, I am training LLava with Qwen2 got same mismatch. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama? |
Hi, I have the same issue. Have you solved it? |
Same when using lora to finetune v1.6-34b |
I have fixed the issue, You just need to make sure the inputs and targets properly masked. |
Can you share your tokenizer settings? |
same when finetuning in 1.5b |
Describe the issue
Issue:
I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main
, and finetune on datasets in the paper. I adapt the command line to make it run on V100.
tokenizers.version == '0.14.1'
Command:
Screenshots:
The text was updated successfully, but these errors were encountered: