-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert : handle tokenizer merges format from transformers 4.45 #9696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I didn't even think about going down this path to check! 🔥
What's the reproduction flow? I only pushed the tokenizer to https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer but I have a local copy with the model as well, and it converts successfully using |
Is it only failing for sentencepiece-based tokenizers? |
@pcuenca This isn't about failure, but about warnings and missing When taking https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer and converting it with
And with
But when I convert it from this branch, I get
And in the dump:
Notice that the
No, since those don't have |
I was considering those warnings failures, as the resulting model could not be used for generation. But I was not seeing them, and thus I was very confused. I have been able to reproduce the issue now with a local copy of Llama 3.1,
True, but I was reading the code and trying to guess when the code path would trigger. I'll debug more closely to try to understand why it worked with Llama 3.2 and not with Llama 3.1. Thanks! Edit: this was user error, I've now reproduced the problem with Llama 3.2 as well. Llama 2 works because it's sentencepiece-based. Apologies for the noise! |
Can this be merged... please |
@farris I can confirm that just cloning this branch and running the conversion works without any other modifications and the resulting models are immediately usable by llama.cp \ This model was trained in
(its fine tuned for specific tokens, but i just wanted to show it works)
|
As noted in #9692,
transformers 4.45
introduces a new serialization for merges, which is a list of 2 strings instead of a single string with 2 space-separated parts.This needs to be handled in
gguf-py/gguf/vocab.py
, becausegguf.SpecialVocab(dir_model, load_merges=True)
otherwise ignores the merges with a warning:Note that this is only a problem when there is no
merges.txt
, so it should be reproducible with https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer, but not with https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization unlessmerges.txt
is renamed or removed.