Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert : handle tokenizer merges format from transformers 4.45 #9696

Merged
merged 2 commits into from
Oct 3, 2024

Conversation

compilade
Copy link
Collaborator

As noted in #9692, transformers 4.45 introduces a new serialization for merges, which is a list of 2 strings instead of a single string with 2 space-separated parts.

This needs to be handled in gguf-py/gguf/vocab.py, because gguf.SpecialVocab(dir_model, load_merges=True) otherwise ignores the merges with a warning:

WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.

Note that this is only a problem when there is no merges.txt, so it should be reproducible with https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer, but not with https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization unless merges.txt is renamed or removed.


@github-actions github-actions bot added the python python script changes label Sep 30, 2024
@compilade compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Sep 30, 2024
Copy link
Collaborator

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I didn't even think about going down this path to check! 🔥

@pcuenca
Copy link
Contributor

pcuenca commented Sep 30, 2024

What's the reproduction flow? I only pushed the tokenizer to https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer but I have a local copy with the model as well, and it converts successfully using convert_hf_to_gguf.py – no merges.txt in the directory.

@pcuenca
Copy link
Contributor

pcuenca commented Sep 30, 2024

Is it only failing for sentencepiece-based tokenizers?

@compilade
Copy link
Collaborator Author

compilade commented Sep 30, 2024

@pcuenca This isn't about failure, but about warnings and missing tokenizer.ggml.merges.

When taking https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer and converting it with python3 convert_hf_to_gguf.py --vocab-only /path/to/model_dir from master, I get

WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.

And with gguf-dump:

     23: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     24: STRING     |        1 | tokenizer.ggml.pre = 'llama-bpe'
     25: [STRING]   |   128256 | tokenizer.ggml.tokens
     26: [INT32]    |   128256 | tokenizer.ggml.token_type
     27: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     28: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009

But when I convert it from this branch, I get

INFO:gguf.vocab:Adding 280147 merge(s).

And in the dump:

     23: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     24: STRING     |        1 | tokenizer.ggml.pre = 'llama-bpe'
     25: [STRING]   |   128256 | tokenizer.ggml.tokens
     26: [INT32]    |   128256 | tokenizer.ggml.token_type
     27: [STRING]   |   280147 | tokenizer.ggml.merges
     28: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     29: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009

Notice that the tokenizer.ggml.merges field is only present when properly handling the new format, at least when it's in use by tokenizer.json.

Is it only failing for sentencepiece-based tokenizers?

No, since those don't have merges, only BPE tokenizers do.

@pcuenca
Copy link
Contributor

pcuenca commented Sep 30, 2024

I was considering those warnings failures, as the resulting model could not be used for generation. But I was not seeing them, and thus I was very confused. I have been able to reproduce the issue now with a local copy of Llama 3.1, but both Llama 3.2 and Llama 2 worked fine for me in my tests (with the tokenizer exported using the new format in all three cases).

No, since those don't have merges, only BPE tokenizers do.

True, but I was reading the code and trying to guess when the code path would trigger.

I'll debug more closely to try to understand why it worked with Llama 3.2 and not with Llama 3.1. Thanks!

Edit: this was user error, I've now reproduced the problem with Llama 3.2 as well. Llama 2 works because it's sentencepiece-based. Apologies for the noise!

@farris
Copy link

farris commented Oct 2, 2024

Can this be merged... please

@xmaayy
Copy link

xmaayy commented Oct 2, 2024

@farris I can confirm that just cloning this branch and running the conversion works without any other modifications and the resulting models are immediately usable by llama.cp \ ollama or any other GGUF compatible service.

This model was trained in transformers>=4.45

> python convert_hf_to_gguf.py --outtype q8_0 --outfile ../pyfim-1b-Q8_0.gguf ..
INFO:hf-to-gguf:Loading model: ..
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../pyfim-1b-Q8_0.gguf: n_tensors = 147, total_size = 1.3G
Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1.31G/1.31G [00:07<00:00, 184Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ../pyfim-1b-Q8_0.gguf

(its fine tuned for specific tokens, but i just wanted to show it works)

> ollama create pyfim-1B -f Modelfile
transferring model data 100%
using existing layer sha256:a999a1c4c48fc6c4b8012a97642e6b9358c867c6ce9f678bc30906f4d426d623
using existing layer sha256:f22452e1689a8c191ff573072fc9c472d1e3fb5aae458d084de3983b610b4617
creating new layer sha256:d8f39b85b2c2b1c650535973d03f4002970b723ddb76cfad3f86322ebd17275b
creating new layer sha256:6eb2ba3a9d8a26c55aa1978e8a78183f812b3e34707bafb11c9c7edd37bbaf98
writing manifest
success

> ollama run pyfim-1B
>>> Hello
emangosal
<", :,0.               0e-5
>

@compilade compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Oct 3, 2024
@ggerganov ggerganov merged commit e3c355b into master Oct 3, 2024
11 checks passed
@ggerganov ggerganov deleted the compilade/convert-merges-pairs-to-old branch October 3, 2024 14:22
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants