convert : handle tokenizer merges format from transformers 4.45 #9696

compilade · 2024-09-30T15:24:20Z

As noted in #9692, transformers 4.45 introduces a new serialization for merges, which is a list of 2 strings instead of a single string with 2 space-separated parts.

This needs to be handled in gguf-py/gguf/vocab.py, because gguf.SpecialVocab(dir_model, load_merges=True) otherwise ignores the merges with a warning:

WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.

Note that this is only a problem when there is no merges.txt, so it should be reproducible with https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer, but not with https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization unless merges.txt is renamed or removed.

I have read the contributing guidelines
Self-reported review complexity:
- Low

Vaibhavs10

Nice catch! I didn't even think about going down this path to check! 🔥

pcuenca · 2024-09-30T16:34:52Z

What's the reproduction flow? I only pushed the tokenizer to https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer but I have a local copy with the model as well, and it converts successfully using convert_hf_to_gguf.py – no merges.txt in the directory.

pcuenca · 2024-09-30T16:44:28Z

Is it only failing for sentencepiece-based tokenizers?

compilade · 2024-09-30T16:55:41Z

@pcuenca This isn't about failure, but about warnings and missing tokenizer.ggml.merges.

When taking https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-tokenizer and converting it with python3 convert_hf_to_gguf.py --vocab-only /path/to/model_dir from master, I get

WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.

And with gguf-dump:

     23: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     24: STRING     |        1 | tokenizer.ggml.pre = 'llama-bpe'
     25: [STRING]   |   128256 | tokenizer.ggml.tokens
     26: [INT32]    |   128256 | tokenizer.ggml.token_type
     27: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     28: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009

But when I convert it from this branch, I get

INFO:gguf.vocab:Adding 280147 merge(s).

And in the dump:

     23: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     24: STRING     |        1 | tokenizer.ggml.pre = 'llama-bpe'
     25: [STRING]   |   128256 | tokenizer.ggml.tokens
     26: [INT32]    |   128256 | tokenizer.ggml.token_type
     27: [STRING]   |   280147 | tokenizer.ggml.merges
     28: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     29: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009

Notice that the tokenizer.ggml.merges field is only present when properly handling the new format, at least when it's in use by tokenizer.json.

Is it only failing for sentencepiece-based tokenizers?

No, since those don't have merges, only BPE tokenizers do.

pcuenca · 2024-09-30T17:20:56Z

I was considering those warnings failures, as the resulting model could not be used for generation. But I was not seeing them, and thus I was very confused. I have been able to reproduce the issue now with a local copy of Llama 3.1, ~~but both Llama 3.2 and Llama 2 worked fine for me in my tests (with the tokenizer exported using the new format in all three cases)~~.

No, since those don't have merges, only BPE tokenizers do.

True, but I was reading the code and trying to guess when the code path would trigger.

I'll debug more closely to try to understand why it worked with Llama 3.2 and not with Llama 3.1. Thanks!

Edit: this was user error, I've now reproduced the problem with Llama 3.2 as well. Llama 2 works because it's sentencepiece-based. Apologies for the noise!

farris · 2024-10-02T19:15:30Z

Can this be merged... please

xmaayy · 2024-10-02T21:14:14Z

@farris I can confirm that just cloning this branch and running the conversion works without any other modifications and the resulting models are immediately usable by llama.cp \ ollama or any other GGUF compatible service.

This model was trained in transformers>=4.45

> python convert_hf_to_gguf.py --outtype q8_0 --outfile ../pyfim-1b-Q8_0.gguf ..
INFO:hf-to-gguf:Loading model: ..
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:../pyfim-1b-Q8_0.gguf: n_tensors = 147, total_size = 1.3G
Writing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1.31G/1.31G [00:07<00:00, 184Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to ../pyfim-1b-Q8_0.gguf

(its fine tuned for specific tokens, but i just wanted to show it works)

> ollama create pyfim-1B -f Modelfile
transferring model data 100%
using existing layer sha256:a999a1c4c48fc6c4b8012a97642e6b9358c867c6ce9f678bc30906f4d426d623
using existing layer sha256:f22452e1689a8c191ff573072fc9c472d1e3fb5aae458d084de3983b610b4617
creating new layer sha256:d8f39b85b2c2b1c650535973d03f4002970b723ddb76cfad3f86322ebd17275b
creating new layer sha256:6eb2ba3a9d8a26c55aa1978e8a78183f812b3e34707bafb11c9c7edd37bbaf98
writing manifest
success

> ollama run pyfim-1B
>>> Hello
emangosal
<", :,0.               0e-5
>

…ganov#9696)

convert : handle tokenizer merges format from transformers 4.45

8ba3858

github-actions bot added the python python script changes label Sep 30, 2024

compilade mentioned this pull request Sep 30, 2024

Bug: cannot find tokenizer merges in model file #9692

Closed

compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Sep 30, 2024

ggerganov approved these changes Sep 30, 2024

View reviewed changes

Vaibhavs10 reviewed Sep 30, 2024

View reviewed changes

Merge branch 'master' into compilade/convert-merges-pairs-to-old

a38fc04

pcuenca approved these changes Sep 30, 2024

View reviewed changes

xmaayy approved these changes Oct 2, 2024

View reviewed changes

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Oct 3, 2024

ggerganov merged commit e3c355b into master Oct 3, 2024
11 checks passed

ggerganov deleted the compilade/convert-merges-pairs-to-old branch October 3, 2024 14:22

pcuenca mentioned this pull request Oct 8, 2024

convert_llama_weights_to_hf.py - problem converting llama 3.2 1B huggingface/transformers#33760

Closed

4 tasks

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

convert : handle tokenizer merges format from transformers 4.45 (gger…

5886d9f

…ganov#9696)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

convert : handle tokenizer merges format from transformers 4.45 (gger…

5fa8294

…ganov#9696)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

convert : handle tokenizer merges format from transformers 4.45 (gger…

89f1bb3

…ganov#9696)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert : handle tokenizer merges format from transformers 4.45 #9696

convert : handle tokenizer merges format from transformers 4.45 #9696

compilade commented Sep 30, 2024

Vaibhavs10 left a comment

pcuenca commented Sep 30, 2024 •

edited

Loading

pcuenca commented Sep 30, 2024

compilade commented Sep 30, 2024 •

edited

Loading

pcuenca commented Sep 30, 2024 •

edited

Loading

farris commented Oct 2, 2024

xmaayy commented Oct 2, 2024 •

edited

Loading

convert : handle tokenizer merges format from transformers 4.45 #9696

convert : handle tokenizer merges format from transformers 4.45 #9696

Conversation

compilade commented Sep 30, 2024

Vaibhavs10 left a comment

Choose a reason for hiding this comment

pcuenca commented Sep 30, 2024 • edited Loading

pcuenca commented Sep 30, 2024

compilade commented Sep 30, 2024 • edited Loading

pcuenca commented Sep 30, 2024 • edited Loading

farris commented Oct 2, 2024

xmaayy commented Oct 2, 2024 • edited Loading

pcuenca commented Sep 30, 2024 •

edited

Loading

compilade commented Sep 30, 2024 •

edited

Loading

pcuenca commented Sep 30, 2024 •

edited

Loading

xmaayy commented Oct 2, 2024 •

edited

Loading