Mergekit produces broken Tokenizers #469

RedrixHD · 2024-12-07T14:10:30Z

Greetings. I'm a novice to mergekit, so I still don't understand most of the technical terms.
I created a merge with the following config:

models:
  - model: TheDrummer/UnslopNemo-12B-v4.1
  - model: inflatebot/MN-12B-Mag-Mell-R1
base_model: TheDrummer/UnslopNemo-12B-v4.1
merge_method: slerp
dtype: bfloat16
tokenizer_source: "union"
chat_template: "chatml"
parameters:
  t: [0, 0.5, 1, 0.5, 0]

The tokenizer of one of the parents, Unslop-Nemo-v4.1 is broken too. So I instead used the tokenizer from the previous version Unslop-Nemo-v2. I also tried simply using the other model's tokenizer — it breaks too. Oobabooga Textgeneration UI simply can't run it and breaks while trying to initialize the tokenizer. I've also merged other models and the tokenizers are all broken too.
Tokenizers that work are around 9MB, while broken ones are around 17MB.
This reddit thread faces the same problem, which suggests it might be related to Transformers: https://www.reddit.com/r/LocalLLaMA/comments/1gwyuyg/beware_of_broken_tokenizers_learned_of_this_while/

In the new tokenizer configuration, I noticed pad_to_multiple_of:, could this fix it, maybe? Or is it something completely different?

The text was updated successfully, but these errors were encountered:

cg123 · 2024-12-07T20:37:03Z

Hey!

This isn't actually an issue with mergekit - in huggingface/tokenizers#909 the serialization format for merges was changed. If you upgrade tokenizers and transformers in your webui environment it should be able to load these new-format tokenizers.

Alternatively, you can downgrade transformers to 4.44.2 and tokenizers to 0.19.1 in your mergekit environment. This will make mergekit output tokenizers in the old format. Unfortunately this will mean you won't be able to merge models that use the new tokenizer format. There's kind of no perfect option until the entire ecosystem supports the new one, unfortunately.

RedrixHD · 2024-12-07T20:58:22Z

Thank you for your reply!

RedrixHD closed this as completed Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mergekit produces broken Tokenizers #469

Mergekit produces broken Tokenizers #469

RedrixHD commented Dec 7, 2024

cg123 commented Dec 7, 2024

RedrixHD commented Dec 7, 2024

Mergekit produces broken Tokenizers #469

Mergekit produces broken Tokenizers #469

Comments

RedrixHD commented Dec 7, 2024

cg123 commented Dec 7, 2024

RedrixHD commented Dec 7, 2024