Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mergekit produces broken Tokenizers #469

Closed
RedrixHD opened this issue Dec 7, 2024 · 2 comments
Closed

Mergekit produces broken Tokenizers #469

RedrixHD opened this issue Dec 7, 2024 · 2 comments

Comments

@RedrixHD
Copy link

RedrixHD commented Dec 7, 2024

Greetings. I'm a novice to mergekit, so I still don't understand most of the technical terms.
I created a merge with the following config:

models:
  - model: TheDrummer/UnslopNemo-12B-v4.1
  - model: inflatebot/MN-12B-Mag-Mell-R1
base_model: TheDrummer/UnslopNemo-12B-v4.1
merge_method: slerp
dtype: bfloat16
tokenizer_source: "union"
chat_template: "chatml"
parameters:
  t: [0, 0.5, 1, 0.5, 0]

The tokenizer of one of the parents, Unslop-Nemo-v4.1 is broken too. So I instead used the tokenizer from the previous version Unslop-Nemo-v2. I also tried simply using the other model's tokenizer — it breaks too. Oobabooga Textgeneration UI simply can't run it and breaks while trying to initialize the tokenizer. I've also merged other models and the tokenizers are all broken too.
Tokenizers that work are around 9MB, while broken ones are around 17MB.
This reddit thread faces the same problem, which suggests it might be related to Transformers: https://www.reddit.com/r/LocalLLaMA/comments/1gwyuyg/beware_of_broken_tokenizers_learned_of_this_while/

In the new tokenizer configuration, I noticed pad_to_multiple_of:, could this fix it, maybe? Or is it something completely different?

@cg123
Copy link
Collaborator

cg123 commented Dec 7, 2024

Hey!

This isn't actually an issue with mergekit - in huggingface/tokenizers#909 the serialization format for merges was changed. If you upgrade tokenizers and transformers in your webui environment it should be able to load these new-format tokenizers.

Alternatively, you can downgrade transformers to 4.44.2 and tokenizers to 0.19.1 in your mergekit environment. This will make mergekit output tokenizers in the old format. Unfortunately this will mean you won't be able to merge models that use the new tokenizer format. There's kind of no perfect option until the entire ecosystem supports the new one, unfortunately.

@RedrixHD
Copy link
Author

RedrixHD commented Dec 7, 2024

Thank you for your reply!

@RedrixHD RedrixHD closed this as completed Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants