Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken tokenizer in Yi-34B merge #428

Closed
Asherathe opened this issue Sep 30, 2024 · 3 comments · Fixed by #430
Closed

Broken tokenizer in Yi-34B merge #428

Asherathe opened this issue Sep 30, 2024 · 3 comments · Fixed by #430

Comments

@Asherathe
Copy link

I've been trying to merge two Yi-34B based builds using Arcee's hosted mergekit. The merge seems to be successful, with no errors shown, but no matter what tokenizer source I use, the result seems broken and I'm unable to convert to GGUF. I know there used to be a bug related to this, but I thought it was fixed.

This is the most recent YAML I used:

base_model: TeeZee/Kyllene-34B-v1.1
chat_template: auto
dtype: float16
merge_method: ties
models:
- model: TeeZee/Kyllene-34B-v1.1
  parameters:
    density: 0.5
    weight: 0.5
- model: Doctor-Shotgun/Nous-Capybara-limarpv3-34B
  parameters:
    density: 0.5
    weight: 0.5
parameters:
  int8_mask: true
  normalize: false
tokenizer_source: base
@Naozumi520
Copy link

Naozumi520 commented Oct 3, 2024

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

@Asherathe
Copy link
Author

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

I was unable to convert the model to GGUF and quantize because of an error about token ids being out of range. There were tokens numbered 64000 & 64001 when the max was 63999.
I was finally able to fix this problem by redoing the merge with the added parameter "embed_slerp=true".

I too see a lot of unused tokens in the config, but I don't know if that's anything to worry about. So far, I haven't seen these show up in generated text.

@cg123
Copy link
Collaborator

cg123 commented Oct 5, 2024

After merging in #430 I'm able to merge the config you posted and successfully quantize the output model. Please do let me know if this recurs or you run into any similar problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants