Broken tokenizer in Yi-34B merge #428

Asherathe · 2024-09-30T17:49:30Z

I've been trying to merge two Yi-34B based builds using Arcee's hosted mergekit. The merge seems to be successful, with no errors shown, but no matter what tokenizer source I use, the result seems broken and I'm unable to convert to GGUF. I know there used to be a bug related to this, but I thought it was fixed.

This is the most recent YAML I used:

base_model: TeeZee/Kyllene-34B-v1.1
chat_template: auto
dtype: float16
merge_method: ties
models:
- model: TeeZee/Kyllene-34B-v1.1
  parameters:
    density: 0.5
    weight: 0.5
- model: Doctor-Shotgun/Nous-Capybara-limarpv3-34B
  parameters:
    density: 0.5
    weight: 0.5
parameters:
  int8_mask: true
  normalize: false
tokenizer_source: base

The text was updated successfully, but these errors were encountered:

Naozumi520 · 2024-10-03T06:39:52Z

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

Asherathe · 2024-10-03T07:00:00Z

Hi! What do you mean by broken tokenizer? I did not sure if my tokenizer were broken. I got token in texts like "<|unused115|>", "<|unused026|>" in my message after merging model.

I was unable to convert the model to GGUF and quantize because of an error about token ids being out of range. There were tokens numbered 64000 & 64001 when the max was 63999.
I was finally able to fix this problem by redoing the merge with the added parameter "embed_slerp=true".

I too see a lot of unused tokens in the config, but I don't know if that's anything to worry about. So far, I haven't seen these show up in generated text.

Should resolve #428.

cg123 · 2024-10-05T19:58:53Z

After merging in #430 I'm able to merge the config you posted and successfully quantize the output model. Please do let me know if this recurs or you run into any similar problems!

cg123 mentioned this issue Oct 5, 2024

Handle merges stored as list instead of space-separated string #430

Merged

cg123 closed this as completed in #430 Oct 5, 2024

cg123 added a commit that referenced this issue Oct 5, 2024

Handle merges stored as list instead of space-separated string (#430)

459121e

Should resolve #428.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken tokenizer in Yi-34B merge #428

Broken tokenizer in Yi-34B merge #428

Asherathe commented Sep 30, 2024

Naozumi520 commented Oct 3, 2024 •

edited

Loading

Asherathe commented Oct 3, 2024

cg123 commented Oct 5, 2024

Broken tokenizer in Yi-34B merge #428

Broken tokenizer in Yi-34B merge #428

Comments

Asherathe commented Sep 30, 2024

Naozumi520 commented Oct 3, 2024 • edited Loading

Asherathe commented Oct 3, 2024

cg123 commented Oct 5, 2024

Naozumi520 commented Oct 3, 2024 •

edited

Loading