Qwen2.5 14B models are ... sometimes? ... having their token vocabulary truncated down to 'actual'? #425

ann-brown · 2024-09-27T14:21:57Z

Actual example of a merge that produced this issue:

models:
  - model: Qwen/Qwen2.5-14B-Instruct
    parameters:
      weight: 0.3
      density: 0.4
merge_method: della
base_model: <base model path>
parameters:
  epsilon: 0.05
  lambda: 1
dtype: bfloat16
tokenizer_source: base

Additional relevant information is that if I get the tokenizer vocab size with tokenizer_vocab_size = len(tokenizer) from ... any Qwen 2.5 14B model, I get the 151665 number rather than the 152064 number that's in the config.json.

I don't fully understand why it's trimming the vocabulary size and embedding layer down in this merge method but none of the others, but it's annoying for compatibility and specifying the tokenizer_source doesn't seem to address the issue (presumably because the tokenizer doesn't actually have 152064 worth of vocabulary)

The text was updated successfully, but these errors were encountered:

cg123 · 2024-10-26T10:29:48Z

When using tokenizer_source/tokenizer new tensors are created for embeddings and LM heads that exactly match the output vocabulary size.

I can look at adding an option for padding the size up to the nearest multiple of 32 if that's causing an issue.

ann-brown · 2024-11-06T16:23:15Z

Would be a helpful option -- it's causing some downstream effects in other paradigms (like getting into unsloth patching that isn't fully calibrated to the model type, for some reason) and preventing merges with other Qwen 2.5 models.

cg123 · 2024-12-01T00:21:39Z

I've added this option in #465 - for Qwen2.5 models setting pad_to_multiple_of: 512 will output a model of the exact same size. Hopefully this helps - do let me know!

ann-brown · 2024-12-01T20:34:13Z

Early indications are that it's working! Merging two models that were at the truncated size brought it back up to 152064 and it evaluates well. If those were just padding in the first place it should be fine.

chenchen333-dev · 2024-12-18T05:36:15Z

Would be a helpful option -- it's causing some downstream effects in other paradigms (like getting into unsloth patching that isn't fully calibrated to the model type, for some reason) and preventing merges with other Qwen 2.5 models.

有合并过glm4模型吗

ann-brown · 2024-12-18T11:47:25Z

Would be a helpful option -- it's causing some downstream effects in other paradigms (like getting into unsloth patching that isn't fully calibrated to the model type, for some reason) and preventing merges with other Qwen 2.5 models.

有合并过glm4模型吗

I have not tried merging any glm4 models. It looks like they have a padded_vocab_size rather than a vocab_size?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5 14B models are ... sometimes? ... having their token vocabulary truncated down to 'actual'? #425

Qwen2.5 14B models are ... sometimes? ... having their token vocabulary truncated down to 'actual'? #425

ann-brown commented Sep 27, 2024 •

edited

Loading

cg123 commented Oct 26, 2024

ann-brown commented Nov 6, 2024

cg123 commented Dec 1, 2024

ann-brown commented Dec 1, 2024

chenchen333-dev commented Dec 18, 2024

ann-brown commented Dec 18, 2024

Qwen2.5 14B models are ... sometimes? ... having their token vocabulary truncated down to 'actual'? #425

Qwen2.5 14B models are ... sometimes? ... having their token vocabulary truncated down to 'actual'? #425

Comments

ann-brown commented Sep 27, 2024 • edited Loading

cg123 commented Oct 26, 2024

ann-brown commented Nov 6, 2024

cg123 commented Dec 1, 2024

ann-brown commented Dec 1, 2024

chenchen333-dev commented Dec 18, 2024

ann-brown commented Dec 18, 2024

ann-brown commented Sep 27, 2024 •

edited

Loading