Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include eos_token and bos_token from tokenizer_config.json for chat templating #5040

Open
abetlen opened this issue Jan 19, 2024 · 12 comments
Open
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@abetlen
Copy link
Collaborator

abetlen commented Jan 19, 2024

Feature Description

Include (at minimum) eos_token and bos_token keys the huggingface tokenizer_config.json as gguf metadata keys.

Motivation

For chat models these differ from the normal eos and bos tokens and are required to stop the model generating user message tokens.

For example to correctly stop generation of chatml chat models you need to watch for the <|im_end|> token specified in the tokenizer_config.json below.

https://huggingface.co/mlabonne/NeuralBeagle14-7B/blob/main/tokenizer_config.json

however the gguf metadata doesn't include this token

image

using the chat_format alone without the eos_token causes generation to continue incorrectly past the end of the assistant respone.

image

Possible Implementation

I'm not too familiar with the gguf-py package but I think adding these to the SpecialVocab class and updating the _try_load_from_tokenizer_json would be a start. Not sure what the correct gguf key should be for these so as to not create confusion.

@abetlen abetlen added the enhancement New feature or request label Jan 19, 2024
@slaren
Copy link
Collaborator

slaren commented Jan 19, 2024

So if I understand correctly, the issue is that tokenizer_config.json file of HF models may contain keys eos_token and bos_token that are meant to be used with the chat template, and these are not the same tokens as the ones defined in config.json with keys bos_token_id and eos_token_id. Is there any documentation about any of this?

@vriesdemichael
Copy link
Contributor

vriesdemichael commented Jan 22, 2024

Okay so the documentation is not exactly clear on this subject.

Some models have a clear mapping with
eos/bos_token_id in generation_config.json matching to both the keys bos/eos_token and the added tokens in the tokenizer_config.json

Others do not such as phi-2:
config.json ("eos_token_id": null, "bos_token_id": null)
generation_config.json (no bos/eos token info)
tokenizer_config.json (sets eos_token and bos_token)
special_tokens_map.json (adds bos/eos/unk)

It looks like the tokenizer.apply_chat_template function expects all template variables to be present in the tokenizer config.
The docstring in the tokenizer init implies that bos_token_id and bos_token should match
However, the bos_token will only be loaded if it is present in the tokenizer_config.json ( I think). If the key is not present, the apply_chat_template function will not work.

I guess that means some models are doomed to have this shortcoming because of the way they were configured.

@vriesdemichael
Copy link
Contributor

The problem in this issue is partly because of the logic in the GGUF writer.

The eos_token in the tokenizer config is not always written to the GGUF metadata because of a conditional:

for typ in self.special_token_types:
add_entry = tokenizer_config.get(f'add_{typ}_token')
if isinstance(add_entry, bool):
self.add_special_token[typ] = add_entry
if not added_tokens:
# We will need this to get the content for the token, so if it's empty
# may as well just give up.
continue
entry = tokenizer_config.get(f'{typ}_token')
if isinstance(entry, str):
tc_content = entry
elif isinstance(entry, dict):
entry_content = entry.get('content')
if not isinstance(entry_content, str):
continue
tc_content = entry_content
else:
continue
# We only need the first match here.
maybe_token_id = next(
(atok.get('id') for atok in added_tokens if atok.get('content') == tc_content),
None,
)
self._set_special_token(typ, maybe_token_id)

When add_eos_token is set to false, the eos_token key is skipped because of the continue statement.
I think the assumption was made that when add_eos_token is false, the eos_token would be useless.
This config description is ambiguous.
This is what I make of it based on the llama tokenizer: The eos_token is added at the end of the templated input when add_eos_token is set to true.
But with the description in the doc strings, I can imagine that some model creators interpreted it as whether to add the eos_token at all.

For this issue it would help if the conditional was skipped.

@Rocketknight1
Copy link

Hi all, I'm the developer at Hugging Face who designed the original spec and implementation for our chat templates - I got linked here from abetlen/llama-cpp-python#1096. Just to clarify the situation with special tokens like bos_token and eos_token: chat templates read these from the tokenizer, which loads them from tokenizer_config.json, rather than generation_config.json.

Chat template handling all happens inside the apply_chat_template function. The key line is here - the chat template is rendered with the conversation history so far, the add_generation_prompt bool, and tokenizer.special_tokens_map, which we splat into kwargs.

If you follow the definition for special_tokens_map, you'll see that it contains the following keys:

"bos_token"
"eos_token"
"unk_token"
"sep_token"
"pad_token"
"cls_token"
"mask_token"
"additional_special_tokens"

With the exception of additional_special_tokens, which is rarely used, all of these should just be a string. If you add those string values to the GGUF metadata you should have everything you need to render the templates correctly. Note that the string values you see here should always be consistent with the token IDs like tokenizer.bos_token_id. This is because attributes like bos_token_id are actually @property methods, and read self.bos_token to determine their value when called.

@teleprint-me
Copy link
Contributor

teleprint-me commented Jan 24, 2024

I think the real issue here is just as @vriesdemichael eloquently pointed out that some models don't have a definitively unique BOS or EOS token.

We could read from the tokenizer_config.json to get some idea of what the tokens might have been intended to be, but this raises an entirely new problem.

For example,

  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "unk_token": "<|endoftext|>"

Phi-2 has the same token defined for BOS, EOS, and UNK. This isn't unique to Phi either.

Maybe it's better to read it from the tokenizer.json? Another issue that arises from this is when the script relies upon tokenizer.model and when tokenizer_config.json isn't available.

For example, the GGUFWriter writes the vocab to the output model file in convert.py.

A thought I had is to simply just use <s>, </s>, and <unk> as fallbacks if they're identical or undefined.

@slaren and @ggerganov would have more of an idea of how this is processed under the hood though. I'm still trying to figure it out.

@ggerganov ggerganov added the help wanted Extra attention is needed label Jan 25, 2024
@Rocketknight1
Copy link

Hi @teleprint-me, I wrote a bit about this here: abetlen/llama-cpp-python#1096 (comment)

The tl;dr is that models should set tokenizer.eos_token to be the end-of-generation token, but many don't, and in some cases the end of generation string isn't added as a special token, which means it's tokenized as multiple tokens (like ["<|", "im_", "end", ">|"]. In those cases, there's not much you can do.

The principled approach would be to just read tokenizer.eos_token and use that, and if that doesn't work then just give up. A hacky solution that would probably work well in practice would be to search the chat template string for common ending strings like [/INST] or </s> or <|im_end|> and use those if they're present.

@vriesdemichael
Copy link
Contributor

@Rocketknight1 Thank you for that clarification!

It would be incredibly helpful if the transformer lib offers some sort of expected structure for these various config files.
I get that some level of flexibility is required, but it would be nice for 3rd party implementations like this project to have a guarantee that certain fields are co-dependent, required, optional, etc.

Most config structures do so using a json schema, which also makes it very easy to validate any given config file.

@vriesdemichael
Copy link
Contributor

Still doesnt fix the problem with misconfigured models. But I guess that is best left to the model creators.

@Rocketknight1
Copy link

I think that might be difficult, given the diversity of models in the library!

However, I do have something that might be interesting to you - we're working on allowing models to store lists of arbitrary stop strings. When this feature is implemented, and when (if) users start adding it to their repos, you can hopefully just read that field to resolve the issue here! huggingface/transformers#28932

@teleprint-me
Copy link
Contributor

@vriesdemichael

Implementations typically use sentencepiece which requires the user to define BOS, EOS, UNK, PAD, etc.

Users won't (and often don't) always define these tokens. The phi model by microsoft is a perfect example of this where only the <|endoftext|> custom token is defined.

This isn't a fix huggingface can take care of, although I can see potential streamlining and guidance helping with the issue.

The only reasonable expectation I can think of is suggesting users utilize some kind of consistent interface, but this can not and will not be guaranteed (I can see some users rolling their eyes at this while others will comply).

This doesn't account for the variety of implementations there are for tokenizers either.

@vriesdemichael
Copy link
Contributor

I understand the complexity and definitely don't expect older models to adapt.

I merely suggest taking the reins for future implementations as huggingface has done with chat templates in the tokenizer.

At some point in the future a reliable set of params which are enforced by hf are desirable for downstream implementations such as llama.cpp, gtpq, etc.

before derailing this discussion any further...

I believe the problems that can be solved in the gguf-py code are fixed, that means this issue can be closed

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants