Include `eos_token` and `bos_token` from `tokenizer_config.json` for chat templating #5040

abetlen · 2024-01-19T20:31:05Z

Feature Description

Include (at minimum) eos_token and bos_token keys the huggingface tokenizer_config.json as gguf metadata keys.

Motivation

For chat models these differ from the normal eos and bos tokens and are required to stop the model generating user message tokens.

For example to correctly stop generation of chatml chat models you need to watch for the <|im_end|> token specified in the tokenizer_config.json below.

https://huggingface.co/mlabonne/NeuralBeagle14-7B/blob/main/tokenizer_config.json

however the gguf metadata doesn't include this token

using the chat_format alone without the eos_token causes generation to continue incorrectly past the end of the assistant respone.

Possible Implementation

I'm not too familiar with the gguf-py package but I think adding these to the SpecialVocab class and updating the _try_load_from_tokenizer_json would be a start. Not sure what the correct gguf key should be for these so as to not create confusion.

The text was updated successfully, but these errors were encountered:

slaren · 2024-01-19T20:46:22Z

So if I understand correctly, the issue is that tokenizer_config.json file of HF models may contain keys eos_token and bos_token that are meant to be used with the chat template, and these are not the same tokens as the ones defined in config.json with keys bos_token_id and eos_token_id. Is there any documentation about any of this?

vriesdemichael · 2024-01-22T09:37:49Z

Okay so the documentation is not exactly clear on this subject.

Some models have a clear mapping with
eos/bos_token_id in generation_config.json matching to both the keys bos/eos_token and the added tokens in the tokenizer_config.json

Others do not such as phi-2:
config.json ("eos_token_id": null, "bos_token_id": null)
generation_config.json (no bos/eos token info)
tokenizer_config.json (sets eos_token and bos_token)
special_tokens_map.json (adds bos/eos/unk)

It looks like the tokenizer.apply_chat_template function expects all template variables to be present in the tokenizer config.
The docstring in the tokenizer init implies that bos_token_id and bos_token should match
However, the bos_token will only be loaded if it is present in the tokenizer_config.json ( I think). If the key is not present, the apply_chat_template function will not work.

I guess that means some models are doomed to have this shortcoming because of the way they were configured.

vriesdemichael · 2024-01-22T09:51:01Z

The problem in this issue is partly because of the logic in the GGUF writer.

The eos_token in the tokenizer config is not always written to the GGUF metadata because of a conditional:

llama.cpp/gguf-py/gguf/vocab.py

Lines 151 to 174 in 381ee19

    
           for typ in self.special_token_types: 
        
               add_entry = tokenizer_config.get(f'add_{typ}_token') 
        
               if isinstance(add_entry, bool): 
        
                   self.add_special_token[typ] = add_entry 
        
               if not added_tokens: 
        
                   # We will need this to get the content for the token, so if it's empty 
        
                   # may as well just give up. 
        
                   continue 
        
               entry = tokenizer_config.get(f'{typ}_token') 
        
               if isinstance(entry, str): 
        
                   tc_content = entry 
        
               elif isinstance(entry, dict): 
        
                   entry_content = entry.get('content') 
        
                   if not isinstance(entry_content, str): 
        
                       continue 
        
                   tc_content = entry_content 
        
               else: 
        
                   continue 
        
               # We only need the first match here. 
        
               maybe_token_id = next( 
        
                   (atok.get('id') for atok in added_tokens if atok.get('content') == tc_content), 
        
                   None, 
        
               ) 
        
               self._set_special_token(typ, maybe_token_id)

When add_eos_token is set to false, the eos_token key is skipped because of the continue statement.
I think the assumption was made that when add_eos_token is false, the eos_token would be useless.
This config description is ambiguous.
This is what I make of it based on the llama tokenizer: The eos_token is added at the end of the templated input when add_eos_token is set to true.
But with the description in the doc strings, I can imagine that some model creators interpreted it as whether to add the eos_token at all.

For this issue it would help if the conditional was skipped.

Rocketknight1 · 2024-01-24T17:47:20Z

Hi all, I'm the developer at Hugging Face who designed the original spec and implementation for our chat templates - I got linked here from abetlen/llama-cpp-python#1096. Just to clarify the situation with special tokens like bos_token and eos_token: chat templates read these from the tokenizer, which loads them from tokenizer_config.json, rather than generation_config.json.

Chat template handling all happens inside the apply_chat_template function. The key line is here - the chat template is rendered with the conversation history so far, the add_generation_prompt bool, and tokenizer.special_tokens_map, which we splat into kwargs.

If you follow the definition for special_tokens_map, you'll see that it contains the following keys:

"bos_token"
"eos_token"
"unk_token"
"sep_token"
"pad_token"
"cls_token"
"mask_token"
"additional_special_tokens"

With the exception of additional_special_tokens, which is rarely used, all of these should just be a string. If you add those string values to the GGUF metadata you should have everything you need to render the templates correctly. Note that the string values you see here should always be consistent with the token IDs like tokenizer.bos_token_id. This is because attributes like bos_token_id are actually @property methods, and read self.bos_token to determine their value when called.

teleprint-me · 2024-01-24T19:46:05Z

I think the real issue here is just as @vriesdemichael eloquently pointed out that some models don't have a definitively unique BOS or EOS token.

We could read from the tokenizer_config.json to get some idea of what the tokens might have been intended to be, but this raises an entirely new problem.

For example,

  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "unk_token": "<|endoftext|>"

Phi-2 has the same token defined for BOS, EOS, and UNK. This isn't unique to Phi either.

Maybe it's better to read it from the tokenizer.json? Another issue that arises from this is when the script relies upon tokenizer.model and when tokenizer_config.json isn't available.

For example, the GGUFWriter writes the vocab to the output model file in convert.py.

A thought I had is to simply just use <s>, </s>, and <unk> as fallbacks if they're identical or undefined.

@slaren and @ggerganov would have more of an idea of how this is processed under the hood though. I'm still trying to figure it out.

Rocketknight1 · 2024-01-29T15:20:43Z

Hi @teleprint-me, I wrote a bit about this here: abetlen/llama-cpp-python#1096 (comment)

The tl;dr is that models should set tokenizer.eos_token to be the end-of-generation token, but many don't, and in some cases the end of generation string isn't added as a special token, which means it's tokenized as multiple tokens (like ["<|", "im_", "end", ">|"]. In those cases, there's not much you can do.

The principled approach would be to just read tokenizer.eos_token and use that, and if that doesn't work then just give up. A hacky solution that would probably work well in practice would be to search the chat template string for common ending strings like [/INST] or </s> or <|im_end|> and use those if they're present.

vriesdemichael · 2024-02-14T08:28:18Z

@Rocketknight1 Thank you for that clarification!

It would be incredibly helpful if the transformer lib offers some sort of expected structure for these various config files.
I get that some level of flexibility is required, but it would be nice for 3rd party implementations like this project to have a guarantee that certain fields are co-dependent, required, optional, etc.

Most config structures do so using a json schema, which also makes it very easy to validate any given config file.

vriesdemichael · 2024-02-14T13:32:44Z

Still doesnt fix the problem with misconfigured models. But I guess that is best left to the model creators.

Rocketknight1 · 2024-02-14T13:45:26Z

I think that might be difficult, given the diversity of models in the library!

However, I do have something that might be interesting to you - we're working on allowing models to store lists of arbitrary stop strings. When this feature is implemented, and when (if) users start adding it to their repos, you can hopefully just read that field to resolve the issue here! huggingface/transformers#28932

teleprint-me · 2024-02-16T09:14:03Z

@vriesdemichael

Implementations typically use sentencepiece which requires the user to define BOS, EOS, UNK, PAD, etc.

Users won't (and often don't) always define these tokens. The phi model by microsoft is a perfect example of this where only the <|endoftext|> custom token is defined.

This isn't a fix huggingface can take care of, although I can see potential streamlining and guidance helping with the issue.

The only reasonable expectation I can think of is suggesting users utilize some kind of consistent interface, but this can not and will not be guaranteed (I can see some users rolling their eyes at this while others will comply).

This doesn't account for the variety of implementations there are for tokenizers either.

vriesdemichael · 2024-02-16T11:34:49Z

I understand the complexity and definitely don't expect older models to adapt.

I merely suggest taking the reins for future implementations as huggingface has done with chat templates in the tokenizer.

At some point in the future a reliable set of params which are enforced by hf are desirable for downstream implementations such as llama.cpp, gtpq, etc.

before derailing this discussion any further...

I believe the problems that can be solved in the gguf-py code are fixed, that means this issue can be closed

github-actions · 2024-03-18T01:33:19Z

This issue is stale because it has been open for 30 days with no activity.

abetlen added the enhancement New feature or request label Jan 19, 2024

abetlen mentioned this issue Jan 23, 2024

Use chat_template from gguf metadata abetlen/llama-cpp-python#1096

Closed

ggerganov added the help wanted Extra attention is needed label Jan 25, 2024

vriesdemichael mentioned this issue Feb 14, 2024

fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false #5487

Merged

github-actions bot added stale and removed stale labels Mar 18, 2024

ThiloteE mentioned this issue Oct 9, 2024

Custom stop sequences in settings nomic-ai/gpt4all#3061

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include `eos_token` and `bos_token` from `tokenizer_config.json` for chat templating #5040

Include `eos_token` and `bos_token` from `tokenizer_config.json` for chat templating #5040

abetlen commented Jan 19, 2024

slaren commented Jan 19, 2024

vriesdemichael commented Jan 22, 2024 •

edited

Loading

vriesdemichael commented Jan 22, 2024

Rocketknight1 commented Jan 24, 2024

teleprint-me commented Jan 24, 2024 •

edited

Loading

Rocketknight1 commented Jan 29, 2024

vriesdemichael commented Feb 14, 2024

vriesdemichael commented Feb 14, 2024

Rocketknight1 commented Feb 14, 2024

teleprint-me commented Feb 16, 2024

vriesdemichael commented Feb 16, 2024

github-actions bot commented Mar 18, 2024

Include eos_token and bos_token from tokenizer_config.json for chat templating #5040

Include eos_token and bos_token from tokenizer_config.json for chat templating #5040

Comments

abetlen commented Jan 19, 2024

Feature Description

Motivation

Possible Implementation

slaren commented Jan 19, 2024

vriesdemichael commented Jan 22, 2024 • edited Loading

vriesdemichael commented Jan 22, 2024

Rocketknight1 commented Jan 24, 2024

teleprint-me commented Jan 24, 2024 • edited Loading

Rocketknight1 commented Jan 29, 2024

vriesdemichael commented Feb 14, 2024

vriesdemichael commented Feb 14, 2024

Rocketknight1 commented Feb 14, 2024

teleprint-me commented Feb 16, 2024

vriesdemichael commented Feb 16, 2024

github-actions bot commented Mar 18, 2024

Include `eos_token` and `bos_token` from `tokenizer_config.json` for chat templating #5040

Include `eos_token` and `bos_token` from `tokenizer_config.json` for chat templating #5040

vriesdemichael commented Jan 22, 2024 •

edited

Loading

teleprint-me commented Jan 24, 2024 •

edited

Loading