How do I replace a spare tokens? #31475

kouyakamada · 2024-06-18T12:25:23Z

System Info

I want to SFT Mistral-v0.3 with my own chat template.
So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template.
However, the new vocabulary was actually added and the size of the vocabulary increased.
Is there any way to replace the vocabulary?

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

tokenizer.json

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
      "id": 10,
      "content": "<|system|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 11,
      "content": "<|user|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 12,
      "content": "<|assistant|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 13,
      "content": "<|eot|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": true,
  "added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "10": {
          "content": "<|system|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "11": {
          "content": "<|user|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "12": {
          "content": "<|assistant|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "13": {
          "content": "<|eot|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}

test code

tokenizer =  AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)

output

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}

Expected behavior

[control_n] Tokens can be replaced with any token.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-19T14:31:44Z

cc @itazap that would indeed be a good addition! More and more people pre-allocate some tokens and we don't have a replace token.

ArthurZucker · 2024-06-19T14:32:30Z

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

lee-onidas · 2024-06-21T23:52:02Z

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

lee-onidas · 2024-06-25T03:46:22Z

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly.

ArthurZucker · 2024-07-12T10:27:42Z

huggingface/tokenizers#1570 should help

itshuey · 2024-08-25T02:58:57Z

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

Hi @ArthurZucker, do you mind elaborating on this? I'm experiencing the same issue as OP after modifying tokenizer.json and tokenizer_config.json. After loading the local tokenizer, I am unable to reassign or delete any entries in tokenizer.vocab manually. For example, del tokenizer.vocab['<token-to-replace>'] does not have any effect. I'm also unsure how to modify added_vocab.

I would love to see this feature go through!

itazap · 2024-08-26T11:40:13Z

Hi @itshuey, can you please share the model and token you are attempting this with (in a short snippet would be great!) so I can take a look? 😊

itshuey · 2024-08-26T12:17:44Z

Sure @itazap, thank you. I am using the Mistral-7B-v0.3 tokenizer.

~ from transformers import AutoTokenizer
~ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.3')
~ tokenizer.vocab['[control_8]'] 
10
~ del tokenizer.vocab['[control_8]'] 
~ tokenizer.vocab['[control_8]'] 
10

Reassignment is also futile.

itazap · 2024-08-26T19:32:29Z

Hi @itshuey indeed this isn't fully supported with AutoTokenizer because it reads the tokenizer.model file which can't be modified manually. However you should be able to remove/ update if you use

tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)

instead of AutoTokenizer. Let me know if this works for your use case! 😊

itshuey · 2024-08-27T17:30:50Z

I'm getting this error when I try to instantiate the PreTrainedTokenizerFast object

>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'PreTrainedTokenizerFast'.

Regardless, I get the same result from trying to modify tokenizer.vocab:

>>> tokenizer.vocab['[control_8]']
10
>>> del tokenizer.vocab['[control_8]']
>>> tokenizer.vocab['[control_8]']
10

Taking the tip into account, I tried using LlamaTokenizer to load my tokenizer. Since get_vocab relies on sentencepiece, I tried to use tokenizer.sp_model.set_vocabulary(), but I couldn't figure out what the List valid_vocab parameter was meant to be. I am hoping there's a transformers-based solution to replace unused control tokens associated with specific id's (in my case 10) without increasing the vocabulary size.

itazap · 2024-08-27T19:21:25Z

Sorry I apologize I wasn't very clear. You are correct that .vocab cannot be modified programmatically, and using del or pop would not work. Updating or deleting tokens would be a new feature and would be supported in the feature Arthur linked. The error text you are getting is a warning and it should not error out / fail.

For now, you can do this by:

Loading the model with PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
Saving the model and locating the model folder with tokenizer.json and tokenizer_config.json files.
Manually editing the tokenizer.json file (note: there are 2 changes needed in this json: in "added_tokens" and in "vocab") and the tokenizer_config.json file (1 change in "added_tokens_decoder").
Loading the model from the local folder you modified.

tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
tokenizer.vocab['[control_8]'] # 10
tokenizer.save_pretrained(temp_folder)

# Pause here and edit both tokenizer.json and tokenizer_config.json. Example below changed control_8 to control_NEW
tokenizer_reloaded = PreTrainedTokenizerFast.from_pretrained(temp_folder)
tokenizer_reloaded.vocab['[control_NEW]'] # 10
tokenizer_reloaded.added_tokens_decoder[10] # 'AddedToken("[control_NEW]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
tokenizer_reloaded.vocab['control_8'] # throws error

Please let me know if you are able to reproduce!

itshuey · 2024-08-27T23:05:44Z

Thank you for the clarification @itazap, modifying the configs worked perfectly! After using save_pretrain with my PreTrainedTokenizerFast tokenizer, I was able to load it locally (with the proper overwritten tokens) via AutoTokenizer as well. Really appreciate your help with this!

itazap · 2024-08-28T12:19:13Z

Awesome, I'm glad it worked! Thanks for your patience 🤗

ArthurZucker added the Feature request Request for a new feature label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I replace a spare tokens? #31475

How do I replace a spare tokens? #31475

kouyakamada commented Jun 18, 2024 •

edited

Loading

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jun 19, 2024

lee-onidas commented Jun 21, 2024

lee-onidas commented Jun 25, 2024

ArthurZucker commented Jul 12, 2024

itshuey commented Aug 25, 2024

itazap commented Aug 26, 2024

itshuey commented Aug 26, 2024

itazap commented Aug 26, 2024

itshuey commented Aug 27, 2024

itazap commented Aug 27, 2024 •

edited

Loading

itshuey commented Aug 27, 2024

itazap commented Aug 28, 2024

How do I replace a spare tokens? #31475

How do I replace a spare tokens? #31475

Comments

kouyakamada commented Jun 18, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jun 19, 2024

lee-onidas commented Jun 21, 2024

lee-onidas commented Jun 25, 2024

ArthurZucker commented Jul 12, 2024

itshuey commented Aug 25, 2024

itazap commented Aug 26, 2024

itshuey commented Aug 26, 2024

itazap commented Aug 26, 2024

itshuey commented Aug 27, 2024

itazap commented Aug 27, 2024 • edited Loading

itshuey commented Aug 27, 2024

itazap commented Aug 28, 2024

kouyakamada commented Jun 18, 2024 •

edited

Loading

itazap commented Aug 27, 2024 •

edited

Loading