Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I replace a spare tokens? #31475

Open
4 tasks
kouyakamada opened this issue Jun 18, 2024 · 13 comments
Open
4 tasks

How do I replace a spare tokens? #31475

kouyakamada opened this issue Jun 18, 2024 · 13 comments
Labels
Feature request Request for a new feature

Comments

@kouyakamada
Copy link

kouyakamada commented Jun 18, 2024

System Info

I want to SFT Mistral-v0.3 with my own chat template.
So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template.
However, the new vocabulary was actually added and the size of the vocabulary increased.
Is there any way to replace the vocabulary?

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

tokenizer.json

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
{
      "id": 10,
      "content": "<|system|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 11,
      "content": "<|user|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 12,
      "content": "<|assistant|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 13,
      "content": "<|eot|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": true,
  "added_tokens_decoder": {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "10": {
          "content": "<|system|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "11": {
          "content": "<|user|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "12": {
          "content": "<|assistant|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
        "13": {
          "content": "<|eot|>",
          "lstrip": false,
          "normalized": false,
          "rstrip": false,
          "single_word": false,
          "special": true
        },
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}

test code

tokenizer =  AutoTokenizer.from_pretrained(model_dir)
pprint(tokenizer.added_tokens_decoder)

output

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 768: AddedToken("[control_766]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 769: AddedToken("[control_767]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 770: AddedToken("[control_768]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32768: AddedToken("<|system|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32769: AddedToken("<|user|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32770: AddedToken("<|assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 32771: AddedToken("<|eot|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}

Expected behavior

[control_n] Tokens can be replaced with any token.

@ArthurZucker
Copy link
Collaborator

cc @itazap that would indeed be a good addition! More and more people pre-allocate some tokens and we don't have a replace token.

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Jun 19, 2024
@ArthurZucker
Copy link
Collaborator

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

@lee-onidas
Copy link

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

@lee-onidas
Copy link

Hey @ArthurZucker,

I tried replacing a token in the vocab (not the added_tokens) for the tokenizer.json file. But when I try to load the tokenizer back up new_tokenizer = AutoTokenizer.from_pretrained('path/to/tokenizer) I get the following error: "Exception: data did not match any variant of untagged enum ModelWrapper at line 356367 column 3"

Do you know what the problem might be?

You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly.

@ArthurZucker
Copy link
Collaborator

huggingface/tokenizers#1570 should help

@itshuey
Copy link

itshuey commented Aug 25, 2024

PS: you can already replace directly in the vocab and the added_vocab (since there tokens are part of both)

Hi @ArthurZucker, do you mind elaborating on this? I'm experiencing the same issue as OP after modifying tokenizer.json and tokenizer_config.json. After loading the local tokenizer, I am unable to reassign or delete any entries in tokenizer.vocab manually. For example, del tokenizer.vocab['<token-to-replace>'] does not have any effect. I'm also unsure how to modify added_vocab.

I would love to see this feature go through!

@itazap
Copy link
Collaborator

itazap commented Aug 26, 2024

Hi @itshuey, can you please share the model and token you are attempting this with (in a short snippet would be great!) so I can take a look? 😊

@itshuey
Copy link

itshuey commented Aug 26, 2024

Sure @itazap, thank you. I am using the Mistral-7B-v0.3 tokenizer.

~ from transformers import AutoTokenizer
~ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.3')
~ tokenizer.vocab['[control_8]'] 
10
~ del tokenizer.vocab['[control_8]'] 
~ tokenizer.vocab['[control_8]'] 
10 

Reassignment is also futile.

@itazap
Copy link
Collaborator

itazap commented Aug 26, 2024

Hi @itshuey indeed this isn't fully supported with AutoTokenizer because it reads the tokenizer.model file which can't be modified manually. However you should be able to remove/ update if you use

tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)

instead of AutoTokenizer. Let me know if this works for your use case! 😊

@itshuey
Copy link

itshuey commented Aug 27, 2024

I'm getting this error when I try to instantiate the PreTrainedTokenizerFast object

>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'PreTrainedTokenizerFast'.

Regardless, I get the same result from trying to modify tokenizer.vocab:

>>> tokenizer.vocab['[control_8]']
10
>>> del tokenizer.vocab['[control_8]']
>>> tokenizer.vocab['[control_8]']
10

Taking the tip into account, I tried using LlamaTokenizer to load my tokenizer. Since get_vocab relies on sentencepiece, I tried to use tokenizer.sp_model.set_vocabulary(), but I couldn't figure out what the List valid_vocab parameter was meant to be. I am hoping there's a transformers-based solution to replace unused control tokens associated with specific id's (in my case 10) without increasing the vocabulary size.

@itazap
Copy link
Collaborator

itazap commented Aug 27, 2024

Sorry I apologize I wasn't very clear. You are correct that .vocab cannot be modified programmatically, and using del or pop would not work. Updating or deleting tokens would be a new feature and would be supported in the feature Arthur linked. The error text you are getting is a warning and it should not error out / fail.

For now, you can do this by:

  1. Loading the model with PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
  2. Saving the model and locating the model folder with tokenizer.json and tokenizer_config.json files.
  3. Manually editing the tokenizer.json file (note: there are 2 changes needed in this json: in "added_tokens" and in "vocab") and the tokenizer_config.json file (1 change in "added_tokens_decoder").
  4. Loading the model from the local folder you modified.
tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
tokenizer.vocab['[control_8]'] # 10
tokenizer.save_pretrained(temp_folder)

# Pause here and edit both tokenizer.json and tokenizer_config.json. Example below changed control_8 to control_NEW
tokenizer_reloaded = PreTrainedTokenizerFast.from_pretrained(temp_folder)
tokenizer_reloaded.vocab['[control_NEW]'] # 10
tokenizer_reloaded.added_tokens_decoder[10] # 'AddedToken("[control_NEW]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
tokenizer_reloaded.vocab['control_8'] # throws error

Please let me know if you are able to reproduce!

@itshuey
Copy link

itshuey commented Aug 27, 2024

Thank you for the clarification @itazap, modifying the configs worked perfectly! After using save_pretrain with my PreTrainedTokenizerFast tokenizer, I was able to load it locally (with the proper overwritten tokens) via AutoTokenizer as well. Really appreciate your help with this!

@itazap
Copy link
Collaborator

itazap commented Aug 28, 2024

Awesome, I'm glad it worked! Thanks for your patience 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

5 participants