Adding tokens to a tokenizer with subword support? #1637

noamgat · 2024-09-27T18:45:05Z

Hi,
When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?

Example:

from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = '筹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))

And the output would be

[1, 32000]
[1, 32000, 234, 176, 188]
[1, 376, 234, 176, 188]

What do I need to modify in my add_tokens call, for me to be able to get the desired 32000 token twice in the second example, and once in the 3rd?

The text was updated successfully, but these errors were encountered:

stephantul · 2024-10-01T13:09:00Z

Hey!

Tokenizers generally differentiate between tokens occurring at the start of a string and in the middle of a string.
In your case, the token 筹 matches only at the beginning of a word because internally it is prefixed by a special token indicating the start of a string. For example, here, it is found twice:

tokenizer.encode(f"{new_char} {new_char}")
[1, 32000, 32000]

One work-around is to put the normalized flag to False, this will make it work the way you expect it to.

from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = '筹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True, normalized=False, rstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))

Results in:

[1, 32000]
[1, 32000, 32000]
[1, 376, 32000]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding tokens to a tokenizer with subword support? #1637

Adding tokens to a tokenizer with subword support? #1637

noamgat commented Sep 27, 2024

stephantul commented Oct 1, 2024

Adding tokens to a tokenizer with subword support? #1637

Adding tokens to a tokenizer with subword support? #1637

Comments

noamgat commented Sep 27, 2024

stephantul commented Oct 1, 2024