Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding tokens to a tokenizer with subword support? #1637

Open
noamgat opened this issue Sep 27, 2024 · 1 comment
Open

Adding tokens to a tokenizer with subword support? #1637

noamgat opened this issue Sep 27, 2024 · 1 comment

Comments

@noamgat
Copy link

noamgat commented Sep 27, 2024

Hi,
When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?

Example:

from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = '筹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))

And the output would be

[1, 32000]
[1, 32000, 234, 176, 188]
[1, 376, 234, 176, 188]

What do I need to modify in my add_tokens call, for me to be able to get the desired 32000 token twice in the second example, and once in the 3rd?

@stephantul
Copy link

Hey!

Tokenizers generally differentiate between tokens occurring at the start of a string and in the middle of a string.
In your case, the token matches only at the beginning of a word because internally it is prefixed by a special token indicating the start of a string. For example, here, it is found twice:

tokenizer.encode(f"{new_char} {new_char}")
[1, 32000, 32000]

One work-around is to put the normalized flag to False, this will make it work the way you expect it to.

from transformers import AutoTokenizer
from tokenizers import AddedToken
model_id = 'TheBloke/Llama-2-7b-Chat-GPTQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)
new_char = '筹'
tokenizer.add_tokens(AddedToken(new_char, single_word=False, lstrip=True, normalized=False, rstrip=True))
print(tokenizer.encode(new_char))
print(tokenizer.encode(new_char + new_char))
print(tokenizer.encode('"' + new_char))

Results in:

[1, 32000]
[1, 32000, 32000]
[1, 376, 32000]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants