You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?
Tokenizers generally differentiate between tokens occurring at the start of a string and in the middle of a string.
In your case, the token 筹 matches only at the beginning of a word because internally it is prefixed by a special token indicating the start of a string. For example, here, it is found twice:
Hi,
When I add an out-of-vocab character to a tokenizer, I am only able to get the new token ID when I encode it as a whole word, but not as a subword. Is there a parameter that I need to add to the call for it to also work in subwords?
Example:
And the output would be
What do I need to modify in my add_tokens call, for me to be able to get the desired 32000 token twice in the second example, and once in the 3rd?
The text was updated successfully, but these errors were encountered: