We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_pretokenized doesnt seem to be respected in some cases. The same code given below works in 0.20.0
from tokenizers import Tokenizer, pre_tokenizer from tokenizers.models import WordPiece m = WordPiece({'F': 0, '<eos>': 1}) t = Tokenizer(m) t.pre_tokenizer = pre_tokenizers.Split('', 'isolated') t.encode(['<eos>'], is_pretokenized=True).ids
Expected to run without any issue but raises the exception:
Exception: WordPiece error: Missing [UNK] token from the vocabulary
It seems to ignore the is_pretokenized flag and wants to apply the pre_tokenizer to the <eos> token.
is_pretokenized
<eos>
The text was updated successfully, but these errors were encountered:
No branches or pull requests
is_pretokenized doesnt seem to be respected in some cases. The same code given below works in 0.20.0
Code
Expected to run without any issue but raises the exception:
It seems to ignore the
is_pretokenized
flag and wants to apply the pre_tokenizer to the<eos>
token.The text was updated successfully, but these errors were encountered: