You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
And statistically about 10% of times I will get <unk> token in the beginning of encoded sequence, although the next token is exactly the start of the sequence. I write test sequence myself so no hidden symbols of whatsoever:
foriinrange(10):
encoded_text=tokenizer_16000('Прапорщик Задов опять здесь.')
print(tokenizer_16000.convert_ids_to_tokens(encoded_text['input_ids']))
I do
spm.SentencePieceTrainer.train
with the following parameters:After that I create T5Tokenizer of it and save it to use:
Then I'm trying to use it with BPE-Dropout the following way:
And statistically about 10% of times I will get
<unk>
token in the beginning of encoded sequence, although the next token is exactly the start of the sequence. I write test sequence myself so no hidden symbols of whatsoever:outputs:
It never happens with vocab size 8000 with all same parameters. Why do
<unk>
appear there?The text was updated successfully, but these errors were encountered: