Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE Dropout tokenizer generates unk at the beginning of sequence #1071

Open
AnnaLebedeva opened this issue Dec 1, 2024 · 0 comments
Open

Comments

@AnnaLebedeva
Copy link

I do spm.SentencePieceTrainer.train with the following parameters:

save_directory: /tokenizer
input: data.txt
vocab_size: 16000
model_type: bpe
pad_id: 0
eos_id: 1
unk_id: 2
bos_id: -1
input_sentence_size: 10000000
shuffle_input_sentence: true

After that I create T5Tokenizer of it and save it to use:

hf_tokenizer = T5Tokenizer('tokenizer/tokenizer.model', extra_ids=0, legacy=False)`
hf_tokenizer.save_pretrained('tokenizer_directory')

Then I'm trying to use it with BPE-Dropout the following way:

tokenizer_16000 = AutoTokenizer.from_pretrained(
    'tokenizer_directory',
    use_fast=False,
    sp_model_kwargs = {
        'enable_sampling': True,
        'alpha': 0.1
    }
    )

And statistically about 10% of times I will get <unk> token in the beginning of encoded sequence, although the next token is exactly the start of the sequence. I write test sequence myself so no hidden symbols of whatsoever:

for i in range(10):
    encoded_text = tokenizer_16000('Прапорщик Задов опять здесь.')
    print(tokenizer_16000.convert_ids_to_tokens(encoded_text['input_ids']))

outputs:

16000
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁', 'З', 'ад', 'ов', '▁оп', 'я', 'ть', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁Пр', 'ап', 'о', 'р', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁опять', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['ра', 'пор', 'щ', 'ик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']

It never happens with vocab size 8000 with all same parameters. Why do <unk> appear there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant