Unable to load t5-small tokenizer saved with latest packages in older versions #31139

jpmann · 2024-05-30T12:32:31Z

System Info

Step No	Transformer	Tokenizers	Sentence Piece
1	4.40.0	0.19.1	0.1.99
2	4.34.1	0.14.1	0.1.99

Who can help?

@ArthurZucker @you

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Step 1: Save the t5-small tokenizer with latest packages

import transformers
print(transformers.__version__)  # 4.40.0

import tokenizers
print(tokenizers.__version__)  # 0.19.1

import sentencepiece
print(sentencepiece.__version__) # 0.1.99

from transformers import AutoTokenizer
t5_tok = AutoTokenizer.from_pretrained("t5-small")
print(t5_tok)
t5_tok.save_pretrained("t5_small_xr_4_40_0")

Step2: Load the tokenizer saved in step#1 in older packages

import transformers
print(transformers.__version__)  # 4.34.1

import tokenizers
print(tokenizers.__version__)  # 0.14.1

import sentencepiece
print(sentencepiece.__version__) # 0.1.99

from transformers import AutoTokenizer
t5_tok = AutoTokenizer.from_pretrained("t5_small_xr_4_40_0")
print(t5_tok)

Expected behavior

The step#2 is failing with error

I see that the tokenizer.json looks different when saved with environment setup in step#1 and step#2 (see system info section for environment details)

When I looked further into it, I see that add_prefix_space which is present in older packages is no longer present. And 2 new fields got introduced prepend_scheme and split. I believe the change in contract caused the failure.

Couple of questions in this context

The newer changes to tokenizer + transformer packages are expected to be backward compatible?
Will there be any tokenization differences when
I save the tokenizer with env in step#1 and tokenize a dataset
I save the tokenizer with env in step#2 and tokenize a dataset

I will launch tests on question#2 and share the output here, but in case the answer is already known, please let me know.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-05T12:55:05Z

Hey! What you are asking for is Forward Compatibility not Backward Compatibility.
The issue lies with the tokenizers version, not transformers. And as such this is expected. You can probably hack to use tokenizers 0.19 but an older version of transformers.

To answer you questions:

there are going to be speed differences
there are going to be differences with added tokens if there are any. See [LlamaTokenizerFast] Refactor default llama #28881 for more details.

jpmann · 2024-06-11T08:59:39Z

Thanks @ArthurZucker. this helps.

jpmann closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load t5-small tokenizer saved with latest packages in older versions #31139

Unable to load t5-small tokenizer saved with latest packages in older versions #31139

jpmann commented May 30, 2024 •

edited

Loading

ArthurZucker commented Jun 5, 2024

jpmann commented Jun 11, 2024

Unable to load t5-small tokenizer saved with latest packages in older versions #31139

Unable to load t5-small tokenizer saved with latest packages in older versions #31139

Comments

jpmann commented May 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 5, 2024

jpmann commented Jun 11, 2024

jpmann commented May 30, 2024 •

edited

Loading