Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load t5-small tokenizer saved with latest packages in older versions #31139

Closed
2 of 4 tasks
jpmann opened this issue May 30, 2024 · 2 comments
Closed
2 of 4 tasks

Comments

@jpmann
Copy link

jpmann commented May 30, 2024

System Info

Step No Transformer Tokenizers Sentence Piece
1 4.40.0 0.19.1 0.1.99
2 4.34.1 0.14.1 0.1.99

Who can help?

@ArthurZucker @you

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Step 1: Save the t5-small tokenizer with latest packages

import transformers
print(transformers.__version__)  # 4.40.0

import tokenizers
print(tokenizers.__version__)  # 0.19.1

import sentencepiece
print(sentencepiece.__version__) # 0.1.99

from transformers import AutoTokenizer
t5_tok = AutoTokenizer.from_pretrained("t5-small")
print(t5_tok)
t5_tok.save_pretrained("t5_small_xr_4_40_0")

Step2: Load the tokenizer saved in step#1 in older packages

import transformers
print(transformers.__version__)  # 4.34.1

import tokenizers
print(tokenizers.__version__)  # 0.14.1

import sentencepiece
print(sentencepiece.__version__) # 0.1.99

from transformers import AutoTokenizer
t5_tok = AutoTokenizer.from_pretrained("t5_small_xr_4_40_0")
print(t5_tok)

Expected behavior

The step#2 is failing with error
image

I see that the tokenizer.json looks different when saved with environment setup in step#1 and step#2 (see system info section for environment details)
image

When I looked further into it, I see that add_prefix_space which is present in older packages is no longer present. And 2 new fields got introduced prepend_scheme and split. I believe the change in contract caused the failure.

Couple of questions in this context

  1. The newer changes to tokenizer + transformer packages are expected to be backward compatible?
  2. Will there be any tokenization differences when
    I save the tokenizer with env in step#1 and tokenize a dataset
    I save the tokenizer with env in step#2 and tokenize a dataset

I will launch tests on question#2 and share the output here, but in case the answer is already known, please let me know.

@ArthurZucker
Copy link
Collaborator

Hey! What you are asking for is Forward Compatibility not Backward Compatibility.
The issue lies with the tokenizers version, not transformers. And as such this is expected. You can probably hack to use tokenizers 0.19 but an older version of transformers.

To answer you questions:

@jpmann
Copy link
Author

jpmann commented Jun 11, 2024

Thanks @ArthurZucker. this helps.

@jpmann jpmann closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants