Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Wav2Vec2] Fix tokenizer set lang #26349

Merged
merged 6 commits into from
Oct 4, 2023

Conversation

sanchit-gandhi
Copy link
Contributor

What does this PR do?

The PR #23909 removed unique_no_split_tokens as an attribute of the Wav2Vec2 tokenizer, however it is required to set the language in the method .set_target_lang:

self.unique_no_split_tokens.append(token)

Thus, calling .set_target_lang currently throws an error:

from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("facebook/mms-1b-all")
tokenizer.set_target_lang("spa")

Output:

Traceback (most recent call last):
  File "/Users/sanchitgandhi/transformers/debug_tokenizer.py", line 4, in <module>
    tokenizer.set_target_lang("spa")
  File "/Users/sanchitgandhi/transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py", line 237, in set_target_lang
    self.unique_no_split_tokens.append(token)
AttributeError: 'Wav2Vec2CTCTokenizer' object has no attribute 'unique_no_split_tokens'

This PR re-instates unique_no_split_tokens as an attribute of the tokenizer. Note that this should already be tested for in the following test:

tokenizer.set_target_lang("spa")

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 22, 2023

The documentation is not available anymore as the PR was closed or merged.

@andergisomon
Copy link

My space suddenly had a runtime error because of this.

@sanchit-gandhi
Copy link
Contributor Author

Sorry to hear that @andergisomon - note that running transformers on the latest PyPi version should bypass this error. It's only on main at the moment:

pip uninstall transformers -y
pip install --upgrade transformers

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'd rather remove the addition of these tokens to unique_no_split_tokens as they won't be used.( #26322 was on it's way)

The issue is that adding them to self.unique_no_split_tokens just does nothing. A fix that would keep the previous behaviour is to rather do self.add_tokens(self.encoder.keys()). It's a bit inefficient but will make sure they are not split!

@sanchit-gandhi
Copy link
Contributor Author

Thanks for the review! Applied the suggestions from your comment - look good to you?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify a bit and good to go

Comment on lines 230 to 232
for token in self.encoder.keys():
if len(token) > 1:
self.unique_no_split_tokens.append(token)
self.add_tokens(token)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_tokens loops over the tokens, so I'd recommend adding them using:

Suggested change
for token in self.encoder.keys():
if len(token) > 1:
self.unique_no_split_tokens.append(token)
self.add_tokens(token)
self.add_tokens(self.encoder.keys())

as an internal checks makes sure let(token)>1

Copy link
Contributor Author

@sanchit-gandhi sanchit-gandhi Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need lstrip=rstrip=True here, so need to use AddedToken(lstrip=True, rstrip=True) as we had before. If that's ok with you, I'll merge

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me!

@sanchit-gandhi sanchit-gandhi merged commit 2d8ee98 into huggingface:main Oct 4, 2023
3 checks passed
@sanchit-gandhi sanchit-gandhi deleted the wav2vec2-doctest branch October 4, 2023 16:34
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
* fix wav2vec2 doctest

* suggestion

* fix

* final fix

* revert since we need AddedTokens
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
* fix wav2vec2 doctest

* suggestion

* fix

* final fix

* revert since we need AddedTokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants