-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Wav2Vec2] Fix tokenizer set lang #26349
[Wav2Vec2] Fix tokenizer set lang #26349
Conversation
The documentation is not available anymore as the PR was closed or merged. |
My space suddenly had a runtime error because of this. |
Sorry to hear that @andergisomon - note that running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'd rather remove the addition of these tokens to unique_no_split_tokens
as they won't be used.( #26322 was on it's way)
The issue is that adding them to self.unique_no_split_tokens
just does nothing. A fix that would keep the previous behaviour is to rather do self.add_tokens(self.encoder.keys())
. It's a bit inefficient but will make sure they are not split!
Thanks for the review! Applied the suggestions from your comment - look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify a bit and good to go
for token in self.encoder.keys(): | ||
if len(token) > 1: | ||
self.unique_no_split_tokens.append(token) | ||
self.add_tokens(token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_tokens
loops over the tokens, so I'd recommend adding them using:
for token in self.encoder.keys(): | |
if len(token) > 1: | |
self.unique_no_split_tokens.append(token) | |
self.add_tokens(token) | |
self.add_tokens(self.encoder.keys()) |
as an internal checks makes sure let(token)>1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need lstrip=rstrip=True
here, so need to use AddedToken(lstrip=True, rstrip=True)
as we had before. If that's ok with you, I'll merge
dbbdf3f
to
7d62f6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me!
* fix wav2vec2 doctest * suggestion * fix * final fix * revert since we need AddedTokens
* fix wav2vec2 doctest * suggestion * fix * final fix * revert since we need AddedTokens
What does this PR do?
The PR #23909 removed
unique_no_split_tokens
as an attribute of the Wav2Vec2 tokenizer, however it is required to set the language in the method.set_target_lang
:transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
Line 237 in dcbfd93
Thus, calling
.set_target_lang
currently throws an error:Output:
This PR re-instates
unique_no_split_tokens
as an attribute of the tokenizer. Note that this should already be tested for in the following test:transformers/tests/models/wav2vec2/test_tokenization_wav2vec2.py
Line 798 in dcbfd93