-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nit-added-tokens #26538
Nit-added-tokens #26538
Conversation
The documentation is not available anymore as the PR was closed or merged. |
…to nit-added-tokens
@@ -2382,8 +2387,8 @@ def save_pretrained( | |||
tokenizer_config = copy.deepcopy(self.init_kwargs) | |||
|
|||
# TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers | |||
# target_keys = self.init_kwargs.keys() | |||
target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"] | |||
target_keys = list(self.init_kwargs.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when saving; we should overwrite the init_kwargs with the content of self. Don't know why it was not the case before
@@ -2227,7 +2232,7 @@ def _from_pretrained( | |||
if added_tokens_file is not None: | |||
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle: | |||
added_tok_encoder = json.load(added_tokens_handle) | |||
# legacy: we have to init with (rstrip=True, lstrip=True) | |||
# legacy: we have to init with (rstrip=True, lstrip=True) (if the token is new? Failing test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have to update this. The tests are shitty and the default is biting us
if str(token) in additional_special_tokens: | ||
# at this point if the token is in `additional_special_tokens` as an str, should be updated | ||
additional_special_tokens.remove(str(token)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only use the default legacy values for AddedToken if the token is not already in the added tokens decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A small benchmark on the from transformers import AutoTokenizer
import time
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")
start = time.time();tokenizer.get_added_vocab();print(time.time()-start)
>>> 0.17021536827087402
start = time.time();{k.content: v for v, k in sorted(tokenizer.added_tokens_decoder.items(), key=lambda item: item[0])};print(time.time()-start)
>>> 0.0054759979248046875
start = time.time();tokenizer.added_tokens_decoder;print(time.time()-start)
0.0007669925689697266 will update rust to make |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
if str(token) in additional_special_tokens: | ||
# at this point the token is in `additional_special_tokens` as an str, let's add the AddedToken info | ||
additional_special_tokens.remove(str(token)) | ||
if token.special and token not in additional_special_tokens: | ||
additional_special_tokens.append(token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @LysandreJik here
* fix stripping * nits * fix another test * styling * fix? * update * revert bad merge * found the bug * YES SIR * is that change really required? * make fast even faster * re order functions
* fix stripping * nits * fix another test * styling * fix? * update * revert bad merge * found the bug * YES SIR * is that change really required? * make fast even faster * re order functions
What does this PR do?
Fixes #26500, fixes #26536