-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nit-added-tokens #26538
Merged
+46
−27
Merged
Nit-added-tokens #26538
Changes from 11 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
32be323
fix stripping
ArthurZucker 603d4e9
Merge branch 'main' of https://github.com/huggingface/transformers
ArthurZucker cb4e48a
nits
ArthurZucker e9bc0e6
fix another test
ArthurZucker fa93ed3
styling
ArthurZucker f031f5e
fix?
ArthurZucker ec9530b
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker fb80bf9
update
ArthurZucker 2260c85
Merge branch 'main' of github.com:huggingface/transformers into nit-a…
ArthurZucker a9b8845
revert bad merge
ArthurZucker 3807fcb
Merge branch 'nit-added-tokens' of https://github.com/ArthurZucker/tr…
ArthurZucker 339ce67
found the bug
ArthurZucker d093b5c
YES SIR
ArthurZucker c12a2f9
is that change really required?
ArthurZucker 02922e1
make fast even faster
ArthurZucker 93152be
re order functions
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2199,6 +2199,11 @@ def _from_pretrained( | |
|
||
if isinstance(token, AddedToken): | ||
added_tokens_decoder[int(idx)] = token | ||
if str(token) in additional_special_tokens: | ||
# at this point if the token is in `additional_special_tokens` as an str, should be updated | ||
additional_special_tokens.remove(str(token)) | ||
if token.special and token not in additional_special_tokens: | ||
additional_special_tokens.append(token) | ||
else: | ||
raise ValueError( | ||
f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary." | ||
|
@@ -2228,10 +2233,13 @@ def _from_pretrained( | |
with open(added_tokens_file, encoding="utf-8") as added_tokens_handle: | ||
added_tok_encoder = json.load(added_tokens_handle) | ||
# legacy: we have to init with (rstrip=True, lstrip=True) | ||
strip = True if "Fast" not in cls.__name__ else False | ||
added_tokens_decoder = { | ||
index: AddedToken(token, rstrip=strip, lstrip=strip) for token, index in added_tok_encoder.items() | ||
} | ||
rstrip = lstrip = True if "Fast" not in cls.__name__ else False | ||
for token, index in added_tok_encoder.items(): | ||
if index in added_tokens_decoder and "Fast" not in cls.__name__: | ||
continue | ||
added_tokens_decoder = { | ||
index: AddedToken(token, rstrip=rstrip, lstrip=lstrip) for token, index in added_tok_encoder.items() | ||
} | ||
# end legacy | ||
|
||
# slow -> fast, non-legacy: we need to make sure the `added_tokens_decoder` is used to add tokens if the `fast` was not properly saved! | ||
|
@@ -2382,8 +2390,8 @@ def save_pretrained( | |
tokenizer_config = copy.deepcopy(self.init_kwargs) | ||
|
||
# TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers | ||
# target_keys = self.init_kwargs.keys() | ||
target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"] | ||
target_keys = list(self.init_kwargs.keys()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when saving; we should overwrite the init_kwargs with the content of self. Don't know why it was not the case before |
||
target_keys += ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"] | ||
for k in target_keys: | ||
if hasattr(self, k): | ||
tokenizer_config[k] = getattr(self, k) | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only use the default legacy values for AddedToken if the token is not already in the added tokens decoder