-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nit-added-tokens #26538
Merged
Merged
Nit-added-tokens #26538
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
32be323
fix stripping
ArthurZucker 603d4e9
Merge branch 'main' of https://github.com/huggingface/transformers
ArthurZucker cb4e48a
nits
ArthurZucker e9bc0e6
fix another test
ArthurZucker fa93ed3
styling
ArthurZucker f031f5e
fix?
ArthurZucker ec9530b
Merge branch 'main' of https://github.com/huggingface/transformers in…
ArthurZucker fb80bf9
update
ArthurZucker 2260c85
Merge branch 'main' of github.com:huggingface/transformers into nit-a…
ArthurZucker a9b8845
revert bad merge
ArthurZucker 3807fcb
Merge branch 'nit-added-tokens' of https://github.com/ArthurZucker/tr…
ArthurZucker 339ce67
found the bug
ArthurZucker d093b5c
YES SIR
ArthurZucker c12a2f9
is that change really required?
ArthurZucker 02922e1
make fast even faster
ArthurZucker 93152be
re order functions
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -846,15 +846,26 @@ def __init__(self, verbose=True, **kwargs): | |
# We directly set the hidden value to allow initialization with special tokens | ||
# which are not yet in the vocabulary. Necessary for serialization/de-serialization | ||
# TODO clean this up at some point (probably by switching to fast tokenizers) | ||
|
||
for key, value in kwargs.items(): | ||
if value is None: | ||
continue | ||
if key in self.SPECIAL_TOKENS_ATTRIBUTES: | ||
if key == "additional_special_tokens": | ||
# TODO THIS IS NASTY! Will always reset tokens to default rstrip and lstrip because self.set_attr on strings | ||
# will not check the addedtokens decoder. WILL FIX TOMORROW | ||
assert isinstance(value, (list, tuple)), f"Value {value} is not a list or tuple" | ||
assert all( | ||
isinstance(t, (str, AddedToken)) for t in value | ||
), "One of the tokens is not a string or an AddedToken" | ||
if hasattr(self, "added_tokens_encoder"): | ||
extended_token = [] | ||
for token in value: | ||
if isinstance(token, str) and str(token) in self.added_tokens_encoder: | ||
extended_token.append(self.added_tokens_decoder[self.added_tokens_encoder[str(token)]]) | ||
else: | ||
extended_token.append(token) | ||
value = extended_token | ||
setattr(self, key, value) | ||
elif isinstance(value, (str)): | ||
value = AddedToken(value, normalized=False, special=True) | ||
|
@@ -1674,14 +1685,6 @@ def _set_processor_class(self, processor_class: str): | |
"""Sets processor class as an attribute.""" | ||
self._processor_class = processor_class | ||
|
||
@property | ||
def added_tokens_encoder(self) -> Dict[str, int]: | ||
""" | ||
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance | ||
optimisation in `self._added_tokens_encoder` for the slow tokenizers. | ||
""" | ||
return {k.content: v for v, k in sorted(self._added_tokens_decoder.items(), key=lambda item: item[0])} | ||
|
||
@property | ||
def added_tokens_decoder(self) -> Dict[int, AddedToken]: | ||
raise NotImplementedError() | ||
|
@@ -2196,9 +2199,13 @@ def _from_pretrained( | |
for idx, token in init_kwargs["added_tokens_decoder"].items(): | ||
if isinstance(token, dict): | ||
token = AddedToken(**token) | ||
|
||
if isinstance(token, AddedToken): | ||
added_tokens_decoder[int(idx)] = token | ||
if str(token) in additional_special_tokens: | ||
# at this point the token is in `additional_special_tokens` as an str, let's add the AddedToken info | ||
additional_special_tokens.remove(str(token)) | ||
if token.special and token not in additional_special_tokens: | ||
additional_special_tokens.append(token) | ||
else: | ||
raise ValueError( | ||
f"Found a {token.__class__} in the saved `added_tokens_decoder`, should be a dictionary." | ||
|
@@ -2381,9 +2388,8 @@ def save_pretrained( | |
|
||
tokenizer_config = copy.deepcopy(self.init_kwargs) | ||
|
||
# TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers | ||
# target_keys = self.init_kwargs.keys() | ||
target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"] | ||
target_keys = list(self.init_kwargs.keys()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when saving; we should overwrite the init_kwargs with the content of self. Don't know why it was not the case before |
||
target_keys += ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"] | ||
for k in target_keys: | ||
if hasattr(self, k): | ||
tokenizer_config[k] = getattr(self, k) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @LysandreJik here