Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Fix slow and fast serialization #26570

Merged
merged 114 commits into from
Oct 18, 2023
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
303a82c
fix
ArthurZucker Oct 3, 2023
cbf179a
Merge branch 'main' of github.com:huggingface/transformers into fix-main
ArthurZucker Oct 3, 2023
01e18db
last attempt
ArthurZucker Oct 3, 2023
08a560a
current work
ArthurZucker Oct 4, 2023
23c9513
fix forward compatibility
ArthurZucker Oct 4, 2023
0ae13ed
save all special tokens
ArthurZucker Oct 5, 2023
d887f68
Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …
ArthurZucker Oct 5, 2023
72ff80e
current state
ArthurZucker Oct 5, 2023
b7b7d13
revert additional changes
ArthurZucker Oct 5, 2023
36d5303
updates
ArthurZucker Oct 5, 2023
ae93856
remove tokenizer.model
ArthurZucker Oct 5, 2023
88ea352
add a test and the fix
ArthurZucker Oct 5, 2023
ca98fbd
nit
ArthurZucker Oct 5, 2023
3c22fbb
revert one more break
ArthurZucker Oct 5, 2023
dc93d5e
fix typefield issue
ArthurZucker Oct 5, 2023
00997e9
quality
ArthurZucker Oct 5, 2023
6143634
more tests
ArthurZucker Oct 5, 2023
907591f
fix fields for FC
ArthurZucker Oct 5, 2023
5df5a83
Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …
ArthurZucker Oct 5, 2023
66ecb9e
Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …
ArthurZucker Oct 5, 2023
0e7bd61
more nits?
ArthurZucker Oct 5, 2023
381a0ec
Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …
ArthurZucker Oct 6, 2023
bf75334
new additional changes
ArthurZucker Oct 6, 2023
fafbbed
how
ArthurZucker Oct 6, 2023
c6de7b2
some updates
ArthurZucker Oct 6, 2023
9a6e750
simplify all
ArthurZucker Oct 7, 2023
8c4ec2c
more nits
ArthurZucker Oct 7, 2023
621ebae
revert some things to original
ArthurZucker Oct 7, 2023
6a6095e
nice
ArthurZucker Oct 7, 2023
e0e5dea
nits
ArthurZucker Oct 7, 2023
92c7754
a small hack
ArthurZucker Oct 7, 2023
9fbbafe
more nits
ArthurZucker Oct 7, 2023
25e2df9
ahhaha
ArthurZucker Oct 7, 2023
2b18cc2
Merge branch 'main' of github.com:huggingface/transformers into fix-main
ArthurZucker Oct 7, 2023
078c94e
fixup
ArthurZucker Oct 7, 2023
ef1e598
update
ArthurZucker Oct 9, 2023
9bf12a8
make test run on ci
ArthurZucker Oct 11, 2023
e6d0381
use subtesting
ArthurZucker Oct 11, 2023
112e4b1
update
ArthurZucker Oct 11, 2023
f794a91
Update .circleci/create_circleci_config.py
ArthurZucker Oct 11, 2023
65aa232
updates
ArthurZucker Oct 11, 2023
8ea095b
Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …
ArthurZucker Oct 11, 2023
efc5e7b
fixup
ArthurZucker Oct 11, 2023
aa569b7
nits
ArthurZucker Oct 11, 2023
5ad55f3
replace typo
ArthurZucker Oct 11, 2023
1c22269
fix the test
ArthurZucker Oct 11, 2023
3b93653
nits
ArthurZucker Oct 11, 2023
a2e977a
Merge branch 'main' of github.com:huggingface/transformers into fix-main
ArthurZucker Oct 11, 2023
1acf2dd
update
ArthurZucker Oct 11, 2023
2dde542
None max dif pls
ArthurZucker Oct 11, 2023
9ebf76e
a partial fix
ArthurZucker Oct 11, 2023
6d2c00e
had to revert one thing
ArthurZucker Oct 11, 2023
e4bcb5e
test the fast
ArthurZucker Oct 11, 2023
3d4bffd
updates
ArthurZucker Oct 11, 2023
8bcb345
fixup
ArthurZucker Oct 11, 2023
d9e5fad
and more nits
ArthurZucker Oct 11, 2023
fc34148
more fixes
ArthurZucker Oct 12, 2023
8389094
update
ArthurZucker Oct 12, 2023
78f1ac4
Oupsy :eye:
ArthurZucker Oct 12, 2023
62eb816
Merge branch 'main' of github.com:huggingface/transformers into fix-main
ArthurZucker Oct 12, 2023
5c1ae9c
nits
ArthurZucker Oct 12, 2023
df8ab6f
fix marian
ArthurZucker Oct 12, 2023
677fcb2
on our way to heaven
ArthurZucker Oct 12, 2023
5a3407e
Update src/transformers/models/t5/tokenization_t5.py
ArthurZucker Oct 12, 2023
856a43d
fixup
ArthurZucker Oct 12, 2023
a3cb498
Update src/transformers/tokenization_utils_fast.py
ArthurZucker Oct 12, 2023
62cf2d0
Update src/transformers/tokenization_utils_base.py
ArthurZucker Oct 12, 2023
fe8bba0
fix phobert
ArthurZucker Oct 13, 2023
be68fc2
skip some things, test more
ArthurZucker Oct 13, 2023
814d978
nits
ArthurZucker Oct 13, 2023
f969713
fixup
ArthurZucker Oct 13, 2023
56b0619
fix deberta
ArthurZucker Oct 13, 2023
f2a5447
update
ArthurZucker Oct 13, 2023
5d7bdab
update
ArthurZucker Oct 13, 2023
49dd8b2
more updates
ArthurZucker Oct 13, 2023
3a03c77
skip one test
ArthurZucker Oct 13, 2023
707a688
more updates
ArthurZucker Oct 13, 2023
bbfc382
fix camembert
ArthurZucker Oct 13, 2023
b6b8aed
can't test this one
ArthurZucker Oct 13, 2023
dac7b89
more good fixes
ArthurZucker Oct 14, 2023
b4ca44e
kind of a major update
ArthurZucker Oct 14, 2023
5245825
fixup
ArthurZucker Oct 14, 2023
0724ebf
more fixups
ArthurZucker Oct 14, 2023
066854a
fix pegasus and mpnet
ArthurZucker Oct 15, 2023
f646ab8
remove skipped tests
ArthurZucker Oct 15, 2023
53e2390
fix phoneme tokenizer if self.verbose
ArthurZucker Oct 15, 2023
e0a967f
fix individual models
ArthurZucker Oct 15, 2023
a353871
update common tests
ArthurZucker Oct 15, 2023
fbc4c4f
update testing files
ArthurZucker Oct 15, 2023
64a6bc4
all over again
ArthurZucker Oct 15, 2023
4219b32
nits
ArthurZucker Oct 15, 2023
48b937a
skip test for markup lm
ArthurZucker Oct 15, 2023
d1a4537
fixups
ArthurZucker Oct 15, 2023
60173aa
fix order of addition in fast by sorting the added tokens decoder
ArthurZucker Oct 16, 2023
8402602
proper defaults for deberta
ArthurZucker Oct 16, 2023
d782bbd
correct default for fnet
ArthurZucker Oct 16, 2023
05ab2c2
nits on add tokens, string initialized to special if special
ArthurZucker Oct 16, 2023
bd6c5a5
skip irrelevant herbert tests
ArthurZucker Oct 16, 2023
8a267d3
main fixes
ArthurZucker Oct 16, 2023
7bda15e
update test added_tokens_serialization
ArthurZucker Oct 16, 2023
ac75cd3
the fix for bart like models and class instanciating
ArthurZucker Oct 16, 2023
640885e
update bart
ArthurZucker Oct 16, 2023
45801c0
nit!
ArthurZucker Oct 16, 2023
14c576f
update idefix test
ArthurZucker Oct 16, 2023
2a78cf9
fix whisper!
ArthurZucker Oct 16, 2023
6f28584
some fixup
ArthurZucker Oct 16, 2023
c12656b
fixups
ArthurZucker Oct 16, 2023
8f8c3f1
revert some of the wrong chanegs
ArthurZucker Oct 16, 2023
de51ef7
fixup
ArthurZucker Oct 16, 2023
0f0a3fe
fixup
ArthurZucker Oct 16, 2023
4b693b9
Merge branch 'main' of github.com:huggingface/transformers into fix-main
ArthurZucker Oct 18, 2023
4b82043
skip marian
ArthurZucker Oct 18, 2023
340df3d
skip the correct tests
ArthurZucker Oct 18, 2023
f9fb43d
skip for tf and flax as well
ArthurZucker Oct 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .circleci/create_circleci_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def to_dict(self):
},
]
steps.extend([{"run": l} for l in self.install_steps])
steps.extend([{"run": "pip install pytest-subtests"}])
steps.append(
{
"save_cache": {
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/camembert/tokenization_camembert.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,9 @@ def __init__(
# In this case it is recommended to properly set the tokens by hand.
self._added_tokens_decoder = {
0: AddedToken("<s>NOTUSED"),
1: AddedToken(pad_token),
1: AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token,
2: AddedToken("</s>NOTUSED"),
3: AddedToken(unk_token),
3: AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token,
4: AddedToken("<unk>NOTUSED"),
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,11 @@ def __init__(
self._tokenizer = SPMTokenizer(
vocab_file, None, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs
)
unk_token = AddedToken(unk_token, normalized=True, lstrip=False, rstrip=False)
unk_token = (
AddedToken(unk_token, normalized=True, lstrip=False, rstrip=False)
if isinstance(unk_token, str)
else unk_token
)
super().__init__(
do_lower_case=do_lower_case,
bos_token=bos_token,
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/models/marian/tokenization_marian.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,9 +148,9 @@ def __init__(

self.separate_vocabs = separate_vocabs
self.encoder = load_json(vocab)
if unk_token not in self.encoder:
if str(unk_token) not in self.encoder:
raise KeyError("<unk> token must be in the vocab")
assert pad_token in self.encoder
assert str(pad_token) in self.encoder

if separate_vocabs:
self.target_encoder = load_json(target_vocab_file)
Expand Down
6 changes: 5 additions & 1 deletion src/transformers/models/nllb/tokenization_nllb.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,11 @@ def __init__(
**kwargs,
):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
mask_token = (
AddedToken(mask_token, normalized=True, lstrip=True, rstrip=False, special=True)
if isinstance(mask_token, str)
else mask_token
)

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.legacy_behaviour = legacy_behaviour
Expand Down
6 changes: 5 additions & 1 deletion src/transformers/models/nllb/tokenization_nllb_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,11 @@ def __init__(
**kwargs,
):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
mask_token = (
AddedToken(mask_token, normalized=True, lstrip=True, rstrip=False, special=True)
if isinstance(mask_token, str)
else mask_token
)
self.legacy_behaviour = legacy_behaviour

_additional_special_tokens = FAIRSEQ_LANGUAGE_CODES.copy()
Expand Down
10 changes: 6 additions & 4 deletions src/transformers/models/t5/tokenization_t5.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,9 @@ def __init__(
legacy=None,
**kwargs,
) -> None:
pad_token = AddedToken(pad_token, rstrip=True, lstrip=True)
unk_token = AddedToken(unk_token, rstrip=True, lstrip=True)
eos_token = AddedToken(eos_token, rstrip=True, lstrip=True)
pad_token = AddedToken(pad_token, rstrip=True, lstrip=True) if isinstance(pad_token, str) else pad_token
unk_token = AddedToken(unk_token, rstrip=True, lstrip=True) if isinstance(unk_token, str) else unk_token
eos_token = AddedToken(eos_token, rstrip=True, lstrip=True) if isinstance(eos_token, str) else eos_token

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

Expand All @@ -167,7 +167,9 @@ def __init__(

if additional_special_tokens is not None:
extra_tokens = [x for x in additional_special_tokens if "<extra_id_" in str(x)]
if extra_ids > 0 and extra_ids != len(extra_tokens):
if len(extra_tokens) < 1:
additional_special_tokens += [f"<extra_id_{i}>" for i in range(extra_ids)]
elif extra_ids > 0 and extra_ids != len(extra_tokens):
raise ValueError(
f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are"
" provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids"
Expand Down
22 changes: 13 additions & 9 deletions src/transformers/tokenization_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,19 +348,20 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):

def __init__(self, **kwargs):
# 1. Init the parent class
super().__init__(**kwargs)

self.tokens_trie = Trie()

# 2. init `_added_tokens_decoder` if child class did not
if not hasattr(self, "_added_tokens_decoder"):
self._added_tokens_decoder: Dict[int, AddedToken] = {}
# 3. if a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
if "added_tokens_decoder" in kwargs:
# overwriting the class's added_tokens_decoder. This is the source of truth!
self._added_tokens_decoder.update(kwargs.get("added_tokens_decoder"))

# 3. if a `added_tokens_decoder` is passed, we are loading from a saved tokenizer, we overwrite
self._added_tokens_decoder.update(kwargs.pop("added_tokens_decoder", {}))
self._added_tokens_encoder: Dict[str, int] = {k.content: v for v, k in self._added_tokens_decoder.items()}

# 4 init the parent class
super().__init__(**kwargs)

# 4. If some of the special tokens are not part of the vocab, we add them, at the end.
# the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
Expand Down Expand Up @@ -459,6 +460,7 @@ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_to
added_tokens = 0
if new_tokens is None:
return added_tokens
# TODO this is fairly slow to improve!
current_vocab = self.get_vocab().copy()
new_idx = len(current_vocab) # only call this once, len gives the last index + 1
for token in new_tokens:
Expand All @@ -467,9 +469,12 @@ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_to
if str(token) == "":
continue
if isinstance(token, str):
if token in self._added_tokens_encoder:
continue
# for legacy AddedTokens strip left and right by default
# TODO this will be remove to have the same default behavior as rust
token = AddedToken(token, normalized=not special_tokens, rstrip=True, lstrip=True)
else:
token = AddedToken(token, normalized=False, rstrip=True, lstrip=True)
if special_tokens:
token.special = True
if token in self._added_tokens_decoder:
Expand Down Expand Up @@ -550,7 +555,7 @@ def tokenize(self, text: TextInput, **kwargs) -> List[str]:
logger.warning(f"Keyword arguments {kwargs} not recognized.")

if hasattr(self, "do_lower_case") and self.do_lower_case:
# convert non-special tokens to lowercase
# convert non-special tokens to lowercase. Might be super slow as well?
escaped_special_toks = [re.escape(s_tok) for s_tok in (self.all_special_tokens)]
escaped_special_toks += [
re.escape(s_tok.content)
Expand All @@ -564,7 +569,7 @@ def tokenize(self, text: TextInput, **kwargs) -> List[str]:
no_split_token = []
tokens = [text]
else:
no_split_token = set(self._added_tokens_encoder.keys()) # don't split on any of the added tokens
no_split_token = self._added_tokens_encoder.keys() # don't split on any of the added tokens
# "This is something<special_token_1> else"
tokens = self.tokens_trie.split(text)

Expand All @@ -588,7 +593,6 @@ def tokenize(self, text: TextInput, **kwargs) -> List[str]:
elif tok_extended.single_word and right and right[0] != " ":
tokens[i + 1] = token + tokens[i + 1]
tokens[i] = ""

else:
raise ValueError(
f"{tok_extended} cannot be tokenized because it was not properly added"
Expand Down
Loading
Loading