Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metaspace #1476

Merged
merged 18 commits into from
Mar 30, 2024
Merged

Refactor metaspace #1476

merged 18 commits into from
Mar 30, 2024

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Mar 22, 2024

Improve performances of meta space, but also just fix it.

for i in [10, 1e2, 1e3, 1e4, 1e5]:
    start = time.time()
    tokenizer.tokenize("<REPR_END>inform<s>. Hey<unk>.       ."*int(i))
    times += [time.time()-start]

['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]

vs 


['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

The pre tokenizer no longer split

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker changed the title version = "0.15.3-dev-0” Refactor metaspace Mar 22, 2024
@ArthurZucker ArthurZucker marked this pull request as ready for review March 22, 2024 05:16
("how", (26, 29)),
("▁are", (29, 35)),
("▁you", (35, 41))
("how▁are▁you", (26, 41))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means meta space does not necessarily split on whitspace. Seems to work for BPE / Llama idk

@ArthurZucker
Copy link
Collaborator Author

Just gotta add python tests here and will be good to go

ArthurZucker and others added 10 commits March 29, 2024 14:57
Improve performances of meta space, but also just fix it.

(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]
(transformers) ➜  transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py
Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors
['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
@Narsil Narsil force-pushed the fix-meta-space-yet-again branch from a261516 to e3ae520 Compare March 29, 2024 13:58
Copy link
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let me just check with huggingface/transformers#28881 once sec

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We "could" do a deprecation cycle for add_prefix_space but since we are gonna do a major, no need I guess

@@ -73,22 +73,27 @@ mod tests {

#[test]
fn decoder_serialization() {
let json = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#;
let oldjson = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, we test BC with previous serialization

@ArthurZucker ArthurZucker force-pushed the fix-meta-space-yet-again branch from d60edfa to 680e163 Compare March 30, 2024 07:19
@Narsil Narsil merged commit 0906971 into main Mar 30, 2024
12 checks passed
@Narsil Narsil deleted the fix-meta-space-yet-again branch March 30, 2024 09:27
@scriptator
Copy link

@ArthurZucker does it sound plausible to you that the breaking change of this pull request causes the problem I described in huggingface/text-embeddings-inference#265? If yes, do you see any way to circumvent that problem besides upgrading tokenizers everywhere?

@ArthurZucker
Copy link
Collaborator Author

Yes, I think you pretty much have to update tokenizers version.
This is a breaking change, which is why we had a jump in versioning

@scriptator
Copy link

Thx for the quick response. Hopefully TEI will do that soon.

scriptator added a commit to scriptator/text-embeddings-inference that referenced this pull request May 15, 2024
This is necessary in order to load models whose tokenizers have been
created by a version after the breaking change
huggingface/tokenizers#1476 (i.e. >= v0.19.0)
@scriptator
Copy link

One more question: Is the new version suited for loading older models (i.e. those saved before the breaking change)? I know for sure that it does not lead to a crash but what about the quality of model responses given that the tokenization has been changed?

@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Jun 7, 2024

I don't think it is backward compatible. See this:

# tokenizers 0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")

vs

# tokenizers <0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[2], line 2
      1 from tokenizers import Tokenizer
----> 2 tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 960 column 3

which is why we did a major release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants