-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor metaspace #1476
Refactor metaspace #1476
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
("how", (26, 29)), | ||
("▁are", (29, 35)), | ||
("▁you", (35, 41)) | ||
("how▁are▁you", (26, 41)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means meta space does not necessarily split on whitspace. Seems to work for BPE / Llama idk
Just gotta add python tests here and will be good to go |
Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]
a261516
to
e3ae520
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let me just check with huggingface/transformers#28881 once sec
bindings/node/src/decoders.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We "could" do a deprecation cycle for add_prefix_space
but since we are gonna do a major, no need I guess
@@ -73,22 +73,27 @@ mod tests { | |||
|
|||
#[test] | |||
fn decoder_serialization() { | |||
let json = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#; | |||
let oldjson = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, we test BC with previous serialization
d60edfa
to
680e163
Compare
@ArthurZucker does it sound plausible to you that the breaking change of this pull request causes the problem I described in huggingface/text-embeddings-inference#265? If yes, do you see any way to circumvent that problem besides upgrading |
Yes, I think you pretty much have to update tokenizers version. |
Thx for the quick response. Hopefully TEI will do that soon. |
This is necessary in order to load models whose tokenizers have been created by a version after the breaking change huggingface/tokenizers#1476 (i.e. >= v0.19.0)
One more question: Is the new version suited for loading older models (i.e. those saved before the breaking change)? I know for sure that it does not lead to a crash but what about the quality of model responses given that the tokenization has been changed? |
I don't think it is backward compatible. See this: # tokenizers 0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base") vs # tokenizers <0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[2], line 2
1 from tokenizers import Tokenizer
----> 2 tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 960 column 3 which is why we did a major release |
Improve performances of meta space, but also just fix it.
The pre tokenizer no longer split