Refactor metaspace #1476

ArthurZucker · 2024-03-22T04:04:23Z

Improve performances of meta space, but also just fix it.

for i in [10, 1e2, 1e3, 1e4, 1e5]:
    start = time.time()
    tokenizer.tokenize("<REPR_END>inform<s>. Hey<unk>.       ."*int(i))
    times += [time.time()-start]

['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.']
[0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848]

vs 


['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.']
[0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

The pre tokenizer no longer split

HuggingFaceDocBuilderDev · 2024-03-22T04:09:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-03-22T11:20:28Z

tokenizers/src/pre_tokenizers/metaspace.rs

-                ("how", (26, 29)),
-                ("▁are", (29, 35)),
-                ("▁you", (35, 41))
+                ("how▁are▁you", (26, 41))


this means meta space does not necessarily split on whitspace. Seems to work for BPE / Llama idk

ArthurZucker · 2024-03-26T13:02:12Z

Just gotta add python tests here and will be good to go

Improve performances of meta space, but also just fix it. (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (14999 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', '▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] ['▁inform', '<s>', '.', '▁Hey', '<unk>', '.', '▁', '▁', '▁', '▁', '▁', '▁', '▁.'] [0.0006330013275146484, 0.0014591217041015625, 0.015890836715698242, 0.18584918975830078, 2.1726326942443848] (transformers) ➜ transformers git:(refactor-default-llama) ✗ python ../scripts/gemma-dummy.py Token indices sequence length is longer than the specified maximum sequence length for this model (10000 > 2048). Running this sequence through the model will result in indexing errors ['<REPR_END>', 'in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] ['in', 'form', '<s>', '.', '▁Hey', '<unk>', '.', '▁▁▁▁▁▁', '▁.'] [0.0008409023284912109, 0.0008909702301025391, 0.00882411003112793, 0.10214710235595703, 1.187899112701416]

ArthurZucker

LGTM, let me just check with huggingface/transformers#28881 once sec

ArthurZucker · 2024-03-30T06:31:42Z

bindings/node/src/decoders.rs

We "could" do a deprecation cycle for add_prefix_space but since we are gonna do a major, no need I guess

bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py

ArthurZucker · 2024-03-30T06:34:05Z

tokenizers/src/decoders/mod.rs

@@ -73,22 +73,27 @@ mod tests {

    #[test]
    fn decoder_serialization() {
-        let json = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#;
+        let oldjson = r#"{"type":"Sequence","decoders":[{"type":"ByteFallback"},{"type":"Metaspace","replacement":"▁","add_prefix_space":true,"prepend_scheme":"always"}]}"#;


good, we test BC with previous serialization

scriptator · 2024-05-15T13:11:25Z

@ArthurZucker does it sound plausible to you that the breaking change of this pull request causes the problem I described in huggingface/text-embeddings-inference#265? If yes, do you see any way to circumvent that problem besides upgrading tokenizers everywhere?

ArthurZucker · 2024-05-15T13:13:56Z

Yes, I think you pretty much have to update tokenizers version.
This is a breaking change, which is why we had a jump in versioning

scriptator · 2024-05-15T13:16:26Z

Thx for the quick response. Hopefully TEI will do that soon.

This is necessary in order to load models whose tokenizers have been created by a version after the breaking change huggingface/tokenizers#1476 (i.e. >= v0.19.0)

scriptator · 2024-05-15T13:26:51Z

One more question: Is the new version suited for loading older models (i.e. those saved before the breaking change)? I know for sure that it does not lead to a crash but what about the quality of model responses given that the tokenization has been changed?

ArthurZucker · 2024-06-07T11:51:08Z

I don't think it is backward compatible. See this:

# tokenizers 0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")

vs

# tokenizers <0.19
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[2], line 2
      1 from tokenizers import Tokenizer
----> 2 tok = Tokenizer.from_pretrained("ArthurZ/new-t5-base")

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 960 column 3

which is why we did a major release

ArthurZucker changed the title ~~version = "0.15.3-dev-0”~~ Refactor metaspace Mar 22, 2024

ArthurZucker mentioned this pull request Mar 22, 2024

[LlamaTokenizerFast] Refactor default llama huggingface/transformers#28881

Merged

ArthurZucker marked this pull request as ready for review March 22, 2024 05:16

ArthurZucker commented Mar 22, 2024

View reviewed changes

ArthurZucker and others added 10 commits March 29, 2024 14:57

well what do we have

e16554b

nit

f59bdac

be BC with non legacy

3ea7de6

unrelated change for clippy

301eddf

fix test

937551d

splitting is a must for word_ids

c16bc18

fmt and lint

49e75e2

Fixing everything (hopefully better).

7c952b8

Fixing node.

e3ae520

Narsil force-pushed the fix-meta-space-yet-again branch from a261516 to e3ae520 Compare March 29, 2024 13:58

Narsil added 3 commits March 29, 2024 15:07

Including yarn.lock

ae61a3d

Lint.

bae14fc

Stubs.

b08b6bc

ArthurZucker commented Mar 30, 2024

View reviewed changes

ArthurZucker added 5 commits March 30, 2024 08:03

revert to use split

22fdc21

fix merge issues

96940aa

fix tests

73b7624

finish fixing tests

5af408b

ruff

680e163

ArthurZucker force-pushed the fix-meta-space-yet-again branch from d60edfa to 680e163 Compare March 30, 2024 07:19

Narsil approved these changes Mar 30, 2024

View reviewed changes

Narsil merged commit 0906971 into main Mar 30, 2024
12 checks passed

Narsil deleted the fix-meta-space-yet-again branch March 30, 2024 09:27

scriptator mentioned this pull request May 15, 2024

multilingual-e5-large exported by recent sentence-transformers version cannot be loaded huggingface/text-embeddings-inference#265

Closed

4 tasks

scriptator mentioned this pull request May 15, 2024

Upgrade tokenizers to 0.19.1 to deal with breaking change in tokenizers huggingface/text-embeddings-inference#266

Merged

5 tasks

joprice mentioned this pull request Jul 27, 2024

compat with transformers >= 4.40 and tokenizers >= 0.19 huggingface/transformers.js#866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor metaspace #1476

Refactor metaspace #1476

ArthurZucker commented Mar 22, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 22, 2024

ArthurZucker Mar 22, 2024

ArthurZucker commented Mar 26, 2024

ArthurZucker left a comment •

edited

Loading

ArthurZucker Mar 30, 2024

ArthurZucker Mar 30, 2024

scriptator commented May 15, 2024

ArthurZucker commented May 15, 2024

scriptator commented May 15, 2024

scriptator commented May 15, 2024

ArthurZucker commented Jun 7, 2024 •

edited

Loading

Refactor metaspace #1476

Refactor metaspace #1476

Conversation

ArthurZucker commented Mar 22, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 22, 2024

ArthurZucker Mar 22, 2024

Choose a reason for hiding this comment

ArthurZucker commented Mar 26, 2024

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker Mar 30, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 30, 2024

Choose a reason for hiding this comment

scriptator commented May 15, 2024

ArthurZucker commented May 15, 2024

scriptator commented May 15, 2024

scriptator commented May 15, 2024

ArthurZucker commented Jun 7, 2024 • edited Loading

ArthurZucker commented Mar 22, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

ArthurZucker commented Jun 7, 2024 •

edited

Loading