Efficient Replace normalizer #1413

rlrs · 2023-12-12T15:12:00Z

The existing Replace normalizer, used for example in the Llama and Mistral tokenizers, is implemented very inefficiently.
This results in normalization taking orders of magnitude longer than it should, making it very time consuming to tokenize long sequences. I've seen a few issues that probably refer to this, for example huggingface/transformers#25873.

This PR replaces the existing implementation -- which seems to scale quadratically with sequence length and number of matches -- with an implementation that scales linearly, while (hopefully) retaining the exact same semantics.
In my benchmarks with real long-sequence data, tokenizing with the Llama tokenizer is more than two orders of magnitude faster.

…fficient-replace

ArthurZucker · 2023-12-12T17:59:01Z

I'll have a look, sounds really interesting !

HuggingFaceDocBuilderDev · 2023-12-18T11:37:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2023-12-20T07:45:50Z

(I am planning on remove the normalizer of Llama and mistral families in favor of pre_tokenizer, but will still check this out!)

rlrs · 2023-12-20T09:00:39Z

(I am planning on remove the normalizer of Llama and mistral families in favor of pre_tokenizer, but will still check this out!)

That's fair, until then this fix will let people tokenize as long sequences as they want.
The code should work with any non-overlapping matches (I'm not sure overlapping matches were ever supported), so it might help with other tokenizers as well, but I don't know if any others use the feature.

rlrs · 2024-01-25T07:54:03Z

I see this got closed from inactivity. Do you need anything from me in order to merge this? Profiling, documentation?

ArthurZucker · 2024-01-26T17:57:13Z

Sorry I'll take some time to review!

ArthurZucker · 2024-02-06T03:45:54Z

Will be my priority review!

Narsil

LGTM. Sorry I've just seen this , seems quite important change indeed.

Narsil · 2024-02-06T10:20:03Z

I just got pinged internally on this.

Before

After

from matplotlib import pyplot as plt
import time
from tqdm import tqdm
from tokenizers import Tokenizer

import numpy as np
lens = np.arange(0, 100000, 100)

with open("data/big.txt") as f:
    TEXT = f.read()

times_fast = []
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
for ll in tqdm(lens):
    text = TEXT[:ll]
    start = time.perf_counter()
    tokenizer.encode(text)
    times_fast.append(time.perf_counter() - start)



timesgpt2 = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
for ll in tqdm(lens):
    text = TEXT[:ll]
    start = time.perf_counter()
    tokenizer.encode(text)
    timesgpt2.append(time.perf_counter() - start)

times = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=False)
for ll in tqdm(lens):
    text = TEXT[:ll]
    start = time.perf_counter()
    tokenizer.encode(text)
    times.append(time.perf_counter() - start)

plt.plot(lens, times_fast)
plt.plot(lens, timesgpt2)
plt.plot(lens, times)
plt.legend(["mistral (tokenizers)", "gpt2 (tokenizers)", "mistral(spm)"])
plt.xlabel("Length in chars")
plt.ylabel("tokenization time (seconds)")
plt.show()

ArthurZucker · 2024-02-07T07:29:10Z

Thanks a lot @rlrs ! Really sorry about the delay and shoutout to you for this clean piece of work!

rlrs added 5 commits December 12, 2023 15:26

new Replace work

320a5f4

clean up

41834c3

clean up

0f797ba

Merge branch 'efficient-replace' of github.com:rlrs/tokenizers into e…

fae12dd

…fficient-replace

typo

958b3a5

cargo fmt

48255f5

rlrs mentioned this pull request Jan 16, 2024

How to support multi-threaded parallel data preprocessing? mosaicml/llm-foundry#870

Open

github-actions bot added the Stale label Jan 20, 2024

soldni mentioned this pull request Jan 20, 2024

BOS/EOS/PAD options in tokens cli; speed up tokenization by segmenting paragraphs. allenai/dolma#102

Merged

github-actions bot closed this Jan 25, 2024

ydshieh reopened this Jan 26, 2024

github-actions bot removed the Stale label Jan 27, 2024

Quentin-Anthony added a commit to Zyphra/tokenizers that referenced this pull request Feb 3, 2024

Apply mistral speedup from huggingface#1413

e5bc202

Quentin-Anthony mentioned this pull request Feb 3, 2024

Apply mistral speedup Zyphra/tokenizers#1

Open

ArthurZucker mentioned this pull request Feb 6, 2024

Normalizer "replace" is quadratic in sequence length (impacts Llama 2 tokenizer) #1449

Closed

Narsil approved these changes Feb 6, 2024

View reviewed changes

Clippy.

d10aacb

Narsil merged commit c893204 into huggingface:main Feb 6, 2024
12 checks passed

Narsil mentioned this pull request May 15, 2024

Converting tokenizers tokenizers into tiktoken tokenizers #1530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Replace normalizer #1413

Efficient Replace normalizer #1413

rlrs commented Dec 12, 2023

ArthurZucker commented Dec 12, 2023

HuggingFaceDocBuilderDev commented Dec 18, 2023

ArthurZucker commented Dec 20, 2023

rlrs commented Dec 20, 2023

rlrs commented Jan 25, 2024

ArthurZucker commented Jan 26, 2024

ArthurZucker commented Feb 6, 2024

Narsil left a comment

Narsil commented Feb 6, 2024

ArthurZucker commented Feb 7, 2024

Efficient Replace normalizer #1413

Efficient Replace normalizer #1413

Conversation

rlrs commented Dec 12, 2023

ArthurZucker commented Dec 12, 2023

HuggingFaceDocBuilderDev commented Dec 18, 2023

ArthurZucker commented Dec 20, 2023

rlrs commented Dec 20, 2023

rlrs commented Jan 25, 2024

ArthurZucker commented Jan 26, 2024

ArthurZucker commented Feb 6, 2024

Narsil left a comment

Choose a reason for hiding this comment

Narsil commented Feb 6, 2024

ArthurZucker commented Feb 7, 2024