-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Replace normalizer #1413
Conversation
I'll have a look, sounds really interesting ! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
(I am planning on remove the normalizer of Llama and mistral families in favor of pre_tokenizer, but will still check this out!) |
That's fair, until then this fix will let people tokenize as long sequences as they want. |
I see this got closed from inactivity. Do you need anything from me in order to merge this? Profiling, documentation? |
Sorry I'll take some time to review! |
Will be my priority review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Sorry I've just seen this , seems quite important change indeed.
I just got pinged internally on this. from matplotlib import pyplot as plt
import time
from tqdm import tqdm
from tokenizers import Tokenizer
import numpy as np
lens = np.arange(0, 100000, 100)
with open("data/big.txt") as f:
TEXT = f.read()
times_fast = []
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
for ll in tqdm(lens):
text = TEXT[:ll]
start = time.perf_counter()
tokenizer.encode(text)
times_fast.append(time.perf_counter() - start)
timesgpt2 = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
for ll in tqdm(lens):
text = TEXT[:ll]
start = time.perf_counter()
tokenizer.encode(text)
timesgpt2.append(time.perf_counter() - start)
times = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", use_fast=False)
for ll in tqdm(lens):
text = TEXT[:ll]
start = time.perf_counter()
tokenizer.encode(text)
times.append(time.perf_counter() - start)
plt.plot(lens, times_fast)
plt.plot(lens, timesgpt2)
plt.plot(lens, times)
plt.legend(["mistral (tokenizers)", "gpt2 (tokenizers)", "mistral(spm)"])
plt.xlabel("Length in chars")
plt.ylabel("tokenization time (seconds)")
plt.show() |
Thanks a lot @rlrs ! Really sorry about the delay and shoutout to you for this clean piece of work! |
The existing Replace normalizer, used for example in the Llama and Mistral tokenizers, is implemented very inefficiently.
This results in normalization taking orders of magnitude longer than it should, making it very time consuming to tokenize long sequences. I've seen a few issues that probably refer to this, for example huggingface/transformers#25873.
This PR replaces the existing implementation -- which seems to scale quadratically with sequence length and number of matches -- with an implementation that scales linearly, while (hopefully) retaining the exact same semantics.
In my benchmarks with real long-sequence data, tokenizing with the Llama tokenizer is more than two orders of magnitude faster.