Why the tokenizer is slower than tiktoken? #1519

BigBinnie · 2024-04-29T23:44:02Z

Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?

ArthurZucker · 2024-04-30T10:12:52Z

Hey, could you share a reproducer?
Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not.
But we could only do this when ask and improve speed potentially.

github-actions · 2024-05-31T01:50:46Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-06-05T07:24:09Z

It's high in my priority to do benchmarks and improve our code if needed!

BigBinnie · 2024-06-20T22:45:49Z

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

ArthurZucker · 2024-06-21T08:12:58Z

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

github-actions · 2024-07-22T01:56:07Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-07-31T13:09:22Z

We actually dived a bit:

Rayon parallelism is kinda broken
we have concurency on the cache for GPT2
We have memory allocation that are also slowing down
With Fast encode #1560, was able to get similar performances as tiktoken, keep posted 😉

ArthurZucker · 2024-07-31T13:10:06Z

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs

github-actions bot added the Stale label May 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2024

ArthurZucker reopened this Jun 5, 2024

github-actions bot removed the Stale label Jun 6, 2024

github-actions bot added the Stale label Jul 22, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024

ArthurZucker reopened this Jul 31, 2024

github-actions bot removed the Stale label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the tokenizer is slower than tiktoken? #1519

Why the tokenizer is slower than tiktoken? #1519

BigBinnie commented Apr 29, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024

ArthurZucker commented Jun 5, 2024 •

edited

Loading

BigBinnie commented Jun 20, 2024

ArthurZucker commented Jun 21, 2024

github-actions bot commented Jul 22, 2024

ArthurZucker commented Jul 31, 2024

ArthurZucker commented Jul 31, 2024

Why the tokenizer is slower than tiktoken? #1519

Why the tokenizer is slower than tiktoken? #1519

Comments

BigBinnie commented Apr 29, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024

ArthurZucker commented Jun 5, 2024 • edited Loading

BigBinnie commented Jun 20, 2024

ArthurZucker commented Jun 21, 2024

github-actions bot commented Jul 22, 2024

ArthurZucker commented Jul 31, 2024

ArthurZucker commented Jul 31, 2024

ArthurZucker commented Jun 5, 2024 •

edited

Loading