Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the tokenizer is slower than tiktoken? #1519

Open
BigBinnie opened this issue Apr 29, 2024 · 8 comments
Open

Why the tokenizer is slower than tiktoken? #1519

BigBinnie opened this issue Apr 29, 2024 · 8 comments

Comments

@BigBinnie
Copy link

Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?

Screen Shot 2024-04-29 at 6 43 14 PM
@ArthurZucker
Copy link
Collaborator

Hey, could you share a reproducer?
Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not.
But we could only do this when ask and improve speed potentially.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 31, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2024
@ArthurZucker ArthurZucker reopened this Jun 5, 2024
@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 5, 2024

It's high in my priority to do benchmarks and improve our code if needed!

@github-actions github-actions bot removed the Stale label Jun 6, 2024
@BigBinnie
Copy link
Author

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

@ArthurZucker
Copy link
Collaborator

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 22, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024
@ArthurZucker ArthurZucker reopened this Jul 31, 2024
@ArthurZucker
Copy link
Collaborator

We actually dived a bit:

  1. Rayon parallelism is kinda broken
  2. we have concurency on the cache for GPT2
  3. We have memory allocation that are also slowing down
    With Fast encode #1560, was able to get similar performances as tiktoken, keep posted 😉

@ArthurZucker
Copy link
Collaborator

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs

@github-actions github-actions bot removed the Stale label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants