-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why the tokenizer is slower than tiktoken? #1519
Comments
Hey, could you share a reproducer? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
It's high in my priority to do benchmarks and improve our code if needed! |
For HF, we use
For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same
please let me know if you need any other information |
You are using |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
We actually dived a bit:
|
One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs |
Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?
The text was updated successfully, but these errors were encountered: