Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if split_special_tokens==True,fast_tokenizer is slower than slow_tokenizer #1700

Open
gongel opened this issue Dec 12, 2024 · 0 comments
Open

Comments

@gongel
Copy link

gongel commented Dec 12, 2024

from transformers import LlamaTokenizer, LlamaTokenizerFast
import time
tokenizer1 = LlamaTokenizer.from_pretrained("./Llama-2-7b-chat-hf", split_special_tokens=True) # LlamaTokenizer
tokenizer2 = LlamaTokenizerFast.from_pretrained("./Llama-2-7b-chat-hf", split_special_tokens=True) # LlamaTokenizer
print(tokenizer1, tokenizer2)

s_time = time.time()
for i in range(1000):
    tokenizer1.tokenize("你好,where are you?"*100)
print(f"slow: {time.time() - s_time}")

s_time = time.time()
for i in range(1000):
    tokenizer2.tokenize("你好,where are you?"*100)
print(f"fast: {time.time() - s_time}")

output:
slow: 0.6021890640258789
fast: 0.7353882789611816

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant