Issue with tokenizer wrapper #644

davidbrandfonbrener · 2024-07-08T18:55:01Z

❓ The question

the tokenizer wrapper causes unintended behavior when the tokenizer has a bos token (like the llama tokenizers). In particular, the call to the base_tokenizer encode function will add bos tokens even when add_special_tokens=False.

The issue is that the default here is for the base_tokenizer to have add_special_tokens=True.

This should be fairly easy to fix, but to properly handle tokenizers with bos tokens, the wrapper would need to be changed more broadly.

This also raised the question for me, why is this wrapper needed in the first place instead of using the huggingface library? I wanted to better understand the motivation before making changes.

davidbrandfonbrener added the type/question An issue that's a question label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with tokenizer wrapper #644

Issue with tokenizer wrapper #644

davidbrandfonbrener commented Jul 8, 2024

Issue with tokenizer wrapper #644

Issue with tokenizer wrapper #644

Comments

davidbrandfonbrener commented Jul 8, 2024

❓ The question