You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the tokenizer wrapper causes unintended behavior when the tokenizer has a bos token (like the llama tokenizers). In particular, the call to the base_tokenizer encode function will add bos tokens even when add_special_tokens=False.
The issue is that the default here is for the base_tokenizer to have add_special_tokens=True.
This should be fairly easy to fix, but to properly handle tokenizers with bos tokens, the wrapper would need to be changed more broadly.
This also raised the question for me, why is this wrapper needed in the first place instead of using the huggingface library? I wanted to better understand the motivation before making changes.
The text was updated successfully, but these errors were encountered:
❓ The question
the tokenizer wrapper causes unintended behavior when the tokenizer has a bos token (like the llama tokenizers). In particular, the call to the base_tokenizer encode function will add bos tokens even when add_special_tokens=False.
The issue is that the default here is for the base_tokenizer to have add_special_tokens=True.
This should be fairly easy to fix, but to properly handle tokenizers with bos tokens, the wrapper would need to be changed more broadly.
This also raised the question for me, why is this wrapper needed in the first place instead of using the huggingface library? I wanted to better understand the motivation before making changes.
The text was updated successfully, but these errors were encountered: