-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the default tokenizer behaviour? #1314
Comments
@Narsil would you know the answer here? thanks |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I don't know for sure here but based on the docs, it does seem that:
|
There is no preprocessing by default in TGI (especially not chat level, although we're finally adding it: #1427). By default, TGI will use the tokenizer as-is and perform exactly what the tokenizer is configured to do. Yes currently requests should be formatted client side. That is what enabled fast turnaround on various prompting techniques, and different behavior for instruct models vs barebone ones without requiring TGI to change in that regard. Things seem more or less stable for now, that's why we're adding it it now. |
Appreciate that @Narsil ! To be clear, when you say "TGI will use the tokenizer as-is", you mean:
However:
Correct? |
Yes. (And for the chat template, yes we do not use it, except if you use the OpenAI compatible endpoint /api/completions/xx) |
System Info
N/A
Information
Tasks
Reproduction
I'm trying to understand whether special tokens (i.e. BOS and EOS) are added and suppressed on tokenization and decoding.
Encoding:
Decoding:
Is this understanding correct?
Expected behavior
If possible, could the default tokenization strategy be described on the ReadMe so users know what to expect?
The text was updated successfully, but these errors were encountered: