Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the default tokenizer behaviour? #1314

Closed
2 of 4 tasks
RonanKMcGovern opened this issue Dec 5, 2023 · 6 comments
Closed
2 of 4 tasks

What is the default tokenizer behaviour? #1314

RonanKMcGovern opened this issue Dec 5, 2023 · 6 comments

Comments

@RonanKMcGovern
Copy link

System Info

N/A

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I'm trying to understand whether special tokens (i.e. BOS and EOS) are added and suppressed on tokenization and decoding.

Encoding:

  • I searched for add_special_tokens in the repo and I don't see anywhere this is being set to true when tokenizing. So, it seems that there are no EOS tokens automatically added.

Decoding:

  • I searched for skip_special_tokens and it seems that here on line 541 that indeed BOS and EOS are being supressed.

Is this understanding correct?

Expected behavior

If possible, could the default tokenization strategy be described on the ReadMe so users know what to expect?

@RonanKMcGovern
Copy link
Author

@Narsil would you know the answer here? thanks

Copy link

github-actions bot commented Jan 7, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 7, 2024
@RonanKMcGovern
Copy link
Author

I don't know for sure here but based on the docs, it does seem that:

  • There are no BOS or EOS tokens added by default in prompt preparation. This means requests must be formatted appropriately AND/OR a chat template provided.
  • BOS and EOS tokens are removed by default in the response provided by the API.

@github-actions github-actions bot removed the Stale label Jan 9, 2024
@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2024

There is no preprocessing by default in TGI (especially not chat level, although we're finally adding it: #1427).

By default, TGI will use the tokenizer as-is and perform exactly what the tokenizer is configured to do. Yes currently requests should be formatted client side. That is what enabled fast turnaround on various prompting techniques, and different behavior for instruct models vs barebone ones without requiring TGI to change in that regard. Things seem more or less stable for now, that's why we're adding it it now.

@RonanKMcGovern
Copy link
Author

Appreciate that @Narsil !

To be clear, when you say "TGI will use the tokenizer as-is", you mean:

  • Any tokenizer.chat_template is NOT used.
  • No BOS/special tokens are added when tokenizing.

However:

  • "skip_special_tokens" IS being set to True when decoding.

Correct?

@Narsil
Copy link
Collaborator

Narsil commented Jan 17, 2024

Yes.

(And for the chat template, yes we do not use it, except if you use the OpenAI compatible endpoint /api/completions/xx)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants