What is the default tokenizer behaviour? #1314

RonanKMcGovern · 2023-12-05T17:35:05Z

System Info

N/A

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I'm trying to understand whether special tokens (i.e. BOS and EOS) are added and suppressed on tokenization and decoding.

Encoding:

I searched for add_special_tokens in the repo and I don't see anywhere this is being set to true when tokenizing. So, it seems that there are no EOS tokens automatically added.

Decoding:

I searched for skip_special_tokens and it seems that here on line 541 that indeed BOS and EOS are being supressed.

Is this understanding correct?

Expected behavior

If possible, could the default tokenization strategy be described on the ReadMe so users know what to expect?

RonanKMcGovern · 2023-12-07T10:15:25Z

@Narsil would you know the answer here? thanks

github-actions · 2024-01-07T01:49:47Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

RonanKMcGovern · 2024-01-08T12:36:42Z

I don't know for sure here but based on the docs, it does seem that:

There are no BOS or EOS tokens added by default in prompt preparation. This means requests must be formatted appropriately AND/OR a chat template provided.
BOS and EOS tokens are removed by default in the response provided by the API.

Narsil · 2024-01-10T18:10:00Z

There is no preprocessing by default in TGI (especially not chat level, although we're finally adding it: #1427).

By default, TGI will use the tokenizer as-is and perform exactly what the tokenizer is configured to do. Yes currently requests should be formatted client side. That is what enabled fast turnaround on various prompting techniques, and different behavior for instruct models vs barebone ones without requiring TGI to change in that regard. Things seem more or less stable for now, that's why we're adding it it now.

RonanKMcGovern · 2024-01-11T11:57:14Z

Appreciate that @Narsil !

To be clear, when you say "TGI will use the tokenizer as-is", you mean:

Any tokenizer.chat_template is NOT used.
No BOS/special tokens are added when tokenizing.

However:

"skip_special_tokens" IS being set to True when decoding.

Correct?

Narsil · 2024-01-17T13:35:28Z

Yes.

(And for the chat template, yes we do not use it, except if you use the OpenAI compatible endpoint /api/completions/xx)

github-actions bot added the Stale label Jan 7, 2024

github-actions bot removed the Stale label Jan 9, 2024

RonanKMcGovern closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the default tokenizer behaviour? #1314

What is the default tokenizer behaviour? #1314

RonanKMcGovern commented Dec 5, 2023

RonanKMcGovern commented Dec 7, 2023

github-actions bot commented Jan 7, 2024

RonanKMcGovern commented Jan 8, 2024

Narsil commented Jan 10, 2024

RonanKMcGovern commented Jan 11, 2024

Narsil commented Jan 17, 2024

What is the default tokenizer behaviour? #1314

What is the default tokenizer behaviour? #1314

Comments

RonanKMcGovern commented Dec 5, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

RonanKMcGovern commented Dec 7, 2023

github-actions bot commented Jan 7, 2024

RonanKMcGovern commented Jan 8, 2024

Narsil commented Jan 10, 2024

RonanKMcGovern commented Jan 11, 2024

Narsil commented Jan 17, 2024