Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refactor post_processor logic and add test #2137

Merged
merged 3 commits into from
Jun 27, 2024
Merged

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented Jun 27, 2024

This PR updates the logic for adding a post_processor to a tokenizer and should resolve the issues in CI https://github.com/huggingface/text-generation-inference/actions/runs/9698059633/job/26765404563

This PR aims to match the same functionality in transformers/models/llama/tokenization_llama_fast.py and fixes an issue where the single and pairs were not correctly constructed and applied. This resulted in a missing bos_token and subsequently an issue with accessing slots in batch.slots[batch.slot_indices]. Also a test is added for that specific case

if let Some(class) = &tokenizer_config.tokenizer_class {
if class == "LlamaTokenizer" || class == "LlamaTokenizerFast" {
tracing::info!("Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205");
if let Some(post_processor) = create_post_processor(tokenizer, &tokenizer_config) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to override only is the current post_processor is None, forgot that condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated and added in latest commit

pub fn create_post_processor(
tokenizer: &Tokenizer,
tokenizer_config: &HubTokenizerConfig,
) -> Option<TemplateProcessing> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return an Result and remove all those unwrap() into ??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point! updated in latest

if let Some(class) = &tokenizer_config.tokenizer_class {
if class == "LlamaTokenizer" || class == "LlamaTokenizerFast" {
tracing::info!("Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205");
if let Some(post_processor) = create_post_processor(tokenizer, &tokenizer_config) {
tokenizer.with_post_processor(post_processor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log only when we actually override, here we would log even if the function fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated and moved after create_post_processor succeeds


let mut single = String::new();
let mut pair = String::new();
let mut special_tokens = Vec::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the actual vec ? String is really a clutch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, updated!

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Narsil Narsil merged commit 74b0231 into main Jun 27, 2024
9 checks passed
@Narsil Narsil deleted the tokenizer-post-processor branch June 27, 2024 21:16
glegendre01 pushed a commit that referenced this pull request Jul 2, 2024
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Sep 26, 2024
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants