fix: refactor post_processor logic and add test #2137

drbh · 2024-06-27T18:40:52Z

This PR updates the logic for adding a post_processor to a tokenizer and should resolve the issues in CI https://github.com/huggingface/text-generation-inference/actions/runs/9698059633/job/26765404563

This PR aims to match the same functionality in transformers/models/llama/tokenization_llama_fast.py and fixes an issue where the single and pairs were not correctly constructed and applied. This resulted in a missing bos_token and subsequently an issue with accessing slots in batch.slots[batch.slot_indices]. Also a test is added for that specific case

Narsil · 2024-06-27T20:09:13Z

router/src/main.rs

+            if let Some(class) = &tokenizer_config.tokenizer_class {
+                if class == "LlamaTokenizer" || class == "LlamaTokenizerFast" {
+                    tracing::info!("Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205");
+                    if let Some(post_processor) = create_post_processor(tokenizer, &tokenizer_config) {


We need to override only is the current post_processor is None, forgot that condition.

updated and added in latest commit

Narsil · 2024-06-27T20:10:40Z

router/src/main.rs

+pub fn create_post_processor(
+    tokenizer: &Tokenizer,
+    tokenizer_config: &HubTokenizerConfig,
+) -> Option<TemplateProcessing> {


Why not return an Result and remove all those unwrap() into ??

great point! updated in latest

Narsil · 2024-06-27T20:11:14Z

router/src/main.rs

+            if let Some(class) = &tokenizer_config.tokenizer_class {
+                if class == "LlamaTokenizer" || class == "LlamaTokenizerFast" {
+                    tracing::info!("Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205");
+                    if let Some(post_processor) = create_post_processor(tokenizer, &tokenizer_config) {
                        tokenizer.with_post_processor(post_processor);


Let's log only when we actually override, here we would log even if the function fails.

updated and moved after create_post_processor succeeds

Narsil · 2024-06-27T20:12:53Z

router/src/main.rs

+
+    let mut single = String::new();
+    let mut pair = String::new();
+    let mut special_tokens = Vec::new();


Can we keep the actual vec ? String is really a clutch.

yep, updated!

…_processor

Narsil

LGTM

* fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor

drbh added 2 commits June 27, 2024 18:34

fix: refactor post_processor logic and add test

f85cd58

fix: remove dev comment

74535ce

Narsil reviewed Jun 27, 2024

View reviewed changes

fix: adjust when post_processor is overridden and improve create_post…

a921854

…_processor

Narsil approved these changes Jun 27, 2024

View reviewed changes

Narsil merged commit 74b0231 into main Jun 27, 2024
9 checks passed

Narsil deleted the tokenizer-post-processor branch June 27, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: refactor post_processor logic and add test #2137

fix: refactor post_processor logic and add test #2137

drbh commented Jun 27, 2024

Narsil Jun 27, 2024

drbh Jun 27, 2024

Narsil Jun 27, 2024

drbh Jun 27, 2024

Narsil Jun 27, 2024

drbh Jun 27, 2024

Narsil Jun 27, 2024

drbh Jun 27, 2024

Narsil left a comment

fix: refactor post_processor logic and add test #2137

fix: refactor post_processor logic and add test #2137

Conversation

drbh commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment