Add support for chat templates (#408)

* Add basic support for chat templates * Cleanup * JSDoc improvements * Support conversion of user-defined functions * Cleanup * Fix function creation * Add unit tests for templates * Cleanup * Improve JSDoc * Add missing return types * Add chat templates docs to table of contents * Add support for logical negation * Fix nested logical negation * Add unit tests for logical operators * Add loop variables * Add support for `RuntimeValue` built-in functions * Add unit tests for string instance methods * Fix conversion of normal function to `FunctionValue` * Update object method unit tests * Save chat template to tokenizer_config.json during conversion * Fix `raise_exception` error * Add `!=` operator for booleans * Remember to increment loop index * Cleanup for loop evaluator * Use `is` helper function * Add support for text nodes i.e., non Jinja statements/expressions * Add auto-generated templating tests * Update unit tests * Remove unused function * Add default chat templates * Use repo with up-to-date tokenizer config * Temporarily disable zephyr test * Delete templates.test.js * Move Jinja functionality to `@huggingface/jinja` * Fix template cache type * Update chat template unit tests * Update `@huggingface/jinja` version * Fix default llama2 system prompt usage * Add unit test for llama2 w/o chat template set * Update jinja version * Update jinja version * Add unit test for user-defined chat templates Example from https://discuss.huggingface.co/t/issue-with-llama-2-chat-template-and-out-of-date-documentation/61645/3 * Add `AddedToken` for improved tokenization * Add example usage for chat templates * Add 'first' Metaspace pretokenizer prepend scheme * Formatting * Update wav2vec2 converter special tokens whitespace split * Fix Metaspace pretokenizer split criteria * Update inputs of `PreTokenizerSequence` * Improve Metaspace pretokenizer * Update llama tokenizer tests * Improve handling of legacy llama tokenizer * Re-enable SPM tests * Add static tokenizer test cases * Add llama2 static tests * Allow user to override legacy tokenizer behaviour in `.from_pretrained` * Add legacy tokenizer unit tests * Bump jinja version to 0.1.0
huggingface · Dec 18, 2023 · d4f7cd5 · d4f7cd5
1 parent 6129e45
commit d4f7cd5
Show file tree

Hide file tree

Showing 8 changed files with 733 additions and 111 deletions.
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -44,6 +44,9 @@
   "optionalDependencies": {
     "onnxruntime-node": "1.14.0"
   },
+  "peerDependencies": {
+    "@huggingface/jinja": "^0.1.0"
+  },
   "devDependencies": {
     "@types/jest": "^29.5.1",
     "catharsis": "github:xenova/catharsis",

diff --git a/scripts/convert.py b/scripts/convert.py
@@ -283,6 +283,13 @@ def main():
         # Load tokenizer
         tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
 
+        # To avoid inserting all chat templates into tokenizers.js, we save the chat template
+        # to the tokenizer_config.json file, and load it when the tokenizer is loaded.
+        if getattr(tokenizer, 'chat_template', None) is None and \
+            getattr(tokenizer, 'use_default_system_prompt', False):
+            # No chat template specified, and we use the default
+            setattr(tokenizer, 'chat_template', tokenizer.default_chat_template)
+
     except KeyError:
         pass  # No Tokenizer
 

diff --git a/scripts/extra/wav2vec2.py b/scripts/extra/wav2vec2.py
@@ -20,8 +20,8 @@ def generate_tokenizer_json(tokenizer):
                 "id": v,
                 "content": k,
                 "single_word": False,
-                "lstrip": False,
-                "rstrip": False,
+                "lstrip": True,
+                "rstrip": True,
                 "normalized": False,
                 "special": True
             }