Skip to content

Commit

Permalink
Add support for chat templates (#408)
Browse files Browse the repository at this point in the history
* Add basic support for chat templates

* Cleanup

* JSDoc improvements

* Support conversion of user-defined functions

* Cleanup

* Fix function creation

* Add unit tests for templates

* Cleanup

* Improve JSDoc

* Add missing return types

* Add chat templates docs to table of contents

* Add support for logical negation

* Fix nested logical negation

* Add unit tests for logical operators

* Add loop variables

* Add support for `RuntimeValue` built-in functions

* Add unit tests for string instance methods

* Fix conversion of normal function to `FunctionValue`

* Update object method unit tests

* Save chat template to tokenizer_config.json during conversion

* Fix `raise_exception` error

* Add `!=` operator for booleans

* Remember to increment loop index

* Cleanup for loop evaluator

* Use `is` helper function

* Add support for text nodes

i.e., non Jinja statements/expressions

* Add auto-generated templating tests

* Update unit tests

* Remove unused function

* Add default chat templates

* Use repo with up-to-date tokenizer config

* Temporarily disable zephyr test

* Delete templates.test.js

* Move Jinja functionality to `@huggingface/jinja`

* Fix template cache type

* Update chat template unit tests

* Update `@huggingface/jinja` version

* Fix default llama2 system prompt usage

* Add unit test for llama2 w/o chat template set

* Update jinja version

* Update jinja version

* Add unit test for user-defined chat templates

Example from https://discuss.huggingface.co/t/issue-with-llama-2-chat-template-and-out-of-date-documentation/61645/3

* Add `AddedToken` for improved tokenization

* Add example usage for chat templates

* Add 'first' Metaspace pretokenizer prepend scheme

* Formatting

* Update wav2vec2 converter special tokens whitespace split

* Fix Metaspace pretokenizer split criteria

* Update inputs of `PreTokenizerSequence`

* Improve Metaspace pretokenizer

* Update llama tokenizer tests

* Improve handling of legacy llama tokenizer

* Re-enable SPM tests

* Add static tokenizer test cases

* Add llama2 static tests

* Allow user to override legacy tokenizer behaviour in `.from_pretrained`

* Add legacy tokenizer unit tests

* Bump jinja version to 0.1.0
  • Loading branch information
xenova authored Dec 18, 2023
1 parent 6129e45 commit d4f7cd5
Show file tree
Hide file tree
Showing 8 changed files with 733 additions and 111 deletions.
12 changes: 12 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@
"optionalDependencies": {
"onnxruntime-node": "1.14.0"
},
"peerDependencies": {
"@huggingface/jinja": "^0.1.0"
},
"devDependencies": {
"@types/jest": "^29.5.1",
"catharsis": "github:xenova/catharsis",
Expand Down
7 changes: 7 additions & 0 deletions scripts/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,13 @@ def main():
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

# To avoid inserting all chat templates into tokenizers.js, we save the chat template
# to the tokenizer_config.json file, and load it when the tokenizer is loaded.
if getattr(tokenizer, 'chat_template', None) is None and \
getattr(tokenizer, 'use_default_system_prompt', False):
# No chat template specified, and we use the default
setattr(tokenizer, 'chat_template', tokenizer.default_chat_template)

except KeyError:
pass # No Tokenizer

Expand Down
4 changes: 2 additions & 2 deletions scripts/extra/wav2vec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ def generate_tokenizer_json(tokenizer):
"id": v,
"content": k,
"single_word": False,
"lstrip": False,
"rstrip": False,
"lstrip": True,
"rstrip": True,
"normalized": False,
"special": True
}
Expand Down
Loading

0 comments on commit d4f7cd5

Please sign in to comment.