Add date_string when applying tokenizer chat template #1474

snarayan21 · 2024-08-21T20:49:19Z

Llama 3.1 models take in a date_string we can use to set the current date in the chat template. This defaults to 26 Jul 2024 if we don't set it, which is kinda problematic. We pass in date_string as the current date to remedy this.

Added unit tests and tested manually on gated llama 3.1 and llama 3 models (see below)

pytest -s -v tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string
===================================================================================== test session starts =====================================================================================
platform darwin -- Python 3.10.5, pytest-8.3.2, pluggy-1.5.0 -- /Users/saaketh.narayan/.pyenv/versions/3.10.5/bin/python3.10
cachedir: .pytest_cache
rootdir: /Users/saaketh.narayan/Desktop/llm-foundry
configfile: pyproject.toml
plugins: cov-5.0.0, anyio-4.3.0, split-0.9.0, pytest_codeblocks-0.17.0
collected 10 items

tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-EleutherAI/gpt-neox-20b] [W ProcessGroupGloo.cpp:751] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 3.06MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 13.9MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 343kB/s]
PASSED
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, so ...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-70B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, so...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-405B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, s...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[False-EleutherAI/gpt-neox-20b] No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
PASSED
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[False-meta-llama/Meta-Llama-3-8B-Instruct] PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 27.2MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 12.2MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.03MB/s]
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 19.3MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 11.3MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.05MB/s]
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 20.1MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 15.4MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 3.75MB/s]
PASSED

Manual test for modified test_multi_turn_chat_slicing using gated llama 3.1 8b tokenizer:

tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-EleutherAI/gpt-neox-20b] [W ProcessGroupGloo.cpp:751] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template ...)
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-EleutherAI/gpt-neox-20b] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template...)
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-EleutherAI/gpt-neox-20b] No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-meta-llama/Meta-Llama-3.1-8B-Instruct] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-EleutherAI/gpt-neox-20b] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-meta-llama/Meta-Llama-3.1-8B-Instruct] PASSED

llmfoundry/data/finetuning/tasks.py

tests/tokenizers/test_tokenizer.py

yo

b9d1d89

snarayan21 requested a review from a team as a code owner August 21, 2024 20:49

yo

81b5988

snarayan21 requested review from irenedea, milocress, dakinggg and KuuCi August 21, 2024 21:28

dakinggg reviewed Aug 21, 2024

View reviewed changes

llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved

snarayan21 requested a review from dakinggg August 21, 2024 22:39

dakinggg approved these changes Aug 22, 2024

View reviewed changes

tests/tokenizers/test_tokenizer.py Show resolved Hide resolved

testedit

009cd06

snarayan21 requested a review from dakinggg August 22, 2024 02:05

dakinggg approved these changes Aug 22, 2024

View reviewed changes

snarayan21 added 2 commits August 22, 2024 00:49

yo

9608708

Merge branch 'main' into saaketh/date_string

3137121

snarayan21 merged commit e235f42 into main Aug 22, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add date_string when applying tokenizer chat template #1474

Add date_string when applying tokenizer chat template #1474

snarayan21 commented Aug 21, 2024 •

edited

Loading

Add date_string when applying tokenizer chat template #1474

Add date_string when applying tokenizer chat template #1474

Conversation

snarayan21 commented Aug 21, 2024 • edited Loading

snarayan21 commented Aug 21, 2024 •

edited

Loading