Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date_string when applying tokenizer chat template #1474

Merged
merged 5 commits into from
Aug 22, 2024

Conversation

snarayan21
Copy link
Contributor

@snarayan21 snarayan21 commented Aug 21, 2024

Llama 3.1 models take in a date_string we can use to set the current date in the chat template. This defaults to 26 Jul 2024 if we don't set it, which is kinda problematic. We pass in date_string as the current date to remedy this.

Added unit tests and tested manually on gated llama 3.1 and llama 3 models (see below)

pytest -s -v tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string
===================================================================================== test session starts =====================================================================================
platform darwin -- Python 3.10.5, pytest-8.3.2, pluggy-1.5.0 -- /Users/saaketh.narayan/.pyenv/versions/3.10.5/bin/python3.10
cachedir: .pytest_cache
rootdir: /Users/saaketh.narayan/Desktop/llm-foundry
configfile: pyproject.toml
plugins: cov-5.0.0, anyio-4.3.0, split-0.9.0, pytest_codeblocks-0.17.0
collected 10 items

tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-EleutherAI/gpt-neox-20b] [W ProcessGroupGloo.cpp:751] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 3.06MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 13.9MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 343kB/s]
PASSED
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, so ...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-70B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, so...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[True-meta-llama/Meta-Llama-3.1-405B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template, s...)
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[False-EleutherAI/gpt-neox-20b] No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
PASSED
tests/tokenizers/test_tokenizer.py::test_tokenizer_date_string[False-meta-llama/Meta-Llama-3-8B-Instruct] PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 27.2MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 12.2MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.03MB/s]
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 19.3MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 11.3MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.05MB/s]
PASSED
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 20.1MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 15.4MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 3.75MB/s]
PASSED

Manual test for modified test_multi_turn_chat_slicing using gated llama 3.1 8b tokenizer:

tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-EleutherAI/gpt-neox-20b] [W ProcessGroupGloo.cpp:751] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-True-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template ...)
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-EleutherAI/gpt-neox-20b] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[True-False-meta-llama/Meta-Llama-3.1-8B-Instruct] SKIPPED (Llama 3.1 Instruct models use date_string in chat template...)
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-EleutherAI/gpt-neox-20b] No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-True-meta-llama/Meta-Llama-3.1-8B-Instruct] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-EleutherAI/gpt-neox-20b] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-HuggingFaceH4/zephyr-7b-beta] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-t5-base] PASSED
tests/data/test_template_tokenization.py::test_multi_turn_chat_slicing[False-False-meta-llama/Meta-Llama-3.1-8B-Instruct] PASSED

@snarayan21 snarayan21 requested a review from a team as a code owner August 21, 2024 20:49
@snarayan21 snarayan21 requested a review from dakinggg August 21, 2024 22:39
@snarayan21 snarayan21 requested a review from dakinggg August 22, 2024 02:05
@snarayan21 snarayan21 merged commit e235f42 into main Aug 22, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants