Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ap/fix mistral template #183

Closed
wants to merge 10 commits into from
Closed

Ap/fix mistral template #183

wants to merge 10 commits into from

Conversation

aldopareja
Copy link
Member

this builds on top of #169, it changes the way we treat system, assistant and user tokens, and assume they are always lists (which could be empty) or single element. We are loosing testing of the alternation between user and assistant. But the jinja template of mistral enforces this, perhaps we could do the same with granite.

- Introduce TokenInfo class to manage special tokens with add_to_tokenizer flag
- Update SpecialTokens dataclass to use TokenInfo for each token
- Modify chat templates (ibm_generic_tmpl.py and mistral_tmpl.py) to use new TokenInfo structure
- Refactor unmask_message_content function to handle token sequences instead of single tokens
- Update get_sp_token to return token lists instead of single integers
- Adjust data processing and tokenizer setup to work with new token structure
- Improve error handling and add assertions for eos and pad tokens
- Enhance chat template for Mistral to support system messages and strict role alternation

These changes provide more flexibility in defining and using special tokens,
especially for models like Mistral that use multi-token sequences for roles.
The new structure allows for better control over which tokens are added to the
tokenizer while still being recognized in the input.
@aldopareja aldopareja requested a review from Maxusmusti August 27, 2024 07:03
@aldopareja
Copy link
Member Author

@Maxusmusti @JamesKunstle @RobotSail this fixes #182

- Update unmask_message_content function for better token handling
- Modify print_masked_samples to use a dedicated mask token
- Adjust tokenizer setup to use AutoTokenizer and chat template
- Update special token handling in setup_tokenizer
- Minor code cleanup and optimization
@aldopareja
Copy link
Member Author

fixes #184

@Maxusmusti
Copy link
Contributor

@aldo-pareja can you rebase this on current main? (which now includes #169 )

Enhance training pipeline and add license headers

- Add SPDX-License-Identifier headers to source files
- Improve error handling and validation in data processing
- Implement checkpoint saving at end of each epoch
- Optimize multipack sampler for better GPU utilization
- Refactor DeepSpeed configuration for ZeRO Stage 3
- Add support for Mixtral sparse MoE models
- Improve logging and error messages throughout
- Fix various minor bugs and typos
@mergify mergify bot added the ci-failure label Aug 29, 2024
@Maxusmusti
Copy link
Contributor

Closing as these changes are introduced in #213

@Maxusmusti Maxusmusti closed this Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants