Conversation style appears tied to the dataset rather than the model #2096

RonanKMcGovern · 2024-11-30T19:23:32Z

If I'm not mistaken, the conversation style that applies during a fine-tune is defined by the dataset defaults, rather than by the tokenizer being used (docs here.

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Is there an option whereby one can:

default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

I'm guessing one issue here is that - since tokenizer.chat_template is not known in advance, this poses issues for controlling the loss mask on the prompt vs completions?

So maybe that's the dilemna? Either one can:
a) load a default conversation style from the model/tokenizer, but then it's hard to implement loss masks, or
b) load the default conversation style based on the dataset choice, but then there risks being token incompatibilities with the model/tokenizer being trained.

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

felipemello1 · 2024-11-30T23:49:05Z

@RdoubleA do you mind taking a look?

ebsmothers · 2024-12-01T04:35:27Z

@RonanKMcGovern thanks for creating the issue. Rafi will probably have the best answer here, but I can weigh in as well. First, one thing worth clarifying: the use of conversation_style is primarily to get data from its raw format into an intermediate format that can be understood by torchtune. E.g. you can see the input and output formats of ShareGPTToMessages here -- note that nothing is actually being tokenized, nor do we do anything involving special tokens.

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Your understanding is correct. In general if we are formatting the prompt in a certain way I'm not sure it would be easy to infer what formatting comes from special tokens vs what doesn't. Also:

Is there an option whereby one can default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

This we don't currently support (mainly because it is hard to map 1:1 from Hugging Face to torchtune tokenization logic, though this is likely something we can work to improve).

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

There are different entry points depending on the degree of customization you're looking for, but most of them should be accessible directly through the tokenizer (i.e. not through the dataset, so you do have access to any special tokens).

As a first step, you may just be able to use the tokenizer as-is -- all our model tokenizers have a method tokenize_messages which will take in a list of messages (i.e. those returned by the conversation_style mentioned above) and tokenize them with the default formatting and special tokens for that model. E.g. for Llama3 you can just use llama3_tokenizer, and you can see here that its tokenize_messages method will iterate over the messages and apply any formatting with special tokens that is unique to Llama 3 (see e.g. the standard Llama 3 chat header added here).
If you want to do some formatting of the prompt you can use a predefined prompt template or write your own (check this page in our live docs for how to do this). This then plugs into the tokenizer to format each message before it gets tokenized. You can also combine this with the special_tokens_path tokenizer argument (e.g. see in the Qwen 2.5 tokenizer API reference here). Combining both of these should get you custom formatting with any additional special tokens, but just in case..
Finally, if you find that the existing tokenizer class is just not cutting it or there's some additional customization you need, you can always modify the tokenizer yourself! Probably the quickest way to do this is to just inherit from whatever model tokenizer class you'd like to modify (e.g. Llama3Tokenizer here) and override tokenize_messages or any other methods you need to customize. But this should be more of a last resort -- the hope is that through appropriate customization of prompt_template and/or special_tokens in (2) you never need to do this for an existing model.

P.S. sorry I am realizing that some of the model tokenizer classes are not linked in our API reference and they contain a lot of important info (e.g. the details about Llama3 here). So will make sure we update that!

RonanKMcGovern · 2024-12-01T05:33:40Z

Thanks very much for your detailed response. I'll need to dig in a bit more but this is giving me structure for finding my feet. For background, I'm coming to this from a transformers/unsloth perspective where I'm used to: a) models having a tokenizer.chat_template b) I either i) start with data that is formatted as an array of system/user/assistant messages OR ii) data that is in columns, say of "question" and "answer", or just "text". c) either the trainer automatically applies the chat template as in b.i. OR I set up a formatting function for b.ii. (which may just make use of the chat_template) to convert data in columns into a single text string). If using a chat_template, that typically will automatically have an EOS type token at the end of assistant responses, which will ensure the model knows when to stop (and this token is typically not masked. I note in torch tune docs, that bos and eos typically are masked, and then it seems there is an option to add an additional ending token, which I presume is not masked). Lastly, there is the matter of whether one trains on completions or not. This can be a bit messy with transformers/unsloth and often requires identifying the portion of the chat template that delineates the assistant response's start, and using that to find where to position the loss mask. I don't think the above approach is necessarily better than torch tune. I'm just sharing it for context (apologies if already obvious). So, moving to torch tune, it seems like the philosophy is that models are more hard coded (i.e. the prompt template is already there for common models in the fine-tuning library). For most use cases this is fine (or probably easier, because one can easily turn on training on completions-only or not, because the library has a firm understanding of model/tokenizer syntax). For edge cases where the prompt is being changed and special tokens introduced (say, custom tool calling that doesn't follow the base model approach), this would require care [and that is fine]. Something I think is worth some thought is ensure that the final tokenizer.json does have a tokenizer.chat_template that is consistent with how the training was done. This (unless I'm mistaken) is what vllm uses to apply a chat template for inference. Anyway. I'm keen to try out torch tune mostly for two reasons: i) to see training speed, ii) since it allows for multi-gpu quite easily (DDP and FSDP is more messy on transformers and not possible on unsloth). One last side-note. There was a major issue in transformers/unsloth whereby gradient accumulation was naively adding gradients without properly normalising for the number of unmasked tokens and length. This led to major error on the loss when accumulating gradients. I read through the torch tune code today and I think it is correct, but probably would be wise for someone smart on torch tune to read this if they haven't already - https://unsloth.ai/blog/gradient.

…

On Sat, Nov 30, 2024 at 8:35 PM ebsmothers ***@***.***> wrote: @RonanKMcGovern <https://github.com/RonanKMcGovern> thanks for creating the issue. Rafi will probably have the best answer here, but I can weigh in as well. First, one thing worth clarifying: the use of conversation_style is primarily to get data from its raw format into an intermediate format that can be understood by torchtune. E.g. you can see the input and output formats of ShareGPTToMessages here <https://pytorch.org/torchtune/main/generated/torchtune.data.ShareGPTToMessages.html#torchtune.data.ShareGPTToMessages> -- note that nothing is actually being tokenized, nor do we do anything involving special tokens. What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not. Your understanding is correct. In general if we are formatting the prompt in a certain way I'm not sure it would be easy to infer what formatting comes from special tokens vs what doesn't. Also: Is there an option whereby one can default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined) This we don't currently support (mainly because it is hard to map 1:1 from Hugging Face to torchtune tokenization logic, though this is likely something we can work to improve). The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing). There are different entry points depending on the degree of customization you're looking for, but most of them should be accessible directly through the tokenizer (i.e. not through the dataset, so you do have access to any special tokens). 1. As a first step, you may just be able to use the tokenizer as-is -- all our model tokenizers have a method tokenize_messages which will take in a list of messages (i.e. those returned by the conversation_style mentioned above) and tokenize them with the default formatting and special tokens for that model. E.g. for Llama3 you can just use llama3_tokenizer <https://pytorch.org/torchtune/main/generated/torchtune.models.llama3.llama3_tokenizer.html#llama3-tokenizer>, and you can see here <https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/torchtune/models/llama3/_tokenizer.py#L261> that its tokenize_messages method will iterate over the messages and apply any formatting with special tokens that is unique to Llama 3 (see e.g. the standard Llama 3 chat header added here <https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/torchtune/models/llama3/_tokenizer.py#L199-L208> ). 2. If you want to do some formatting of the prompt you can use a predefined prompt template or write your own (check this page <https://pytorch.org/torchtune/main/basics/prompt_templates.html> in our live docs for how to do this). This then plugs into the tokenizer to format each message before it gets tokenized. You can also combine this with the special_tokens_path tokenizer argument (e.g. see in the Qwen 2.5 tokenizer API reference here <https://pytorch.org/torchtune/main/generated/torchtune.models.qwen2_5.qwen2_5_tokenizer.html#torchtune.models.qwen2_5.qwen2_5_tokenizer>). Combining both of these should get you custom formatting with any additional special tokens, but just in case.. 3. Finally, if you find that the existing tokenizer class is just not cutting it or there's some additional customization you need, you can always modify the tokenizer yourself! Probably the quickest way to do this is to just inherit from whatever model tokenizer class you'd like to modify (e.g. Llama3Tokenizer here <https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/torchtune/models/llama3/_tokenizer.py#L43>) and override tokenize_messages or any other methods you need to customize. But this should be more of a last resort -- the hope is that through appropriate customization of prompt_template and/or special_tokens in (2) you never need to do this for an existing model. P.S. sorry I am realizing that some of the model tokenizer classes are not linked in our API reference and they contain a lot of important info (e.g. the details about Llama3 here <https://github.com/pytorch/torchtune/blob/32e265d5749fd592711a03247486eafa6c898d94/torchtune/models/llama3/_tokenizer.py#L43-L63>). So will make sure we update that! — Reply to this email directly, view it on GitHub <#2096 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CTFRPQUGN5OERB3BTD2DKG2NAVCNFSM6AAAAABSY2L2LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBZGU3DONZQGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ebsmothers · 2024-12-01T17:34:01Z

@RonanKMcGovern to address some of these comments individually:

For background, I'm coming to this from a transformers/unsloth perspective
where I'm used to:
a) models having a tokenizer.chat_template
b) I either i) start with data that is formatted as an array of
system/user/assistant messages OR ii) data that is in columns, say of
"question" and "answer", or just "text".
c) either the trainer automatically applies the chat template as in b.i. OR
I set up a formatting function for b.ii. (which may just make use of the
chat_template) to convert data in columns into a single text string).

b) is good as a starting point for torchtune too. In both i) and ii) the flow is generally the same: given a raw data format, first apply a transform specific to your particular dataset to get the data into a standard format recognized by torchtune (i.e. a list of Message objects), then apply any custom formatting or model-specific logic (including tokenization). So for case i) you would probably want to use either ShareGPTToMessages or OpenAIToMessages for the first step, depending on your exact input format (and of course if it's in a non-standard format you can always write your own version of these). I would split case ii) further: the case of multiple columns (e.g. "question" and "answer") should fall under instruct_dataset. This comes readymade with a transform into the message format: InputOutputToMessages. In this case you would just need to pass the column_map={"input": "question", "output": "answer"}. And the case of a single "text" column should fit nicely with our text_completion_dataset.

Then (c) basically corresponds to the second step -- apply custom formatting, tokenization, etc.. anything that's unique to the model. I think I already covered this in my last comment, but lmk if there's more you're unclear on here.

If using a chat_template, that typically will automatically have an EOS
type token at the end of assistant responses, which will ensure the model
knows when to stop (and this token is typically not masked. I note in torch
tune docs, that bos and eos typically are masked, and then it seems there
is an option to add an additional ending token, which I presume is not
masked).

Yes, this is all determined by the tokenizer. We do always mask BOS and EOS, and in e.g. Llama3 there are tokens like EOT and EOM that will not be masked.

Lastly, there is the matter of whether one trains on completions or not.
This can be a bit messy with transformers/unsloth and often requires
identifying the portion of the chat template that delineates the assistant
response's start, and using that to find where to position the loss mask.

Yes, at least for our chat- and instruct-style datasets we have the flag train_on_input. When set to True, this will also consider the loss from the prompt during training (I assume this is what you're referring to, but let me know if I'm misunderstanding your comment).

Something I think is worth some thought is ensure that the final
tokenizer.json does have a tokenizer.chat_template that is consistent with
how the training was done. This (unless I'm mistaken) is what vllm uses to
apply a chat template for inference.

Yeah this is a good point. Right now we do not really maintain a standard mapping between our tokenizer and Hugging Face chat templates, but we are planning on providing cleaner integration with vLLM soon so this is likely something we will need to support.

Anyway. I'm keen to try out torch tune mostly for two reasons: i) to see
training speed, ii) since it allows for multi-gpu quite easily (DDP and
FSDP is more messy on transformers and not possible on unsloth).

For training speed, I would recommend running with torch compile (set compile=True) and also sample packing (set dataset.packed=True) to get faster training. If you're doing sample packing, you will also need to set tokenizer.max_seq_len to determine the packed sequence length. The right value will depend on your hardware, batch size, etc, but I'd recommend starting at 2048 and increasing by powers of 2.

One last side-note. There was a major issue in transformers/unsloth whereby
gradient accumulation was naively adding gradients without properly
normalising for the number of unmasked tokens and length. This led to major
error on the loss when accumulating gradients.

Yes, thanks for mentioning this. We fixed this in #1917 (this also fixes the same problem for distributed training, where the accumulation of the number of tokens seen needs to be taken over all ranks to properly normalize the loss).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversation style appears tied to the dataset rather than the model #2096

Conversation style appears tied to the dataset rather than the model #2096

RonanKMcGovern commented Nov 30, 2024

felipemello1 commented Nov 30, 2024

ebsmothers commented Dec 1, 2024

RonanKMcGovern commented Dec 1, 2024 via email

ebsmothers commented Dec 1, 2024

Conversation style appears tied to the dataset rather than the model #2096

Conversation style appears tied to the dataset rather than the model #2096

Comments

RonanKMcGovern commented Nov 30, 2024

felipemello1 commented Nov 30, 2024

ebsmothers commented Dec 1, 2024

RonanKMcGovern commented Dec 1, 2024 via email

ebsmothers commented Dec 1, 2024