Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation style appears tied to the dataset rather than the model #2096

Open
RonanKMcGovern opened this issue Nov 30, 2024 · 4 comments
Open

Comments

@RonanKMcGovern
Copy link

If I'm not mistaken, the conversation style that applies during a fine-tune is defined by the dataset defaults, rather than by the tokenizer being used (docs here.

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Is there an option whereby one can:

  • default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

I'm guessing one issue here is that - since tokenizer.chat_template is not known in advance, this poses issues for controlling the loss mask on the prompt vs completions?

So maybe that's the dilemna? Either one can:
a) load a default conversation style from the model/tokenizer, but then it's hard to implement loss masks, or
b) load the default conversation style based on the dataset choice, but then there risks being token incompatibilities with the model/tokenizer being trained.

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

@felipemello1
Copy link
Contributor

@RdoubleA do you mind taking a look?

@ebsmothers
Copy link
Contributor

@RonanKMcGovern thanks for creating the issue. Rafi will probably have the best answer here, but I can weigh in as well. First, one thing worth clarifying: the use of conversation_style is primarily to get data from its raw format into an intermediate format that can be understood by torchtune. E.g. you can see the input and output formats of ShareGPTToMessages here -- note that nothing is actually being tokenized, nor do we do anything involving special tokens.

What happens if the tokenizer+model do not have the tokens required for a given conversation style? Are those special tokens created? I assume not.

Your understanding is correct. In general if we are formatting the prompt in a certain way I'm not sure it would be easy to infer what formatting comes from special tokens vs what doesn't. Also:

Is there an option whereby one can default to using tokenizer.chat_template for the conversation style? (most models on huggingface have this defined)

This we don't currently support (mainly because it is hard to map 1:1 from Hugging Face to torchtune tokenization logic, though this is likely something we can work to improve).

The practical task I'm interested in is fine-tuning llama 3 and qwen 2.5 using conversation styles that match their chat templates (so as to minimise the re-training/over-writing that I'm doing).

There are different entry points depending on the degree of customization you're looking for, but most of them should be accessible directly through the tokenizer (i.e. not through the dataset, so you do have access to any special tokens).

  1. As a first step, you may just be able to use the tokenizer as-is -- all our model tokenizers have a method tokenize_messages which will take in a list of messages (i.e. those returned by the conversation_style mentioned above) and tokenize them with the default formatting and special tokens for that model. E.g. for Llama3 you can just use llama3_tokenizer, and you can see here that its tokenize_messages method will iterate over the messages and apply any formatting with special tokens that is unique to Llama 3 (see e.g. the standard Llama 3 chat header added here).

  2. If you want to do some formatting of the prompt you can use a predefined prompt template or write your own (check this page in our live docs for how to do this). This then plugs into the tokenizer to format each message before it gets tokenized. You can also combine this with the special_tokens_path tokenizer argument (e.g. see in the Qwen 2.5 tokenizer API reference here). Combining both of these should get you custom formatting with any additional special tokens, but just in case..

  3. Finally, if you find that the existing tokenizer class is just not cutting it or there's some additional customization you need, you can always modify the tokenizer yourself! Probably the quickest way to do this is to just inherit from whatever model tokenizer class you'd like to modify (e.g. Llama3Tokenizer here) and override tokenize_messages or any other methods you need to customize. But this should be more of a last resort -- the hope is that through appropriate customization of prompt_template and/or special_tokens in (2) you never need to do this for an existing model.

P.S. sorry I am realizing that some of the model tokenizer classes are not linked in our API reference and they contain a lot of important info (e.g. the details about Llama3 here). So will make sure we update that!

@RonanKMcGovern
Copy link
Author

RonanKMcGovern commented Dec 1, 2024 via email

@ebsmothers
Copy link
Contributor

@RonanKMcGovern to address some of these comments individually:

For background, I'm coming to this from a transformers/unsloth perspective
where I'm used to:
a) models having a tokenizer.chat_template
b) I either i) start with data that is formatted as an array of
system/user/assistant messages OR ii) data that is in columns, say of
"question" and "answer", or just "text".
c) either the trainer automatically applies the chat template as in b.i. OR
I set up a formatting function for b.ii. (which may just make use of the
chat_template) to convert data in columns into a single text string).

b) is good as a starting point for torchtune too. In both i) and ii) the flow is generally the same: given a raw data format, first apply a transform specific to your particular dataset to get the data into a standard format recognized by torchtune (i.e. a list of Message objects), then apply any custom formatting or model-specific logic (including tokenization). So for case i) you would probably want to use either ShareGPTToMessages or OpenAIToMessages for the first step, depending on your exact input format (and of course if it's in a non-standard format you can always write your own version of these). I would split case ii) further: the case of multiple columns (e.g. "question" and "answer") should fall under instruct_dataset. This comes readymade with a transform into the message format: InputOutputToMessages. In this case you would just need to pass the column_map={"input": "question", "output": "answer"}. And the case of a single "text" column should fit nicely with our text_completion_dataset.

Then (c) basically corresponds to the second step -- apply custom formatting, tokenization, etc.. anything that's unique to the model. I think I already covered this in my last comment, but lmk if there's more you're unclear on here.

If using a chat_template, that typically will automatically have an EOS
type token at the end of assistant responses, which will ensure the model
knows when to stop (and this token is typically not masked. I note in torch
tune docs, that bos and eos typically are masked, and then it seems there
is an option to add an additional ending token, which I presume is not
masked).

Yes, this is all determined by the tokenizer. We do always mask BOS and EOS, and in e.g. Llama3 there are tokens like EOT and EOM that will not be masked.

Lastly, there is the matter of whether one trains on completions or not.
This can be a bit messy with transformers/unsloth and often requires
identifying the portion of the chat template that delineates the assistant
response's start, and using that to find where to position the loss mask.

Yes, at least for our chat- and instruct-style datasets we have the flag train_on_input. When set to True, this will also consider the loss from the prompt during training (I assume this is what you're referring to, but let me know if I'm misunderstanding your comment).

Something I think is worth some thought is ensure that the final
tokenizer.json does have a tokenizer.chat_template that is consistent with
how the training was done. This (unless I'm mistaken) is what vllm uses to
apply a chat template for inference.

Yeah this is a good point. Right now we do not really maintain a standard mapping between our tokenizer and Hugging Face chat templates, but we are planning on providing cleaner integration with vLLM soon so this is likely something we will need to support.

Anyway. I'm keen to try out torch tune mostly for two reasons: i) to see
training speed, ii) since it allows for multi-gpu quite easily (DDP and
FSDP is more messy on transformers and not possible on unsloth).

For training speed, I would recommend running with torch compile (set compile=True) and also sample packing (set dataset.packed=True) to get faster training. If you're doing sample packing, you will also need to set tokenizer.max_seq_len to determine the packed sequence length. The right value will depend on your hardware, batch size, etc, but I'd recommend starting at 2048 and increasing by powers of 2.

One last side-note. There was a major issue in transformers/unsloth whereby
gradient accumulation was naively adding gradients without properly
normalising for the number of unmasked tokens and length. This led to major
error on the loss when accumulating gradients.

Yes, thanks for mentioning this. We fixed this in #1917 (this also fixes the same problem for distributed training, where the accumulation of the number of tokens seen needs to be taken over all ranks to properly normalize the loss).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants