-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
return mask of user messages when calling tokenizer.apply_chat_template(c,tokenize=True)
#28950
Comments
tokenizer.apply_chat_template(c,tokenize=True)
tokenizer.apply_chat_template(c,tokenize=True)
Hi @yonigottesman - this would be a useful feature, but how do you plan to implement it? |
indeed, great feature! possible approach: in |
@geronimi73 something like that could work, but there are several edge cases! Firstly, some tokenizers introduce additional spaces, in which case the outputs might be slightly different if you loop through messages separately, and secondly some tokenizers like LLaMA insert the system message into the first user message, which means that we can't safely assume that the ordering of messages in the dict will match the ordering of tokens in the output. |
from what i've seen the order of the messages is always preserved, also llama inserts do you have an example of a jinja template where the order of the is changed? i haven't found one (in the few examples i looked at)
true. but I think this is exactly the way people help themselves right now: loop through the messages, tokenize each message separately and set labels according to the |
Hi @geronimi73 - the example in the docs you linked is not actually the full LLaMA template! We simplified it for that document. Here's the full LLaMA 2 template, with linebreaks/indentation added. Note that the system message is actually injected into the middle of the first user message!
|
I think a different approach is needed here. I know this is a bit of a hack, but bare with me a second as I think this is really important :) We can introduce a new keyword Here is how the new chat template of phi3 would look like:
Here is the jinja2 extension to add the
The rendering will be done with
When this is done
Now I know this is not such a trivial solution, but given we are not going swap jinja with something else, I think its not so bad. @Rocketknight1 what do you think? p.s this is the messages i used to check this code works:
|
@yonigottesman this is really cool! We'll definitely have to do some testing and iterate on it, but the way you've gotten the template to track the information we need is really nice. @lewtun - This should let us keep track of which roles generated which blocks of text in the rendered output. Are there other things you wanted added to chat templates in a similar vein that we might include in this PR? @xenova - how do you think this would interact with |
like I said I am willing to work on this pr.
but that would be work on that repo... |
@yonigottesman we're very to have you open this PR! We'll definitely need to check with maintainers for other libraries to make sure that we can support it there (or at least ignore it without breaking the templates). However, I think opening a PR quickly is a good start - let us know whenever you're ready! |
lets continue this conversation here #30650 |
Feature request
when training a chat model I want to ignore labels that are "user" generated and only compute the loss on the "assistant" messages. The
tokenizer.apply_chat_template(c,tokenize=True)
should return a list with 0,1 - 1 marking tokens from a "user" message I can then create thelabels
of this input by marking all tokens generated by user with -100.This is similar to the behavior of DataCollatorForCompletionOnlyLM but with this class we search the
instruction_template
which is not easy to find in a multi message conversation.Motivation
anyone training a conversational model should probably do this and its hard to do it together with
apply_chat_template
. in most cases people manually construct the chat string with -100 (see fastchat llama)Your contribution
If the proposal is accepted I will work on this and submit a pr
The text was updated successfully, but these errors were encountered: