You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like system, user, and assistant in a list of messages as follows:
messages= [
{"role": "system", "content": "You are AGI"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "What is my purpose?"},
]
However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a conversations field and include the following roles:
system
human (same as user in the OpenAI spec)
gpt (same as assistant in the OpenAI spec)
Here's an example:
conversations= [
{"from": "system", "value": "You are AGI"},
{"from": "human", "value": "Hello"},
{"from": "gpt", "value": "What is my purpose?"},
]
Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?
Motivation
Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:
sharegpt_role_mapping= {"system": "system", "human": "user", "gpt": "assistant"}
defcreate_messages(
x,
system_column: str=None,
prompt_column: str=None,
completion_column: str=None,
share_gpt_column: str=None,
):
"""Create messages in H4 format"""ifprompt_columnisnotNoneandcompletion_columnisnotNone:
x["messages"] = []
ifsystem_columnisnotNone:
x["messages"].append({"role": "system", "content": x[system_column]})
x["messages"].extend(
[{"role": "user", "content": x[prompt_column]}, {"role": "assistant", "content": x[completion_column]}]
)
elifshare_gpt_columnisnotNone:
x["messages"] = []
formsginx[share_gpt_column]:
x["messages"].append({"role": sharegpt_role_mapping[msg["from"]], "content": msg["value"]})
# No need to format messages if they are already in the right formatelif"messages"inx:
returnxelse:
raiseValueError("Dataset does not have the expected columns.")
returnxds=load_dataset(script_args.dataset_name)
ds=ds.map(
create_messages,
fn_kwargs={
"system_column": script_args.system_column,
"prompt_column": script_args.prompt_column,
"completion_column": script_args.completion_column,
"share_gpt_column": script_args.sharegpt_column,
},
num_proc=script_args.num_proc,
)
Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.
Your contribution
Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.
The text was updated successfully, but these errors were encountered:
Would it make sense to include a mapping within TRL that detects the ShareGPT
I think so.
For context, trainer are expected to support conversational dataset, see #2071.
We can support ShareGPT at several level:
Trainers : trainers would allow both Open AI spec and ShareGPT.
Scripts : the example scripts would make sure to convert the dataset into OpenAI spec format if needed. TRL would provide a util function to convert to OAI format before passing the dataset to the trainer.
Feature request
Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like
system
,user
, andassistant
in a list of messages as follows:However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a
conversations
field and include the following roles:system
human
(same asuser
in the OpenAI spec)gpt
(same asassistant
in the OpenAI spec)Here's an example:
Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?
Motivation
Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:
Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.
Your contribution
Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.
The text was updated successfully, but these errors were encountered: