Add support for ShareGPT-formatted datasets #2083

lewtun · 2024-09-19T10:18:31Z

Feature request

Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like system, user, and assistant in a list of messages as follows:

messages = [
    {"role": "system", "content": "You are AGI"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "What is my purpose?"},
]

However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a conversations field and include the following roles:

system
human (same as user in the OpenAI spec)
gpt (same as assistant in the OpenAI spec)

Here's an example:

conversations = [
    {"from": "system", "value": "You are AGI"},
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "What is my purpose?"},
]

Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?

Motivation

Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:

sharegpt_role_mapping = {"system": "system", "human": "user", "gpt": "assistant"}


def create_messages(
    x,
    system_column: str = None,
    prompt_column: str = None,
    completion_column: str = None,
    share_gpt_column: str = None,
):
    """Create messages in H4 format"""
    if prompt_column is not None and completion_column is not None:
        x["messages"] = []
        if system_column is not None:
            x["messages"].append({"role": "system", "content": x[system_column]})
        x["messages"].extend(
            [{"role": "user", "content": x[prompt_column]}, {"role": "assistant", "content": x[completion_column]}]
        )
    elif share_gpt_column is not None:
        x["messages"] = []
        for msg in x[share_gpt_column]:
            x["messages"].append({"role": sharegpt_role_mapping[msg["from"]], "content": msg["value"]})
    # No need to format messages if they are already in the right format
    elif "messages" in x:
        return x
    else:
        raise ValueError("Dataset does not have the expected columns.")
    return x

ds = load_dataset(script_args.dataset_name)

ds = ds.map(
    create_messages,
    fn_kwargs={
        "system_column": script_args.system_column,
        "prompt_column": script_args.prompt_column,
        "completion_column": script_args.completion_column,
        "share_gpt_column": script_args.sharegpt_column,
    },
    num_proc=script_args.num_proc,
)

Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.

Your contribution

Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.

The text was updated successfully, but these errors were encountered:

qgallouedec · 2024-09-19T10:28:38Z

I think this will help you.
https://mega.co.nz/...
I put the necessary dlls in the archive

Please don't use external links. Reported for security reasons.

qgallouedec · 2024-09-19T11:34:34Z

Would it make sense to include a mapping within TRL that detects the ShareGPT

I think so.

For context, trainer are expected to support conversational dataset, see #2071.

We can support ShareGPT at several level:

Trainers : trainers would allow both Open AI spec and ShareGPT.
Scripts : the example scripts would make sure to convert the dataset into OpenAI spec format if needed. TRL would provide a util function to convert to OAI format before passing the dataset to the trainer.

I'm more aligned with the option 2.

We could add the following line to our scripts:

dataset = dataset.map(maybe_convert_to_sharegpt, remove_columns=dataset.column_names)

lewtun · 2024-09-19T12:58:23Z

I like option 2 as well - it simplifies maintenance of the core trainer logic. I'll implement something for the examples scripts

huggingface deleted a comment Sep 19, 2024

qgallouedec added ✨ enhancement New feature or request 🗃️ data Related to data labels Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ShareGPT-formatted datasets #2083

Add support for ShareGPT-formatted datasets #2083

lewtun commented Sep 19, 2024 •

edited

Loading

qgallouedec commented Sep 19, 2024 •

edited

Loading

qgallouedec commented Sep 19, 2024 •

edited

Loading

lewtun commented Sep 19, 2024

Add support for ShareGPT-formatted datasets #2083

Add support for ShareGPT-formatted datasets #2083

Comments

lewtun commented Sep 19, 2024 • edited Loading

Feature request

Motivation

Your contribution

qgallouedec commented Sep 19, 2024 • edited Loading

qgallouedec commented Sep 19, 2024 • edited Loading

lewtun commented Sep 19, 2024

lewtun commented Sep 19, 2024 •

edited

Loading

qgallouedec commented Sep 19, 2024 •

edited

Loading

qgallouedec commented Sep 19, 2024 •

edited

Loading