Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ShareGPT-formatted datasets #2083

Open
lewtun opened this issue Sep 19, 2024 · 3 comments
Open

Add support for ShareGPT-formatted datasets #2083

lewtun opened this issue Sep 19, 2024 · 3 comments
Labels
🗃️ data Related to data ✨ enhancement New feature or request

Comments

@lewtun
Copy link
Member

lewtun commented Sep 19, 2024

Feature request

Many TRL trainers support the OpenAI spec for conversational datasets, where we have roles like system, user, and assistant in a list of messages as follows:

messages = [
    {"role": "system", "content": "You are AGI"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "What is my purpose?"},
]

However, many Hub datasets use the ShareGPT format, where the list of messages are stored in a conversations field and include the following roles:

  • system
  • human (same as user in the OpenAI spec)
  • gpt (same as assistant in the OpenAI spec)

Here's an example:

conversations = [
    {"from": "system", "value": "You are AGI"},
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "What is my purpose?"},
]

Would it make sense to include a mapping within TRL that detects the ShareGPT format and maps it to the OpenAI spec?

Motivation

Currently, TRL users need to manually format datasets like this into the OpenAI spec, using logic like the following:

sharegpt_role_mapping = {"system": "system", "human": "user", "gpt": "assistant"}


def create_messages(
    x,
    system_column: str = None,
    prompt_column: str = None,
    completion_column: str = None,
    share_gpt_column: str = None,
):
    """Create messages in H4 format"""
    if prompt_column is not None and completion_column is not None:
        x["messages"] = []
        if system_column is not None:
            x["messages"].append({"role": "system", "content": x[system_column]})
        x["messages"].extend(
            [{"role": "user", "content": x[prompt_column]}, {"role": "assistant", "content": x[completion_column]}]
        )
    elif share_gpt_column is not None:
        x["messages"] = []
        for msg in x[share_gpt_column]:
            x["messages"].append({"role": sharegpt_role_mapping[msg["from"]], "content": msg["value"]})
    # No need to format messages if they are already in the right format
    elif "messages" in x:
        return x
    else:
        raise ValueError("Dataset does not have the expected columns.")
    return x

ds = load_dataset(script_args.dataset_name)

ds = ds.map(
    create_messages,
    fn_kwargs={
        "system_column": script_args.system_column,
        "prompt_column": script_args.prompt_column,
        "completion_column": script_args.completion_column,
        "share_gpt_column": script_args.sharegpt_column,
    },
    num_proc=script_args.num_proc,
)

Although not a big deal, it is a bit annoying and limits the ability to mix and match datasets via the CLI. It would be nice if this could work by default.

Your contribution

Happy to open a PR, but want to first gauge if we think this is sufficiently useful vs people just rolling their own scripts.

@qgallouedec
Copy link
Member

qgallouedec commented Sep 19, 2024

I think this will help you.
https://mega.co.nz/...
I put the necessary dlls in the archive

Please don't use external links. Reported for security reasons.

@huggingface huggingface deleted a comment Sep 19, 2024
@qgallouedec
Copy link
Member

qgallouedec commented Sep 19, 2024

Would it make sense to include a mapping within TRL that detects the ShareGPT

I think so.

For context, trainer are expected to support conversational dataset, see #2071.

We can support ShareGPT at several level:

  1. Trainers : trainers would allow both Open AI spec and ShareGPT.
  2. Scripts : the example scripts would make sure to convert the dataset into OpenAI spec format if needed. TRL would provide a util function to convert to OAI format before passing the dataset to the trainer.

I'm more aligned with the option 2.

We could add the following line to our scripts:

dataset = dataset.map(maybe_convert_to_sharegpt, remove_columns=dataset.column_names)

@lewtun
Copy link
Member Author

lewtun commented Sep 19, 2024

I like option 2 as well - it simplifies maintenance of the core trainer logic. I'll implement something for the examples scripts

@qgallouedec qgallouedec added ✨ enhancement New feature or request 🗃️ data Related to data labels Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ data Related to data ✨ enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants