Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Descriptions (READMEs) for trl-lib/dataset #2470

Closed
Kallinteris-Andreas opened this issue Dec 13, 2024 · 1 comment · Fixed by #2491
Closed

Provide Descriptions (READMEs) for trl-lib/dataset #2470

Kallinteris-Andreas opened this issue Dec 13, 2024 · 1 comment · Fixed by #2491
Labels
🗃️ data Related to data 📚 documentation Improvements or additions to documentation ✨ enhancement New feature or request 👶 good first issue Good for newcomers 🙋 help from community wanted Open invitation for community members to contribute

Comments

@Kallinteris-Andreas
Copy link

Kallinteris-Andreas commented Dec 13, 2024

Feature request

TRL includes a few datasets in HF/trl-lib, and those datasets do not include any information on them

Example the trl-lib/Capybara does not have readme.md, it would be useful to include minimal information like

  • who made it?, The best info, I could find is NousResearch (the makers of hermes models) have a model named capybara, was this dataset what was used to train that model, or is it something else?
  • What is it for SFT, RewardModel, RLHF (DPO/PPO) (from my limited understanding different dataset types are used for each of those processes)
  • What is it intended to accomplice (Domain adaptation?, Bias reduction?), What is it intended to improve

Motivation

Extra information is always useful. It is essential for evaluating the impact of the training process.

Your contribution

Not sure what I can do to help as I am not familiar with those datasets

@qgallouedec
Copy link
Member

qgallouedec commented Dec 13, 2024

Yes, that's a good point!

All datasets in hf.co/trl-lib are taken from an original dataset. We should at least indicate this dataset in the readme with something like:

This dataset is a processed version of [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) with this [script](https://github.com/huggingface/trl/blob/main/examples/datasets/ultrafeedback.py).

To do this, we should add to all script in https://github.com/huggingface/trl/blob/main/examples/datasets a model card that we push, like in

MODEL_CARD = """
---
library_name: transformers
tags: [trl]
---
# Tiny {model_class_name}
This is a minimal model built for unit tests in the [TRL](https://github.com/huggingface/trl) library.
"""
api = HfApi()
def push_to_hub(model, tokenizer, suffix=None):
model_class_name = model.__class__.__name__
content = MODEL_CARD.format(model_class_name=model_class_name)
model_card = ModelCard(content)
repo_id = f"{ORGANIZATION}/tiny-{model_class_name}"
if suffix is not None:
repo_id += f"-{suffix}"
if api.repo_exists(repo_id):
print(f"Model {repo_id} already exists, skipping")
else:
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)
model_card.push_to_hub(repo_id)

We could also add the type/format of dataset with a link to the relevant section in this page of the documentation: https://huggingface.co/docs/trl/en/dataset_formats

@qgallouedec qgallouedec added 📚 documentation Improvements or additions to documentation ✨ enhancement New feature or request 🗃️ data Related to data 🙋 help from community wanted Open invitation for community members to contribute 👶 good first issue Good for newcomers labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ data Related to data 📚 documentation Improvements or additions to documentation ✨ enhancement New feature or request 👶 good first issue Good for newcomers 🙋 help from community wanted Open invitation for community members to contribute
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants