Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fixing #2149] load_from_disk for rl tpye training #2193

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

leeparkuky
Copy link

@leeparkuky leeparkuky commented Dec 15, 2024

Here’s a draft for the PR message:


Description

This PR enhances the dataset loading functionality by introducing additional checks for loading datasets from disk. If the provided dataset path exists, it now attempts to load the dataset using load_from_disk. Additionally, it verifies if a specific split exists when dealing with a DatasetDict and raises a clear error message if the split is not found. If the path does not exist or an error occurs, the existing behavior of loading the dataset using load_dataset is retained.

Motivation and Context

This change is required to handle cases where datasets are stored locally on disk, improving flexibility and robustness. It solves potential issues when users provide a valid path to a local dataset that was previously not being handled correctly. The new implementation also adds split validation for DatasetDict, preventing silent failures when accessing invalid splits.

How has this been tested?

The changes have been tested by:

  1. Loading a dataset from a valid local path using load_from_disk.
  2. Verifying split existence when the dataset is of type DatasetDict.
  3. Raising appropriate errors when an invalid split is provided.
  4. Ensuring fallback to the original load_dataset when the path does not exist or errors occur.

Test scenarios include both valid and invalid paths, as well as valid and missing dataset splits.

Screenshots (if appropriate)

N/A

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Social Handles (Optional)


Let me know if you need further refinements!

@winglian
Copy link
Collaborator

@leeparkuky Thanks for the PR! I saw this PR and I think the way to tackle this is via #2204. I refactored out the dataset loading from remote or disk into its own function that's independent of the SFT transforms. Once this is merged, we can update this PR to simply use the new function provided that should handle the cases that you fixed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants