[Fixing #2149] load_from_disk for rl tpye training #2193

leeparkuky · 2024-12-15T20:16:59Z

Here’s a draft for the PR message:

Description

This PR enhances the dataset loading functionality by introducing additional checks for loading datasets from disk. If the provided dataset path exists, it now attempts to load the dataset using load_from_disk. Additionally, it verifies if a specific split exists when dealing with a DatasetDict and raises a clear error message if the split is not found. If the path does not exist or an error occurs, the existing behavior of loading the dataset using load_dataset is retained.

Motivation and Context

This change is required to handle cases where datasets are stored locally on disk, improving flexibility and robustness. It solves potential issues when users provide a valid path to a local dataset that was previously not being handled correctly. The new implementation also adds split validation for DatasetDict, preventing silent failures when accessing invalid splits.

How has this been tested?

The changes have been tested by:

Loading a dataset from a valid local path using load_from_disk.
Verifying split existence when the dataset is of type DatasetDict.
Raising appropriate errors when an invalid split is provided.
Ensuring fallback to the original load_dataset when the path does not exist or errors occur.

Test scenarios include both valid and invalid paths, as well as valid and missing dataset splits.

Screenshots (if appropriate)

N/A

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Social Handles (Optional)

Let me know if you need further refinements!

winglian · 2024-12-19T16:28:55Z

@leeparkuky Thanks for the PR! I saw this PR and I think the way to tackle this is via #2204. I refactored out the dataset loading from remote or disk into its own function that's independent of the SFT transforms. Once this is merged, we can update this PR to simply use the new function provided that should handle the cases that you fixed here.

leeparkuky added 3 commits December 15, 2024 15:08

Update rl.py

1a69c60

Update rl.py

fac4495

Update rl.py

c2ff442

leeparkuky mentioned this pull request Dec 15, 2024

load_from_disk for rl tpye training #2192

Open

5 tasks

winglian requested review from NanoCode012 and bursteratom December 17, 2024 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fixing #2149] load_from_disk for rl tpye training #2193

[Fixing #2149] load_from_disk for rl tpye training #2193

leeparkuky commented Dec 15, 2024 •

edited

Loading

winglian commented Dec 19, 2024

[Fixing #2149] load_from_disk for rl tpye training #2193

Are you sure you want to change the base?

[Fixing #2149] load_from_disk for rl tpye training #2193

Conversation

leeparkuky commented Dec 15, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Dec 19, 2024

leeparkuky commented Dec 15, 2024 •

edited

Loading