Clarification on dataset mixer #157

deep-diver · 2024-04-18T09:01:48Z

from the README from /scripts.

datasets_mixer:
    dataset_1: 0.5  # Use 50% of the training examples
    dataset_2: 0.66 # Use 66% of the training examples
    dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx         # The training splits to mix
- test_xxx          # The test splits to mix

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

However, the actual implementation seems like searching the test_xxx split from all datasets specified:

alignment-handbook/src/alignment/data.py

Lines 225 to 230 in 70769f9

    
           if "train" in split: 
        
               raw_train_datasets.append(dataset) 
        
           elif "test" in split: 
        
               raw_val_datasets.append(dataset) 
        
           else: 
        
               raise ValueError(f"Split type {split} not recognized as one of test or train.")

Could you please explain the relationships between multiple datasets and splits?
Thank you.

The text was updated successfully, but these errors were encountered:

shabie · 2024-04-21T13:39:42Z

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

Each dataset should have a separate train and test splits. This is made clear in the docstring where the expecatation is that they start with train_ and test_ respectively. Now the percentages sample the fraction of all datapoints from the train split. The corresponding test dataset is taken in full since subsampling for validation seems pointless (unless validation is super expensive then yeah maybe).

If the confusion was that the datamixer automatically uses the "unused" part of the train split as a test dataset (like how sklearn allows us to do that) then no that doesn't happen here. I like it cuz it always keeps the test set away from being mistakenly used as training by just changing the percentages of the mix.

Anyhow, all this is based on my understanding of the code. Hope it helps or if I am wrong, please correct me :)

deep-diver · 2024-04-22T07:39:26Z

Thank you @shabie

I think it could be common to have a test dataset in a single repo while we could have training dataset from multiple sources.

At least this is my use-case.
To do this, I ended up merging multiple datasets into a single one by myself. Just hoping it could be done in alignment handbook too.

JIElite · 2024-06-11T22:16:46Z

if we assign the mixed dataset to 0.0, what will happen on the test set?

will it use the full test set for evaluation?
or it won't use anything from that dataset

deep-diver · 2024-06-11T23:59:16Z

@JIElite

AFAIK, the ratio doesn't have any impact on the test split.

JIElite · 2024-06-12T11:03:19Z

@deep-diver
Thanks for reply
So, it will also use test set for evaluation, right? even if we assign the mixed ratio to 0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on dataset mixer #157

Clarification on dataset mixer #157

deep-diver commented Apr 18, 2024

shabie commented Apr 21, 2024

deep-diver commented Apr 22, 2024

JIElite commented Jun 11, 2024

deep-diver commented Jun 11, 2024

JIElite commented Jun 12, 2024

Clarification on dataset mixer #157

Clarification on dataset mixer #157

Comments

deep-diver commented Apr 18, 2024

shabie commented Apr 21, 2024

deep-diver commented Apr 22, 2024

JIElite commented Jun 11, 2024

deep-diver commented Jun 11, 2024

JIElite commented Jun 12, 2024