-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on dataset mixer #157
Comments
Each dataset should have a separate If the confusion was that the datamixer automatically uses the "unused" part of the Anyhow, all this is based on my understanding of the code. Hope it helps or if I am wrong, please correct me :) |
Thank you @shabie I think it could be common to have a test dataset in a single repo while we could have training dataset from multiple sources. At least this is my use-case. |
if we assign the mixed dataset to 0.0, what will happen on the test set?
|
AFAIK, the ratio doesn't have any impact on the test split. |
@deep-diver |
from the README from
/scripts
.From the comments, it looks like ONLY training samples from
dataset_1
,dataset_2
, anddataset_3
are considered. There isn't explanation how each dataset contributes to thetest_xxx
split.However, the actual implementation seems like searching the
test_xxx
split from all datasets specified:alignment-handbook/src/alignment/data.py
Lines 225 to 230 in 70769f9
Could you please explain the relationships between multiple datasets and splits?
Thank you.
The text was updated successfully, but these errors were encountered: