-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove task
arg in load_dataset
in image-classification example
#28408
Remove task
arg in load_dataset
in image-classification example
#28408
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix. Indeed, the tasks
were deprecated long ago and we should definitely stop promoting its use.
Just some comment below about a possible edge case.
) | ||
|
||
# Rename image and label columns if needed (e.g. Cifar10) | ||
if "img" in dataset["train"].features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if data_args.dataset_name
is None and data_args.train_dir
is None, then the dataset
dict will not have a "train"
key, and the line above will raise a KeyError
.
This could be avoided by replacing the line above with:
if "img" in dataset["train"].features: | |
if "img" in list(dataset.column_names.values())[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or:
if "img" in dataset["train"].features: | |
if "img" in next(iter(dataset.column_names.values())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more readable ?
if "img" in dataset["train"].features: | |
if "img" in (dataset["train"].features if "train" in dataset else dataset["validation"].features): |
and also compatible with your suggestion at huggingface/datasets#6571
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it seems this script always do training no ? in this case you can assume "train" is always present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it's probably mostly used for training, but I'm going to add the suggestion anyway in case.
# Rename image and label columns if needed (e.g. Cifar10) | ||
if "img" in dataset["train"].features: | ||
dataset = dataset.rename_column("img", "image") | ||
if "label" in dataset["train"].features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
) | ||
# See more about loading custom images at | ||
# https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder. | ||
|
||
# Rename image and label columns if needed (e.g. Cifar10) | ||
if "img" in dataset["train"].features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
# Rename image and label columns if needed (e.g. Cifar10) | ||
if "img" in dataset["train"].features: | ||
dataset = dataset.rename_column("img", "image") | ||
if "label" in dataset["train"].features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
Related to this PR, I opened an issue in |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I think we should also add |
if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | ||
dataset = dataset.rename_column(data_args.image_column_name, "image") | ||
if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | ||
dataset = dataset.rename_column(data_args.label_column_name, "labels") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the audio-classification
task, I think we should do the following:
if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | |
dataset = dataset.rename_column(data_args.image_column_name, "image") | |
if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | |
dataset = dataset.rename_column(data_args.label_column_name, "labels") | |
dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names | |
if data_args.image_column_name not in dataset_column_names: | |
if "img" in dataset_column_names: | |
logger.info(f"Renaming column img to {data_args.image_column_name}") | |
dataset = dataset.rename_column("img", data_args.image_column_name) | |
else: | |
raise ValueError( | |
f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. " | |
"Make sure to set `--image_column_name` to the correct image column - one of " | |
f"{', '.join(dataset_column_names)}." | |
) | |
if data_args.label_column_name not in dataset["train"].column_names: | |
raise ValueError( | |
f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. " | |
"Make sure to set `--label_column_name` to the correct text column - one of " | |
f"{', '.join(dataset_column_names)}." | |
) |
(Then, we later need to rename label
to labels
either via rename_column
or inside the train_transforms
/val_transforms
so that we don't break the script)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to have something generic, we should precisely avoid using "img" in the example.
data_args.image_column_name
should not be the name of the new column, we already know it is "image" like in the task template. It should be the name of the column to rename so that users can easily run the script if the image column name is not "img" or "image".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's make sure you rebase and the CIs are green
@@ -197,7 +198,7 @@ accelerate test | |||
that will check everything is ready for training. Finally, you can launch training with | |||
|
|||
```bash | |||
accelerate launch run_image_classification_trainer.py | |||
accelerate launch run_image_classification_no_trainer.py --image_column_name img |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default should not need this no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the default to image
because that's the name of the image column for most image datasets. But Cifar10, which is used in this example, has it named img
for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it thanks
Thanks @regisss |
…uggingface#28408) * Remove `task` arg in `load_dataset` in image-classification example * Manage case where "train" is not in dataset * Add new args to manage image and label column names * Similar to audio-classification example * Fix README * Update tests
…uggingface#28408) * Remove `task` arg in `load_dataset` in image-classification example * Manage case where "train" is not in dataset * Add new args to manage image and label column names * Similar to audio-classification example * Fix README * Update tests
What does this PR do?
The
task
argument is now deprecated in thedatasets.load_dataset
method. This PR removes it and adds the renaming logic needed to deal with datasets like Cifar10 (thetask
attribute of datasets used to help with that).Internal discussion here: https://huggingface.slack.com/archives/C034N0A7H09/p1704447848692889
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.