Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove task arg in load_dataset in image-classification example #28408

Merged

Conversation

regisss
Copy link
Contributor

@regisss regisss commented Jan 9, 2024

What does this PR do?

The task argument is now deprecated in the datasets.load_dataset method. This PR removes it and adds the renaming logic needed to deal with datasets like Cifar10 (the task attribute of datasets used to help with that).
Internal discussion here: https://huggingface.slack.com/archives/C034N0A7H09/p1704447848692889

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Indeed, the tasks were deprecated long ago and we should definitely stop promoting its use.

Just some comment below about a possible edge case.

)

# Rename image and label columns if needed (e.g. Cifar10)
if "img" in dataset["train"].features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if data_args.dataset_name is None and data_args.train_dir is None, then the dataset dict will not have a "train" key, and the line above will raise a KeyError.

This could be avoided by replacing the line above with:

Suggested change
if "img" in dataset["train"].features:
if "img" in list(dataset.column_names.values())[0]:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

Suggested change
if "img" in dataset["train"].features:
if "img" in next(iter(dataset.column_names.values())):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more readable ?

Suggested change
if "img" in dataset["train"].features:
if "img" in (dataset["train"].features if "train" in dataset else dataset["validation"].features):

and also compatible with your suggestion at huggingface/datasets#6571

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it seems this script always do training no ? in this case you can assume "train" is always present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it's probably mostly used for training, but I'm going to add the suggestion anyway in case.

# Rename image and label columns if needed (e.g. Cifar10)
if "img" in dataset["train"].features:
dataset = dataset.rename_column("img", "image")
if "label" in dataset["train"].features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

)
# See more about loading custom images at
# https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder.

# Rename image and label columns if needed (e.g. Cifar10)
if "img" in dataset["train"].features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

# Rename image and label columns if needed (e.g. Cifar10)
if "img" in dataset["train"].features:
dataset = dataset.rename_column("img", "image")
if "label" in dataset["train"].features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

@albertvillanova
Copy link
Member

albertvillanova commented Jan 9, 2024

Related to this PR, I opened an issue in datasets to improve the user experience when using DatasetDict.column_names:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@mariosasko
Copy link
Contributor

I think we should also add image_column_name (default image) and label_column_name (default label) to the example's DataTrainingArguments to be consistent with audio-classification as renaming img to image is not general enough.

@regisss regisss requested a review from mariosasko January 10, 2024 09:32
Comment on lines 280 to 283
if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
dataset = dataset.rename_column(data_args.image_column_name, "image")
if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
dataset = dataset.rename_column(data_args.label_column_name, "labels")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the audio-classification task, I think we should do the following:

Suggested change
if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
dataset = dataset.rename_column(data_args.image_column_name, "image")
if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
dataset = dataset.rename_column(data_args.label_column_name, "labels")
dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names
if data_args.image_column_name not in dataset_column_names:
if "img" in dataset_column_names:
logger.info(f"Renaming column img to {data_args.image_column_name}")
dataset = dataset.rename_column("img", data_args.image_column_name)
else:
raise ValueError(
f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--image_column_name` to the correct image column - one of "
f"{', '.join(dataset_column_names)}."
)
if data_args.label_column_name not in dataset["train"].column_names:
raise ValueError(
f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. "
"Make sure to set `--label_column_name` to the correct text column - one of "
f"{', '.join(dataset_column_names)}."
)

(Then, we later need to rename label to labels either via rename_column or inside the train_transforms/val_transforms so that we don't break the script)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to have something generic, we should precisely avoid using "img" in the example.
data_args.image_column_name should not be the name of the new column, we already know it is "image" like in the task template. It should be the name of the column to rename so that users can easily run the script if the image column name is not "img" or "image".

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's make sure you rebase and the CIs are green

@@ -197,7 +198,7 @@ accelerate test
that will check everything is ready for training. Finally, you can launch training with

```bash
accelerate launch run_image_classification_trainer.py
accelerate launch run_image_classification_no_trainer.py --image_column_name img
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default should not need this no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default to image because that's the name of the image column for most image datasets. But Cifar10, which is used in this example, has it named img for some reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it thanks

@ArthurZucker
Copy link
Collaborator

Thanks @regisss

@ArthurZucker ArthurZucker merged commit 0cdcd7a into huggingface:main Jan 16, 2024
8 checks passed
@regisss regisss deleted the remove_task_image_classification_example branch January 16, 2024 08:57
wgifford pushed a commit to wgifford/transformers that referenced this pull request Jan 21, 2024
…uggingface#28408)

* Remove `task` arg in `load_dataset` in image-classification example

* Manage case where "train" is not in dataset

* Add new args to manage image and label column names

* Similar to audio-classification example

* Fix README

* Update tests
AjayP13 pushed a commit to AjayP13/transformers that referenced this pull request Jan 22, 2024
…uggingface#28408)

* Remove `task` arg in `load_dataset` in image-classification example

* Manage case where "train" is not in dataset

* Add new args to manage image and label column names

* Similar to audio-classification example

* Fix README

* Update tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants