-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239
base: feature/v2
Are you sure you want to change the base?
Conversation
…mes to test filter and split classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge effort.
Here's an inital (partial) round of review.
In general, the logic might still be a bit hard to follow for newcomers. This is inevitable because of the many scenarios that we need to cover, but at least we should make sure that it is thoroughly covered in the documentation.
src/anomalib/data/base/datamodule.py
Outdated
@property | ||
def category(self) -> str: | ||
"""Get the category of the datamodule.""" | ||
return self._category | ||
|
||
@category.setter | ||
def category(self, category: str) -> None: | ||
"""Set the category of the datamodule.""" | ||
self._category = category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these used anywhere? It might be a bit confusing because not all datasets consist of multiple categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is used for saving the images to filesystem. Maybe we could address this part in another PR, as the scope will expand
mapping = { | ||
"none": SplitMode.AUTO, | ||
"from_dir": SplitMode.PREDEFINED, | ||
"synthetic": SplitMode.SYNTHETIC, | ||
"same_as_test": SplitMode.AUTO, | ||
"from_train": SplitMode.AUTO, | ||
"from_test": SplitMode.AUTO, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping will not lead to the exact same behaviour between the new version and legacy versions. Not sure how big of an issue this is, but it's something to be aware of. Maybe we could include it in the warning message.
src/anomalib/data/base/datamodule.py
Outdated
|
||
# Check validation set | ||
if hasattr(self, "val_data") and not (self.val_data.has_normal and self.val_data.has_anomalous): | ||
msg = "Validation set should contain both normal and abnormal images." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be too strict. Some users may not have access to abnormal images at training time, but may still benefit from running a validation sequence on normal images for adaptive thresholding. (The adaptive threshold value in this case will default to the highest anomaly score predicted over the normal validation images, which turns out to be a not-too-bad estimate in absence of anomalous samples).
src/anomalib/data/base/datamodule.py
Outdated
|
||
# Check test set | ||
if hasattr(self, "test_data") and not (self.test_data.has_normal and self.test_data.has_anomalous): | ||
msg = "Test set should contain both normal and abnormal images." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may also be too strict. In some papers the pixel-level performance is reported over only the anomalous images of the test set. While this may not be the best practice, I think we should support it for those users that want to use this approach.
src/anomalib/data/base/datamodule.py
Outdated
) | ||
elif self.val_split_mode == SplitMode.SYNTHETIC: | ||
logger.info("Generating synthetic val set.") | ||
self.val_data = SyntheticAnomalyDataset.from_dataset(self.train_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to split the dataset first. Otherwise the training set and the validation set will consist of the same images.
src/anomalib/data/base/datamodule.py
Outdated
) | ||
elif self.test_split_mode == SplitMode.SYNTHETIC: | ||
logger.info("Generating synthetic test set.") | ||
self.test_data = SyntheticAnomalyDataset.from_dataset(self.train_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. We need to split the train set first, to ensure that the train and test sets are mutually exclusive.
Co-authored-by: Dick Ameln <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
…m:samet-akcay/anomalib into feature/add-custom-validation-set-support
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2239 +/- ##
==========================================
- Coverage 80.90% 80.74% -0.16%
==========================================
Files 248 250 +2
Lines 10859 11232 +373
==========================================
+ Hits 8785 9069 +284
- Misses 2074 2163 +89 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the massive effort. At risk of adding more work. I do have a few comments.
src/anomalib/data/video/avenue.py
Outdated
# Avenue dataset does not provide a validation set | ||
# Auto behaviour is to clone the test set as validation set. | ||
if self.val_split_mode == SplitMode.AUTO: | ||
self.val_data = self.test_data.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably inform users about the selection. Maybe logger.info("Using testing data for validation")
assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices))) | ||
|
||
|
||
def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test edge cases like ratio = 0, and 1?
"""Test filtering by count.""" | ||
dataset_filter = DatasetFilter(sample_segmentation_dataframe) | ||
count = 50 | ||
filtered_samples = dataset_filter.apply(by=count, seed=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we also test label_aware filter by count?
src/anomalib/data/base/dataset.py
Outdated
return copy.deepcopy(self) | ||
|
||
# Alias for copy method | ||
clone = copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of defining this here?
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
Signed-off-by: Samet Akcay <[email protected]>
- Updated validation split modes to 'auto' and set ratios to 'null' in avenue.yaml, btech.yaml, datumaro.yaml, shanghaitech.yaml, and ucsd_ped.yaml. - Changed test split modes to 'predefined' and set test split ratios to 'null' in btech.yaml, kolektor.yaml, mvtec.yaml, and visa.yaml. - Adjusted the val_split_ratio in folder.yaml to '0.5' for consistency. These changes standardize the configuration settings for validation and test splits across various datasets, enhancing maintainability and clarity.
… feature/add-custom-validation-set-support
Signed-off-by: Samet Akcay <[email protected]>
📝 Description
This PR introduces the following changes:
CSV Data Support
### New Splitting Mechanism via
SplitMode
TestSplitMode
andValSplitMode
has some duplication and overall is a bit confusing to use. Instead, we introduceSplitMode
to standardise the splitting mechanism across each subset.Dataset Filtering
Dataset Splitting
✨ Changes
Select what type of change your PR is:
✅ Checklist
Before you submit your pull request, please make sure you have completed the following steps:
For more information about code review checklists, see the Code Review Checklist.