🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

samet-akcay · 2024-08-08T12:28:10Z

📝 Description

Addresses:

This PR introduces the following changes:

CSV Data Support

With this PR users will be able to provide a csv file for their custom datasets. For example;

        >>> data_module = CSV(
        ...     name="custom_format_dataset",
        ...     csv_path="path/to/sample_dataset.csv",
        ...     task=TaskType.CLASSIFICATION,
        ...     sep=";",
        ...     extension=[".jpg", ".png"],
        ... )

### New Splitting Mechanism via SplitMode

Existing splitting mechanism for TestSplitMode and ValSplitMode has some duplication and overall is a bit confusing to use. Instead, we introduce SplitMode to standardise the splitting mechanism across each subset.
This PR also provides a backward compatibility layer to map the old splitting keys to the new one.

class SplitMode(str, Enum):
    SYNTHETIC = "synthetic"
    PREDEFINED = "predefined"
    AUTO = "auto"

        >>> resolve_split_mode(TestSplitMode.NONE)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.NONE is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(TestSplitMode.SYNTHETIC)  # Legacy input (deprecated)
        DeprecationWarning: The split mode TestSplitMode.SYNTHETIC is deprecated. Use 'SplitMode.SYNTHETIC' instead.
        SplitMode.SYNTHETIC

        >>> resolve_split_mode(ValSplitMode.FROM_TRAIN)  # Legacy input (deprecated)
        DeprecationWarning: The split mode ValSplitMode.FROM_TRAIN is deprecated. Use 'SplitMode.AUTO' instead.
        SplitMode.AUTO

        >>> resolve_split_mode(SplitMode.PREDEFINED)  # Current input (preferred)
        SplitMode.PREDEFINED

Dataset Filtering

We propose a new dataset filter object to be able to filter datasets easily. For example;

            #Apply filters via apply method:
            >>> dataset.filter.apply("normal")  # label
            >>> dataset.filter.apply(100)       # count
            >>> dataset.filter.apply(0.5)       # ratio
            >>> dataset.filter.apply({"label": "normal", "count": 100})  # multiple filters

Dataset Splitting

In addition to the dataset filtering, this PR introduces dataset splitting via:

            #Create a subset based on label values:
            >>> normal_dataset, abnormal_dataset = dataset.create_subset("label")

            #Create a subset based on specific sample indices:
            >>> train_set, val_set, test_set = dataset.create_subset([[0, 2, 3], [1, 4], [5]])

            #Create a subset based on specific split ratios:
            >>> train_set, val_set, test_set = dataset.create_subset([0.6, 0.2, 0.2], seed=42)

            #Create a subset based on the number of samples:
            >>> dataset.create_subset(100)

            #Create a subset based on custom criteria:
            >>> dataset.create_subset({"label": "normal", "count": 100})

✨ Changes

Select what type of change your PR is:

🐞 Bug fix (non-breaking change which fixes an issue)
🔨 Refactor (non-breaking change which refactors the code base)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
📚 I have made the necessary updates to the documentation (if applicable).
🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

…mes to test filter and split classes

…rovided

djdameln

Thanks for the huge effort.

Here's an inital (partial) round of review.

In general, the logic might still be a bit hard to follow for newcomers. This is inevitable because of the many scenarios that we need to cover, but at least we should make sure that it is thoroughly covered in the documentation.

djdameln · 2024-08-08T17:03:11Z

src/anomalib/data/base/datamodule.py

+    @property
+    def category(self) -> str:
+        """Get the category of the datamodule."""
+        return self._category
+
+    @category.setter
+    def category(self, category: str) -> None:
+        """Set the category of the datamodule."""
+        self._category = category


Are these used anywhere? It might be a bit confusing because not all datasets consist of multiple categories.

I think this part is used for saving the images to filesystem. Maybe we could address this part in another PR, as the scope will expand

djdameln · 2024-08-08T17:16:15Z

src/anomalib/data/utils/split.py

+    mapping = {
+        "none": SplitMode.AUTO,
+        "from_dir": SplitMode.PREDEFINED,
+        "synthetic": SplitMode.SYNTHETIC,
+        "same_as_test": SplitMode.AUTO,
+        "from_train": SplitMode.AUTO,
+        "from_test": SplitMode.AUTO,
+    }


This mapping will not lead to the exact same behaviour between the new version and legacy versions. Not sure how big of an issue this is, but it's something to be aware of. Maybe we could include it in the warning message.

djdameln · 2024-08-08T17:23:52Z

src/anomalib/data/base/datamodule.py

+
+        # Check validation set
+        if hasattr(self, "val_data") and not (self.val_data.has_normal and self.val_data.has_anomalous):
+            msg = "Validation set should contain both normal and abnormal images."


This may be too strict. Some users may not have access to abnormal images at training time, but may still benefit from running a validation sequence on normal images for adaptive thresholding. (The adaptive threshold value in this case will default to the highest anomaly score predicted over the normal validation images, which turns out to be a not-too-bad estimate in absence of anomalous samples).

djdameln · 2024-08-08T17:26:23Z

src/anomalib/data/base/datamodule.py

+
+        # Check test set
+        if hasattr(self, "test_data") and not (self.test_data.has_normal and self.test_data.has_anomalous):
+            msg = "Test set should contain both normal and abnormal images."


This may also be too strict. In some papers the pixel-level performance is reported over only the anomalous images of the test set. While this may not be the best practice, I think we should support it for those users that want to use this approach.

src/anomalib/data/base/datamodule.py

djdameln · 2024-08-08T17:57:14Z

src/anomalib/data/base/datamodule.py

+            )
+        elif self.val_split_mode == SplitMode.SYNTHETIC:
+            logger.info("Generating synthetic val set.")
+            self.val_data = SyntheticAnomalyDataset.from_dataset(self.train_data)


I think we need to split the dataset first. Otherwise the training set and the validation set will consist of the same images.

djdameln · 2024-08-08T18:00:23Z

src/anomalib/data/base/datamodule.py

+            )
+        elif self.test_split_mode == SplitMode.SYNTHETIC:
+            logger.info("Generating synthetic test set.")
+            self.test_data = SyntheticAnomalyDataset.from_dataset(self.train_data)


Same as above. We need to split the train set first, to ensure that the train and test sets are mutually exclusive.

Co-authored-by: Dick Ameln <[email protected]>

…logic in split-by-ratio

Signed-off-by: Samet Akcay <[email protected]>

…m:samet-akcay/anomalib into feature/add-custom-validation-set-support

codecov · 2024-08-22T11:46:42Z

Codecov Report

Attention: Patch coverage is 79.12525% with 105 lines in your changes missing coverage. Please review.

Project coverage is 80.74%. Comparing base (2bd2842) to head (c575d6a).
Report is 5 commits behind head on main.

Files	Patch %	Lines
src/anomalib/data/base/datamodule.py	53.76%	43 Missing ⚠️
src/anomalib/data/image/csv.py	72.94%	23 Missing ⚠️
src/anomalib/data/utils/filter.py	86.81%	12 Missing ⚠️
src/anomalib/data/utils/split.py	90.47%	12 Missing ⚠️
src/anomalib/data/base/dataset.py	77.27%	10 Missing ⚠️
src/anomalib/data/image/folder.py	76.19%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2239      +/-   ##
==========================================
- Coverage   80.90%   80.74%   -0.16%     
==========================================
  Files         248      250       +2     
  Lines       10859    11232     +373     
==========================================
+ Hits         8785     9069     +284     
- Misses       2074     2163      +89

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ashwinvaidya17

Thanks for the massive effort. At risk of adding more work. I do have a few comments.

ashwinvaidya17 · 2024-09-05T13:57:27Z

src/anomalib/data/video/avenue.py

+        # Avenue dataset does not provide a validation set
+        # Auto behaviour is to clone the test set as validation set.
+        if self.val_split_mode == SplitMode.AUTO:
+            self.val_data = self.test_data.clone()


We should probably inform users about the selection. Maybe logger.info("Using testing data for validation")

ashwinvaidya17 · 2024-09-05T14:17:12Z

tests/unit/data/utils/test_filter.py

+    assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices)))
+
+
+def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None:


Should we also test edge cases like ratio = 0, and 1?

ashwinvaidya17 · 2024-09-05T14:18:34Z

tests/unit/data/utils/test_filter.py

+    """Test filtering by count."""
+    dataset_filter = DatasetFilter(sample_segmentation_dataframe)
+    count = 50
+    filtered_samples = dataset_filter.apply(by=count, seed=42)


Also, should we also test label_aware filter by count?

ashwinvaidya17 · 2024-09-05T14:20:02Z

src/anomalib/data/base/dataset.py

+        return copy.deepcopy(self)
+
+    # Alias for copy method
+    clone = copy


What's the advantage of defining this here?

Signed-off-by: Samet Akcay <[email protected]>

- Updated validation split modes to 'auto' and set ratios to 'null' in avenue.yaml, btech.yaml, datumaro.yaml, shanghaitech.yaml, and ucsd_ped.yaml. - Changed test split modes to 'predefined' and set test split ratios to 'null' in btech.yaml, kolektor.yaml, mvtec.yaml, and visa.yaml. - Adjusted the val_split_ratio in folder.yaml to '0.5' for consistency. These changes standardize the configuration settings for validation and test splits across various datasets, enhancing maintainability and clarity.

… feature/add-custom-validation-set-support

Signed-off-by: Samet Akcay <[email protected]>

samet-akcay added 15 commits August 7, 2024 13:19

re-order data __init__

c35804f

Added dataset filter and split

5780541

Add the base dataset and datamodule implementations

a139f4f

Add conftest to create sample classification and segmentation datafra…

11492a0

…mes to test filter and split classes

Edited video datamodule

1ad059c

Add csv data documentation

e631aa0

Fix a bug in datamodule to address when train/val/test datasets are p…

9e53c4c

…rovided

Fix a bug in resolve_split_mode function

f45c7ec

Add CSV dataset and datamodules

28a6244

add relative import to anomalib.data

e46bb37

Add clone and copy method to anomalib dataset

3aa0ac2

Add csv tests

1565bbf

Change csv dataset assignment logic

591e2ed

Update the mvtec logic

ad8271b

Reflect the changes in image datasets.

a19b7f1

samet-akcay requested review from ashwinvaidya17 and djdameln as code owners August 8, 2024 12:28

samet-akcay changed the title ~~🔨 Dataset refactor: Add CSV Data support and custom validation set support~~ 🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support Aug 8, 2024

djdameln reviewed Aug 8, 2024

View reviewed changes

samet-akcay and others added 11 commits August 13, 2024 05:49

Add clear cache option to avoid using old samples

7bc8439

Modify split_by_ratio logic

374d360

Modify _process_train_only_scenario

df64670

Update the Folder datamodule

751ed3c

Update src/anomalib/data/base/datamodule.py

ceb9675

Co-authored-by: Dick Ameln <[email protected]>

pre-commit

e7e955c

Remove unused old code

64fd3f3

Refactor CSV dataset assignment logic

4dc7d7e

Update folder notebook

8a7d87a

Add CSV notebook

73fcd41

chore: Add CSV datamodule notebook and update folder notebook

3b91977

samet-akcay added 5 commits August 13, 2024 15:18

Add csv link to the sidebar in docs.

a21332a

Add more informative deprecation warning in resolvesplitmode. Change …

5e0a088

…logic in split-by-ratio

Address PR comments.

a992f75

Fix the csv tests

5e7de57

Merge branch 'main' into feature/add-custom-validation-set-support

5eaa516

samet-akcay added the Ready for Review label Aug 20, 2024

samet-akcay added 8 commits August 20, 2024 12:48

refactor: Update folder.yaml configuration file

f7603a1

remove normal_test_dir arg from tests

74ccb06

remove normal_test_dir arg from tests

9f58d18

Reflect the new changes in video datamodules

b871b9e

Signed-off-by: Samet Akcay <[email protected]>

Modify the tests to conform the new datamodule format

ebfaf7f

Signed-off-by: Samet Akcay <[email protected]>

Modify the tests to conform the new datamodule format

ce8b5fa

Signed-off-by: Samet Akcay <[email protected]>

Merge branch 'feature/add-custom-validation-set-support' of github.co…

b3fc488

…m:samet-akcay/anomalib into feature/add-custom-validation-set-support

Write the csv file to the notebooks directory.

c575d6a

ashwinvaidya17 reviewed Sep 5, 2024

View reviewed changes

samet-akcay added this to the v2.0.0 milestone Oct 22, 2024

samet-akcay mentioned this pull request Oct 22, 2024

🎯 [EPIC] Design and Implement the New AnomalibModule for v2.0 #2364

Open

12 tasks

samet-akcay mentioned this pull request Nov 5, 2024

🚀 Add PreProcessor to AnomalyModule #2358

Merged

9 tasks

samet-akcay added 2 commits November 27, 2024 08:57

Resolve merge conflicts

8e905e8

Signed-off-by: Samet Akcay <[email protected]>

Resolve merge conflicts

c44f377

Signed-off-by: Samet Akcay <[email protected]>

samet-akcay changed the base branch from main to feature/v2 November 27, 2024 10:09

samet-akcay added 6 commits November 27, 2024 13:30

Update the tests

ddede9e

Signed-off-by: Samet Akcay <[email protected]>

Add the new split logic to kolektor

489e96c

Signed-off-by: Samet Akcay <[email protected]>

fix pre-commit

e98bada

Signed-off-by: Samet Akcay <[email protected]>

Modify make_kolektor_dataset function

17b6923

Signed-off-by: Samet Akcay <[email protected]>

Merge branch 'feature/v2' of github.com:openvinotoolkit/anomalib into…

5973e20

… feature/add-custom-validation-set-support

samet-akcay requested a review from djdameln November 27, 2024 16:51

Properly assign val data transforms

529e0ad

Signed-off-by: Samet Akcay <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

samet-akcay commented Aug 8, 2024 •

edited

Loading

djdameln left a comment

djdameln Aug 8, 2024

samet-akcay Aug 13, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

djdameln Aug 8, 2024

codecov bot commented Aug 22, 2024 •

edited

Loading

ashwinvaidya17 left a comment

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

ashwinvaidya17 Sep 5, 2024

		assert all(filtered_samples.iloc[i]["image_path"] == f"image_{indices[i]}.jpg" for i in range(len(indices)))


		def test_filter_by_ratio(sample_classification_dataframe: pd.DataFrame) -> None:

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

Are you sure you want to change the base?

🔨 Dataset Additions: CSV data, custom validation set, dataset filtering and splitting support #2239

Conversation

samet-akcay commented Aug 8, 2024 • edited Loading

📝 Description

CSV Data Support

Dataset Filtering

Dataset Splitting

✨ Changes

✅ Checklist

djdameln left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 22, 2024 • edited Loading

Codecov Report

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samet-akcay commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 22, 2024 •

edited

Loading