From fdda01492052f613215b64f83ea930266c0040b0 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:07:16 +0200 Subject: [PATCH 1/8] re-word explanation of NCl in rst --- doc/under_sampling.rst | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index 9f2795430..af56cc90a 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -347,10 +347,30 @@ place. The class can be used as:: Our implementation offer to set the number of seeds to put in the set :math:`C` originally by setting the parameter ``n_seeds_S``. -:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than -condensing them :cite:`laurikkala2001improving`. Therefore, it will used the -union of samples to be rejected between the :class:`EditedNearestNeighbours` -and the output a 3 nearest neighbors classifier. The class can be used as:: +Neighbourhood Cleaning Rule +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :class:`NeighbourhoodCleaningRule` is another "cleaning" algorithm. It removes +samples from the majority class that are closest to the boundary with the minority +:cite:`laurikkala2001improving`. + +The :class:`NeighbourhoodCleaningRule` expands on the cleaning performed by +:class:`EditedNearestNeighbours` by eliminating additional majority class samples if +they are among the 3 closest neighbours of a sample from the minority class. + +The procedure for the :class:`NeighbourhoodCleaningRule` is as follows: + +1. Remove observations from the majority class with edited nearest neighbors (ENN). +2. Remove additional samples from the majority class if they are one of the k closest +neighbors of a minority sample, where all or most of those neighbors are not minority. + +To carry out step 2 there is one condition: a sample will only be removed if its class +has a minimum number of observations. The minimum number of observations is regulated +by the `threshold_cleaning` parameter. In the original article +:cite:`laurikkala2001improving`, samples would be removed if the class had at +least half as many observations as those in the minority class. + +The class can be used as:: >>> from imblearn.under_sampling import NeighbourhoodCleaningRule >>> ncr = NeighbourhoodCleaningRule(n_neighbors=11) From ff4fe04c55d827a060ee511e8d2312717bcf1a0c Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:20:58 +0200 Subject: [PATCH 2/8] re-word docstrings --- .../_neighbourhood_cleaning_rule.py | 23 +++++++++++-------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index 188ba32f3..5ea40b342 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -29,7 +29,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): """Undersample based on the neighbourhood cleaning rule. - This class uses ENN and a k-NN to remove noisy samples from the datasets. + This class uses ENN and a k-NN to remove noisy samples from the majority class or + classes. Read more in the :ref:`User Guide `. @@ -46,7 +47,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): If ``int``, size of the neighbourhood to consider to compute the K-nearest neighbors. If object, an estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to - find the nearest-neighbors. By default, it will be a 3-NN. + find the nearest-neighbors. By default, it explores the 3 closest + neighbors. kind_sel : {{"all", "mode"}}, default='all' Strategy to use in order to exclude samples in the ENN sampling. @@ -65,13 +67,14 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): `"all"` strategy. threshold_cleaning : float, default=0.5 - Threshold used to whether consider a class or not during the cleaning - after applying ENN. A class will be considered during cleaning when: + Threshold used to determine if further samples will be removed from a certain + majority class or not during the cleaning step that follows the ENN. Further + samples will be removed during the second step when: Ci > C x T , - where Ci and C is the number of samples in the class and the data set, - respectively and theta is the threshold. + where Ci is the number of samples in the class, C is the number of samples in + the data set, and T is the threshold. {n_jobs} @@ -79,18 +82,18 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): ---------- sampling_strategy_ : dict Dictionary containing the information to sample the dataset. The keys - corresponds to the class labels from which to sample and the values + correspond to the class labels from which to sample and the values are the number of samples to sample. edited_nearest_neighbours_ : estimator object The edited nearest neighbour object used to make the first resampling. nn_ : estimator object - Validated K-nearest Neighbours object created from `n_neighbors` parameter. + Validated K-nearest Neighbours object created from the `n_neighbors` parameter. classes_to_clean_ : list - The classes considered with under-sampling by `nn_` in the second cleaning - phase. + The classes that statisfy the condition for further under-sampling in the + second cleaning phase. sample_indices_ : ndarray of shape (n_new_samples,) Indices of the samples selected. From d9d7613482222a56f5c1efe2a91b78e055e385a2 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:23:55 +0200 Subject: [PATCH 3/8] cosmetic edits --- .../_prototype_selection/_neighbourhood_cleaning_rule.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index 5ea40b342..9458eeace 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -68,7 +68,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): threshold_cleaning : float, default=0.5 Threshold used to determine if further samples will be removed from a certain - majority class or not during the cleaning step that follows the ENN. Further + majority class during the cleaning step that follows the ENN. Additional samples will be removed during the second step when: Ci > C x T , From 01af8df1ae39aea2628e627cefddc2b21825825a Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Mon, 10 Jul 2023 19:25:30 +0200 Subject: [PATCH 4/8] final touch --- .../_prototype_selection/_neighbourhood_cleaning_rule.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index 9458eeace..a3cc9290e 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -92,7 +92,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): Validated K-nearest Neighbours object created from the `n_neighbors` parameter. classes_to_clean_ : list - The classes that statisfy the condition for further under-sampling in the + The classes that statisfy the condition for further under-sampling during the second cleaning phase. sample_indices_ : ndarray of shape (n_new_samples,) From c26157a028872db13dd1ce756ceebae49958dacf Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 13:37:16 +0200 Subject: [PATCH 5/8] reword Co-authored-by: Guillaume Lemaitre --- .../_prototype_selection/_neighbourhood_cleaning_rule.py | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index a3cc9290e..a420ed461 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -29,8 +29,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): """Undersample based on the neighbourhood cleaning rule. - This class uses ENN and a k-NN to remove noisy samples from the majority class or - classes. + This class uses ENN and a k-NN to remove noisy samples from the majority classes(s). Read more in the :ref:`User Guide `. From 8d4508f7a65340db4b70692d013a778d4339c216 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 13:37:58 +0200 Subject: [PATCH 6/8] reword Co-authored-by: Guillaume Lemaitre --- .../_prototype_selection/_neighbourhood_cleaning_rule.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index a420ed461..e3db50bf2 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -72,7 +72,7 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): Ci > C x T , - where Ci is the number of samples in the class, C is the number of samples in + where Ci is the number of samples in the class to be under-sampled, C is the number of samples in the data set, and T is the threshold. {n_jobs} From 9bbc4a4fe5f526924db885aaee8dadec9abab33a Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 13:48:38 +0200 Subject: [PATCH 7/8] reword as per suggestions --- doc/under_sampling.rst | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/doc/under_sampling.rst b/doc/under_sampling.rst index af56cc90a..8c2f91f70 100644 --- a/doc/under_sampling.rst +++ b/doc/under_sampling.rst @@ -351,12 +351,11 @@ Neighbourhood Cleaning Rule ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :class:`NeighbourhoodCleaningRule` is another "cleaning" algorithm. It removes -samples from the majority class that are closest to the boundary with the minority -:cite:`laurikkala2001improving`. +samples from the majority class that are the closest to the boundary they form with +the samples of the minority class :cite:`laurikkala2001improving`. The :class:`NeighbourhoodCleaningRule` expands on the cleaning performed by -:class:`EditedNearestNeighbours` by eliminating additional majority class samples if -they are among the 3 closest neighbours of a sample from the minority class. +:class:`EditedNearestNeighbours` by eliminating additional majority class samples. The procedure for the :class:`NeighbourhoodCleaningRule` is as follows: @@ -366,9 +365,9 @@ neighbors of a minority sample, where all or most of those neighbors are not min To carry out step 2 there is one condition: a sample will only be removed if its class has a minimum number of observations. The minimum number of observations is regulated -by the `threshold_cleaning` parameter. In the original article -:cite:`laurikkala2001improving`, samples would be removed if the class had at -least half as many observations as those in the minority class. +by the `threshold_cleaning` parameter. A sample will only be removed from the target +class if it has at least as many observations as threshold times the number of samples +in the minority class. The class can be used as:: From 72b94d03d87434e4dfa41cfae0e1d73537205a11 Mon Sep 17 00:00:00 2001 From: Soledad Galli Date: Tue, 11 Jul 2023 15:39:05 +0200 Subject: [PATCH 8/8] fix linting Co-authored-by: Guillaume Lemaitre --- .../_prototype_selection/_neighbourhood_cleaning_rule.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py index e3db50bf2..ee13b4c17 100644 --- a/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py +++ b/imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py @@ -72,8 +72,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler): Ci > C x T , - where Ci is the number of samples in the class to be under-sampled, C is the number of samples in - the data set, and T is the threshold. + where Ci is the number of samples in the class to be under-sampled, C + is the number of samples in the data set, and T is the threshold. {n_jobs}