Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC improve documentation of NCR #1017

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 23 additions & 4 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -347,10 +347,29 @@ place. The class can be used as::
Our implementation offer to set the number of seeds to put in the set :math:`C`
originally by setting the parameter ``n_seeds_S``.

:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the
union of samples to be rejected between the :class:`EditedNearestNeighbours`
and the output a 3 nearest neighbors classifier. The class can be used as::
Neighbourhood Cleaning Rule
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :class:`NeighbourhoodCleaningRule` is another "cleaning" algorithm. It removes
samples from the majority class that are the closest to the boundary they form with
the samples of the minority class :cite:`laurikkala2001improving`.

The :class:`NeighbourhoodCleaningRule` expands on the cleaning performed by
:class:`EditedNearestNeighbours` by eliminating additional majority class samples.

The procedure for the :class:`NeighbourhoodCleaningRule` is as follows:

1. Remove observations from the majority class with edited nearest neighbors (ENN).
2. Remove additional samples from the majority class if they are one of the k closest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we repeating the same sentence as above, I would remove the paragraph above and only go with the bullet point sequence.

neighbors of a minority sample, where all or most of those neighbors are not minority.

To carry out step 2 there is one condition: a sample will only be removed if its class
has a minimum number of observations. The minimum number of observations is regulated
by the `threshold_cleaning` parameter. A sample will only be removed from the target
class if it has at least as many observations as threshold times the number of samples
in the minority class.

The class can be used as::

>>> from imblearn.under_sampling import NeighbourhoodCleaningRule
>>> ncr = NeighbourhoodCleaningRule(n_neighbors=11)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
class NeighbourhoodCleaningRule(BaseCleaningSampler):
"""Undersample based on the neighbourhood cleaning rule.

This class uses ENN and a k-NN to remove noisy samples from the datasets.
This class uses ENN and a k-NN to remove noisy samples from the majority classes(s).

Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.

Expand All @@ -46,7 +46,8 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
If ``int``, size of the neighbourhood to consider to compute the
K-nearest neighbors. If object, an estimator that inherits from
:class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to
find the nearest-neighbors. By default, it will be a 3-NN.
find the nearest-neighbors. By default, it explores the 3 closest
neighbors.

kind_sel : {{"all", "mode"}}, default='all'
Strategy to use in order to exclude samples in the ENN sampling.
Expand All @@ -65,32 +66,33 @@ class NeighbourhoodCleaningRule(BaseCleaningSampler):
`"all"` strategy.

threshold_cleaning : float, default=0.5
Threshold used to whether consider a class or not during the cleaning
after applying ENN. A class will be considered during cleaning when:
Threshold used to determine if further samples will be removed from a certain
majority class during the cleaning step that follows the ENN. Additional
samples will be removed during the second step when:

Ci > C x T ,

where Ci and C is the number of samples in the class and the data set,
respectively and theta is the threshold.
where Ci is the number of samples in the class to be under-sampled, C
is the number of samples in the data set, and T is the threshold.

{n_jobs}

Attributes
----------
sampling_strategy_ : dict
Dictionary containing the information to sample the dataset. The keys
corresponds to the class labels from which to sample and the values
correspond to the class labels from which to sample and the values
are the number of samples to sample.

edited_nearest_neighbours_ : estimator object
The edited nearest neighbour object used to make the first resampling.

nn_ : estimator object
Validated K-nearest Neighbours object created from `n_neighbors` parameter.
Validated K-nearest Neighbours object created from the `n_neighbors` parameter.

classes_to_clean_ : list
The classes considered with under-sampling by `nn_` in the second cleaning
phase.
The classes that statisfy the condition for further under-sampling during the
second cleaning phase.

sample_indices_ : ndarray of shape (n_new_samples,)
Indices of the samples selected.
Expand Down