Adjustment of the validation of the number of target neighbors #353
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before the actual optimization process, it is checked whether the parameters are valid. In the lines 175 - 177 it is checked if the chosen k is valid in the context of the training data. According to the definition of LMNN by Weinberger et al. each class must have at least k+1 elements, so that there are at least k target neighbors for each data point. In the implementation, however, it is only checked whether
self.n_neighbors<= required_k
(in fact the code checks the opposite in order to throw an error), whererequired_k
is the number of elements of the smallest class. This check indicates that the choice of k is valid for a class that has exactly k elements, which shouldn’t be the case.However, this leads to selecting a point as its own target neighbor, if this small class. For the determination of the target neighbors, a distance matrix of all points within the class is computed. To prevent that the point itself is recognized as nearest neighbor, the diagonal of this matrix is set to infinity. If a class has only k elements, all elements of the class are chosen as target neighbors, including the current point itself (even if it has a distance of infinity to itself according to the distance matrix). This results in each point of such a class effectively having one target neighbor less than classes with more training data, which can have unintended influences on the final transformation depending on the dataset used.
To prevent this, it is sufficient to adjust the validation so that
self.n_neighbors < required_k
must apply.