PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

ianozsvald · 2018-05-03T21:04:22Z

Using the current version of XGBoost and ELI5 if I add NaN values to X, whilst show_weights works fine PermutationImportance throws an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

To recreate:

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03 
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system     : Linux
#release    : 4.9.91-040991-generic
#machine    : x86_64
#processor  : x86_64
#CPU cores  : 8
#interpreter: 64bit

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100% 
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The call to check_array is using sklearn's constraints and disallows NaN. XGBoost is ok with NaN. My modification (monkey patched here for easy testing) is to call check_array(X, force_all_finite=False):

from sklearn.metrics.scorer import check_scoring  # type: ignore
from sklearn.utils import check_array, check_random_state  # type: ignore

def fit(self, X, y, groups=None, **fit_params):
    # type: (...) -> PermutationImportance
    """Compute ``feature_importances_`` attribute and optionally
    fit the base estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The training input samples.

    y : array-like, shape (n_samples,)
        The target values (integers that correspond to classes in
        classification, real numbers in regression).

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.

    **fit_params : Other estimator specific parameters

    Returns
    -------
    self : object
        Returns self.
    """
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)

    if self.cv != "prefit" and self.refit:
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y, **fit_params)

    X = check_array(X, force_all_finite=False) 
    #X = check_array(X)

    if self.cv not in (None, "prefit"):
        si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    else:
        si = self._non_cv_scores_importances(X, y)
    scores, results = si
    self.scores_ = np.array(scores)
    self.results_ = results
    self.feature_importances_ = np.mean(results, axis=0)
    self.feature_importances_std_ = np.std(results, axis=0)
    return self

PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

The text was updated successfully, but these errors were encountered:

lstmemery · 2018-05-04T17:10:13Z

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

If you go down that route, you should also check to see if the model is also a pipeline with an imputer.

ianozsvald · 2018-05-04T17:46:30Z

Or maybe the easy first step is to pass an argument to PermutationImportance to set this flag True or False?

ianozsvald · 2018-08-12T17:33:25Z

I'll note that with a fresh install of a conda environment, I still get the above issue and using the work-around I posted, it works ok. These are my versions using watermark:

2018-08-12 

CPython 3.6.6
IPython 6.5.0

numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8

stefansimik · 2018-12-08T17:48:05Z

I have completely the same problem, is there any fix or solution?

ihopethiswillfi · 2018-12-14T17:42:54Z

I'm looking at this as well as I'm having the same issue.

I don't understand the case against having check_array(X, force_all_finite=False) as default and hardcoded?

It's pretty obvious that the provided data to the model has to be similar as what the model was trained on. So I don't see why we need to do any input validation here.

I can make a PR but I'd like to hear thoughts from a contributor on this.

Matgrb · 2020-05-22T08:41:53Z

I will pick up the issue

lopuhin added the help wanted label Mar 4, 2019

Matgrb mentioned this issue May 22, 2020

PermutationImportance does not allow NaNs #380

Closed

Matgrb linked a pull request May 22, 2020 that will close this issue

Allow for Nans in input of Permutation Importance #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

ianozsvald commented May 3, 2018

lstmemery commented May 4, 2018

ianozsvald commented May 4, 2018

ianozsvald commented Aug 12, 2018

stefansimik commented Dec 8, 2018

ihopethiswillfi commented Dec 14, 2018

Matgrb commented May 22, 2020 •

edited

Loading

PermutationImportance error with XGBoost and NaNs - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). (with a fix) #262

PermutationImportance error with XGBoost and NaNs - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). (with a fix) #262

Comments

ianozsvald commented May 3, 2018

lstmemery commented May 4, 2018

ianozsvald commented May 4, 2018

ianozsvald commented Aug 12, 2018

stefansimik commented Dec 8, 2018

ihopethiswillfi commented Dec 14, 2018

Matgrb commented May 22, 2020 • edited Loading

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

PermutationImportance error with XGBoost and NaNs - `ValueError: Input contains NaN, infinity or a value too large for dtype('float64').` (with a fix) #262

Matgrb commented May 22, 2020 •

edited

Loading