Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PermutationImportance error with XGBoost and NaNs - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). (with a fix) #262

Open
ianozsvald opened this issue May 3, 2018 · 6 comments · May be fixed by #381

Comments

@ianozsvald
Copy link

Using the current version of XGBoost and ELI5 if I add NaN values to X, whilst show_weights works fine PermutationImportance throws an error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

To recreate:

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance

%load_ext watermark
%watermark -d -m -v -p numpy,sklearn,eli5,xgboost,pandas
#2018-05-03 
#CPython 3.6.5
#IPython 6.3.1
#numpy 1.14.2
#sklearn 0.19.1
#eli5 0.8
#xgboost 0.71
#pandas 0.22.0
#compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
#system     : Linux
#release    : 4.9.91-040991-generic
#machine    : x86_64
#processor  : x86_64
#CPU cores  : 8
#interpreter: 64bit

# 8 items of data, pairs of useless feature and predictive feature
X_np = np.array([[np.nan, 1,], [0, 1], [0, 1], [0, 1], [0, 1], [0, 2,], [0, 2,], [0, 2,], [0, 2,], [0, 2]])
y_np = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# if we have 10 items (prepared above) XGBClassifier won't fit (but RandomForestClassifer does)
# so the score is 0. If we concatenate to make "more data" (30 items in total) then XGBClassifier
# fits with 100% 
X = np.concatenate((X_np, X_np, X_np))
y = np.concatenate((y_np, y_np, y_np))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X y shapes:", X_train.shape, y_train.shape, X_test.shape, y_test.shape) # (15, 2) (15,) (15, 2) (15,)

est = XGBClassifier()
est.fit(X_train, y_train)
print("Classifier score (should be 1.0):", est.score(X_test, y_test))

perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
#X y shapes: (15, 2) (15,) (15, 2) (15,)
#Classifier score (should be 1.0): 1.0

~/anaconda3/envs/debug_xgb_pandas_eli5/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The call to check_array is using sklearn's constraints and disallows NaN. XGBoost is ok with NaN. My modification (monkey patched here for easy testing) is to call check_array(X, force_all_finite=False):

from sklearn.metrics.scorer import check_scoring  # type: ignore
from sklearn.utils import check_array, check_random_state  # type: ignore

def fit(self, X, y, groups=None, **fit_params):
    # type: (...) -> PermutationImportance
    """Compute ``feature_importances_`` attribute and optionally
    fit the base estimator.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        The training input samples.

    y : array-like, shape (n_samples,)
        The target values (integers that correspond to classes in
        classification, real numbers in regression).

    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.

    **fit_params : Other estimator specific parameters

    Returns
    -------
    self : object
        Returns self.
    """
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)

    if self.cv != "prefit" and self.refit:
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y, **fit_params)

    X = check_array(X, force_all_finite=False) 
    #X = check_array(X)

    if self.cv not in (None, "prefit"):
        si = self._cv_scores_importances(X, y, groups=groups, **fit_params)
    else:
        si = self._non_cv_scores_importances(X, y)
    scores, results = si
    self.scores_ = np.array(scores)
    self.results_ = results
    self.feature_importances_ = np.mean(results, axis=0)
    self.feature_importances_std_ = np.std(results, axis=0)
    return self

PermutationImportance.fit = fit
perm = PermutationImportance(est)
perm.fit(X_test, y_test)
eli5.show_weights(perm)
# no errors, reports perm results just fine

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

@lstmemery
Copy link

It might be wise to try testing for the use of XGB vs sklearn and then force_all_finite could be flipped to preserve the sklearn interpretation?

If you go down that route, you should also check to see if the model is also a pipeline with an imputer.

@ianozsvald
Copy link
Author

Or maybe the easy first step is to pass an argument to PermutationImportance to set this flag True or False?

@ianozsvald
Copy link
Author

I'll note that with a fresh install of a conda environment, I still get the above issue and using the work-around I posted, it works ok. These are my versions using watermark:

2018-08-12 

CPython 3.6.6
IPython 6.5.0

numpy 1.15.0
matplotlib 2.2.2
sklearn 0.19.1
xgboost 0.72.1
seaborn 0.9.0
pandas 0.23.4
eli5 0.8

@stefansimik
Copy link

I have completely the same problem, is there any fix or solution?

@ihopethiswillfi
Copy link

I'm looking at this as well as I'm having the same issue.

I don't understand the case against having check_array(X, force_all_finite=False) as default and hardcoded?

It's pretty obvious that the provided data to the model has to be similar as what the model was trained on. So I don't see why we need to do any input validation here.

I can make a PR but I'd like to hear thoughts from a contributor on this.

@Matgrb
Copy link

Matgrb commented May 22, 2020

I will pick up the issue

@Matgrb Matgrb linked a pull request May 22, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants