Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

score_samples returns square matrix when label is one-dimensional #182

Open
DRudel opened this issue Jun 11, 2018 · 8 comments
Open

score_samples returns square matrix when label is one-dimensional #182

DRudel opened this issue Jun 11, 2018 · 8 comments
Labels

Comments

@DRudel
Copy link

DRudel commented Jun 11, 2018

When an Earth model is fit with labels that have a single dimension (and are of shape (n,1), score_samples is not behaving as expected.

The problem is that in the code below y_hat is being returned as a vector of shape (n,)

        X, y, sample_weight, output_weight, missing = self._scrub(
            X, y, None, None, missing)
        y_hat = self.predict(X, missing=missing)
        residual = 1 - (y - y_hat) ** 2 / y**2
        return residual

Since y_hat has shape (n,) and y has shape (n,1), broadcasting rules will force the errors (y - y_hat) to be of shape (n,n).

@DRudel
Copy link
Author

DRudel commented Jun 11, 2018

I have mitigated this by the following patch:

        X, y, sample_weight, output_weight, missing = self._scrub(
            X, y, None, None, missing)
        if y.shape[1] == 1:
            y=y.ravel()
        y_hat = self.predict(X, missing=missing)
        residual = 1 - (y - y_hat) ** 2 / y**2
        return residual

@DRudel
Copy link
Author

DRudel commented Jun 11, 2018

Looks like this is more compatible with the rest of the code:

        X, y, sample_weight, output_weight, missing = self._scrub(
            X, y, None, None, missing)
        y_hat = self.predict(X, missing=missing)
        if y_hat.ndim == 1:
            y_hat = y_hat.reshape(-1, 1)
        residual = 1 - (y - y_hat) ** 2 / y**2
        return residual

@jcrudy
Copy link
Collaborator

jcrudy commented Jun 11, 2018

@DRudel That's an annoying behavior. I tried my best to model the behavior of py-earth after that of scikit-learn, but I'm not sure if I succeeded in this case. I'll check what scikit-learn does and if it doesn't match this behavior I'll make a change for the next release. In the meantime you can of course reshape the output in your own code. I'll comment here when I've decided how to proceed.

@DRudel
Copy link
Author

DRudel commented Jun 12, 2018

@jcrudy Thanks for the response. A couple of notes:

Reshaping the output really is not a tenable solution because the mathematical operations end up providing a result that is mostly nonsense due to broadcasting.

By way of example, suppose that y and y_hat are equal except for shape:

my_1d_array = np.array([1, 4], 'float')
my_2d_array = np.array([[1], [4]], 'float')

Then calculate the error term:

errors = my_2d_array - my_1d_array

This turns out to be [[0,-3], [3,0]], and due to further broadcasting to allow matrix division, we have

print(1 - (errors ** 2)/(my_2d_array ** 2))

With the result of [[-1, 8],[-0.4375, -1]], which cannot be reshaped to get the expected result of [0, 0]

Also, if y is a long vector you run into memory and computation problems in generating the inverse of the n x n matrix used in the denominator.

This brings me to my second point: I don't think you should be using y**2 for the denominator for this score_samples function. The appropriate denominator for a term-by-term scoring of error terms would be:

1 - (y - y_hat)**2 / np.var(y)

The current formula can greatly over-estimate the poorness of fit for values where the correct value is near 0. For example, if you have standardized data (unit variance) and the predicted value for y at a particular point is 0.11 but the actual value is 0.01, the current formula assigns a score of -99 to a point that is only off by 0.1 of a standard deviation.

To confirm the fidelity of the formula I suggest above, consider:

from sklearn.metrics import r2_score
vector_1 = np.array([[2], [3], [1], [0]],'float')
vector_2 = np.array([[2], [1], [3], [2]], 'float')

print(1 - (vector_1 - vector_2)**2/np.var(vector_1))

print(r2_score(vector_1, vector_2))

Using the suggested formula the returned values for scored_samples would be [1, -2.2, -2.2, -2.2], for an average value of -1.4, which is the value returned for r2_score for the whole data set.

@jcrudy
Copy link
Collaborator

jcrudy commented Jun 12, 2018

@DRudel I actually didn't realize score_samples was py-earth code! I assumed it was yours or from some library. Sorry. You are quite correct that reshaping is no use. A workaround for the moment is to write your own score_samples, which it seems you're quite capable of doing. You can also monkey-patch the Earth class if you need your corrected score_samples to replace the current method. Another option, I suppose, is to reshape your input before calling score_samples. It will obviously depend on your specific situation which of these options makes the most sense.

Regarding your mathematical argument, it seems reasonable and I suspect you are correct. I'm actually not familiar with the purpose of the score_samples method, though, so I'm gonna have to look into it a little. I would welcome any additional exposition or links on the subject if you can conveniently provide them.

In conclusion, the data reshaping issue, I'm now convinced, is a definite bug. The mathematical issue I regard as very likely a bug. Thanks for pointing these things out! Marking this issue as a bug. I will update here as more information is available.

@jcrudy jcrudy added the bug label Jun 12, 2018
@DRudel
Copy link
Author

DRudel commented Jun 12, 2018

@jcrudy , regrettably in its current state reshaping the input also doesn't help. Regardless of the shape of y sent to score_samples, the scrub function will return a 2d-vector so the actual y value subtracted from y_hat will be a 2d-vector no matter what the shape of y.

To be honest, I don't know that py-earth needs a score_samples() method to be compatible with sklearn. Sci-kit learn generally uses score_samples() not (as you might think) to find the errors for individual data points but rather to compute things like density or probabilities in unsupervised learners.

With regard to the suggested formula, it basically just pulls apart the formula for r2_score, which I had the impression py-earth was trying to do on a term-by-term basis.

Sorry for the rambling nature of the rest of this message. I did not have time to write a shorter version.

At its core r2_score is the value of the average squared error divided by the average squared difference from the midpoint:

    numerator = (weight * (y_true - y_pred) ** 2).sum(axis=0,
                                                      dtype=np.float64)
    denominator = (weight * (y_true - np.average(
        y_true, axis=0, weights=sample_weight)) ** 2).sum(axis=0,
                                                          dtype=np.float64)

The above code is from sklearn's implementation

Note that the numerator is essentially the sum of (y - y_hat)**2 and the denominator is just the sum of (y - y_bar)**2, where y_bar is the average value of y. To get the r2_score you take 1 - numerator / denominator (except you may do something slightly different if you are combining several coordinates).

Since it is common to center your data (so y_bar = 0), I got the impression that this was what py-earth was attempting to do on an element-by-element basis, as in that case the denominator becomes the sum of y**2.

The problem arises when you move from the group values (sum of stuff divided by sum of other stuff) to element-by-element values. It doesn't work to just strip the "sum of" from both the top and the bottom for the same reason that "sum of (a/b)" is not equal to (sum of a) / (sum of b) for sequences a and b.

What you want is to have the average value of the individual scores match the overall score you get from r2_score function, or at least that seemed like the natural thing to do. To make this tenable, when you pull apart the group aggregation into individual elements you need the denominator to be constant across all those individual expressions.

To do that you can move from "sum of ..." to "average of ..." (which doesn't change the overall ratio) and then you have [average of (y- y_hat)**2] / [average of (y - y_bar) **].
But the value of "average of (y - y_bar)**2" is exactly the same as just the variance of the whole set Y. So you have [average of (y - y_hat)**2] / (variance of Y)

And now you have an expression with a constant denominator. The "variance of y" (as a group of data points) is the same no matter which individual y you are looking at. So it is easy to pull the expression apart and determine how much each individual term contributes to the average. The "score" for each individual term is then (error ** 2) / (variance of Y) and the average is r2_score, as desired.

@jcrudy
Copy link
Collaborator

jcrudy commented Jun 12, 2018

@DRudel Thanks for the explanation. That makes sense to me. I guess score_samples probably hasn't ever worked, for years, and somehow you are the first person to notice! It certainly seems like it might not be a very important method, but the fact that you're using it indicates it must be useful for something. Do you have a particular use case?

In any case, I think at this point it makes more sense to fix it than to remove it. I would welcome a pull request for it if you feel like making one. Otherwise, it's on my to-do list and I'll probably get to it in the next few weeks.

@DRudel
Copy link
Author

DRudel commented Jun 13, 2018

@jcrudy I have put a pull request in.

You were interested in use cases for score_samples. I'm happy to provide two. (I think it is a pity that scikit-learn does not support the method more broadly.)

One use case is Reinforcement Learning. Frequently in that domain one has a series of models {M_0, M_1, M_2, ...} (or a sequence of snapshots of the same model evolving over time using online learning).

One common tack is to generate M_{n+1} by prioritized training, wherein data points that had the greatest error when evaluated by model M_n are given higher priority. The hope is that there is more to learn from these points. In the case case of a MARS-type algorithm, these points may be particularly useful in uncovering knots that might otherwise have been considered statistical noise. To accomplish this, it is useful to know which points had the greatest standardized error, which is essentially what score-samples does.

Second use case once again involves uncovering signal that might otherwise look like noise. Standard methods of measuring error for multidimensional output involve combining all the error in all the dimensions into a single value. Unfortunately, due to the ramifications of multidimensional geometry, this can make rectifiable error in a regression look like simple noise. Here is an example.

Imagine a case where you have 10 dimensions with a standard deviation of 4 in each dimension. This means that the typical point is square root of 160 = ~12.7 units away from the midpoint of all points. Now imagine that you have a model that tends to be off by 2 units on average in each coordinate. This means that the typical predictions is about square root of 40 = 6.4 units away from the target.

Now, imagine that there was a prospective change to the model that cut the typical error by 20% along one coordinate, but did nothing to the other coordinates. Even though this change is reducing the error by 20% in terms of that one coordinate, to a multidimensional evaluator, it can look like noise. Instead of being on average square root of 40 units away, is off by square root of 38.5 (about). The model may well decide this extra term is not worth adding because it does not decrease the multi-dimensional error sufficiently.

Long story short, it can be valuable to know the error in each coordinate separately, and typical r2_score does not give you access to that. But if you know the error in each coordinate for each sample, you can compare two models and see whether substantial improvement is being made in a particular coordinate even if it does not change the r2_score much.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants