-
Notifications
You must be signed in to change notification settings - Fork 123
score_samples returns square matrix when label is one-dimensional #182
Comments
I have mitigated this by the following patch:
|
Looks like this is more compatible with the rest of the code:
|
@DRudel That's an annoying behavior. I tried my best to model the behavior of py-earth after that of scikit-learn, but I'm not sure if I succeeded in this case. I'll check what scikit-learn does and if it doesn't match this behavior I'll make a change for the next release. In the meantime you can of course reshape the output in your own code. I'll comment here when I've decided how to proceed. |
@jcrudy Thanks for the response. A couple of notes: Reshaping the output really is not a tenable solution because the mathematical operations end up providing a result that is mostly nonsense due to broadcasting. By way of example, suppose that y and y_hat are equal except for shape:
Then calculate the error term:
This turns out to be [[0,-3], [3,0]], and due to further broadcasting to allow matrix division, we have
With the result of [[-1, 8],[-0.4375, -1]], which cannot be reshaped to get the expected result of [0, 0] Also, if y is a long vector you run into memory and computation problems in generating the inverse of the n x n matrix used in the denominator. This brings me to my second point: I don't think you should be using y**2 for the denominator for this score_samples function. The appropriate denominator for a term-by-term scoring of error terms would be:
The current formula can greatly over-estimate the poorness of fit for values where the correct value is near 0. For example, if you have standardized data (unit variance) and the predicted value for y at a particular point is 0.11 but the actual value is 0.01, the current formula assigns a score of -99 to a point that is only off by 0.1 of a standard deviation. To confirm the fidelity of the formula I suggest above, consider:
Using the suggested formula the returned values for scored_samples would be [1, -2.2, -2.2, -2.2], for an average value of -1.4, which is the value returned for r2_score for the whole data set. |
@DRudel I actually didn't realize Regarding your mathematical argument, it seems reasonable and I suspect you are correct. I'm actually not familiar with the purpose of the In conclusion, the data reshaping issue, I'm now convinced, is a definite bug. The mathematical issue I regard as very likely a bug. Thanks for pointing these things out! Marking this issue as a bug. I will update here as more information is available. |
@jcrudy , regrettably in its current state reshaping the input also doesn't help. Regardless of the shape of y sent to score_samples, the scrub function will return a 2d-vector so the actual y value subtracted from y_hat will be a 2d-vector no matter what the shape of y. To be honest, I don't know that py-earth needs a score_samples() method to be compatible with sklearn. Sci-kit learn generally uses score_samples() not (as you might think) to find the errors for individual data points but rather to compute things like density or probabilities in unsupervised learners. With regard to the suggested formula, it basically just pulls apart the formula for r2_score, which I had the impression py-earth was trying to do on a term-by-term basis. Sorry for the rambling nature of the rest of this message. I did not have time to write a shorter version. At its core r2_score is the value of the average squared error divided by the average squared difference from the midpoint:
The above code is from sklearn's implementation Note that the numerator is essentially the sum of (y - y_hat)**2 and the denominator is just the sum of (y - y_bar)**2, where y_bar is the average value of y. To get the r2_score you take 1 - numerator / denominator (except you may do something slightly different if you are combining several coordinates). Since it is common to center your data (so y_bar = 0), I got the impression that this was what py-earth was attempting to do on an element-by-element basis, as in that case the denominator becomes the sum of y**2. The problem arises when you move from the group values (sum of stuff divided by sum of other stuff) to element-by-element values. It doesn't work to just strip the "sum of" from both the top and the bottom for the same reason that "sum of (a/b)" is not equal to (sum of a) / (sum of b) for sequences a and b. What you want is to have the average value of the individual scores match the overall score you get from r2_score function, or at least that seemed like the natural thing to do. To make this tenable, when you pull apart the group aggregation into individual elements you need the denominator to be constant across all those individual expressions. To do that you can move from "sum of ..." to "average of ..." (which doesn't change the overall ratio) and then you have [average of (y- y_hat)**2] / [average of (y - y_bar) **]. And now you have an expression with a constant denominator. The "variance of y" (as a group of data points) is the same no matter which individual y you are looking at. So it is easy to pull the expression apart and determine how much each individual term contributes to the average. The "score" for each individual term is then (error ** 2) / (variance of Y) and the average is r2_score, as desired. |
@DRudel Thanks for the explanation. That makes sense to me. I guess In any case, I think at this point it makes more sense to fix it than to remove it. I would welcome a pull request for it if you feel like making one. Otherwise, it's on my to-do list and I'll probably get to it in the next few weeks. |
@jcrudy I have put a pull request in. You were interested in use cases for score_samples. I'm happy to provide two. (I think it is a pity that scikit-learn does not support the method more broadly.) One use case is Reinforcement Learning. Frequently in that domain one has a series of models {M_0, M_1, M_2, ...} (or a sequence of snapshots of the same model evolving over time using online learning). One common tack is to generate M_{n+1} by prioritized training, wherein data points that had the greatest error when evaluated by model M_n are given higher priority. The hope is that there is more to learn from these points. In the case case of a MARS-type algorithm, these points may be particularly useful in uncovering knots that might otherwise have been considered statistical noise. To accomplish this, it is useful to know which points had the greatest standardized error, which is essentially what score-samples does. Second use case once again involves uncovering signal that might otherwise look like noise. Standard methods of measuring error for multidimensional output involve combining all the error in all the dimensions into a single value. Unfortunately, due to the ramifications of multidimensional geometry, this can make rectifiable error in a regression look like simple noise. Here is an example. Imagine a case where you have 10 dimensions with a standard deviation of 4 in each dimension. This means that the typical point is square root of 160 = ~12.7 units away from the midpoint of all points. Now imagine that you have a model that tends to be off by 2 units on average in each coordinate. This means that the typical predictions is about square root of 40 = 6.4 units away from the target. Now, imagine that there was a prospective change to the model that cut the typical error by 20% along one coordinate, but did nothing to the other coordinates. Even though this change is reducing the error by 20% in terms of that one coordinate, to a multidimensional evaluator, it can look like noise. Instead of being on average square root of 40 units away, is off by square root of 38.5 (about). The model may well decide this extra term is not worth adding because it does not decrease the multi-dimensional error sufficiently. Long story short, it can be valuable to know the error in each coordinate separately, and typical r2_score does not give you access to that. But if you know the error in each coordinate for each sample, you can compare two models and see whether substantial improvement is being made in a particular coordinate even if it does not change the r2_score much. |
When an Earth model is fit with labels that have a single dimension (and are of shape (n,1), score_samples is not behaving as expected.
The problem is that in the code below y_hat is being returned as a vector of shape (n,)
Since y_hat has shape (n,) and y has shape (n,1), broadcasting rules will force the errors (y - y_hat) to be of shape (n,n).
The text was updated successfully, but these errors were encountered: