You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We will develop a small amount of theory that provides a framework for developing models in this section.
We first consider the case of quantitative output. Let's assume $$X \in \mathbb{R}^p$$ and $$Y \in \mathbb{R}$$ with joint distribution $$Pr(X,Y)$$. We seek a function $$f(X)$$ to predict $$Y$$ with input $$X$$. This theory require a loss function$$L(Y,f(x))$$ for penalizing errors in prediction, and by far the most common and convenient is squares loss error: $$L(Y,f(x))=(Y-f(x))^2$$. This leads us to a criterion for choosing $$f$$:
the expected prediction error. Apparently, in most case, we already know $$X$$, but we don't know $$Y$$. To solve this problem, we can change joint probability into conditional probability. That change the equation as:
and we see that it suffices to minimize EPE point-wise:
$$
f(x) = argmin_c E_{Y|X}([Y-c]^2|X=x).
$$
The solution is
$$
f(x) = E(Y|X=x),
$$
the conditional expectation is also known as the regression function. Thus the best prediction of $$Y$$ at any point $$X = x$$ is the conditional mean, when best is measured by average squared error.
The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point we have:
$$
\hat{f(x)} = Ave(y_i|x_i \in N_k(x)).
$$
There are two approximations here:
expectation is approximated by averaging over sample data;
conditioning at a point is relaxed to conditioning on some region "closed" to target point.
If we get a large training data size $$N$$, as $$N,k \rightarrow \infty$$ and $$k/N \rightarrow 0$$, $$\hat{f(x)} \rightarrow E(Y|X=x)$$.
Now let's consider how to make linear model fit this framework. We just assumed $$f(x) \approx x^T\beta$$. Plugging this function into $$EPE$$ and we can solve for $$\beta$$ theoretically:
Since $$x^T \beta$$ is a scalar, $$x^T \beta x = x x^T \beta$$.
Therefor, $$\frac{\partial{EPE(\beta)}}{\partial{\beta}}=-2(E(yx)-E(x x^T \beta))$$. Let this equation be 0, and we can get:
$$
\beta = [E(XX^T)]^{-1}E(XY).
$$
The least squares replaces the expectation above by average over the training data.
So both $$k$$-nearest neighbors and least squares end up approximating conditional expectation by averages. But the differences are model assumptions:
Least squares assumes $$f(x)$$ is well approximated by globally linear model.
The $$k$$-nearest neighbors assumes $$f(x)$$ is well approximated by a locally constant function.
And if we replace the loss function as $$E|Y-f(x)|$$1, the solution will be the conditional median,
$$
\hat{f(x)} = median(Y|X=x),
$$
which is a different measure of location. Meanwhile its estimation are more robust than those for the conditional mean. However, the $$L_1$$ criteria have discontinuities in their derivatives, which has hindered their widespread use.
If our output is categorical, we can use zero-one loss function. The expected predict error is
$$
EPE = E[L(G, \hat{G(X)})],
$$
where again the expectation is taken with respect to the joint distribution $$Pr(G,X)$$. Again we condition, and can write EPE as