Skip to content

Commit

Permalink
Update reading-svm.md
Browse files Browse the repository at this point in the history
  • Loading branch information
nsanthan committed Apr 10, 2024
1 parent d1264c5 commit b906701
Showing 1 changed file with 42 additions and 42 deletions.
84 changes: 42 additions & 42 deletions morea/kernels/reading-svm.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ first note the following exercise.


#### Training for maximum margin
Suppose \(\z_1\upto \z_n\) are \(n\)
training examples in \(\reals^p\) given to us with labels \(y_1\upto y_n\)
respectively (each label is either \(+1\) or \(-1\)). Let \(\w\in\reals^p\)
and \(b\) be a number, and define
Suppose \\( \z_1\upto \z_n \\\) are \\(n\\\)
training examples in \(\reals^p\\) given to us with labels \\(y_1\upto y_n\\)
respectively (each label is either \\(+1\\) or \\(-1\\)). Let \\(\w\in\reals^p\\)
and \\(b\\) be a number, and define

$$\gamma_i(\w,b) = \w^T\z_i - b.$$


Therefore, the distances of the \(n\) points to the plane \(\w^T\x-b=0\)
are respectively \(\gamma_1/||\w|| \upto \gamma_n/||\w||\). In addition,
Therefore, the distances of the \\(n\\) points to the plane \\(\w^T\x-b=0\\)
are respectively \\(\gamma_1/||\w|| \upto \gamma_n/||\w||\\). In addition,
let

\begin{equation}
Expand All @@ -52,79 +52,79 @@ let
\end{equation}

so that the smallest distance between the examples and the hyperplane
is \(\gamma(\w,b)/||\w||\). This is called the \emph{margin} of the classifier
\(\w^T\x-b=0\).
is \\(\gamma(\w,b)/||\w||\\). This is called the \emph{margin} of the classifier
\\(\w^T\x-b=0\\).

From our training data, we want to obtain that plane \(\w^T\x-b=0\) which
From our training data, we want to obtain that plane \\(\w^T\x-b=0\\) which
classifies all examples correctly, but in addition has the largest
margin. This plane is what we will learn from the training examples, and
what we will use to predict on the test examples.

So for training, we first set up an optimization. Note that \(\gamma\)
is some complicated function of \(\w\) and \(b\). Different values of
\(\w\) and \(b\) yield potentially different orientations and
So for training, we first set up an optimization. Note that \\(\gamma\\)
is some complicated function of \\(\w\\) and \\(b\\). Different values of
\\(\w\\) and \\(b\\) yield potentially different orientations and
intercepts of the separating hyperplane, and their margin is
determined by different examples (\ie the minimizer
in\textasciitilde{}\eqref{eq:gamma} is different). Even though we may not have
\(\gamma(\w,b)\) in a simple form, we can still ask for
\\(\gamma(\w,b)\\) in a simple form, we can still ask for

\begin{align*}
\w^*,b^* &= \arg\max{\w,b} \frac{\gamma(\w,b)}{||\w||}\\
\text{subject to } & y_i(\w^T \z_i -b) \ge 0 \text{ for all } 1\le i\le n.
\end{align*}

In the optimization above, the first line asks to maximize the margin,
while the constraints (there are \(n\) of them) ensure that each
while the constraints (there are \\(n\\) of them) ensure that each
example is classified properly.

So far so good, but we don't really want to compute \(\gamma(\w,b)\) or
So far so good, but we don't really want to compute \\(\gamma(\w,b)\\) or
try expressing it in any closed/numerical form. But there is a simple
conceptual way around it. Suppose \(\w\) and \(\b\) classified all examples
such that every example, \(\z_1\upto \z_n\) satisfied
conceptual way around it. Suppose \\(\w\\) and \\(\b\\) classified all examples
such that every example, \\(\z_1\upto \z_n\\) satisfied

\begin{equation}
\label{eq:constraints}
y_i(\w^T \z_i -b) \ge \nu, \qquad 1\le i\le n.
\end{equation}

For a given \(\w\) and \(b\), since \(\gamma(\w,b)/||\w||\) happens to be the
distance of the closest point to the plane \(\w^T \x -b =0\), we could
satisfy all \(n\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \(\nu\) in the range \(0 \le\nu \le \gamma(\w,b)\) and for no
For a given \\(\w\\) and \\(b\\), since \\(\gamma(\w,b)/||\w||\\) happens to be the
distance of the closest point to the plane \\(\w^T \x -b =0\\), we could
satisfy all \\(n\\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \\(\nu\\) in the range \\(0 \le\nu \le \gamma(\w,b)\\) and for no
other.

Therefore, we ask to find the maximum number \(\nu\) such that all the
Therefore, we ask to find the maximum number \\(\nu\\) such that all the
constraints in\textasciitilde{}\eqref{eq:constraints} are satisfied.
Note the shift now---we treat \(\nu\) as just a number (not a
function of \(\w\) and \(b\)) and see which is the largest combination
of the number \(\nu\), the vector \(\w\) and \(b\) that satisfies
Note the shift now---we treat \\(\nu\\) as just a number (not a
function of \\(\w\\) and \\(b\\)) and see which is the largest combination
of the number \\(\nu\\), the vector \\(\w\\) and \\(b\\) that satisfies

\begin{align*}
\w^*,b^*,\nu^* &= \arg\max_{\nu,\w,b} \frac{\nu}{||\w||}\\
\text{subject to } & y_i(\w^T \z_i -b) \ge \nu \text{ for all } 1\le i\le n.
\end{align*}

We can make one more simplification. There is no distinction between
the plane \(\w^T\x-b=0\) and the plane
\(k(\w^T\x-b) =0\) for any real number \(k\ne 0\) (because
if \(\x\) satisfies the equation \({\w}^T\x-{b}=0\), it
the plane \\(\w^T\x-b=0\\) and the plane
\\(k(\w^T\x-b) =0\\) for any real number \\(k\ne 0\\) (because
if \\(\x\\) satisfies the equation \\({\w}^T\x-{b}=0\\), it
automatically satifies the other and vice versa). So all the
candidates (\(k{\w}, k{b}\)), \(k\ne 0\), yield exactly the
candidates (\\(k{\w}, k{b}\\)), \\(k\ne 0\\), yield exactly the
same plane (and hence same margin). We may choose just one candidate among
these while searching for the optimum. To make our life simpler, we
can choose \(k\) such that
can choose \\(k\\) such that

\[
\min_{1\le i\le n} k({\w}^T\z_i - {b}) = 1
\]

or equivalently, given any \({\w}\) and \({b}\), we scale it by
\(k=\frac1\gamma\), where \(\gamma\) is as defined as
in~\eqref{eq:gamma}, to get \(\tilde{\w}\) and \(\tilde{b}\), and
optimize over only the \(\tilde{\w}\) and \(\tilde{b}\).
or equivalently, given any \\({\w}\\) and \\({b}\\), we scale it by
\\(k=\frac1\gamma\\), where \\(\gamma\\) is as defined as
in~\eqref{eq:gamma}, to get \\(\tilde{\w}\\) and \\(\tilde{b}\\), and
optimize over only the \\(\tilde{\w}\\) and \\(\tilde{b}\\).

Then, we will have \[ \min_{1\le i\le n} (\tilde{\w}^T\z_i -
\tilde{b}) = 1 \] and the margin of the hyperplane \(\tilde{\w}^T
\x-\tilde{b}=0\) is \(1/||\tilde{\w}||\).
\tilde{b}) = 1 \] and the margin of the hyperplane \\(\tilde{\w}^T
\x-\tilde{b}=0\\) is \\(1/||\tilde{\w}||\\).
So we can rewrite our training goal to be the optimization

\begin{align*}
Expand All @@ -133,13 +133,13 @@ So we can rewrite our training goal to be the optimization
\text{ for all } 1\le i\le n.
\end{align*}

Clearly, the \(\nu\) 's are now superflous---they don't exist in either the
Clearly, the \\(\nu\\) 's are now superflous---they don't exist in either the
objective or the constraints---we can discard them.
In the above, the \(\tilde\w\) and \(\tilde b\) are just dummy variables,
In the above, the \\(\tilde\w\\) and \\(\tilde b\\) are just dummy variables,
we can call them by any other name and nothing will really change. Furthermore,
maximizing \(1/||\w||\) is the same as minimizing \(||\w||\), which is in turn
the same as minimizing \(\half ||\w||^2\). We can therefore write our training
objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where
maximizing \\(1/||\w||\\) is the same as minimizing \\(||\w||\\), which is in turn
the same as minimizing \\(\half ||\w||^2\\). We can therefore write our training
objective as obtaining the hyperplane \\((\w^*)^T \x-b^*=0\\), where

\begin{align}
\nonumber
Expand All @@ -149,8 +149,8 @@ objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where
\text{ for all } 1\le i\le n.
\end{align}

You may wonder why we transformed maximizing \(1/||\w||\) to minimizing
\(\half ||\w||^2\). The reason is that we want our objectives and
You may wonder why we transformed maximizing \\(1/||\w||\\) to minimizing
\\(\half ||\w||^2\\). The reason is that we want our objectives and
constraints to be \emph{convex} functions. We will have a little
digression here to define convex functions and sets, but practically
every large constrained optimization we can solve is convex (or we
Expand Down

0 comments on commit b906701

Please sign in to comment.