Update reading-svm.md

uhm-descartes · Apr 10, 2024 · b906701 · b906701
1 parent d1264c5
commit b906701
Showing 1 changed file with 42 additions and 42 deletions.
diff --git a/morea/kernels/reading-svm.md b/morea/kernels/reading-svm.md
@@ -34,16 +34,16 @@ first note the following exercise.
 
 
 #### Training for maximum margin
-Suppose \(\z_1\upto \z_n\) are \(n\)
-training examples in \(\reals^p\) given to us with labels \(y_1\upto y_n\)
-respectively (each label is either \(+1\) or \(-1\)). Let \(\w\in\reals^p\)
-and \(b\) be a number, and define
+Suppose \\( \z_1\upto \z_n \\\) are \\(n\\\)
+training examples in \(\reals^p\\) given to us with labels \\(y_1\upto y_n\\)
+respectively (each label is either \\(+1\\) or \\(-1\\)). Let \\(\w\in\reals^p\\)
+and \\(b\\) be a number, and define
 
 $$\gamma_i(\w,b) = \w^T\z_i - b.$$
 
 
-Therefore, the distances of the \(n\) points to the plane \(\w^T\x-b=0\) 
-are respectively \(\gamma_1/||\w|| \upto \gamma_n/||\w||\). In addition,
+Therefore, the distances of the \\(n\\) points to the plane \\(\w^T\x-b=0\\) 
+are respectively \\(\gamma_1/||\w|| \upto \gamma_n/||\w||\\). In addition,
 let 
 
 \begin{equation}
@@ -52,79 +52,79 @@ let
 \end{equation}
 
 so that the smallest distance between the examples and the hyperplane
-is \(\gamma(\w,b)/||\w||\). This is called the \emph{margin} of the classifier
-\(\w^T\x-b=0\).
+is \\(\gamma(\w,b)/||\w||\\). This is called the \emph{margin} of the classifier
+\\(\w^T\x-b=0\\).
 
-From our training data, we want to obtain that plane \(\w^T\x-b=0\) which
+From our training data, we want to obtain that plane \\(\w^T\x-b=0\\) which
 classifies all examples correctly, but in addition has the largest
 margin. This plane is what we will learn from the training examples, and
 what we will use to predict on the test examples. 
 
-So for training, we first set up an optimization. Note that \(\gamma\)
-is some complicated function of \(\w\) and \(b\). Different values of
-\(\w\) and \(b\) yield potentially different orientations and
+So for training, we first set up an optimization. Note that \\(\gamma\\)
+is some complicated function of \\(\w\\) and \\(b\\). Different values of
+\\(\w\\) and \\(b\\) yield potentially different orientations and
 intercepts of the separating hyperplane, and their margin is
 determined by different examples (\ie the minimizer
 in\textasciitilde{}\eqref{eq:gamma} is different).  Even though we may not have
-\(\gamma(\w,b)\) in a simple form, we can still ask for
+\\(\gamma(\w,b)\\) in a simple form, we can still ask for
 
 \begin{align*}
 \w^*,b^* &= \arg\max{\w,b} \frac{\gamma(\w,b)}{||\w||}\\
 \text{subject to } & y_i(\w^T \z_i -b) \ge 0 \text{ for all } 1\le i\le n.
 \end{align*}
 
 In the optimization above, the first line asks to maximize the margin,
-while the constraints (there are \(n\) of them) ensure that each
+while the constraints (there are \\(n\\) of them) ensure that each
 example is classified properly.
 
-So far so good, but we don't really want to compute \(\gamma(\w,b)\) or
+So far so good, but we don't really want to compute \\(\gamma(\w,b)\\) or
 try expressing it in any closed/numerical form. But there is a simple
-conceptual way around it. Suppose \(\w\) and \(\b\) classified all examples
-such that every example, \(\z_1\upto \z_n\) satisfied
+conceptual way around it. Suppose \\(\w\\) and \\(\b\\) classified all examples
+such that every example, \\(\z_1\upto \z_n\\) satisfied
 
 \begin{equation}
 \label{eq:constraints}
  y_i(\w^T \z_i -b) \ge \nu, \qquad 1\le i\le n.
 \end{equation}
 
-For a given \(\w\) and \(b\), since \(\gamma(\w,b)/||\w||\) happens to be the
-distance of the closest point to the plane \(\w^T \x -b =0\), we could
-satisfy all \(n\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \(\nu\) in the range \(0 \le\nu \le \gamma(\w,b)\) and for no
+For a given \\(\w\\) and \\(b\\), since \\(\gamma(\w,b)/||\w||\\) happens to be the
+distance of the closest point to the plane \\(\w^T \x -b =0\\), we could
+satisfy all \\(n\\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \\(\nu\\) in the range \\(0 \le\nu \le \gamma(\w,b)\\) and for no
 other.
 
-Therefore, we ask to find the maximum number \(\nu\) such that all the
+Therefore, we ask to find the maximum number \\(\nu\\) such that all the
 constraints in\textasciitilde{}\eqref{eq:constraints} are satisfied.
-Note the shift now---we treat \(\nu\) as just a number (not a
-function of \(\w\) and \(b\)) and see which is the largest combination
-of the number \(\nu\), the vector \(\w\) and \(b\) that satisfies
+Note the shift now---we treat \\(\nu\\) as just a number (not a
+function of \\(\w\\) and \\(b\\)) and see which is the largest combination
+of the number \\(\nu\\), the vector \\(\w\\) and \\(b\\) that satisfies
 
 \begin{align*}
 \w^*,b^*,\nu^* &= \arg\max_{\nu,\w,b} \frac{\nu}{||\w||}\\
 \text{subject to } & y_i(\w^T \z_i -b) \ge \nu \text{ for all } 1\le i\le n.
 \end{align*}
 
 We can make one more simplification. There is no distinction between
-the plane \(\w^T\x-b=0\) and the plane
-\(k(\w^T\x-b) =0\) for any real number \(k\ne 0\) (because
-if \(\x\) satisfies the equation \({\w}^T\x-{b}=0\), it
+the plane \\(\w^T\x-b=0\\) and the plane
+\\(k(\w^T\x-b) =0\\) for any real number \\(k\ne 0\\) (because
+if \\(\x\\) satisfies the equation \\({\w}^T\x-{b}=0\\), it
 automatically satifies the other and vice versa). So all the
-candidates (\(k{\w}, k{b}\)), \(k\ne 0\), yield exactly the
+candidates (\\(k{\w}, k{b}\\)), \\(k\ne 0\\), yield exactly the
 same plane (and hence same margin). We may choose just one candidate among
 these while searching for the optimum. To make our life simpler, we
-can choose \(k\) such that 
+can choose \\(k\\) such that 
 
 \[ 
 \min_{1\le i\le n} k({\w}^T\z_i - {b}) = 1 
 \]
 
-or equivalently, given any \({\w}\) and \({b}\), we scale it by
-\(k=\frac1\gamma\), where \(\gamma\) is as defined as
-in~\eqref{eq:gamma}, to get \(\tilde{\w}\) and \(\tilde{b}\), and
-optimize over only the \(\tilde{\w}\) and \(\tilde{b}\).
+or equivalently, given any \\({\w}\\) and \\({b}\\), we scale it by
+\\(k=\frac1\gamma\\), where \\(\gamma\\) is as defined as
+in~\eqref{eq:gamma}, to get \\(\tilde{\w}\\) and \\(\tilde{b}\\), and
+optimize over only the \\(\tilde{\w}\\) and \\(\tilde{b}\\).
 
 Then, we will have \[ \min_{1\le i\le n} (\tilde{\w}^T\z_i -
-\tilde{b}) = 1 \] and the margin of the hyperplane \(\tilde{\w}^T
-\x-\tilde{b}=0\) is \(1/||\tilde{\w}||\).
+\tilde{b}) = 1 \] and the margin of the hyperplane \\(\tilde{\w}^T
+\x-\tilde{b}=0\\) is \\(1/||\tilde{\w}||\\).
 So we can rewrite our training goal to be the optimization
 
 \begin{align*}
@@ -133,13 +133,13 @@ So we can rewrite our training goal to be the optimization
 \text{ for all } 1\le i\le n.
 \end{align*}
 
-Clearly, the \(\nu\) 's are now superflous---they don't exist in either the
+Clearly, the \\(\nu\\) 's are now superflous---they don't exist in either the
 objective or the constraints---we can discard them.
-In the above, the \(\tilde\w\) and \(\tilde b\) are just dummy variables,
+In the above, the \\(\tilde\w\\) and \\(\tilde b\\) are just dummy variables,
 we can call them by any other name and nothing will really change. Furthermore,
-maximizing \(1/||\w||\) is the same as minimizing \(||\w||\), which is in turn
-the same as minimizing \(\half ||\w||^2\). We can therefore write our training
-objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where 
+maximizing \\(1/||\w||\\) is the same as minimizing \\(||\w||\\), which is in turn
+the same as minimizing \\(\half ||\w||^2\\). We can therefore write our training
+objective as obtaining the hyperplane \\((\w^*)^T \x-b^*=0\\), where 
 
 \begin{align}
   \nonumber
@@ -149,8 +149,8 @@ objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where
 \text{ for all } 1\le i\le n.
 \end{align}
 
-You may wonder why we transformed maximizing \(1/||\w||\) to minimizing
-\(\half ||\w||^2\). The reason is that we want our objectives and
+You may wonder why we transformed maximizing \\(1/||\w||\\) to minimizing
+\\(\half ||\w||^2\\). The reason is that we want our objectives and
 constraints to be \emph{convex} functions. We will have a little
 digression here to define convex functions and sets, but practically
 every large constrained optimization we can solve is convex (or we