From b90670173a0e3afe8f46491c5d07a5c0ce1fa85b Mon Sep 17 00:00:00 2001 From: nsanthan Date: Tue, 9 Apr 2024 15:49:47 -1000 Subject: [PATCH] Update reading-svm.md --- morea/kernels/reading-svm.md | 84 ++++++++++++++++++------------------ 1 file changed, 42 insertions(+), 42 deletions(-) diff --git a/morea/kernels/reading-svm.md b/morea/kernels/reading-svm.md index 9ea0ffd..0e251db 100644 --- a/morea/kernels/reading-svm.md +++ b/morea/kernels/reading-svm.md @@ -34,16 +34,16 @@ first note the following exercise. #### Training for maximum margin -Suppose \(\z_1\upto \z_n\) are \(n\) -training examples in \(\reals^p\) given to us with labels \(y_1\upto y_n\) -respectively (each label is either \(+1\) or \(-1\)). Let \(\w\in\reals^p\) -and \(b\) be a number, and define +Suppose \\( \z_1\upto \z_n \\\) are \\(n\\\) +training examples in \(\reals^p\\) given to us with labels \\(y_1\upto y_n\\) +respectively (each label is either \\(+1\\) or \\(-1\\)). Let \\(\w\in\reals^p\\) +and \\(b\\) be a number, and define $$\gamma_i(\w,b) = \w^T\z_i - b.$$ -Therefore, the distances of the \(n\) points to the plane \(\w^T\x-b=0\) -are respectively \(\gamma_1/||\w|| \upto \gamma_n/||\w||\). In addition, +Therefore, the distances of the \\(n\\) points to the plane \\(\w^T\x-b=0\\) +are respectively \\(\gamma_1/||\w|| \upto \gamma_n/||\w||\\). In addition, let \begin{equation} @@ -52,21 +52,21 @@ let \end{equation} so that the smallest distance between the examples and the hyperplane -is \(\gamma(\w,b)/||\w||\). This is called the \emph{margin} of the classifier -\(\w^T\x-b=0\). +is \\(\gamma(\w,b)/||\w||\\). This is called the \emph{margin} of the classifier +\\(\w^T\x-b=0\\). -From our training data, we want to obtain that plane \(\w^T\x-b=0\) which +From our training data, we want to obtain that plane \\(\w^T\x-b=0\\) which classifies all examples correctly, but in addition has the largest margin. This plane is what we will learn from the training examples, and what we will use to predict on the test examples. -So for training, we first set up an optimization. Note that \(\gamma\) -is some complicated function of \(\w\) and \(b\). Different values of -\(\w\) and \(b\) yield potentially different orientations and +So for training, we first set up an optimization. Note that \\(\gamma\\) +is some complicated function of \\(\w\\) and \\(b\\). Different values of +\\(\w\\) and \\(b\\) yield potentially different orientations and intercepts of the separating hyperplane, and their margin is determined by different examples (\ie the minimizer in\textasciitilde{}\eqref{eq:gamma} is different). Even though we may not have -\(\gamma(\w,b)\) in a simple form, we can still ask for +\\(\gamma(\w,b)\\) in a simple form, we can still ask for \begin{align*} \w^*,b^* &= \arg\max{\w,b} \frac{\gamma(\w,b)}{||\w||}\\ @@ -74,29 +74,29 @@ in\textasciitilde{}\eqref{eq:gamma} is different). Even though we may not have \end{align*} In the optimization above, the first line asks to maximize the margin, -while the constraints (there are \(n\) of them) ensure that each +while the constraints (there are \\(n\\) of them) ensure that each example is classified properly. -So far so good, but we don't really want to compute \(\gamma(\w,b)\) or +So far so good, but we don't really want to compute \\(\gamma(\w,b)\\) or try expressing it in any closed/numerical form. But there is a simple -conceptual way around it. Suppose \(\w\) and \(\b\) classified all examples -such that every example, \(\z_1\upto \z_n\) satisfied +conceptual way around it. Suppose \\(\w\\) and \\(\b\\) classified all examples +such that every example, \\(\z_1\upto \z_n\\) satisfied \begin{equation} \label{eq:constraints} y_i(\w^T \z_i -b) \ge \nu, \qquad 1\le i\le n. \end{equation} -For a given \(\w\) and \(b\), since \(\gamma(\w,b)/||\w||\) happens to be the -distance of the closest point to the plane \(\w^T \x -b =0\), we could -satisfy all \(n\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \(\nu\) in the range \(0 \le\nu \le \gamma(\w,b)\) and for no +For a given \\(\w\\) and \\(b\\), since \\(\gamma(\w,b)/||\w||\\) happens to be the +distance of the closest point to the plane \\(\w^T \x -b =0\\), we could +satisfy all \\(n\\) constraints of\textasciitilde{}\eqref{eq:constraints} above for every value of \\(\nu\\) in the range \\(0 \le\nu \le \gamma(\w,b)\\) and for no other. -Therefore, we ask to find the maximum number \(\nu\) such that all the +Therefore, we ask to find the maximum number \\(\nu\\) such that all the constraints in\textasciitilde{}\eqref{eq:constraints} are satisfied. -Note the shift now---we treat \(\nu\) as just a number (not a -function of \(\w\) and \(b\)) and see which is the largest combination -of the number \(\nu\), the vector \(\w\) and \(b\) that satisfies +Note the shift now---we treat \\(\nu\\) as just a number (not a +function of \\(\w\\) and \\(b\\)) and see which is the largest combination +of the number \\(\nu\\), the vector \\(\w\\) and \\(b\\) that satisfies \begin{align*} \w^*,b^*,\nu^* &= \arg\max_{\nu,\w,b} \frac{\nu}{||\w||}\\ @@ -104,27 +104,27 @@ of the number \(\nu\), the vector \(\w\) and \(b\) that satisfies \end{align*} We can make one more simplification. There is no distinction between -the plane \(\w^T\x-b=0\) and the plane -\(k(\w^T\x-b) =0\) for any real number \(k\ne 0\) (because -if \(\x\) satisfies the equation \({\w}^T\x-{b}=0\), it +the plane \\(\w^T\x-b=0\\) and the plane +\\(k(\w^T\x-b) =0\\) for any real number \\(k\ne 0\\) (because +if \\(\x\\) satisfies the equation \\({\w}^T\x-{b}=0\\), it automatically satifies the other and vice versa). So all the -candidates (\(k{\w}, k{b}\)), \(k\ne 0\), yield exactly the +candidates (\\(k{\w}, k{b}\\)), \\(k\ne 0\\), yield exactly the same plane (and hence same margin). We may choose just one candidate among these while searching for the optimum. To make our life simpler, we -can choose \(k\) such that +can choose \\(k\\) such that \[ \min_{1\le i\le n} k({\w}^T\z_i - {b}) = 1 \] -or equivalently, given any \({\w}\) and \({b}\), we scale it by -\(k=\frac1\gamma\), where \(\gamma\) is as defined as -in~\eqref{eq:gamma}, to get \(\tilde{\w}\) and \(\tilde{b}\), and -optimize over only the \(\tilde{\w}\) and \(\tilde{b}\). +or equivalently, given any \\({\w}\\) and \\({b}\\), we scale it by +\\(k=\frac1\gamma\\), where \\(\gamma\\) is as defined as +in~\eqref{eq:gamma}, to get \\(\tilde{\w}\\) and \\(\tilde{b}\\), and +optimize over only the \\(\tilde{\w}\\) and \\(\tilde{b}\\). Then, we will have \[ \min_{1\le i\le n} (\tilde{\w}^T\z_i - -\tilde{b}) = 1 \] and the margin of the hyperplane \(\tilde{\w}^T -\x-\tilde{b}=0\) is \(1/||\tilde{\w}||\). +\tilde{b}) = 1 \] and the margin of the hyperplane \\(\tilde{\w}^T +\x-\tilde{b}=0\\) is \\(1/||\tilde{\w}||\\). So we can rewrite our training goal to be the optimization \begin{align*} @@ -133,13 +133,13 @@ So we can rewrite our training goal to be the optimization \text{ for all } 1\le i\le n. \end{align*} -Clearly, the \(\nu\) 's are now superflous---they don't exist in either the +Clearly, the \\(\nu\\) 's are now superflous---they don't exist in either the objective or the constraints---we can discard them. -In the above, the \(\tilde\w\) and \(\tilde b\) are just dummy variables, +In the above, the \\(\tilde\w\\) and \\(\tilde b\\) are just dummy variables, we can call them by any other name and nothing will really change. Furthermore, -maximizing \(1/||\w||\) is the same as minimizing \(||\w||\), which is in turn -the same as minimizing \(\half ||\w||^2\). We can therefore write our training -objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where +maximizing \\(1/||\w||\\) is the same as minimizing \\(||\w||\\), which is in turn +the same as minimizing \\(\half ||\w||^2\\). We can therefore write our training +objective as obtaining the hyperplane \\((\w^*)^T \x-b^*=0\\), where \begin{align} \nonumber @@ -149,8 +149,8 @@ objective as obtaining the hyperplane \((\w^*)^T \x-b^*=0\), where \text{ for all } 1\le i\le n. \end{align} -You may wonder why we transformed maximizing \(1/||\w||\) to minimizing -\(\half ||\w||^2\). The reason is that we want our objectives and +You may wonder why we transformed maximizing \\(1/||\w||\\) to minimizing +\\(\half ||\w||^2\\). The reason is that we want our objectives and constraints to be \emph{convex} functions. We will have a little digression here to define convex functions and sets, but practically every large constrained optimization we can solve is convex (or we