Update reading-svm.md

uhm-descartes · Apr 10, 2024 · 4c6e2b1 · 4c6e2b1
1 parent 8259439
commit 4c6e2b1
Showing 1 changed file with 97 additions and 6 deletions.
diff --git a/morea/kernels/reading-svm.md b/morea/kernels/reading-svm.md
@@ -2,7 +2,7 @@
 title: "Support Vector Machines"
 published: true
 morea_id: reading-svm
-morea_summary: "Support Vector Machines"
+morea_summary: "Primal formulation"
 morea_type: reading
 morea_labels:
 ---
@@ -30,8 +30,7 @@ desirable it is.
 
 We will formulate an optimization problem for training that not only
 tries to get a separating hyperplane, but also one that will ensure
-that the examples are as far away from it as possible. To do this,
-first note the following exercise.
+that the examples are as far away from it as possible. 
 
 
 #### Training for maximum margin
@@ -97,7 +96,7 @@ Note the shift now---we treat \\(\nu\\) as just a number (not a
 function of \\(\w\\) and \\(b\\)) and see which is the largest combination
 of the number \\(\nu\\), the vector \\(\w\\) and \\(b\\) that satisfies
 
-$$\w^*,b^*,\nu^* &= \arg\max_{\nu,\w,b} \frac{\nu}{||\w||}$$
+$$\w^*,b^*,\nu^* = \arg\max_{\nu,\w,b} \frac{\nu}{||\w||}$$
 subject to \\(y_i(\w^T \z_i -b) \ge \nu \text{ for all } 1\le i\le n.\\)
 
 We can make one more simplification. There is no distinction between
@@ -125,7 +124,7 @@ and the margin of the hyperplane \\(\tilde{\w}^T
 \x-\tilde{b}=0\\) is \\(1/||\tilde{\w}||\\).
 So we can rewrite our training goal to be the optimization
 
-$$\w^*,b^*,\nu^* &= \arg\max_{\nu,\tilde{\w},b} \frac{1}{||\tilde{\w}||}$$
+$$\w^*,b^*,\nu^* = \arg\max_{\nu,\tilde{\w},b} \frac{1}{||\tilde{\w}||}$$
 subject to \\(y_i(\tilde{\w}^T \z_i -\tilde{b}) \ge 1\\) for all \\(1\le i\le n.\\)
 
 Clearly, the \\(\nu\\) 's are now superflous---they don't exist in either the
@@ -136,7 +135,7 @@ maximizing \\(1/||\w||\\) is the same as minimizing \\(||\w||\\), which is in tu
 the same as minimizing \\(\half ||\w||^2\\). We can therefore write our training
 objective as obtaining the hyperplane \\((\w^*)^T \x-b^*=0\\), where 
 
-$$  \w^*,b^* &= \arg\min_{\w,b} \half{||\w||^2} \tag*{(2)}$$
+$$  \w^*,b^* = \arg\min_{\w,b} \half{||\w||^2} \tag*{(2)}$$
 subject to \\(y_i(\w^T \z_i -{b}) \ge 1 \\) for all \\(1\le i\le n.\\)
 
 You may wonder why we transformed maximizing \\(1/||\w||\\) to minimizing
@@ -148,3 +147,95 @@ just fake the steps of a convex optimization if we are stuck with
 non-convex optimization). Often, even convex optimization does not
 look that way to begin with---we need to tweak the formulation as 
 above to get to the correct form.
+
+#### Lagrangian for the SVM problem
+To write the Lagrangian for this problem, we rewrite each inequality
+constraint above so that it looks like \(f_i(\w,b) \le 0\), namely
+
+$$1-y_i(\w^T \z_i -{b}) \le 0.$$
+
+Each inequality gets its own Lagrange multiplier \(\lambda_i\), so our Lagrangian
+is (letting \(\Lambda = (\lambda_1\upto \lambda_n)\))
+
+$$\cL(\w,b, \Lambda) =
+\half{||\w||^2} + \sum_{i=1}^n \lambda_i \Paren{1-y_i(\w^T \z_i -{b})}.$$
+
+Now consider the following problem for a specific choice of \(\w\) and \(b\),
+
+$$\max_{\Lambda \ge 0}\cL(\w,b, \Lambda) ,$$
+
+where \(\Lambda\ge 0\) is shorthand for \(\lambda_1\ge 0, \lambda_2\ge
+0,\cdots,\lambda_n\ge 0\). Now if \(\w\) and \(b\) satisfy all constraints,
+we will have for all \(1\le i\le n\) that
+
+$$1-y_i(\w^T \z_i -{b}) \le 0,$$
+
+therefore
+
+$$\cL(\w,b, \Lambda) 
+=
+\half{||\w||^2} + \sum_{i=1}^n \lambda_i \Paren{1-y_i(\w^T \z_i -{b})}\\
+\le
+\half{||\w||^2},
+$$
+
+with equality in the second equation iff we choose \(\lambda_i=0\) for all \(i\). Therefore, for any  \(\w\) and \(b\) satisfying all constraints,
+
+$$\max_{\Lambda \ge 0}\cL(\w,b, \Lambda)  = \half ||\w||^2.$$
+
+On the other hand if \(\w\) and \(b\) are such that there is even a single
+constraint violated, that is for some \(j\),
+
+$$1-y_j(\w^T \z_j -{b}) \ge 0.$$
+
+Then to maximize \(\cL(\w,b, \Lambda)\), we can choose \(\lambda_j\to \infty\),
+therefore
+
+$$\lambda_j(1-y_j(\w^T \z_j -{b})) \to +\infty,$$
+
+therefore
+
+$$\max_{\Lambda \ge 0}\cL(\w,b, \Lambda)  = \infty.$$
+
+Putting it together
+
+$$
+\max_{\Lambda \ge 0}\cL(\w,b, \Lambda) 
+=
+\begin{cases}
+\half ||\w||^2 & \w, b \text{ satisfy all $n$ constraints}\\
+\infty & \text{else.}
+\end{cases}
+$$
+
+Let us call 
+
+$$g(\w,b) \ed \max_{\Lambda \ge 0}\cL(\w,b, \Lambda)$$
+
+for convenience. Now there is definitely at least one \(\w\), \(b\) that
+satisfies all constraints (since the points are linearly
+separable). Therefore, the smallest value \(g(\w,b)\) can take is
+definitely not infinity (so any \(\w,b\) that violates any constraint
+will never minimize \(g(\w,b)\)). That means that if we look for
+
+$$\arg \min_{\w,b} g(\w,b),$$
+
+the solution must be \(\w^*,b^*\) from\textasciitilde{}\eqref{eq:svmls}, since we are
+minimizing \(\half ||\w||^2\) but only among such \(\w,b\) that satisfy
+every constraint given.
+
+Therefore, we can pose\textasciitilde{}\eqref{eq:svmls} as follows:
+\begin{equation}
+  \label{eq:primal}
+\w^*,b^* = \arg\min_{\w,b} \max_{\Lambda \ge 0} \cL(\w,b, \Lambda).
+\end{equation}
+
+We will call the above the \emph{primal} formulation of the
+constrained optimization problem in\textasciitilde{}\eqref{eq:svmls}---where we write the
+Lagrangian, and observe that the solution of\textasciitilde{}\eqref{eq:svmls} is obtained
+by the minmax formulation in\textasciitilde{}\eqref{eq:primal}.
+
+As the name ``primal'' suggests, we will also have a \emph{dual}
+formulation of the optimization problem in\textasciitilde{}\eqref{eq:svmls}. But before
+we get into the dual formulation, we have a little segue into elementary
+game theory.