You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learning provides an automated way to modify the agent's internal decision mechanisms to improve its own performance.
It exposes the agent to reality rather than trying to hardcode reality into the agent's program.
More generally, learning is useful for any task where it is difficult to write a program that performs the task but easy to obtain examples of desired behavior.
class: middle
Machine learning
class: middle
.center[
.width-40[]
.width-40[]
]
.question[How would you write a computer program that recognizes cats from dogs?]
Let $\mathbf{d} \sim p(\mathbf{x}, y)$ be a dataset of $N$ example input-output pairs
$$\mathbf{d} = \\{ (\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), ..., (\mathbf{x}_N, y_N) \\},$$
where $\mathbf{x}_i \in \mathbb{R}^d$ are $d$-dimensional vectors representing the input values and $y_i \in \mathcal{Y}$ are the corresponding output values.
From this data, we want to identify a probabilistic model $$p_\theta(y|\mathbf{x})$$ that best explains the data.
Linear regression considers a parameterized linear Gaussian model for its parametric model of $p(y|\mathbf{x})$, that is
$$p(y|\mathbf{x}) = \mathcal{N}(y | \mathbf{w}^T \mathbf{x} + b, \sigma^2),$$
where $\mathbf{w}$ and $b$ are parameters to determine.
To learn the conditional distribution $p(y|\mathbf{x})$, we maximize
$$p(y|\mathbf{x}) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{1}{2}\frac{(y-(\mathbf{w}^T \mathbf{x} + b))^2}{\sigma^2}\right)$$
w.r.t. $\mathbf{w}$ and $b$ over the data $\mathbf{d} = \{ (\mathbf{x}_j, y_j) \}$.
--
count: false
By constraining the derivatives of the log-likelihood to $0$, we arrive to the problem of minimizing
$$\mathcal{L}(\mathbf{w},b) = \sum_{j=1}^N (y_j - (\mathbf{w}^T \mathbf{x}_j + b))^2.$$
Therefore, minimizing the sum of squared errors corresponds to the MLE solution for a linear fit, assuming Gaussian noise of fixed variance.
class: middle
.center.width-45[]
.center[
Minimizing the negative log-likelihood of a linear Gaussian model reduces to minimizing the sum of squared residuals.]
If we absorb the bias term $b$ into the weight vector $\mathbf{w}$ by adding a constant feature $x_0=1$ to the input vector $\mathbf{x}$, the solution $\mathbf{w}^*$ is given analytically by
$$\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y},$$
where $\mathbf{X}$ is the input matrix made of the stacked input vectors $\mathbf{x}_j$ (including the constant feature) and $\mathbf{y}$ is the output vector made of the output values $y_j$.
Logistic regression models the conditional as
$$P(Y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}+b),$$
where the sigmoid activation function
$\sigma(x) = \frac{1}{1 + \exp(-x)}$
looks like a soft heavyside:
.center.width-60[]
???
This model is the core building block of deep neural networks!
class: middle
Following the principle of maximum likelihood estimation, we have
This loss is an estimator of the cross-entropy$$H(p,q) = \mathbb{E}_p[-\log q]$$ for $p=Y|\mathbf{x}_i$ and $q=\hat{Y}|\mathbf{x}_i$.
Unfortunately, there is no closed-form solution for the MLE of $\mathbf{w}$ and $b$.
class: middle
Gradient descent
Let $\mathcal{L}(\theta)$ denote a loss function defined over model parameters $\theta$ (e.g., $\mathbf{w}$ and $b$).
To minimize $\mathcal{L}(\theta)$, gradient descent uses local linear information to iteratively move towards a (local) minimum.
For $\theta_0$, a first-order approximation around $\theta_0$ can be defined as
$$\hat{\mathcal{L}}(\epsilon; \theta_0) = \mathcal{L}(\theta_0) + \epsilon^T\nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{2\gamma}||\epsilon||^2.$$
.center.width-50[]
class: middle
A minimizer of the approximation $\hat{\mathcal{L}}(\epsilon; \theta_0)$ is given for
$$\begin{aligned}
\nabla_\epsilon \hat{\mathcal{L}}(\epsilon; \theta_0) &= 0 \\
&= \nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{\gamma} \epsilon,
\end{aligned}$$
which results in the best improvement for the step $\epsilon = -\gamma \nabla_\theta \mathcal{L}(\theta_0)$.
Therefore, model parameters can be updated iteratively using the update rule
$$\theta_{t+1} = \theta_t -\gamma \nabla_\theta \mathcal{L}(\theta_t),$$
where
$\theta_0$ are the initial parameters of the model,
$\gamma$ is the learning rate.
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
count: false
class: center, middle
class: middle, center
(Step-by-step code example)
class: middle
Example: imitation learning in Pacman
Can we learn to play Pacman only from observations?
Feature vectors $\mathbf{x} = g(s)$ are extracted from the game states $s$. Output values $y$ corresponds to actions $a$ .
State-action pairs $(\mathbf{x}, y)$ are collected by observing an expert playing.
We want to learn the actions that the expert would take in a given situation. That is, learn the mapping $f:\mathbb{R}^d \to \mathcal{A}$.
This is a multiclass classification problem that can be solved by combining binary classifers.