From d9a33e5214270fedbc499b2ef1971aa1dd1fe918 Mon Sep 17 00:00:00 2001 From: ZHANG Jing Date: Fri, 30 Aug 2024 17:11:31 +0200 Subject: [PATCH] udpate dl_scratch --- deep_learning_from_scratch.ipynb | 197 ++++++++++++++++++++++--------- 1 file changed, 143 insertions(+), 54 deletions(-) diff --git a/deep_learning_from_scratch.ipynb b/deep_learning_from_scratch.ipynb index 5af4a79..218079f 100644 --- a/deep_learning_from_scratch.ipynb +++ b/deep_learning_from_scratch.ipynb @@ -1802,7 +1802,8 @@ "plt.ylabel(\"accuracy\")\n", "plt.ylim(0, 1.0)\n", "plt.legend(loc='lower right')\n", - "plt.show()\n" + "plt.show()\n", + "# Guess why I only run one epoch? -- Because it's too slow!" ] }, { @@ -1851,7 +1852,7 @@ "* Back propagation of Addition Nodes \n", "Take $z=x+y$ as an example, the analytic derivatives of $x$ and $y$ are: \n", "$$\\frac{\\partial z}{\\partial x} = 1\\\\ \\frac{\\partial z}{\\partial y} = 1$$\n", - "The computational graph is as follows: \n", + "The computational graph of *addition* is as follows: \n", "![Addition nodes](./figures/dlscratch_computegraphchainruleadd.png)" ] }, @@ -1862,7 +1863,7 @@ "* Back propagation of Multiplication Nodes\n", "Take $z=xy$ as an example, the analytic derivatives of $x$ and $y$ are: \n", "$$\\frac{\\partial z}{\\partial x} = y\\\\ \\frac{\\partial z}{\\partial y} = x$$\n", - "The computational graph is as follows: \n", + "The computational graph of *multiplication* is as follows: \n", "![Product nodes](./figures/dlscratch_computegraphchainruleproduct.png)" ] }, @@ -1901,7 +1902,10 @@ ], "source": [ "# multiply layer and add layer\n", - "\n", + "# an example of buying apples: \n", + "# input: number of apples, unit price of an apple, tax of apples.\n", + "# output: total price a customer needs to pay. (forward propagation process)\n", + "# gradient of input variables, gradient of output. (back propagation process)\n", "class AddLayer:\n", " def __init__(self):\n", " pass\n", @@ -1980,6 +1984,7 @@ } ], "source": [ + "# add one mor fruit -- oranges\n", "apple = 100\n", "apple_num = 2\n", "orange = 150\n", @@ -2021,8 +2026,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Implemention of activation function\n", - "ReLU(Rectified Linear Unit): \n", + "### Activation function in back propagation\n", + "Why need activation function, after back propagation (calculating gradient/derivative process), you may be clear. If no activation function: \n", + "1. Linear superposition: The output of each layer is the product of the input and the weight matrix plus the bias. Since there is no activation function to introduce nonlinearity, the output of the entire network will only be a series of linear transformations of the input.\n", + "2. Gradient vanishing or exploding: In a multi-layer network, if each layer is linear, the gradients are continuously multiplied during back propagation. If the norm of the weight matrix is ​​greater than 1, the gradient may grow exponentially (gradient explosion), while if the norm of the weight matrix is ​​less than 1, the gradient may decay exponentially (gradient vanishing). This makes weight updates very difficult, thus hindering the training of deep networks.\n", + "3. Gradient Constant: In the absence of an activation function, the gradient of each layer will be constant because the derivative of a linear layer is a constant. This means that the gradient will not change as the depth of the network increases, but it also means that the gradients cannot be propagated effectively because they will not be adjusted for different layers of the network.\n", + "4. Lack of feature learning ability: Activation functions usually help the network learn useful feature representations from the input data. Without activation functions, the network will not be able to do this feature learning because it can only perform simple linear transformations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "ReLU and Sigmoid \n", + "* ReLU(Rectified Linear Unit) \n", "$$\n", "y = \n", "\\begin{cases} \n", @@ -2038,16 +2055,32 @@ "0 & (x \\leq 0).\n", "\\end{cases}\n", "$$\n", - "computational graph \n", + "computational graph of ReLU function \n", "![computational graph](./figures/dlscratch_computegraphrelu.png)\n", "\n", - "Sigmoid: \n", + "* Sigmoid: \n", "$$y=\\frac{1}{1+\\exp(-x)}$$\n", + "the analytic derivative:\n", + "$$\\frac{\\partial y}{\\partial x} = y(1-y)$$\n", + "computational graph of Sigmoid function, a bit complicated, let's first look at forward progation, which is easy to understand \n", "![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoid.png) \n", + "computational graph of Sigmoid function in back propagation, difficult to understand unless some analytic derivative is recalled \n", "![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoidback.png) \n", + "1. '/' means $y=\\frac{1}{x}$, so its analytic derivative is: $\\frac{\\partial y}{\\partial x}=-\\frac{1}{x^2}=-y^2$\n", + "2. '+' will keep unchanged (e.g. $z=x+y$)\n", + "3. 'exp' means $y=e^x$, so its analytic derivative is still: $\\frac{\\partial y}{\\partial x}=e^x \\text{or} \\exp (x)$, in this case, $\\exp (-x)$\n", + "4. '×' will exchange the variable (e.g. $z=xy$), so here multiply by -1\n", + "\n", "Simplified version \n", - "![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoidbacksimple.png) ![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoidbacksimple1.png) ![gradient of sigmoid](./figures/gradient_sigmoid.png)\n", - "\n" + "![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoidbacksimple.png) ![computational graph of sigmoid](./figures/dlscratch_computegraphsigmoidbacksimple1.png) \n", + "Because Sigmoid function is $y = \\frac{1}{1+\\exp(-x)}$, so the final formula is shown as follow:\n", + "$$\\frac{\\partial L}{\\partial y} y^2 \\exp(-x) \n", + "= \\frac{\\partial L}{\\partial y} \\frac{1}{(1 + \\exp(-x))^2} \\exp(-x) \n", + "= \\frac{\\partial L}{\\partial y} \\frac{1}{1 + \\exp(-x)} \\frac{\\exp(-x)}{1 + \\exp(-x)} \n", + "= \\frac{\\partial L}{\\partial y} y(1-y)\n", + "$$\n", + "* The above formula is the Derivative of loss function $L$ with respect to $y$ and The derivative of the activation function $y$ with respect to the input $x$\n", + "* Intuitively, the derivative of the sigmoid function tells us that when the input $x$ is very large or very small, the derivative becomes very small (the gradient vanishing problem), and when the input $x$ is close to 0, the derivative is the largest, which helps the model update the weights during training." ] }, { @@ -2087,22 +2120,46 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Implemention of Affine/Softmax\n", - "![affine](./figures/dlscratch_computegraphaffine.png)\n", - "![affinebatch](./figures/dlscratch_computegraphaffinebatch.png) \n", - "![softmaxlayer](./figures/dlscratch_softmaxlayer.png) \n", + "Softmax with loss \n", + "The derivative after back propagation of cross entropy loss and softmax function look beautiful, clean, and very intuitive (shows difference bewteen output and label), why? \n", "![crossentropylayer](./figures/dlscratch_computegraphcrossentropy.png) \n", - "Cross entropy loss and mean square error loss are designed on purpose beause: \n", - "* derivatives\n", - "* derivatives\n" + "Beause: \n", + "* Softmax function $y_i=\\frac{\\exp(a_i)}{\\sum_j\\exp(a_j)}$, derivative is $\\frac{\\partial y_i}{\\partial a_i}=y_i(1-y_i)$\n", + "* Cross entropy loss $L = -\\sum_i t_i\\log(y_i)$, derivative is $\\frac{\\partial L}{\\partial y_i}=-\\frac{t_i}{y_i}$\n", + "* Derivative of L and Softmax: $\\frac{\\partial L}{\\partial a_i}=\\frac{\\partial L}{\\partial y_i}\\cdot\\frac{\\partial y_i}{\\partial a_i}=-\\frac{t_i}{y_i}\\cdot y_i(1-y_i) = y_i-t_i $\n", + "\n", + "Similary, mean square error loss e $L = \\sum_i(t_i-y_i)^2$, derivative is $\\frac{\\partial L}{\\partial y_i}=2(y_i-t_i)$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Affine transformation\n", + "An affine layer is actually a **matrix** of input, weights and bias, nothing new. \n", + "The computational graph below is a batch based affine layer, just to remind that the layer is storing the matrix, not the single elements. \n", + "Many layers in the deep learning library are implemented based on the principle of affine transformation. \n", + "![affinebatch](./figures/dlscratch_computegraphaffinebatch.png) \n", + "So far, we can have a simple network shown as below, affine layer, activation function layer, output layer etc: \n", + "![a simple networkr](./figures/dlscratch_softmaxlayer.png) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### TwoLayerNet case" ] }, { "cell_type": "code", - "execution_count": 90, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ + "import numpy as np\n", + "from collections import OrderedDict\n", + "\n", "class Affine:\n", " def __init__(self, W, b):\n", " self.W =W\n", @@ -2120,14 +2177,14 @@ " x = x.reshape(x.shape[0], -1)\n", " self.x = x\n", "\n", - " out = np.dot(self.x, self.W) + self.b\n", + " out = np.dot(self.x, self.W) + self.b # out=x⋅W+b\n", "\n", " return out\n", "\n", " def backward(self, dout):\n", " dx = np.dot(dout, self.W.T)\n", " self.dW = np.dot(self.x.T, dout)\n", - " self.db = np.sum(dout, axis=0)\n", + " self.db = np.sum(dout, axis=0) # out = wx+b, biases have same influence on all x\n", " \n", " dx = dx.reshape(*self.original_x_shape) # Reshape input data (tensor compatible)\n", " return dx\n", @@ -2147,24 +2204,17 @@ "\n", " def backward(self, dout=1):\n", " batch_size = self.t.shape[0]\n", - " if self.t.size == self.y.size: # When the training data is one-hot vector\n", - " dx = (self.y - self.t) / batch_size\n", + " # When the label is one-hot vector\n", + " if self.t.size == self.y.size: \n", + " dx = (self.y - self.t) / batch_size # y-t\n", + " # when the label is integer\n", " else:\n", - " dx = self.y.copy()\n", - " dx[np.arange(batch_size), self.t] -= 1\n", + " dx = self.y.copy() # y not change\n", + " dx[np.arange(batch_size), self.t] -= 1 # y-t, in this case, the true label t=1\n", " dx = dx / batch_size\n", " \n", - " return dx" - ] - }, - { - "cell_type": "code", - "execution_count": 100, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from collections import OrderedDict\n", + " return dx\n", + "\n", "def softmax(x):\n", " x = x - np.max(x, axis=-1, keepdims=True)\n", " return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)\n", @@ -2180,6 +2230,7 @@ "\n", " batch_size = y.shape[0]\n", " return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size\n", + "\n", "def numerical_gradient(f, x):\n", " h = 1e-4 # 0.0001\n", " grad = np.zeros_like(x)\n", @@ -2225,7 +2276,7 @@ " self.layers['Relu1'] = Relu()\n", " self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])\n", "\n", - " self.lastLayer = SoftmaxWithLoss()\n", + " self.lossLayer = SoftmaxWithLoss()\n", " \n", " def predict(self, x):\n", " for layer in self.layers.values():\n", @@ -2235,7 +2286,7 @@ " \n", " def loss(self, x, t):\n", " y = self.predict(x)\n", - " return self.lastLayer.forward(y, t)\n", + " return self.lossLayer.forward(y, t)\n", " \n", " def accuracy(self, x, t):\n", " y = self.predict(x)\n", @@ -2246,6 +2297,7 @@ " return accuracy\n", " \n", " def compute_numerical_gradient(self, x, t):\n", + " # forward\n", " loss_W = lambda W: self.loss(x, t)\n", " \n", " grads = {}\n", @@ -2258,14 +2310,15 @@ " \n", " def backward_gradient(self, x, t):\n", " # forward\n", - " self.loss(x, t)\n", - "\n", + " y = self.predict(x)\n", + " self.lossLayer.forward(y, t) # loss value is stored\n", + " \n", " # backward\n", - " dout = 1\n", - " dout = self.lastLayer.backward(dout)\n", + " dout = 1 # inital gradient of loss\n", + " dout = self.lossLayer.backward(dout)\n", " \n", " layers = list(self.layers.values())\n", - " layers.reverse()\n", + " layers.reverse() # back propagation start from last layer\n", " for layer in layers:\n", " dout = layer.backward(dout)\n", "\n", @@ -2278,28 +2331,28 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Epoch 1/10 - train acc: 0.1457, test acc: 0.1493\n", - "Epoch 2/10 - train acc: 0.2401, test acc: 0.2372\n", - "Epoch 3/10 - train acc: 0.3420, test acc: 0.3472\n", - "Epoch 4/10 - train acc: 0.3533, test acc: 0.3568\n", - "Epoch 5/10 - train acc: 0.4632, test acc: 0.4681\n", - "Epoch 6/10 - train acc: 0.4877, test acc: 0.4729\n", - "Epoch 7/10 - train acc: 0.5861, test acc: 0.5919\n", - "Epoch 8/10 - train acc: 0.6213, test acc: 0.6187\n", - "Epoch 9/10 - train acc: 0.6805, test acc: 0.6845\n", - "Epoch 10/10 - train acc: 0.7059, test acc: 0.7096\n" + "Epoch 1/10 - train acc: 0.1988, test acc: 0.2030\n", + "Epoch 2/10 - train acc: 0.1126, test acc: 0.1139\n", + "Epoch 3/10 - train acc: 0.3293, test acc: 0.3260\n", + "Epoch 4/10 - train acc: 0.4533, test acc: 0.4442\n", + "Epoch 5/10 - train acc: 0.5738, test acc: 0.5800\n", + "Epoch 6/10 - train acc: 0.5425, test acc: 0.5475\n", + "Epoch 7/10 - train acc: 0.5732, test acc: 0.5756\n", + "Epoch 8/10 - train acc: 0.5913, test acc: 0.5891\n", + "Epoch 9/10 - train acc: 0.6685, test acc: 0.6704\n", + "Epoch 10/10 - train acc: 0.7092, test acc: 0.7108\n" ] }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -2391,6 +2444,7 @@ ], "source": [ "# gradient check\n", + "# the meaning of numerical gradient is to check if the backward propagation is right, although it is very slow.\n", "import numpy as np\n", "from dataset.mnist import load_mnist\n", "\n", @@ -2416,6 +2470,41 @@ "## Tricks for learning" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Updating parameters (weights&bias)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Initializing weights" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Batch normalization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Normalization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Hyper-parameters" + ] + }, { "cell_type": "markdown", "metadata": {},