diff --git a/AI/Day2/logistic-regression.ipynb b/AI/Day2/logistic-regression.ipynb index 1d0e188..ee49b91 100644 --- a/AI/Day2/logistic-regression.ipynb +++ b/AI/Day2/logistic-regression.ipynb @@ -2,16 +2,17 @@ "cells": [ { "cell_type": "code", - "execution_count": null, + "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", - "from typing import Tuple\n", - "from torch import nn\n", - "import torch\n", - "from tqdm import tqdm" + "import sklearn.datasets\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "EPOCH = 100\n", + "LR = 0.1" ] }, { @@ -25,36 +26,59 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "x = (np.random.rand(1_000) * 10).round()\n", - "y = x % 2 == 0\n", - "x = torch.tensor(x, dtype=torch.float32)\n", - "y = torch.tensor(y, dtype=torch.float32)" + "Congratulations on building your first machine learning algorithm ! You were probably getting really impatient of diving into AI. I hope you understand why we wanted to take the time to go through all the basics first, though, because as you could probably tell, our python and numpy skills are going to prove really useful when building machine learning models.\n", + "\n", + "During the first module of the day, you got the gist of the main aspects of a machine learning pipeline:\n", + "- making a prediction\n", + "- computing the loss\n", + "- computing the gradients\n", + "- updating the weight and bias\n", + "\n", + "You'll find that this basic architecture is behind almost everything we'll be doing for the rest of the week.\n", + "\n", + "You will also find that this basic architecture could be recycled so that the developer can focus entirely on the things that do change.\n", + "\n", + "For example, here's an example of how **Linear Regression** can be achieved using the most popular ML library, **pytorch**:\n", + "\n", + "```python\n", + "class LinearRegression(nn.Module):\n", + " def __init__(self):\n", + " self.fc = nn.Linear(***,***)\n", + " def forward(self, x):\n", + " return self.fc(x)\n", + "```\n", + "\n", + "And actually, here's the same for **Logistic Regression**, which is what we'll be implementing by hand in this module !\n", + "\n", + "```python\n", + "class LogisticRegression(nn.Module):\n", + " def __init__(self):\n", + " self.fc = nn.Linear(***,***)\n", + " def forward(self, x):\n", + " x = self.fc\n", + " return F.sigmoid(x)\n", + "```\n", + "\n", + "Cool right ? Well, it might look great for Linear Regression, since you already know what's going on behind the scenes...\\\n", + "But unless you already know how Logistic Regression works, the code sample won't tell you anything !\n", + "\n", + "That's why we're taking the time to learn the (boring?) math behind these algorithms. It might be annoying at first, but I can assure you that understanding why we use Linear instead of Logistic Regression for certain tasks is much more intuitive if you know how they work than if you have no idea." ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "class NeuralNetwork(nn.Module):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.fc1 = nn.Linear(1000, 1000)\n", + "First of all, we're going to be using an actual ML library before we begin !\n", "\n", - " def forward(self, x):\n", - " x = self.fc1(x)\n", - " x = nn.functional.sigmoid(x)\n", - " return x\n", - " \n", - "model = NeuralNetwork()\n", - "optimizer = torch.optim.SGD(model.parameters(), 0.1)\n", - "loss_fn = nn.BCELoss()" + "The library is called sklearn and it is a wonderful set of tools which can help while working on AI !\n", + "\n", + "In fact, sklearn has implementations of many algorithms, including Linear and Logistic Regression !\n", + "\n", + "It also provides us with plenty of tools to quickly generate and manipulate randomized data for training :" ] }, { @@ -63,113 +87,185 @@ "metadata": {}, "outputs": [], "source": [ - "for e in range(10):\n", - " optimizer.zero_grad()\n", - " y_pred = model.forward(x)\n", - " loss = loss_fn(y_pred, y)\n", - " loss.backward()\n", - " optimizer.step()" + "## Using `make_blobs()`, we generate a sample dataset with 1_000 entries, each with two features.\n", + "## With the `centers` parameter, we tell sklearn to separate the data in two main classes\n", + "## Logistic Regression being a classifier model, we will use it to predict if one data entry\n", + "## belongs to one class or the other !\n", + "x_train, y_train = sklearn.datasets.make_blobs(n_samples=1_000, n_features=2, centers=2)\n", + "\n", + "## This data doesn't mean anything, like Brad's problem in the last module, but if you're wondering\n", + "## how multiple features would translate into a real world problem, imagine if you had data of\n", + "## house prices and their size, and you needed to predict whether Brad would be willing to buy the\n", + "## house or not. That would mean each data entry would have two features: the price and size of the\n", + "## house. The \"x\" would be an array of [price, size] and \"y\" would be a binary value (either true or false).\n", + "x_train.shape, y_train.shape" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "pred = model(x)\n", - "\n", - "(pred.round() == y).sum()" + "We'll use matplotlib to display our data in a nice way :" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 63, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ - "EPOCH = 100\n", - "x_train = (np.random.rand(1_000) * 10).round().reshape(1,-1)\n", - "y_train = np.array(x % 2 == 0, dtype=np.int32)" + "plt.scatter(x_train[:,0], x_train[:,1])\n", + "plt.show()" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each entry is represented by a blue circle, and you can clearly see that there are two clearly separate groups of data." + ] + }, + { + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "global_preds = []\n", - "loss_history = []\n", + "Now, we'll use sklearn's `SGDClassifier` to train a logistic regression on our generated data.\n", "\n", - "w = np.random.rand()\n", - "b = np.random.rand()\n", + "If you're curious, you might stumble upon `LogisticRegression` while browsing through [sklearn's docs](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning), which is also a sklearn model which implements the eponymous algorithm.\n", "\n", - "N = len(x_train)\n", - "LR = 0.0005\n", + "The reason we use `SGDClassifier` instead is because it adds the notion of gradient descent and updating weights to the basic Logistic Regression algorithm.\n", "\n", - "w,b" + ">SGD stands for 'stochastic gradient descent' btw" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100.0% accuracy\n" + ] + } + ], + "source": [ + "mdl = SGDClassifier(eta0=LR, max_iter=EPOCH)\n", + "mdl.fit(x_train, y_train)\n", + "\n", + "plt.scatter(x_train[:,0], x_train[:,1], c=mdl.predict(x_train))\n", + "plt.show()\n", + "print(f\"{(mdl.predict(x_train) == y_train).mean()*100}% accuracy\")" + ] + }, + { + "cell_type": "code", + "execution_count": 65, "metadata": {}, "outputs": [], "source": [ - "def forward(x: np.array) -> np.array:\n", - " x = x * w + b\n", - " if x >= 0:\n", - " return 1 / (1 + np.exp(-x))\n", - " else:\n", - " return np.exp(x) / 1 + np.exp(x)\n", + "class MyLogisticRegression:\n", + " def __init__(self, max_iter=EPOCH, lr=LR):\n", + " self.epochs = max_iter\n", + " self.lr = lr\n", "\n", + " def fit(self, x: np.ndarray, y: np.ndarray):\n", + " self.w = np.zeros(x.shape[1])\n", + " self.b = 0\n", "\n", - "for e in tqdm(range(EPOCH)):\n", - " dl_dw = 0.0\n", - " dl_db = 0.0\n", + " for i in range(self.epochs):\n", + " y_pred = self.forward(x)\n", + " loss = self.bce(y_pred, y)\n", + " dw, db = self.backward(x, y)\n", + " self.optimize(dw, db)\n", "\n", - " for x, y in zip(x_train, y_train):\n", - " # Prediction\n", - " pred = forward(x).round()\n", + " def optimize(self, dw: np.ndarray, db: np.ndarray):\n", + " self.w -= dw * self.lr\n", + " self.b -= db * self.lr\n", "\n", - " # Gradient descent\n", - " dl_dw += (forward(x) - y) * x\n", - " dl_db += forward(x) - y\n", + " def backward(self, x: np.ndarray, y: np.ndarray):\n", + " y_pred = self.forward(x)\n", + " db = np.mean(y_pred - y)\n", + " dw = np.array([np.mean(grad) for grad in x.T @ (y_pred - y)])\n", + " return dw, db\n", "\n", - " # Getting the average values\n", - " dl_dw *= 1 / N\n", - " dl_db *= 1 / N\n", + " def bce(self, y_pred: np.ndarray, y: np.ndarray):\n", + " return -np.mean(y * np.log(y_pred + 1e-9) + (1 - y) * np.log(1 - y_pred + 1e-9))\n", "\n", - " # Optimization\n", - " w = w - LR * dl_dw\n", - " b = b - LR * dl_db\n", + " def forward(self, x: np.ndarray):\n", + " y_pred = self.linear(x)\n", + " return np.array([self.sigmoid(val) for val in y_pred])\n", "\n", - " # Logging loss\n", - " total_error = 0.0\n", - " for i in range(N):\n", - " total_error += y_train[i] * np.log(forward(x_train[i])) + (\n", - " 1 - y_train[i]\n", - " ) * np.log(forward(x_train[i]))\n", - " loss_history.append(total_error / N)\n", + " def linear(self, x: np.ndarray):\n", + " return self.w @ x.T + self.b\n", "\n", + " def sigmoid(self, x: np.ndarray):\n", + " if x >= 0:\n", + " z = np.exp(-x)\n", + " return 1 / (1 + z)\n", + " else:\n", + " z = np.exp(x)\n", + " return z / (1 + z)\n", "\n", - "plt.plot(loss_history)" + " def predict(self, x):\n", + " y_pred = self.forward(x)\n", + " return np.array([1 if p > 0.5 else 0 for p in y_pred])" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 66, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100.0% accuracy\n" + ] + } + ], "source": [ - "preds = []\n", + "mdl = MyLogisticRegression()\n", + "mdl.fit(x_train, y_train)\n", "\n", - "for x, y in zip(x_train, y_train):\n", - " print(x, y, forward(x))\n", - " preds.append(round(forward(x)))\n", + "plt.scatter(x_train[:,0], x_train[:,1], c=mdl.predict(x_train))\n", + "plt.show()\n", "\n", - "print(f\"{np.average(preds == y_train)*100:.2f}%\")" + "print(f\"{(mdl.predict(x_train) == y_train).mean()*100}% accuracy\")" ] }, {