This overview contains a dedicated section for each assignment, listing the main topics covered in it together with links to the notebooks where that topic is used. Additionally, each sub-section contains a "Topics Index" which aims to give a more in depth intuitive explanation of the most important topics covered in the assignment. This part also lists the core implementation section in the assignment relevant for that topic as well as relevant readings on the topic (either extracted from the notes or not) that can be useful in gaining more context or a better understanding of the topic.
Assignment 1 - Solutions Directory
- Q1 - k-Nearest Neighboor Classifier
- Q2 - Training a Support Vector Machine
- Q3 - Train a Softmax Classifier
- Q4 - Train a Two Layer Fully Connected Neural Network
- Q5 - Higher Level Representations: Image Features
Assignment 2 - Solutions Directory
- Q1 - Multi Layer Fully Connected Neural Networks
- Q2 - Batch Normalization
- Q3 - Dropout
- Q4 - Convolutional Neural Networks
- Q5 - Pytorch on CIFAR10
Assignment 3 - Solutions Directory
- Q1 - Network Visualization
- Q2 - Image Captioning with Vanilla RNNs (Recurrent Neural Networks)
- Q3 - Image Captioning with LSTMs (Long Short Term Memory Networls)
- Q4 - Image Captioning with Transformers
- Q5 - GANs (Generative Adversial Networks)
- Q6 - Self Supervised (Contrastive) Learning
Topic | Relevant Reading / Figures | Brief Intuitive Explanation | Implemented / Visualization |
---|---|---|---|
kNN Classifier | + cs231n notes + A Few Useful Things to Know About Machine Learning + Recognizing and Learning Object Categories +kNN classification using Neighbourhood Components Analysis +PCA (Principal Component Analysis) and SVD(Singular Value Decomposition) +Random Projections +Approximate Nearest Neighboor Classifier - FLANN |
Training - This step can be replaced with storing all the available training data. k is hyperparameter that needs to be tuned on the validation data. Inference - The sample to classify is compared to all the available training samples by computing L1 norm ( |
k_nearest_neighbor.py |
SVM Classifier | + cs231n notes - Multiclass SVM Formulation + Inspiration paper for SVM formulation in the assignment + OVA (One-VS-All) SVM Formulation + AVA (All-VS-All) SVM Formulation + Structured SVM Formulation + Deep Learning using Support Vector Machines |
Training - The training stage aims to minimize the error of the classifier by tweaking the model weights so that they simmultaneusly can achieve the lowest possible loss when classifying samples in the training dataset. The error(loss) is non-zero whenever the predicted score for a class is different than the score of the correct class for a sample. Additionally, a regularization term is commonly used to explicitly limit the complexity of the model and prevent over-fitting. Intuitively, the training stage aims to find support vectors to define decision boundaries inbetween classes such that the distance between the support vector and the closest points of the classes it separates has a value of at least Activation Function - For the single sample case (at sampling/inference) it is defined as a vector matrix multiplication Loss Function - For a single sample the loss function is known as "Hinge Loss" and is computed as Inference - With the trained classifier, a class is assigned to a test sample based on the scores computed from the equation |
linear_svm.py layers.py |
Softmax Classifier | + cs231n notes - Softmax Classifier |
Training - Similarly to the SVM, the training stage aims to minimize the loss function. The main difference between the Softmax and the SVM classifiers arise in the interpretation of the scores obtained from the activation function and the definition of the loss function. Activation Function - The activation function is defined identically to the SVM for both training and inference stages. However, the interpretation of the class scores for a sample is different. In this case the scores are interpreted as unnormalized log probabilities. Loss Function - The loss function for this classifier is known as the "Cross-Entropy Loss" and is defined as Inference - With the trained classifier, a class is assigned to a test sample based on the scores computed from the equation |
softmax.py layers.py |
Backpropagation: Gradient Based Learning | + Optimization: Stochastic Gradient Descent + Backpropagation Intuitions + Automatic differentiation in machine learning: a survey + Vector, Matrix, and Tensor Derivatives + The Matrix Calculus You Need For Deep Learning |
Intuition - Gradient based learning implies that an objective function is defined. This objective is usually intended to measure the error of a model prediction. The model prediction is a function of model parameters and training data so naturally the objective function is going to be a function of model parameters and training data. Given that we want our model to have an error as low as possible we aim to minimize its error. In other words, we want to minimize the defined objective which measures how wrong the model prediction is. Since the data is fixed and we don't want it to change, we can achieve lower error by tweaking the model parameters. To know exactly how the model parameters should be tweaked to minimize the output of the objective we use its gradient. The gradient can be interpreted as the direction of the steepest ascent of the objective function but since we are interested in minmizing the objective we use its opposite value to update the model parameters. This enables us to find the decreasing direction of the objective function and update the model parameters along it to minimize the error. Gradient Computation - Gradient of an expression can be computed either numerically (i.e. using the centered difference formula) or analytically (using calculus rules). It is good practice to perform a gradient check after implementing the gradient function of an expression. This means evaluating both the numerical and analytical gradient expressions and comparing the results. We expect the results to be 0 (or at least in the order of Gradient Descent - This reffers to the idea of computing the gradient of an expression and updating its parameters with the opposite of gradient value (when the goal is to minimize the objective value). It is common that the scaled gradient value is used for update and the scaling parameter is known as the "learning rate" which is a hyperparameter that needs to be tuned. |
|
FCNN (Fully Connected Neural Networks) - The Fully Connected Layer | + cs231n notes 1 + ImageNet Classification with Deep Convolutional Neural Networks + Delving Deep Into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification + Proof that Neural Networks with at least one hidden layer are universal approximators +Do Deep Nets Really Need to be Deep? +FitNets: Hints for Thin Deep Nets +The Loss Surfaces of Multilayer Networks + deeplearning.net tutorial + ConvNetJS Demo + Single Neuron Dynamics and Computation + cs231n notes 2 + Understanding the difficulty of training deep feedforward neural networks + Dropout: A Simple Way to Prevent Neural Networks from Overfitting + cs231n notes 3 - Contains a lot of valuable information and resources about training models from scratch. + Sofmtax Classifier vs FCNN: 2D Classification Example |
Intuition - FCNNs (Fully Connected Neural Networks) can be seen as an extension of linear classifiers. Whereas a linear classifier would usually compute the class scores (i.e. Data Preprocessing - The most common data preprocessing techniques are centering (mean subtraction), normalization (either by using standard deviation or by transforming the data to have values in the range [-1, 1] using an affine transformation). Although not used with CNNs, another form of data preprocessing is PCA(reducing the data dimensionality by selecting k dimensions with the most variance) and whitening(divide the data in its eigenbasis by its eigenvalues to squash the variance to the same length across all the dimensions) Activations Function - There are multiple activation functions that can been used in this structure but by far the most common, proven and recommended one is the ReLU activation function which has the form Loss Function - Multiple loss functions can be used in combination with the FCNN structure such as hinge loss or cross-entropy loss among the simplest examples. More complex objective functions can be defined depending on the task at hand. The loss function commonly contains two components: the data loss and the regularization loss. The data loss penalizes errors with respect to the data while the regularization loss penalizes errors with respect to the model parameters. Regularization - Is a form of counteraction to prevent the model from overfitting when training neural networks. Regularization can take multiple forms such as additional terms to the loss function itself (L1 or L2 regularization) or various techniques employed during training such as max norm constraints or dropout. Note that only the model weights are commonly regularized while the bias terms are not. Additionally, all model layers are regularized in a similar manner. It is uncommon to regularize different layers to different amounts. Training - During training the first step is to compute the forward pass of the model (the set of matrix multiplication followed by non-linearity multiple times depending on the number of layers in the model). The second step is to compute the value of the loss function to measure how wrong the model predictions were for each sample. To prevent overfitting we must add a regularization term to the loss function to penalize large values in the model parameters (this is similar to the regularization term for the SVM case). The next step is to figure out how to change the model parameters such that the loss function value gets minimized. This information is inferred from the gradient of the loss function with respect to the model parameters. Since the gradient indicates the direction of the steepest ascent of the loss function, we use its opposite value to update the model parameters (i.e. with values indicating towards the steepest descent) together with a scalar value known as learning rate to control the magnitude of the parameter update. |
Two Layer FCNet Multi Layer FCNet Affine Layer Implementation (Forward/Backward Pass) |
CNN (Convolutional Neural Networks) - The Convolutional Layer | +cs231n notes + CNN EXPLAINER: Learning Convolutional Neural Networks with Interactive Visualization + A guide to convolution arithmetic for deep learning + Multi-Scale Context Aggregation By Dilated Convolutions - Introduces Dilated Convolutions + Network in Network - Introduces 1x1 Convolutions |
Intuition - Compared to an FCNN, this type of model is useful when working with images because it allows preservation of the spatial information in images and because it enables processing the image data with a reasonable number of model parameters compared to FCNN. For FCNN this is not possible because of the model design which requires the input data to be oranized as 1D vectors. Activation Function - The activation functions used for CNNs are similar to the ones used for FCNNs with the most common one being the ReLU activation function. Loss Function - The loss functions used for CNNs are similar to the ones used for FCNNs with the most common one being the Cross-Entropy loss function. Forward Pass - During the forward pass the model parameters are combined with the data using the convolution operation and the activation function. In the single sample scenario, a 3D tensor (representing the image) is used as input and processed using the convolution operator with another (smaller) 3D tensor (representing the filer or the kernel) to result in another 3D output tensor (representing the activation volume). The spatial size of the activation volume is dependent on the properties of the filter and can be computed using the formula |
Three Layer ConvNet layers.py |
Pooling Layer | + cs231n notes |
Intuition - The pooling layer is mainly responssible for reducing the spatial dimensions of the activation maps through a pooling operation. The most common pooling operations is by far Max Pooling which reduces a given region to its maximum value. Additional pooling options are Average Pooling or L2-norm Pooling but they tend to be less effective than Max Pooling. Among the benefits of using pooling layers we mention reducing the number of model parameters which also enables overfitting control. Forward Pass - This operation involves sliding a fixed size window over the input data and performing the specified pooling operation on each position. The properties of the pooling operation are defined by the characteristics of the sliding window approach such as its Size or SpatialExtent and Stride. The dimension of the pooling operation result can be thus computed using the formula Note - Although the pooling operation is very common in CNNs there are alternative approaches that have been proven to work without using it such as in Striving for Simplicity: The All Convolutional Net |
layers.py |