When working with modern Deep Learning frameworks such as Tensorflow or PyTorch, it can sometimes be difficult to see what is actually going on under the hood. This repository contains easily understandable implementations of a Dense Layer, LeakyReLU & Sigmoid Activation Layer, as well as Softmax-CrossEntropy Loss for batched inputs. Batched input means that the implementation can process multiple images simultaneously, a feature that is often omitted for simplicity from other educational implementations. In my opinion, the added complexity of describing the computations using batched tensors instead of for loops is well worth it, as it more closely follows current Deep Learning frameworks.
This implementation deliberately does not use any 3rd party libraries/code (except for Catch2 as unittest framework ). As an exercise, all components including the matrix operations have been written in C++ from scratch. Besides providing educational value this project was also an exercise for me, in order to prevent my C++ from getting all too rusty (pun very much intended ... 🦀.)
The achieved test accuracy is on par with tensorflow, see section Performance comparison with tensorflow
Building and running should be easy for anyone with a C++17 compatible toolchain.
- cmake >= 3.10
- make (or ninja)
- C++ compiler (C++17 support required)
git clone --recursive https://github.com/baurst/mlp_from_scratch_cpp.git
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release # for debug build use -DCMAKE_BUILD_TYPE=Debug
make -j4
# to run the tests
./test/tests
# download MNIST dataset as .csv (or alternatively from kaggle.com: https://www.kaggle.com/oddrationale/mnist-in-csv)
wget https://pjreddie.com/media/files/mnist_train.csv
wget https://pjreddie.com/media/files/mnist_test.csv
# train the classifier by providing absolute paths to datasets as arguments
./src/main mnist_train.csv mnist_test.csv
For each train step, the train method of the MLP is called with a mini-batch of flattened images and corresponding labels, called input & target.
float MLP::train(const Mat2D<float>& input, const Mat2D<float>& target,
const Loss& loss_obj, const float learning_rate) {
First, the forward pass is performed, i.e. the activations of each layer for the mini-batch "input" are stored like so:
const auto activations = this->forward(input);
const auto logits = activations.back();
In the backward pass, we propagate the gradient of the loss w.r.t. to its input back through the network to update the weights. As the name suggests, backpropagation starts at the back of the computation graph, in our case from the loss layer. During the backward pass, each layer updates its trainable variables with the gradient (scaled by the learning rate) and returns the gradient of the loss w.r.t. its inputs:
auto grad = loss_obj.loss_grad(logits, target_label);
for (int32_t layer_idx = this->layers.size() - 1; layer_idx >= 0;
--layer_idx) {
const auto layer_input = activations[layer_idx];
grad = this->layers[layer_idx]->backward(layer_input, grad, learning_rate);
}
(For details on how each layer implements the backward method, see below or ./src/layer/layer.cpp.) The weight update for this mini-batch is now complete. Finally, the loss is returned for logging purposes.
const auto loss = loss_obj.loss(logits, target_label);
const auto avg_loss = loss.reduce_mean();
return avg_loss;
}
(Apologies for the weird math typesetting and to dark-mode users, but Github Markdown does not support LaTeX math at the moment.)
To optimize the network we need to compute the weight updates and propagate the gradient backwards.
Mat2D<float> DenseLayer::backward(const Mat2D<float>& input,
const Mat2D<float>& gradients_output,
const float learning_rate) {
The activation of the Dense Layer is as follows:
where W, B are the weight and bias matrices of the dense layer, X is the batched input and Y the batched output. b is the number of processed batches, o is the number of neurons in the layer, the input features X are of depth i.
During training, a scalar loss L is computed that we would like to minimize. For gradient based optimization we need to compute the gradients of the loss w.r.t. each layer input X, weights W and biases B for each layer in the network.
To backpropagate the loss gradient from the output of the layer (gradients_output
)
backwards to the previous layer (called grad_input
from hereon), we apply the chain rule.
Each entry of gradients_output
contains the partial derivative of the Loss w.r.t. each entry of the output matrix Y.
Consequently, we need to compute gradients_input
, which must contain each partial derivative of the loss w.r.t. each entry in the input matrix X:
This conveniently translates to C++ like so:
const auto grad_input =
gradients_output.dot_product(this->weights.transpose());
To update the weights W of the layer itself using the chain rule we get
, or in code:
const auto grad_weights = input.transpose().dot_product(gradients_output);
and for the bias term
, which translates to
const auto grad_biases = gradients_output.reduce_sum_axis(0);
These gradients can then be scaled with the learning rate and subtracted from the current weight/bias matrices for simple gradient descent optimization.
const auto weight_update = grad_weights.hadamard_product(learning_rate);
const auto bias_update = grad_biases.hadamard_product(learning_rate);
this->weights = this->weights.minus(weight_update);
this->biases = this->biases.minus(bias_update);
Finally, the gradient at the output of the previous layer is returned,
return grad_input;
}
so that the backpropagation algorithm can be continued all the way to the input of the network.
By the way: The activation functions are treated like separate layers, which in turn implement the forwards & backwards functions.
As of now, the network achieves a test accuracy of 94.2% using two Dense Layers with 50 and 25 neurons each and Leaky ReLU activation (alpha=0.1), SGD with learning rate 0.01 (with decay) and a batch size of 64 on the MNIST test set. An identical classifier has been implemented using tensorflow in ./plot/tf_mlp.py, achieving the same test accuracy.
To run the tensorflow implementation and/or generate this plot follow the instructions given in ./plot/Readme.md.
Should you wish to add your own layer you can simply write a class that inherits from the class Layer, implement the forward and backward pass and drop it into your model. Have fun tinkering!