Why not add some didactic value to the MNIST ConvNN solution on pp. 473 - 482 (Chapter 14) by using a model of comparable size with the densely connected solution? #138

pavlo-yanchenko · 2023-08-01T16:12:54Z

pavlo-yanchenko
Aug 1, 2023

There are two solutions for the standard MNIST handwritten digit recognition problem in the book.
The first is with a three-layer densely connected network, and the other is with a convolutional network. Text on page 473 suggests that there has been an implied comparison between the two (it seems logical and didactic). However, arbitrarily chosen hyperparameters of the convolutional network don't give a chance to fall in love with the beauty of CNN.

Let me explain what I mean.
The first densely connected MNIST classifier has 25.818 trainable parameters and can be easily trained using a CPU. My Mac needs only about 10s per epoch for that. It achieves 96.69% accuracy.
The second convolutional solution has staggering 3.274.634 parameters. It's 100+ times bigger than the first. Thanks to the computing efficiency of pytorch, it takes only about 5 times longer to train this network (using a CPU). It achieves 99.07% accuracy.
The example reads like a 100+ times bigger model adds 1,5% of accuracy to achieve better classification performance for the MNIST task. Cool.

However, it seems the second model is a good illustration not for value that can add a convolutional approach but for the rule of thumb from the cs231n course - "you should use as big of a neural network as your computational budget allows and use regularization techniques to control overfitting." In reality, this model is unreasonable arbitrarily big (it is just unclear why it needs 1024 units in the classifier), and comparable results can be achieved with a significantly smaller model.

For example, the model with 8 filters in the first convolutional layer, 16 filters in the second, and 32 units in the hidden layer of the classifier (28.874 parameters, let's say, very close to the densely connected solution) can achieve ~99.20% accuracy and demonstrates how using convolution layers can help achieve better results with the same sizes. (This model and its results can be reviewed here)

rasbt · 2023-08-02T10:38:11Z

rasbt
Aug 2, 2023
Maintainer

That's a great suggestion @pavlo-yanchenko. Going a step further, I think MNIST is not a great example for comparing MLPs and CNNs from a didactic point because MNIST is "so easy." I.e., you can achieve 93% accuracy with a logistic regression model already -- no hidden layers required. The only reason why MNIST is still used is that it is a) included in PyTorch and convenient b) it's relatively intuitive. But yeah, not a great example for CNNs vs MLPs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not add some didactic value to the MNIST ConvNN solution on pp. 473 - 482 (Chapter 14) by using a model of comparable size with the densely connected solution? #138

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why not add some didactic value to the MNIST ConvNN solution on pp. 473 - 482 (Chapter 14) by using a model of comparable size with the densely connected solution? #138

pavlo-yanchenko Aug 1, 2023

Replies: 1 comment

rasbt Aug 2, 2023 Maintainer

pavlo-yanchenko
Aug 1, 2023

rasbt
Aug 2, 2023
Maintainer