Why not add some didactic value to the MNIST ConvNN solution on pp. 473 - 482 (Chapter 14) by using a model of comparable size with the densely connected solution? #138
pavlo-yanchenko
started this conversation in
Ideas
Replies: 1 comment
-
That's a great suggestion @pavlo-yanchenko. Going a step further, I think MNIST is not a great example for comparing MLPs and CNNs from a didactic point because MNIST is "so easy." I.e., you can achieve 93% accuracy with a logistic regression model already -- no hidden layers required. The only reason why MNIST is still used is that it is a) included in PyTorch and convenient b) it's relatively intuitive. But yeah, not a great example for CNNs vs MLPs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
There are two solutions for the standard MNIST handwritten digit recognition problem in the book.
The first is with a three-layer densely connected network, and the other is with a convolutional network. Text on page 473 suggests that there has been an implied comparison between the two (it seems logical and didactic). However, arbitrarily chosen hyperparameters of the convolutional network don't give a chance to fall in love with the beauty of CNN.
Let me explain what I mean.
The first densely connected MNIST classifier has 25.818 trainable parameters and can be easily trained using a CPU. My Mac needs only about 10s per epoch for that. It achieves 96.69% accuracy.
The second convolutional solution has staggering 3.274.634 parameters. It's 100+ times bigger than the first. Thanks to the computing efficiency of pytorch, it takes only about 5 times longer to train this network (using a CPU). It achieves 99.07% accuracy.
The example reads like a 100+ times bigger model adds 1,5% of accuracy to achieve better classification performance for the MNIST task. Cool.
However, it seems the second model is a good illustration not for value that can add a convolutional approach but for the rule of thumb from the cs231n course - "you should use as big of a neural network as your computational budget allows and use regularization techniques to control overfitting." In reality, this model is unreasonable arbitrarily big (it is just unclear why it needs 1024 units in the classifier), and comparable results can be achieved with a significantly smaller model.
For example, the model with 8 filters in the first convolutional layer, 16 filters in the second, and 32 units in the hidden layer of the classifier (28.874 parameters, let's say, very close to the densely connected solution) can achieve ~99.20% accuracy and demonstrates how using convolution layers can help achieve better results with the same sizes. (This model and its results can be reviewed here)
Beta Was this translation helpful? Give feedback.
All reactions