This work is based on "Learning both Weights and Connections for Efficient Neural Network." Song et al. @ NIPS '15. Note that these works are just for quantifying its effectiveness on latency (within TensorFlow), not a best optimal. Thus, some details are abbreviated for simplicity. (e.g. # of iterations, adjusted dropout ratio, etc.)
I applied Iterative Pruning on a small MNIST CNN model (13MB, originally), which can be accessed from TensorFlow Tutorials. After pruning off some percentages of weights, I've simply retrained two epochs for each case and got compressed models (minimum 2.6MB with 90% off) with minor loss of accuracy. (99.17% -> 98.99% with 90% off and retraining) Again, this is not an optimal.
Due to lack of supports on SparseTensor and its operations of TensorFlow (0.8.0), this implementation has some limitations. This work uses embedding_lookup_sparse to compute sparse matrix-vector multiplication. It is not solely for the purpose of sparse matrix vector multiplication, and thus its performance may be sub-optimal. (I'm not sure.) Also, TensorFlow uses <index, value> pair for sparse matrix rather than using typical CSR format which is more compact and performant. In summary, because of the following reasons, I think this implementation has some limitations.
- embedding_lookup_sparse doesn't support
broadcasting
, which prohibits users to run test with normal test datasets. - Performance may be somewhat sub-optimal.
- Because "Sparse Variable" is not supported, manual dense to sparse and sparse to dense transformation is required.
- 4D Convolution Tensor may also be applicable, but bit tricky.
- Current embedding_lookup_sparse forces additional matrix transpose, dimension squeeze and dimension reshape.
model_ckpt_dense: original model
model_ckpt_dense_pruned: 90% pruned-only model
model_ckpt_sparse_retrained: 90% pruned and retrained model
sudo apt-get install python-scipy python-numpy python-matplotlib
To regenerate these sparse model, edit config.py
first as your threshold configuration,
and then run training with second (pruning and retraining) and third (generate sparse form of weight data) round options.
./train.py -2 -3
To inference single image (seven.png) and measure its latency,
./deploy_test.py -d -m model_ckpt_dense
./deploy_test_sparse.py -d -m model_ckpt_sparse_retrained
To test dense model,
./deploy_test.py -t -m model_ckpt_dense
./deploy_test.py -t -m model_ckpt_dense_pruned
./deploy_test.py -t -m model_ckpt_dense_retrained
To draw histogram that shows the weight distribution,
# After running train.py (it generates .dat files)
./draw_histogram.py
Results are currently somewhat mediocre or degraded due to indirection and additional storage overhead originated from sparse matrix form. Also, it may because model size is too small. (12.49MB)
Baseline: 12.49 MB
10 % pruned: 21.86 MB
20 % pruned: 19.45 MB
30 % pruned: 17.05 MB
40 % pruned: 14.64 MB
50 % pruned: 12.23 MB
60 % pruned: 9.83 MB
70 % pruned: 7.42 MB
80 % pruned: 5.02 MB
90 % pruned: 2.61 MB
CPU: Intel Core i5-2500 @ 3.3 GHz, LLC size: 6 MB
Baseline: 0.01118040085 s
10 % pruned: 1.919299984 s
20 % pruned: 0.2325239658 s
30 % pruned: 0.2111079693 s
40 % pruned: 0.1982570648 s
50 % pruned: 0.1691776752 s
60 % pruned: 0.1305227757 s
70 % pruned: 0.116039753 s
80 % pruned: 0.103564167 s
90 % pruned: 0.1058168888 s
GPU: Nvidia Geforce GTX650 @ 1.058 GHz, LLC size: 256 KB
Baseline: 0.1475181845 s
10 % pruned: 0.2954540253 s
20 % pruned: 0.2665398121 s
30 % pruned: 0.2585638046 s
40 % pruned: 0.2090051651 s
50 % pruned: 0.1995279789 s
60 % pruned: 0.1815193653 s
70 % pruned: 0.1436806202 s
80 % pruned: 0.135668993 s
90 % pruned: 0.1218701839 s