diff --git a/README.md b/README.md index c513818..6071a45 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![arXiv](https://img.shields.io/badge/cs.CV-%09arXiv%3A2011.14660-red)](https://arxiv.org/abs/2011.14660) -# SplitNet: Divide and Co-training +# Divide and Co-training -SplitNet achieves 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px) +Divide and co-training achieve 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px) by dividing one existing large network into several small ones and co-training. ## Table of Contents @@ -31,31 +31,30 @@ by dividing one existing large network into several small ones and co-training.
This is the code for the paper -SplitNet: Divide and Co-training. + +Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training.
-The width of a neural network matters since increasing -the width will necessarily increase the model capacity. However, -the performance of a network does not improve linearly -with the width and soon gets saturated. To tackle this problem, -we propose to increase the number of networks rather -than purely scaling up the width. To prove it, one large network -is divided into several small ones, and each of these -small networks has a fraction of the original one’s parameters. -We then train these small networks together and make -them see various views of the same data to learn different -and complementary knowledge. During this co-training process, -networks can also learn from each other. As a result, -small networks can achieve better ensemble performance +The width of a neural network matters since increasing the width +will necessarily increase the model capacity. +However, the performance of a network does not improve linearly +with the width and soon gets saturated. +In this case, we argue that increasing the number of networks (ensemble) +can achieve better accuracy-efficiency trade-offs than purely increasing the width. +To prove it, +one large network is divided into several small ones +regarding its parameters and regularization components. +Each of these small networks has a fraction of the original one's parameters. +We then train these small networks together and make them see various +views of the same data to increase their diversity. +During this co-training process, +networks can also learn from each other. +As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs. -This reveals that the number of networks is a new dimension -of effective model scaling, besides depth/width/resolution. Small networks can also achieve faster inference speed -than the large one by concurrent running on different devices. -We validate the idea --- increasing the number of -networks is a new dimension of effective model scaling --- -with different network architectures on common benchmarks -through extensive experiments. +than the large one by concurrent running on different devices. +We validate our argument with 8 different neural architectures on +common benchmarks through extensive experiments.
@@ -70,8 +69,8 @@ through extensive experiments. ## Features and TODO -- [x] Support SplitNet with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet, -Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet. Also support ResNeSt without SplitNet. +- [x] Support divide and co-training with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet, +Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet. - [x] Different data augmentation methods, i.e., mixup, random erasing, auto-augment, rand-augment, cutout - [x] Distributed training (tested with multi-GPUs on single machine) - [x] Multi-GPUs synchronized BatchNormalization @@ -197,7 +196,7 @@ wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/ - Download [The SVHN dataset](http://ufldl.stanford.edu/housenumbers/) (*Format 2: Cropped Digits*), put them in the `dataset/svhn` directory. -- `cd` to `github` directory and clone the `SplitNet-Divide-and-Co-training` repo. +- `cd` to `github` directory and clone the `Divide-and-Co-training` repo. For brevity, rename it as `splitnet`. @@ -291,9 +290,9 @@ Then run ## Citations ``` -@misc{2020_SplitNet, +@misc{2020_splitnet, author = {Shuai Zhao and Liguang Zhou and Wenxiao Wang and Deng Cai and Tin Lun Lam and Yangsheng Xu}, - title = {SplitNet: Divide and Co-training}, + title = {Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training}, howpublished = {arXiv}, year = {2020} } diff --git a/miscs/fig1_width.png b/miscs/fig1_width.png index 64545f0..ef0c442 100644 Binary files a/miscs/fig1_width.png and b/miscs/fig1_width.png differ diff --git a/miscs/fig2_framework.png b/miscs/fig2_framework.png index 0b531de..cf34475 100644 Binary files a/miscs/fig2_framework.png and b/miscs/fig2_framework.png differ diff --git a/miscs/fig3_latency.png b/miscs/fig3_latency.png index 96f7e89..27be1a1 100644 Binary files a/miscs/fig3_latency.png and b/miscs/fig3_latency.png differ diff --git a/miscs/res_cifar10.png b/miscs/res_cifar10.png index 0bf465e..c2696f3 100644 Binary files a/miscs/res_cifar10.png and b/miscs/res_cifar10.png differ diff --git a/miscs/res_cifar100.png b/miscs/res_cifar100.png index f810f26..ffc100b 100644 Binary files a/miscs/res_cifar100.png and b/miscs/res_cifar100.png differ diff --git a/miscs/res_imagenet.png b/miscs/res_imagenet.png index 09a3e6a..558f780 100644 Binary files a/miscs/res_imagenet.png and b/miscs/res_imagenet.png differ diff --git a/model/splitnet.py b/model/splitnet.py index 8ee2b1a..43e8717 100644 --- a/model/splitnet.py +++ b/model/splitnet.py @@ -202,6 +202,7 @@ def __init__(self, self.models = nn.ModuleList(models) self.criterion = criterion if args.is_identical_init: + print("INFO:PyTorch: Using identical initialization.") self._identical_init() # data transform - use different transformers for different networks @@ -222,6 +223,7 @@ def __init__(self, self.cot_weight_warm_up_epochs = args.cot_weight_warm_up_epochs # self.kl_temperature = args.kl_temperature self.cot_loss_choose = args.cot_loss_choose + print("INFO:PyTorch: The co-training loss is {}.".format(self.cot_loss_choose)) self.num_classes = args.num_classes def forward(self, x, target=None, mode='train', epoch=0, streams=None): @@ -335,6 +337,20 @@ def _co_training_loss(self, outputs, loss_choose, epoch=0): H_mean = (- p_mean * torch.log(p_mean)).sum(-1).mean() H_sep = (- p_all * F.log_softmax(outputs_all, dim=-1)).sum(-1).mean() cot_loss = weight_now * (H_mean - H_sep) + + elif loss_choose == 'kl_seperate': + outputs_all = torch.stack(outputs, dim=0) + # repeat [1,2,3] like [1,1,2,2,3,3] and [2,3,1,3,1,2] + outputs_r1 = torch.repeat_interleave(outputs_all, self.split_factor - 1, dim=0) + index_list = [j for i in range(self.split_factor) for j in range(self.split_factor) if j!=i] + outputs_r2 = torch.index_select(outputs_all, dim=0, index=torch.tensor(index_list, dtype=torch.long).cuda()) + # calculate the KL divergence + kl_loss = F.kl_div(F.log_softmax(outputs_r1, dim=-1), + F.softmax(outputs_r2, dim=-1).detach(), + reduction='none') + # cot_loss = weight_now * (kl_loss.sum(-1).mean(-1).sum() / (self.split_factor - 1)) + cot_loss = weight_now * (kl_loss.sum(-1).mean(-1).sum() / (self.split_factor - 1)) + else: raise NotImplementedError