├── .gitignore
├── README.md
├── crd
├── __init__.py
├── criterion.py
└── memory.py
├── dataset
├── __init__.py
├── cifar100.py
└── imagenet.py
├── distiller_zoo
├── AB.py
├── AT.py
├── CC.py
├── FSP.py
├── FT.py
├── FitNet.py
├── KD.py
├── KDSVD.py
├── NST.py
├── PKT.py
├── RKD.py
├── SP.py
├── VID.py
└── __init__.py
├── examples
├── __init__.py
├── cifar100.png
├── fig1.png
└── imagenet.png
├── helper
├── __init__.py
├── loops.py
├── pretrain.py
└── util.py
├── models
├── ShuffleNetv1.py
├── ShuffleNetv2.py
├── __init__.py
├── classifier.py
├── mobilenetv2.py
├── resnet.py
├── resnetv2.py
├── util.py
├── vgg.py
└── wrn.py
├── scripts
├── fetch_pretrained_teachers.sh
├── run_cifar_distill.sh
└── run_cifar_vanilla.sh
├── supermix.py
├── train_student.py
└── train_teacher.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | data/
3 | output*/
4 | ckpts/
5 | *.pth
6 | *.t7
7 | *.png
8 | *.jpg
9 | tmp*.py
10 |
11 | *.pdf
12 |
13 |
14 | # Byte-compiled / optimized / DLL files
15 | __pycache__/
16 | *.py[cod]
17 | *$py.class
18 |
19 | # C extensions
20 | *.so
21 |
22 | # Distribution / packaging
23 | .Python
24 | build/
25 | develop-eggs/
26 | dist/
27 | downloads/
28 | eggs/
29 | .eggs/
30 | lib/
31 | lib64/
32 | parts/
33 | sdist/
34 | var/
35 | wheels/
36 | *.egg-info/
37 | .installed.cfg
38 | *.egg
39 | MANIFEST
40 |
41 | # PyInstaller
42 | # Usually these files are written by a python script from a template
43 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
44 | *.manifest
45 | *.spec
46 |
47 | # Installer logs
48 | pip-log.txt
49 | pip-delete-this-directory.txt
50 |
51 | # Unit test / coverage reports
52 | htmlcov/
53 | .tox/
54 | .coverage
55 | .coverage.*
56 | .cache
57 | nosetests.xml
58 | coverage.xml
59 | *.cover
60 | .hypothesis/
61 | .pytest_cache/
62 |
63 | # Translations
64 | *.mo
65 | *.pot
66 |
67 | # Django stuff:
68 | *.log
69 | local_settings.py
70 | db.sqlite3
71 |
72 | # Flask stuff:
73 | instance/
74 | .webassets-cache
75 |
76 | # Scrapy stuff:
77 | .scrapy
78 |
79 | # Sphinx documentation
80 | docs/_build/
81 |
82 | # PyBuilder
83 | target/
84 |
85 | # Jupyter Notebook
86 | .ipynb_checkpoints
87 |
88 | # pyenv
89 | .python-version
90 |
91 | # celery beat schedule file
92 | celerybeat-schedule
93 |
94 | # SageMath parsed files
95 | *.sage.py
96 |
97 | # Environments
98 | .env
99 | .venv
100 | env/
101 | venv/
102 | ENV/
103 | env.bak/
104 | venv.bak/
105 |
106 | # Spyder project settings
107 | .spyderproject
108 | .spyproject
109 |
110 | # Rope project settings
111 | .ropeproject
112 |
113 | # mkdocs documentation
114 | /site
115 |
116 | # mypy
117 | .mypy_cache/
118 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SuperMix: Supervising the Mixing Data Augmentation
2 |
3 | 
4 |
5 |
6 |
7 | **Pytorch implementation of [SuperMix paper](https://arxiv.org/abs/2003.05034), a supervised method for data augmentation (will appear in CVPR 2021).**
8 |
9 | ## Run SuperMix
10 |
11 | - Auguments are:
12 | * `--dataset`: specify the dataset, choices: `imagenet` or `cifar100`, default: `cifar100`.
13 | * `--model`: specify the supervisor for augmentation. For `cifar100`, all the models in 'models/\_\_init\_\_.py' can be used. For imagenet, all the models in `torchvision.models` can be used.
14 | * `--device`: specify the device, default: `cuda:0`.
15 | * `--save_dir`: the directory to save the mixed images.
16 | * `--input_dir`: the input directory of the imagenet dataset.
17 | * `--bs`: batch size, default: `100`.
18 | * `--aug_size`: number of mixed images to produce, default: `500000`.
19 | * `--k`: number of input images to be mixed, default: `2`.
20 | * `--max_iter`: maximum number of iterations on each batch, default: `50`.
21 | * `--alpha`: alpha value for the Dirichlet distribution, default: `3`.
22 | * `--sigma`: standard deviation of the Guassian smoothing function, default: `1`.
23 | * `--w`: spatial size of the mixing masks, default: `8`.
24 | * `--lambda_s`: multiplier for the sparsity loss, default: `25`.
25 | * `--tol`: percentage of successfull samples in the batch for early termination, default: `70`.
26 | * `--plot`: plot the mixed images after generation, default: `True`
27 |
28 |
29 | ### Run on the ImageNet data
30 | 1. Run supermix.py
31 | ```
32 | python3 supermix.py --dataset imagenet --model resnet34 --save_dir ./outputdir --bs 16 --aug_size 50000 --w 16 --sigma 2
33 | ```
34 | 2. Sample outputs
35 |
36 |
37 |
38 |
39 |
40 |
41 | ### Run on the CIFAR-100 data
42 |
43 | 1. Download the pretrained model by:
44 |
45 | ```
46 | sh scripts/fetch_pretrained_teachers.sh
47 | ```
48 | which saves the models to `save/models`
49 |
50 | 2. Run supermix.py
51 |
52 | ```
53 | python3 supermix.py --dataset cifar100 --model resnet110 --save_dir ./outputdir --bs 64 --aug_size 50000 --w 8 --sigma 1
54 | ```
55 |
56 | 3. Sample outputs
57 |
58 |
59 |
60 |
61 |
62 | ## Evaluating SuperMix for knowledge distillation and object classification
63 |
64 | **Code for the distillation is forked/copied from [the official code of CRD](https://github.com/HobbitLong/RepDistiller)**
65 |
66 | 1. Fetch the pretrained teacher models by:
67 |
68 | ```
69 | sh scripts/fetch_pretrained_teachers.sh
70 | ```
71 | which will download and save the models to `save/models`
72 |
73 | 2. Produce augmented data using SuperMix by:
74 |
75 | ```
76 | python3 supermix.py --dataset cifar100 --model resnet110 --save_dir ./output --bs 128 --aug_size 500000 --w 8 --sigma 1
77 | ```
78 |
79 | 3. Run the distillation
80 | - using cross-entropy (Equation 9 in the paper) by:
81 |
82 | ```
83 | python3 train_student.py --path_t ./save/models/resnet110_vanilla/ckpt_epoch_240.pth --model_s resnet20 --distill kd --model_s resnet8x4 -r 2.0 -a 0 -b 0 --aug_type supermix --aug_dir ./output --trial 1
84 | ```
85 | - using the original distillation objective proposed by Hinton et. al., (Equation 8 in the paper) by:
86 |
87 | ```
88 | python3 train_student.py --path_t ./save/models/resnet110_vanilla/ckpt_epoch_240.pth --model_s resnet20 --distill kd --model_s resnet8x4 -r 1.8 -a 0.2 -b 0 --aug_type supermix --aug_dir ./output --trial 1
89 | ```
90 |
91 |
92 | - where the flags are explained as:
93 | - `--path_t`: specify the path of the teacher model
94 | - `--model_s`: specify the student model, see 'models/\_\_init\_\_.py' to check the available model types.
95 | - `--distill`: specify the distillation method
96 | - `-r`: the weight of the cross-entropy loss between logit and ground truth, default: `1`
97 | - `-a`: the weight of the KD loss, default: `None`
98 | - `-b`: the weight of other distillation losses, default: `None`
99 | - `--aug_type`: type of the augmentation, choices: `None`, `supermix`, `mixup`, `cutmix`.
100 | - `--aug_dir`: the directory of augmented images when `supermix` is selected for `aug_type`.
101 | - `--aug_alpha`: alpha for the Dirichlet distribution when `mixup` or `cutmix` is selected for `aug_type`.
102 | - `--trial`: specify the experimental id to differentiate between multiple runs.
103 |
104 |
105 | 4. (optional) Train teacher networks from scratch. Example commands are in `scripts/run_cifar_vanilla.sh`
106 |
107 | Note: the default setting is for a single-GPU training. If you would like to play this repo with multiple GPUs, you might need to tune the learning rate, which empirically needs to be scaled up linearly with the batch size, see [this paper](https://arxiv.org/abs/1706.02677)
108 |
109 | ## Benchmark Results on CIFAR-100:
110 |
111 | Performance is measured by classification accuracy (%)
112 |
113 | | Teacher
Student | wrn-40-2
wrn-16-2 | wrn-40-2
wrn-40-1 | resnet56
resnet20 | resnet110
resnet20 | resnet110
resnet32 | resnet32x4
resnet8x4 | vgg13
vgg8 |
114 | |:---------------:|:-----------------:|:-----------------:|:-----------------:|:------------------:|:------------------:|:--------------------:|:-----------:|
115 | | Teacher
Student | 75.61
73.26 | 75.61
71.98 | 72.34
69.06 | 74.31
69.06 | 74.31
71.14 | 79.42
72.50 | 74.64
70.36 |
116 | | KD | 74.92 | 73.54 | 70.66 | 70.67 | 73.08 | 73.33 | 72.98 |
117 | | FitNet | 73.58 | 72.24 | 69.21 | 68.99 | 71.06 | 73.50 | 71.02 |
118 | | AT | 74.08 | 72.77 | 70.55 | 70.22 | 72.31 | 73.44 | 71.43 |
119 | | SP | 73.83 | 72.43 | 69.67 | 70.04 | 72.69 | 72.94 | 72.68 |
120 | | CC | 73.56 | 72.21 | 69.63 | 69.48 | 71.48 | 72.97 | 70.71 |
121 | | VID | 74.11 | 73.30 | 70.38 | 70.16 | 72.61 | 73.09 | 71.23 |
122 | | RKD | 73.35 | 72.22 | 69.61 | 69.25 | 71.82 | 71.90 | 71.48 |
123 | | PKT | 74.54 | 73.45 | 70.34 | 70.25 | 72.61 | 73.64 | 72.88 |
124 | | AB | 72.50 | 72.38 | 69.47 | 69.53 | 70.98 | 73.17 | 70.94 |
125 | | FT | 73.25 | 71.59 | 69.84 | 70.22 | 72.37 | 72.86 | 70.58 |
126 | | FSP | 72.91 | 0.00 | 69.95 | 70.11 | 71.89 | 72.62 | 70.23 |
127 | | NST | 73.68 | 72.24 | 69.60 | 69.53 | 71.96 | 73.30 | 71.53 |
128 | | CRD | 75.48 | 74.14 | 71.16 | 71.46 | 73.48 | 75.51 | 73.94 |
129 | | CRD+KD | 75.64| 74.38| 71.63 | 71.56 | 73.75 | 75.46 | 74.29 |
130 | | ImgNet32| 74.91 | 74.80 | 71.38 | 71.48 | 73.17 | 75.57 | 73.95 |
131 | | MixUp| 76.20| 75.53 | 72.00 | 72.27 | 74.60 | 76.73 | 74.56 |
132 | | CutMix| 76.40 | 75.85 | 72.33 | 72.68 | 74.24 |76.81 | 74.87 |
133 | | SuperMix|**76.93**|**76.11**|**72.64**|**72.75** | **74.80** | **77.16** | **75.38** |
134 | | ImgNet32+KD| 76.52 | 75.70 | 72.22 | 72.23 | 74.24 | 76.46 | 75.02 |
135 | | MixUp+KD| 76.58 | 76.10 | 72.89 | 72.82 | 74.94 | 77.07 | 75.58 |
136 | | CutMix+KD| 76.81 | 76.45 | 72.67 | 72.83 | 74.87 | 76.90 | 75.50 |
137 | | SuperMix+KD| **77.45** |**76.53**| **73.19**| **72.96** | **75.21**| **77.59** | **76.03** |
138 |
139 | ## Questions
140 | If there is a question regarding any part of the code, or it needs further clarification, please create an issue or send me an email: ad0046@mix.wvu.edu.
141 |
142 | ## Citation
143 |
144 | If you found SuperMix helpful for your research, please cite our paper:
145 |
146 | ```
147 | @article{dabouei2020,
148 | title={SuperMix: Supervising the Mixing Data Augmentation},
149 | author={Dabouei, Ali and Soleymani, Sobhan and Taherkhani, Fariborz and Nasrabadi, Nasser M},
150 | journal={arXiv preprint arXiv:2003.05034},
151 | year={2020}
152 | }
153 | ```
154 | Moreover, if you are developing distillation methods, we encourage you to cite CRD, due to their notable contribution by benchmarking the state-of-the-art methods of distillation.
155 | ```
156 | @inproceedings{tian2019crd,
157 | title={Contrastive Representation Distillation},
158 | author={Yonglong Tian and Dilip Krishnan and Phillip Isola},
159 | booktitle={International Conference on Learning Representations},
160 | year={2020}
161 | }
162 | ```
163 |
164 |
--------------------------------------------------------------------------------
/crd/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/crd/__init__.py
--------------------------------------------------------------------------------
/crd/criterion.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 | from .memory import ContrastMemory
4 |
5 | eps = 1e-7
6 |
7 |
8 | class CRDLoss(nn.Module):
9 | """CRD Loss function
10 | includes two symmetric parts:
11 | (a) using teacher as anchor, choose positive and negatives over the student side
12 | (b) using student as anchor, choose positive and negatives over the teacher side
13 |
14 | Args:
15 | opt.s_dim: the dimension of student's feature
16 | opt.t_dim: the dimension of teacher's feature
17 | opt.feat_dim: the dimension of the projection space
18 | opt.nce_k: number of negatives paired with each positive
19 | opt.nce_t: the temperature
20 | opt.nce_m: the momentum for updating the memory buffer
21 | opt.n_data: the number of samples in the training set, therefor the memory buffer is: opt.n_data x opt.feat_dim
22 | """
23 | def __init__(self, opt):
24 | super(CRDLoss, self).__init__()
25 | self.embed_s = Embed(opt.s_dim, opt.feat_dim)
26 | self.embed_t = Embed(opt.t_dim, opt.feat_dim)
27 | self.contrast = ContrastMemory(opt.feat_dim, opt.n_data, opt.nce_k, opt.nce_t, opt.nce_m)
28 | self.criterion_t = ContrastLoss(opt.n_data)
29 | self.criterion_s = ContrastLoss(opt.n_data)
30 |
31 | def forward(self, f_s, f_t, idx, contrast_idx=None):
32 | """
33 | Args:
34 | f_s: the feature of student network, size [batch_size, s_dim]
35 | f_t: the feature of teacher network, size [batch_size, t_dim]
36 | idx: the indices of these positive samples in the dataset, size [batch_size]
37 | contrast_idx: the indices of negative samples, size [batch_size, nce_k]
38 |
39 | Returns:
40 | The contrastive loss
41 | """
42 | f_s = self.embed_s(f_s)
43 | f_t = self.embed_t(f_t)
44 | out_s, out_t = self.contrast(f_s, f_t, idx, contrast_idx)
45 | s_loss = self.criterion_s(out_s)
46 | t_loss = self.criterion_t(out_t)
47 | loss = s_loss + t_loss
48 | return loss
49 |
50 |
51 | class ContrastLoss(nn.Module):
52 | """
53 | contrastive loss, corresponding to Eq (18)
54 | """
55 | def __init__(self, n_data):
56 | super(ContrastLoss, self).__init__()
57 | self.n_data = n_data
58 |
59 | def forward(self, x):
60 | bsz = x.shape[0]
61 | m = x.size(1) - 1
62 |
63 | # noise distribution
64 | Pn = 1 / float(self.n_data)
65 |
66 | # loss for positive pair
67 | P_pos = x.select(1, 0)
68 | log_D1 = torch.div(P_pos, P_pos.add(m * Pn + eps)).log_()
69 |
70 | # loss for K negative pair
71 | P_neg = x.narrow(1, 1, m)
72 | log_D0 = torch.div(P_neg.clone().fill_(m * Pn), P_neg.add(m * Pn + eps)).log_()
73 |
74 | loss = - (log_D1.sum(0) + log_D0.view(-1, 1).sum(0)) / bsz
75 |
76 | return loss
77 |
78 |
79 | class Embed(nn.Module):
80 | """Embedding module"""
81 | def __init__(self, dim_in=1024, dim_out=128):
82 | super(Embed, self).__init__()
83 | self.linear = nn.Linear(dim_in, dim_out)
84 | self.l2norm = Normalize(2)
85 |
86 | def forward(self, x):
87 | x = x.view(x.shape[0], -1)
88 | x = self.linear(x)
89 | x = self.l2norm(x)
90 | return x
91 |
92 |
93 | class Normalize(nn.Module):
94 | """normalization layer"""
95 | def __init__(self, power=2):
96 | super(Normalize, self).__init__()
97 | self.power = power
98 |
99 | def forward(self, x):
100 | norm = x.pow(self.power).sum(1, keepdim=True).pow(1. / self.power)
101 | out = x.div(norm)
102 | return out
103 |
--------------------------------------------------------------------------------
/crd/memory.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 | import math
4 |
5 |
6 | class ContrastMemory(nn.Module):
7 | """
8 | memory buffer that supplies large amount of negative samples.
9 | """
10 | def __init__(self, inputSize, outputSize, K, T=0.07, momentum=0.5):
11 | super(ContrastMemory, self).__init__()
12 | self.nLem = outputSize
13 | self.unigrams = torch.ones(self.nLem)
14 | self.multinomial = AliasMethod(self.unigrams)
15 | self.multinomial.cuda()
16 | self.K = K
17 |
18 | self.register_buffer('params', torch.tensor([K, T, -1, -1, momentum]))
19 | stdv = 1. / math.sqrt(inputSize / 3)
20 | self.register_buffer('memory_v1', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
21 | self.register_buffer('memory_v2', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
22 |
23 | def forward(self, v1, v2, y, idx=None):
24 | K = int(self.params[0].item())
25 | T = self.params[1].item()
26 | Z_v1 = self.params[2].item()
27 | Z_v2 = self.params[3].item()
28 |
29 | momentum = self.params[4].item()
30 | batchSize = v1.size(0)
31 | outputSize = self.memory_v1.size(0)
32 | inputSize = self.memory_v1.size(1)
33 |
34 | # original score computation
35 | if idx is None:
36 | idx = self.multinomial.draw(batchSize * (self.K + 1)).view(batchSize, -1)
37 | idx.select(1, 0).copy_(y.data)
38 | # sample
39 | weight_v1 = torch.index_select(self.memory_v1, 0, idx.view(-1)).detach()
40 | weight_v1 = weight_v1.view(batchSize, K + 1, inputSize)
41 | out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1))
42 | out_v2 = torch.exp(torch.div(out_v2, T))
43 | # sample
44 | weight_v2 = torch.index_select(self.memory_v2, 0, idx.view(-1)).detach()
45 | weight_v2 = weight_v2.view(batchSize, K + 1, inputSize)
46 | out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1))
47 | out_v1 = torch.exp(torch.div(out_v1, T))
48 |
49 | # set Z if haven't been set yet
50 | if Z_v1 < 0:
51 | self.params[2] = out_v1.mean() * outputSize
52 | Z_v1 = self.params[2].clone().detach().item()
53 | print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
54 | if Z_v2 < 0:
55 | self.params[3] = out_v2.mean() * outputSize
56 | Z_v2 = self.params[3].clone().detach().item()
57 | print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))
58 |
59 | # compute out_v1, out_v2
60 | out_v1 = torch.div(out_v1, Z_v1).contiguous()
61 | out_v2 = torch.div(out_v2, Z_v2).contiguous()
62 |
63 | # update memory
64 | with torch.no_grad():
65 | l_pos = torch.index_select(self.memory_v1, 0, y.view(-1))
66 | l_pos.mul_(momentum)
67 | l_pos.add_(torch.mul(v1, 1 - momentum))
68 | l_norm = l_pos.pow(2).sum(1, keepdim=True).pow(0.5)
69 | updated_v1 = l_pos.div(l_norm)
70 | self.memory_v1.index_copy_(0, y, updated_v1)
71 |
72 | ab_pos = torch.index_select(self.memory_v2, 0, y.view(-1))
73 | ab_pos.mul_(momentum)
74 | ab_pos.add_(torch.mul(v2, 1 - momentum))
75 | ab_norm = ab_pos.pow(2).sum(1, keepdim=True).pow(0.5)
76 | updated_v2 = ab_pos.div(ab_norm)
77 | self.memory_v2.index_copy_(0, y, updated_v2)
78 |
79 | return out_v1, out_v2
80 |
81 |
82 | class AliasMethod(object):
83 | """
84 | From: https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/
85 | """
86 | def __init__(self, probs):
87 |
88 | if probs.sum() > 1:
89 | probs.div_(probs.sum())
90 | K = len(probs)
91 | self.prob = torch.zeros(K)
92 | self.alias = torch.LongTensor([0]*K)
93 |
94 | # Sort the data into the outcomes with probabilities
95 | # that are larger and smaller than 1/K.
96 | smaller = []
97 | larger = []
98 | for kk, prob in enumerate(probs):
99 | self.prob[kk] = K*prob
100 | if self.prob[kk] < 1.0:
101 | smaller.append(kk)
102 | else:
103 | larger.append(kk)
104 |
105 | # Loop though and create little binary mixtures that
106 | # appropriately allocate the larger outcomes over the
107 | # overall uniform mixture.
108 | while len(smaller) > 0 and len(larger) > 0:
109 | small = smaller.pop()
110 | large = larger.pop()
111 |
112 | self.alias[small] = large
113 | self.prob[large] = (self.prob[large] - 1.0) + self.prob[small]
114 |
115 | if self.prob[large] < 1.0:
116 | smaller.append(large)
117 | else:
118 | larger.append(large)
119 |
120 | for last_one in smaller+larger:
121 | self.prob[last_one] = 1
122 |
123 | def cuda(self):
124 | self.prob = self.prob.cuda()
125 | self.alias = self.alias.cuda()
126 |
127 | def draw(self, N):
128 | """ Draw N samples from multinomial """
129 | K = self.alias.size(0)
130 |
131 | kk = torch.zeros(N, dtype=torch.long, device=self.prob.device).random_(0, K)
132 | prob = self.prob.index_select(0, kk)
133 | alias = self.alias.index_select(0, kk)
134 | # b is whether a random number is greater than q
135 | b = torch.bernoulli(prob)
136 | oq = kk.mul(b.long())
137 | oj = alias.mul((1-b).long())
138 |
139 | return oq + oj
--------------------------------------------------------------------------------
/dataset/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/dataset/__init__.py
--------------------------------------------------------------------------------
/dataset/cifar100.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import os
4 | import torch
5 | import socket
6 | import numpy as np
7 | import torchvision
8 | from helper.util import AugDataset
9 | from torch.utils.data import DataLoader
10 | from torchvision import datasets, transforms
11 | from PIL import Image
12 | from helper.util import plot_tensor
13 |
14 | """
15 | mean = {
16 | 'cifar100': (0.5071, 0.4867, 0.4408),
17 | }
18 |
19 | std = {
20 | 'cifar100': (0.2675, 0.2565, 0.2761),
21 | }
22 | """
23 |
24 |
25 | class Datasubset(torch.utils.data.Dataset):
26 | def __init__(self, dataset, len):
27 | self.dataset = dataset
28 | self.len = len
29 |
30 | def __getitem__(self, i):
31 | return self.dataset[i % self.len]
32 |
33 | def __len__(self):
34 | return self.len # max(len(d) for d in self.datasets)
35 |
36 |
37 | class ConcatDataset(torch.utils.data.Dataset):
38 | def __init__(self, *datasets, len, opt):
39 | self.datasets = datasets
40 | self.len = len
41 | self.opt = opt
42 |
43 | def __getitem__(self, i):
44 | res = []
45 | for j, d in enumerate(self.datasets):
46 | l = min(len(d), self.len)
47 | # print(l)
48 |
49 | if (self.opt.aug_type == 'mixup' or self.opt.aug_type == 'cutmix') and j == 1:
50 | i += self.opt.batch_size * 10
51 | res.append(d[i % l])
52 |
53 | # return tuple(d[i % len(d)] for d in self.datasets)
54 | return tuple(res)
55 |
56 | def __len__(self):
57 | return self.len # max(len(d) for d in self.datasets)
58 |
59 |
60 | class DatasetMasked(torch.utils.data.Dataset):
61 | def __init__(self, dataset, opt):
62 | self.dataset = dataset
63 | self.len = len(dataset)
64 | self.opt = opt
65 |
66 | def __getitem__(self, i):
67 | res = self.dataset[i] # 3x32x32
68 | mask = torch.zeros([32, 32]).type(torch.FloatTensor)
69 |
70 | # set a random square area in the mask to one
71 | lambda_aug = np.random.beta(self.opt.aug_alpha, self.opt.aug_alpha)
72 |
73 | s_w = int(32 * np.sqrt(1 - lambda_aug))
74 | if s_w == 32:
75 | s_w = 31
76 |
77 |
78 |
79 | rand = torch.randint(0, 32 - s_w, size=[2])
80 | mask[int(rand[0]):int(rand[0]) + s_w, int(rand[1]):int(rand[1]) + s_w] = 1
81 | mask = mask.view(1, 32, 32)
82 | # res.append(mask)
83 | return res + tuple(mask) # append the mask to output
84 |
85 | def __len__(self):
86 | return self.len # max(len(d) for d in self.datasets)
87 |
88 |
89 | def get_data_folder():
90 | """
91 | return server-dependent path to store the data
92 | """
93 | hostname = socket.gethostname()
94 | if hostname.startswith('visiongpu'):
95 | data_folder = '/data/vision/phillipi/rep-learn/datasets'
96 | elif hostname.startswith('yonglong-home'):
97 | data_folder = '/home/yonglong/Data/data'
98 | else:
99 | data_folder = './data/'
100 |
101 | if not os.path.isdir(data_folder):
102 | os.makedirs(data_folder)
103 |
104 | return data_folder
105 |
106 |
107 | class CIFAR100Instance(datasets.CIFAR100):
108 | """CIFAR100Instance Dataset.
109 | """
110 |
111 | def __getitem__(self, index):
112 |
113 | # if torch.__version__[0] == '0':
114 |
115 | if self.train:
116 | img, target = self.train_data[index], self.train_labels[index]
117 | else:
118 | img, target = self.test_data[index], self.test_labels[index]
119 | # else:
120 | # img, target = self.data[index], self.targets[index]
121 |
122 | # doing this so that it is consistent with all other datasets
123 | # to return a PIL Image
124 | img = Image.fromarray(img)
125 |
126 | if self.transform is not None:
127 | img = self.transform(img)
128 |
129 | if self.target_transform is not None:
130 | target = self.target_transform(target)
131 |
132 | return img, target, index
133 |
134 |
135 | def get_cifar100_dataloaders(opt, is_instance=False):
136 | """
137 | cifar 100
138 | """
139 | data_folder = get_data_folder()
140 |
141 | train_transform = transforms.Compose([
142 | transforms.RandomCrop(32, padding=4),
143 | transforms.RandomHorizontalFlip(),
144 | transforms.ToTensor(),
145 | transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
146 | ])
147 | test_transform = transforms.Compose([
148 | transforms.ToTensor(),
149 | transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
150 | ])
151 |
152 | if is_instance:
153 | train_set = CIFAR100Instance(root=data_folder,
154 | download=True,
155 | train=True,
156 | transform=train_transform)
157 | n_data = len(train_set)
158 | else:
159 | train_set = datasets.CIFAR100(root=data_folder,
160 | download=True,
161 | train=True,
162 | transform=train_transform)
163 |
164 | # prepare the augmentation dataset
165 | if opt.aug_type is not None:
166 | train_transform_aug = transforms.Compose([
167 | transforms.RandomCrop(32, padding=2),
168 | transforms.RandomHorizontalFlip(),
169 | transforms.ToTensor(),
170 | transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
171 | ])
172 |
173 | if opt.aug_type == 'supermix':
174 | train_set_aug = torchvision.datasets.ImageFolder(
175 | root=opt.aug_dir,
176 | transform=train_transform_aug
177 | )
178 | if opt.aug_size == -1:
179 | # max(len(d) for d in self.datasets)
180 | opt.aug_size = max(len(train_set), len(train_set_aug))
181 | opt.aug_size -= opt.aug_size % 100
182 | elif opt.aug_type == 'mixup':
183 | train_set_aug = datasets.CIFAR100(root=data_folder,
184 | download=True,
185 | train=True,
186 | transform=train_transform)
187 | opt.aug_size = 50000
188 | elif opt.aug_type == 'cutmix':
189 | train_set_aug = datasets.CIFAR100(root=data_folder,
190 | download=True,
191 | train=True,
192 | transform=train_transform)
193 | # generate masks for the data
194 | train_set_aug = DatasetMasked(train_set_aug, opt=opt)
195 | opt.aug_size = 50000
196 | #
197 | # img = train_set.__getitem__(798)
198 | # plot_tensor([img[0]])
199 | # exit()
200 |
201 | train_loader = torch.utils.data.DataLoader(
202 | ConcatDataset(train_set, train_set_aug, len=opt.aug_size, opt=opt), batch_size=opt.batch_size, shuffle=True,
203 | num_workers=opt.num_workers, pin_memory=True)
204 | else:
205 | train_loader = DataLoader(train_set,
206 | batch_size=opt.batch_size,
207 | shuffle=True,
208 | num_workers=opt.num_workers)
209 | opt.aug_size = 50000
210 |
211 | test_set = datasets.CIFAR100(root=data_folder,
212 | download=True,
213 | train=False,
214 | transform=test_transform)
215 | test_loader = DataLoader(test_set,
216 | batch_size=opt.batch_size,
217 | shuffle=False,
218 | num_workers=opt.num_workers)
219 |
220 | print("size of the augment set: ", opt.aug_size)
221 |
222 | if is_instance:
223 | return train_loader, test_loader, n_data
224 | else:
225 | return train_loader, test_loader
226 |
227 |
228 | class CIFAR100InstanceSample(datasets.CIFAR100):
229 | """
230 | CIFAR100Instance+Sample Dataset
231 | """
232 |
233 | def __init__(self, root, train=True,
234 | transform=None, target_transform=None,
235 | download=False, k=4096, mode='exact', is_sample=True, percent=1.0):
236 | super().__init__(root=root, train=train, download=download,
237 | transform=transform, target_transform=target_transform)
238 | self.k = k
239 | self.mode = mode
240 | self.is_sample = is_sample
241 |
242 | num_classes = 100
243 | if self.train:
244 | num_samples = len(self.train_data)
245 | label = self.train_labels
246 | else:
247 | num_samples = len(self.test_data)
248 | label = self.test_labels
249 |
250 | self.cls_positive = [[] for i in range(num_classes)]
251 | for i in range(num_samples):
252 | self.cls_positive[label[i]].append(i)
253 |
254 | self.cls_negative = [[] for i in range(num_classes)]
255 | for i in range(num_classes):
256 | for j in range(num_classes):
257 | if j == i:
258 | continue
259 | self.cls_negative[i].extend(self.cls_positive[j])
260 |
261 | self.cls_positive = [np.asarray(self.cls_positive[i]) for i in range(num_classes)]
262 | self.cls_negative = [np.asarray(self.cls_negative[i]) for i in range(num_classes)]
263 |
264 | if 0 < percent < 1:
265 | n = int(len(self.cls_negative[0]) * percent)
266 | self.cls_negative = [np.random.permutation(self.cls_negative[i])[0:n]
267 | for i in range(num_classes)]
268 |
269 | self.cls_positive = np.asarray(self.cls_positive)
270 | self.cls_negative = np.asarray(self.cls_negative)
271 |
272 | def __getitem__(self, index):
273 | if self.train:
274 | img, target = self.train_data[index], self.train_labels[index]
275 | else:
276 | img, target = self.test_data[index], self.test_labels[index]
277 |
278 | # doing this so that it is consistent with all other datasets
279 | # to return a PIL Image
280 | img = Image.fromarray(img)
281 |
282 | if self.transform is not None:
283 | img = self.transform(img)
284 |
285 | if self.target_transform is not None:
286 | target = self.target_transform(target)
287 |
288 | if not self.is_sample:
289 | # directly return
290 | return img, target, index
291 | else:
292 | # sample contrastive examples
293 | if self.mode == 'exact':
294 | pos_idx = index
295 | elif self.mode == 'relax':
296 | pos_idx = np.random.choice(self.cls_positive[target], 1)
297 | pos_idx = pos_idx[0]
298 | else:
299 | raise NotImplementedError(self.mode)
300 | replace = True if self.k > len(self.cls_negative[target]) else False
301 | neg_idx = np.random.choice(self.cls_negative[target], self.k, replace=replace)
302 | sample_idx = np.hstack((np.asarray([pos_idx]), neg_idx))
303 | return img, target, index, sample_idx
304 |
305 |
306 | def get_cifar100_dataloaders_sample(batch_size=128, num_workers=8, k=4096, mode='exact',
307 | is_sample=True, percent=1.0):
308 | """
309 | cifar 100
310 | """
311 | data_folder = get_data_folder()
312 |
313 | train_transform = transforms.Compose([
314 | transforms.RandomCrop(32, padding=4),
315 | transforms.RandomHorizontalFlip(),
316 | transforms.ToTensor(),
317 | transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
318 | ])
319 | test_transform = transforms.Compose([
320 | transforms.ToTensor(),
321 | transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
322 | ])
323 |
324 | train_set = CIFAR100InstanceSample(root=data_folder,
325 | download=True,
326 | train=True,
327 | transform=train_transform,
328 | k=k,
329 | mode=mode,
330 | is_sample=is_sample,
331 | percent=percent)
332 | n_data = len(train_set)
333 | train_loader = DataLoader(train_set,
334 | batch_size=batch_size,
335 | shuffle=True,
336 | num_workers=num_workers)
337 |
338 | test_set = datasets.CIFAR100(root=data_folder,
339 | download=True,
340 | train=False,
341 | transform=test_transform)
342 | test_loader = DataLoader(test_set,
343 | batch_size=int(batch_size / 2),
344 | shuffle=False,
345 | num_workers=int(num_workers / 2))
346 |
347 | return train_loader, test_loader, n_data
348 |
--------------------------------------------------------------------------------
/dataset/imagenet.py:
--------------------------------------------------------------------------------
1 | """
2 | get data loaders
3 | """
4 | from __future__ import print_function
5 |
6 | import os
7 | import socket
8 | import numpy as np
9 | from torch.utils.data import DataLoader
10 | from torchvision import datasets
11 | from torchvision import transforms
12 |
13 |
14 | def get_data_folder():
15 | """
16 | return server-dependent path to store the data
17 | """
18 | hostname = socket.gethostname()
19 | if hostname.startswith('visiongpu'):
20 | data_folder = '/data/vision/phillipi/rep-learn/datasets/imagenet'
21 | elif hostname.startswith('yonglong-home'):
22 | data_folder = '/home/yonglong/Data/data/imagenet'
23 | else:
24 | data_folder = './data/imagenet'
25 |
26 | if not os.path.isdir(data_folder):
27 | os.makedirs(data_folder)
28 |
29 | return data_folder
30 |
31 |
32 | class ImageFolderInstance(datasets.ImageFolder):
33 | """: Folder datasets which returns the index of the image as well::
34 | """
35 | def __getitem__(self, index):
36 | """
37 | Args:
38 | index (int): Index
39 | Returns:
40 | tuple: (image, target) where target is class_index of the target class.
41 | """
42 | path, target = self.imgs[index]
43 | img = self.loader(path)
44 | if self.transform is not None:
45 | img = self.transform(img)
46 | if self.target_transform is not None:
47 | target = self.target_transform(target)
48 |
49 | return img, target, index
50 |
51 |
52 | class ImageFolderSample(datasets.ImageFolder):
53 | """: Folder datasets which returns (img, label, index, contrast_index):
54 | """
55 | def __init__(self, root, transform=None, target_transform=None,
56 | is_sample=False, k=4096):
57 | super().__init__(root=root, transform=transform, target_transform=target_transform)
58 |
59 | self.k = k
60 | self.is_sample = is_sample
61 |
62 | print('stage1 finished!')
63 |
64 | if self.is_sample:
65 | num_classes = len(self.classes)
66 | num_samples = len(self.samples)
67 | label = np.zeros(num_samples, dtype=np.int32)
68 | for i in range(num_samples):
69 | path, target = self.imgs[i]
70 | label[i] = target
71 |
72 | self.cls_positive = [[] for i in range(num_classes)]
73 | for i in range(num_samples):
74 | self.cls_positive[label[i]].append(i)
75 |
76 | self.cls_negative = [[] for i in range(num_classes)]
77 | for i in range(num_classes):
78 | for j in range(num_classes):
79 | if j == i:
80 | continue
81 | self.cls_negative[i].extend(self.cls_positive[j])
82 |
83 | self.cls_positive = [np.asarray(self.cls_positive[i], dtype=np.int32) for i in range(num_classes)]
84 | self.cls_negative = [np.asarray(self.cls_negative[i], dtype=np.int32) for i in range(num_classes)]
85 |
86 | print('dataset initialized!')
87 |
88 | def __getitem__(self, index):
89 | """
90 | Args:
91 | index (int): Index
92 | Returns:
93 | tuple: (image, target) where target is class_index of the target class.
94 | """
95 | path, target = self.imgs[index]
96 | img = self.loader(path)
97 | if self.transform is not None:
98 | img = self.transform(img)
99 | if self.target_transform is not None:
100 | target = self.target_transform(target)
101 |
102 | if self.is_sample:
103 | # sample contrastive examples
104 | pos_idx = index
105 | neg_idx = np.random.choice(self.cls_negative[target], self.k, replace=True)
106 | sample_idx = np.hstack((np.asarray([pos_idx]), neg_idx))
107 | return img, target, index, sample_idx
108 | else:
109 | return img, target, index
110 |
111 |
112 | def get_test_loader(dataset='imagenet', batch_size=128, num_workers=8):
113 | """get the test data loader"""
114 |
115 | if dataset == 'imagenet':
116 | data_folder = get_data_folder()
117 | else:
118 | raise NotImplementedError('dataset not supported: {}'.format(dataset))
119 |
120 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
121 | std=[0.229, 0.224, 0.225])
122 | test_transform = transforms.Compose([
123 | transforms.Resize(256),
124 | transforms.CenterCrop(224),
125 | transforms.ToTensor(),
126 | normalize,
127 | ])
128 |
129 | test_folder = os.path.join(data_folder, 'val')
130 | test_set = datasets.ImageFolder(test_folder, transform=test_transform)
131 | test_loader = DataLoader(test_set,
132 | batch_size=batch_size,
133 | shuffle=False,
134 | num_workers=num_workers,
135 | pin_memory=True)
136 |
137 | return test_loader
138 |
139 |
140 | def get_dataloader_sample(dataset='imagenet', batch_size=128, num_workers=8, is_sample=False, k=4096):
141 | """Data Loader for ImageNet"""
142 |
143 | if dataset == 'imagenet':
144 | data_folder = get_data_folder()
145 | else:
146 | raise NotImplementedError('dataset not supported: {}'.format(dataset))
147 |
148 | # add data transform
149 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
150 | std=[0.229, 0.224, 0.225])
151 | train_transform = transforms.Compose([
152 | transforms.RandomResizedCrop(224),
153 | transforms.RandomHorizontalFlip(),
154 | transforms.ToTensor(),
155 | normalize,
156 | ])
157 | test_transform = transforms.Compose([
158 | transforms.Resize(256),
159 | transforms.CenterCrop(224),
160 | transforms.ToTensor(),
161 | normalize,
162 | ])
163 | train_folder = os.path.join(data_folder, 'train')
164 | test_folder = os.path.join(data_folder, 'val')
165 |
166 | train_set = ImageFolderSample(train_folder, transform=train_transform, is_sample=is_sample, k=k)
167 | test_set = datasets.ImageFolder(test_folder, transform=test_transform)
168 |
169 | train_loader = DataLoader(train_set,
170 | batch_size=batch_size,
171 | shuffle=True,
172 | num_workers=num_workers,
173 | pin_memory=True)
174 | test_loader = DataLoader(test_set,
175 | batch_size=batch_size,
176 | shuffle=False,
177 | num_workers=num_workers,
178 | pin_memory=True)
179 |
180 | print('num_samples', len(train_set.samples))
181 | print('num_class', len(train_set.classes))
182 |
183 | return train_loader, test_loader, len(train_set), len(train_set.classes)
184 |
185 |
186 | def get_imagenet_dataloader(dataset='imagenet', batch_size=128, num_workers=16, is_instance=False):
187 | """
188 | Data Loader for imagenet
189 | """
190 | if dataset == 'imagenet':
191 | data_folder = get_data_folder()
192 | else:
193 | raise NotImplementedError('dataset not supported: {}'.format(dataset))
194 |
195 | normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
196 | std=[0.229, 0.224, 0.225])
197 | train_transform = transforms.Compose([
198 | transforms.RandomResizedCrop(224),
199 | transforms.RandomHorizontalFlip(),
200 | transforms.ToTensor(),
201 | normalize,
202 | ])
203 | test_transform = transforms.Compose([
204 | transforms.Resize(256),
205 | transforms.CenterCrop(224),
206 | transforms.ToTensor(),
207 | normalize,
208 | ])
209 |
210 | train_folder = os.path.join(data_folder, 'train')
211 | test_folder = os.path.join(data_folder, 'val')
212 |
213 | if is_instance:
214 | train_set = ImageFolderInstance(train_folder, transform=train_transform)
215 | n_data = len(train_set)
216 | else:
217 | train_set = datasets.ImageFolder(train_folder, transform=train_transform)
218 |
219 | test_set = datasets.ImageFolder(test_folder, transform=test_transform)
220 |
221 | train_loader = DataLoader(train_set,
222 | batch_size=batch_size,
223 | shuffle=True,
224 | num_workers=num_workers,
225 | pin_memory=True)
226 |
227 | test_loader = DataLoader(test_set,
228 | batch_size=batch_size,
229 | shuffle=False,
230 | num_workers=num_workers//2,
231 | pin_memory=True)
232 |
233 | if is_instance:
234 | return train_loader, test_loader, n_data
235 | else:
236 | return train_loader, test_loader
237 |
--------------------------------------------------------------------------------
/distiller_zoo/AB.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 |
6 |
7 | class ABLoss(nn.Module):
8 | """Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons
9 | code: https://github.com/bhheo/AB_distillation
10 | """
11 | def __init__(self, feat_num, margin=1.0):
12 | super(ABLoss, self).__init__()
13 | self.w = [2**(i-feat_num+1) for i in range(feat_num)]
14 | self.margin = margin
15 |
16 | def forward(self, g_s, g_t):
17 | bsz = g_s[0].shape[0]
18 | losses = [self.criterion_alternative_l2(s, t) for s, t in zip(g_s, g_t)]
19 | losses = [w * l for w, l in zip(self.w, losses)]
20 | # loss = sum(losses) / bsz
21 | # loss = loss / 1000 * 3
22 | losses = [l / bsz for l in losses]
23 | losses = [l / 1000 * 3 for l in losses]
24 | return losses
25 |
26 | def criterion_alternative_l2(self, source, target):
27 | loss = ((source + self.margin) ** 2 * ((source > -self.margin) & (target <= 0)).float() +
28 | (source - self.margin) ** 2 * ((source <= self.margin) & (target > 0)).float())
29 | return torch.abs(loss).sum()
30 |
--------------------------------------------------------------------------------
/distiller_zoo/AT.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 |
7 | class Attention(nn.Module):
8 | """Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks
9 | via Attention Transfer
10 | code: https://github.com/szagoruyko/attention-transfer"""
11 | def __init__(self, p=2):
12 | super(Attention, self).__init__()
13 | self.p = p
14 |
15 | def forward(self, g_s, g_t):
16 | return [self.at_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)]
17 |
18 | def at_loss(self, f_s, f_t):
19 | s_H, t_H = f_s.shape[2], f_t.shape[2]
20 | if s_H > t_H:
21 | f_s = F.adaptive_avg_pool2d(f_s, (t_H, t_H))
22 | elif s_H < t_H:
23 | f_t = F.adaptive_avg_pool2d(f_t, (s_H, s_H))
24 | else:
25 | pass
26 | return (self.at(f_s) - self.at(f_t)).pow(2).mean()
27 |
28 | def at(self, f):
29 | return F.normalize(f.pow(self.p).mean(1).view(f.size(0), -1))
30 |
--------------------------------------------------------------------------------
/distiller_zoo/CC.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 |
6 |
7 | class Correlation(nn.Module):
8 | """Correlation Congruence for Knowledge Distillation, ICCV 2019.
9 | The authors nicely shared the code with me. This is a modified version based on
10 | the original authors' implementation. Credits go to the original author"""
11 | def __init__(self):
12 | super(Correlation, self).__init__()
13 |
14 | def forward(self, f_s, f_t):
15 | delta = torch.abs(f_s - f_t)
16 | loss = torch.mean((delta[:-1] * delta[1:]).sum(1))
17 | return loss
18 |
19 |
20 | # class Correlation(nn.Module):
21 | # """Similarity-preserving loss"""
22 | # def __init__(self):
23 | # super(Correlation, self).__init__()
24 | #
25 | # def forward(self, f_s, f_t):
26 | # return self.similarity_loss(f_s, f_t)
27 | # # return [self.similarity_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)]
28 | #
29 | # def similarity_loss(self, f_s, f_t):
30 | # bsz = f_s.shape[0]
31 | # f_s = f_s.view(bsz, -1)
32 | # f_t = f_t.view(bsz, -1)
33 | #
34 | # G_s = torch.mm(f_s, torch.t(f_s))
35 | # G_s = G_s / G_s.norm(2)
36 | # G_t = torch.mm(f_t, torch.t(f_t))
37 | # G_t = G_t / G_t.norm(2)
38 | #
39 | # G_diff = G_t - G_s
40 | # loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
41 | # return loss
42 |
--------------------------------------------------------------------------------
/distiller_zoo/FSP.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import numpy as np
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 |
8 | class FSP(nn.Module):
9 | """A Gift from Knowledge Distillation:
10 | Fast Optimization, Network Minimization and Transfer Learning"""
11 | def __init__(self, s_shapes, t_shapes):
12 | super(FSP, self).__init__()
13 | assert len(s_shapes) == len(t_shapes), 'unequal length of feat list'
14 | s_c = [s[1] for s in s_shapes]
15 | t_c = [t[1] for t in t_shapes]
16 | if np.any(np.asarray(s_c) != np.asarray(t_c)):
17 | raise ValueError('num of channels not equal (error in FSP)')
18 |
19 | def forward(self, g_s, g_t):
20 | s_fsp = self.compute_fsp(g_s)
21 | t_fsp = self.compute_fsp(g_t)
22 | loss_group = [self.compute_loss(s, t) for s, t in zip(s_fsp, t_fsp)]
23 | return loss_group
24 |
25 | @staticmethod
26 | def compute_loss(s, t):
27 | return (s - t).pow(2).mean()
28 |
29 | @staticmethod
30 | def compute_fsp(g):
31 | fsp_list = []
32 | for i in range(len(g) - 1):
33 | bot, top = g[i], g[i + 1]
34 | b_H, t_H = bot.shape[2], top.shape[2]
35 | if b_H > t_H:
36 | bot = F.adaptive_avg_pool2d(bot, (t_H, t_H))
37 | elif b_H < t_H:
38 | top = F.adaptive_avg_pool2d(top, (b_H, b_H))
39 | else:
40 | pass
41 | bot = bot.unsqueeze(1)
42 | top = top.unsqueeze(2)
43 | bot = bot.view(bot.shape[0], bot.shape[1], bot.shape[2], -1)
44 | top = top.view(top.shape[0], top.shape[1], top.shape[2], -1)
45 |
46 | fsp = (bot * top).mean(-1)
47 | fsp_list.append(fsp)
48 | return fsp_list
49 |
--------------------------------------------------------------------------------
/distiller_zoo/FT.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 |
7 | class FactorTransfer(nn.Module):
8 | """Paraphrasing Complex Network: Network Compression via Factor Transfer, NeurIPS 2018"""
9 | def __init__(self, p1=2, p2=1):
10 | super(FactorTransfer, self).__init__()
11 | self.p1 = p1
12 | self.p2 = p2
13 |
14 | def forward(self, f_s, f_t):
15 | return self.factor_loss(f_s, f_t)
16 |
17 | def factor_loss(self, f_s, f_t):
18 | s_H, t_H = f_s.shape[2], f_t.shape[2]
19 | if s_H > t_H:
20 | f_s = F.adaptive_avg_pool2d(f_s, (t_H, t_H))
21 | elif s_H < t_H:
22 | f_t = F.adaptive_avg_pool2d(f_t, (s_H, s_H))
23 | else:
24 | pass
25 | if self.p2 == 1:
26 | return (self.factor(f_s) - self.factor(f_t)).abs().mean()
27 | else:
28 | return (self.factor(f_s) - self.factor(f_t)).pow(self.p2).mean()
29 |
30 | def factor(self, f):
31 | return F.normalize(f.pow(self.p1).mean(1).view(f.size(0), -1))
32 |
--------------------------------------------------------------------------------
/distiller_zoo/FitNet.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 |
5 |
6 | class HintLoss(nn.Module):
7 | """Fitnets: hints for thin deep nets, ICLR 2015"""
8 | def __init__(self):
9 | super(HintLoss, self).__init__()
10 | self.crit = nn.MSELoss()
11 |
12 | def forward(self, f_s, f_t):
13 | loss = self.crit(f_s, f_t)
14 | return loss
15 |
--------------------------------------------------------------------------------
/distiller_zoo/KD.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 |
7 | class DistillKL(nn.Module):
8 | """Distilling the Knowledge in a Neural Network"""
9 | def __init__(self, T):
10 | super(DistillKL, self).__init__()
11 | self.T = T
12 |
13 | def forward(self, y_s, y_t):
14 | p_s = F.log_softmax(y_s/self.T, dim=1)
15 | p_t = F.softmax(y_t/self.T, dim=1)
16 | loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
17 | return loss
18 |
--------------------------------------------------------------------------------
/distiller_zoo/KDSVD.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 |
8 | class KDSVD(nn.Module):
9 | """
10 | Self-supervised Knowledge Distillation using Singular Value Decomposition
11 | original Tensorflow code: https://github.com/sseung0703/SSKD_SVD
12 | """
13 | def __init__(self, k=1):
14 | super(KDSVD, self).__init__()
15 | self.k = k
16 |
17 | def forward(self, g_s, g_t):
18 | v_sb = None
19 | v_tb = None
20 | losses = []
21 | for i, f_s, f_t in zip(range(len(g_s)), g_s, g_t):
22 |
23 | u_t, s_t, v_t = self.svd(f_t, self.k)
24 | u_s, s_s, v_s = self.svd(f_s, self.k + 3)
25 | v_s, v_t = self.align_rsv(v_s, v_t)
26 | s_t = s_t.unsqueeze(1)
27 | v_t = v_t * s_t
28 | v_s = v_s * s_t
29 |
30 | if i > 0:
31 | s_rbf = torch.exp(-(v_s.unsqueeze(2) - v_sb.unsqueeze(1)).pow(2) / 8)
32 | t_rbf = torch.exp(-(v_t.unsqueeze(2) - v_tb.unsqueeze(1)).pow(2) / 8)
33 |
34 | l2loss = (s_rbf - t_rbf.detach()).pow(2)
35 | l2loss = torch.where(torch.isfinite(l2loss), l2loss, torch.zeros_like(l2loss))
36 | losses.append(l2loss.sum())
37 |
38 | v_tb = v_t
39 | v_sb = v_s
40 |
41 | bsz = g_s[0].shape[0]
42 | losses = [l / bsz for l in losses]
43 | return losses
44 |
45 | def svd(self, feat, n=1):
46 | size = feat.shape
47 | assert len(size) == 4
48 |
49 | x = feat.view(-1, size[1], size[2] * size[2]).transpose(-2, -1)
50 | u, s, v = torch.svd(x)
51 |
52 | u = self.removenan(u)
53 | s = self.removenan(s)
54 | v = self.removenan(v)
55 |
56 | if n > 0:
57 | u = F.normalize(u[:, :, :n], dim=1)
58 | s = F.normalize(s[:, :n], dim=1)
59 | v = F.normalize(v[:, :, :n], dim=1)
60 |
61 | return u, s, v
62 |
63 | @staticmethod
64 | def removenan(x):
65 | x = torch.where(torch.isfinite(x), x, torch.zeros_like(x))
66 | return x
67 |
68 | @staticmethod
69 | def align_rsv(a, b):
70 | cosine = torch.matmul(a.transpose(-2, -1), b)
71 | max_abs_cosine, _ = torch.max(torch.abs(cosine), 1, keepdim=True)
72 | mask = torch.where(torch.eq(max_abs_cosine, torch.abs(cosine)),
73 | torch.sign(cosine), torch.zeros_like(cosine))
74 | a = torch.matmul(a, mask)
75 | return a, b
76 |
--------------------------------------------------------------------------------
/distiller_zoo/NST.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 |
7 | class NSTLoss(nn.Module):
8 | """like what you like: knowledge distill via neuron selectivity transfer"""
9 | def __init__(self):
10 | super(NSTLoss, self).__init__()
11 | pass
12 |
13 | def forward(self, g_s, g_t):
14 | return [self.nst_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)]
15 |
16 | def nst_loss(self, f_s, f_t):
17 | s_H, t_H = f_s.shape[2], f_t.shape[2]
18 | if s_H > t_H:
19 | f_s = F.adaptive_avg_pool2d(f_s, (t_H, t_H))
20 | elif s_H < t_H:
21 | f_t = F.adaptive_avg_pool2d(f_t, (s_H, s_H))
22 | else:
23 | pass
24 |
25 | f_s = f_s.view(f_s.shape[0], f_s.shape[1], -1)
26 | f_s = F.normalize(f_s, dim=2)
27 | f_t = f_t.view(f_t.shape[0], f_t.shape[1], -1)
28 | f_t = F.normalize(f_t, dim=2)
29 |
30 | # set full_loss as False to avoid unnecessary computation
31 | full_loss = True
32 | if full_loss:
33 | return (self.poly_kernel(f_t, f_t).mean().detach() + self.poly_kernel(f_s, f_s).mean()
34 | - 2 * self.poly_kernel(f_s, f_t).mean())
35 | else:
36 | return self.poly_kernel(f_s, f_s).mean() - 2 * self.poly_kernel(f_s, f_t).mean()
37 |
38 | def poly_kernel(self, a, b):
39 | a = a.unsqueeze(1)
40 | b = b.unsqueeze(2)
41 | res = (a * b).sum(-1).pow(2)
42 | return res
--------------------------------------------------------------------------------
/distiller_zoo/PKT.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 |
6 |
7 | class PKT(nn.Module):
8 | """Probabilistic Knowledge Transfer for deep representation learning
9 | Code from author: https://github.com/passalis/probabilistic_kt"""
10 | def __init__(self):
11 | super(PKT, self).__init__()
12 |
13 | def forward(self, f_s, f_t):
14 | return self.cosine_similarity_loss(f_s, f_t)
15 |
16 | @staticmethod
17 | def cosine_similarity_loss(output_net, target_net, eps=0.0000001):
18 | # Normalize each vector by its norm
19 | output_net_norm = torch.sqrt(torch.sum(output_net ** 2, dim=1, keepdim=True))
20 | output_net = output_net / (output_net_norm + eps)
21 | output_net[output_net != output_net] = 0
22 |
23 | target_net_norm = torch.sqrt(torch.sum(target_net ** 2, dim=1, keepdim=True))
24 | target_net = target_net / (target_net_norm + eps)
25 | target_net[target_net != target_net] = 0
26 |
27 | # Calculate the cosine similarity
28 | model_similarity = torch.mm(output_net, output_net.transpose(0, 1))
29 | target_similarity = torch.mm(target_net, target_net.transpose(0, 1))
30 |
31 | # Scale cosine similarity to 0..1
32 | model_similarity = (model_similarity + 1.0) / 2.0
33 | target_similarity = (target_similarity + 1.0) / 2.0
34 |
35 | # Transform them into probabilities
36 | model_similarity = model_similarity / torch.sum(model_similarity, dim=1, keepdim=True)
37 | target_similarity = target_similarity / torch.sum(target_similarity, dim=1, keepdim=True)
38 |
39 | # Calculate the KL-divergence
40 | loss = torch.mean(target_similarity * torch.log((target_similarity + eps) / (model_similarity + eps)))
41 |
42 | return loss
43 |
--------------------------------------------------------------------------------
/distiller_zoo/RKD.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 |
8 | class RKDLoss(nn.Module):
9 | """Relational Knowledge Disitllation, CVPR2019"""
10 | def __init__(self, w_d=25, w_a=50):
11 | super(RKDLoss, self).__init__()
12 | self.w_d = w_d
13 | self.w_a = w_a
14 |
15 | def forward(self, f_s, f_t):
16 | student = f_s.view(f_s.shape[0], -1)
17 | teacher = f_t.view(f_t.shape[0], -1)
18 |
19 | # RKD distance loss
20 | with torch.no_grad():
21 | t_d = self.pdist(teacher, squared=False)
22 | mean_td = t_d[t_d > 0].mean()
23 | t_d = t_d / mean_td
24 |
25 | d = self.pdist(student, squared=False)
26 | mean_d = d[d > 0].mean()
27 | d = d / mean_d
28 |
29 | loss_d = F.smooth_l1_loss(d, t_d)
30 |
31 | # RKD Angle loss
32 | with torch.no_grad():
33 | td = (teacher.unsqueeze(0) - teacher.unsqueeze(1))
34 | norm_td = F.normalize(td, p=2, dim=2)
35 | t_angle = torch.bmm(norm_td, norm_td.transpose(1, 2)).view(-1)
36 |
37 | sd = (student.unsqueeze(0) - student.unsqueeze(1))
38 | norm_sd = F.normalize(sd, p=2, dim=2)
39 | s_angle = torch.bmm(norm_sd, norm_sd.transpose(1, 2)).view(-1)
40 |
41 | loss_a = F.smooth_l1_loss(s_angle, t_angle)
42 |
43 | loss = self.w_d * loss_d + self.w_a * loss_a
44 |
45 | return loss
46 |
47 | @staticmethod
48 | def pdist(e, squared=False, eps=1e-12):
49 | e_square = e.pow(2).sum(dim=1)
50 | prod = e @ e.t()
51 | res = (e_square.unsqueeze(1) + e_square.unsqueeze(0) - 2 * prod).clamp(min=eps)
52 |
53 | if not squared:
54 | res = res.sqrt()
55 |
56 | res = res.clone()
57 | res[range(len(e)), range(len(e))] = 0
58 | return res
59 |
--------------------------------------------------------------------------------
/distiller_zoo/SP.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 |
7 |
8 | class Similarity(nn.Module):
9 | """Similarity-Preserving Knowledge Distillation, ICCV2019, verified by original author"""
10 | def __init__(self):
11 | super(Similarity, self).__init__()
12 |
13 | def forward(self, g_s, g_t):
14 | return [self.similarity_loss(f_s, f_t) for f_s, f_t in zip(g_s, g_t)]
15 |
16 | def similarity_loss(self, f_s, f_t):
17 | bsz = f_s.shape[0]
18 | f_s = f_s.view(bsz, -1)
19 | f_t = f_t.view(bsz, -1)
20 |
21 | G_s = torch.mm(f_s, torch.t(f_s))
22 | # G_s = G_s / G_s.norm(2)
23 | G_s = torch.nn.functional.normalize(G_s)
24 | G_t = torch.mm(f_t, torch.t(f_t))
25 | # G_t = G_t / G_t.norm(2)
26 | G_t = torch.nn.functional.normalize(G_t)
27 |
28 | G_diff = G_t - G_s
29 | loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
30 | return loss
31 |
--------------------------------------------------------------------------------
/distiller_zoo/VID.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 | import numpy as np
7 |
8 |
9 | class VIDLoss(nn.Module):
10 | """Variational Information Distillation for Knowledge Transfer (CVPR 2019),
11 | code from author: https://github.com/ssahn0215/variational-information-distillation"""
12 | def __init__(self,
13 | num_input_channels,
14 | num_mid_channel,
15 | num_target_channels,
16 | init_pred_var=5.0,
17 | eps=1e-5):
18 | super(VIDLoss, self).__init__()
19 |
20 | def conv1x1(in_channels, out_channels, stride=1):
21 | return nn.Conv2d(
22 | in_channels, out_channels,
23 | kernel_size=1, padding=0,
24 | bias=False, stride=stride)
25 |
26 | self.regressor = nn.Sequential(
27 | conv1x1(num_input_channels, num_mid_channel),
28 | nn.ReLU(),
29 | conv1x1(num_mid_channel, num_mid_channel),
30 | nn.ReLU(),
31 | conv1x1(num_mid_channel, num_target_channels),
32 | )
33 | self.log_scale = torch.nn.Parameter(
34 | np.log(np.exp(init_pred_var-eps)-1.0) * torch.ones(num_target_channels)
35 | )
36 | self.eps = eps
37 |
38 | def forward(self, input, target):
39 | # pool for dimentsion match
40 | s_H, t_H = input.shape[2], target.shape[2]
41 | if s_H > t_H:
42 | input = F.adaptive_avg_pool2d(input, (t_H, t_H))
43 | elif s_H < t_H:
44 | target = F.adaptive_avg_pool2d(target, (s_H, s_H))
45 | else:
46 | pass
47 | pred_mean = self.regressor(input)
48 | pred_var = torch.log(1.0+torch.exp(self.log_scale))+self.eps
49 | pred_var = pred_var.view(1, -1, 1, 1)
50 | neg_log_prob = 0.5*(
51 | (pred_mean-target)**2/pred_var+torch.log(pred_var)
52 | )
53 | loss = torch.mean(neg_log_prob)
54 | return loss
55 |
--------------------------------------------------------------------------------
/distiller_zoo/__init__.py:
--------------------------------------------------------------------------------
1 | from .AB import ABLoss
2 | from .AT import Attention
3 | from .CC import Correlation
4 | from .FitNet import HintLoss
5 | from .FSP import FSP
6 | from .FT import FactorTransfer
7 | from .KD import DistillKL
8 | from .KDSVD import KDSVD
9 | from .NST import NSTLoss
10 | from .PKT import PKT
11 | from .RKD import RKDLoss
12 | from .SP import Similarity
13 | from .VID import VIDLoss
14 |
--------------------------------------------------------------------------------
/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/examples/__init__.py
--------------------------------------------------------------------------------
/examples/cifar100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/examples/cifar100.png
--------------------------------------------------------------------------------
/examples/fig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/examples/fig1.png
--------------------------------------------------------------------------------
/examples/imagenet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/examples/imagenet.png
--------------------------------------------------------------------------------
/helper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alldbi/SuperMix/d63d25a6ff387640f4840faed97791b7c5badc5d/helper/__init__.py
--------------------------------------------------------------------------------
/helper/loops.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function, division
2 |
3 | import sys
4 | import time
5 | import torch
6 | from helper.util import plot_tensor
7 | from .util import AverageMeter, accuracy
8 | import os
9 | import numpy as np
10 |
11 |
12 | def train_vanilla(epoch, train_loader, model, criterion, optimizer, opt, warmup_scheduler):
13 | device = opt.device
14 | """vanilla training"""
15 | model.train()
16 |
17 | batch_time = AverageMeter()
18 | data_time = AverageMeter()
19 | losses = AverageMeter()
20 | top1 = AverageMeter()
21 | top5 = AverageMeter()
22 |
23 | end = time.time()
24 | for idx, (input, target) in enumerate(train_loader):
25 |
26 | if epoch < 5 + 1:
27 | warmup_scheduler.step()
28 |
29 | data_time.update(time.time() - end)
30 |
31 | input = input.float()
32 | input = input.to(device)
33 | target = target.to(device)
34 |
35 | # ===================forward=====================
36 | output = model(input)
37 | loss = criterion(output, target)
38 |
39 | acc1, acc5 = accuracy(output, target, topk=(1, 5))
40 | losses.update(loss.item(), input.size(0))
41 | top1.update(acc1[0], input.size(0))
42 | top5.update(acc5[0], input.size(0))
43 |
44 | # ===================backward=====================
45 | optimizer.zero_grad()
46 | loss.backward()
47 | optimizer.step()
48 |
49 | # ===================meters=====================
50 | batch_time.update(time.time() - end)
51 | end = time.time()
52 |
53 | # tensorboard logger
54 | pass
55 |
56 | # print info
57 | if idx % opt.print_freq == 0:
58 | print('Epoch: [{0}][{1}/{2}]\t'
59 | 'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
60 | 'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
61 | 'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
62 | 'Acc@1 {top1.val:.3f} ({top1.avg:.3f})\t'
63 | 'Acc@5 {top5.val:.3f} ({top5.avg:.3f})'.format(
64 | epoch, idx, len(train_loader), batch_time=batch_time,
65 | data_time=data_time, loss=losses, top1=top1, top5=top5))
66 | sys.stdout.flush()
67 |
68 | print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'
69 | .format(top1=top1, top5=top5))
70 |
71 | return top1.avg, losses.avg
72 |
73 |
74 | def convert_time(seconds):
75 | seconds = seconds % (24 * 3600)
76 | hour = seconds // 3600
77 | seconds %= 3600
78 | minutes = seconds // 60
79 | seconds %= 60
80 | return [hour, minutes, seconds]
81 |
82 |
83 | def train_distill(epoch, train_loader, val_loader, module_list, criterion_list, optimizer, opt, best_acc, logger,
84 | device, warmup_scheduler, total_t):
85 | t_0 = time.time()
86 |
87 | """One epoch distillation"""
88 | # set modules as train()
89 | for module in module_list:
90 | module.train()
91 | # set teacher as eval()
92 | module_list[-1].eval()
93 |
94 | if opt.distill == 'abound':
95 | module_list[1].eval()
96 | elif opt.distill == 'factor':
97 | module_list[2].eval()
98 |
99 | criterion_cls = criterion_list[0]
100 | criterion_div = criterion_list[1]
101 | criterion_kd = criterion_list[2]
102 |
103 | model_s = module_list[0]
104 | model_t = module_list[-1]
105 |
106 | batch_time = AverageMeter()
107 | losses = AverageMeter()
108 | top1 = AverageMeter()
109 | top5 = AverageMeter()
110 | xentm = AverageMeter()
111 | kdm = AverageMeter()
112 | otherm = AverageMeter()
113 |
114 | end = time.time()
115 |
116 | t_data = time.time()
117 |
118 | ag_time = 0
119 | for idx, data_combined in enumerate(train_loader):
120 |
121 | ag_time += time.time() - t_data
122 |
123 | if epoch < opt.epochs_warmup + 1:
124 | warmup_scheduler.step()
125 |
126 | model_s.train()
127 | model_t.eval()
128 |
129 | if opt.aug_type is None:
130 | data = data_combined
131 | else:
132 | data = data_combined[0]
133 | data_aug = data_combined[1]
134 |
135 | if opt.distill in ['crd']:
136 | input, target, index, contrast_idx = data
137 | else:
138 | input, target, index = data
139 | if opt.aug_type is not None and opt.aug_type != 'cutmix':
140 | input_aug = data_aug[0]
141 |
142 | input = input.float()
143 | input = input.to(device)
144 | target = target.to(device)
145 | index = index.to(device)
146 | bs = input.size(0)
147 |
148 | if opt.distill in ['crd']:
149 | contrast_idx = contrast_idx.to(device)
150 |
151 | if opt.aug_type is not None:
152 | # construct augmentation samples using mixup or cropmix
153 | if opt.aug_type == 'mixup':
154 | input_aug = input_aug.to(device)
155 | # shift samples in the batch to make pairs
156 | idx_aug = torch.arange(bs)
157 | idx_aug[0:bs - 1] = idx_aug[1:bs].clone()
158 | idx_aug[-1] = 0
159 | input_aug_b = input_aug[idx_aug]
160 | if opt.aug_lambda > 0:
161 | # compute mixup samples using fixed lambda
162 | input_aug = opt.aug_lambda * input_aug + (1 - opt.aug_lambda) * input_aug_b
163 | elif opt.aug_lambda == -1:
164 | # compute mixup samples using the beta distribution
165 | lambda_aug = np.random.beta(opt.aug_alpha, opt.aug_alpha, size=[bs, 1, 1, 1])
166 | lambda_aug = torch.from_numpy(lambda_aug).type(torch.FloatTensor).to(opt.device)
167 | input_aug = lambda_aug * input_aug + (1 - lambda_aug) * input_aug_b
168 | elif opt.aug_type == 'cutmix':
169 | input_aug = data_aug[0]
170 | mask = data_aug[2].view(bs, 1, 32, 32)
171 | input_aug, mask = input_aug.to(device), mask.to(device)
172 | # shift samples in the batch to make pairs
173 | idx_aug = torch.arange(bs)
174 | idx_aug[0:bs - 1] = idx_aug[1:bs].clone()
175 | idx_aug[-1] = 0
176 | input_aug_b = input_aug[idx_aug]
177 |
178 | input_aug = mask * input_aug + (1 - mask) * input_aug_b
179 | # for i in range(10):
180 | # plot_tensor([input_aug[i], mask[i]])
181 | input_aug = input_aug.to(device)
182 |
183 | # ===================forward=====================
184 | preact = False
185 | if opt.distill in ['abound']:
186 | preact = True
187 | feat_s, logit_s = model_s(input, is_feat=True, preact=preact)
188 |
189 | # make training faster when there is no need to the prediction of the teacher for nat samples
190 | if not (opt.distill in ['kd'] and opt.alpha == 0):
191 | feat_t, logit_t = model_t(input, is_feat=True, preact=preact)
192 | feat_t = [f.detach() for f in feat_t]
193 |
194 | # compute the predicted label of the teacher for the augmented samples
195 | if opt.aug_type is not None:
196 | logit_aug_t = model_t(input_aug)
197 | logit_aug_s = model_s(input_aug)
198 | pred_lbl_t = logit_aug_t.argmax(1)
199 |
200 | # cls + kl div
201 | loss_cls_nat = criterion_cls(logit_s, target)
202 |
203 | loss_cls_aug = 0
204 | if opt.aug_type is not None:
205 | loss_cls_aug = criterion_cls(logit_aug_s, pred_lbl_t)
206 |
207 | loss_cls = loss_cls_nat + loss_cls_aug
208 |
209 | if opt.alpha > 0:
210 | # if opt.aug_type is not None:
211 | # loss_div = criterion_div(logit_aug_s, logit_aug_t)
212 | # else:
213 | loss_div = criterion_div(logit_s, logit_t)
214 | else:
215 | loss_div = torch.zeros([1])
216 | loss_div = loss_div.to(device)
217 |
218 | # other kd beyond KL divergence
219 | if opt.distill == 'kd':
220 | loss_kd = 0
221 | elif opt.distill == 'hint':
222 | f_s = module_list[1](feat_s[opt.hint_layer])
223 | f_t = feat_t[opt.hint_layer]
224 | loss_kd = criterion_kd(f_s, f_t)
225 | elif opt.distill == 'crd':
226 | f_s = feat_s[-1]
227 | f_t = feat_t[-1]
228 | loss_kd = criterion_kd(f_s, f_t, index, contrast_idx)
229 | elif opt.distill == 'attention':
230 | g_s = feat_s[1:-1]
231 | g_t = feat_t[1:-1]
232 | loss_group = criterion_kd(g_s, g_t)
233 | loss_kd = sum(loss_group)
234 | elif opt.distill == 'nst':
235 | g_s = feat_s[1:-1]
236 | g_t = feat_t[1:-1]
237 | loss_group = criterion_kd(g_s, g_t)
238 | loss_kd = sum(loss_group)
239 | elif opt.distill == 'similarity':
240 | g_s = [feat_s[-2]]
241 | g_t = [feat_t[-2]]
242 | loss_group = criterion_kd(g_s, g_t)
243 | loss_kd = sum(loss_group)
244 | elif opt.distill == 'rkd':
245 | f_s = feat_s[-1]
246 | f_t = feat_t[-1]
247 | loss_kd = criterion_kd(f_s, f_t)
248 | elif opt.distill == 'pkt':
249 | f_s = feat_s[-1]
250 | f_t = feat_t[-1]
251 | loss_kd = criterion_kd(f_s, f_t)
252 | elif opt.distill == 'kdsvd':
253 | g_s = feat_s[1:-1]
254 | g_t = feat_t[1:-1]
255 | loss_group = criterion_kd(g_s, g_t)
256 | loss_kd = sum(loss_group)
257 | elif opt.distill == 'correlation':
258 | f_s = module_list[1](feat_s[-1])
259 | f_t = module_list[2](feat_t[-1])
260 | loss_kd = criterion_kd(f_s, f_t)
261 | elif opt.distill == 'vid':
262 | g_s = feat_s[1:-1]
263 | g_t = feat_t[1:-1]
264 | loss_group = [c(f_s, f_t) for f_s, f_t, c in zip(g_s, g_t, criterion_kd)]
265 | loss_kd = sum(loss_group)
266 | elif opt.distill == 'abound':
267 | # can also add loss to this stage
268 | loss_kd = 0
269 | elif opt.distill == 'fsp':
270 | # can also add loss to this stage
271 | loss_kd = 0
272 | elif opt.distill == 'factor':
273 | factor_s = module_list[1](feat_s[-2])
274 | factor_t = module_list[2](feat_t[-2], is_factor=True)
275 | loss_kd = criterion_kd(factor_s, factor_t)
276 | else:
277 | raise NotImplementedError(opt.distill)
278 |
279 | loss = opt.gamma * loss_cls + opt.alpha * loss_div + opt.beta * loss_kd
280 |
281 | acc1, acc5 = accuracy(logit_s, target, topk=(1, 5))
282 | losses.update(loss.item(), bs)
283 | top1.update(acc1.item(), bs)
284 | top5.update(acc5.item(), bs)
285 | xentm.update(loss_cls.item(), bs)
286 | kdm.update(loss_div.item())
287 | otherm.update(loss_kd)
288 |
289 | # ===================backward=====================
290 | optimizer.zero_grad()
291 | loss.backward()
292 | optimizer.step()
293 |
294 | # ===================meters=====================
295 | total_t += time.time() - end
296 | batch_time.update(time.time() - end, 1)
297 | end = time.time()
298 |
299 | # print info
300 | if idx % opt.print_freq == 0 and idx > 0:
301 | for param_group in optimizer.param_groups:
302 | lr = param_group['lr']
303 | # compute the remaining time
304 | epoch_remaining = opt.epochs - epoch
305 | # total_iters_remaining = len(train_loader) * (opt.epochs - epoch + 1) - idx
306 | iters_passed = len(train_loader) * (epoch - 1) + idx
307 | iters_remaining = len(train_loader) * (opt.epochs - epoch + 1) - idx
308 |
309 | ert = total_t * iters_remaining / iters_passed
310 | ert = convert_time(ert)
311 |
312 | print(
313 | 'Epoch: %d [%03d, %03d], l_xent: %.4f, l_kd: %.4f, l_other: %.4f, acc: %.2f, lr: %.4f, time: %.1f, ert: %d:%02d:%02d' % (
314 | epoch, idx, len(train_loader), xentm.avg, kdm.avg, otherm.avg, top1.avg, lr,
315 | batch_time.avg * opt.print_freq, ert[0], ert[1], ert[2]))
316 |
317 | if idx % opt.test_freq == 0 and idx > 0:
318 | test_acc, tect_acc_top5, test_loss = validate(val_loader, model_s, criterion_cls, opt)
319 | model_s.train()
320 | if test_acc > best_acc:
321 | best_acc = test_acc
322 | state = {
323 | 'epoch': epoch,
324 | 'model': model_s.state_dict(),
325 | 'best_acc': best_acc,
326 | }
327 | save_file = os.path.join(opt.save_folder, '{}.pth'.format(opt.model_s))
328 | torch.save(state, save_file)
329 | print("\nTest acc: %.2f, best: %.2f\n" % (test_acc, best_acc))
330 |
331 | logger.store([epoch, xentm.avg, kdm.avg, otherm.avg, top1.avg, test_acc, best_acc, lr], log=True)
332 |
333 | xentm.reset()
334 | kdm.reset()
335 | top1.reset()
336 | otherm.reset()
337 | batch_time.reset()
338 | end = time.time()
339 |
340 | t_data = time.time()
341 |
342 | return best_acc, total_t
343 |
344 |
345 | def validate(val_loader, model, criterion, opt):
346 | device = opt.device
347 | """validation"""
348 | batch_time = AverageMeter()
349 | losses = AverageMeter()
350 | top1 = AverageMeter()
351 | top5 = AverageMeter()
352 |
353 | # switch to evaluate mode
354 | model.eval()
355 |
356 | with torch.no_grad():
357 | end = time.time()
358 | for idx, (input, target) in enumerate(val_loader):
359 | input = input.float()
360 | input = input.to(device)
361 | target = target.to(device)
362 |
363 | # compute output
364 | output = model(input)
365 | loss = criterion(output, target)
366 |
367 | # measure accuracy and record loss
368 | acc1, acc5 = accuracy(output, target, topk=(1, 5))
369 | losses.update(loss.item(), input.size(0))
370 | top1.update(acc1.item(), input.size(0))
371 | top5.update(acc5.item(), input.size(0))
372 |
373 | # measure elapsed time
374 | batch_time.update(time.time() - end)
375 | end = time.time()
376 |
377 | model.train()
378 | return top1.avg, top5.avg, losses.avg
379 |
--------------------------------------------------------------------------------
/helper/pretrain.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function, division
2 |
3 | import time
4 | import sys
5 | import torch
6 | import torch.optim as optim
7 | import torch.backends.cudnn as cudnn
8 | from .util import AverageMeter
9 |
10 |
11 | def init(model_s, model_t, init_modules, criterion, train_loader, opt):
12 | model_t.eval()
13 | model_s.eval()
14 | init_modules.train()
15 |
16 | if torch.cuda.is_available():
17 | model_s.cuda()
18 | model_t.cuda()
19 | init_modules.cuda()
20 | cudnn.benchmark = True
21 |
22 | if opt.model_s in ['resnet8', 'resnet14', 'resnet20', 'resnet32', 'resnet44', 'resnet56', 'resnet110',
23 | 'resnet8x4', 'resnet32x4', 'wrn_16_1', 'wrn_16_2', 'wrn_40_1', 'wrn_40_2'] and \
24 | opt.distill == 'factor':
25 | lr = 0.01
26 | else:
27 | lr = opt.learning_rate
28 | optimizer = optim.SGD(init_modules.parameters(),
29 | lr=lr,
30 | momentum=opt.momentum,
31 | weight_decay=opt.weight_decay)
32 |
33 | batch_time = AverageMeter()
34 | data_time = AverageMeter()
35 | losses = AverageMeter()
36 | for epoch in range(1, opt.init_epochs + 1):
37 | batch_time.reset()
38 | data_time.reset()
39 | losses.reset()
40 | end = time.time()
41 | for idx, data in enumerate(train_loader):
42 | if opt.distill in ['crd']:
43 | input, target, index, contrast_idx = data
44 | else:
45 | input, target, index = data
46 | data_time.update(time.time() - end)
47 |
48 | input = input.float()
49 | if torch.cuda.is_available():
50 | input = input.cuda()
51 | target = target.cuda()
52 | index = index.cuda()
53 | if opt.distill in ['crd']:
54 | contrast_idx = contrast_idx.cuda()
55 |
56 | # ============= forward ==============
57 | preact = (opt.distill == 'abound')
58 | feat_s, _ = model_s(input, is_feat=True, preact=preact)
59 | with torch.no_grad():
60 | feat_t, _ = model_t(input, is_feat=True, preact=preact)
61 | feat_t = [f.detach() for f in feat_t]
62 |
63 | if opt.distill == 'abound':
64 | g_s = init_modules[0](feat_s[1:-1])
65 | g_t = feat_t[1:-1]
66 | loss_group = criterion(g_s, g_t)
67 | loss = sum(loss_group)
68 | elif opt.distill == 'factor':
69 | f_t = feat_t[-2]
70 | _, f_t_rec = init_modules[0](f_t)
71 | loss = criterion(f_t_rec, f_t)
72 | elif opt.distill == 'fsp':
73 | loss_group = criterion(feat_s[:-1], feat_t[:-1])
74 | loss = sum(loss_group)
75 | else:
76 | raise NotImplemented('Not supported in init training: {}'.format(opt.distill))
77 |
78 | losses.update(loss.item(), input.size(0))
79 |
80 | # ===================backward=====================
81 | optimizer.zero_grad()
82 | loss.backward()
83 | optimizer.step()
84 |
85 | batch_time.update(time.time() - end)
86 | end = time.time()
87 |
88 | # end of epoch
89 | # logger.log_value('init_train_loss', losses.avg, epoch)
90 | print('Epoch: [{0}/{1}]\t'
91 | 'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
92 | 'losses: {losses.val:.3f} ({losses.avg:.3f})'.format(
93 | epoch, opt.init_epochs, batch_time=batch_time, losses=losses))
94 | sys.stdout.flush()
95 |
--------------------------------------------------------------------------------
/helper/util.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch
4 | import numpy as np
5 | import datetime, os
6 | from torch.optim.lr_scheduler import _LRScheduler
7 | from torch.utils.data import Dataset
8 | from PIL import Image
9 | import matplotlib.pyplot as plt
10 |
11 |
12 | class AugDataset(Dataset):
13 | def __init__(self, root_dir, size=500000, transforms=None):
14 | for i in range(4):
15 |
16 | dict = np.load(root_dir + str(i + 1), allow_pickle=True)
17 | data = dict['data']
18 | if i == 0:
19 | x = data
20 | else:
21 | x = np.concatenate((x, data), 0)
22 |
23 | x = x.reshape(-1, 3, 32, 32)
24 |
25 | x = x[0:size, ...]
26 |
27 | self.data = x
28 | self.transforms = transforms
29 |
30 | def __len__(self):
31 | return self.data.shape[0]
32 |
33 | def __getitem__(self, idx):
34 | if torch.is_tensor(idx):
35 | idx = idx.tolist()
36 |
37 | # img = self.transforms(self.data[idx, ...])
38 | # print(img)
39 | # exit()
40 | img = self.data[idx, ...]
41 | img = img.transpose(1, 2, 0)
42 | img = Image.fromarray(img)
43 | img = self.transforms(img)
44 | sample = {'image': img, 'target': 0}
45 |
46 | return sample
47 |
48 |
49 | # augset = AugDataset('/media/aldb/DATA1/DATABASE/imagenet32x32/Imagenet32_train/train_data_batch_')
50 | #
51 | # print(len(augset))
52 | # exit()
53 |
54 |
55 | def adjust_learning_rate_new(epoch, optimizer, LUT):
56 | """
57 | new learning rate schedule according to RotNet
58 | """
59 | lr = next((lr for (max_epoch, lr) in LUT if max_epoch > epoch), LUT[-1][1])
60 | for param_group in optimizer.param_groups:
61 | param_group['lr'] = lr
62 |
63 |
64 | def adjust_learning_rate(epoch, opt, optimizer):
65 | """Sets the learning rate to the initial LR decayed by decay rate every steep step"""
66 | steps = np.sum(epoch > np.asarray(opt.lr_decay_epochs))
67 | if steps > 0:
68 | new_lr = opt.learning_rate * (opt.lr_decay_rate ** steps)
69 | for param_group in optimizer.param_groups:
70 | param_group['lr'] = new_lr
71 |
72 |
73 | class AverageMeter(object):
74 | """Computes and stores the average and current value"""
75 |
76 | def __init__(self):
77 | self.reset()
78 |
79 | def reset(self):
80 | self.val = 0
81 | self.avg = 0
82 | self.sum = 0
83 | self.count = 0
84 |
85 | def update(self, val, n=1):
86 | self.val = val
87 | self.sum += val * n
88 | self.count += n
89 | self.avg = self.sum / self.count
90 |
91 |
92 | def accuracy(output, target, topk=(1,)):
93 | """Computes the accuracy over the k top predictions for the specified values of k"""
94 | with torch.no_grad():
95 | maxk = max(topk)
96 | batch_size = target.size(0)
97 |
98 | _, pred = output.topk(maxk, 1, True, True)
99 | pred = pred.t()
100 | correct = pred.eq(target.view(1, -1).expand_as(pred))
101 |
102 | res = []
103 | for k in topk:
104 | correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
105 | res.append(correct_k.mul_(100.0 / batch_size))
106 | return res
107 |
108 |
109 | class Logger:
110 | def __init__(self, dir, var_names=None, format=None, args=None):
111 | self.dir = dir
112 | self.var_names = var_names
113 | self.format = format
114 | self.vars = []
115 |
116 | # create the log folder
117 | if not os.path.exists(dir):
118 | os.makedirs(dir)
119 |
120 | file = open(dir + '/log.txt', 'w')
121 | file.write('Log file created on ' + str(datetime.datetime.now()) + '\n\n')
122 |
123 | dict = {}
124 | for arg in vars(args):
125 | dict[arg] = str(getattr(args, arg))
126 |
127 | for d in sorted(dict.keys()):
128 | file.write(d + ' : ' + dict[d] + '\n')
129 | file.write('\n')
130 | file.close()
131 |
132 | def store(self, vars, log=False):
133 | self.vars = self.vars + vars
134 | if log:
135 | self.log()
136 |
137 | def log(self):
138 |
139 | vars = self.vars
140 | file = open(self.dir + '/log.txt', 'a')
141 | st = ''
142 | for i in range(len(vars)):
143 | st += self.var_names[i] + ': ' + self.format[i] % (vars[i]) + ', '
144 | st += 'time: ' + str(datetime.datetime.now()) + '\n'
145 | file.write(st)
146 | file.close()
147 | self.vars = []
148 |
149 |
150 | def count_parameters(model):
151 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
152 |
153 |
154 | def get_teacher_name(model_path):
155 | """parse teacher name"""
156 | segments = model_path.split('/')[-2].split('_')
157 | if segments[0] != 'wrn':
158 | return segments[0]
159 | else:
160 | return segments[0] + '_' + segments[1] + '_' + segments[2]
161 |
162 |
163 | class WarmUpLR(_LRScheduler):
164 | """warmup_training learning rate scheduler
165 | Args:
166 | optimizer: optimzier(e.g. SGD)
167 | total_iters: totoal_iters of warmup phase
168 | """
169 |
170 | def __init__(self, optimizer, total_iters, last_epoch=-1):
171 | self.total_iters = total_iters
172 | super().__init__(optimizer, last_epoch)
173 |
174 | def get_lr(self):
175 | """we will use the first m batches, and set the learning
176 | rate to base_lr * m / total_iters
177 | """
178 | return [base_lr * self.last_epoch / (self.total_iters + 1e-8) for base_lr in self.base_lrs]
179 |
180 |
181 | def normalize01(x):
182 | x = (x - x.min()) / (x.max() - x.min())
183 | return x
184 |
185 |
186 | def plot_tensor(tensor_list):
187 | for i, t in enumerate(tensor_list):
188 | t_np = t.detach().cpu().numpy().squeeze()
189 | if len(t_np.shape) == 3:
190 | t_np = t_np.transpose(1, 2, 0)
191 | t_np = normalize01(t_np)
192 | plt.subplot(1, len(tensor_list), i + 1)
193 | plt.imshow(t_np)
194 | plt.show()
195 |
196 |
197 | if __name__ == '__main__':
198 | pass
199 |
--------------------------------------------------------------------------------
/models/ShuffleNetv1.py:
--------------------------------------------------------------------------------
1 | '''ShuffleNet in PyTorch.
2 | See the paper "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices" for more details.
3 | '''
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 |
8 |
9 | class ShuffleBlock(nn.Module):
10 | def __init__(self, groups):
11 | super(ShuffleBlock, self).__init__()
12 | self.groups = groups
13 |
14 | def forward(self, x):
15 | '''Channel shuffle: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,w] -> [N,C,H,W]'''
16 | N,C,H,W = x.size()
17 | g = self.groups
18 | return x.view(N,g,C//g,H,W).permute(0,2,1,3,4).reshape(N,C,H,W)
19 |
20 |
21 | class Bottleneck(nn.Module):
22 | def __init__(self, in_planes, out_planes, stride, groups, is_last=False):
23 | super(Bottleneck, self).__init__()
24 | self.is_last = is_last
25 | self.stride = stride
26 |
27 | mid_planes = int(out_planes/4)
28 | g = 1 if in_planes == 24 else groups
29 | self.conv1 = nn.Conv2d(in_planes, mid_planes, kernel_size=1, groups=g, bias=False)
30 | self.bn1 = nn.BatchNorm2d(mid_planes)
31 | self.shuffle1 = ShuffleBlock(groups=g)
32 | self.conv2 = nn.Conv2d(mid_planes, mid_planes, kernel_size=3, stride=stride, padding=1, groups=mid_planes, bias=False)
33 | self.bn2 = nn.BatchNorm2d(mid_planes)
34 | self.conv3 = nn.Conv2d(mid_planes, out_planes, kernel_size=1, groups=groups, bias=False)
35 | self.bn3 = nn.BatchNorm2d(out_planes)
36 |
37 | self.shortcut = nn.Sequential()
38 | if stride == 2:
39 | self.shortcut = nn.Sequential(nn.AvgPool2d(3, stride=2, padding=1))
40 |
41 | def forward(self, x):
42 | out = F.relu(self.bn1(self.conv1(x)))
43 | out = self.shuffle1(out)
44 | out = F.relu(self.bn2(self.conv2(out)))
45 | out = self.bn3(self.conv3(out))
46 | res = self.shortcut(x)
47 | preact = torch.cat([out, res], 1) if self.stride == 2 else out+res
48 | out = F.relu(preact)
49 | # out = F.relu(torch.cat([out, res], 1)) if self.stride == 2 else F.relu(out+res)
50 | if self.is_last:
51 | return out, preact
52 | else:
53 | return out
54 |
55 |
56 | class ShuffleNet(nn.Module):
57 | def __init__(self, cfg, num_classes=10):
58 | super(ShuffleNet, self).__init__()
59 | out_planes = cfg['out_planes']
60 | num_blocks = cfg['num_blocks']
61 | groups = cfg['groups']
62 |
63 | self.conv1 = nn.Conv2d(3, 24, kernel_size=1, bias=False)
64 | self.bn1 = nn.BatchNorm2d(24)
65 | self.in_planes = 24
66 | self.layer1 = self._make_layer(out_planes[0], num_blocks[0], groups)
67 | self.layer2 = self._make_layer(out_planes[1], num_blocks[1], groups)
68 | self.layer3 = self._make_layer(out_planes[2], num_blocks[2], groups)
69 | self.linear = nn.Linear(out_planes[2], num_classes)
70 |
71 | def _make_layer(self, out_planes, num_blocks, groups):
72 | layers = []
73 | for i in range(num_blocks):
74 | stride = 2 if i == 0 else 1
75 | cat_planes = self.in_planes if i == 0 else 0
76 | layers.append(Bottleneck(self.in_planes, out_planes-cat_planes,
77 | stride=stride,
78 | groups=groups,
79 | is_last=(i == num_blocks - 1)))
80 | self.in_planes = out_planes
81 | return nn.Sequential(*layers)
82 |
83 | def get_feat_modules(self):
84 | feat_m = nn.ModuleList([])
85 | feat_m.append(self.conv1)
86 | feat_m.append(self.bn1)
87 | feat_m.append(self.layer1)
88 | feat_m.append(self.layer2)
89 | feat_m.append(self.layer3)
90 | return feat_m
91 |
92 | def get_bn_before_relu(self):
93 | raise NotImplementedError('ShuffleNet currently is not supported for "Overhaul" teacher')
94 |
95 | def forward(self, x, is_feat=False, preact=False):
96 | out = F.relu(self.bn1(self.conv1(x)))
97 | f0 = out
98 | out, f1_pre = self.layer1(out)
99 | f1 = out
100 | out, f2_pre = self.layer2(out)
101 | f2 = out
102 | out, f3_pre = self.layer3(out)
103 | f3 = out
104 | out = F.avg_pool2d(out, 4)
105 | out = out.view(out.size(0), -1)
106 | f4 = out
107 | out = self.linear(out)
108 |
109 | if is_feat:
110 | if preact:
111 | return [f0, f1_pre, f2_pre, f3_pre, f4], out
112 | else:
113 | return [f0, f1, f2, f3, f4], out
114 | else:
115 | return out
116 |
117 |
118 | def ShuffleV1(**kwargs):
119 | cfg = {
120 | 'out_planes': [240, 480, 960],
121 | 'num_blocks': [4, 8, 4],
122 | 'groups': 3
123 | }
124 | return ShuffleNet(cfg, **kwargs)
125 |
126 |
127 | if __name__ == '__main__':
128 |
129 | x = torch.randn(2, 3, 32, 32)
130 | net = ShuffleV1(num_classes=100)
131 | import time
132 | a = time.time()
133 | feats, logit = net(x, is_feat=True, preact=True)
134 | b = time.time()
135 | print(b - a)
136 | for f in feats:
137 | print(f.shape, f.min().item())
138 | print(logit.shape)
139 |
--------------------------------------------------------------------------------
/models/ShuffleNetv2.py:
--------------------------------------------------------------------------------
1 | '''ShuffleNetV2 in PyTorch.
2 | See the paper "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design" for more details.
3 | '''
4 | import torch
5 | import torch.nn as nn
6 | import torch.nn.functional as F
7 |
8 |
9 | class ShuffleBlock(nn.Module):
10 | def __init__(self, groups=2):
11 | super(ShuffleBlock, self).__init__()
12 | self.groups = groups
13 |
14 | def forward(self, x):
15 | '''Channel shuffle: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,w] -> [N,C,H,W]'''
16 | N, C, H, W = x.size()
17 | g = self.groups
18 | return x.view(N, g, C//g, H, W).permute(0, 2, 1, 3, 4).reshape(N, C, H, W)
19 |
20 |
21 | class SplitBlock(nn.Module):
22 | def __init__(self, ratio):
23 | super(SplitBlock, self).__init__()
24 | self.ratio = ratio
25 |
26 | def forward(self, x):
27 | c = int(x.size(1) * self.ratio)
28 | return x[:, :c, :, :], x[:, c:, :, :]
29 |
30 |
31 | class BasicBlock(nn.Module):
32 | def __init__(self, in_channels, split_ratio=0.5, is_last=False):
33 | super(BasicBlock, self).__init__()
34 | self.is_last = is_last
35 | self.split = SplitBlock(split_ratio)
36 | in_channels = int(in_channels * split_ratio)
37 | self.conv1 = nn.Conv2d(in_channels, in_channels,
38 | kernel_size=1, bias=False)
39 | self.bn1 = nn.BatchNorm2d(in_channels)
40 | self.conv2 = nn.Conv2d(in_channels, in_channels,
41 | kernel_size=3, stride=1, padding=1, groups=in_channels, bias=False)
42 | self.bn2 = nn.BatchNorm2d(in_channels)
43 | self.conv3 = nn.Conv2d(in_channels, in_channels,
44 | kernel_size=1, bias=False)
45 | self.bn3 = nn.BatchNorm2d(in_channels)
46 | self.shuffle = ShuffleBlock()
47 |
48 | def forward(self, x):
49 | x1, x2 = self.split(x)
50 | out = F.relu(self.bn1(self.conv1(x2)))
51 | out = self.bn2(self.conv2(out))
52 | preact = self.bn3(self.conv3(out))
53 | out = F.relu(preact)
54 | # out = F.relu(self.bn3(self.conv3(out)))
55 | preact = torch.cat([x1, preact], 1)
56 | out = torch.cat([x1, out], 1)
57 | out = self.shuffle(out)
58 | if self.is_last:
59 | return out, preact
60 | else:
61 | return out
62 |
63 |
64 | class DownBlock(nn.Module):
65 | def __init__(self, in_channels, out_channels):
66 | super(DownBlock, self).__init__()
67 | mid_channels = out_channels // 2
68 | # left
69 | self.conv1 = nn.Conv2d(in_channels, in_channels,
70 | kernel_size=3, stride=2, padding=1, groups=in_channels, bias=False)
71 | self.bn1 = nn.BatchNorm2d(in_channels)
72 | self.conv2 = nn.Conv2d(in_channels, mid_channels,
73 | kernel_size=1, bias=False)
74 | self.bn2 = nn.BatchNorm2d(mid_channels)
75 | # right
76 | self.conv3 = nn.Conv2d(in_channels, mid_channels,
77 | kernel_size=1, bias=False)
78 | self.bn3 = nn.BatchNorm2d(mid_channels)
79 | self.conv4 = nn.Conv2d(mid_channels, mid_channels,
80 | kernel_size=3, stride=2, padding=1, groups=mid_channels, bias=False)
81 | self.bn4 = nn.BatchNorm2d(mid_channels)
82 | self.conv5 = nn.Conv2d(mid_channels, mid_channels,
83 | kernel_size=1, bias=False)
84 | self.bn5 = nn.BatchNorm2d(mid_channels)
85 |
86 | self.shuffle = ShuffleBlock()
87 |
88 | def forward(self, x):
89 | # left
90 | out1 = self.bn1(self.conv1(x))
91 | out1 = F.relu(self.bn2(self.conv2(out1)))
92 | # right
93 | out2 = F.relu(self.bn3(self.conv3(x)))
94 | out2 = self.bn4(self.conv4(out2))
95 | out2 = F.relu(self.bn5(self.conv5(out2)))
96 | # concat
97 | out = torch.cat([out1, out2], 1)
98 | out = self.shuffle(out)
99 | return out
100 |
101 |
102 | class ShuffleNetV2(nn.Module):
103 | def __init__(self, net_size, num_classes=10):
104 | super(ShuffleNetV2, self).__init__()
105 | out_channels = configs[net_size]['out_channels']
106 | num_blocks = configs[net_size]['num_blocks']
107 |
108 | # self.conv1 = nn.Conv2d(3, 24, kernel_size=3,
109 | # stride=1, padding=1, bias=False)
110 | self.conv1 = nn.Conv2d(3, 24, kernel_size=1, bias=False)
111 | self.bn1 = nn.BatchNorm2d(24)
112 | self.in_channels = 24
113 | self.layer1 = self._make_layer(out_channels[0], num_blocks[0])
114 | self.layer2 = self._make_layer(out_channels[1], num_blocks[1])
115 | self.layer3 = self._make_layer(out_channels[2], num_blocks[2])
116 | self.conv2 = nn.Conv2d(out_channels[2], out_channels[3],
117 | kernel_size=1, stride=1, padding=0, bias=False)
118 | self.bn2 = nn.BatchNorm2d(out_channels[3])
119 | self.linear = nn.Linear(out_channels[3], num_classes)
120 |
121 | def _make_layer(self, out_channels, num_blocks):
122 | layers = [DownBlock(self.in_channels, out_channels)]
123 | for i in range(num_blocks):
124 | layers.append(BasicBlock(out_channels, is_last=(i == num_blocks - 1)))
125 | self.in_channels = out_channels
126 | return nn.Sequential(*layers)
127 |
128 | def get_feat_modules(self):
129 | feat_m = nn.ModuleList([])
130 | feat_m.append(self.conv1)
131 | feat_m.append(self.bn1)
132 | feat_m.append(self.layer1)
133 | feat_m.append(self.layer2)
134 | feat_m.append(self.layer3)
135 | return feat_m
136 |
137 | def get_bn_before_relu(self):
138 | raise NotImplementedError('ShuffleNetV2 currently is not supported for "Overhaul" teacher')
139 |
140 | def forward(self, x, is_feat=False, preact=False):
141 | out = F.relu(self.bn1(self.conv1(x)))
142 | # out = F.max_pool2d(out, 3, stride=2, padding=1)
143 | f0 = out
144 | out, f1_pre = self.layer1(out)
145 | f1 = out
146 | out, f2_pre = self.layer2(out)
147 | f2 = out
148 | out, f3_pre = self.layer3(out)
149 | f3 = out
150 | out = F.relu(self.bn2(self.conv2(out)))
151 | out = F.avg_pool2d(out, 4)
152 | out = out.view(out.size(0), -1)
153 | f4 = out
154 | out = self.linear(out)
155 | if is_feat:
156 | if preact:
157 | return [f0, f1_pre, f2_pre, f3_pre, f4], out
158 | else:
159 | return [f0, f1, f2, f3, f4], out
160 | else:
161 | return out
162 |
163 |
164 | configs = {
165 | 0.2: {
166 | 'out_channels': (40, 80, 160, 512),
167 | 'num_blocks': (3, 3, 3)
168 | },
169 |
170 | 0.3: {
171 | 'out_channels': (40, 80, 160, 512),
172 | 'num_blocks': (3, 7, 3)
173 | },
174 |
175 | 0.5: {
176 | 'out_channels': (48, 96, 192, 1024),
177 | 'num_blocks': (3, 7, 3)
178 | },
179 |
180 | 1: {
181 | 'out_channels': (116, 232, 464, 1024),
182 | 'num_blocks': (3, 7, 3)
183 | },
184 | 1.5: {
185 | 'out_channels': (176, 352, 704, 1024),
186 | 'num_blocks': (3, 7, 3)
187 | },
188 | 2: {
189 | 'out_channels': (224, 488, 976, 2048),
190 | 'num_blocks': (3, 7, 3)
191 | }
192 | }
193 |
194 |
195 | def ShuffleV2(**kwargs):
196 | model = ShuffleNetV2(net_size=1, **kwargs)
197 | return model
198 |
199 |
200 | if __name__ == '__main__':
201 | net = ShuffleV2(num_classes=100)
202 | x = torch.randn(3, 3, 32, 32)
203 | import time
204 | a = time.time()
205 | feats, logit = net(x, is_feat=True, preact=True)
206 | b = time.time()
207 | print(b - a)
208 | for f in feats:
209 | print(f.shape, f.min().item())
210 | print(logit.shape)
211 |
--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
1 | from .resnet import resnet8, resnet14, resnet20, resnet32, resnet44, resnet56, resnet110, resnet8x4, resnet32x4
2 | from .resnetv2 import ResNet50
3 | from .wrn import wrn_16_1, wrn_16_2, wrn_40_1, wrn_40_2
4 | from .vgg import vgg19_bn, vgg16_bn, vgg13_bn, vgg11_bn, vgg8_bn
5 | from .mobilenetv2 import mobile_half
6 | from .ShuffleNetv1 import ShuffleV1
7 | from .ShuffleNetv2 import ShuffleV2
8 |
9 | model_dict = {
10 | 'resnet8': resnet8,
11 | 'resnet14': resnet14,
12 | 'resnet20': resnet20,
13 | 'resnet32': resnet32,
14 | 'resnet44': resnet44,
15 | 'resnet56': resnet56,
16 | 'resnet110': resnet110,
17 | 'resnet8x4': resnet8x4,
18 | 'resnet32x4': resnet32x4,
19 | 'ResNet50': ResNet50,
20 | 'wrn_16_1': wrn_16_1,
21 | 'wrn_16_2': wrn_16_2,
22 | 'wrn_40_1': wrn_40_1,
23 | 'wrn_40_2': wrn_40_2,
24 | 'vgg8': vgg8_bn,
25 | 'vgg11': vgg11_bn,
26 | 'vgg13': vgg13_bn,
27 | 'vgg16': vgg16_bn,
28 | 'vgg19': vgg19_bn,
29 | 'MobileNetV2': mobile_half,
30 | 'ShuffleV1': ShuffleV1,
31 | 'ShuffleV2': ShuffleV2,
32 | }
33 |
--------------------------------------------------------------------------------
/models/classifier.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 |
5 |
6 | #########################################
7 | # ===== Classifiers ===== #
8 | #########################################
9 |
10 | class LinearClassifier(nn.Module):
11 |
12 | def __init__(self, dim_in, n_label=10):
13 | super(LinearClassifier, self).__init__()
14 |
15 | self.net = nn.Linear(dim_in, n_label)
16 |
17 | def forward(self, x):
18 | return self.net(x)
19 |
20 |
21 | class NonLinearClassifier(nn.Module):
22 |
23 | def __init__(self, dim_in, n_label=10, p=0.1):
24 | super(NonLinearClassifier, self).__init__()
25 |
26 | self.net = nn.Sequential(
27 | nn.Linear(dim_in, 200),
28 | nn.Dropout(p=p),
29 | nn.BatchNorm1d(200),
30 | nn.ReLU(inplace=True),
31 | nn.Linear(200, n_label),
32 | )
33 |
34 | def forward(self, x):
35 | return self.net(x)
36 |
--------------------------------------------------------------------------------
/models/mobilenetv2.py:
--------------------------------------------------------------------------------
1 | """
2 | MobileNetV2 implementation used in
3 |
4 | """
5 |
6 | import torch
7 | import torch.nn as nn
8 | import math
9 |
10 | __all__ = ['mobilenetv2_T_w', 'mobile_half']
11 |
12 | BN = None
13 |
14 |
15 | def conv_bn(inp, oup, stride):
16 | return nn.Sequential(
17 | nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
18 | nn.BatchNorm2d(oup),
19 | nn.ReLU(inplace=True)
20 | )
21 |
22 |
23 | def conv_1x1_bn(inp, oup):
24 | return nn.Sequential(
25 | nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
26 | nn.BatchNorm2d(oup),
27 | nn.ReLU(inplace=True)
28 | )
29 |
30 |
31 | class InvertedResidual(nn.Module):
32 | def __init__(self, inp, oup, stride, expand_ratio):
33 | super(InvertedResidual, self).__init__()
34 | self.blockname = None
35 |
36 | self.stride = stride
37 | assert stride in [1, 2]
38 |
39 | self.use_res_connect = self.stride == 1 and inp == oup
40 |
41 | self.conv = nn.Sequential(
42 | # pw
43 | nn.Conv2d(inp, inp * expand_ratio, 1, 1, 0, bias=False),
44 | nn.BatchNorm2d(inp * expand_ratio),
45 | nn.ReLU(inplace=True),
46 | # dw
47 | nn.Conv2d(inp * expand_ratio, inp * expand_ratio, 3, stride, 1, groups=inp * expand_ratio, bias=False),
48 | nn.BatchNorm2d(inp * expand_ratio),
49 | nn.ReLU(inplace=True),
50 | # pw-linear
51 | nn.Conv2d(inp * expand_ratio, oup, 1, 1, 0, bias=False),
52 | nn.BatchNorm2d(oup),
53 | )
54 | self.names = ['0', '1', '2', '3', '4', '5', '6', '7']
55 |
56 | def forward(self, x):
57 | t = x
58 | if self.use_res_connect:
59 | return t + self.conv(x)
60 | else:
61 | return self.conv(x)
62 |
63 |
64 | class MobileNetV2(nn.Module):
65 | """mobilenetV2"""
66 | def __init__(self, T,
67 | feature_dim,
68 | input_size=32,
69 | width_mult=1.,
70 | remove_avg=False):
71 | super(MobileNetV2, self).__init__()
72 | self.remove_avg = remove_avg
73 |
74 | # setting of inverted residual blocks
75 | self.interverted_residual_setting = [
76 | # t, c, n, s
77 | [1, 16, 1, 1],
78 | [T, 24, 2, 1],
79 | [T, 32, 3, 2],
80 | [T, 64, 4, 2],
81 | [T, 96, 3, 1],
82 | [T, 160, 3, 2],
83 | [T, 320, 1, 1],
84 | ]
85 |
86 | # building first layer
87 | assert input_size % 32 == 0
88 | input_channel = int(32 * width_mult)
89 | self.conv1 = conv_bn(3, input_channel, 2)
90 |
91 | # building inverted residual blocks
92 | self.blocks = nn.ModuleList([])
93 | for t, c, n, s in self.interverted_residual_setting:
94 | output_channel = int(c * width_mult)
95 | layers = []
96 | strides = [s] + [1] * (n - 1)
97 | for stride in strides:
98 | layers.append(
99 | InvertedResidual(input_channel, output_channel, stride, t)
100 | )
101 | input_channel = output_channel
102 | self.blocks.append(nn.Sequential(*layers))
103 |
104 | self.last_channel = int(1280 * width_mult) if width_mult > 1.0 else 1280
105 | self.conv2 = conv_1x1_bn(input_channel, self.last_channel)
106 |
107 | # building classifier
108 | self.classifier = nn.Sequential(
109 | # nn.Dropout(0.5),
110 | nn.Linear(self.last_channel, feature_dim),
111 | )
112 |
113 | H = input_size // (32//2)
114 | self.avgpool = nn.AvgPool2d(H, ceil_mode=True)
115 |
116 | self._initialize_weights()
117 | print(T, width_mult)
118 |
119 | def get_bn_before_relu(self):
120 | bn1 = self.blocks[1][-1].conv[-1]
121 | bn2 = self.blocks[2][-1].conv[-1]
122 | bn3 = self.blocks[4][-1].conv[-1]
123 | bn4 = self.blocks[6][-1].conv[-1]
124 | return [bn1, bn2, bn3, bn4]
125 |
126 | def get_feat_modules(self):
127 | feat_m = nn.ModuleList([])
128 | feat_m.append(self.conv1)
129 | feat_m.append(self.blocks)
130 | return feat_m
131 |
132 | def forward(self, x, is_feat=False, preact=False):
133 |
134 | out = self.conv1(x)
135 | f0 = out
136 |
137 | out = self.blocks[0](out)
138 | out = self.blocks[1](out)
139 | f1 = out
140 | out = self.blocks[2](out)
141 | f2 = out
142 | out = self.blocks[3](out)
143 | out = self.blocks[4](out)
144 | f3 = out
145 | out = self.blocks[5](out)
146 | out = self.blocks[6](out)
147 | f4 = out
148 |
149 | out = self.conv2(out)
150 |
151 | if not self.remove_avg:
152 | out = self.avgpool(out)
153 | out = out.view(out.size(0), -1)
154 | f5 = out
155 | out = self.classifier(out)
156 |
157 | if is_feat:
158 | return [f0, f1, f2, f3, f4, f5], out
159 | else:
160 | return out
161 |
162 | def _initialize_weights(self):
163 | for m in self.modules():
164 | if isinstance(m, nn.Conv2d):
165 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels #/ m.groups
166 | # print(m.kernel_size[0], m.kernel_size[1], m.in_channels, m.out_channels, m.groups)
167 | m.weight.data.normal_(0, math.sqrt(2. / n))
168 | if m.bias is not None:
169 | m.bias.data.zero_()
170 |
171 | elif isinstance(m, nn.BatchNorm2d):
172 | m.weight.data.fill_(1)
173 | m.bias.data.zero_()
174 | elif isinstance(m, nn.Linear):
175 | n = m.weight.size(1)
176 | m.weight.data.normal_(0, 0.01)
177 | m.bias.data.zero_()
178 | print("initializing done!!!")
179 | # exit()
180 |
181 | def mobilenetv2_T_w(T, W, feature_dim=100):
182 | model = MobileNetV2(T=T, feature_dim=feature_dim, width_mult=W)
183 | return model
184 |
185 |
186 | def mobile_half(num_classes):
187 | return mobilenetv2_T_w(6, 0.5, num_classes)
188 |
189 |
190 | if __name__ == '__main__':
191 | x = torch.randn(2, 3, 32, 32)
192 |
193 | net = mobile_half(100)
194 |
195 | feats, logit = net(x, is_feat=True, preact=True)
196 | for f in feats:
197 | print(f.shape, f.min().item())
198 | print(logit.shape)
199 |
200 | for m in net.get_bn_before_relu():
201 | if isinstance(m, nn.BatchNorm2d):
202 | print('pass')
203 | else:
204 | print('warning')
205 |
206 |
--------------------------------------------------------------------------------
/models/resnet.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 |
3 | '''Resnet for cifar dataset.
4 | Ported form
5 | https://github.com/facebook/fb.resnet.torch
6 | and
7 | https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
8 | (c) YANG, Wei
9 | '''
10 | import torch.nn as nn
11 | import torch.nn.functional as F
12 | import math
13 |
14 |
15 | __all__ = ['resnet']
16 |
17 |
18 | def conv3x3(in_planes, out_planes, stride=1):
19 | """3x3 convolution with padding"""
20 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
21 | padding=1, bias=False)
22 |
23 |
24 | class BasicBlock(nn.Module):
25 | expansion = 1
26 |
27 | def __init__(self, inplanes, planes, stride=1, downsample=None, is_last=False):
28 | super(BasicBlock, self).__init__()
29 | self.is_last = is_last
30 | self.conv1 = conv3x3(inplanes, planes, stride)
31 | self.bn1 = nn.BatchNorm2d(planes)
32 | self.relu = nn.ReLU(inplace=True)
33 | self.conv2 = conv3x3(planes, planes)
34 | self.bn2 = nn.BatchNorm2d(planes)
35 | self.downsample = downsample
36 | self.stride = stride
37 |
38 | def forward(self, x):
39 | residual = x
40 |
41 | out = self.conv1(x)
42 | out = self.bn1(out)
43 | out = self.relu(out)
44 |
45 | out = self.conv2(out)
46 | out = self.bn2(out)
47 |
48 | if self.downsample is not None:
49 | residual = self.downsample(x)
50 |
51 | out += residual
52 | preact = out
53 | out = F.relu(out)
54 | if self.is_last:
55 | return out, preact
56 | else:
57 | return out
58 |
59 |
60 | class Bottleneck(nn.Module):
61 | expansion = 4
62 |
63 | def __init__(self, inplanes, planes, stride=1, downsample=None, is_last=False):
64 | super(Bottleneck, self).__init__()
65 | self.is_last = is_last
66 | self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False)
67 | self.bn1 = nn.BatchNorm2d(planes)
68 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
69 | padding=1, bias=False)
70 | self.bn2 = nn.BatchNorm2d(planes)
71 | self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
72 | self.bn3 = nn.BatchNorm2d(planes * 4)
73 | self.relu = nn.ReLU(inplace=True)
74 | self.downsample = downsample
75 | self.stride = stride
76 |
77 | def forward(self, x):
78 | residual = x
79 |
80 | out = self.conv1(x)
81 | out = self.bn1(out)
82 | out = self.relu(out)
83 |
84 | out = self.conv2(out)
85 | out = self.bn2(out)
86 | out = self.relu(out)
87 |
88 | out = self.conv3(out)
89 | out = self.bn3(out)
90 |
91 | if self.downsample is not None:
92 | residual = self.downsample(x)
93 |
94 | out += residual
95 | preact = out
96 | out = F.relu(out)
97 | if self.is_last:
98 | return out, preact
99 | else:
100 | return out
101 |
102 |
103 | class ResNet(nn.Module):
104 |
105 | def __init__(self, depth, num_filters, block_name='BasicBlock', num_classes=10):
106 | super(ResNet, self).__init__()
107 | # Model type specifies number of layers for CIFAR-10 model
108 | if block_name.lower() == 'basicblock':
109 | assert (depth - 2) % 6 == 0, 'When use basicblock, depth should be 6n+2, e.g. 20, 32, 44, 56, 110, 1202'
110 | n = (depth - 2) // 6
111 | block = BasicBlock
112 | elif block_name.lower() == 'bottleneck':
113 | assert (depth - 2) % 9 == 0, 'When use bottleneck, depth should be 9n+2, e.g. 20, 29, 47, 56, 110, 1199'
114 | n = (depth - 2) // 9
115 | block = Bottleneck
116 | else:
117 | raise ValueError('block_name shoule be Basicblock or Bottleneck')
118 |
119 | self.inplanes = num_filters[0]
120 | self.conv1 = nn.Conv2d(3, num_filters[0], kernel_size=3, padding=1,
121 | bias=False)
122 | self.bn1 = nn.BatchNorm2d(num_filters[0])
123 | self.relu = nn.ReLU(inplace=True)
124 | self.layer1 = self._make_layer(block, num_filters[1], n)
125 | self.layer2 = self._make_layer(block, num_filters[2], n, stride=2)
126 | self.layer3 = self._make_layer(block, num_filters[3], n, stride=2)
127 | self.avgpool = nn.AvgPool2d(8)
128 | self.fc = nn.Linear(num_filters[3] * block.expansion, num_classes)
129 |
130 | for m in self.modules():
131 | if isinstance(m, nn.Conv2d):
132 | nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
133 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
134 | nn.init.constant_(m.weight, 1)
135 | nn.init.constant_(m.bias, 0)
136 |
137 | def _make_layer(self, block, planes, blocks, stride=1):
138 | downsample = None
139 | if stride != 1 or self.inplanes != planes * block.expansion:
140 | downsample = nn.Sequential(
141 | nn.Conv2d(self.inplanes, planes * block.expansion,
142 | kernel_size=1, stride=stride, bias=False),
143 | nn.BatchNorm2d(planes * block.expansion),
144 | )
145 |
146 | layers = list([])
147 | layers.append(block(self.inplanes, planes, stride, downsample, is_last=(blocks == 1)))
148 | self.inplanes = planes * block.expansion
149 | for i in range(1, blocks):
150 | layers.append(block(self.inplanes, planes, is_last=(i == blocks-1)))
151 |
152 | return nn.Sequential(*layers)
153 |
154 | def get_feat_modules(self):
155 | feat_m = nn.ModuleList([])
156 | feat_m.append(self.conv1)
157 | feat_m.append(self.bn1)
158 | feat_m.append(self.relu)
159 | feat_m.append(self.layer1)
160 | feat_m.append(self.layer2)
161 | feat_m.append(self.layer3)
162 | return feat_m
163 |
164 | def get_bn_before_relu(self):
165 | if isinstance(self.layer1[0], Bottleneck):
166 | bn1 = self.layer1[-1].bn3
167 | bn2 = self.layer2[-1].bn3
168 | bn3 = self.layer3[-1].bn3
169 | elif isinstance(self.layer1[0], BasicBlock):
170 | bn1 = self.layer1[-1].bn2
171 | bn2 = self.layer2[-1].bn2
172 | bn3 = self.layer3[-1].bn2
173 | else:
174 | raise NotImplementedError('ResNet unknown block error !!!')
175 |
176 | return [bn1, bn2, bn3]
177 |
178 | def forward(self, x, is_feat=False, preact=False):
179 | x = self.conv1(x)
180 | x = self.bn1(x)
181 | x = self.relu(x) # 32x32
182 | f0 = x
183 |
184 | x, f1_pre = self.layer1(x) # 32x32
185 | f1 = x
186 | x, f2_pre = self.layer2(x) # 16x16
187 | f2 = x
188 | x, f3_pre = self.layer3(x) # 8x8
189 | f3 = x
190 |
191 | x = self.avgpool(x)
192 | x = x.view(x.size(0), -1)
193 | f4 = x
194 | x = self.fc(x)
195 |
196 | if is_feat:
197 | if preact:
198 | return [f0, f1_pre, f2_pre, f3_pre, f4], x
199 | else:
200 | return [f0, f1, f2, f3, f4], x
201 | else:
202 | return x
203 |
204 |
205 | def resnet8(**kwargs):
206 | return ResNet(8, [16, 16, 32, 64], 'basicblock', **kwargs)
207 |
208 |
209 | def resnet14(**kwargs):
210 | return ResNet(14, [16, 16, 32, 64], 'basicblock', **kwargs)
211 |
212 |
213 | def resnet20(**kwargs):
214 | return ResNet(20, [16, 16, 32, 64], 'basicblock', **kwargs)
215 |
216 |
217 | def resnet32(**kwargs):
218 | return ResNet(32, [16, 16, 32, 64], 'basicblock', **kwargs)
219 |
220 |
221 | def resnet44(**kwargs):
222 | return ResNet(44, [16, 16, 32, 64], 'basicblock', **kwargs)
223 |
224 |
225 | def resnet56(**kwargs):
226 | return ResNet(56, [16, 16, 32, 64], 'basicblock', **kwargs)
227 |
228 |
229 | def resnet110(**kwargs):
230 | return ResNet(110, [16, 16, 32, 64], 'basicblock', **kwargs)
231 |
232 |
233 | def resnet8x4(**kwargs):
234 | return ResNet(8, [32, 64, 128, 256], 'basicblock', **kwargs)
235 |
236 |
237 | def resnet32x4(**kwargs):
238 | return ResNet(32, [32, 64, 128, 256], 'basicblock', **kwargs)
239 |
240 |
241 | if __name__ == '__main__':
242 | import torch
243 |
244 | x = torch.randn(2, 3, 32, 32)
245 | net = resnet8x4(num_classes=20)
246 | feats, logit = net(x, is_feat=True, preact=True)
247 |
248 | for f in feats:
249 | print(f.shape, f.min().item())
250 | print(logit.shape)
251 |
252 | for m in net.get_bn_before_relu():
253 | if isinstance(m, nn.BatchNorm2d):
254 | print('pass')
255 | else:
256 | print('warning')
257 |
--------------------------------------------------------------------------------
/models/resnetv2.py:
--------------------------------------------------------------------------------
1 | '''ResNet in PyTorch.
2 | For Pre-activation ResNet, see 'preact_resnet.py'.
3 | Reference:
4 | [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
5 | Deep Residual Learning for Image Recognition. arXiv:1512.03385
6 | '''
7 | import torch
8 | import torch.nn as nn
9 | import torch.nn.functional as F
10 |
11 |
12 | class BasicBlock(nn.Module):
13 | expansion = 1
14 |
15 | def __init__(self, in_planes, planes, stride=1, is_last=False):
16 | super(BasicBlock, self).__init__()
17 | self.is_last = is_last
18 | self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
19 | self.bn1 = nn.BatchNorm2d(planes)
20 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
21 | self.bn2 = nn.BatchNorm2d(planes)
22 |
23 | self.shortcut = nn.Sequential()
24 | if stride != 1 or in_planes != self.expansion * planes:
25 | self.shortcut = nn.Sequential(
26 | nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1, stride=stride, bias=False),
27 | nn.BatchNorm2d(self.expansion * planes)
28 | )
29 |
30 | def forward(self, x):
31 | out = F.relu(self.bn1(self.conv1(x)))
32 | out = self.bn2(self.conv2(out))
33 | out += self.shortcut(x)
34 | preact = out
35 | out = F.relu(out)
36 | if self.is_last:
37 | return out, preact
38 | else:
39 | return out
40 |
41 |
42 | class Bottleneck(nn.Module):
43 | expansion = 4
44 |
45 | def __init__(self, in_planes, planes, stride=1, is_last=False):
46 | super(Bottleneck, self).__init__()
47 | self.is_last = is_last
48 | self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
49 | self.bn1 = nn.BatchNorm2d(planes)
50 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
51 | self.bn2 = nn.BatchNorm2d(planes)
52 | self.conv3 = nn.Conv2d(planes, self.expansion * planes, kernel_size=1, bias=False)
53 | self.bn3 = nn.BatchNorm2d(self.expansion * planes)
54 |
55 | self.shortcut = nn.Sequential()
56 | if stride != 1 or in_planes != self.expansion * planes:
57 | self.shortcut = nn.Sequential(
58 | nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1, stride=stride, bias=False),
59 | nn.BatchNorm2d(self.expansion * planes)
60 | )
61 |
62 | def forward(self, x):
63 | out = F.relu(self.bn1(self.conv1(x)))
64 | out = F.relu(self.bn2(self.conv2(out)))
65 | out = self.bn3(self.conv3(out))
66 | out += self.shortcut(x)
67 | preact = out
68 | out = F.relu(out)
69 | if self.is_last:
70 | return out, preact
71 | else:
72 | return out
73 |
74 |
75 | class ResNet(nn.Module):
76 | def __init__(self, block, num_blocks, num_classes=10, zero_init_residual=False):
77 | super(ResNet, self).__init__()
78 | self.in_planes = 64
79 |
80 | self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
81 | self.bn1 = nn.BatchNorm2d(64)
82 | self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
83 | self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
84 | self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
85 | self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
86 | self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
87 | self.linear = nn.Linear(512 * block.expansion, num_classes)
88 |
89 | for m in self.modules():
90 | if isinstance(m, nn.Conv2d):
91 | nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
92 | elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
93 | nn.init.constant_(m.weight, 1)
94 | nn.init.constant_(m.bias, 0)
95 |
96 | # Zero-initialize the last BN in each residual branch,
97 | # so that the residual branch starts with zeros, and each residual block behaves like an identity.
98 | # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
99 | if zero_init_residual:
100 | for m in self.modules():
101 | if isinstance(m, Bottleneck):
102 | nn.init.constant_(m.bn3.weight, 0)
103 | elif isinstance(m, BasicBlock):
104 | nn.init.constant_(m.bn2.weight, 0)
105 |
106 | def get_feat_modules(self):
107 | feat_m = nn.ModuleList([])
108 | feat_m.append(self.conv1)
109 | feat_m.append(self.bn1)
110 | feat_m.append(self.layer1)
111 | feat_m.append(self.layer2)
112 | feat_m.append(self.layer3)
113 | feat_m.append(self.layer4)
114 | return feat_m
115 |
116 | def get_bn_before_relu(self):
117 | if isinstance(self.layer1[0], Bottleneck):
118 | bn1 = self.layer1[-1].bn3
119 | bn2 = self.layer2[-1].bn3
120 | bn3 = self.layer3[-1].bn3
121 | bn4 = self.layer4[-1].bn3
122 | elif isinstance(self.layer1[0], BasicBlock):
123 | bn1 = self.layer1[-1].bn2
124 | bn2 = self.layer2[-1].bn2
125 | bn3 = self.layer3[-1].bn2
126 | bn4 = self.layer4[-1].bn2
127 | else:
128 | raise NotImplementedError('ResNet unknown block error !!!')
129 |
130 | return [bn1, bn2, bn3, bn4]
131 |
132 | def _make_layer(self, block, planes, num_blocks, stride):
133 | strides = [stride] + [1] * (num_blocks - 1)
134 | layers = []
135 | for i in range(num_blocks):
136 | stride = strides[i]
137 | layers.append(block(self.in_planes, planes, stride, i == num_blocks - 1))
138 | self.in_planes = planes * block.expansion
139 | return nn.Sequential(*layers)
140 |
141 | def forward(self, x, is_feat=False, preact=False):
142 | out = F.relu(self.bn1(self.conv1(x)))
143 | f0 = out
144 | out, f1_pre = self.layer1(out)
145 | f1 = out
146 | out, f2_pre = self.layer2(out)
147 | f2 = out
148 | out, f3_pre = self.layer3(out)
149 | f3 = out
150 | out, f4_pre = self.layer4(out)
151 | f4 = out
152 | out = self.avgpool(out)
153 | out = out.view(out.size(0), -1)
154 | f5 = out
155 | out = self.linear(out)
156 | if is_feat:
157 | if preact:
158 | return [[f0, f1_pre, f2_pre, f3_pre, f4_pre, f5], out]
159 | else:
160 | return [f0, f1, f2, f3, f4, f5], out
161 | else:
162 | return out
163 |
164 |
165 | def ResNet18(**kwargs):
166 | return ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
167 |
168 |
169 | def ResNet34(**kwargs):
170 | return ResNet(BasicBlock, [3, 4, 6, 3], **kwargs)
171 |
172 |
173 | def ResNet50(**kwargs):
174 | return ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
175 |
176 |
177 | def ResNet101(**kwargs):
178 | return ResNet(Bottleneck, [3, 4, 23, 3], **kwargs)
179 |
180 |
181 | def ResNet152(**kwargs):
182 | return ResNet(Bottleneck, [3, 8, 36, 3], **kwargs)
183 |
184 |
185 | if __name__ == '__main__':
186 | net = ResNet18(num_classes=100)
187 | x = torch.randn(2, 3, 32, 32)
188 | feats, logit = net(x, is_feat=True, preact=True)
189 |
190 | for f in feats:
191 | print(f.shape, f.min().item())
192 | print(logit.shape)
193 |
194 | for m in net.get_bn_before_relu():
195 | if isinstance(m, nn.BatchNorm2d):
196 | print('pass')
197 | else:
198 | print('warning')
199 |
--------------------------------------------------------------------------------
/models/util.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import torch.nn as nn
4 | import math
5 |
6 |
7 | class Paraphraser(nn.Module):
8 | """Paraphrasing Complex Network: Network Compression via Factor Transfer"""
9 | def __init__(self, t_shape, k=0.5, use_bn=False):
10 | super(Paraphraser, self).__init__()
11 | in_channel = t_shape[1]
12 | out_channel = int(t_shape[1] * k)
13 | self.encoder = nn.Sequential(
14 | nn.Conv2d(in_channel, in_channel, 3, 1, 1),
15 | nn.BatchNorm2d(in_channel) if use_bn else nn.Sequential(),
16 | nn.LeakyReLU(0.1, inplace=True),
17 | nn.Conv2d(in_channel, out_channel, 3, 1, 1),
18 | nn.BatchNorm2d(out_channel) if use_bn else nn.Sequential(),
19 | nn.LeakyReLU(0.1, inplace=True),
20 | nn.Conv2d(out_channel, out_channel, 3, 1, 1),
21 | nn.BatchNorm2d(out_channel) if use_bn else nn.Sequential(),
22 | nn.LeakyReLU(0.1, inplace=True),
23 | )
24 | self.decoder = nn.Sequential(
25 | nn.ConvTranspose2d(out_channel, out_channel, 3, 1, 1),
26 | nn.BatchNorm2d(out_channel) if use_bn else nn.Sequential(),
27 | nn.LeakyReLU(0.1, inplace=True),
28 | nn.ConvTranspose2d(out_channel, in_channel, 3, 1, 1),
29 | nn.BatchNorm2d(in_channel) if use_bn else nn.Sequential(),
30 | nn.LeakyReLU(0.1, inplace=True),
31 | nn.ConvTranspose2d(in_channel, in_channel, 3, 1, 1),
32 | nn.BatchNorm2d(in_channel) if use_bn else nn.Sequential(),
33 | nn.LeakyReLU(0.1, inplace=True),
34 | )
35 |
36 | def forward(self, f_s, is_factor=False):
37 | factor = self.encoder(f_s)
38 | if is_factor:
39 | return factor
40 | rec = self.decoder(factor)
41 | return factor, rec
42 |
43 |
44 | class Translator(nn.Module):
45 | def __init__(self, s_shape, t_shape, k=0.5, use_bn=True):
46 | super(Translator, self).__init__()
47 | in_channel = s_shape[1]
48 | out_channel = int(t_shape[1] * k)
49 | self.encoder = nn.Sequential(
50 | nn.Conv2d(in_channel, in_channel, 3, 1, 1),
51 | nn.BatchNorm2d(in_channel) if use_bn else nn.Sequential(),
52 | nn.LeakyReLU(0.1, inplace=True),
53 | nn.Conv2d(in_channel, out_channel, 3, 1, 1),
54 | nn.BatchNorm2d(out_channel) if use_bn else nn.Sequential(),
55 | nn.LeakyReLU(0.1, inplace=True),
56 | nn.Conv2d(out_channel, out_channel, 3, 1, 1),
57 | nn.BatchNorm2d(out_channel) if use_bn else nn.Sequential(),
58 | nn.LeakyReLU(0.1, inplace=True),
59 | )
60 |
61 | def forward(self, f_s):
62 | return self.encoder(f_s)
63 |
64 |
65 | class Connector(nn.Module):
66 | """Connect for Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons"""
67 | def __init__(self, s_shapes, t_shapes):
68 | super(Connector, self).__init__()
69 | self.s_shapes = s_shapes
70 | self.t_shapes = t_shapes
71 |
72 | self.connectors = nn.ModuleList(self._make_conenctors(s_shapes, t_shapes))
73 |
74 | @staticmethod
75 | def _make_conenctors(s_shapes, t_shapes):
76 | assert len(s_shapes) == len(t_shapes), 'unequal length of feat list'
77 | connectors = []
78 | for s, t in zip(s_shapes, t_shapes):
79 | if s[1] == t[1] and s[2] == t[2]:
80 | connectors.append(nn.Sequential())
81 | else:
82 | connectors.append(ConvReg(s, t, use_relu=False))
83 | return connectors
84 |
85 | def forward(self, g_s):
86 | out = []
87 | for i in range(len(g_s)):
88 | out.append(self.connectors[i](g_s[i]))
89 |
90 | return out
91 |
92 |
93 | class ConnectorV2(nn.Module):
94 | """A Comprehensive Overhaul of Feature Distillation (ICCV 2019)"""
95 | def __init__(self, s_shapes, t_shapes):
96 | super(ConnectorV2, self).__init__()
97 | self.s_shapes = s_shapes
98 | self.t_shapes = t_shapes
99 |
100 | self.connectors = nn.ModuleList(self._make_conenctors(s_shapes, t_shapes))
101 |
102 | def _make_conenctors(self, s_shapes, t_shapes):
103 | assert len(s_shapes) == len(t_shapes), 'unequal length of feat list'
104 | t_channels = [t[1] for t in t_shapes]
105 | s_channels = [s[1] for s in s_shapes]
106 | connectors = nn.ModuleList([self._build_feature_connector(t, s)
107 | for t, s in zip(t_channels, s_channels)])
108 | return connectors
109 |
110 | @staticmethod
111 | def _build_feature_connector(t_channel, s_channel):
112 | C = [nn.Conv2d(s_channel, t_channel, kernel_size=1, stride=1, padding=0, bias=False),
113 | nn.BatchNorm2d(t_channel)]
114 | for m in C:
115 | if isinstance(m, nn.Conv2d):
116 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
117 | m.weight.data.normal_(0, math.sqrt(2. / n))
118 | elif isinstance(m, nn.BatchNorm2d):
119 | m.weight.data.fill_(1)
120 | m.bias.data.zero_()
121 | return nn.Sequential(*C)
122 |
123 | def forward(self, g_s):
124 | out = []
125 | for i in range(len(g_s)):
126 | out.append(self.connectors[i](g_s[i]))
127 |
128 | return out
129 |
130 |
131 | class ConvReg(nn.Module):
132 | """Convolutional regression for FitNet"""
133 | def __init__(self, s_shape, t_shape, use_relu=True):
134 | super(ConvReg, self).__init__()
135 | self.use_relu = use_relu
136 | s_N, s_C, s_H, s_W = s_shape
137 | t_N, t_C, t_H, t_W = t_shape
138 | if s_H == 2 * t_H:
139 | self.conv = nn.Conv2d(s_C, t_C, kernel_size=3, stride=2, padding=1)
140 | elif s_H * 2 == t_H:
141 | self.conv = nn.ConvTranspose2d(s_C, t_C, kernel_size=4, stride=2, padding=1)
142 | elif s_H >= t_H:
143 | self.conv = nn.Conv2d(s_C, t_C, kernel_size=(1+s_H-t_H, 1+s_W-t_W))
144 | else:
145 | raise NotImplemented('student size {}, teacher size {}'.format(s_H, t_H))
146 | self.bn = nn.BatchNorm2d(t_C)
147 | self.relu = nn.ReLU(inplace=True)
148 |
149 | def forward(self, x):
150 | x = self.conv(x)
151 | if self.use_relu:
152 | return self.relu(self.bn(x))
153 | else:
154 | return self.bn(x)
155 |
156 |
157 | class Regress(nn.Module):
158 | """Simple Linear Regression for hints"""
159 | def __init__(self, dim_in=1024, dim_out=1024):
160 | super(Regress, self).__init__()
161 | self.linear = nn.Linear(dim_in, dim_out)
162 | self.relu = nn.ReLU(inplace=True)
163 |
164 | def forward(self, x):
165 | x = x.view(x.shape[0], -1)
166 | x = self.linear(x)
167 | x = self.relu(x)
168 | return x
169 |
170 |
171 | class Embed(nn.Module):
172 | """Embedding module"""
173 | def __init__(self, dim_in=1024, dim_out=128):
174 | super(Embed, self).__init__()
175 | self.linear = nn.Linear(dim_in, dim_out)
176 | self.l2norm = Normalize(2)
177 |
178 | def forward(self, x):
179 | x = x.view(x.shape[0], -1)
180 | x = self.linear(x)
181 | x = self.l2norm(x)
182 | return x
183 |
184 |
185 | class LinearEmbed(nn.Module):
186 | """Linear Embedding"""
187 | def __init__(self, dim_in=1024, dim_out=128):
188 | super(LinearEmbed, self).__init__()
189 | self.linear = nn.Linear(dim_in, dim_out)
190 |
191 | def forward(self, x):
192 | x = x.view(x.shape[0], -1)
193 | x = self.linear(x)
194 | return x
195 |
196 |
197 | class MLPEmbed(nn.Module):
198 | """non-linear embed by MLP"""
199 | def __init__(self, dim_in=1024, dim_out=128):
200 | super(MLPEmbed, self).__init__()
201 | self.linear1 = nn.Linear(dim_in, 2 * dim_out)
202 | self.relu = nn.ReLU(inplace=True)
203 | self.linear2 = nn.Linear(2 * dim_out, dim_out)
204 | self.l2norm = Normalize(2)
205 |
206 | def forward(self, x):
207 | x = x.view(x.shape[0], -1)
208 | x = self.relu(self.linear1(x))
209 | x = self.l2norm(self.linear2(x))
210 | return x
211 |
212 |
213 | class Normalize(nn.Module):
214 | """normalization layer"""
215 | def __init__(self, power=2):
216 | super(Normalize, self).__init__()
217 | self.power = power
218 |
219 | def forward(self, x):
220 | norm = x.pow(self.power).sum(1, keepdim=True).pow(1. / self.power)
221 | out = x.div(norm)
222 | return out
223 |
224 |
225 | class Flatten(nn.Module):
226 | """flatten module"""
227 | def __init__(self):
228 | super(Flatten, self).__init__()
229 |
230 | def forward(self, feat):
231 | return feat.view(feat.size(0), -1)
232 |
233 |
234 | class PoolEmbed(nn.Module):
235 | """pool and embed"""
236 | def __init__(self, layer=0, dim_out=128, pool_type='avg'):
237 | super().__init__()
238 | if layer == 0:
239 | pool_size = 8
240 | nChannels = 16
241 | elif layer == 1:
242 | pool_size = 8
243 | nChannels = 16
244 | elif layer == 2:
245 | pool_size = 6
246 | nChannels = 32
247 | elif layer == 3:
248 | pool_size = 4
249 | nChannels = 64
250 | elif layer == 4:
251 | pool_size = 1
252 | nChannels = 64
253 | else:
254 | raise NotImplementedError('layer not supported: {}'.format(layer))
255 |
256 | self.embed = nn.Sequential()
257 | if layer <= 3:
258 | if pool_type == 'max':
259 | self.embed.add_module('MaxPool', nn.AdaptiveMaxPool2d((pool_size, pool_size)))
260 | elif pool_type == 'avg':
261 | self.embed.add_module('AvgPool', nn.AdaptiveAvgPool2d((pool_size, pool_size)))
262 |
263 | self.embed.add_module('Flatten', Flatten())
264 | self.embed.add_module('Linear', nn.Linear(nChannels*pool_size*pool_size, dim_out))
265 | self.embed.add_module('Normalize', Normalize(2))
266 |
267 | def forward(self, x):
268 | return self.embed(x)
269 |
270 |
271 | if __name__ == '__main__':
272 | import torch
273 |
274 | g_s = [
275 | torch.randn(2, 16, 16, 16),
276 | torch.randn(2, 32, 8, 8),
277 | torch.randn(2, 64, 4, 4),
278 | ]
279 | g_t = [
280 | torch.randn(2, 32, 16, 16),
281 | torch.randn(2, 64, 8, 8),
282 | torch.randn(2, 128, 4, 4),
283 | ]
284 | s_shapes = [s.shape for s in g_s]
285 | t_shapes = [t.shape for t in g_t]
286 |
287 | net = ConnectorV2(s_shapes, t_shapes)
288 | out = net(g_s)
289 | for f in out:
290 | print(f.shape)
291 |
--------------------------------------------------------------------------------
/models/vgg.py:
--------------------------------------------------------------------------------
1 | '''VGG for CIFAR10. FC layers are removed.
2 | (c) YANG, Wei
3 | '''
4 | import torch.nn as nn
5 | import torch.nn.functional as F
6 | import math
7 |
8 |
9 | __all__ = [
10 | 'VGG', 'vgg11', 'vgg11_bn', 'vgg13', 'vgg13_bn', 'vgg16', 'vgg16_bn',
11 | 'vgg19_bn', 'vgg19',
12 | ]
13 |
14 |
15 | model_urls = {
16 | 'vgg11': 'https://download.pytorch.org/models/vgg11-bbd30ac9.pth',
17 | 'vgg13': 'https://download.pytorch.org/models/vgg13-c768596a.pth',
18 | 'vgg16': 'https://download.pytorch.org/models/vgg16-397923af.pth',
19 | 'vgg19': 'https://download.pytorch.org/models/vgg19-dcbb9e9d.pth',
20 | }
21 |
22 |
23 | class VGG(nn.Module):
24 |
25 | def __init__(self, cfg, batch_norm=False, num_classes=1000):
26 | super(VGG, self).__init__()
27 | self.block0 = self._make_layers(cfg[0], batch_norm, 3)
28 | self.block1 = self._make_layers(cfg[1], batch_norm, cfg[0][-1])
29 | self.block2 = self._make_layers(cfg[2], batch_norm, cfg[1][-1])
30 | self.block3 = self._make_layers(cfg[3], batch_norm, cfg[2][-1])
31 | self.block4 = self._make_layers(cfg[4], batch_norm, cfg[3][-1])
32 |
33 | self.pool0 = nn.MaxPool2d(kernel_size=2, stride=2)
34 | self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
35 | self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
36 | self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
37 | self.pool4 = nn.AdaptiveAvgPool2d((1, 1))
38 | # self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)
39 |
40 | self.classifier = nn.Linear(512, num_classes)
41 | self._initialize_weights()
42 |
43 | def get_feat_modules(self):
44 | feat_m = nn.ModuleList([])
45 | feat_m.append(self.block0)
46 | feat_m.append(self.pool0)
47 | feat_m.append(self.block1)
48 | feat_m.append(self.pool1)
49 | feat_m.append(self.block2)
50 | feat_m.append(self.pool2)
51 | feat_m.append(self.block3)
52 | feat_m.append(self.pool3)
53 | feat_m.append(self.block4)
54 | feat_m.append(self.pool4)
55 | return feat_m
56 |
57 | def get_bn_before_relu(self):
58 | bn1 = self.block1[-1]
59 | bn2 = self.block2[-1]
60 | bn3 = self.block3[-1]
61 | bn4 = self.block4[-1]
62 | return [bn1, bn2, bn3, bn4]
63 |
64 | def forward(self, x, is_feat=False, preact=False):
65 | h = x.shape[2]
66 | x = F.relu(self.block0(x))
67 | f0 = x
68 | x = self.pool0(x)
69 | x = self.block1(x)
70 | f1_pre = x
71 | x = F.relu(x)
72 | f1 = x
73 | x = self.pool1(x)
74 | x = self.block2(x)
75 | f2_pre = x
76 | x = F.relu(x)
77 | f2 = x
78 | x = self.pool2(x)
79 | x = self.block3(x)
80 | f3_pre = x
81 | x = F.relu(x)
82 | f3 = x
83 | if h == 64:
84 | x = self.pool3(x)
85 | x = self.block4(x)
86 | f4_pre = x
87 | x = F.relu(x)
88 | f4 = x
89 | x = self.pool4(x)
90 | x = x.view(x.size(0), -1)
91 | f5 = x
92 | x = self.classifier(x)
93 |
94 | if is_feat:
95 | if preact:
96 | return [f0, f1_pre, f2_pre, f3_pre, f4_pre, f5], x
97 | else:
98 | return [f0, f1, f2, f3, f4, f5], x
99 | else:
100 | return x
101 |
102 | @staticmethod
103 | def _make_layers(cfg, batch_norm=False, in_channels=3):
104 | layers = []
105 | for v in cfg:
106 | if v == 'M':
107 | layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
108 | else:
109 | conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
110 | if batch_norm:
111 | layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
112 | else:
113 | layers += [conv2d, nn.ReLU(inplace=True)]
114 | in_channels = v
115 | layers = layers[:-1]
116 | return nn.Sequential(*layers)
117 |
118 | def _initialize_weights(self):
119 | for m in self.modules():
120 | if isinstance(m, nn.Conv2d):
121 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
122 | m.weight.data.normal_(0, math.sqrt(2. / n))
123 | if m.bias is not None:
124 | m.bias.data.zero_()
125 | elif isinstance(m, nn.BatchNorm2d):
126 | m.weight.data.fill_(1)
127 | m.bias.data.zero_()
128 | elif isinstance(m, nn.Linear):
129 | n = m.weight.size(1)
130 | m.weight.data.normal_(0, 0.01)
131 | m.bias.data.zero_()
132 |
133 |
134 | cfg = {
135 | 'A': [[64], [128], [256, 256], [512, 512], [512, 512]],
136 | 'B': [[64, 64], [128, 128], [256, 256], [512, 512], [512, 512]],
137 | 'D': [[64, 64], [128, 128], [256, 256, 256], [512, 512, 512], [512, 512, 512]],
138 | 'E': [[64, 64], [128, 128], [256, 256, 256, 256], [512, 512, 512, 512], [512, 512, 512, 512]],
139 | 'S': [[64], [128], [256], [512], [512]],
140 | }
141 |
142 |
143 | def vgg8(**kwargs):
144 | """VGG 8-layer model (configuration "S")
145 | Args:
146 | pretrained (bool): If True, returns a model pre-trained on ImageNet
147 | """
148 | model = VGG(cfg['S'], **kwargs)
149 | return model
150 |
151 |
152 | def vgg8_bn(**kwargs):
153 | """VGG 8-layer model (configuration "S")
154 | Args:
155 | pretrained (bool): If True, returns a model pre-trained on ImageNet
156 | """
157 | model = VGG(cfg['S'], batch_norm=True, **kwargs)
158 | return model
159 |
160 |
161 | def vgg11(**kwargs):
162 | """VGG 11-layer model (configuration "A")
163 | Args:
164 | pretrained (bool): If True, returns a model pre-trained on ImageNet
165 | """
166 | model = VGG(cfg['A'], **kwargs)
167 | return model
168 |
169 |
170 | def vgg11_bn(**kwargs):
171 | """VGG 11-layer model (configuration "A") with batch normalization"""
172 | model = VGG(cfg['A'], batch_norm=True, **kwargs)
173 | return model
174 |
175 |
176 | def vgg13(**kwargs):
177 | """VGG 13-layer model (configuration "B")
178 | Args:
179 | pretrained (bool): If True, returns a model pre-trained on ImageNet
180 | """
181 | model = VGG(cfg['B'], **kwargs)
182 | return model
183 |
184 |
185 | def vgg13_bn(**kwargs):
186 | """VGG 13-layer model (configuration "B") with batch normalization"""
187 | model = VGG(cfg['B'], batch_norm=True, **kwargs)
188 | return model
189 |
190 |
191 | def vgg16(**kwargs):
192 | """VGG 16-layer model (configuration "D")
193 | Args:
194 | pretrained (bool): If True, returns a model pre-trained on ImageNet
195 | """
196 | model = VGG(cfg['D'], **kwargs)
197 | return model
198 |
199 |
200 | def vgg16_bn(**kwargs):
201 | """VGG 16-layer model (configuration "D") with batch normalization"""
202 | model = VGG(cfg['D'], batch_norm=True, **kwargs)
203 | return model
204 |
205 |
206 | def vgg19(**kwargs):
207 | """VGG 19-layer model (configuration "E")
208 | Args:
209 | pretrained (bool): If True, returns a model pre-trained on ImageNet
210 | """
211 | model = VGG(cfg['E'], **kwargs)
212 | return model
213 |
214 |
215 | def vgg19_bn(**kwargs):
216 | """VGG 19-layer model (configuration 'E') with batch normalization"""
217 | model = VGG(cfg['E'], batch_norm=True, **kwargs)
218 | return model
219 |
220 |
221 | if __name__ == '__main__':
222 | import torch
223 |
224 | x = torch.randn(2, 3, 32, 32)
225 | net = vgg19_bn(num_classes=100)
226 | feats, logit = net(x, is_feat=True, preact=True)
227 |
228 | for f in feats:
229 | print(f.shape, f.min().item())
230 | print(logit.shape)
231 |
232 | for m in net.get_bn_before_relu():
233 | if isinstance(m, nn.BatchNorm2d):
234 | print('pass')
235 | else:
236 | print('warning')
237 |
--------------------------------------------------------------------------------
/models/wrn.py:
--------------------------------------------------------------------------------
1 | import math
2 | import torch
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 |
6 | """
7 | Original Author: Wei Yang
8 | """
9 |
10 | __all__ = ['wrn']
11 |
12 |
13 | class BasicBlock(nn.Module):
14 | def __init__(self, in_planes, out_planes, stride, dropRate=0.0):
15 | super(BasicBlock, self).__init__()
16 | self.bn1 = nn.BatchNorm2d(in_planes)
17 | self.relu1 = nn.ReLU(inplace=True)
18 | self.conv1 = nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
19 | padding=1, bias=False)
20 | self.bn2 = nn.BatchNorm2d(out_planes)
21 | self.relu2 = nn.ReLU(inplace=True)
22 | self.conv2 = nn.Conv2d(out_planes, out_planes, kernel_size=3, stride=1,
23 | padding=1, bias=False)
24 | self.droprate = dropRate
25 | self.equalInOut = (in_planes == out_planes)
26 | self.convShortcut = (not self.equalInOut) and nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride,
27 | padding=0, bias=False) or None
28 |
29 | def forward(self, x):
30 | if not self.equalInOut:
31 | x = self.relu1(self.bn1(x))
32 | else:
33 | out = self.relu1(self.bn1(x))
34 | out = self.relu2(self.bn2(self.conv1(out if self.equalInOut else x)))
35 | if self.droprate > 0:
36 | out = F.dropout(out, p=self.droprate, training=self.training)
37 | out = self.conv2(out)
38 | return torch.add(x if self.equalInOut else self.convShortcut(x), out)
39 |
40 |
41 | class NetworkBlock(nn.Module):
42 | def __init__(self, nb_layers, in_planes, out_planes, block, stride, dropRate=0.0):
43 | super(NetworkBlock, self).__init__()
44 | self.layer = self._make_layer(block, in_planes, out_planes, nb_layers, stride, dropRate)
45 |
46 | def _make_layer(self, block, in_planes, out_planes, nb_layers, stride, dropRate):
47 | layers = []
48 | for i in range(nb_layers):
49 | layers.append(block(i == 0 and in_planes or out_planes, out_planes, i == 0 and stride or 1, dropRate))
50 | return nn.Sequential(*layers)
51 |
52 | def forward(self, x):
53 | return self.layer(x)
54 |
55 |
56 | class WideResNet(nn.Module):
57 | def __init__(self, depth, num_classes, widen_factor=1, dropRate=0.0):
58 | super(WideResNet, self).__init__()
59 | nChannels = [16, 16*widen_factor, 32*widen_factor, 64*widen_factor]
60 | assert (depth - 4) % 6 == 0, 'depth should be 6n+4'
61 | n = (depth - 4) // 6
62 | block = BasicBlock
63 | # 1st conv before any network block
64 | self.conv1 = nn.Conv2d(3, nChannels[0], kernel_size=3, stride=1,
65 | padding=1, bias=False)
66 | # 1st block
67 | self.block1 = NetworkBlock(n, nChannels[0], nChannels[1], block, 1, dropRate)
68 | # 2nd block
69 | self.block2 = NetworkBlock(n, nChannels[1], nChannels[2], block, 2, dropRate)
70 | # 3rd block
71 | self.block3 = NetworkBlock(n, nChannels[2], nChannels[3], block, 2, dropRate)
72 | # global average pooling and classifier
73 | self.bn1 = nn.BatchNorm2d(nChannels[3])
74 | self.relu = nn.ReLU(inplace=True)
75 | self.fc = nn.Linear(nChannels[3], num_classes)
76 | self.nChannels = nChannels[3]
77 |
78 | for m in self.modules():
79 | if isinstance(m, nn.Conv2d):
80 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
81 | m.weight.data.normal_(0, math.sqrt(2. / n))
82 | elif isinstance(m, nn.BatchNorm2d):
83 | m.weight.data.fill_(1)
84 | m.bias.data.zero_()
85 | elif isinstance(m, nn.Linear):
86 | m.bias.data.zero_()
87 |
88 | def get_feat_modules(self):
89 | feat_m = nn.ModuleList([])
90 | feat_m.append(self.conv1)
91 | feat_m.append(self.block1)
92 | feat_m.append(self.block2)
93 | feat_m.append(self.block3)
94 | return feat_m
95 |
96 | def get_bn_before_relu(self):
97 | bn1 = self.block2.layer[0].bn1
98 | bn2 = self.block3.layer[0].bn1
99 | bn3 = self.bn1
100 |
101 | return [bn1, bn2, bn3]
102 |
103 | def forward(self, x, is_feat=False, preact=False):
104 | out = self.conv1(x)
105 | f0 = out
106 | out = self.block1(out)
107 | f1 = out
108 | out = self.block2(out)
109 | f2 = out
110 | out = self.block3(out)
111 | f3 = out
112 | out = self.relu(self.bn1(out))
113 | out = F.avg_pool2d(out, 8)
114 | out = out.view(-1, self.nChannels)
115 | f4 = out
116 | out = self.fc(out)
117 | if is_feat:
118 | if preact:
119 | f1 = self.block2.layer[0].bn1(f1)
120 | f2 = self.block3.layer[0].bn1(f2)
121 | f3 = self.bn1(f3)
122 | return [f0, f1, f2, f3, f4], out
123 | else:
124 | return out
125 |
126 |
127 | def wrn(**kwargs):
128 | """
129 | Constructs a Wide Residual Networks.
130 | """
131 | model = WideResNet(**kwargs)
132 | return model
133 |
134 |
135 | def wrn_40_2(**kwargs):
136 | model = WideResNet(depth=40, widen_factor=2, **kwargs)
137 | return model
138 |
139 |
140 | def wrn_40_1(**kwargs):
141 | model = WideResNet(depth=40, widen_factor=1, **kwargs)
142 | return model
143 |
144 |
145 | def wrn_16_2(**kwargs):
146 | model = WideResNet(depth=16, widen_factor=2, **kwargs)
147 | return model
148 |
149 |
150 | def wrn_16_1(**kwargs):
151 | model = WideResNet(depth=16, widen_factor=1, **kwargs)
152 | return model
153 |
154 |
155 | if __name__ == '__main__':
156 | import torch
157 |
158 | x = torch.randn(2, 3, 32, 32)
159 | net = wrn_40_2(num_classes=100)
160 | feats, logit = net(x, is_feat=True, preact=True)
161 |
162 | for f in feats:
163 | print(f.shape, f.min().item())
164 | print(logit.shape)
165 |
166 | for m in net.get_bn_before_relu():
167 | if isinstance(m, nn.BatchNorm2d):
168 | print('pass')
169 | else:
170 | print('warning')
171 |
--------------------------------------------------------------------------------
/scripts/fetch_pretrained_teachers.sh:
--------------------------------------------------------------------------------
1 | # fetch pre-trained teacher models
2 |
3 | mkdir -p save/models/
4 |
5 | cd save/models
6 |
7 | mkdir -p wrn_40_2_vanilla
8 | wget http://shape2prog.csail.mit.edu/repo/wrn_40_2_vanilla/ckpt_epoch_240.pth
9 | mv ckpt_epoch_240.pth wrn_40_2_vanilla/
10 |
11 | mkdir -p resnet56_vanilla
12 | wget http://shape2prog.csail.mit.edu/repo/resnet56_vanilla/ckpt_epoch_240.pth
13 | mv ckpt_epoch_240.pth resnet56_vanilla/
14 |
15 | mkdir -p resnet110_vanilla
16 | wget http://shape2prog.csail.mit.edu/repo/resnet110_vanilla/ckpt_epoch_240.pth
17 | mv ckpt_epoch_240.pth resnet110_vanilla/
18 |
19 | mkdir -p resnet32x4_vanilla
20 | wget http://shape2prog.csail.mit.edu/repo/resnet32x4_vanilla/ckpt_epoch_240.pth
21 | mv ckpt_epoch_240.pth resnet32x4_vanilla/
22 |
23 | mkdir -p vgg13_vanilla
24 | wget http://shape2prog.csail.mit.edu/repo/vgg13_vanilla/ckpt_epoch_240.pth
25 | mv ckpt_epoch_240.pth vgg13_vanilla/
26 |
27 | mkdir -p ResNet50_vanilla
28 | wget http://shape2prog.csail.mit.edu/repo/ResNet50_vanilla/ckpt_epoch_240.pth
29 | mv ckpt_epoch_240.pth ResNet50_vanilla/
30 |
31 | cd ../..
--------------------------------------------------------------------------------
/scripts/run_cifar_distill.sh:
--------------------------------------------------------------------------------
1 | # sample scripts for running the distillation code
2 | # use resnet32x4 and resnet8x4 as an example
3 |
4 | # kd
5 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1
6 | # FitNet
7 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill hint --model_s resnet8x4 -a 0 -b 100 --trial 1
8 | # AT
9 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill attention --model_s resnet8x4 -a 0 -b 1000 --trial 1
10 | # SP
11 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill similarity --model_s resnet8x4 -a 0 -b 3000 --trial 1
12 | # CC
13 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill correlation --model_s resnet8x4 -a 0 -b 0.02 --trial 1
14 | # VID
15 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill vid --model_s resnet8x4 -a 0 -b 1 --trial 1
16 | # RKD
17 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill rkd --model_s resnet8x4 -a 0 -b 1 --trial 1
18 | # PKT
19 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill pkt --model_s resnet8x4 -a 0 -b 30000 --trial 1
20 | # AB
21 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill abound --model_s resnet8x4 -a 0 -b 1 --trial 1
22 | # FT
23 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill factor --model_s resnet8x4 -a 0 -b 200 --trial 1
24 | # FSP
25 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill fsp --model_s resnet8x4 -a 0 -b 50 --trial 1
26 | # NST
27 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill nst --model_s resnet8x4 -a 0 -b 50 --trial 1
28 | # CRD
29 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill crd --model_s resnet8x4 -a 0 -b 0.8 --trial 1
30 |
31 | # CRD+KD
32 | python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill crd --model_s resnet8x4 -a 1 -b 0.8 --trial 1
33 |
34 | # KDA
35 | python train_student.py --path_t ./save/models/resnet110_vanilla/chkpt_epoch_240.pth --distill kd --model_S resnet20 -r 2 -a 0 -b 0
--------------------------------------------------------------------------------
/scripts/run_cifar_vanilla.sh:
--------------------------------------------------------------------------------
1 | # sample scripts for training vanilla teacher models
2 |
3 | python train_teacher.py --model wrn_40_2
4 |
5 | python train_teacher.py --model resnet56
6 |
7 | python train_teacher.py --model resnet110
8 |
9 | python train_teacher.py --model resnet32x4
10 |
11 | python train_teacher.py --model vgg13
12 |
13 | python train_teacher.py --model ResNet50
14 |
--------------------------------------------------------------------------------
/supermix.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import argparse
4 | from datetime import datetime
5 |
6 | import numpy as np
7 | import torch
8 | import torch.nn as nn
9 | import torchvision
10 | import torchvision.transforms as transforms
11 | import torch.nn.functional as F
12 | from torch.utils.data import DataLoader
13 | from torch.autograd import Variable
14 | import copy
15 | import time
16 | import matplotlib.pyplot as plt
17 | import scipy.misc as misc
18 | from helper.util import get_teacher_name
19 | from models import model_dict
20 | import math
21 | import torchvision.datasets as datasets
22 | import torchvision.models as models
23 |
24 |
25 | class Datasubset(torch.utils.data.Dataset):
26 | def __init__(self, dataset, len):
27 | self.dataset = dataset
28 | self.len = len
29 |
30 | def __getitem__(self, i):
31 | return self.dataset[i % self.len]
32 |
33 | def __len__(self):
34 | return self.len
35 |
36 |
37 | def load_teacher(model_path, n_cls):
38 | print('==> loading teacher model')
39 | model_t = get_teacher_name(model_path)
40 | model = model_dict[model_t](num_classes=n_cls)
41 | model.load_state_dict(torch.load(model_path)['model'])
42 | print('==> done')
43 | return model
44 |
45 |
46 | def onehot(y, n_classes=100):
47 | bs = y.size(0)
48 | y = y.type(torch.LongTensor).view(-1, 1)
49 | y_onehot = torch.FloatTensor(bs, n_classes)
50 |
51 | # In your for loop
52 | y_onehot.zero_()
53 | y_onehot.scatter_(1, y, 1)
54 | return y_onehot.cuda()
55 |
56 |
57 | class Smoothing(nn.Module):
58 | def __init__(self):
59 | super(Smoothing, self).__init__()
60 |
61 | def compute_kernels(self, sigma=1, chennels=1):
62 | size_denom = 5.
63 | sigma = int(sigma * size_denom)
64 | kernel_size = sigma
65 | mgrid = torch.arange(kernel_size, dtype=torch.float32)
66 | mean = (kernel_size - 1.) / 2.
67 | mgrid = mgrid - mean
68 | mgrid = mgrid * size_denom
69 | kernel = 1. / (sigma * math.sqrt(2. * math.pi)) * \
70 | torch.exp(-(((mgrid - 0.) / (sigma)) ** 2) * 0.5)
71 |
72 | # Make sure sum of values in gaussian kernel equals 1.
73 | kernel = kernel / torch.sum(kernel)
74 |
75 | # Reshape to depthwise convolutional weight
76 | kernelx = kernel.view(1, 1, int(kernel_size), 1).repeat(chennels, 1, 1, 1)
77 | kernely = kernel.view(1, 1, 1, int(kernel_size)).repeat(chennels, 1, 1, 1)
78 |
79 | return kernelx.cuda(), kernely.cuda(), kernel_size
80 |
81 | def forward(self, input, sigma):
82 | if sigma > 0:
83 | channels = input.size(1)
84 | kx, ky, kernel_size = self.compute_kernels(sigma=sigma, chennels=channels)
85 |
86 | # padd the input
87 | padd0 = int(kernel_size // 2)
88 | evenorodd = int(1 - kernel_size % 2)
89 | # self.pad = torch.nn.ConstantPad2d((padd0 - evenorodd, padd0, padd0 - evenorodd, padd0), 0.)
90 |
91 | input = F.pad(input, (padd0 - evenorodd, padd0, padd0 - evenorodd, padd0), 'constant', 0.)
92 | input = F.conv2d(input, weight=kx, groups=channels)
93 | input = F.conv2d(input, weight=ky, groups=channels)
94 | return input
95 |
96 |
97 | smoother = Smoothing().cuda()
98 |
99 |
100 | def normalize01(x):
101 | return (x - x.min()) / (x.max() - x.min())
102 |
103 |
104 | def tensor2img(t, ismask=False):
105 | x = t.cpu().detach().numpy().squeeze()
106 | if len(x.shape) == 3:
107 | x = x.transpose(1, 2, 0)
108 | if ismask:
109 | return x
110 | return normalize01(x)
111 |
112 |
113 | def plott(t_list):
114 | for ti in range(len(t_list)):
115 | x = tensor2img(t_list[ti])
116 | plt.subplot(1, len(t_list), ti + 1)
117 | plt.imshow(x)
118 | plt.show()
119 |
120 |
121 | def kldiv(x, y):
122 | x = F.log_softmax(x, 1)
123 | y = F.softmax(y, 1)
124 | return nn.KLDivLoss(reduction='none')(x, y).sum(1)
125 |
126 |
127 | def kldiv2(x, y):
128 | x = F.log_softmax(x, 1)
129 | return nn.KLDivLoss(reduction='none')(x, y).sum(1)
130 |
131 |
132 | def mask_process(x, upsample_size):
133 | bs = x.size(0)
134 | K = x.size(1)
135 | mask_w = x.size(3)
136 | m1 = x.view(bs * K, 1, mask_w, mask_w)
137 | m1 = F.interpolate(m1, upsample_size, mode='bilinear')
138 | m1 = m1.view(bs, K, 1, upsample_size, upsample_size)
139 | m1 = torch.sigmoid(m1)
140 | sum_masks = m1.sum(1, keepdim=True)
141 | m1 = m1 / sum_masks
142 | return m1
143 |
144 |
145 | def mix_batch(net, data, K, alpha=1, mask_w=16, sigma_grad=2, max_iter=200, toler=0):
146 |
147 |
148 |
149 | # size of the current batch
150 | bs = data.size(0)
151 | # spatial size of the input images
152 | inw = data.size(2)
153 |
154 | # predict the label of the input images
155 | f_data = net(data)
156 | pred_lbl = f_data.argmax(1)
157 |
158 | # generate the shuffle indexes to construct the sets X
159 | idx = list(range(bs))
160 | idx_arr = [idx]
161 | for i in range(K - 1):
162 | idx = idx_arr[-1].copy()
163 | idx[:-1] = idx_arr[-1][1:]
164 | idx[-1] = idx_arr[-1][0]
165 | idx_arr.append(idx)
166 | idx_arr = np.array(idx_arr)
167 |
168 | # construct K set and store them in data_X
169 | data_X = torch.zeros([bs, K, 3, inw, inw])
170 | lbl_X = torch.zeros([bs, K])
171 | for i in range(K):
172 | data_X[:, i, ...] = data[idx_arr[i], ...]
173 | lbl_X[:, i] = pred_lbl[idx_arr[i], ...]
174 | data_X = data_X.cuda()
175 |
176 | # construct the target soft labels, Equation 2 in the paper
177 | soft_targets = torch.zeros([bs, opt.n_classes])
178 | for i in range(bs):
179 | lbl_set = lbl_X[i:i + 1, :]
180 | lbl_set = lbl_set.view(K, 1)
181 | lambda_aug = np.random.dirichlet(np.ones(K) * alpha, 1).reshape(K, 1)
182 | lambda_aug = torch.from_numpy(lambda_aug).type(torch.FloatTensor).cuda()
183 | lbl_set_onehot = onehot(lbl_set, opt.n_classes) * lambda_aug
184 | lbl_soft = lbl_set_onehot.sum(0)
185 | soft_targets[i, :] = lbl_soft
186 | soft_targets = soft_targets.cuda()
187 |
188 | # construct the mask variables
189 | mask_init = 0.
190 | mask = torch.ones([bs, K, 1, mask_w, mask_w]).cuda() * mask_init
191 |
192 | loop_i = 0
193 |
194 | _, top2lbl = torch.topk(soft_targets, K, 1)
195 | top2lbl, _ = top2lbl.sort()
196 |
197 | batch_mask = torch.ones([bs]).cuda()
198 |
199 | while batch_mask.sum().item() > toler and loop_i < max_iter:
200 | # define the variable of the mask
201 | m = Variable(mask, requires_grad=True)
202 |
203 | # process the mask variable which will: 1) upsample the mask, 2) normalize it
204 | m_pr = mask_process(m, upsample_size=inw)
205 |
206 | # construct mixed images
207 | mixed_data = m_pr * data_X
208 | mixed_data = mixed_data.sum(1)
209 |
210 | # compute the prediction on mixed images
211 | f_mix = net.forward(mixed_data)
212 |
213 | stdloss = torch.abs(m_pr * (m_pr - 1))
214 | stdloss = stdloss.mean(1).mean(1).mean(1).mean(1)
215 |
216 | # compute the kldiv between the predictions and the target soft labels
217 | kl = kldiv2(f_mix, soft_targets)
218 |
219 | # zero out the loss for successfully mixed samples
220 | kl = (kl + stdloss * opt.lambda_s) * batch_mask
221 |
222 | loss = kl.sum()
223 |
224 | # compute the gradients of the loss w.r.t. to the mask variable
225 | grad = torch.autograd.grad(loss, m)[0]
226 |
227 | w_k = copy.deepcopy(grad.data) # bs x K x 1 x mask_w x mask_w
228 |
229 | w_k = w_k.view(bs * K, 1, mask_w, mask_w)
230 | w = smoother(w_k, sigma=sigma_grad)
231 | w = w.view(bs, K, 1, mask_w, mask_w)
232 |
233 | f_k = -1 * kl
234 |
235 | dot = w_k.view(bs, -1) @ w.view(bs, -1).t()
236 | dot = torch.diag(dot)
237 |
238 | pert = torch.abs(f_k) / (dot + 1e-10)
239 |
240 | pert = torch.clamp(pert, 0.0001, 2000)
241 |
242 | r_i = -1 * pert.view(bs, 1, 1, 1, 1).repeat(1, K, 1, 1, 1) * w
243 |
244 | mask = mask + r_i.detach() * batch_mask.view(bs, 1, 1, 1, 1)
245 | mask_pr = mask_process(mask, upsample_size=inw)
246 | check_mix = mask_pr * data_X
247 | check_mix = check_mix.sum(1)
248 |
249 | pred_mix = net.forward(check_mix)
250 |
251 | _, pred_lbl_top2 = torch.topk(pred_mix, K, 1)
252 | pred_lbl_top2, _ = pred_lbl_top2.sort()
253 |
254 | batch_mask = pred_lbl_top2 != top2lbl
255 |
256 | batch_mask = batch_mask.sum(1).type(torch.FloatTensor).cuda()
257 | batch_mask = (batch_mask > 0).type(torch.FloatTensor).cuda()
258 | loop_i += 1
259 |
260 | idx = np.where(batch_mask.detach().cpu().numpy() == 0)[0].reshape(-1)
261 |
262 | check_mix = check_mix[idx, ...]
263 | mask_pr = mask_pr[idx, ...]
264 | pred_mix = pred_mix[idx, ...]
265 | data_X = data_X[idx, ...]
266 |
267 | return check_mix, mask_pr, pred_mix, data_X, loop_i
268 |
269 |
270 | def normalize(x):
271 | return (x - x.min()) / (x.max() - x.min())
272 |
273 |
274 | def plott(t_list):
275 | for ti in range(len(t_list)):
276 | x = tensor2img(t_list[ti])
277 | plt.subplot(1, len(t_list), ti + 1)
278 | plt.imshow(x)
279 | plt.show()
280 |
281 |
282 | def convert_time(seconds):
283 | seconds = seconds % (24 * 3600)
284 | hour = seconds // 3600
285 | seconds %= 3600
286 | minutes = seconds // 60
287 | seconds %= 60
288 | return [hour, minutes, seconds]
289 |
290 |
291 | def augment(opt, data_loader):
292 | model.eval()
293 | counter = 0
294 | total_iter = 0
295 | batch_counter = 0
296 | total_time = 0
297 |
298 | while counter < opt.aug_size:
299 | for batch_index, (images, labels) in enumerate(data_loader):
300 | images, labels = images.to(device), labels.to(device)
301 | bs = images.size(0)
302 |
303 | model.zero_grad()
304 |
305 | t0 = time.time()
306 |
307 | if bs != opt.bs:
308 | break
309 |
310 | # use the data in the batch to generated new data
311 | images_mixed, mask, pred_mix, data_X, iter = mix_batch(model, images, alpha=opt.alpha, K=opt.k,
312 | mask_w=opt.w,
313 | sigma_grad=opt.sigma,
314 | toler=opt.tol, max_iter=opt.max_iter)
315 |
316 | delta_t = time.time() - t0
317 | total_time += delta_t
318 |
319 | # number of generated images
320 | n_suc = images_mixed.size(0)
321 |
322 | # plot the results
323 | if opt.plot and n_suc>0:
324 | n_samples = min(n_suc, 3)
325 |
326 | for p in range(n_samples):
327 | n_cols = opt.k * 2 + 1
328 |
329 | # plot mixed images
330 | plt.subplot(n_samples, n_cols, p * n_cols + 1)
331 | plt.imshow(tensor2img(images_mixed[p, ...]))
332 | plt.axis('off')
333 | plt.title('Mixed')
334 |
335 | # plot input images
336 | for ps in range(opt.k):
337 | plt.subplot(n_samples, n_cols, p * n_cols + 1 + ps + 1)
338 | plt.imshow(tensor2img(data_X[p, ps, ...]))
339 | plt.axis('off')
340 | plt.title('input ' + str(ps))
341 |
342 | # plot input images
343 | for ps in range(opt.k):
344 | plt.subplot(n_samples, n_cols, p * n_cols + 1 + ps + opt.k + 1)
345 | plt.imshow(tensor2img(mask[p, ps, ...], ismask=True), cmap='jet')
346 | plt.axis('off')
347 | plt.title('mask ' + str(ps))
348 |
349 | plt.show()
350 |
351 | for i in range(n_suc):
352 | img = images_mixed[i].detach().cpu().numpy().transpose(1, 2, 0)
353 | img = img * std + mean
354 | img = img * 255
355 |
356 | img = img.astype(np.uint8)
357 |
358 | misc.imsave(opt.save_dir + '/' + str(counter + i) + '.png', img)
359 |
360 | counter += n_suc
361 |
362 | total_iter += iter
363 | batch_counter += 1
364 |
365 | remaining_time = (opt.aug_size - counter) * total_time / (counter+1)
366 | ert = convert_time(remaining_time)
367 |
368 | print(
369 | "iter: %d, n_generated: %d, iters: %02d, ert: %d:%d:%02d" % (
370 | batch_index, counter, iter, ert[0], ert[1],
371 | ert[2]))
372 | if counter > opt.aug_size:
373 | return 0
374 |
375 |
376 | def eval(device, net):
377 | net.eval()
378 | test_loss = 0.0 # cost function error
379 | correct = 0.0
380 | criterion = nn.CrossEntropyLoss()
381 | for (images, labels) in cifar100_test_loader:
382 | images, labels = images.to(device), labels.to(device)
383 |
384 | outputs = net(images)
385 |
386 | loss = criterion(outputs, labels)
387 | test_loss += loss.item() * images.size()[0]
388 | preds = outputs.argmax(1)
389 | correct += preds.eq(labels).sum()
390 |
391 | acc = correct.float() / len(cifar100_test_loader.dataset)
392 | loss = test_loss / len(cifar100_test_loader.dataset)
393 |
394 | return acc, loss
395 |
396 |
397 | def count_parameters(model):
398 | return sum(p.numel() for p in model.parameters() if p.requires_grad)
399 |
400 |
401 | if __name__ == '__main__':
402 | parser = argparse.ArgumentParser()
403 | parser.add_argument('--dataset', type=str, default='cifar100', help='dataset to augment', choices=['imagenet', 'cifar100'])
404 | parser.add_argument('--model', type=str, default='resnet32',
405 | help='name of the supervisor model to load')
406 | parser.add_argument('--device', type=str, default='cuda:0', help='cuda or cpu')
407 | parser.add_argument('--save_dir', type=str, required=True,
408 | help='output directory to save results')
409 | parser.add_argument('--input_dir', type=str, default='/home/aldb/outputs/imgenet/imgnet_train1',
410 | help='directory of the training set of ImageNet')
411 | parser.add_argument('--bs', type=int, default=100, help='batch size')
412 | parser.add_argument('--aug_size', type=int, default=500000, help='number of images to generate')
413 | parser.add_argument('--k', type=int, default=2, help='number of samples to mix')
414 | parser.add_argument('--max_iter', type=int, default=50, help='maximum number of iteration for each batch')
415 | parser.add_argument('--alpha', type=float, default=3, help='alpha of the Dirichlet distribution')
416 | parser.add_argument('--sigma', type=float, default=1, help='standard deviation for the Gaussian blurring')
417 | parser.add_argument('--w', type=int, default=16, help='width of the mixing masks')
418 | parser.add_argument('--lambda_s', type=float, default=25, help='multiplier of the sparsity loss')
419 | parser.add_argument('--tol', type=int, default=70,
420 | help='tolerance (percent) for the number of unsuccessful samples in the batch')
421 | parser.add_argument('--plot', type=bool, default=True, help='plot the results')
422 | opt = parser.parse_args()
423 |
424 | # set the device
425 | device = torch.device(opt.device)
426 |
427 | opt.tol = int(opt.bs * opt.tol / 100)
428 |
429 | if opt.dataset == 'cifar100':
430 |
431 | # mean and std of the training set of cifar100
432 | CIFAR100_MEAN = (0.5070, 0.4865, 0.4409)
433 | CIFAR100_STD = (0.2673, 0.2564, 0.2761)
434 | std = np.array(CIFAR100_STD)
435 | mean = np.array(CIFAR100_MEAN)
436 | std = std.reshape(1, 1, 3)
437 | mean = mean.reshape(1, 1, 3)
438 |
439 | # load the data
440 | transform = transforms.Compose([
441 | transforms.RandomHorizontalFlip(),
442 | transforms.ToTensor(),
443 | transforms.Normalize(CIFAR100_MEAN, CIFAR100_STD)
444 | ])
445 |
446 | transform_test = transforms.Compose([
447 | transforms.ToTensor(),
448 | transforms.Normalize(CIFAR100_MEAN, CIFAR100_STD)
449 | ])
450 |
451 | cifar100_training = torchvision.datasets.CIFAR100(root='./data', train=True, download=True,
452 | transform=transform)
453 |
454 | data_loader = DataLoader(cifar100_training, shuffle=True, num_workers=2, batch_size=opt.bs)
455 |
456 | cifar100_test = torchvision.datasets.CIFAR100(root='./data', train=False, download=True,
457 | transform=transform_test)
458 | cifar100_test_loader = DataLoader(
459 | cifar100_test, shuffle=False, num_workers=2, batch_size=100)
460 |
461 | # load the teacher model
462 | path_t = './save/models/' + opt.model + '_vanilla/ckpt_epoch_240.pth'
463 | model = load_teacher(path_t, 100)
464 | model.eval()
465 | model.to(device)
466 | opt.n_classes = 100
467 |
468 |
469 | elif opt.dataset == 'imagenet':
470 | # mean and std of the training set of ImageNet
471 | mean_imgnet = (0.485, 0.456, 0.406)
472 | std_imgnet = (0.229, 0.224, 0.225)
473 | std = np.array(std_imgnet)
474 | mean = np.array(mean_imgnet)
475 | std = std.reshape(1, 1, 3)
476 | mean = mean.reshape(1, 1, 3)
477 |
478 | train_dataset = datasets.ImageFolder(
479 | opt.input_dir,
480 | transforms.Compose([
481 | transforms.Scale(260),
482 | transforms.RandomCrop(224),
483 | transforms.RandomHorizontalFlip(),
484 | transforms.ToTensor(),
485 | transforms.Normalize(mean=mean_imgnet,
486 | std=std_imgnet),
487 | ]))
488 |
489 | data_loader = torch.utils.data.DataLoader(
490 | train_dataset, batch_size=opt.bs, num_workers=4, pin_memory=True, shuffle=True)
491 |
492 | loader = getattr(models, opt.model)
493 |
494 | model = loader(pretrained=True)
495 | model.eval()
496 | model.to(device)
497 | opt.n_classes = 1000
498 |
499 | opt.save_dir = os.path.join(opt.save_dir, 'data')
500 | if not os.path.exists(opt.save_dir):
501 | os.makedirs(opt.save_dir)
502 |
503 | augment(opt, data_loader)
504 |
--------------------------------------------------------------------------------
/train_student.py:
--------------------------------------------------------------------------------
1 | """
2 | the general training framework
3 | """
4 |
5 | from __future__ import print_function
6 |
7 | import os
8 | import argparse
9 | import time
10 |
11 | # import tensorboard_logger as tb_logger
12 | import torch
13 | import torch.optim as optim
14 | import torch.nn as nn
15 | import torch.backends.cudnn as cudnn
16 |
17 | from models import model_dict
18 | from models.util import Embed, ConvReg, LinearEmbed
19 | from models.util import Connector, Translator, Paraphraser
20 |
21 | from dataset.cifar100 import get_cifar100_dataloaders, get_cifar100_dataloaders_sample
22 |
23 | from helper.util import adjust_learning_rate, Logger, count_parameters, get_teacher_name, WarmUpLR, plot_tensor
24 |
25 | from distiller_zoo import DistillKL, HintLoss, Attention, Similarity, Correlation, VIDLoss, RKDLoss
26 | from distiller_zoo import PKT, ABLoss, FactorTransfer, KDSVD, FSP, NSTLoss
27 | from crd.criterion import CRDLoss
28 |
29 | from helper.loops import train_distill as train, validate
30 | from helper.pretrain import init
31 | import numpy as np
32 |
33 |
34 | def parse_option():
35 | parser = argparse.ArgumentParser('argument for training')
36 |
37 | parser.add_argument('--print_freq', type=int, default=5, help='print frequency')
38 | parser.add_argument('--tb_freq', type=int, default=500, help='tb frequency')
39 | parser.add_argument('--save_freq', type=int, default=40, help='save frequency')
40 | parser.add_argument('--batch_size', type=int, default=128, help='batch_size')
41 | parser.add_argument('--device', type=str, default='cuda:1', help='batch_size')
42 | parser.add_argument('--num_workers', type=int, default=2, help='num of workers to use')
43 | parser.add_argument('--epochs', type=int, default=600, help='number of training epochs')
44 | parser.add_argument('--init_epochs', type=int, default=30, help='init training for two-stage methods')
45 |
46 | # optimization
47 | parser.add_argument('--learning_rate', type=float, default=0.1, help='learning rate')
48 | parser.add_argument('--epochs_warmup', type=int, default=5, help='number of epochs for learning rate warm up')
49 | parser.add_argument('--lr_decay_epochs', type=str, default='200, 300, 400, 500', # '150, 250, 350, 450',
50 | help='where to decay lr, can be a list')
51 | parser.add_argument('--lr_decay_rate', type=float, default=0.1, help='decay rate for learning rate')
52 | parser.add_argument('--weight_decay', type=float, default=5e-4, help='weight decay')
53 | parser.add_argument('--momentum', type=float, default=0.9, help='momentum')
54 |
55 | # dataset
56 | parser.add_argument('--dataset', type=str, default='cifar100', choices=['cifar100'], help='dataset')
57 |
58 | # model
59 | parser.add_argument('--model_s', type=str, default='resnet20',
60 | choices=['resnet8', 'resnet14', 'resnet20', 'resnet32', 'resnet44', 'resnet56', 'resnet110',
61 | 'resnet8x4', 'resnet32x4', 'wrn_16_1', 'wrn_16_2', 'wrn_40_1', 'wrn_40_2',
62 | 'vgg8', 'vgg11', 'vgg13', 'vgg16', 'vgg19', 'ResNet50',
63 | 'MobileNetV2', 'ShuffleV1', 'ShuffleV2'])
64 | parser.add_argument('--path_t', type=str, default='./save/models/resnet110_vanilla/ckpt_epoch_240.pth',
65 | help='teacher model snapshot')
66 |
67 | # distillation
68 | parser.add_argument('--distill', type=str, default='kd', choices=['kd', 'hint', 'attention', 'similarity',
69 | 'correlation', 'vid', 'crd', 'kdsvd', 'fsp',
70 | 'rkd', 'pkt', 'abound', 'factor', 'nst'])
71 |
72 | # parser.add_argument('--aug', type=str, default=None,
73 | # help='address of the augmented dataset')
74 |
75 | # augmentation parameters
76 | parser.add_argument('--aug_type', type=str, default='supermix', choices=[None, 'mixup', 'cutmix', 'supermix'],
77 | help='type of augmentation')
78 | parser.add_argument('--aug_dir', type=str, default='/home/aldb/outputs/new2/wrn_40_2_k:3_alpha:3',
79 | help='address of the augmented dataset')
80 | parser.add_argument('--aug_size', type=str, default=-1,
81 | help='size of the augmented dataset, -1 means the maximum possible size')
82 | parser.add_argument('--aug_lambda', type=float, default=0.5, help='lambda for mixup, must be between 0 and 1')
83 | parser.add_argument('--aug_alpha', type=float, default=10000,
84 | help='alpha for the beta distribution to sample the lambda, this is active when --aug_lambda is -1')
85 |
86 | parser.add_argument('--trial', type=str, default='augmented', help='trial id')
87 |
88 | parser.add_argument('-r', '--gamma', type=float, default=2, help='weight for classification')
89 | parser.add_argument('-a', '--alpha', type=float, default=0, help='weight balance for KD')
90 | parser.add_argument('-b', '--beta', type=float, default=0, help='weight balance for other losses')
91 |
92 | # KL distillation
93 | parser.add_argument('--kd_T', type=float, default=4, help='temperature for KD distillation')
94 |
95 | # NCE distillation
96 | parser.add_argument('--feat_dim', default=128, type=int, help='feature dimension')
97 | parser.add_argument('--mode', default='exact', type=str, choices=['exact', 'relax'])
98 | parser.add_argument('--nce_k', default=16384, type=int, help='number of negative samples for NCE')
99 | parser.add_argument('--nce_t', default=0.07, type=float, help='temperature parameter for softmax')
100 | parser.add_argument('--nce_m', default=0.5, type=float, help='momentum for non-parametric updates')
101 |
102 | # hint layer
103 | parser.add_argument('--hint_layer', default=2, type=int, choices=[0, 1, 2, 3, 4])
104 |
105 | parser.add_argument('--test_interval', type=int, default=None, help='test interval')
106 | parser.add_argument('--seed', default=1001, type=int, help='random seed')
107 |
108 | opt = parser.parse_args()
109 |
110 | return opt
111 |
112 |
113 | def load_teacher(model_path, n_cls):
114 | print('==> loading teacher model')
115 | model_t = get_teacher_name(model_path)
116 | model = model_dict[model_t](num_classes=n_cls)
117 | model.load_state_dict(torch.load(model_path)['model'])
118 | print('==> done')
119 | return model
120 |
121 |
122 | def build_grid(source_size, target_size):
123 | k = float(target_size) / float(source_size)
124 | direct = torch.linspace(0, k, target_size).unsqueeze(0).repeat(target_size, 1).unsqueeze(-1)
125 | full = torch.cat([direct, direct.transpose(1, 0)], dim=2).unsqueeze(0)
126 | return full.cuda()
127 |
128 |
129 | def random_crop_grid(x, grid):
130 | delta = x.size(2) - grid.size(1)
131 | grid = grid.repeat(x.size(0), 1, 1, 1).cuda()
132 | # Add random shifts by x
133 | grid[:, :, :, 0] = grid[:, :, :, 0] + torch.FloatTensor(x.size(0)).cuda().random_(0, delta).unsqueeze(-1).unsqueeze(
134 | -1).expand(-1, grid.size(1), grid.size(2)) / x.size(2)
135 | # Add random shifts by y
136 | grid[:, :, :, 1] = grid[:, :, :, 1] + torch.FloatTensor(x.size(0)).cuda().random_(0, delta).unsqueeze(-1).unsqueeze(
137 | -1).expand(-1, grid.size(1), grid.size(2)) / x.size(2)
138 | return grid
139 |
140 |
141 | ############
142 |
143 | ############
144 |
145 | def distill(opt):
146 | # refine the opt arguments
147 |
148 | opt.model_path = './save/student_model'
149 |
150 | iterations = opt.lr_decay_epochs.split(',')
151 | opt.lr_decay_epochs = list([])
152 | for it in iterations:
153 | opt.lr_decay_epochs.append(int(it))
154 |
155 | opt.model_t = get_teacher_name(opt.path_t)
156 |
157 | opt.print_freq = int(50000 / opt.batch_size / opt.print_freq)
158 |
159 | opt.model_name = 'S:{}_T:{}_{}_{}/r:{}_a:{}_b:{}_{}_{}_{}_{}_lam:{}_alp:{}_augsize:{}_T:{}'.format(
160 | opt.model_s, opt.model_t,
161 | opt.dataset,
162 | opt.distill,
163 | opt.gamma, opt.alpha, opt.beta,
164 | opt.trial,
165 | opt.device, opt.seed,
166 | opt.aug_type,
167 | opt.aug_lambda,
168 | opt.aug_alpha,
169 | opt.aug_size, opt.kd_T)
170 |
171 | opt.save_folder = os.path.join(opt.model_path, opt.model_name)
172 | if not os.path.isdir(opt.save_folder):
173 | os.makedirs(opt.save_folder)
174 |
175 | opt.learning_rate = 0.1 * opt.batch_size / 128
176 |
177 | # set different learning rate from these 4 models
178 | if opt.model_s in ['MobileNetV2', 'ShuffleV1', 'ShuffleV2']:
179 | opt.learning_rate = opt.learning_rate / 5
180 |
181 | print("learning rate is set to:", opt.learning_rate)
182 |
183 | best_acc = 0
184 | np.random.seed(opt.seed)
185 | torch.manual_seed(opt.seed)
186 |
187 | # dataloader
188 | if opt.dataset == 'cifar100':
189 | if opt.distill in ['crd']:
190 | train_loader, val_loader, n_data = get_cifar100_dataloaders_sample(batch_size=opt.batch_size,
191 | num_workers=opt.num_workers,
192 | k=opt.nce_k,
193 | mode=opt.mode)
194 | else:
195 | train_loader, val_loader, n_data = get_cifar100_dataloaders(opt, is_instance=True)
196 | n_cls = 100
197 | else:
198 | raise NotImplementedError(opt.dataset)
199 |
200 | # set the interval for testing
201 | opt.test_freq = int(50000 / opt.batch_size)
202 |
203 | # compute number of epochs using the original cifar100 dataset size
204 | opt.lr_decay_epochs = list(int(i * 50000 / opt.aug_size) for i in opt.lr_decay_epochs)
205 | opt.epochs = int(opt.epochs * 50000 / opt.aug_size)
206 |
207 | print('Decay epochs: ', opt.lr_decay_epochs)
208 | print('Max epochs: ', opt.epochs)
209 |
210 | # set the device
211 | if torch.cuda.is_available():
212 | device = torch.device(opt.device)
213 | else:
214 | device = torch.device('cpu')
215 |
216 | # model
217 | model_t = load_teacher(opt.path_t, n_cls)
218 | model_s = model_dict[opt.model_s](num_classes=n_cls)
219 |
220 | # print(model_s)
221 |
222 | print("Size of the teacher:", count_parameters(model_t))
223 | print("Size of the student:", count_parameters(model_s))
224 |
225 | data = torch.randn(2, 3, 32, 32)
226 | model_t.eval()
227 | model_s.eval()
228 | feat_t, _ = model_t(data, is_feat=True)
229 | feat_s, _ = model_s(data, is_feat=True)
230 |
231 | module_list = nn.ModuleList([])
232 | module_list.append(model_s)
233 | trainable_list = nn.ModuleList([])
234 | trainable_list.append(model_s)
235 |
236 | criterion_cls = nn.CrossEntropyLoss()
237 | criterion_div = DistillKL(opt.kd_T)
238 | if opt.distill == 'kd':
239 | criterion_kd = DistillKL(opt.kd_T)
240 | elif opt.distill == 'hint':
241 | criterion_kd = HintLoss()
242 | regress_s = ConvReg(feat_s[opt.hint_layer].shape, feat_t[opt.hint_layer].shape)
243 | module_list.append(regress_s)
244 | trainable_list.append(regress_s)
245 | elif opt.distill == 'crd':
246 | opt.s_dim = feat_s[-1].shape[1]
247 | opt.t_dim = feat_t[-1].shape[1]
248 | opt.n_data = n_data
249 | criterion_kd = CRDLoss(opt)
250 | module_list.append(criterion_kd.embed_s)
251 | module_list.append(criterion_kd.embed_t)
252 | trainable_list.append(criterion_kd.embed_s)
253 | trainable_list.append(criterion_kd.embed_t)
254 | elif opt.distill == 'attention':
255 | criterion_kd = Attention()
256 | elif opt.distill == 'nst':
257 | criterion_kd = NSTLoss()
258 | elif opt.distill == 'similarity':
259 | criterion_kd = Similarity()
260 | elif opt.distill == 'rkd':
261 | criterion_kd = RKDLoss()
262 | elif opt.distill == 'pkt':
263 | criterion_kd = PKT()
264 | elif opt.distill == 'kdsvd':
265 | criterion_kd = KDSVD()
266 | elif opt.distill == 'correlation':
267 | criterion_kd = Correlation()
268 | embed_s = LinearEmbed(feat_s[-1].shape[1], opt.feat_dim)
269 | embed_t = LinearEmbed(feat_t[-1].shape[1], opt.feat_dim)
270 | module_list.append(embed_s)
271 | module_list.append(embed_t)
272 | trainable_list.append(embed_s)
273 | trainable_list.append(embed_t)
274 | elif opt.distill == 'vid':
275 | s_n = [f.shape[1] for f in feat_s[1:-1]]
276 | t_n = [f.shape[1] for f in feat_t[1:-1]]
277 | criterion_kd = nn.ModuleList(
278 | [VIDLoss(s, t, t) for s, t in zip(s_n, t_n)]
279 | )
280 | # add this as some parameters in VIDLoss need to be updated
281 | trainable_list.append(criterion_kd)
282 | elif opt.distill == 'abound':
283 | s_shapes = [f.shape for f in feat_s[1:-1]]
284 | t_shapes = [f.shape for f in feat_t[1:-1]]
285 | connector = Connector(s_shapes, t_shapes)
286 | # init stage training
287 | init_trainable_list = nn.ModuleList([])
288 | init_trainable_list.append(connector)
289 | init_trainable_list.append(model_s.get_feat_modules())
290 | criterion_kd = ABLoss(len(feat_s[1:-1]))
291 | init(model_s, model_t, init_trainable_list, criterion_kd, train_loader, opt)
292 | # classification
293 | module_list.append(connector)
294 | elif opt.distill == 'factor':
295 | s_shape = feat_s[-2].shape
296 | t_shape = feat_t[-2].shape
297 | paraphraser = Paraphraser(t_shape)
298 | translator = Translator(s_shape, t_shape)
299 | # init stage training
300 | init_trainable_list = nn.ModuleList([])
301 | init_trainable_list.append(paraphraser)
302 | criterion_init = nn.MSELoss()
303 | init(model_s, model_t, init_trainable_list, criterion_init, train_loader, opt)
304 | # classification
305 | criterion_kd = FactorTransfer()
306 | module_list.append(translator)
307 | module_list.append(paraphraser)
308 | trainable_list.append(translator)
309 | elif opt.distill == 'fsp':
310 | s_shapes = [s.shape for s in feat_s[:-1]]
311 | t_shapes = [t.shape for t in feat_t[:-1]]
312 | criterion_kd = FSP(s_shapes, t_shapes)
313 | # init stage training
314 | init_trainable_list = nn.ModuleList([])
315 | init_trainable_list.append(model_s.get_feat_modules())
316 | init(model_s, model_t, init_trainable_list, criterion_kd, train_loader, opt)
317 | # classification training
318 | pass
319 | else:
320 | raise NotImplementedError(opt.distill)
321 |
322 | criterion_list = nn.ModuleList([])
323 | criterion_list.append(criterion_cls) # classification loss
324 | criterion_list.append(criterion_div) # KL divergence loss, original knowledge distillation
325 | criterion_list.append(criterion_kd) # other knowledge distillation loss
326 |
327 | # optimizer
328 | optimizer = optim.SGD(trainable_list.parameters(),
329 | lr=opt.learning_rate,
330 | momentum=opt.momentum,
331 | weight_decay=opt.weight_decay)
332 |
333 | # append teacher after optimizer to avoid weight_decay
334 | module_list.append(model_t)
335 |
336 | if torch.cuda.is_available():
337 | module_list.to(device)
338 | criterion_list.to(device)
339 | cudnn.benchmark = True
340 |
341 | # setup warmup
342 | warmup_scheduler = WarmUpLR(optimizer, len(train_loader) * opt.epochs_warmup)
343 |
344 | # validate teacher accuracy
345 | teacher_acc, _, _ = validate(val_loader, model_t, criterion_cls, opt)
346 | print('teacher accuracy: %.2f \n' % (teacher_acc))
347 |
348 | # creat logger
349 | logger = Logger(dir=opt.save_folder,
350 | var_names=['Epoch', 'l_xent', 'l_kd', 'l_other', 'acc_train', 'acc_test', 'acc_test_best', 'lr'],
351 | format=['%02d', '%.4f', '%.4f', '%.4f', '%.2f', '%.2f', '%.2f', '%.6f'], args=opt)
352 |
353 | total_t = 0
354 | # routine
355 | for epoch in range(1, opt.epochs + 1):
356 | adjust_learning_rate(epoch, opt, optimizer)
357 | time1 = time.time()
358 | best_acc, total_t = train(epoch, train_loader, val_loader, module_list, criterion_list, optimizer, opt,
359 | best_acc, logger,
360 | device, warmup_scheduler, total_t)
361 | time2 = time.time()
362 | print('epoch {}, total time {:.2f}'.format(epoch, time2 - time1))
363 | print('Best accuracy: %.2f \n' % (best_acc))
364 |
365 |
366 | if __name__ == '__main__':
367 | opt = parse_option()
368 | distill(opt)
369 |
--------------------------------------------------------------------------------
/train_teacher.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import os
4 | import argparse
5 | import socket
6 | import time
7 |
8 | import torch
9 | import torch.optim as optim
10 | import torch.nn as nn
11 | import torch.backends.cudnn as cudnn
12 |
13 | from models import model_dict
14 |
15 | from dataset.cifar100 import get_cifar100_dataloaders
16 |
17 | from helper.util import adjust_learning_rate, accuracy, AverageMeter, Logger, WarmUpLR
18 | from helper.loops import train_vanilla as train, validate
19 |
20 |
21 | def parse_option():
22 |
23 | hostname = socket.gethostname()
24 |
25 | parser = argparse.ArgumentParser('argument for training')
26 |
27 | parser.add_argument('--print_freq', type=int, default=50, help='print frequency')
28 | parser.add_argument('--tb_freq', type=int, default=500, help='tb frequency')
29 | parser.add_argument('--save_freq', type=int, default=40, help='save frequency')
30 | parser.add_argument('--batch_size', type=int, default=128, help='batch_size')
31 | parser.add_argument('--num_workers', type=int, default=8, help='num of workers to use')
32 | parser.add_argument('--epochs', type=int, default=600, help='number of training epochs')
33 | parser.add_argument('--device', type=str, default='cuda:0', help='batch_size')
34 |
35 | # optimization
36 | parser.add_argument('--learning_rate', type=float, default=0.02, help='learning rate')
37 | parser.add_argument('--lr_decay_epochs', type=str, default='200, 300, 400, 500', help='where to decay lr, can be a list')
38 | parser.add_argument('--lr_decay_rate', type=float, default=0.1, help='decay rate for learning rate')
39 | parser.add_argument('--weight_decay', type=float, default=5e-4, help='weight decay')
40 | parser.add_argument('--momentum', type=float, default=0.9, help='momentum')
41 | parser.add_argument('--aug', type=str, default=None,
42 | help='address of the augmented dataset')
43 | parser.add_argument('--aug_type', type=str, default=None,
44 | help='address of the augmented dataset')
45 | # dataset
46 | parser.add_argument('--model', type=str, default='MobileNetV2',
47 | choices=['resnet8', 'resnet14', 'resnet20', 'resnet32', 'resnet44', 'resnet56', 'resnet110',
48 | 'resnet8x4', 'resnet32x4', 'wrn_16_1', 'wrn_16_2', 'wrn_40_1', 'wrn_40_2',
49 | 'vgg8', 'vgg11', 'vgg13', 'vgg16', 'vgg19',
50 | 'MobileNetV2', 'ShuffleV1', 'ShuffleV2', ])
51 | parser.add_argument('--dataset', type=str, default='cifar100', choices=['cifar100'], help='dataset')
52 |
53 | parser.add_argument('-t', '--trial', type=int, default=0, help='the experiment id')
54 |
55 | opt = parser.parse_args()
56 |
57 | # set different learning rate from these 4 models
58 | # if opt.model in ['MobileNetV2', 'ShuffleV1', 'ShuffleV2']:
59 | # opt.learning_rate = 0.01
60 |
61 | # set the path according to the environment
62 |
63 | opt.model_path = './save/models_t'
64 |
65 | iterations = opt.lr_decay_epochs.split(',')
66 | opt.lr_decay_epochs = list([])
67 | for it in iterations:
68 | opt.lr_decay_epochs.append(int(it))
69 |
70 | opt.model_name = '{}_{}_lr_{}_decay_{}_trial_{}_cuda'.format(opt.model, opt.dataset, opt.learning_rate,
71 | opt.weight_decay, opt.trial, opt.device)
72 |
73 | opt.save_folder = os.path.join(opt.model_path, opt.model_name)
74 | if not os.path.isdir(opt.save_folder):
75 | os.makedirs(opt.save_folder)
76 |
77 | return opt
78 |
79 |
80 | def main():
81 | best_acc = 0
82 |
83 | opt = parse_option()
84 |
85 | # dataloader
86 | if opt.dataset == 'cifar100':
87 | train_loader, val_loader = get_cifar100_dataloaders(opt)
88 | n_cls = 100
89 | else:
90 | raise NotImplementedError(opt.dataset)
91 |
92 | # model
93 | model = model_dict[opt.model](num_classes=n_cls)
94 |
95 | # optimizer
96 | optimizer = optim.SGD(model.parameters(),
97 | lr=opt.learning_rate,
98 | momentum=opt.momentum,
99 | weight_decay=opt.weight_decay)
100 |
101 | print("learning rate:", opt.learning_rate)
102 |
103 | criterion = nn.CrossEntropyLoss()
104 |
105 | if torch.cuda.is_available():
106 | model = model.to(opt.device)
107 | criterion = criterion.to(opt.device)
108 | cudnn.benchmark = True
109 |
110 | # setup warmup
111 | warmup_scheduler = WarmUpLR(optimizer, len(train_loader) * 5)
112 |
113 | logger = Logger(dir=opt.save_folder,
114 | var_names=['Epoch', 'l_xent', 'l_kd', 'l_other', 'acc_train', 'acc_test', 'acc_test_best', 'lr'],
115 | format=['%02d', '%.4f', '%.4f', '%.4f', '%.2f', '%.2f', '%.2f', '%.6f'], args=opt)
116 | # routine
117 | for epoch in range(1, opt.epochs + 1):
118 |
119 | adjust_learning_rate(epoch, opt, optimizer)
120 | print("==> training...")
121 |
122 | time1 = time.time()
123 | train_acc, train_loss = train(epoch, train_loader, model, criterion, optimizer, opt, warmup_scheduler)
124 | time2 = time.time()
125 | print('epoch {}, total time {:.2f}'.format(epoch, time2 - time1))
126 |
127 | test_acc, test_acc_top5, test_loss = validate(val_loader, model, criterion, opt)
128 |
129 | for param_group in optimizer.param_groups:
130 | lr = param_group['lr']
131 | logger.store([epoch, train_loss, 0000, 0000, train_acc, test_acc, best_acc, lr], log=True)
132 |
133 | # save the best model
134 | if test_acc > best_acc:
135 | best_acc = test_acc
136 | state = {
137 | 'epoch': epoch,
138 | 'model': model.state_dict(),
139 | 'best_acc': best_acc,
140 | 'optimizer': optimizer.state_dict(),
141 | }
142 | save_file = os.path.join(opt.save_folder, '{}_best.pth'.format(opt.model))
143 | print('saving the best model!')
144 | torch.save(state, save_file)
145 |
146 | # regular saving
147 | # if epoch % opt.save_freq == 0:
148 | # print('==> Saving...')
149 | # state = {
150 | # 'epoch': epoch,
151 | # 'model': model.state_dict(),
152 | # 'accuracy': test_acc,
153 | # 'optimizer': optimizer.state_dict(),
154 | # }
155 | # save_file = os.path.join(opt.save_folder, 'ckpt_epoch_{epoch}.pth'.format(epoch=epoch))
156 | # torch.save(state, save_file)
157 |
158 | print('best accuracy:', best_acc)
159 |
160 | # save model
161 | state = {
162 | 'opt': opt,
163 | 'model': model.state_dict(),
164 | 'optimizer': optimizer.state_dict(),
165 | }
166 | save_file = os.path.join(opt.save_folder, '{}_last.pth'.format(opt.model))
167 | torch.save(state, save_file)
168 |
169 |
170 | if __name__ == '__main__':
171 | main()
172 |
--------------------------------------------------------------------------------