├── .github
    └── ISSUE_TEMPLATE.md
├── .gitignore
├── Cifar10-fast
    ├── README.md
    └── cifar10-fast.py
├── DCGAN
    └── README.md
├── ImageNet
    ├── README.md
    ├── augmentors.py
    ├── benchmark-dataflow.py
    ├── benchmark-opencv-resize.py
    ├── benchmark-tfdata.py
    ├── dump-lmdb.py
    └── symbolic_imagenet.py
├── LICENSE
├── MaskRCNN
    ├── README.md
    └── maskrcnn.patch
├── README.md
├── ResNet-Horovod
    ├── README.md
    ├── imagenet-resnet-horovod.py
    ├── imagenet_utils.py
    ├── resnet_model.py
    ├── serve-data.py
    └── slurm.script
├── ResNet-MultiGPU
    ├── README.md
    ├── resnet-multigpu.py
    └── tfbench
    │   ├── __init__.py
    │   ├── convnet_builder.py
    │   ├── model.py
    │   ├── model_config.py
    │   └── resnet_model.py
├── other-wrappers
    ├── README.md
    ├── keras.alexnet.py
    ├── keras.cifar10.py
    ├── keras.resnet.py
    ├── keras.vgg.py
    ├── tensorpack.alexnet.py
    ├── tensorpack.cifar10.py
    ├── tensorpack.resnet.py
    ├── tensorpack.vgg.py
    ├── tflearn.cifar10.py
    └── tflearn.vgg.py
├── profile-import
    ├── README.md
    ├── import_profiler.py
    └── profile-import.py
└── tox.ini


/.github/ISSUE_TEMPLATE.md:
--------------------------------------------------------------------------------
 1 | If you're asking about an unexpected problem which you do not know the root cause,
 2 | please include:
 3 | 
 4 | ### 1. What you did:
 5 | 
 6 | (1) **What's the command you run:**
 7 | 
 8 | (2) **Have you made any changes to the code? Paste `git status; git diff` here:**
 9 | 
10 |   Please try to provide enough information to let other __reproduce__ your issues.
11 |   Without reproducing the issue, we may not be able to investigate it.
12 | 
13 | ### 2. What you observed:
14 | 
15 | (1) **Include the ENTIRE logs here:**
16 | 
17 | It's always better to copy-paste what you observed instead of describing them.
18 | 
19 | It's always better to paste **as much as possible**, although sometimes a partial log is OK.
20 | 
21 | Tensorpack typically saves stdout to its training log.
22 | If stderr is relevant, you can run a command with `CMD 2>&1 | tee logs.txt`
23 | to save both stdout and stderr to one file.
24 | 
25 | (2) **Other observations, if any:**
26 | For example, CPU/GPU utilization, output images, tensorboard curves, if relevant to your issue.
27 | 
28 | ### 3. What you expected, if not obvious.
29 | 
30 | If you expect certain accuracy, only in one of the two conditions can we help with it:
31 | (1) You're unable to reproduce the accuracy documented in tensorpack examples.
32 | (2) It appears to be a tensorpack bug.
33 | 
34 | Otherwise, how to train a model to certain accuracy is a machine learning question.
35 | We do not answer machine learning questions and it is your responsibility to
36 | figure out how to make your models more accurate.
37 | 
38 | ### 4. Your environment:
39 |   + Python version:
40 |   + TF version: `python -c 'import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)'`.
41 |   + Tensorpack version: `python -c 'import tensorpack; print(tensorpack.__version__);'`.
42 |       You can install Tensorpack master by `pip install -U git+https://github.com/ppwwyyxx/tensorpack.git`
43 |       and see if your issue is already solved.
44 |   + If you're not using tensorpack under a normal command line shell (e.g.,
45 |     using an IDE or jupyter notebook), please retry under a normal command line shell.
46 |   + Hardware information, e.g. number of GPUs used.
47 | 
48 | You may often want to provide extra information related to your issue, but
49 | at the minimum please try to provide the above information __accurately__ to save effort in the investigation.
50 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | train_log
3 | __pycache__
4 | *.pyc
5 | *.so
6 | cifar-10-batches-py
7 | 


--------------------------------------------------------------------------------
/Cifar10-fast/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Train to 94% accuracy on Cifar10 within a minute
 3 | 
 4 | This script is able to train to 94% accuracy (average over multiple runs)
 5 | on Cifar10 within a minute, when run with:
 6 | 
 7 | * 1 V100 GPU
 8 | * TensorFlow 1.14
 9 | * Tensorpack @047579df
10 | * CUDA 10, CuDNN 7.6.2
11 | 
12 | The script mostly follows the [cifar10-fast repo](https://github.com/davidcpage/cifar10-fast)
13 | with small modifications on architecture.
14 | 
15 | This sort of "competition" doesn't really constitute any innovations since it's
16 | mainly about recipe tuning and overfitting the test accuracy.
17 | But since someone has tuned it, it's an interesting excercise to follow.
18 | 
19 | ## To Run:
20 | ```
21 | ./cifar10-fast.py --num-runs 10
22 | ```
23 | 
24 | Time: it takes about 2.2s per epoch over the 24-epoch training.
25 | The first epoch is slower because of CuDNN warmup and XLA compilation.
26 | 
27 | Accuracy: it prints the test accuracy after every run finishes, and the average accuracy in the end.
28 | 


--------------------------------------------------------------------------------
/Cifar10-fast/cifar10-fast.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | import argparse
  4 | import numpy as np
  5 | import os
  6 | import multiprocessing as mp
  7 | import tensorflow as tf
  8 | 
  9 | from tensorpack import *
 10 | from tensorpack.utils import logger
 11 | from tensorpack.tfutils.tower import TowerFunc
 12 | from tensorpack.tfutils.varreplace import custom_getter_scope
 13 | from tensorpack.dataflow import dataset
 14 | 
 15 | 
 16 | BATCH = 512
 17 | STEPS_PER_EPOCH = 50000 // 512 + 1
 18 | TOTAL_EPOCH = 24
 19 | WARMUP_EPOCH = 5.0 / 24 * TOTAL_EPOCH
 20 | USE_FP16 = True
 21 | DATA_FORMAT = "NCHW"
 22 | 
 23 | 
 24 | def get_inputs(batch):
 25 |     return [tf.TensorSpec(
 26 |                 (batch, 3, 32, 32) if DATA_FORMAT == "NCHW" else (batch, 32, 32, 3),
 27 |                 tf.float32, 'input'),
 28 |             tf.TensorSpec((batch, 10), tf.float32, 'label')]
 29 | 
 30 | 
 31 | def build_graph(image, label):
 32 |     if USE_FP16:
 33 |         image = tf.cast(image, tf.float16)
 34 | 
 35 |     def activation(x):
 36 |         return tf.nn.leaky_relu(x, alpha=0.1)
 37 | 
 38 |     def residual(name, x, chan):
 39 |         with tf.variable_scope(name):
 40 |             x = Conv2D('res1', x, chan, 3)
 41 |             x = BatchNorm('bn1', x)
 42 |             x = activation(x)
 43 |             x = Conv2D('res2', x, chan, 3)
 44 |             x = BatchNorm('bn2', x)
 45 |             x = activation(x)
 46 |             return x
 47 | 
 48 |     def fp16_getter(getter, *args, **kwargs):
 49 |         name = args[0] if len(args) else kwargs['name']
 50 |         if not USE_FP16 or (not name.endswith('/W') and not name.endswith('/b')):
 51 |             # ignore BN's gamma and beta
 52 |             return getter(*args, **kwargs)
 53 |         else:
 54 |             if kwargs['dtype'] == tf.float16:
 55 |                 kwargs['dtype'] = tf.float32
 56 |                 ret = getter(*args, **kwargs)
 57 |                 return tf.cast(ret, tf.float16)
 58 |             else:
 59 |                 return getter(*args, **kwargs)
 60 | 
 61 |     with custom_getter_scope(fp16_getter), \
 62 |             argscope(Conv2D, activation=tf.identity, use_bias=False), \
 63 |             argscope([Conv2D, MaxPooling, BatchNorm], data_format=DATA_FORMAT), \
 64 |             argscope(BatchNorm, momentum=0.8):
 65 | 
 66 |         with tf.variable_scope('prep'):
 67 |             l = Conv2D('conv', image, 64, 3)
 68 |             l = BatchNorm('bn', l)
 69 |             l = activation(l)
 70 | 
 71 |         with tf.variable_scope("layer1"):
 72 |             l = Conv2D('conv', l, 128, 3)
 73 |             l = MaxPooling('pool', l, 2)
 74 |             l = BatchNorm('bn', l)
 75 |             l = activation(l)
 76 |             l = l + residual('res', l, 128)
 77 | 
 78 |         with tf.variable_scope("layer2"):
 79 |             l = Conv2D('conv', l, 256, 3)
 80 |             l = MaxPooling('pool', l, 2)
 81 |             l = BatchNorm('bn', l)
 82 |             l = activation(l)
 83 | 
 84 |         with tf.variable_scope("layer3"):
 85 |             l = Conv2D('conv', l, 512, 3)
 86 |             l = MaxPooling('pool', l, 2)
 87 |             l = BatchNorm('bn', l)
 88 |             l = activation(l)
 89 |             l = l + residual('res', l, 512)
 90 | 
 91 |         l = tf.reduce_max(l, axis=[2, 3] if DATA_FORMAT == "NCHW" else [1, 2])
 92 |         l = FullyConnected('fc', l, 10, use_bias=False)
 93 |         logits = tf.cast(l * 0.125, tf.float32, name='logits')
 94 | 
 95 |     cost = tf.nn.softmax_cross_entropy_with_logits(labels=label, logits=logits)
 96 |     cost = tf.reduce_sum(cost)
 97 |     wd_cost = regularize_cost('.*', l2_regularizer(5e-4 * BATCH), name='regularize_loss')
 98 | 
 99 |     correct = tf.equal(tf.argmax(logits, axis=1), tf.argmax(label, axis=1), name='correct')
100 |     return tf.add_n([cost, wd_cost], name='cost')
101 | 
102 | 
103 | def get_data(train_or_test):
104 |     isTrain = train_or_test == 'train'
105 |     ds = dataset.Cifar10(train_or_test)
106 | 
107 |     cifar10_mean = np.asarray([0.4914, 0.4822, 0.4465], dtype="float32") * 255.
108 |     cifar10_invstd = 1.0 / (np.asarray([0.2471, 0.2435, 0.2616], dtype="float32") * 255)
109 | 
110 |     if isTrain:
111 |         augmentors = imgaug.AugmentorList([
112 |             imgaug.RandomCrop((32, 32)),
113 |             imgaug.Flip(horiz=True),
114 |             imgaug.RandomCutout(8, 8),
115 |         ])
116 | 
117 |     def mapf(dp):
118 |         img, label = dp
119 |         img = (img.astype("float32") - cifar10_mean) * cifar10_invstd
120 | 
121 |         if isTrain:
122 |             img = np.pad(img, [(4, 4), (4, 4), (0, 0)], mode='reflect')
123 |             img = augmentors.augment(img)
124 | 
125 |             onehot = np.zeros((10, ), dtype=np.float32) + 0.2 / 9
126 |             onehot[label] = 0.8
127 |         else:
128 |             onehot = np.zeros((10, ), dtype=np.float32)
129 |             onehot[label] = 1.
130 | 
131 |         if DATA_FORMAT == "NCHW":
132 |             img = img.transpose(2, 0, 1)
133 |         return img, onehot
134 | 
135 |     if not isTrain:
136 |         ds = MapData(ds, mapf)
137 |         ds = BatchData(ds, BATCH, remainder=False)
138 |         return ds
139 | 
140 |     ds = MultiProcessMapAndBatchDataZMQ(ds, 8, mapf, BATCH, buffer_size=20000)
141 |     ds = RepeatedData(ds, -1)
142 |     return ds
143 | 
144 | def run_once(result_queue):
145 |     tf.reset_default_graph()
146 |     trainer = SimpleTrainer()
147 |     trainer.XLA_COMPILE = True
148 | 
149 |     with tf.device('/gpu:0'):
150 |         gs = tf.train.get_or_create_global_step()
151 |         gs = tf.cast(gs, tf.float32)
152 |         # 0.0 -> 0.4 in warmup
153 |         # 0.4 -> 0.0 in the rest epochs
154 |         lr = tf.where(tf.greater(gs, 5 * STEPS_PER_EPOCH),
155 |             (TOTAL_EPOCH * STEPS_PER_EPOCH - gs) / ((TOTAL_EPOCH - WARMUP_EPOCH) * STEPS_PER_EPOCH) * 0.4 / BATCH,
156 |             gs / (WARMUP_EPOCH * STEPS_PER_EPOCH) * 0.4 / BATCH
157 |         )
158 | 
159 |     trainer.setup_graph(
160 |         get_inputs(BATCH),
161 |         #StagingInput(QueueInput(get_data('train'), queue=tf.FIFOQueue(300, [tf.float32, tf.int64]))),
162 |         #DummyConstantInput([x.shape for x in get_inputs()]),
163 |         StagingInput(TFDatasetInput(get_data('train'))),
164 |         build_graph,
165 |         lambda: tf.train.MomentumOptimizer(lr, 0.9, use_nesterov=True)
166 |     )
167 | 
168 |     trainer.train_with_defaults(
169 |         callbacks=[
170 |             PeriodicTrigger(
171 |                 InferenceRunner(
172 |                     get_data('test'), ClassificationError('correct', 'val_acc'),
173 |                     # We used static shape in training, in order to allow XLA
174 |                     # But we want dynamic batch size for inference, therefore
175 |                     # recreate a tower function with different input signature.
176 |                     tower_func=TowerFunc(build_graph, get_inputs(None))
177 |                 ), every_k_epochs=TOTAL_EPOCH),
178 |             RunUpdateOps(),
179 |         ],
180 |         extra_callbacks=[],  # disable all default callbacks
181 |         monitors=[ScalarPrinter()],  # disable other default monitors
182 |         steps_per_epoch=STEPS_PER_EPOCH,
183 |         max_epoch=TOTAL_EPOCH
184 |     )
185 |     result_queue.put(trainer.monitors.get_latest('val_acc'))
186 | 
187 | 
188 | if __name__ == '__main__':
189 |     parser = argparse.ArgumentParser()
190 |     parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.')
191 |     parser.add_argument('--num-runs', default=1, type=int)
192 |     args = parser.parse_args()
193 | 
194 |     if args.gpu:
195 |         os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
196 |     os.environ["TF_AUTOTUNE_THRESHOLD"] = '1'
197 | 
198 |     q = mp.Queue()
199 |     if args.num_runs == 1:
200 |         run_once(q)
201 |         logger.info("Val Acc: " + str(q.get()))
202 |     else:
203 |         val_accs = []
204 |         for k in range(args.num_runs):
205 |             proc = mp.Process(target=run_once, args=(q,))
206 |             proc.start()
207 |             val_accs.append(q.get())
208 |             proc.join(timeout=5)
209 |             proc.terminate()
210 |             logger.info("Val Accs: " + str(val_accs))
211 |         logger.info("Mean Val Acc: " + str(np.mean(val_accs)))
212 | 


--------------------------------------------------------------------------------
/DCGAN/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## DCGAN
 3 | Environment: TFv1.3.0-rc1-1302-g593dc8e. Tesla P100.
 4 | 
 5 | Code: [DCGAN-tensorflow](https://github.com/carpedm20/DCGAN-tensorflow/) at commit b13830.
 6 | 
 7 | * DCGAN-tensorflow:
 8 | ```
 9 | python main.py --dataset celebA --train --crop
10 | ```
11 | This command takes 0.36s per iteration, where each iteration is 1 update to D and 2 updates to G.
12 | 
13 | * [tensorpack DCGAN examples](https://github.com/tensorpack/tensorpack/blob/master/examples/GAN/DCGAN.py):
14 | 
15 | Modify the code to use `SeparateGANTrainer(..., d_period=2)`, and run with:
16 | ```
17 | python DCGAN.py --data /path/to/img_align_celebA --crop-size 108 --batch 64
18 | ```
19 | 
20 | This script runs at 15.5 it/s, where every two iterations is equivalent to 1 iteration in DCGAN-tensorflow.
21 | Therefore this script is roughly 2.8x faster.
22 | 


--------------------------------------------------------------------------------
/ImageNet/README.md:
--------------------------------------------------------------------------------
 1 | Some scripts to test ImageNet reading speed.
 2 | 
 3 | + With `tensorpack.dataflow` (pure python loader):
 4 | 
 5 | Augmentations=[GoogleNetResize, Lighting, Flip]: reach 5.6k image/s on a DGX1.
 6 | 
 7 | ```
 8 | python benchmark-dataflow.py /path/to/imagenet --batch 128 --name train --aug resizeAndLighting
 9 | ```
10 | 
11 | + With `tf.data`:
12 | 
13 | Augmentations=[GoogleNetResize, Flip]: 11k image/s on a DGX1.
14 | As a reference, a DGX1 with 8 V100s can train ResNet-50 in fp32 at 2.6k image/s.
15 | ```
16 | python benchmark-tfdata.py /path/to/imagenet --name train --batch 128
17 | ```
18 | 


--------------------------------------------------------------------------------
/ImageNet/augmentors.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: augmentors.py
 4 | 
 5 | import numpy as np
 6 | import cv2
 7 | from tensorpack.dataflow import imgaug
 8 | 
 9 | 
10 | __all__ = ['fbresnet_augmentor', 'inference_augmentor',
11 |            'resizeAndLighting_augmentor']
12 | 
13 | 
14 | 
15 | def inference_augmentor():
16 |     return [
17 |         imgaug.ResizeShortestEdge(256, cv2.INTER_CUBIC),
18 |         imgaug.CenterCrop((224, 224))
19 |     ]
20 | 
21 | 
22 | def fbresnet_augmentor():
23 |     # assme BGR input
24 |     augmentors = [
25 |         imgaug.GoogleNetRandomCropAndResize(),
26 |         imgaug.RandomOrderAug(
27 |             [imgaug.BrightnessScale((0.6, 1.4), clip=False),
28 |              imgaug.Contrast((0.6, 1.4), clip=False),
29 |              imgaug.Saturation(0.4, rgb=False),
30 |              # rgb->bgr conversion for the constants copied from fb.resnet.torch
31 |              imgaug.Lighting(0.1,
32 |                              eigval=np.asarray(
33 |                                  [0.2175, 0.0188, 0.0045][::-1]) * 255.0,
34 |                              eigvec=np.array(
35 |                                  [[-0.5675, 0.7192, 0.4009],
36 |                                   [-0.5808, -0.0045, -0.8140],
37 |                                   [-0.5836, -0.6948, 0.4203]],
38 |                                  dtype='float32')[::-1, ::-1]
39 |                              )]),
40 |         imgaug.Flip(horiz=True),
41 |     ]
42 |     return augmentors
43 | 
44 | 
45 | def resizeAndLighting_augmentor():
46 |     # assme BGR input
47 |     augmentors = [
48 |         imgaug.GoogleNetRandomCropAndResize(),
49 |         imgaug.Lighting(0.1,
50 |                         eigval=np.asarray(
51 |                             [0.2175, 0.0188, 0.0045][::-1]) * 255.0,
52 |                         eigvec=np.array(
53 |                             [[-0.5675, 0.7192, 0.4009],
54 |                              [-0.5808, -0.0045, -0.8140],
55 |                              [-0.5836, -0.6948, 0.4203]],
56 |                             dtype='float32')[::-1, ::-1]),
57 |         imgaug.Flip(horiz=True),
58 |     ]
59 |     return augmentors
60 | 
61 | 
62 | def resizeOnly_augmentor():
63 |     # assme BGR input
64 |     augmentors = [
65 |         imgaug.GoogleNetRandomCropAndResize(),
66 |         imgaug.Lighting(0.1,
67 |                         eigval=np.asarray(
68 |                             [0.2175, 0.0188, 0.0045][::-1]) * 255.0,
69 |                         eigvec=np.array(
70 |                             [[-0.5675, 0.7192, 0.4009],
71 |                              [-0.5808, -0.0045, -0.8140],
72 |                              [-0.5836, -0.6948, 0.4203]],
73 |                             dtype='float32')[::-1, ::-1]),
74 |         imgaug.Flip(horiz=True),
75 |     ]
76 |     return augmentors
77 | 


--------------------------------------------------------------------------------
/ImageNet/benchmark-dataflow.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: benchmark-dataflow.py
  4 | 
  5 | import argparse
  6 | import cv2
  7 | 
  8 | from tensorpack import *
  9 | from tensorpack.dataflow.imgaug import *
 10 | from tensorpack.dataflow.parallel import PlasmaGetData, PlasmaPutData  # noqa
 11 | from tensorpack.utils.serialize import loads
 12 | 
 13 | import augmentors
 14 | 
 15 | 
 16 | def test_orig(dir, name, augs, batch):
 17 |     ds = dataset.ILSVRC12(dir, name, shuffle=True)
 18 |     ds = AugmentImageComponent(ds, augs)
 19 | 
 20 |     ds = BatchData(ds, batch)
 21 |     # ds = PlasmaPutData(ds)
 22 |     ds = MultiProcessRunnerZMQ(ds, 50, hwm=80)
 23 |     # ds = PlasmaGetData(ds)
 24 |     return ds
 25 | 
 26 | 
 27 | def test_lmdb_train(db, augs, batch):
 28 |     ds = LMDBData(db, shuffle=False)
 29 |     ds = LocallyShuffleData(ds, 50000)
 30 |     ds = MultiProcessRunner(ds, 5000, 1)
 31 |     return ds
 32 | 
 33 |     ds = LMDBDataPoint(ds)
 34 | 
 35 |     def f(x):
 36 |         return cv2.imdecode(x, cv2.IMREAD_COLOR)
 37 |     ds = MapDataComponent(ds, f, 0)
 38 |     ds = AugmentImageComponent(ds, augs)
 39 | 
 40 |     ds = BatchData(ds, batch, use_list=True)
 41 |     # ds = PlasmaPutData(ds)
 42 |     ds = MultiProcessRunnerZMQ(ds, 40, hwm=80)
 43 |     # ds = PlasmaGetData(ds)
 44 |     return ds
 45 | 
 46 | 
 47 | def test_lmdb_inference(db, augs, batch):
 48 |     ds = LMDBData(db, shuffle=False)
 49 |     # ds = LocallyShuffleData(ds, 50000)
 50 | 
 51 |     augs = AugmentorList(augs)
 52 | 
 53 |     def mapper(data):
 54 |         im, label = loads(data[1])
 55 |         im = cv2.imdecode(im, cv2.IMREAD_COLOR)
 56 |         im = augs.augment(im)
 57 |         return im, label
 58 | 
 59 |     ds = MultiProcessMapData(ds, 40, mapper,
 60 |                              buffer_size=200)
 61 |     # ds = MultiThreadMapData(ds, 40, mapper, buffer_size=2000)
 62 | 
 63 |     ds = BatchData(ds, batch)
 64 |     ds = MultiProcessRunnerZMQ(ds, 1)
 65 |     return ds
 66 | 
 67 | 
 68 | def test_inference(dir, name, augs, batch=128):
 69 |     ds = dataset.ILSVRC12Files(dir, name, shuffle=False, dir_structure='train')
 70 | 
 71 |     aug = imgaug.AugmentorList(augs)
 72 | 
 73 |     def mapf(dp):
 74 |         fname, cls = dp
 75 |         im = cv2.imread(fname, cv2.IMREAD_COLOR)
 76 |         im = aug.augment(im)
 77 |         return im, cls
 78 |     ds = MultiThreadMapData(ds, 30, mapf, buffer_size=2000, strict=True)
 79 |     ds = BatchData(ds, batch)
 80 |     ds = MultiProcessRunnerZMQ(ds, 1)
 81 |     return ds
 82 | 
 83 | 
 84 | if __name__ == '__main__':
 85 |     available_augmentors = [
 86 |         k[:-len("_augmentor")]
 87 |         for k in augmentors.__all__ if k.endswith('_augmentor')]
 88 |     parser = argparse.ArgumentParser()
 89 |     parser.add_argument('data', help='file or directory of dataset')
 90 |     parser.add_argument('--batch', type=int, default=64)
 91 |     parser.add_argument('--name', choices=['train', 'val'], default='train')
 92 |     parser.add_argument('--aug', choices=available_augmentors, required=True)
 93 |     args = parser.parse_args()
 94 | 
 95 |     augs = getattr(augmentors, args.aug + '_augmentor')()
 96 | 
 97 |     if args.data.endswith('lmdb'):
 98 |         if args.name == 'train':
 99 |             ds = test_lmdb_train(args.data, augs, args.batch)
100 |         else:
101 |             ds = test_lmdb_inference(args.data, augs, args.batch)
102 |     else:
103 |         if args.name == 'train':
104 |             ds = test_orig(args.data, args.name, augs, args.batch)
105 |         else:
106 |             ds = test_inference(args.data, args.name, augs, args.batch)
107 |     TestDataSpeed(ds, 500000, warmup=100).start()
108 | 


--------------------------------------------------------------------------------
/ImageNet/benchmark-opencv-resize.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: benchmark-opencv-resize.py
 4 | 
 5 | 
 6 | import cv2
 7 | import time
 8 | import numpy as np
 9 | 
10 | """
11 | Some prebuilt opencv is much slower than others.
12 | You should check with this script and make sure it prints < 1s.
13 | 
14 | 
15 | On E5-2680v3, archlinux, this script prints:
16 |     0.61s for system opencv 3.4.0-2.
17 |     >5 s for anaconda opencv 3.3.1 py36h6cbbc71_1.
18 | 
19 | On E5-2650v4, this script prints:
20 |     0.6s for opencv built locally with -DWITH_OPENMP=OFF
21 |     0.6s for opencv from `pip install opencv-python`.
22 |     1.3s for opencv built locally with -DWITH_OPENMP=ON
23 |     2s for opencv from `conda install`.
24 | """
25 | 
26 | 
27 | img = (np.random.rand(256, 256, 3) * 255).astype('uint8')
28 | 
29 | start = time.time()
30 | for k in range(1000):
31 |     out = cv2.resize(img, (384, 384))
32 | print(time.time() - start)
33 | 


--------------------------------------------------------------------------------
/ImageNet/benchmark-tfdata.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: benchmark-tfdata.py
 4 | 
 5 | import tqdm
 6 | import argparse
 7 | import tensorflow as tf
 8 | from tensorpack.tfutils.common import get_default_sess_config
 9 | 
10 | from symbolic_imagenet import get_imglist, build_pipeline
11 | 
12 | 
13 | def benchmark_ds(ds, count, warmup=200):
14 |     itr = ds.make_initializable_iterator()
15 |     dp = itr.get_next()
16 |     dpop = tf.group(*dp)
17 |     with tf.Session(config=get_default_sess_config()) as sess:
18 | 
19 |         sess.run(itr.initializer)
20 |         for _ in tqdm.trange(warmup):
21 |             sess.run(dpop)
22 |         for _ in tqdm.trange(count, smoothing=0.1):
23 |             sess.run(dpop)
24 | 
25 | 
26 | if __name__ == '__main__':
27 |     parser = argparse.ArgumentParser()
28 |     parser.add_argument('data', help='directory to imagenet')
29 |     parser.add_argument('--name', choices=['train', 'val'], default='train')
30 |     parser.add_argument('--batch', type=int, default=128)
31 |     parser.add_argument('--parallel', type=int, default=40)
32 |     args = parser.parse_args()
33 | 
34 |     imglist = get_imglist(args.data, args.name)
35 |     print("Number of Images: {}".format(len(imglist)))
36 | 
37 |     with tf.device('/cpu:0'):
38 |         data = build_pipeline(
39 |             imglist, args.name == 'train',
40 |             args.batch, args.parallel)
41 |         if args.name != 'train':
42 |             data = data.repeat()    # for benchmark
43 |     benchmark_ds(data, 100000)
44 | 


--------------------------------------------------------------------------------
/ImageNet/dump-lmdb.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: dump-lmdb.py
 4 | 
 5 | import numpy as np
 6 | import cv2
 7 | import os
 8 | import argparse
 9 | 
10 | from tensorpack.dataflow import *
11 | 
12 | 
13 | class RawILSVRC12(DataFlow):
14 |     def __init__(self, dir, name):
15 |         self.dir = os.path.join(dir, name)
16 | 
17 |         meta = dataset.ILSVRCMeta()
18 |         self.imglist = meta.get_image_list(
19 |             name,
20 |             dataset.ILSVRCMeta.guess_dir_structure(self.dir))
21 |         np.random.shuffle(self.imglist)
22 | 
23 |     def get_data(self):
24 |         for fname, label in self.imglist:
25 |             fname = os.path.join(self.dir, fname)
26 |             im = cv2.imread(fname)
27 |             assert im is not None, fname
28 |             with open(fname, 'rb') as f:
29 |                 jpeg = f.read()
30 |             jpeg = np.asarray(bytearray(jpeg), dtype='uint8')
31 |             assert len(jpeg) > 10
32 |             yield [jpeg, label]
33 | 
34 |     def size(self):
35 |         return len(self.imglist)
36 | 
37 | 
38 | if __name__ == '__main__':
39 |     parser = argparse.ArgumentParser()
40 |     parser.add_argument('--data', help='path to ILSVRC12 images')
41 |     parser.add_argument('--name', choices=['train', 'val'])
42 |     parser.add_argument('--output', required=True)
43 |     args = parser.parse_args()
44 |     assert args.output.endswith('.lmdb')
45 | 
46 |     ds = RawILSVRC12(args.data, args.name)
47 |     ds = PrefetchDataZMQ(ds, 1)
48 |     dftools.dump_dataflow_to_lmdb(ds, args.output)
49 | 


--------------------------------------------------------------------------------
/ImageNet/symbolic_imagenet.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: symbolic_imagenet.py
  4 | 
  5 | import os
  6 | import tensorflow as tf
  7 | import numpy as np
  8 | from tensorpack.dataflow import dataset
  9 | from tensorpack.utils import logger
 10 | 
 11 | __all__ = ['get_imglist', 'build_pipeline', 'lighting']
 12 | 
 13 | 
 14 | def get_imglist(dir, name):
 15 |     """
 16 |     Args:
 17 |         dir(str): directory which contains name
 18 |         name(str): 'train' or 'val'
 19 | 
 20 |     Returns:
 21 |         [(full filename, label)]
 22 |     """
 23 |     dir = os.path.join(dir, name)
 24 |     meta = dataset.ILSVRCMeta()
 25 |     imglist = meta.get_image_list(
 26 |         name,
 27 |         dataset.ILSVRCMeta.guess_dir_structure(dir))
 28 | 
 29 |     def _filter(fname):
 30 |         # png
 31 |         return 'n02105855_2933.JPEG' in fname
 32 | 
 33 |     ret = []
 34 |     for fname, label in imglist:
 35 |         if _filter(fname):
 36 |             logger.info("Image {} was filtered out.".format(fname))
 37 |             continue
 38 |         fname = os.path.join(dir, fname)
 39 |         ret.append((fname, label))
 40 |     return ret
 41 | 
 42 | 
 43 | def uint8_resize_bicubic(image, shape):
 44 |     ret = tf.image.resize_bicubic([image], shape)
 45 |     return tf.cast(tf.clip_by_value(ret, 0, 255), tf.uint8)[0]
 46 | 
 47 | 
 48 | def resize_shortest_edge(image, image_shape, size):
 49 |     shape = tf.cast(image_shape, tf.float32)
 50 |     w_greater = tf.greater(image_shape[0], image_shape[1])
 51 |     shape = tf.cond(w_greater,
 52 |                     lambda: tf.cast([shape[0] / shape[1] * size, size], tf.int32),
 53 |                     lambda: tf.cast([size, shape[1] / shape[0] * size], tf.int32))
 54 | 
 55 |     return uint8_resize_bicubic(image, shape)
 56 | 
 57 | 
 58 | def center_crop(image, size):
 59 |     image_height = tf.shape(image)[0]
 60 |     image_width = tf.shape(image)[1]
 61 | 
 62 |     offset_height = (image_height - size) // 2
 63 |     offset_width = (image_width - size) // 2
 64 |     image = tf.slice(image, [offset_height, offset_width, 0], [size, size, -1])
 65 |     return image
 66 | 
 67 | 
 68 | def lighting(image, std, eigval, eigvec):
 69 |     v = tf.random_uniform(shape=[3]) * std * eigval
 70 |     inc = tf.matmul(eigvec, tf.reshape(v, [3, 1]))
 71 |     image = tf.cast(tf.cast(image, tf.float32) + tf.reshape(inc, [3]), image.dtype)
 72 |     return image
 73 | 
 74 | 
 75 | def training_mapper(filename, label):
 76 |     byte = tf.read_file(filename)
 77 | 
 78 |     jpeg_opt = {'fancy_upscaling': True, 'dct_method': 'INTEGER_ACCURATE'}
 79 |     jpeg_shape = tf.image.extract_jpeg_shape(byte)  # hwc
 80 |     bbox_begin, bbox_size, distort_bbox = tf.image.sample_distorted_bounding_box(
 81 |         jpeg_shape,
 82 |         bounding_boxes=tf.zeros(shape=[0, 0, 4]),
 83 |         min_object_covered=0,
 84 |         aspect_ratio_range=[0.75, 1.33],
 85 |         area_range=[0.08, 1.0],
 86 |         max_attempts=10,
 87 |         use_image_if_no_bounding_boxes=True)
 88 | 
 89 |     is_bad = tf.reduce_sum(tf.cast(tf.equal(bbox_size, jpeg_shape), tf.int32)) >= 2
 90 | 
 91 |     def good():
 92 |         offset_y, offset_x, _ = tf.unstack(bbox_begin)
 93 |         target_height, target_width, _ = tf.unstack(bbox_size)
 94 |         crop_window = tf.stack([offset_y, offset_x, target_height, target_width])
 95 | 
 96 |         image = tf.image.decode_and_crop_jpeg(
 97 |             byte, crop_window, channels=3, **jpeg_opt)
 98 |         image = uint8_resize_bicubic(image, [224, 224])
 99 |         return image
100 | 
101 |     def bad():
102 |         image = tf.image.decode_jpeg(
103 |             tf.reshape(byte, shape=[]), 3, **jpeg_opt)
104 |         image = resize_shortest_edge(image, jpeg_shape, 224)
105 |         image = center_crop(image, 224)
106 |         return image
107 | 
108 |     image = tf.cond(is_bad, bad, good)
109 |     # TODO other imgproc
110 |     # image = lighting(image, 0.1,
111 |     #    eigval=np.array([0.2175, 0.0188, 0.0045], dtype='float32') * 255.0,
112 |     #    eigvec=np.array([[-0.5675, 0.7192, 0.4009],
113 |     #                     [-0.5808, -0.0045, -0.8140],
114 |     #                     [-0.5836, -0.6948, 0.4203]], dtype='float32'))
115 |     image = tf.image.random_flip_left_right(image)
116 |     return image, label
117 | 
118 | 
119 | def validation_mapper(filename, label):
120 |     byte = tf.read_file(filename)
121 | 
122 |     jpeg_opt = {'fancy_upscaling': True, 'dct_method': 'INTEGER_ACCURATE'}
123 |     image = tf.image.decode_jpeg(
124 |         tf.reshape(byte, shape=[]), 3, **jpeg_opt)
125 |     image = resize_shortest_edge(image, tf.shape(image), 256)
126 |     image = center_crop(image, 224)
127 |     return image, label
128 | 
129 | 
130 | def build_pipeline(imglist, training, batch, parallel):
131 |     """
132 |     Args:
133 |         imglist (list): [(full filename, label)]
134 |         training (bool):
135 |         batch (int):
136 |         parallel (int):
137 | 
138 |     If training, returns an infinite dataset.
139 | 
140 |     Note that it produces RGB images, not BGR.
141 |     """
142 |     N = len(imglist)
143 |     filenames = tf.constant([k[0] for k in imglist], name='filenames')
144 |     labels = tf.constant([k[1] for k in imglist], dtype=tf.int32, name='labels')
145 | 
146 |     ds = tf.data.Dataset.from_tensor_slices((filenames, labels))
147 | 
148 |     if training:
149 |         ds = ds.shuffle(N, reshuffle_each_iteration=True).repeat()
150 | 
151 |     mapper = training_mapper if training else validation_mapper
152 | 
153 |     ds = ds.apply(
154 |         tf.contrib.data.map_and_batch(
155 |             mapper,
156 |             batch_size=batch,
157 |             num_parallel_batches=parallel))
158 |     ds = ds.prefetch(100)
159 |     return ds
160 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <http://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/MaskRCNN/README.md:
--------------------------------------------------------------------------------
 1 | ## Mask R-CNN
 2 | 
 3 | This is a benchmark of tensorpack's Mask R-CNN implementation
 4 | and the popular Matterport Mask R-CNN implementation.
 5 | 
 6 | ### Environment:
 7 | 
 8 | * TensorFlow 1.14 (6e0893c79) + PR30893
 9 | * Python 3.7
10 | * CUDA 10.0, CuDNN 7.6.2
11 | * tensorpack 0.9.7.1 (a7f4094d)
12 | * keras 2.2.5
13 | * matterport/Mask_RCNN 3deaec5d
14 | * horovod 0.18.0
15 | * 8xV100s + 80xE5-2698 v4
16 | 
17 | ### Settings:
18 | * Use the standard hyperparameters used by [Detectron](https://github.com/facebookresearch/Detectron/),
19 |   except that total batch size is set to 8.
20 | 
21 | * `export TF_CUDNN_USE_AUTOTUNE=0` to avoid CuDNN warmup time.
22 | 
23 | * Measure speed using "images per second", in the second or later epochs.
24 | 
25 | 
26 | ### [tensorpack FasterRCNN example](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN):
27 | 
28 | Using `TRAINER=replicated`, the speed is about 42 img/s:
29 | ```
30 | ./train.py  --config DATA.BASEDIR=~/data/coco DATA.NUM_WORKERS=20 MODE_FPN=True --load ImageNet-R50-AlignPadding.npz
31 | ```
32 | 
33 | Using `TRAINER=horovod`, the speed is about 50 img/s:
34 | ```
35 | mpirun -np 8 ./train.py --config DATA.BASEDIR=~/data/coco MODE_FPN=True TRAINER=horovod --load ImageNet-R50-AlignPadding.npz
36 | ```
37 | 
38 | ### [matterport/Mask_RCNN](https://github.com/matterport/Mask_RCNN/):
39 | 
40 | Apply [maskrcnn.patch](maskrcnn.patch) to make it use the same hyperparameters.
41 | Then, run command:
42 | 
43 | ```
44 | python coco.py train --dataset=~/data/coco/ --model=imagenet
45 | ```
46 | 
47 | It trains at 0.77 s / step, aka 10 img/s.
48 | If using 2 images per GPU, it can improve to 12 img/s.
49 | 
50 | 
51 | ### Note:
52 | 
53 | * Mask R-CNN is a complicated system and there could be many implementation differences.
54 |   The above diff only makes the two systems perform roughly the same training.
55 | 
56 | * The training time of a R-CNN typically slowly decreases as the training progresses.
57 |   In this experiment we only look at the training time of the first couple thousand iterations.
58 |   It cannot be extrapolated to compute the total training time of the model.
59 | 
60 | * Tensorpack's Mask R-CNN is not only fast, but also
61 |   [more accurate](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN#results).
62 | 


--------------------------------------------------------------------------------
/MaskRCNN/maskrcnn.patch:
--------------------------------------------------------------------------------
 1 | diff --git i/samples/coco/coco.py w/samples/coco/coco.py
 2 | index 5d172b5..b0bba41 100644
 3 | --- i/samples/coco/coco.py
 4 | +++ w/samples/coco/coco.py
 5 | @@ -78,10 +78,13 @@ class CocoConfig(Config):
 6 | 
 7 |      # We use a GPU with 12GB memory, which can fit two images.
 8 |      # Adjust down if you use a smaller GPU.
 9 | -    IMAGES_PER_GPU = 2
10 | +    IMAGES_PER_GPU = 1
11 | 
12 |      # Uncomment to train on 8 GPUs (default is 1)
13 | -    # GPU_COUNT = 8
14 | +    GPU_COUNT = 8
15 | +    BACKBONE = "resnet50"
16 | +    STEPS_PER_EPOCH = 200
17 | +    TRAIN_ROIS_PER_IMAGE = 512
18 | 
19 |      # Number of classes (including background)
20 |      NUM_CLASSES = 1 + 80  # COCO has 80 classes
21 | @@ -496,29 +499,10 @@ if __name__ == '__main__':
22 |          # *** This training schedule is an example. Update to your needs ***
23 | 
24 |          # Training - Stage 1
25 | -        print("Training network heads")
26 |          model.train(dataset_train, dataset_val,
27 |                      learning_rate=config.LEARNING_RATE,
28 |                      epochs=40,
29 | -                    layers='heads',
30 | -                    augmentation=augmentation)
31 | -
32 | -        # Training - Stage 2
33 | -        # Finetune layers from ResNet stage 4 and up
34 | -        print("Fine tune Resnet stage 4 and up")
35 | -        model.train(dataset_train, dataset_val,
36 | -                    learning_rate=config.LEARNING_RATE,
37 | -                    epochs=120,
38 | -                    layers='4+',
39 | -                    augmentation=augmentation)
40 | -
41 | -        # Training - Stage 3
42 | -        # Fine tune all layers
43 | -        print("Fine tune all layers")
44 | -        model.train(dataset_train, dataset_val,
45 | -                    learning_rate=config.LEARNING_RATE / 10,
46 | -                    epochs=160,
47 | -                    layers='all',
48 | +                    layers='3+',
49 |                      augmentation=augmentation)
50 | 
51 |      elif args.command == "evaluate":
52 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # tensorpack benchmarks
 3 | 
 4 | We use TensorFlow in the efficient way. Tensorpack is:
 5 | 
 6 | * [As fast as tensorflow/benchmarks in multi-GPU ResNet training](ResNet-MultiGPU/)
 7 | * [__1.2x~5x__ faster than Keras & tflearn in common CNNs](other-wrappers/)
 8 | * [Able to reproduce "ImageNet in one hour" with 256 GPUs](ResNet-Horovod/)
 9 | * [Able to train Cifar10 to 94% accuracy within __a minute__](Cifar10-fast)
10 | * [__5x__ faster than matterport/Mask_RCNN](MaskRCNN/)
11 | * [2.8x faster than DCGAN-tensorflow](DCGAN/)
12 | 
13 | All above claims can be reproduced with the corresponding code.
14 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Tensorpack + Horovod
 3 | 
 4 | Multi-GPU / distributed training on ImageNet, with TensorFlow + Tensorpack + Horovod.
 5 | 
 6 | It reproduces the settings in the paper
 7 | + [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677)
 8 | 
 9 | The code is annotated with sentences from the paper.
10 | 
11 | Based on this baseline implementation, we implemented adversarial training and obtained ImageNet classifiers with state-of-the-art adversarial robustness. See our code release at [facebookresearch/ImageNet-Adversarial-Training](https://github.com/facebookresearch/ImageNet-Adversarial-Training/).
12 | 
13 | ## Dependencies:
14 | + TensorFlow>=1.5, tensorpack>=0.9.9.
15 | + [Horovod](https://github.com/uber/horovod) with NCCL support.
16 | 	See [doc](https://github.com/uber/horovod/blob/master/docs/gpus.md) for its installation instructions.
17 | + [zmq_ops](https://github.com/tensorpack/zmq_ops): optional but recommended.
18 | + Prepare ImageNet data into [this structure](http://tensorpack.readthedocs.io/modules/dataflow.dataset.html#tensorpack.dataflow.dataset.ILSVRC12).
19 | 
20 | ## Run:
21 | ```bash
22 | # Single Machine, Multiple GPUs:
23 | # Run the following two commands together:
24 | $ ./serve-data.py --data ~/data/imagenet/ --batch 64
25 | $ mpirun -np 8 --output-filename test.log python3 ./imagenet-resnet-horovod.py -d 50 --data ~/data/imagenet/ --batch 64
26 | ```
27 | 
28 | ```bash
29 | # Multiple Machines with RoCE/IB:
30 | host1$ ./serve-data.py --data ~/data/imagenet/ --batch 64
31 | host2$ ./serve-data.py --data ~/data/imagenet/ --batch 64
32 | $ mpirun -np 16 -H host1:8,host2:8 --output-filename test.log \
33 | 		-bind-to none -map-by slot -mca pml ob1 \
34 | 	  -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \
35 | 		-x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
36 | 		python3 ./imagenet-resnet-horovod.py -d 50 \
37 |         --data ~/data/imagenet/ --batch 64 --validation distributed
38 | ```
39 | 
40 | Notes:
41 | 1. MPI does not like fork(), so running `serve-data.py` inside MPI is not a good idea.
42 | 2. You may tune the best mca & NCCL options for your own systems.
43 |    See [horovod docs](https://github.com/uber/horovod/blob/master/docs/) for details.
44 |    Note that TCP connection will then have much worse scaling efficiency.
45 | 3. To train on small datasets, __you don't need a separate data serving process or zmq ops__.
46 | 	You can simply load data inside each training process with its own data loader.
47 | 	The main motivation to use a separate data loader is to avoid fork() inside
48 | 	MPI and to make it easier to benchmark.
49 | 4. You can pass `--no-zmq-ops` to both scripts, to use Python for communication instead of the faster zmq_ops.
50 | 5. If you're using slurm in a cluster, checkout an example [sbatch script](slurm.script).
51 | 
52 | ## Performance Benchmark:
53 | ```bash
54 | # To benchmark data speed:
55 | $ ./serve-data.py --data ~/data/imagenet/ --batch 64 --benchmark
56 | # To benchmark training with fake data:
57 | # Run the training command with `--fake`
58 | ```
59 | 
60 | ## Distributed ResNet50 Results:
61 | 
62 |  | devices   | batch per GPU | time   <sup>[1](#ft1)</sup> | top1 err <sup>[3](#ft3)</sup>|
63 |  | -         | -             | -                           | -        |
64 |  | 32 P100s  | 64            | 5h9min                      | 23.73%   |
65 |  | 128 P100s | 32            | 1h40min                     | 23.62%   |
66 |  | 128 P100s | 64            | 1h23min                     | 23.97%   |
67 |  | 256 P100s | 32            | 1h9min <sup>[2](#ft2)</sup> | 23.90%   |
68 | 
69 | 
70 | <a id="ft1">1</a>: Validation time excluded from total time. Time depends on your hardware.
71 | 
72 | <a id="ft2">2</a>: This corresponds to exactly the "1 hour" setting in the original paper.
73 | 
74 | <a id="ft3">3</a>: The final error typically has ±0.1 or more fluctuation according to the paper.
75 | 
76 | Although the code does not scale very ideally with 32 machines, it does scale with 90+% efficiency on 2 or 4 machines.
77 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/imagenet-resnet-horovod.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: imagenet-resnet-horovod.py
  4 | 
  5 | import argparse
  6 | import sys
  7 | import os
  8 | import socket
  9 | import numpy as np
 10 | 
 11 | import tensorflow as tf
 12 | from tensorpack import *
 13 | from tensorpack.tfutils import argscope, SmartInit
 14 | 
 15 | import horovod.tensorflow as hvd
 16 | 
 17 | from imagenet_utils import (
 18 |     fbresnet_augmentor, get_val_dataflow, ImageNetModel, eval_classification)
 19 | from resnet_model import (
 20 |     resnet_group, resnet_bottleneck, resnet_backbone, Norm)
 21 | 
 22 | 
 23 | class Model(ImageNetModel):
 24 |     def __init__(self, depth, norm='BN'):
 25 |         self.num_blocks = {
 26 |             50: [3, 4, 6, 3],
 27 |             101: [3, 4, 23, 3],
 28 |             152: [3, 8, 36, 3],
 29 |         }[depth]
 30 |         self.norm = norm
 31 | 
 32 |     def get_logits(self, image):
 33 |         with argscope([Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm], data_format='NCHW'), \
 34 |                 argscope(Norm, type=self.norm):
 35 |             return resnet_backbone(image, self.num_blocks, resnet_group, resnet_bottleneck)
 36 | 
 37 | 
 38 | class HorovodClassificationError(ClassificationError):
 39 |     """
 40 |     Like ClassificationError, it evaluates total samples & count of wrong samples.
 41 |     But in the end we aggregate the total&count by horovod.
 42 |     """
 43 |     def _setup_graph(self):
 44 |         self._placeholder = tf.placeholder(tf.float32, shape=[2], name='to_be_reduced')
 45 |         self._reduced = hvd.allreduce(self._placeholder, average=False)
 46 | 
 47 |     def _after_inference(self):
 48 |         tot = self.err_stat.total
 49 |         cnt = self.err_stat.count
 50 |         tot, cnt = self._reduced.eval(feed_dict={self._placeholder: [tot, cnt]})
 51 |         return {self.summary_name: cnt * 1. / tot}
 52 | 
 53 | 
 54 | def get_config(model, fake=False):
 55 |     batch = args.batch
 56 |     total_batch = batch * hvd.size()
 57 | 
 58 |     if fake:
 59 |         data = FakeData(
 60 |             [[args.batch, 224, 224, 3], [args.batch]], 1000,
 61 |             random=False, dtype=['uint8', 'int32'])
 62 |         data = StagingInput(QueueInput(data))
 63 |         callbacks = []
 64 |         steps_per_epoch = 50
 65 |     else:
 66 |         logger.info("#Tower: {}; Batch size per tower: {}".format(hvd.size(), batch))
 67 |         zmq_addr = 'ipc://@imagenet-train-b{}'.format(batch)
 68 |         if args.no_zmq_ops:
 69 |             dataflow = RemoteDataZMQ(zmq_addr, hwm=150, bind=False)
 70 |             data = QueueInput(dataflow)
 71 |         else:
 72 |             data = ZMQInput(zmq_addr, 30, bind=False)
 73 |         data = StagingInput(data, nr_stage=1)
 74 | 
 75 |         steps_per_epoch = int(np.round(1281167 / total_batch))
 76 | 
 77 |     """
 78 |     Sec 2.1: Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k.
 79 |     """
 80 |     BASE_LR = 0.1 * (total_batch // 256)
 81 |     logger.info("Base LR: {}".format(BASE_LR))
 82 |     """
 83 |     Sec 5.1:
 84 |     We call this number (0.1 * kn / 256 ) the reference learning rate,
 85 |     and reduce it by 1/10 at the 30-th, 60-th, and 80-th epoch
 86 |     """
 87 |     callbacks = [
 88 |         ModelSaver(max_to_keep=100),
 89 |         EstimatedTimeLeft(),
 90 |         ScheduledHyperParamSetter(
 91 |             'learning_rate', [(0, BASE_LR), (30, BASE_LR * 1e-1), (60, BASE_LR * 1e-2),
 92 |                               (80, BASE_LR * 1e-3)]),
 93 |     ]
 94 |     if BASE_LR > 0.1:
 95 |         """
 96 |         Sec 2.2: In practice, with a large minibatch of size kn, we start from a learning rate of η and increment
 97 |         it by a constant amount at each iteration such that it reachesη = kη after 5 epochs.
 98 |         After the warmup phase, we go back to the original learning rate schedule.
 99 |         """
100 |         callbacks.append(
101 |             ScheduledHyperParamSetter(
102 |                 'learning_rate', [(0, 0.1), (5 * steps_per_epoch, BASE_LR)],
103 |                 interp='linear', step_based=True))
104 | 
105 |     if args.validation is not None:
106 |         # TODO For distributed training, you probably don't want everyone to wait for master doing validation.
107 |         # Better to start a separate job, since the model is saved.
108 |         if args.validation == 'master' and hvd.rank() == 0:
109 |             # For reproducibility, do not use remote data for validation
110 |             dataset_val = get_val_dataflow(
111 |                 args.data, 64, fbresnet_augmentor(False))
112 |             infs = [ClassificationError('wrong-top1', 'val-error-top1'),
113 |                     ClassificationError('wrong-top5', 'val-error-top5')]
114 |             callbacks.append(InferenceRunner(QueueInput(dataset_val), infs))
115 |         # For simple validation tasks such as image classification, distributed validation is possible.
116 |         elif args.validation == 'distributed':
117 |             dataset_val = get_val_dataflow(
118 |                 args.data, 64, fbresnet_augmentor(False),
119 |                 num_splits=hvd.size(), split_index=hvd.rank())
120 |             infs = [HorovodClassificationError('wrong-top1', 'val-error-top1'),
121 |                     HorovodClassificationError('wrong-top5', 'val-error-top5')]
122 |             callbacks.append(
123 |                 InferenceRunner(QueueInput(dataset_val), infs).set_chief_only(False))
124 | 
125 |     return TrainConfig(
126 |         model=model,
127 |         data=data,
128 |         callbacks=callbacks,
129 |         steps_per_epoch=steps_per_epoch,
130 |         max_epoch=35 if args.fake else 95,
131 |     )
132 | 
133 | 
134 | if __name__ == '__main__':
135 |     parser = argparse.ArgumentParser()
136 |     parser.add_argument('--data', help='ILSVRC dataset dir')
137 |     parser.add_argument('--logdir', help='Directory for models and training stats.')
138 |     parser.add_argument('--load', help='load model')
139 |     parser.add_argument('--eval', action='store_true', help='run evaluation with --load instead of training.')
140 | 
141 |     parser.add_argument('--fake', help='use fakedata to test or benchmark this model', action='store_true')
142 |     parser.add_argument('-d', '--depth', help='resnet depth',
143 |                         type=int, default=50, choices=[50, 101, 152])
144 |     parser.add_argument('--norm', choices=['BN', 'GN'], default='BN')
145 |     parser.add_argument('--accum-grad', type=int, default=1)
146 |     parser.add_argument('--weight-decay-norm', action='store_true',
147 |                         help="apply weight decay on normalization layers (gamma & beta)."
148 |                              "This is used in torch/pytorch, and slightly "
149 |                              "improves validation accuracy of large models.")
150 |     parser.add_argument('--validation', choices=['distributed', 'master'],
151 |                         help='Validation method. By default the script performs no validation.')
152 |     parser.add_argument('--no-zmq-ops', help='use pure python to send/receive data',
153 |                         action='store_true')
154 |     """
155 |     Sec 2.3: We keep the per-worker sample size n constant when we change the number of workers k.
156 |     In this work, we use n = 32 which has performed well for a wide range of datasets and networks.
157 |     """
158 |     parser.add_argument('--batch', help='per-GPU batch size', default=32, type=int)
159 |     args = parser.parse_args()
160 | 
161 |     model = Model(args.depth, args.norm)
162 |     model.accum_grad = args.accum_grad
163 |     if args.weight_decay_norm:
164 |         model.weight_decay_pattern = ".*/W|.*/gamma|.*/beta"
165 | 
166 |     if args.eval:
167 |         batch = 128    # something that can run on one gpu
168 |         ds = get_val_dataflow(args.data, batch, fbresnet_augmentor(False))
169 |         eval_classification(model, SmartInit(args.load), ds)
170 |         sys.exit()
171 | 
172 |     logger.info("Training on {}".format(socket.gethostname()))
173 |     # Print some information for sanity check.
174 |     os.system("nvidia-smi")
175 |     assert args.load is None
176 | 
177 |     hvd.init()
178 | 
179 |     if args.logdir is None:
180 |         args.logdir = os.path.join('train_log', 'Horovod-{}GPUs-{}Batch'.format(hvd.size(), args.batch))
181 | 
182 |     if hvd.rank() == 0:
183 |         logger.set_logger_dir(args.logdir, 'd')
184 |     logger.info("Rank={}, Local Rank={}, Size={}".format(hvd.rank(), hvd.local_rank(), hvd.size()))
185 | 
186 |     """
187 |     Sec 3: Remark 3: Normalize the per-worker loss by
188 |     total minibatch size kn, not per-worker size n.
189 |     """
190 |     model.loss_scale = 1.0 / hvd.size()
191 |     config = get_config(model, fake=args.fake)
192 |     """
193 |     Sec 3: standard communication primitives like
194 |     allreduce [11] perform summing, not averaging
195 |     """
196 |     trainer = HorovodTrainer(average=False)
197 |     launch_train_with_config(config, trainer)
198 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/imagenet_utils.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: imagenet_utils.py
  4 | 
  5 | """
  6 | This file is modified from
  7 | https://github.com/tensorpack/tensorpack/blob/master/examples/ImageNetModels/imagenet_utils.py
  8 | """
  9 | 
 10 | 
 11 | import multiprocessing as mp
 12 | import numpy as np
 13 | from abc import abstractmethod
 14 | import cv2
 15 | import tensorflow as tf
 16 | 
 17 | from tensorpack import imgaug, dataset, ModelDesc
 18 | from tensorpack.dataflow import (
 19 |     BatchData, MultiThreadMapData, DataFromList,
 20 |     AugmentImageComponent, MultiProcessRunnerZMQ)
 21 | from tensorpack.models import regularize_cost
 22 | from tensorpack.predict import FeedfreePredictor, PredictConfig
 23 | from tensorpack.tfutils.summary import add_moving_summary
 24 | from tensorpack.tfutils.optimizer import AccumGradOptimizer
 25 | from tensorpack.utils import logger
 26 | from tensorpack.utils.stats import RatioCounter
 27 | 
 28 | """
 29 | ====== DataFlow =======
 30 | """
 31 | 
 32 | 
 33 | def fbresnet_augmentor(isTrain):
 34 |     """
 35 |     Augmentor used in fb.resnet.torch, for BGR images in range [0,255].
 36 |     """
 37 |     interpolation = cv2.INTER_LINEAR
 38 |     if isTrain:
 39 |         """
 40 |         Sec 5.1:
 41 |         We use scale and aspect ratio data augmentation [35] as
 42 |         in [12]. The network input image is a 224×224 pixel random
 43 |         crop from an augmented image or its horizontal flip.
 44 |         """
 45 |         augmentors = [
 46 |             imgaug.GoogleNetRandomCropAndResize(interp=interpolation),
 47 |             # It's OK to remove the following augs if your CPU is not fast enough.
 48 |             # Removing brightness/contrast/saturation does not have a significant effect on accuracy.
 49 |             # Removing lighting leads to a tiny drop in accuracy.
 50 |             imgaug.ToFloat32(),
 51 |             imgaug.RandomOrderAug(
 52 |                 [imgaug.BrightnessScale((0.6, 1.4), clip=False),
 53 |                  imgaug.Contrast((0.6, 1.4), rgb=False, clip=False),
 54 |                  imgaug.Saturation(0.4, rgb=False, clip=False),
 55 |                  # rgb-bgr conversion for the constants copied from fb.resnet.torch
 56 |                  imgaug.Lighting(0.1,
 57 |                                  eigval=np.asarray(
 58 |                                      [0.2175, 0.0188, 0.0045][::-1]) * 255.0,
 59 |                                  eigvec=np.array(
 60 |                                      [[-0.5675, 0.7192, 0.4009],
 61 |                                       [-0.5808, -0.0045, -0.8140],
 62 |                                       [-0.5836, -0.6948, 0.4203]],
 63 |                                      dtype='float32')[::-1, ::-1]
 64 |                                  )]),
 65 |             imgaug.ToUint8(),
 66 |             imgaug.Flip(horiz=True),
 67 |         ]
 68 |     else:
 69 |         augmentors = [
 70 |             imgaug.ResizeShortestEdge(256, interp=interpolation),
 71 |             imgaug.CenterCrop((224, 224)),
 72 |         ]
 73 |     return augmentors
 74 | 
 75 | 
 76 | def get_train_dataflow(datadir, batch, augmentors=None, parallel=None):
 77 |     """
 78 |     Sec 3, Remark 4:
 79 |     Use a single random shuffling of the training data (per epoch)
 80 |     that is divided amongst all k workers.
 81 | 
 82 |     NOTE: Here we do not follow the paper which makes some differences.
 83 |     Here, each machine shuffles independently.
 84 |     """
 85 |     if parallel is None:
 86 |         parallel = min(50, mp.cpu_count())
 87 |     if augmentors is None:
 88 |         augmentors = fbresnet_augmentor(True)
 89 |     ds = dataset.ILSVRC12(datadir, 'train', shuffle=True)
 90 |     ds = AugmentImageComponent(ds, augmentors, copy=False)
 91 |     ds = BatchData(ds, batch, remainder=False)
 92 |     ds = MultiProcessRunnerZMQ(ds, parallel)
 93 |     return ds
 94 | 
 95 | 
 96 | def get_val_dataflow(
 97 |         datadir, batch_size,
 98 |         augmentors=None, parallel=None,
 99 |         num_splits=None, split_index=None):
100 |     if augmentors is None:
101 |         augmentors = fbresnet_augmentor(False)
102 |     assert datadir is not None
103 |     assert isinstance(augmentors, list)
104 |     if parallel is None:
105 |         parallel = min(40, mp.cpu_count())
106 | 
107 |     if num_splits is None:
108 |         ds = dataset.ILSVRC12Files(datadir, 'val', shuffle=False)
109 |     else:
110 |         # shard validation data
111 |         assert split_index < num_splits
112 |         files = dataset.ILSVRC12Files(datadir, 'val', shuffle=False)
113 |         files.reset_state()
114 |         files = list(files.get_data())
115 |         logger.info("Number of validation data = {}".format(len(files)))
116 |         split_size = len(files) // num_splits
117 |         start, end = split_size * split_index, split_size * (split_index + 1)
118 |         end = min(end, len(files))
119 |         logger.info("Local validation split = {} - {}".format(start, end))
120 |         files = files[start: end]
121 |         ds = DataFromList(files, shuffle=False)
122 |     aug = imgaug.AugmentorList(augmentors)
123 | 
124 |     def mapf(dp):
125 |         fname, cls = dp
126 |         im = cv2.imread(fname, cv2.IMREAD_COLOR)
127 |         im = aug.augment(im)
128 |         return im, cls
129 |     ds = MultiThreadMapData(ds, parallel, mapf,
130 |                             buffer_size=min(2000, ds.size()), strict=True)
131 |     ds = BatchData(ds, batch_size, remainder=True)
132 |     # do not fork() under MPI
133 |     return ds
134 | 
135 | 
136 | def eval_classification(model, sessinit, dataflow):
137 |     """
138 |     Eval a classification model on the dataset. It assumes the model inputs are
139 |     named "input" and "label", and contains "wrong-top1" and "wrong-top5" in the graph.
140 |     """
141 |     pred_config = PredictConfig(
142 |         model=model,
143 |         session_init=sessinit,
144 |         input_names=['input', 'label'],
145 |         output_names=['wrong-top1', 'wrong-top5']
146 |     )
147 |     acc1, acc5 = RatioCounter(), RatioCounter()
148 | 
149 |     # This does not have a visible improvement over naive predictor,
150 |     # but will have an improvement if image_dtype is set to float32.
151 |     pred = FeedfreePredictor(pred_config, StagingInput(QueueInput(dataflow), device='/gpu:0'))
152 |     for _ in tqdm.trange(dataflow.size()):
153 |         top1, top5 = pred()
154 |         batch_size = top1.shape[0]
155 |         acc1.feed(top1.sum(), batch_size)
156 |         acc5.feed(top5.sum(), batch_size)
157 | 
158 |     print("Top1 Error: {}".format(acc1.ratio))
159 |     print("Top5 Error: {}".format(acc5.ratio))
160 | 
161 | 
162 | class ImageNetModel(ModelDesc):
163 |     image_shape = 224
164 | 
165 |     """
166 |     uint8 instead of float32 is used as input type to reduce copy overhead.
167 |     It might hurt the performance a liiiitle bit.
168 |     The pretrained models were trained with float32.
169 |     """
170 |     image_dtype = tf.uint8
171 | 
172 |     """
173 |     Either 'NCHW' or 'NHWC'
174 |     """
175 |     data_format = 'NCHW'
176 | 
177 |     """
178 |     Whether the image is BGR or RGB. If using DataFlow, then it should be BGR.
179 |     """
180 |     image_bgr = True
181 | 
182 |     weight_decay = 1e-4
183 | 
184 |     """
185 |     To apply on normalization parameters, use '.*/W|.*/gamma|.*/beta'
186 | 
187 |     Sec 5.1: We use a weight decay λ of 0.0001 and following [16] we do not apply
188 |     weight decay on the learnable BN coefficients
189 |     """
190 |     weight_decay_pattern = '.*/W'
191 | 
192 |     """
193 |     Scale the loss, for whatever reasons (e.g., gradient averaging, fp16 training, etc)
194 |     """
195 |     loss_scale = 1.
196 | 
197 |     """
198 |     Label smoothing (See tf.losses.softmax_cross_entropy)
199 |     """
200 |     label_smoothing = 0.
201 | 
202 |     """
203 |     Accumulate gradients across several steps (by default 1, which means no accumulation across steps).
204 |     """
205 |     accum_grad = 1
206 | 
207 |     def inputs(self):
208 |         return [tf.TensorSpec([None, self.image_shape, self.image_shape, 3], self.image_dtype, 'input'),
209 |                 tf.TensorSpec([None], tf.int32, 'label')]
210 | 
211 |     def build_graph(self, image, label):
212 |         image = self.image_preprocess(image)
213 |         assert self.data_format == 'NCHW'
214 |         image = tf.transpose(image, [0, 3, 1, 2])
215 | 
216 |         logits = self.get_logits(image)
217 |         loss = ImageNetModel.compute_loss_and_error(
218 |             logits, label, label_smoothing=self.label_smoothing)
219 | 
220 |         if self.weight_decay > 0:
221 |             wd_loss = regularize_cost(self.weight_decay_pattern,
222 |                                       tf.contrib.layers.l2_regularizer(self.weight_decay),
223 |                                       name='l2_regularize_loss')
224 |             add_moving_summary(loss, wd_loss)
225 |             total_cost = tf.add_n([loss, wd_loss], name='cost')
226 |         else:
227 |             total_cost = tf.identity(loss, name='cost')
228 |             add_moving_summary(total_cost)
229 | 
230 |         if self.loss_scale != 1.:
231 |             logger.info("Scaling the total loss by {} ...".format(self.loss_scale))
232 |             return total_cost * self.loss_scale
233 |         else:
234 |             return total_cost
235 | 
236 |     @abstractmethod
237 |     def get_logits(self, image):
238 |         """
239 |         Args:
240 |             image: 4D tensor of ``self.input_shape`` in ``self.data_format``
241 | 
242 |         Returns:
243 |             Nx#class logits
244 |         """
245 | 
246 |     def optimizer(self):
247 |         """
248 |         Sec 5.1: We use Nesterov momentum with m of 0.9.
249 | 
250 |         Sec 3: momentum correction
251 |         Tensorflow's momentum optimizer does not need correction.
252 |         """
253 |         lr = tf.get_variable('learning_rate', initializer=0.1, trainable=False)
254 |         tf.summary.scalar('learning_rate-summary', lr)
255 |         opt = tf.train.MomentumOptimizer(lr, 0.9, use_nesterov=True)
256 |         if self.accum_grad != 1:
257 |             opt = AccumGradOptimizer(opt, self.accum_grad)
258 |         return opt
259 | 
260 |     def image_preprocess(self, image):
261 |         with tf.name_scope('image_preprocess'):
262 |             if image.dtype.base_dtype != tf.float32:
263 |                 image = tf.cast(image, tf.float32)
264 | 
265 |             """
266 |             Sec 5.1:
267 |             The input image is normalized by the per-color mean and
268 |             standard deviation, as in [12]
269 |             """
270 |             mean = [0.485, 0.456, 0.406]    # rgb
271 |             std = [0.229, 0.224, 0.225]
272 |             if self.image_bgr:
273 |                 mean = mean[::-1]
274 |                 std = std[::-1]
275 |             image_mean = tf.constant(mean, dtype=tf.float32) * 255.
276 |             image_std = tf.constant(std, dtype=tf.float32) * 255.
277 |             image = (image - image_mean) / image_std
278 |             return image
279 | 
280 |     @staticmethod
281 |     def compute_loss_and_error(logits, label, label_smoothing=0.):
282 |         if label_smoothing != 0.:
283 |             nclass = logits.shape[-1]
284 |             label = tf.one_hot(label, nclass) if label.shape.ndims == 1 else label
285 | 
286 |         if label.shape.ndims == 1:
287 |             loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
288 |         else:
289 |             loss = tf.losses.softmax_cross_entropy(
290 |                 label, logits, label_smoothing=label_smoothing,
291 |                 reduction=tf.losses.Reduction.NONE)
292 |         loss = tf.reduce_mean(loss, name='xentropy-loss')
293 | 
294 |         def prediction_incorrect(logits, label, topk=1, name='incorrect_vector'):
295 |             with tf.name_scope('prediction_incorrect'):
296 |                 x = tf.logical_not(tf.nn.in_top_k(logits, label, topk))
297 |             return tf.cast(x, tf.float32, name=name)
298 | 
299 |         wrong = prediction_incorrect(logits, label, 1, name='wrong-top1')
300 |         add_moving_summary(tf.reduce_mean(wrong, name='train-error-top1'))
301 | 
302 |         wrong = prediction_incorrect(logits, label, 5, name='wrong-top5')
303 |         add_moving_summary(tf.reduce_mean(wrong, name='train-error-top5'))
304 |         return loss
305 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/resnet_model.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: resnet_model.py
  4 | 
  5 | import tensorflow as tf
  6 | from contextlib import contextmanager
  7 | 
  8 | 
  9 | from tensorpack.tfutils.argscope import argscope
 10 | from tensorpack.tfutils.varreplace import remap_variables
 11 | from tensorpack.models import (
 12 |     Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm, FullyConnected, BNReLU, layer_register)
 13 | 
 14 | 
 15 | @layer_register(log_shape=False, use_scope=False)
 16 | def Norm(x, type, gamma_initializer=tf.constant_initializer(1.)):
 17 |     """
 18 |     A norm layer (which depends on 'type')
 19 | 
 20 |     Args:
 21 |         type (str): one of "BN" or "GN"
 22 |     """
 23 |     assert type in ["BN", "GN"]
 24 |     if type == "BN":
 25 |         return BatchNorm('bn', x, gamma_initializer=gamma_initializer)
 26 |     else:
 27 |         return GroupNorm('gn', x, gamma_initializer=gamma_initializer)
 28 | 
 29 | 
 30 | @layer_register(log_shape=True)
 31 | def GroupNorm(x, group=32, gamma_initializer=tf.constant_initializer(1.)):
 32 |     """
 33 |     https://arxiv.org/abs/1803.08494
 34 |     """
 35 |     shape = x.get_shape().as_list()
 36 |     ndims = len(shape)
 37 |     assert ndims == 4, shape
 38 |     chan = shape[1]
 39 |     assert chan % group == 0, chan
 40 |     group_size = chan // group
 41 | 
 42 |     orig_shape = tf.shape(x)
 43 |     h, w = orig_shape[2], orig_shape[3]
 44 | 
 45 |     x = tf.reshape(x, tf.stack([-1, group, group_size, h, w]))
 46 | 
 47 |     mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True)
 48 | 
 49 |     new_shape = [1, group, group_size, 1, 1]
 50 | 
 51 |     beta = tf.get_variable('beta', [chan], initializer=tf.constant_initializer())
 52 |     beta = tf.reshape(beta, new_shape)
 53 | 
 54 |     gamma = tf.get_variable('gamma', [chan], initializer=gamma_initializer)
 55 |     gamma = tf.reshape(gamma, new_shape)
 56 | 
 57 |     out = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-5, name='output')
 58 |     return tf.reshape(out, orig_shape, name='output')
 59 | 
 60 | 
 61 | def resnet_shortcut(l, n_out, stride, activation=tf.identity):
 62 |     n_in = l.get_shape().as_list()[1]
 63 |     if n_in != n_out:   # change dimension when channel is not the same
 64 |         return Conv2D('convshortcut', l, n_out, 1, strides=stride, activation=activation)
 65 |     else:
 66 |         return l
 67 | 
 68 | 
 69 | def resnet_bottleneck(l, ch_out, stride, stride_first=False):
 70 |     shortcut = l
 71 |     norm_relu = lambda x: tf.nn.relu(Norm(x))
 72 |     l = Conv2D('conv1', l, ch_out, 1, strides=stride if stride_first else 1, activation=norm_relu)
 73 |     """
 74 |     Sec 5.1:
 75 |     We use the ResNet-50 [16] variant from [12], noting that
 76 |     the stride-2 convolutions are on 3×3 layers instead of on 1×1 layers
 77 |     """
 78 |     l = Conv2D('conv2', l, ch_out, 3, strides=1 if stride_first else stride, activation=norm_relu)
 79 |     """
 80 |     Section 5.1:
 81 |     For BN layers, the learnable scaling coefficient γ is initialized
 82 |     to be 1, except for each residual block's last BN
 83 |     where γ is initialized to be 0.
 84 |     """
 85 |     l = Conv2D('conv3', l, ch_out * 4, 1, activation=lambda x: Norm(x, gamma_initializer=tf.zeros_initializer()))
 86 |     ret = l + resnet_shortcut(shortcut, ch_out * 4, stride, activation=lambda x: Norm(x))
 87 |     return tf.nn.relu(ret, name='block_output')
 88 | 
 89 | 
 90 | def resnet_group(name, l, block_func, features, count, stride):
 91 |     with tf.variable_scope(name):
 92 |         for i in range(0, count):
 93 |             with tf.variable_scope('block{}'.format(i)):
 94 |                 l = block_func(l, features, stride if i == 0 else 1)
 95 |     return l
 96 | 
 97 | 
 98 | @contextmanager
 99 | def weight_standardization_context(enable=True):
100 |     """
101 |     Implement Centered Weight Normalization
102 |     (http://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Centered_Weight_Normalization_ICCV_2017_paper.pdf)
103 |     or Weight Standardization (https://arxiv.org/abs/1903.10520)
104 | 
105 |     Usage:
106 | 
107 |     with weight_standardization_context():
108 |         l = Conv2D('conv', l)
109 |         ...
110 |     """
111 |     if enable:
112 |         def weight_standardization(v):
113 |             if (not v.name.endswith('/W:0')) or v.shape.ndims != 4:
114 |                 return v
115 |             mean, var = tf.nn.moments(v, [0, 1, 2], keep_dims=True)
116 |             v = (v - mean) / (tf.sqrt(var) + 1e-5)
117 |             return v
118 | 
119 |         with remap_variables(weight_standardization):
120 |             yield
121 | 
122 |     else:
123 |         yield
124 | 
125 | 
126 | def resnet_backbone(image, num_blocks, group_func, block_func):
127 |     """
128 |     Sec 5.1: We adopt the initialization of [15] for all convolutional layers.
129 |     TensorFlow does not have the true "MSRA init". We use variance_scaling as an approximation.
130 |     """
131 |     with argscope(Conv2D, use_bias=False,
132 |                   kernel_initializer=tf.variance_scaling_initializer(scale=2.0, mode='fan_out')):
133 |         l = Conv2D('conv0', image, 64, 7, strides=2, activation=BNReLU)
134 |         l = MaxPooling('pool0', l, pool_size=3, strides=2, padding='SAME')
135 |         l = group_func('group0', l, block_func, 64, num_blocks[0], 1)
136 |         l = group_func('group1', l, block_func, 128, num_blocks[1], 2)
137 |         l = group_func('group2', l, block_func, 256, num_blocks[2], 2)
138 |         l = group_func('group3', l, block_func, 512, num_blocks[3], 2)
139 |         l = GlobalAvgPooling('gap', l)
140 |         logits = FullyConnected('linear', l, 1000,
141 |                                 kernel_initializer=tf.random_normal_initializer(stddev=0.01))
142 |     """
143 |     Sec 5.1:
144 |     The 1000-way fully-connected layer is initialized by
145 |     drawing weights from a zero-mean Gaussian with standard
146 |     deviation of 0.01.
147 |     """
148 |     return logits
149 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/serve-data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: serve-data.py
 4 | 
 5 | import argparse
 6 | import os
 7 | import socket
 8 | 
 9 | from tensorpack.dataflow import (
10 |     send_dataflow_zmq, MapData, TestDataSpeed, FakeData, dataset)
11 | from tensorpack.utils import logger
12 | from imagenet_utils import get_train_dataflow
13 | 
14 | from zmq_ops import dump_arrays
15 | 
16 | 
17 | if __name__ == '__main__':
18 |     parser = argparse.ArgumentParser()
19 |     parser.add_argument('--data', help='ILSVRC dataset dir')
20 |     parser.add_argument('--fake', action='store_true')
21 |     parser.add_argument('--batch', help='per-GPU batch size',
22 |                         default=32, type=int)
23 |     parser.add_argument('--benchmark', action='store_true')
24 |     parser.add_argument('--no-zmq-ops', action='store_true')
25 |     args = parser.parse_args()
26 | 
27 |     os.environ['CUDA_VISIBLE_DEVICES'] = ''
28 | 
29 |     if args.fake:
30 |         ds = FakeData(
31 |             [[args.batch, 224, 224, 3], [args.batch]],
32 |             1000, random=False, dtype=['uint8', 'int32'])
33 |     else:
34 |         ds = get_train_dataflow(args.data, args.batch)
35 | 
36 |     logger.info("Serving data on {}".format(socket.gethostname()))
37 | 
38 |     if args.benchmark:
39 |         ds = MapData(ds, dump_arrays)
40 |         TestDataSpeed(ds, warmup=300).start()
41 |     else:
42 |         format = None if args.no_zmq_ops else 'zmq_ops'
43 |         send_dataflow_zmq(
44 |             ds, 'ipc://@imagenet-train-b{}'.format(args.batch),
45 |             hwm=150, format=format, bind=True)
46 | 


--------------------------------------------------------------------------------
/ResNet-Horovod/slurm.script:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -e
 2 | #SBATCH --output=logs/job-%j.%N.out
 3 | #SBATCH --error=logs/job-%j.%N.err
 4 | #SBATCH --ntasks-per-node=8  # 8 tasks per node
 5 | #SBATCH --gres=gpu:8		 # 8 GPUs per node
 6 | #SBATCH --cpus-per-task=10   # 80/8 cpus per task
 7 | #SBATCH --mem=200G	 # ask for 200G
 8 | 
 9 | # To run on 4 nodes x 8 GPUs: use "mkdir -p logs && sbatch --nodes=4 slurm.script"
10 | 
11 | echo "NNODES: $SLURM_NNODES"
12 | echo "JOBID: $SLURM_JOB_ID"
13 | env | grep PATH
14 | 
15 | export TENSORPACK_PROGRESS_REFRESH=20
16 | export TENSORPACK_SERIALIZE=msgpack
17 | DATA_PATH=~/data/imagenet
18 | BATCH=64
19 | 
20 | # launch data
21 | srun --output=logs/data-%J.%N.log \
22 | 		 --error=logs/data-%J.%N.err \
23 | 		 --gres=gpu:0 --cpus-per-task=60 --mincpus 60 \
24 | 	--ntasks=$SLURM_NNODES --ntasks-per-node=1 \
25 | 	python ./serve-data.py --data $DATA_PATH --batch $BATCH &
26 | DATA_PID=$!
27 | 
28 | # launch training
29 | # https://www.open-mpi.org/faq/?category=openfabrics#ib-router has document on IB options
30 | # the queue parameters sometimes can hang the communication (for some MPI versions and some operations)
31 | mpirun -output-filename logs/train-$SLURM_JOB_ID.log -tag-output \
32 | 	-bind-to none -map-by slot \
33 | 	-mca pml ob1 -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
34 | 	-x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \
35 | 	python ./imagenet-resnet-horovod.py \
36 |     --data $DATA_PATH --batch $BATCH --validation distributed $@ &
37 | MPI_PID=$!
38 | 
39 | wait $MPI_PID
40 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Benchmark Multi-GPU training against tensorflow/benchmarks
 3 | 
 4 | Tensorpack Multi-GPU trainers are implemented following the awesome examples in
 5 | [tensorflow/benchmarks](https://github.com/tensorflow/benchmarks).
 6 | Their performance should be the same.
 7 | 
 8 | Here we measure performance by the number of images the trainer can process per second when training a ResNet-50 on ImageNet-size images.
 9 | 
10 | This script is tested on fake data to focus on the performance of different parallel strategies.
11 | To train on real data with reasonable experiment settings, see
12 | [Tensorpack ResNet example](https://github.com/tensorpack/tensorpack/tree/master/examples/ResNet) or [ResNet-Horovod benchmark](../ResNet-Horovod).
13 | 
14 | ## Scripts:
15 | 
16 | The following command in tensorflow/benchmarks:
17 | ```
18 | python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=64 --model=resnet50 --variable_update=replicated/parameter_server --local_parameter_device=cpu
19 | ```
20 | 
21 | is roughly the same as this command in tensorpack:
22 | ```
23 | python resnet-multigpu.py --fake-location gpu --variable-update=replicated/parameter_server
24 | ```
25 | 
26 | There are tiny differences in the way data is synthesized (the `--fake-location` option):
27 | * gpu: synthesized on GPU as a constant. TF benchmark uses something most similar to this.
28 | * cpu: synthesized on CPU inside TF runtime.
29 | * python-queue: synthesized on CPU inside Python, transferred to TF with queue, prefetch on GPU.
30 | This is the most common experiement setting in tensorpack, because it's easy to write
31 | (use Python to write data loader) and still fast.
32 | * python-dataset: also use python to write data loader, but transfer to TF with `tf.data + prefetch` instead.
33 | * zmq-consume: consume data from a ZMQ pipe, using [my zmq ops](https://github.com/tensorpack/zmq_ops), also with GPU prefetch.
34 | 	The data producer can therefore be written in any language.
35 | 
36 | Data processing inside TF is often a [bad idea in practice](https://tensorpack.readthedocs.io/tutorial/philosophy/dataflow.html#alternative-data-loading-solutions).
37 | When data comes from outside TF, my experiements show
38 | that `zmq-consume` is the fastest input pipeline compared to others here.
39 | It's also faster than `tensorflow/benchmarks` (tested on Jan 6 2018 with TF1.5.0rc0) when training real data.
40 | 
41 | ## Performance @ Sep 2017:
42 | 
43 | Environment:
44 | * Software: TF v1.3.0-rc1-1302-g593dc8e; tensorpack 0.5.
45 | * Hardware: 8 P100s.
46 | 
47 | Note that the latest source code uses new features in tensorpack and therefore won't work with tensorpack 0.5.
48 | Checkout an old version if you intend to repdouce these numbers.
49 | 
50 | Experiments were not run for multiple times. So expect some small variance in results.
51 | 
52 | `variable-update=replicated`:
53 | 
54 | | #GPU      | tensorpack(GPU/CPU/Python) | tensorflow/benchmarks |
55 | | --------- | ----------------------     | --------------------  |
56 | | 1         | 228/228/219                | 225.73                |
57 | | 2         | 427/423/415                | 424.54                |
58 | | 4         | 802/785/787                | 789.51                |
59 | | 8         | 1612/1579/1551             | 1580.58               |
60 | 
61 | `variable-update=parameter_server`:
62 | 
63 | | #GPU      | tensorpack(GPU/CPU/Python) | tensorflow/benchmarks |
64 | | --------- | -------------------        | --------------------  |
65 | | 1         | 227/227/223                | 221.68                |
66 | | 2         | 428/418/403                | 421.01                |
67 | | 4         | 817/802/787                | 828.29                |
68 | | 8         | 1651/1556/1574             | 1604.55               |
69 | 
70 | ## Performance @ May 2019:
71 | 
72 | Environment:
73 | 
74 | * Software: TF 1.13.1, cuda 10, cudnn 7.4.2, tensorpack 0.9.5.
75 | * Hardware: AWS p2.16xlarge (8 V100s)
76 | 
77 | Results:
78 | 
79 | * `--fake-location=gpu --variable-update=horovod`: 2874 img/s.
80 | * `--fake-location=gpu --variable-update=replicated`: 2844 img/s.
81 | * `--fake-location=gpu --variable-update=replicated --use-fp16`: 5224 img/s.
82 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128`: 5891 img/s
83 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128 --use-xla-compile`: 9225 img/s
84 | 
85 | ## Performance @ Jul 2020:
86 | 
87 | Environment:
88 | 
89 | * Software: TF 2.2 with v1 compatible mode, cuda 10.1, cudnn 7.6.5, tensorpack 0.10.1.
90 | * Hardware: AWS p2.16xlarge (8 V100s, 16G memory each)
91 | 
92 | Results:
93 | 
94 | * `--fake-location=gpu --variable-update=horovod`: 2973 img/s.
95 | * `--fake-location=gpu --variable-update=replicated`: 2961 img/s.
96 | * `--fake-location=gpu --variable-update=replicated --batch 128`: 3137 img/s.
97 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128`: 6141 img/s
98 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128 --use-xla-compile`: 9341 img/s
99 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/resnet-multigpu.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # File: resnet-multigpu.py
  4 | 
  5 | import sys
  6 | import argparse
  7 | import os
  8 | from contextlib import contextmanager, ExitStack
  9 | import tensorflow as tf
 10 | 
 11 | from tensorpack import *
 12 | from tensorpack.utils.gpu import get_nr_gpu
 13 | from tensorpack.tfutils.summary import *
 14 | from tensorpack.utils.argtools import log_once
 15 | 
 16 | from tensorpack.tfutils.collection import freeze_collection
 17 | from tensorpack.tfutils import get_current_tower_context
 18 | from tensorpack.tfutils.varreplace import custom_getter_scope
 19 | 
 20 | from tfbench.convnet_builder import ConvNetBuilder
 21 | from tfbench import model_config
 22 | 
 23 | INPUT_SHAPE = 224
 24 | IMAGE_DTYPE = tf.uint8
 25 | IMAGE_DTYPE_NUMPY = 'uint8'
 26 | 
 27 | 
 28 | class Model(ModelDesc):
 29 |     def __init__(self, data_format='NCHW'):
 30 |         self.data_format = data_format
 31 | 
 32 |     def inputs(self):
 33 |         return [tf.TensorSpec([args.batch, INPUT_SHAPE, INPUT_SHAPE, 3], IMAGE_DTYPE, 'input'),
 34 |                 tf.TensorSpec([args.batch], tf.int32, 'label')]
 35 | 
 36 |     def optimizer(self):
 37 |         lr = tf.get_variable('learning_rate', initializer=0.1, trainable=False)
 38 |         return tf.train.GradientDescentOptimizer(lr)
 39 | 
 40 |     def build_graph(self, image, label):
 41 |         # all-zero tensor hurt performance for some reason.
 42 |         ctx = get_current_tower_context()
 43 |         label = tf.random_uniform(
 44 |             [args.batch],
 45 |             minval=0, maxval=1000 - 1,
 46 |             dtype=tf.int32, name='synthetic_labels')
 47 | 
 48 |         # our fake images are in [0, 1]
 49 |         target_dtype = tf.float16 if args.use_fp16 else tf.float32
 50 |         if image.dtype != target_dtype:
 51 |             image = tf.cast(image, target_dtype) * 2.0 - 1.
 52 |         if self.data_format == 'NCHW':
 53 |             image = tf.transpose(image, [0, 3, 1, 2])
 54 | 
 55 |         logits = self._get_logits(image)
 56 | 
 57 |         if logits.dtype != tf.float32:
 58 |             logger.info("Casting logits back to fp32 ...")
 59 |             logits = tf.cast(logits, tf.float32)
 60 | 
 61 |         loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
 62 |         loss = tf.reduce_mean(loss, name='xentropy-loss')
 63 |         if ctx.index == ctx.total - 1:
 64 |             wd_cost = regularize_cost('.*', tf.nn.l2_loss) * (1e-4 * ctx.total)
 65 |             return tf.add_n([loss, wd_cost], name='cost')
 66 |         else:
 67 |             return loss
 68 | 
 69 | 
 70 | @contextmanager
 71 | def maybe_freeze_updates(enable):
 72 |     if enable:
 73 |         with freeze_collection([tf.GraphKeys.UPDATE_OPS]):
 74 |             yield
 75 |     else:
 76 |         yield
 77 | 
 78 | 
 79 | class TFBenchModel(Model):
 80 |     def _get_logits(self, image):
 81 |         ctx = get_current_tower_context()
 82 |         with maybe_freeze_updates(ctx.index > 0):
 83 |             network = ConvNetBuilder(
 84 |                 image, 3, True,
 85 |                 use_tf_layers=True,
 86 |                 data_format=self.data_format,
 87 |                 dtype=tf.float16 if args.use_fp16 else tf.float32,
 88 |                 variable_dtype=tf.float32)
 89 |             with custom_getter_scope(network.get_custom_getter()):
 90 |                 dataset = lambda: 1
 91 |                 dataset.name = 'imagenet'
 92 |                 model_conf = model_config.get_model_config('resnet50', dataset)
 93 |                 model_conf.set_batch_size(args.batch)
 94 |                 model_conf.add_inference(network)
 95 |                 return network.affine(1000, activation='linear', stddev=0.001)
 96 | 
 97 | 
 98 | class TensorpackModel(Model):
 99 |     """
100 |     Implement the same model with tensorpack layers.
101 |     """
102 |     def _get_logits(self, image):
103 | 
104 |         def fp16_getter(getter, *args, **kwargs):
105 |             name = args[0] if len(args) else kwargs['name']
106 |             if not name.endswith('/W') and not name.endswith('/b'):
107 |                 """
108 |                 Following convention, convolution & fc are quantized.
109 |                 BatchNorm (gamma & beta) are not quantized.
110 |                 """
111 |                 return getter(*args, **kwargs)
112 |             else:
113 |                 if kwargs['dtype'] == tf.float16:
114 |                     kwargs['dtype'] = tf.float32
115 |                     ret = getter(*args, **kwargs)
116 |                     ret = tf.cast(ret, tf.float16)
117 |                     log_once("Variable {} casted to fp16 ...".format(name))
118 |                     return ret
119 |                 else:
120 |                     return getter(*args, **kwargs)
121 | 
122 |         def shortcut(l, n_in, n_out, stride):
123 |             if n_in != n_out:
124 |                 l = Conv2D('convshortcut', l, n_out, 1, strides=stride)
125 |                 l = BatchNorm('bnshortcut', l)
126 |                 return l
127 |             else:
128 |                 return l
129 | 
130 |         def bottleneck(l, ch_out, stride, preact):
131 |             ch_in = l.get_shape().as_list()[1]
132 |             input = l
133 |             l = Conv2D('conv1', l, ch_out, 1, strides=stride, activation=BNReLU)
134 |             l = Conv2D('conv2', l, ch_out, 3, strides=1, activation=BNReLU)
135 |             l = Conv2D('conv3', l, ch_out * 4, 1, activation=tf.identity)
136 |             l = BatchNorm('bn', l)
137 |             ret = l + shortcut(input, ch_in, ch_out * 4, stride)
138 |             return tf.nn.relu(ret)
139 | 
140 |         def layer(l, layername, block_func, features, count, stride, first=False):
141 |             with tf.variable_scope(layername):
142 |                 with tf.variable_scope('block0'):
143 |                     l = block_func(l, features, stride,
144 |                                    'no_preact' if first else 'both_preact')
145 |                 for i in range(1, count):
146 |                     with tf.variable_scope('block{}'.format(i)):
147 |                         l = block_func(l, features, 1, 'default')
148 |                 return l
149 | 
150 |         defs = [3, 4, 6, 3]
151 | 
152 |         with ExitStack() as stack:
153 |             stack.enter_context(argscope(
154 |                 Conv2D, use_bias=False,
155 |                 kernel_initializer=tf.variance_scaling_initializer(mode='fan_out')))
156 |             stack.enter_context(argscope(
157 |                 [Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm],
158 |                 data_format=self.data_format))
159 |             if args.use_fp16:
160 |                 stack.enter_context(custom_getter_scope(fp16_getter))
161 |                 image = tf.cast(image, tf.float16)
162 |             logits = (LinearWrap(image)
163 |                       .Conv2D('conv0', 64, 7, strides=2)
164 |                       .BatchNorm('bn0')
165 |                       .tf.nn.relu()
166 |                       .MaxPooling('pool0', 3, strides=2, padding='SAME')
167 |                       .apply(layer, 'group0', bottleneck, 64, defs[0], 1, first=True)
168 |                       .apply(layer, 'group1', bottleneck, 128, defs[1], 2)
169 |                       .apply(layer, 'group2', bottleneck, 256, defs[2], 2)
170 |                       .apply(layer, 'group3', bottleneck, 512, defs[3], 2)
171 |                       .GlobalAvgPooling('gap')
172 |                       .FullyConnected('linear', 1000)())
173 |         if args.use_fp16:
174 |             logits = tf.cast(logits, tf.float32)
175 |         return logits
176 | 
177 | 
178 | def get_data(mode):
179 |     # get input
180 |     input_shape = [args.batch, 224, 224, 3]
181 |     label_shape = [args.batch]
182 |     dataflow = FakeData(
183 |         [input_shape, label_shape], 1000,
184 |         random=False, dtype=[IMAGE_DTYPE_NUMPY, 'int32'])
185 |     if mode == 'gpu':
186 |         return DummyConstantInput([input_shape, label_shape])
187 |     elif mode == 'cpu':
188 |         def fn():
189 |             # these copied from tensorflow/benchmarks
190 |             with tf.device('/cpu:0'):
191 |                 if IMAGE_DTYPE == tf.float32:
192 |                     images = tf.truncated_normal(
193 |                         input_shape, dtype=IMAGE_DTYPE, stddev=0.1, name='synthetic_images')
194 |                 else:
195 |                     images = tf.random_uniform(
196 |                         input_shape, minval=0, maxval=255, dtype=tf.int32, name='synthetic_images')
197 |                     images = tf.cast(images, IMAGE_DTYPE)
198 |                 labels = tf.random_uniform(
199 |                     label_shape, minval=1, maxval=1000, dtype=tf.int32, name='synthetic_labels')
200 |                 # images = tf.contrib.framework.local_variable(images, name='images')
201 |             return [images, labels]
202 |         ret = TensorInput(fn)
203 |         return StagingInput(ret, nr_stage=1)
204 |     elif mode == 'python' or mode == 'python-queue':
205 |         ret = QueueInput(
206 |             dataflow,
207 |             queue=tf.FIFOQueue(args.prefetch, [IMAGE_DTYPE, tf.int32]))
208 |         return StagingInput(ret, nr_stage=1)
209 |     elif mode == 'python-dataset':
210 |         ds = TFDatasetInput.dataflow_to_dataset(dataflow, [IMAGE_DTYPE, tf.int32])
211 |         ds = ds.repeat().prefetch(args.prefetch)
212 |         ret = TFDatasetInput(ds)
213 |         return StagingInput(ret, nr_stage=1)
214 |     elif mode == 'zmq-serve':
215 |         send_dataflow_zmq(dataflow, 'ipc://testpipe', hwm=args.prefetch, format='zmq_op')
216 |         sys.exit()
217 |     elif mode == 'zmq-consume':
218 |         ret = ZMQInput(
219 |             'ipc://testpipe', hwm=args.prefetch)
220 |         return StagingInput(ret, nr_stage=1)
221 | 
222 | 
223 | if __name__ == '__main__':
224 |     parser = argparse.ArgumentParser()
225 |     parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.')
226 |     parser.add_argument('--model', choices=['tfbench', 'tensorpack'], default='tensorpack')
227 |     parser.add_argument('--load', help='load model')
228 |     parser.add_argument('--prefetch', type=int, default=150)
229 |     parser.add_argument('--use-fp16', action='store_true')
230 |     parser.add_argument('--use-xla-compile', action='store_true')
231 |     parser.add_argument('--batch', type=int, default=64, help='per GPU batch size')
232 |     parser.add_argument('--data_format', help='specify NCHW or NHWC',
233 |                         type=str, default='NCHW')
234 |     parser.add_argument('--fake-location', help='the place to create fake data',
235 |                         type=str, default='gpu',
236 |                         choices=[
237 |                             'cpu', 'gpu',
238 |                             'python', 'python-queue', 'python-dataset',
239 |                             'zmq-serve', 'zmq-consume'])
240 |     parser.add_argument('--variable-update', help='variable update strategy',
241 |                         type=str,
242 |                         choices=['replicated', 'parameter_server', 'horovod', 'byteps'],
243 |                         required=True)
244 | 
245 |     parser.add_argument('--ps-hosts')
246 |     parser.add_argument('--worker-hosts')
247 |     parser.add_argument('--job')
248 |     args = parser.parse_args()
249 | 
250 |     if args.gpu:
251 |         os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
252 | 
253 |     sessconf = get_default_sess_config()
254 |     sessconf.inter_op_parallelism_threads = 80 - 16
255 |     if args.job:
256 |         # distributed:
257 |         cluster_spec = tf.train.ClusterSpec({
258 |             'ps': args.ps_hosts.split(','),
259 |             'worker': args.worker_hosts.split(',')
260 |         })
261 |         job = args.job.split(':')[0]
262 |         if job == 'ps':
263 |             os.environ['CUDA_VISIBLE_DEVICES'] = ''
264 |         task_index = int(args.job.split(':')[1])
265 |         server = tf.train.Server(
266 |             cluster_spec, job_name=job, task_index=task_index,
267 |             config=sessconf)
268 | 
269 |     NUM_GPU = get_nr_gpu()
270 |     logger.info("Number of GPUs found: " + str(NUM_GPU))
271 | 
272 |     if args.job:
273 |         trainer = {
274 |             'replicated': lambda: DistributedTrainerReplicated(NUM_GPU, server),
275 |             'parameter_server': lambda: DistributedTrainerParameterServer(NUM_GPU, server),
276 |         }[args.variable_update]()
277 |     else:
278 |         if NUM_GPU == 1:
279 |             trainer = SimpleTrainer()
280 |         else:
281 |             trainer = {
282 |                 'replicated': lambda: SyncMultiGPUTrainerReplicated(
283 |                     NUM_GPU, average=False, mode='hierarchical' if NUM_GPU >= 8 else 'cpu'),
284 |                 # average=False is the actual configuration used by tfbench
285 |                 'horovod': lambda: HorovodTrainer(),
286 |                 'byteps': lambda: BytePSTrainer(),
287 |                 'parameter_server': lambda: SyncMultiGPUTrainerParameterServer(NUM_GPU, ps_device='cpu')
288 |             }[args.variable_update]()
289 |             if args.variable_update == 'replicated':
290 |                 trainer.BROADCAST_EVERY_EPOCH = False
291 | 
292 |     M = TFBenchModel if args.model == 'tfbench' else TensorpackModel
293 |     callbacks = [
294 |         GPUMemoryTracker(),
295 |         # ModelSaver(checkpoint_dir='./tmpmodel'),  # it takes time
296 |     ]
297 |     if args.variable_update != 'horovod':
298 |         callbacks.append(GPUUtilizationTracker())
299 |     if args.variable_update in ['horovod', 'byteps']:
300 |         # disable logging if not on first rank
301 |         if trainer.hvd.rank() != 0:
302 |             logger.addFilter(lambda _: False)
303 |         NUM_GPU = trainer.hvd.size()
304 |         logger.info("Number of GPUs to use: " + str(NUM_GPU))
305 | 
306 |     try:
307 |         from tensorpack.callbacks import ThroughputTracker
308 |     except ImportError:
309 |         # only available in latest tensorpack
310 |         pass
311 |     else:
312 |         callbacks.append(ThroughputTracker(samples_per_step=args.batch * NUM_GPU))
313 | 
314 |     config = TrainConfig(
315 |         data=get_data(args.fake_location),
316 |         model=M(data_format=args.data_format),
317 |         callbacks=callbacks,
318 |         extra_callbacks=[
319 |             # MovingAverageSummary(),   # tensorflow/benchmarks does not do this
320 |             ProgressBar(),  # nor this
321 |             # MergeAllSummaries(),
322 |             RunUpdateOps()
323 |         ],
324 |         session_config=sessconf if not args.job else None,
325 |         steps_per_epoch=50,
326 |         max_epoch=10,
327 |     )
328 | 
329 |     # consistent with tensorflow/benchmarks
330 |     trainer.COLOCATE_GRADIENTS_WITH_OPS = False
331 |     trainer.XLA_COMPILE = args.use_xla_compile
332 |     launch_train_with_config(config, trainer)
333 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/tfbench/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tensorpack/benchmarks/7c97174f9a00e440d60c38b54d4c45f58271fc3e/ResNet-MultiGPU/tfbench/__init__.py


--------------------------------------------------------------------------------
/ResNet-MultiGPU/tfbench/convnet_builder.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """CNN builder."""
 16 | 
 17 | from __future__ import print_function
 18 | 
 19 | from collections import defaultdict
 20 | import contextlib
 21 | 
 22 | import numpy as np
 23 | 
 24 | import tensorflow as tf
 25 | 
 26 | from tensorflow.python.layers import convolutional as conv_layers
 27 | from tensorflow.python.layers import core as core_layers
 28 | from tensorflow.python.layers import pooling as pooling_layers
 29 | from tensorflow.python.training import moving_averages
 30 | 
 31 | 
 32 | class ConvNetBuilder(object):
 33 |   """Builder of cnn net."""
 34 | 
 35 |   def __init__(self,
 36 |                input_op,
 37 |                input_nchan,
 38 |                phase_train,
 39 |                use_tf_layers,
 40 |                data_format='NCHW',
 41 |                dtype=tf.float32,
 42 |                variable_dtype=tf.float32):
 43 |     self.top_layer = input_op
 44 |     self.top_size = input_nchan
 45 |     self.phase_train = phase_train
 46 |     self.use_tf_layers = use_tf_layers
 47 |     self.data_format = data_format
 48 |     self.dtype = dtype
 49 |     self.variable_dtype = variable_dtype
 50 |     self.counts = defaultdict(lambda: 0)
 51 |     self.use_batch_norm = False
 52 |     self.batch_norm_config = {}  # 'decay': 0.997, 'scale': True}
 53 |     self.channel_pos = ('channels_last'
 54 |                         if data_format == 'NHWC' else 'channels_first')
 55 |     self.aux_top_layer = None
 56 |     self.aux_top_size = 0
 57 | 
 58 |   def get_custom_getter(self):
 59 |     """Returns a custom getter that this class's methods must be called under.
 60 | 
 61 |     All methods of this class must be called under a variable scope that was
 62 |     passed this custom getter. Example:
 63 | 
 64 |     ```python
 65 |     network = ConvNetBuilder(...)
 66 |     with tf.variable_scope('cg', custom_getter=network.get_custom_getter()):
 67 |       network.conv(...)
 68 |       # Call more methods of network here
 69 |     ```
 70 | 
 71 |     Currently, this custom getter only does anything if self.use_tf_layers is
 72 |     True. In that case, it causes variables to be stored as dtype
 73 |     self.variable_type, then casted to the requested dtype, instead of directly
 74 |     storing the variable as the requested dtype.
 75 |     """
 76 |     def inner_custom_getter(getter, *args, **kwargs):
 77 |       """Custom getter that forces variables to have type self.variable_type."""
 78 |       if not self.use_tf_layers:
 79 |         return getter(*args, **kwargs)
 80 |       requested_dtype = kwargs['dtype']
 81 |       if not (requested_dtype == tf.float32 and
 82 |               self.variable_dtype == tf.float16):
 83 |         # Only change the variable dtype if doing so does not decrease variable
 84 |         # precision.
 85 |         kwargs['dtype'] = self.variable_dtype
 86 |       var = getter(*args, **kwargs)
 87 |       # This if statement is needed to guard the cast, because batch norm
 88 |       # assigns directly to the return value of this custom getter. The cast
 89 |       # makes the return value not a variable so it cannot be assigned. Batch
 90 |       # norm variables are always in fp32 so this if statement is never
 91 |       # triggered for them.
 92 |       if var.dtype.base_dtype != requested_dtype:
 93 |         var = tf.cast(var, requested_dtype)
 94 |       return var
 95 |     return inner_custom_getter
 96 | 
 97 |   @contextlib.contextmanager
 98 |   def switch_to_aux_top_layer(self):
 99 |     """Context that construct cnn in the auxiliary arm."""
100 |     if self.aux_top_layer is None:
101 |       raise RuntimeError('Empty auxiliary top layer in the network.')
102 |     saved_top_layer = self.top_layer
103 |     saved_top_size = self.top_size
104 |     self.top_layer = self.aux_top_layer
105 |     self.top_size = self.aux_top_size
106 |     yield
107 |     self.aux_top_layer = self.top_layer
108 |     self.aux_top_size = self.top_size
109 |     self.top_layer = saved_top_layer
110 |     self.top_size = saved_top_size
111 | 
112 |   def get_variable(self, name, shape, dtype, cast_dtype, *args, **kwargs):
113 |     # TODO(reedwm): Currently variables and gradients are transferred to other
114 |     # devices and machines as type `dtype`, not `cast_dtype`. In particular,
115 |     # this means in fp16 mode, variables are transferred as fp32 values, not
116 |     # fp16 values, which uses extra bandwidth.
117 |     var = tf.get_variable(name, shape, dtype, *args, **kwargs)
118 |     return tf.cast(var, cast_dtype)
119 | 
120 |   def _conv2d_impl(self, input_layer, num_channels_in, filters, kernel_size,
121 |                    strides, padding, kernel_initializer):
122 |     if self.use_tf_layers:
123 |       return conv_layers.conv2d(input_layer, filters, kernel_size, strides,
124 |                                 padding, self.channel_pos,
125 |                                 kernel_initializer=kernel_initializer,
126 |                                 use_bias=False)
127 |     else:
128 |       weights_shape = [kernel_size[0], kernel_size[1], num_channels_in, filters]
129 |       # We use the name 'conv2d/kernel' so the variable has the same name as its
130 |       # tf.layers equivalent. This way, if a checkpoint is written when
131 |       # self.use_tf_layers == True, it can be loaded when
132 |       # self.use_tf_layers == False, and vice versa.
133 |       weights = self.get_variable('conv2d/kernel', weights_shape,
134 |                                   self.variable_dtype, self.dtype,
135 |                                   initializer=kernel_initializer)
136 |       if self.data_format == 'NHWC':
137 |         strides = [1] + strides + [1]
138 |       else:
139 |         strides = [1, 1] + strides
140 |       return tf.nn.conv2d(input_layer, weights, strides, padding,
141 |                           data_format=self.data_format)
142 | 
143 |   def conv(self,
144 |            num_out_channels,
145 |            k_height,
146 |            k_width,
147 |            d_height=1,
148 |            d_width=1,
149 |            mode='SAME',
150 |            input_layer=None,
151 |            num_channels_in=None,
152 |            use_batch_norm=None,
153 |            stddev=None,
154 |            activation='relu',
155 |            bias=0.0,
156 |            kernel_initializer=None):
157 |     """Construct a conv2d layer on top of cnn."""
158 |     if input_layer is None:
159 |       input_layer = self.top_layer
160 |     if num_channels_in is None:
161 |       num_channels_in = self.top_size
162 |     if stddev is not None and kernel_initializer is None:
163 |       kernel_initializer = tf.truncated_normal_initializer(stddev=stddev)
164 |     name = 'conv' + str(self.counts['conv'])
165 |     self.counts['conv'] += 1
166 |     with tf.variable_scope(name):
167 |       strides = [1, d_height, d_width, 1]
168 |       if self.data_format == 'NCHW':
169 |         strides = [strides[0], strides[3], strides[1], strides[2]]
170 |       if mode != 'SAME_RESNET':
171 |         conv = self._conv2d_impl(input_layer, num_channels_in, num_out_channels,
172 |                                  kernel_size=[k_height, k_width],
173 |                                  strides=[d_height, d_width], padding=mode,
174 |                                  kernel_initializer=kernel_initializer)
175 |       else:  # Special padding mode for ResNet models
176 |         if d_height == 1 and d_width == 1:
177 |           conv = self._conv2d_impl(input_layer, num_channels_in,
178 |                                    num_out_channels,
179 |                                    kernel_size=[k_height, k_width],
180 |                                    strides=[d_height, d_width], padding='SAME',
181 |                                    kernel_initializer=kernel_initializer)
182 |         else:
183 |           rate = 1  # Unused (for 'a trous' convolutions)
184 |           kernel_height_effective = k_height + (k_height - 1) * (rate - 1)
185 |           pad_h_beg = (kernel_height_effective - 1) // 2
186 |           pad_h_end = kernel_height_effective - 1 - pad_h_beg
187 |           kernel_width_effective = k_width + (k_width - 1) * (rate - 1)
188 |           pad_w_beg = (kernel_width_effective - 1) // 2
189 |           pad_w_end = kernel_width_effective - 1 - pad_w_beg
190 |           padding = [[0, 0], [pad_h_beg, pad_h_end],
191 |                      [pad_w_beg, pad_w_end], [0, 0]]
192 |           if self.data_format == 'NCHW':
193 |             padding = [padding[0], padding[3], padding[1], padding[2]]
194 |           input_layer = tf.pad(input_layer, padding)
195 |           conv = self._conv2d_impl(input_layer, num_channels_in,
196 |                                    num_out_channels,
197 |                                    kernel_size=[k_height, k_width],
198 |                                    strides=[d_height, d_width], padding='VALID',
199 |                                    kernel_initializer=kernel_initializer)
200 |       if use_batch_norm is None:
201 |         use_batch_norm = self.use_batch_norm
202 |       if not use_batch_norm:
203 |         if bias is not None:
204 |           biases = self.get_variable('biases', [num_out_channels],
205 |                                      self.variable_dtype, self.dtype,
206 |                                      initializer=tf.constant_initializer(bias))
207 |           biased = tf.reshape(
208 |               tf.nn.bias_add(conv, biases, data_format=self.data_format),
209 |               conv.get_shape())
210 |         else:
211 |           biased = conv
212 |       else:
213 |         self.top_layer = conv
214 |         self.top_size = num_out_channels
215 |         biased = self.batch_norm(**self.batch_norm_config)
216 |       if activation == 'relu':
217 |         conv1 = tf.nn.relu(biased)
218 |       elif activation == 'linear' or activation is None:
219 |         conv1 = biased
220 |       elif activation == 'tanh':
221 |         conv1 = tf.nn.tanh(biased)
222 |       else:
223 |         raise KeyError('Invalid activation type \'%s\'' % activation)
224 |       self.top_layer = conv1
225 |       self.top_size = num_out_channels
226 |       return conv1
227 | 
228 |   def _pool(self,
229 |             pool_name,
230 |             pool_function,
231 |             k_height,
232 |             k_width,
233 |             d_height,
234 |             d_width,
235 |             mode,
236 |             input_layer,
237 |             num_channels_in):
238 |     """Construct a pooling layer."""
239 |     if input_layer is None:
240 |       input_layer = self.top_layer
241 |     else:
242 |       self.top_size = num_channels_in
243 |     name = pool_name + str(self.counts[pool_name])
244 |     self.counts[pool_name] += 1
245 |     if self.use_tf_layers:
246 |       pool = pool_function(
247 |           input_layer, [k_height, k_width], [d_height, d_width],
248 |           padding=mode,
249 |           data_format=self.channel_pos,
250 |           name=name)
251 |     else:
252 |       if self.data_format == 'NHWC':
253 |         ksize = [1, k_height, k_width, 1]
254 |         strides = [1, d_height, d_width, 1]
255 |       else:
256 |         ksize = [1, 1, k_height, k_width]
257 |         strides = [1, 1, d_height, d_width]
258 |       pool = tf.nn.max_pool(input_layer, ksize, strides, padding=mode,
259 |                             data_format=self.data_format, name=name)
260 |     self.top_layer = pool
261 |     return pool
262 | 
263 |   def mpool(self,
264 |             k_height,
265 |             k_width,
266 |             d_height=2,
267 |             d_width=2,
268 |             mode='VALID',
269 |             input_layer=None,
270 |             num_channels_in=None):
271 |     """Construct a max pooling layer."""
272 |     return self._pool('mpool', pooling_layers.max_pooling2d, k_height, k_width,
273 |                       d_height, d_width, mode, input_layer, num_channels_in)
274 | 
275 |   def apool(self,
276 |             k_height,
277 |             k_width,
278 |             d_height=2,
279 |             d_width=2,
280 |             mode='VALID',
281 |             input_layer=None,
282 |             num_channels_in=None):
283 |     """Construct an average pooling layer."""
284 |     return self._pool('apool', pooling_layers.average_pooling2d, k_height,
285 |                       k_width, d_height, d_width, mode, input_layer,
286 |                       num_channels_in)
287 | 
288 |   def reshape(self, shape, input_layer=None):
289 |     if input_layer is None:
290 |       input_layer = self.top_layer
291 |     self.top_layer = tf.reshape(input_layer, shape)
292 |     self.top_size = shape[-1]  # HACK This may not always work
293 |     return self.top_layer
294 | 
295 |   def affine(self,
296 |              num_out_channels,
297 |              input_layer=None,
298 |              num_channels_in=None,
299 |              bias=0.0,
300 |              stddev=None,
301 |              activation='relu'):
302 |     if input_layer is None:
303 |       input_layer = self.top_layer
304 |     if num_channels_in is None:
305 |       num_channels_in = self.top_size
306 |     name = 'affine' + str(self.counts['affine'])
307 |     self.counts['affine'] += 1
308 |     with tf.variable_scope(name):
309 |       init_factor = 2. if activation == 'relu' else 1.
310 |       stddev = stddev or np.sqrt(init_factor / num_channels_in)
311 |       kernel = self.get_variable(
312 |           'weights', [num_channels_in, num_out_channels],
313 |           self.variable_dtype, self.dtype,
314 |           initializer=tf.truncated_normal_initializer(stddev=stddev))
315 |       biases = self.get_variable('biases', [num_out_channels],
316 |                                  self.variable_dtype, self.dtype,
317 |                                  initializer=tf.constant_initializer(bias))
318 |       logits = tf.nn.xw_plus_b(input_layer, kernel, biases)
319 |       if activation == 'relu':
320 |         affine1 = tf.nn.relu(logits, name=name)
321 |       elif activation == 'linear' or activation is None:
322 |         affine1 = logits
323 |       else:
324 |         raise KeyError('Invalid activation type \'%s\'' % activation)
325 |       self.top_layer = affine1
326 |       self.top_size = num_out_channels
327 |       return affine1
328 | 
329 |   def inception_module(self, name, cols, input_layer=None, in_size=None):
330 |     if input_layer is None:
331 |       input_layer = self.top_layer
332 |     if in_size is None:
333 |       in_size = self.top_size
334 |     name += str(self.counts[name])
335 |     self.counts[name] += 1
336 |     with tf.variable_scope(name):
337 |       col_layers = []
338 |       col_layer_sizes = []
339 |       for c, col in enumerate(cols):
340 |         col_layers.append([])
341 |         col_layer_sizes.append([])
342 |         for l, layer in enumerate(col):
343 |           ltype, args = layer[0], layer[1:]
344 |           kwargs = {
345 |               'input_layer': input_layer,
346 |               'num_channels_in': in_size
347 |           } if l == 0 else {}
348 |           if ltype == 'conv':
349 |             self.conv(*args, **kwargs)
350 |           elif ltype == 'mpool':
351 |             self.mpool(*args, **kwargs)
352 |           elif ltype == 'apool':
353 |             self.apool(*args, **kwargs)
354 |           elif ltype == 'share':  # Share matching layer from previous column
355 |             self.top_layer = col_layers[c - 1][l]
356 |             self.top_size = col_layer_sizes[c - 1][l]
357 |           else:
358 |             raise KeyError(
359 |                 'Invalid layer type for inception module: \'%s\'' % ltype)
360 |           col_layers[c].append(self.top_layer)
361 |           col_layer_sizes[c].append(self.top_size)
362 |       catdim = 3 if self.data_format == 'NHWC' else 1
363 |       self.top_layer = tf.concat([layers[-1] for layers in col_layers], catdim)
364 |       self.top_size = sum([sizes[-1] for sizes in col_layer_sizes])
365 |       return self.top_layer
366 | 
367 |   def spatial_mean(self, keep_dims=False):
368 |     name = 'spatial_mean' + str(self.counts['spatial_mean'])
369 |     self.counts['spatial_mean'] += 1
370 |     axes = [1, 2] if self.data_format == 'NHWC' else [2, 3]
371 |     self.top_layer = tf.reduce_mean(
372 |         self.top_layer, axes, keepdims=keep_dims, name=name)
373 |     return self.top_layer
374 | 
375 |   def dropout(self, keep_prob=0.5, input_layer=None):
376 |     if input_layer is None:
377 |       input_layer = self.top_layer
378 |     else:
379 |       self.top_size = None
380 |     name = 'dropout' + str(self.counts['dropout'])
381 |     with tf.variable_scope(name):
382 |       if not self.phase_train:
383 |         keep_prob = 1.0
384 |       if self.use_tf_layers:
385 |         dropout = core_layers.dropout(input_layer, 1. - keep_prob,
386 |                                       training=self.phase_train)
387 |       else:
388 |         dropout = tf.nn.dropout(input_layer, keep_prob)
389 |       self.top_layer = dropout
390 |       return dropout
391 | 
392 |   def _batch_norm_without_layers(self, input_layer, decay, use_scale, epsilon):
393 |     """Batch normalization on `input_layer` without tf.layers."""
394 |     # We make this function as similar as possible to the
395 |     # tf.contrib.layers.batch_norm, to minimize the differences between using
396 |     # layers and not using layers.
397 |     shape = input_layer.shape
398 |     num_channels = shape[3] if self.data_format == 'NHWC' else shape[1]
399 |     beta = self.get_variable('beta', [num_channels], tf.float32, tf.float32,
400 |                              initializer=tf.zeros_initializer())
401 |     if use_scale:
402 |       gamma = self.get_variable('gamma', [num_channels], tf.float32,
403 |                                 tf.float32, initializer=tf.ones_initializer())
404 |     else:
405 |       gamma = tf.constant(1.0, tf.float32, [num_channels])
406 |     # For moving variables, we use tf.get_variable instead of self.get_variable,
407 |     # since self.get_variable returns the result of tf.cast which we cannot
408 |     # assign to.
409 |     moving_mean = tf.get_variable('moving_mean', [num_channels],
410 |                                   tf.float32,
411 |                                   initializer=tf.zeros_initializer(),
412 |                                   trainable=False)
413 |     moving_variance = tf.get_variable('moving_variance', [num_channels],
414 |                                       tf.float32,
415 |                                       initializer=tf.ones_initializer(),
416 |                                       trainable=False)
417 |     if self.phase_train:
418 |       bn, batch_mean, batch_variance = tf.nn.fused_batch_norm(
419 |           input_layer, gamma, beta, epsilon=epsilon,
420 |           data_format=self.data_format, is_training=True)
421 |       mean_update = moving_averages.assign_moving_average(
422 |           moving_mean, batch_mean, decay=decay, zero_debias=False)
423 |       variance_update = moving_averages.assign_moving_average(
424 |           moving_variance, batch_variance, decay=decay, zero_debias=False)
425 |       tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, mean_update)
426 |       tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, variance_update)
427 |     else:
428 |       bn, _, _ = tf.nn.fused_batch_norm(
429 |           input_layer, gamma, beta, mean=moving_mean,
430 |           variance=moving_variance, epsilon=epsilon,
431 |           data_format=self.data_format, is_training=False)
432 |     return bn
433 | 
434 |   def batch_norm(self, input_layer=None, decay=0.999, scale=False,
435 |                  epsilon=0.001):
436 |     """Adds a Batch Normalization layer."""
437 |     if input_layer is None:
438 |       input_layer = self.top_layer
439 |     else:
440 |       self.top_size = None
441 |     name = 'batchnorm' + str(self.counts['batchnorm'])
442 |     self.counts['batchnorm'] += 1
443 | 
444 |     with tf.variable_scope(name) as scope:
445 |       if self.use_tf_layers:
446 |         bn = tf.contrib.layers.batch_norm(
447 |             input_layer,
448 |             decay=decay,
449 |             scale=scale,
450 |             epsilon=epsilon,
451 |             is_training=self.phase_train,
452 |             fused=True,
453 |             data_format=self.data_format,
454 |             scope=scope)
455 |       else:
456 |         bn = self._batch_norm_without_layers(input_layer, decay, scale, epsilon)
457 |     self.top_layer = bn
458 |     self.top_size = bn.shape[3] if self.data_format == 'NHWC' else bn.shape[1]
459 |     self.top_size = int(self.top_size)
460 |     return bn
461 | 
462 |   def lrn(self, depth_radius, bias, alpha, beta):
463 |     """Adds a local response normalization layer."""
464 |     name = 'lrn' + str(self.counts['lrn'])
465 |     self.counts['lrn'] += 1
466 |     self.top_layer = tf.nn.lrn(
467 |         self.top_layer, depth_radius, bias, alpha, beta, name=name)
468 |     return self.top_layer
469 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/tfbench/model.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | # ==============================================================================
15 | """Base model configuration for CNN benchmarks."""
16 | 
17 | 
18 | class Model(object):
19 |   """Base model configuration for CNN benchmarks."""
20 | 
21 |   def __init__(self,
22 |                model,
23 |                image_size,
24 |                batch_size,
25 |                learning_rate,
26 |                layer_counts=None,
27 |                fp16_loss_scale=128):
28 |     self.model = model
29 |     self.image_size = image_size
30 |     self.batch_size = batch_size
31 |     self.default_batch_size = batch_size
32 |     self.learning_rate = learning_rate
33 |     self.layer_counts = layer_counts
34 |     # TODO(reedwm) Set custom loss scales for each model instead of using the
35 |     # default of 128.
36 |     self.fp16_loss_scale = fp16_loss_scale
37 | 
38 |   def get_model(self):
39 |     return self.model
40 | 
41 |   def get_image_size(self):
42 |     return self.image_size
43 | 
44 |   def get_batch_size(self):
45 |     return self.batch_size
46 | 
47 |   def set_batch_size(self, batch_size):
48 |     self.batch_size = batch_size
49 | 
50 |   def get_default_batch_size(self):
51 |     return self.default_batch_size
52 | 
53 |   def get_layer_counts(self):
54 |     return self.layer_counts
55 | 
56 |   def get_fp16_loss_scale(self):
57 |     return self.fp16_loss_scale
58 | 
59 |   def get_learning_rate(self, global_step, batch_size):
60 |     del global_step
61 |     del batch_size
62 |     return self.learning_rate
63 | 
64 |   def add_inference(self, unused_cnn):
65 |     raise ValueError('Must be implemented in derived classes')
66 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/tfbench/model_config.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | # ==============================================================================
15 | 
16 | """Model configurations for CNN benchmarks.
17 | """
18 | 
19 | from . import resnet_model
20 | 
21 | 
22 | _model_name_to_imagenet_model = {
23 |     'resnet50': resnet_model.create_resnet50_model,
24 |     'resnet50_v2': resnet_model.create_resnet50_v2_model,
25 |     'resnet101': resnet_model.create_resnet101_model,
26 |     'resnet101_v2': resnet_model.create_resnet101_v2_model,
27 |     'resnet152': resnet_model.create_resnet152_model,
28 |     'resnet152_v2': resnet_model.create_resnet152_v2_model,
29 | }
30 | 
31 | 
32 | _model_name_to_cifar_model = {
33 | }
34 | 
35 | 
36 | def _get_model_map(dataset_name):
37 |   if 'cifar10' == dataset_name:
38 |     return _model_name_to_cifar_model
39 |   elif dataset_name in ('imagenet', 'synthetic'):
40 |     return _model_name_to_imagenet_model
41 |   else:
42 |     raise ValueError('Invalid dataset name: %s' % dataset_name)
43 | 
44 | 
45 | def get_model_config(model_name, dataset):
46 |   """Map model name to model network configuration."""
47 |   model_map = _get_model_map(dataset.name)
48 |   if model_name not in model_map:
49 |     raise ValueError('Invalid model name \'%s\' for dataset \'%s\'' %
50 |                      (model_name, dataset.name))
51 |   else:
52 |     return model_map[model_name]()
53 | 
54 | 
55 | def register_model(model_name, dataset_name, model_func):
56 |   """Register a new model that can be obtained with `get_model_config`."""
57 |   model_map = _get_model_map(dataset_name)
58 |   if model_name in model_map:
59 |     raise ValueError('Model "%s" is already registered for dataset "%s"' %
60 |                      (model_name, dataset_name))
61 |   model_map[model_name] = model_func
62 | 


--------------------------------------------------------------------------------
/ResNet-MultiGPU/tfbench/resnet_model.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | 
 16 | """Resnet model configuration.
 17 | 
 18 | References:
 19 |   Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
 20 |   Deep Residual Learning for Image Recognition
 21 |   arXiv:1512.03385 (2015)
 22 | 
 23 |   Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
 24 |   Identity Mappings in Deep Residual Networks
 25 |   arXiv:1603.05027 (2016)
 26 | 
 27 |   Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy,
 28 |   Alan L. Yuille
 29 |   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets,
 30 |   Atrous Convolution, and Fully Connected CRFs
 31 |   arXiv:1606.00915 (2016)
 32 | """
 33 | 
 34 | import numpy as np
 35 | from six.moves import xrange  # pylint: disable=redefined-builtin
 36 | import tensorflow as tf
 37 | from . import model as model_lib
 38 | 
 39 | 
 40 | def bottleneck_block_v1(cnn, depth, depth_bottleneck, stride):
 41 |   """Bottleneck block with identity short-cut for ResNet v1.
 42 | 
 43 |   Args:
 44 |     cnn: the network to append bottleneck blocks.
 45 |     depth: the number of output filters for this bottleneck block.
 46 |     depth_bottleneck: the number of bottleneck filters for this block.
 47 |     stride: Stride used in the first layer of the bottleneck block.
 48 |   """
 49 |   input_layer = cnn.top_layer
 50 |   in_size = cnn.top_size
 51 |   name_key = 'resnet_v1'
 52 |   name = name_key + str(cnn.counts[name_key])
 53 |   cnn.counts[name_key] += 1
 54 | 
 55 |   with tf.variable_scope(name):
 56 |     if depth == in_size:
 57 |       if stride == 1:
 58 |         shortcut = input_layer
 59 |       else:
 60 |         shortcut = cnn.apool(
 61 |             1, 1, stride, stride, input_layer=input_layer,
 62 |             num_channels_in=in_size)
 63 |     else:
 64 |       shortcut = cnn.conv(
 65 |           depth, 1, 1, stride, stride, activation=None,
 66 |           use_batch_norm=True, input_layer=input_layer,
 67 |           num_channels_in=in_size, bias=None)
 68 |     cnn.conv(depth_bottleneck, 1, 1, stride, stride,
 69 |              input_layer=input_layer, num_channels_in=in_size,
 70 |              use_batch_norm=True, bias=None)
 71 |     cnn.conv(depth_bottleneck, 3, 3, 1, 1, mode='SAME_RESNET',
 72 |              use_batch_norm=True, bias=None)
 73 |     res = cnn.conv(depth, 1, 1, 1, 1, activation=None,
 74 |                    use_batch_norm=True, bias=None)
 75 |     output = tf.nn.relu(shortcut + res)
 76 |     cnn.top_layer = output
 77 |     cnn.top_size = depth
 78 | 
 79 | 
 80 | def bottleneck_block_v2(cnn, depth, depth_bottleneck, stride):
 81 |   """Bottleneck block with identity short-cut for ResNet v2.
 82 | 
 83 |   The main difference from v1 is that a batch norm and relu are done at the
 84 |   start of the block, instead of the end. This initial batch norm and relu is
 85 |   collectively called a pre-activation.
 86 | 
 87 |   Args:
 88 |     cnn: the network to append bottleneck blocks.
 89 |     depth: the number of output filters for this bottleneck block.
 90 |     depth_bottleneck: the number of bottleneck filters for this block.
 91 |     stride: Stride used in the first layer of the bottleneck block.
 92 |   """
 93 |   input_layer = cnn.top_layer
 94 |   in_size = cnn.top_size
 95 |   name_key = 'resnet_v2'
 96 |   name = name_key + str(cnn.counts[name_key])
 97 |   cnn.counts[name_key] += 1
 98 | 
 99 |   preact = cnn.batch_norm()
100 |   preact = tf.nn.relu(preact)
101 |   with tf.variable_scope(name):
102 |     if depth == in_size:
103 |       if stride == 1:
104 |         shortcut = input_layer
105 |       else:
106 |         shortcut = cnn.apool(
107 |             1, 1, stride, stride, input_layer=input_layer,
108 |             num_channels_in=in_size)
109 |     else:
110 |       shortcut = cnn.conv(
111 |           depth, 1, 1, stride, stride, activation=None, use_batch_norm=False,
112 |           input_layer=preact, num_channels_in=in_size, bias=None)
113 |     cnn.conv(depth_bottleneck, 1, 1, stride, stride,
114 |              input_layer=preact, num_channels_in=in_size,
115 |              use_batch_norm=True, bias=None)
116 |     cnn.conv(depth_bottleneck, 3, 3, 1, 1, mode='SAME_RESNET',
117 |              use_batch_norm=True, bias=None)
118 |     res = cnn.conv(depth, 1, 1, 1, 1, activation=None,
119 |                    use_batch_norm=False, bias=None)
120 |     output = shortcut + res
121 |     cnn.top_layer = output
122 |     cnn.top_size = depth
123 | 
124 | 
125 | def bottleneck_block(cnn, depth, depth_bottleneck, stride, pre_activation):
126 |   """Bottleneck block with identity short-cut.
127 | 
128 |   Args:
129 |     cnn: the network to append bottleneck blocks.
130 |     depth: the number of output filters for this bottleneck block.
131 |     depth_bottleneck: the number of bottleneck filters for this block.
132 |     stride: Stride used in the first layer of the bottleneck block.
133 |     pre_activation: use pre_activation structure used in v2 or not.
134 |   """
135 |   if pre_activation:
136 |     bottleneck_block_v2(cnn, depth, depth_bottleneck, stride)
137 |   else:
138 |     bottleneck_block_v1(cnn, depth, depth_bottleneck, stride)
139 | 
140 | 
141 | def residual_block(cnn, depth, stride, pre_activation):
142 |   """Residual block with identity short-cut.
143 | 
144 |   Args:
145 |     cnn: the network to append residual blocks.
146 |     depth: the number of output filters for this residual block.
147 |     stride: Stride used in the first layer of the residual block.
148 |     pre_activation: use pre_activation structure or not.
149 |   """
150 |   input_layer = cnn.top_layer
151 |   in_size = cnn.top_size
152 |   if in_size != depth:
153 |     # Plan A of shortcut.
154 |     shortcut = cnn.apool(1, 1, stride, stride,
155 |                          input_layer=input_layer,
156 |                          num_channels_in=in_size)
157 |     padding = (depth - in_size) // 2
158 |     if cnn.channel_pos == 'channels_last':
159 |       shortcut = tf.pad(
160 |           shortcut, [[0, 0], [0, 0], [0, 0], [padding, padding]])
161 |     else:
162 |       shortcut = tf.pad(
163 |           shortcut, [[0, 0], [padding, padding], [0, 0], [0, 0]])
164 |   else:
165 |     shortcut = input_layer
166 |   if pre_activation:
167 |     res = cnn.batch_norm(input_layer)
168 |     res = tf.nn.relu(res)
169 |   else:
170 |     res = input_layer
171 |   cnn.conv(depth, 3, 3, stride, stride,
172 |            input_layer=res, num_channels_in=in_size,
173 |            use_batch_norm=True, bias=None)
174 |   if pre_activation:
175 |     res = cnn.conv(depth, 3, 3, 1, 1, activation=None,
176 |                    use_batch_norm=False, bias=None)
177 |     output = shortcut + res
178 |   else:
179 |     res = cnn.conv(depth, 3, 3, 1, 1, activation=None,
180 |                    use_batch_norm=True, bias=None)
181 |     output = tf.nn.relu(shortcut + res)
182 |   cnn.top_layer = output
183 |   cnn.top_size = depth
184 | 
185 | 
186 | class ResnetModel(model_lib.Model):
187 |   """Resnet cnn network configuration."""
188 | 
189 |   def __init__(self, model, layer_counts):
190 |     default_batch_sizes = {
191 |         'resnet50': 64,
192 |         'resnet101': 32,
193 |         'resnet152': 32,
194 |         'resnet50_v2': 64,
195 |         'resnet101_v2': 32,
196 |         'resnet152_v2': 32,
197 |     }
198 |     batch_size = default_batch_sizes.get(model, 32)
199 |     super(ResnetModel, self).__init__(model, 224, batch_size, 0.005,
200 |                                       layer_counts)
201 |     self.pre_activation = 'v2' in model
202 | 
203 |   def add_inference(self, cnn):
204 |     if self.layer_counts is None:
205 |       raise ValueError('Layer counts not specified for %s' % self.get_model())
206 |     cnn.use_batch_norm = True
207 |     cnn.batch_norm_config = {'decay': 0.997, 'epsilon': 1e-5, 'scale': True}
208 |     cnn.conv(64, 7, 7, 2, 2, mode='SAME_RESNET', use_batch_norm=True)
209 |     cnn.mpool(3, 3, 2, 2, mode='SAME')
210 |     for _ in xrange(self.layer_counts[0]):
211 |       bottleneck_block(cnn, 256, 64, 1, self.pre_activation)
212 |     for i in xrange(self.layer_counts[1]):
213 |       stride = 2 if i == 0 else 1
214 |       bottleneck_block(cnn, 512, 128, stride, self.pre_activation)
215 |     for i in xrange(self.layer_counts[2]):
216 |       stride = 2 if i == 0 else 1
217 |       bottleneck_block(cnn, 1024, 256, stride, self.pre_activation)
218 |     for i in xrange(self.layer_counts[3]):
219 |       stride = 2 if i == 0 else 1
220 |       bottleneck_block(cnn, 2048, 512, stride, self.pre_activation)
221 |     if self.pre_activation:
222 |       cnn.batch_norm()
223 |       cnn.top_layer = tf.nn.relu(cnn.top_layer)
224 |     cnn.spatial_mean()
225 | 
226 |   def get_learning_rate(self, global_step, batch_size):
227 |     num_batches_per_epoch = (
228 |         float(datasets.IMAGENET_NUM_TRAIN_IMAGES) / batch_size)
229 |     boundaries = [int(num_batches_per_epoch * x) for x in [30, 60]]
230 |     values = [0.1, 0.01, 0.001]
231 |     return tf.train.piecewise_constant(global_step, boundaries, values)
232 | 
233 | 
234 | def create_resnet50_model():
235 |   return ResnetModel('resnet50', (3, 4, 6, 3))
236 | 
237 | 
238 | def create_resnet50_v2_model():
239 |   return ResnetModel('resnet50_v2', (3, 4, 6, 3))
240 | 
241 | 
242 | def create_resnet101_model():
243 |   return ResnetModel('resnet101', (3, 4, 23, 3))
244 | 
245 | 
246 | def create_resnet101_v2_model():
247 |   return ResnetModel('resnet101_v2', (3, 4, 23, 3))
248 | 
249 | 
250 | def create_resnet152_model():
251 |   return ResnetModel('resnet152', (3, 8, 36, 3))
252 | 
253 | 
254 | def create_resnet152_v2_model():
255 |   return ResnetModel('resnet152_v2', (3, 8, 36, 3))
256 | 
257 | 
258 | class ResnetCifar10Model(model_lib.Model):
259 |   """Resnet cnn network configuration for Cifar 10 dataset.
260 | 
261 |   V1 model architecture follows the one defined in the paper:
262 |   https://arxiv.org/pdf/1512.03385.pdf.
263 | 
264 |   V2 model architecture follows the one defined in the paper:
265 |   https://arxiv.org/pdf/1603.05027.pdf.
266 |   """
267 | 
268 |   def __init__(self, model, layer_counts):
269 |     self.pre_activation = 'v2' in model
270 |     super(ResnetCifar10Model, self).__init__(
271 |         model, 32, 128, 0.1, layer_counts)
272 | 
273 |   def add_inference(self, cnn):
274 |     if self.layer_counts is None:
275 |       raise ValueError('Layer counts not specified for %s' % self.get_model())
276 | 
277 |     cnn.use_batch_norm = True
278 |     cnn.batch_norm_config = {'decay': 0.9, 'epsilon': 1e-5, 'scale': True}
279 |     if self.pre_activation:
280 |       cnn.conv(16, 3, 3, 1, 1, use_batch_norm=True)
281 |     else:
282 |       cnn.conv(16, 3, 3, 1, 1, activation=None, use_batch_norm=True)
283 |     for i in xrange(self.layer_counts[0]):
284 |       # reshape to batch_size x 16 x 32 x 32
285 |       residual_block(cnn, 16, 1, self.pre_activation)
286 |     for i in xrange(self.layer_counts[1]):
287 |       # Subsampling is performed at the first convolution with a stride of 2
288 |       stride = 2 if i == 0 else 1
289 |       # reshape to batch_size x 32 x 16 x 16
290 |       residual_block(cnn, 32, stride, self.pre_activation)
291 |     for i in xrange(self.layer_counts[2]):
292 |       stride = 2 if i == 0 else 1
293 |       # reshape to batch_size x 64 x 8 x 8
294 |       residual_block(cnn, 64, stride, self.pre_activation)
295 |     if self.pre_activation:
296 |       cnn.batch_norm()
297 |       cnn.top_layer = tf.nn.relu(cnn.top_layer)
298 |     cnn.spatial_mean()
299 | 
300 |   def get_learning_rate(self, global_step, batch_size):
301 |     num_batches_per_epoch = int(50000 / batch_size)
302 |     boundaries = num_batches_per_epoch * np.array([82, 123, 300],
303 |                                                   dtype=np.int64)
304 |     boundaries = [x for x in boundaries]
305 |     values = [0.1, 0.01, 0.001, 0.0002]
306 |     return tf.train.piecewise_constant(global_step, boundaries, values)
307 | 
308 | 
309 | def create_resnet20_cifar_model():
310 |   return ResnetCifar10Model('resnet20', (3, 3, 3))
311 | 
312 | 
313 | def create_resnet20_v2_cifar_model():
314 |   return ResnetCifar10Model('resnet20_v2', (3, 3, 3))
315 | 
316 | 
317 | def create_resnet32_cifar_model():
318 |   return ResnetCifar10Model('resnet32_v2', (5, 5, 5))
319 | 
320 | 
321 | def create_resnet32_v2_cifar_model():
322 |   return ResnetCifar10Model('resnet32_v2', (5, 5, 5))
323 | 
324 | 
325 | def create_resnet44_cifar_model():
326 |   return ResnetCifar10Model('resnet44', (7, 7, 7))
327 | 
328 | 
329 | def create_resnet44_v2_cifar_model():
330 |   return ResnetCifar10Model('resnet44_v2', (7, 7, 7))
331 | 
332 | 
333 | def create_resnet56_cifar_model():
334 |   return ResnetCifar10Model('resnet56', (9, 9, 9))
335 | 
336 | 
337 | def create_resnet56_v2_cifar_model():
338 |   return ResnetCifar10Model('resnet56_v2', (9, 9, 9))
339 | 
340 | 
341 | def create_resnet110_cifar_model():
342 |   return ResnetCifar10Model('resnet110', (18, 18, 18))
343 | 
344 | 
345 | def create_resnet110_v2_cifar_model():
346 |   return ResnetCifar10Model('resnet110_v2', (18, 18, 18))
347 | 


--------------------------------------------------------------------------------
/other-wrappers/README.md:
--------------------------------------------------------------------------------
 1 | ## Benchmark CNN with other TF high-level APIs
 2 | 
 3 | Tensorpack is __1.2x~5x__ faster than the equivalent code written in some other TF high-level APIs.
 4 | 
 5 | ### Benchmark setting:
 6 | 
 7 | * Hardware: AWS p3.16xlarge (8 Tesla V100s)
 8 | * Software:
 9 | Python 3.6, TF 1.13.1, cuda 10, cudnn 7.4.2, Keras 2.1.5, tflearn 0.3.2, tensorpack 0.9.4.
10 | * Measurement: speed is measured by images per second (__larger is better__). First epoch is warmup and
11 | 	is not considered in timing. Second or later epochs have statistically insignificant difference.
12 | * Data:
13 | 	* True data for Cifar10.
14 | 	* For ImageNet, assumed to be a constant numpy array already available on CPU.
15 | 		This is a reasonable setting because data always has to come from somewhere to CPU anyway.
16 | * __Source code is here__. They are all <100 lines that you can easily
17 |   run and __verify by yourself__.
18 | 
19 | ### On a Single GPU:
20 | | Task                           | tensorpack  | Keras  | tflearn |
21 | | ------------------------------ | ----------- | ------ | ------- |
22 | | Keras Official Cifar10 Example | __11904__   | 7142   | 5882    |
23 | | VGG16 on fake ImageNet         | __230__     | 204    | 194     |
24 | | AlexNet on fake ImageNet       | __2603__    | 1454   | N/A     |
25 | | ResNet50 on fake ImageNet      | __333__     | 266    | N/A     |
26 | 
27 | ### Data Parallel on 8 GPUs:
28 | 
29 | Each script used in this section can be run with "./script.py NUM_GPU" to use a different number of GPUs.
30 | 
31 | |                      | 1 GPU   | 2 GPUs | 8 GPUs    |
32 | | -------------------- | ------- | ------ | --------- |
33 | | tensorpack+ResNet50  | 333     | 579    | __2245__  |
34 | | Keras+ResNet50       | 266     | 320    | 470       |
35 | |                      |         |        |           |
36 | | tensorpack+VGG16     | 230     | 438    | __1651__  |
37 | | Keras+VGG16          | 204     | 304    | 449       |
38 | 
39 | 
40 | 
41 | Notes:
42 | 
43 | 1. With a better (but different in batch sizes, etc) setting in [../ResNet-MultiGPU/](../ResNet-MultiGPU/),
44 | 	tensorpack can further reach 2800 im/s for ResNet50 on a p3.16xlarge instance.
45 | 	And 9225 im/s with all optimizations + fp16 turned on.
46 | 2. It's possible for Keras to be faster (by using better input pipeline, building data-parallel graph by yourself, etc), but it's NOT
47 | 	how most users are using Keras or how any of the Keras examples are written.
48 | 
49 | 	Using Keras together with Tensorpack is one way to make Keras faster.
50 | 	See the [Keras+Tensorpack example](https://github.com/tensorpack/tensorpack/tree/master/examples/keras).
51 | 


--------------------------------------------------------------------------------
/other-wrappers/keras.alexnet.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import keras
 4 | keras.backend.set_image_data_format('channels_first')
 5 | from keras.models import Sequential
 6 | from keras.layers import *
 7 | from keras.utils import np_utils
 8 | import numpy as np
 9 | import sys
10 | 
11 | try:
12 |     NUM_GPU = int(sys.argv[1])
13 | except IndexError:
14 |     NUM_GPU = 1
15 | batch_size = 64 * NUM_GPU
16 | nb_classes = 1000
17 | 
18 | img_rows, img_cols = 224, 224
19 | 
20 | if keras.backend.image_data_format() == 'channels_first':
21 |     X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32')
22 | else:
23 |     X_train = np.random.random((batch_size, img_rows, img_cols, 3)).astype('float32')
24 | Y_train = np.random.random((batch_size,)).astype('int32')
25 | Y_train = np_utils.to_categorical(Y_train, nb_classes)
26 | 
27 | 
28 | def gen():
29 |     while True:
30 |         yield (X_train, Y_train)
31 | 
32 | 
33 | model = Sequential()
34 | model.add(Convolution2D(64, 11, strides=4, padding='valid', input_shape=X_train.shape[1:]))
35 | model.add(Activation('relu'))
36 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2))
37 | model.add(Convolution2D(192, 5, padding='same'))
38 | model.add(Activation('relu'))
39 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2))
40 | 
41 | model.add(Convolution2D(384, 3, padding='same'))
42 | model.add(Activation('relu'))
43 | model.add(Convolution2D(256, 3, padding='same'))
44 | model.add(Activation('relu'))
45 | model.add(Convolution2D(256, 3, padding='same'))
46 | model.add(Activation('relu'))
47 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2))
48 | 
49 | model.add(Flatten())
50 | model.add(Dense(4096))
51 | model.add(Activation('relu'))
52 | model.add(Dropout(0.5))
53 | model.add(Dense(4096))
54 | model.add(Activation('relu'))
55 | model.add(Dropout(0.5))
56 | model.add(Dense(nb_classes))
57 | model.add(Activation('softmax'))
58 | 
59 | for l in model.layers:
60 |     print(l.input_shape, l.output_shape)
61 | 
62 | if NUM_GPU != 1:
63 |     model = keras.utils.multi_gpu_model(model, NUM_GPU)
64 | 
65 | # Let's train the model using RMSprop
66 | model.compile(loss='categorical_crossentropy',
67 |               optimizer='rmsprop',
68 |               metrics=['accuracy'])
69 | 
70 | model.fit_generator(gen(), epochs=100, steps_per_epoch=200)
71 | 


--------------------------------------------------------------------------------
/other-wrappers/keras.cifar10.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | '''Train a simple deep CNN on the CIFAR10 small images dataset.
 3 | 
 4 | It gets to 75% validation accuracy in 25 epochs, and 79% after 50 epochs.
 5 | (it's still underfitting at that point, though).
 6 | '''
 7 | 
 8 | from __future__ import print_function
 9 | import keras
10 | from keras.datasets import cifar10
11 | from keras.models import Sequential
12 | from keras.layers import Dense, Dropout, Activation, Flatten
13 | from keras.layers import Conv2D, MaxPooling2D
14 | 
15 | batch_size = 32
16 | num_classes = 10
17 | epochs = 100
18 | 
19 | # The data, split between train and test sets:
20 | (x_train, y_train), (x_test, y_test) = cifar10.load_data()
21 | print('x_train shape:', x_train.shape)
22 | print(x_train.shape[0], 'train samples')
23 | print(x_test.shape[0], 'test samples')
24 | 
25 | # Convert class vectors to binary class matrices.
26 | y_train = keras.utils.to_categorical(y_train, num_classes)
27 | y_test = keras.utils.to_categorical(y_test, num_classes)
28 | 
29 | model = Sequential()
30 | model.add(Conv2D(32, (3, 3), padding='same',
31 |                  input_shape=x_train.shape[1:]))
32 | model.add(Activation('relu'))
33 | model.add(Conv2D(32, (3, 3)))
34 | model.add(Activation('relu'))
35 | model.add(MaxPooling2D(pool_size=(2, 2)))
36 | model.add(Dropout(0.25))
37 | 
38 | model.add(Conv2D(64, (3, 3), padding='same'))
39 | model.add(Activation('relu'))
40 | model.add(Conv2D(64, (3, 3)))
41 | model.add(Activation('relu'))
42 | model.add(MaxPooling2D(pool_size=(2, 2)))
43 | model.add(Dropout(0.25))
44 | 
45 | model.add(Flatten())
46 | model.add(Dense(512))
47 | model.add(Activation('relu'))
48 | model.add(Dropout(0.5))
49 | model.add(Dense(num_classes))
50 | model.add(Activation('softmax'))
51 | 
52 | # initiate RMSprop optimizer
53 | opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)
54 | 
55 | # Let's train the model using RMSprop
56 | model.compile(loss='categorical_crossentropy',
57 |               optimizer=opt,
58 |               metrics=['accuracy'])
59 | 
60 | x_train = x_train.astype('float32')
61 | x_test = x_test.astype('float32')
62 | x_train /= 255
63 | x_test /= 255
64 | 
65 | print('Not using data augmentation.')
66 | model.fit(x_train, y_train,
67 |           batch_size=batch_size,
68 |           epochs=epochs,
69 |           # validation_data=(x_test, y_test),
70 |           shuffle=True)
71 | 


--------------------------------------------------------------------------------
/other-wrappers/keras.resnet.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import keras
 4 | keras.backend.set_image_data_format('channels_first')
 5 | from keras.applications.resnet50 import ResNet50
 6 | from keras.utils import np_utils
 7 | import numpy as np
 8 | import sys
 9 | 
10 | try:
11 |     NUM_GPU = int(sys.argv[1])
12 | except IndexError:
13 |     NUM_GPU = 1
14 | 
15 | batch_size = 32 * NUM_GPU
16 | 
17 | img_rows, img_cols = 224, 224
18 | 
19 | if keras.backend.image_data_format() == 'channels_first':
20 |     X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32')
21 | else:
22 |     X_train = np.random.random((batch_size, img_rows, img_cols, 3)).astype('float32')
23 | Y_train = np.random.random((batch_size,)).astype('int32')
24 | Y_train = np_utils.to_categorical(Y_train, 1000)
25 | 
26 | 
27 | def gen():
28 |     while True:
29 |         yield (X_train, Y_train)
30 | 
31 | 
32 | model = ResNet50(weights=None, input_shape=X_train.shape[1:])
33 | 
34 | if NUM_GPU != 1:
35 |     model = keras.utils.multi_gpu_model(model, gpus=NUM_GPU)
36 | 
37 | for l in model.layers:
38 |     print(l.input_shape, l.output_shape)
39 | 
40 | model.compile(loss='categorical_crossentropy',
41 |               optimizer='sgd', metrics=['accuracy'])
42 | 
43 | model.fit_generator(gen(), epochs=100, steps_per_epoch=50)
44 | 


--------------------------------------------------------------------------------
/other-wrappers/keras.vgg.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | import keras
 5 | keras.backend.set_image_data_format('channels_first')
 6 | from keras.models import Model
 7 | from keras.utils import np_utils
 8 | from keras.layers import *
 9 | import numpy as np
10 | 
11 | try:
12 |     NUM_GPU = int(sys.argv[1])
13 | except IndexError:
14 |     NUM_GPU = 1
15 | batch_size = 64 * NUM_GPU
16 | nb_classes = 1000
17 | nb_epoch = 200
18 | 
19 | img_rows, img_cols = 224, 224
20 | 
21 | if keras.backend.image_data_format() == 'channels_first':
22 |     X_train = np.random.random((batch_size, 3, img_rows, img_cols))
23 | else:
24 |     X_train = np.random.random((batch_size, img_rows, img_cols, 3))
25 | Y_train = np.random.random((batch_size,)).astype('int32')
26 | Y_train = np_utils.to_categorical(Y_train, nb_classes)
27 | 
28 | 
29 | def gen():
30 |     while True:
31 |         yield (X_train, Y_train)
32 | 
33 | 
34 | img_input = Input(shape=X_train.shape[1:], dtype="float32")
35 | # Block 1
36 | x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(img_input)
37 | x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
38 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)
39 | 
40 | # Block 2
41 | x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv1')(x)
42 | x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv2')(x)
43 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x)
44 | 
45 | # Block 3
46 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv1')(x)
47 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv2')(x)
48 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv3')(x)
49 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(x)
50 | 
51 | # Block 4
52 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
53 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
54 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
55 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)
56 | 
57 | # Block 5
58 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
59 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
60 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
61 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
62 | 
63 | x = Flatten(name='flatten')(x)
64 | x = Dense(4096, activation='relu', name='fc1')(x)
65 | x = Dense(4096, activation='relu', name='fc2')(x)
66 | x = Dense(nb_classes, activation='softmax', name='predictions')(x)
67 | 
68 | model = Model(img_input, x, name='vgg16')
69 | 
70 | if NUM_GPU != 1:
71 |     model = keras.utils.multi_gpu_model(model, NUM_GPU)
72 | 
73 | # Let's train the model using RMSprop
74 | model.compile(loss='categorical_crossentropy',
75 |               optimizer='rmsprop',
76 |               metrics=['accuracy'])
77 | 
78 | model.fit_generator(gen(), epochs=nb_epoch, steps_per_epoch=50)
79 | 


--------------------------------------------------------------------------------
/other-wrappers/tensorpack.alexnet.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: tensorpack.alexnet.py
 4 | import tensorflow as tf
 5 | import numpy as np
 6 | from tensorpack import *
 7 | 
 8 | BATCH = 64
 9 | 
10 | 
11 | class Model(ModelDesc):
12 |     def inputs(self):
13 |         return [tf.TensorSpec([BATCH, 3, 224, 224], tf.float32, 'input'),
14 |                 tf.TensorSpec([BATCH], tf.int32, 'label')]
15 | 
16 |     def build_graph(self, image, label):
17 |         image = image / 255.0
18 | 
19 |         with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3), \
20 |                 argscope([Conv2D, MaxPooling], data_format='channels_first'):
21 |             logits = (LinearWrap(image)
22 |                       .Conv2D('conv1_1', 64, kernel_size=11, strides=4, padding='VALID')
23 |                       .MaxPooling('pool1', 3, 2)
24 |                       .Conv2D('conv1_2', 192, kernel_size=5)
25 |                       .MaxPooling('pool2', 3, 2)
26 | 
27 |                       .Conv2D('conv3', 384)
28 |                       .Conv2D('conv4', 256)
29 |                       .Conv2D('conv5', 256)
30 |                       .MaxPooling('pool3', 3, 2)
31 | 
32 |                       .FullyConnected('fc6', 4096, activation=tf.nn.relu)
33 |                       .Dropout('drop0', rate=0.5)
34 |                       .FullyConnected('fc7', 4096, activation=tf.nn.relu)
35 |                       .Dropout('drop1', rate=0.5)
36 |                       .FullyConnected('fc8', 1000, activation=tf.identity)())
37 | 
38 |         cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
39 |         cost = tf.reduce_mean(cost, name='cost')
40 |         return cost
41 | 
42 |     def optimizer(self):
43 |         return tf.train.RMSPropOptimizer(1e-3, epsilon=1e-8)
44 | 
45 | 
46 | def get_data():
47 |     X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32')
48 |     Y_train = np.random.random((BATCH,)).astype('int32')
49 | 
50 |     def gen():
51 |         while True:
52 |             yield [X_train, Y_train]
53 |     return DataFromGenerator(gen)
54 | 
55 | 
56 | if __name__ == '__main__':
57 |     dataset_train = get_data()
58 |     config = TrainConfig(
59 |         model=Model(),
60 |         data=StagingInput(QueueInput(dataset_train)),
61 |         callbacks=[],
62 |         max_epoch=100,
63 |         steps_per_epoch=200,
64 |     )
65 |     launch_train_with_config(config, SimpleTrainer())
66 | 


--------------------------------------------------------------------------------
/other-wrappers/tensorpack.cifar10.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: tensorpack.cifar10.py
 4 | import tensorflow as tf
 5 | from tensorpack import *
 6 | 
 7 | 
 8 | class Model(ModelDesc):
 9 |     def inputs(self):
10 |         return [tf.TensorSpec([None, 32, 32, 3], tf.float32, 'input'),
11 |                 tf.TensorSpec([None], tf.int32, 'label')]
12 | 
13 |     def build_graph(self, image, label):
14 |         image = tf.transpose(image, [0, 3, 1, 2])
15 |         image = image / 255.0
16 | 
17 |         with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3, padding='VALID'), \
18 |                 argscope([Conv2D, MaxPooling], data_format='NCHW'):
19 |             logits = (LinearWrap(image)
20 |                       .Conv2D('conv0', 32, padding='SAME')
21 |                       .Conv2D('conv1', 32)
22 |                       .MaxPooling('pool0', 2)
23 |                       .Dropout(rate=0.25)
24 |                       .Conv2D('conv2', 64, padding='SAME')
25 |                       .Conv2D('conv3', 64)
26 |                       .MaxPooling('pool1', 2)
27 |                       .Dropout(rate=0.25)
28 |                       .FullyConnected('fc1', 512, activation=tf.nn.relu)
29 |                       .Dropout(rate=0.5)
30 |                       .FullyConnected('linear', 10, activation=tf.identity)())
31 | 
32 |         cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
33 |         cost = tf.reduce_mean(cost, name='cost')
34 | 
35 |         wrong = tf.cast(tf.logical_not(tf.nn.in_top_k(logits, label, 1)), tf.float32, name='wrong')
36 |         tf.reduce_mean(wrong, name='train_error')
37 |         # no weight decay
38 |         return cost
39 | 
40 |     def optimizer(self):
41 |         lr = tf.get_variable('learning_rate', initializer=1e-4, trainable=False)
42 |         return tf.train.RMSPropOptimizer(lr, epsilon=1e-8)
43 | 
44 | 
45 | def get_data(train_or_test):
46 |     isTrain = train_or_test == 'train'
47 |     ds = dataset.Cifar10(train_or_test)
48 |     ds = BatchData(ds, 32, remainder=not isTrain)
49 |     return ds
50 | 
51 | 
52 | if __name__ == '__main__':
53 |     dataset_train = get_data('train')
54 |     dataset_test = get_data('test')
55 |     config = TrainConfig(
56 |         model=Model(),
57 |         data=QueueInput(dataset_train,
58 |                         queue=tf.FIFOQueue(300, [tf.float32, tf.int32])),
59 |         # callbacks=[InferenceRunner(dataset_test, ClassificationError('wrong'))],   # skip validation
60 |         callbacks=[],
61 |         # keras monitor these two live data during training. do it here (no overhead actually)
62 |         extra_callbacks=[ProgressBar(['cost', 'train_error']), MergeAllSummaries()],
63 |         max_epoch=200,
64 |     )
65 |     launch_train_with_config(config, SimpleTrainer())
66 | 


--------------------------------------------------------------------------------
/other-wrappers/tensorpack.resnet.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: tensorpack.resnet.py
 4 | import tensorflow as tf
 5 | import sys
 6 | import numpy as np
 7 | from tensorpack import *
 8 | 
 9 | BATCH = 32  # tensorpack's "batch" is per-GPU batch.
10 | try:
11 |     NUM_GPU = int(sys.argv[1])
12 | except IndexError:
13 |     NUM_GPU = 1
14 | 
15 | 
16 | def resnet_shortcut(l, n_out, stride, activation=tf.identity):
17 |     n_in = l.get_shape().as_list()[1]
18 |     if n_in != n_out:   # change dimension when channel is not the same
19 |         return Conv2D('convshortcut', l, n_out, 1, strides=stride, activation=activation)
20 |     else:
21 |         return l
22 | 
23 | 
24 | def block_func(l, ch_out, stride):
25 |     BN = lambda x, name=None: BatchNorm('bn', x)
26 |     shortcut = l
27 |     l = Conv2D('conv1', l, ch_out, 1, strides=stride, activation=BNReLU)
28 |     l = Conv2D('conv2', l, ch_out, 3, strides=1, activation=BNReLU)
29 |     l = Conv2D('conv3', l, ch_out * 4, 1, activation=BN)
30 |     return tf.nn.relu(l + resnet_shortcut(shortcut, ch_out * 4, stride, activation=BN))
31 | 
32 | 
33 | def group_func(l, name, block_func, features, count, stride):
34 |     with tf.variable_scope(name):
35 |         for i in range(0, count):
36 |             with tf.variable_scope('block{}'.format(i)):
37 |                 l = block_func(l, features, stride if i == 0 else 1)
38 |     return l
39 | 
40 | 
41 | class Model(ModelDesc):
42 |     def inputs(self):
43 |         return [tf.TensorSpec([None, 3, 224, 224], tf.float32, 'input'),
44 |                 tf.TensorSpec([None], tf.int32, 'label')]
45 | 
46 |     def build_graph(self, image, label):
47 |         image = image / 255.0
48 | 
49 |         num_blocks = [3, 4, 6, 3]
50 |         with argscope([Conv2D, MaxPooling, BatchNorm, GlobalAvgPooling], data_format='channels_first'), \
51 |                 argscope(Conv2D, use_bias=False):
52 |             logits = (LinearWrap(image)
53 |                       .tf.pad([[0, 0], [3, 3], [3, 3], [0, 0]])
54 |                       .Conv2D('conv0', 64, 7, strides=2, activation=BNReLU, padding='VALID')
55 |                       .MaxPooling('pool0', 3, strides=2, padding='SAME')
56 |                       .apply(group_func, 'group0', block_func, 64, num_blocks[0], 1)
57 |                       .apply(group_func, 'group1', block_func, 128, num_blocks[1], 2)
58 |                       .apply(group_func, 'group2', block_func, 256, num_blocks[2], 2)
59 |                       .apply(group_func, 'group3', block_func, 512, num_blocks[3], 2)
60 |                       .GlobalAvgPooling('gap')
61 |                       .FullyConnected('linear', 1000, activation=tf.identity)())
62 | 
63 |         cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
64 |         cost = tf.reduce_mean(cost, name='cost')
65 |         return cost
66 | 
67 |     def optimizer(self):
68 |         return tf.train.GradientDescentOptimizer(1e-3)
69 | 
70 | 
71 | def get_data():
72 |     X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32')
73 |     Y_train = np.random.random((BATCH,)).astype('int32')
74 | 
75 |     def gen():
76 |         while True:
77 |             yield [X_train, Y_train]
78 |     return DataFromGenerator(gen)
79 | 
80 | 
81 | if __name__ == '__main__':
82 |     dataset_train = get_data()
83 |     config = TrainConfig(
84 |         model=Model(),
85 |         dataflow=dataset_train,
86 |         callbacks=[],
87 |         max_epoch=100,
88 |         steps_per_epoch=50,
89 |     )
90 |     trainer = SyncMultiGPUTrainerReplicated(
91 |         NUM_GPU, mode='hierarchical' if NUM_GPU == 8 else 'cpu')
92 |     launch_train_with_config(config, trainer)
93 | 


--------------------------------------------------------------------------------
/other-wrappers/tensorpack.vgg.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # File: tensorpack.vgg.py
 4 | import tensorflow as tf
 5 | import numpy as np
 6 | import sys
 7 | from tensorpack import *
 8 | 
 9 | BATCH = 64  # tensorpack's "batch" is per-GPU batch.
10 | try:
11 |     NUM_GPU = int(sys.argv[1])
12 | except IndexError:
13 |     NUM_GPU = 1
14 | 
15 | 
16 | class Model(ModelDesc):
17 |     def inputs(self):
18 |         return [tf.TensorSpec([None, 3, 224, 224], tf.float32, 'input'),
19 |                 tf.TensorSpec([None], tf.int32, 'label')]
20 | 
21 |     def build_graph(self, image, label):
22 |         image = image / 255.0
23 | 
24 |         with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3), \
25 |                 argscope([Conv2D, MaxPooling], data_format='channels_first'):
26 |             logits = (LinearWrap(image)
27 |                       .Conv2D('conv1_1', 64)
28 |                       .Conv2D('conv1_2', 64)
29 |                       .MaxPooling('pool1', 2)
30 |                       # 112
31 |                       .Conv2D('conv2_1', 128)
32 |                       .Conv2D('conv2_2', 128)
33 |                       .MaxPooling('pool2', 2)
34 |                       # 56
35 |                       .Conv2D('conv3_1', 256)
36 |                       .Conv2D('conv3_2', 256)
37 |                       .Conv2D('conv3_3', 256)
38 |                       .MaxPooling('pool3', 2)
39 |                       # 28
40 |                       .Conv2D('conv4_1', 512)
41 |                       .Conv2D('conv4_2', 512)
42 |                       .Conv2D('conv4_3', 512)
43 |                       .MaxPooling('pool4', 2)
44 |                       # 14
45 |                       .Conv2D('conv5_1', 512)
46 |                       .Conv2D('conv5_2', 512)
47 |                       .Conv2D('conv5_3', 512)
48 |                       .MaxPooling('pool5', 2)
49 |                       # 7
50 |                       .FullyConnected('fc6', 4096, activation=tf.nn.relu)
51 |                       .FullyConnected('fc7', 4096, activation=tf.nn.relu)
52 |                       .FullyConnected('fc8', 1000, activation=tf.identity)())
53 | 
54 |         cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
55 |         cost = tf.reduce_mean(cost, name='cost')
56 |         return cost
57 | 
58 |     def optimizer(self):
59 |         return tf.train.RMSPropOptimizer(1e-3, epsilon=1e-8)
60 | 
61 | 
62 | def get_data():
63 |     X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32')
64 |     Y_train = np.random.random((BATCH,)).astype('int32')
65 | 
66 |     def gen():
67 |         while True:
68 |             yield [X_train, Y_train]
69 |     return DataFromGenerator(gen)
70 | 
71 | 
72 | if __name__ == '__main__':
73 |     dataset_train = get_data()
74 |     config = TrainConfig(
75 |         model=Model(),
76 |         data=StagingInput(QueueInput(dataset_train)),
77 |         callbacks=[],
78 |         extra_callbacks=[ProgressBar(['cost'])],
79 |         max_epoch=200,
80 |         steps_per_epoch=50,
81 |     )
82 |     if NUM_GPU == 1:
83 |         trainer = SimpleTrainer()
84 |     else:
85 |         trainer = SyncMultiGPUTrainerReplicated(NUM_GPU, mode='nccl')
86 |     launch_train_with_config(config, trainer)
87 | 


--------------------------------------------------------------------------------
/other-wrappers/tflearn.cifar10.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | """ Convolutional network applied to CIFAR-10 dataset classification task.
 5 | 
 6 | References:
 7 |     Learning Multiple Layers of Features from Tiny Images, A. Krizhevsky, 2009.
 8 | 
 9 | Links:
10 |     [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
11 | 
12 | """
13 | from __future__ import division, print_function, absolute_import
14 | 
15 | import tflearn
16 | from tflearn.data_utils import shuffle, to_categorical
17 | from tflearn.layers.core import input_data, dropout, fully_connected
18 | from tflearn.layers.conv import conv_2d, max_pool_2d
19 | from tflearn.layers.estimator import regression
20 | from tflearn.data_preprocessing import ImagePreprocessing
21 | from tflearn.data_augmentation import ImageAugmentation
22 | 
23 | # Data loading and preprocessing
24 | from tflearn.datasets import cifar10
25 | (X, Y), (X_test, Y_test) = cifar10.load_data()
26 | X, Y = shuffle(X, Y)
27 | Y = to_categorical(Y, 10)
28 | Y_test = to_categorical(Y_test, 10)
29 | 
30 | # Real-time data preprocessing
31 | img_prep = ImagePreprocessing()
32 | 
33 | # Real-time data augmentation
34 | img_aug = ImageAugmentation()
35 | 
36 | # Convolutional network building
37 | network = input_data(shape=[None, 32, 32, 3],
38 |                      data_preprocessing=img_prep,
39 |                      data_augmentation=img_aug)
40 | network = conv_2d(network, 32, 3, activation='relu')
41 | network = conv_2d(network, 32, 3, activation='relu', padding='valid')
42 | network = max_pool_2d(network, 2)
43 | network = dropout(network, 0.5)
44 | 
45 | network = conv_2d(network, 64, 3, activation='relu')
46 | network = conv_2d(network, 64, 3, activation='relu', padding='valid')
47 | network = max_pool_2d(network, 2)
48 | network = dropout(network, 0.5)
49 | network = fully_connected(network, 512, activation='relu')
50 | network = dropout(network, 0.5)
51 | network = fully_connected(network, 10, activation='softmax')
52 | network = regression(network, optimizer='rmsprop',
53 |                      loss='categorical_crossentropy',
54 |                      learning_rate=0.0001)
55 | 
56 | # Train using classifier
57 | model = tflearn.DNN(network, tensorboard_verbose=0)
58 | model.fit(X, Y, n_epoch=50, shuffle=True,
59 |           show_metric=True, batch_size=32, run_id='cifar10_cnn')
60 | 


--------------------------------------------------------------------------------
/other-wrappers/tflearn.vgg.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | from __future__ import division, print_function, absolute_import
 5 | 
 6 | import tflearn
 7 | from tflearn.data_utils import to_categorical
 8 | from tflearn.layers.core import input_data, dropout, fully_connected
 9 | import numpy as np
10 | from tflearn.layers.conv import conv_2d, max_pool_2d
11 | from tflearn.layers.estimator import regression
12 | 
13 | batch_size = 64
14 | img_rows, img_cols = 224, 224
15 | # Data loading and preprocessing
16 | X = np.random.random((batch_size, img_rows, img_cols, 3))
17 | Y = np.random.random((batch_size,)).astype('int32')
18 | Y = to_categorical(Y, 1000)
19 | 
20 | # Convolutional network building
21 | network = input_data(shape=[None, img_rows, img_cols, 3])
22 | network = conv_2d(network, 64, 3, activation='relu')
23 | network = conv_2d(network, 64, 3, activation='relu')
24 | network = max_pool_2d(network, 2)
25 | 
26 | network = conv_2d(network, 128, 3, activation='relu')
27 | network = conv_2d(network, 128, 3, activation='relu')
28 | network = max_pool_2d(network, 2)
29 | 
30 | network = conv_2d(network, 256, 3, activation='relu')
31 | network = conv_2d(network, 256, 3, activation='relu')
32 | network = conv_2d(network, 256, 3, activation='relu')
33 | network = max_pool_2d(network, 2)
34 | 
35 | network = conv_2d(network, 512, 3, activation='relu')
36 | network = conv_2d(network, 512, 3, activation='relu')
37 | network = conv_2d(network, 512, 3, activation='relu')
38 | network = max_pool_2d(network, 2)
39 | 
40 | network = conv_2d(network, 512, 3, activation='relu')
41 | network = conv_2d(network, 512, 3, activation='relu')
42 | network = conv_2d(network, 512, 3, activation='relu')
43 | network = max_pool_2d(network, 2)
44 | 
45 | network = fully_connected(network, 4096, activation='relu')
46 | network = dropout(network, 0.5)
47 | network = fully_connected(network, 4096, activation='relu')
48 | network = dropout(network, 0.5)
49 | network = fully_connected(network, 1000, activation='softmax')
50 | network = regression(network, optimizer='rmsprop',
51 |                      loss='categorical_crossentropy',
52 |                      learning_rate=0.001)
53 | 
54 | # Train using classifier
55 | model = tflearn.DNN(network, tensorboard_verbose=0)
56 | import time
57 | while True:
58 |     start = time.time()
59 |     model.fit(X, Y, n_epoch=50, snapshot_step=99999)
60 |     print("Time:", time.time() - start)
61 | 


--------------------------------------------------------------------------------
/profile-import/README.md:
--------------------------------------------------------------------------------
1 | 
2 | This small script shows that the import time of tensorpack is negligible compared to tensorflow.
3 | 


--------------------------------------------------------------------------------
/profile-import/import_profiler.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | 
  3 | 
  4 | class ImportInfo(object):
  5 |     def __init__(self, name, context_name, counter):
  6 |         self.name = name
  7 |         self.context_name = context_name
  8 |         self._counter = counter
  9 |         self._depth = 0
 10 |         self._start = time.time()
 11 | 
 12 |         self.elapsed = None
 13 | 
 14 |     def done(self):
 15 |         self.elapsed = time.time() - self._start
 16 | 
 17 |     @property
 18 |     def _key(self):
 19 |         return self.name, self.context_name, self._counter
 20 | 
 21 |     def __repr__(self):
 22 |         return "ImportInfo({!r}, {!r}, {!r})".format(*self._key)
 23 | 
 24 |     def __hash__(self):
 25 |         return hash(self._key)
 26 | 
 27 |     def __eq__(self, other):
 28 |         if isinstance(other, ImportInfo):
 29 |             return other._key == self._key
 30 |         return NotImplemented
 31 | 
 32 |     def __ne__(self, other):
 33 |         return not self == other
 34 | 
 35 | 
 36 | class ImportStack(object):
 37 |     def __init__(self):
 38 |         self._current_stack = []
 39 |         self._full_stack = {}
 40 |         self._counter = 0
 41 | 
 42 |     def push(self, name, context_name):
 43 |         info = ImportInfo(name, context_name, self._counter)
 44 |         self._counter += 1
 45 | 
 46 |         if len(self._current_stack) > 0:
 47 |             parent = self._current_stack[-1]
 48 |             if parent not in self._full_stack:
 49 |                 self._full_stack[parent] = []
 50 |             self._full_stack[parent].append(info)
 51 |         self._current_stack.append(info)
 52 | 
 53 |         info._depth = len(self._current_stack) - 1
 54 | 
 55 |         return info
 56 | 
 57 |     def pop(self, import_info):
 58 |         top = self._current_stack.pop()
 59 |         assert top is import_info
 60 |         top.done()
 61 | 
 62 | 
 63 | def compute_intime(parent, full_stack, ordered_visited, visited, depth=0):
 64 |     if parent in visited:
 65 |         return
 66 | 
 67 |     cumtime = intime = parent.elapsed
 68 |     visited[parent] = [cumtime, parent.name, parent.context_name, depth]
 69 |     ordered_visited.append(parent)
 70 | 
 71 |     for child in full_stack.get(parent, []):
 72 |         intime -= child.elapsed
 73 |         compute_intime(child, full_stack, ordered_visited, visited, depth + 1)
 74 | 
 75 |     visited[parent].append(intime)
 76 | 
 77 | 
 78 | class ImportProfilerContext(object):
 79 |     def __init__(self):
 80 |         self._original_importer = __builtins__["__import__"]
 81 |         self._import_stack = ImportStack()
 82 | 
 83 |     def enable(self):
 84 |         __builtins__["__import__"] = self._profiled_import
 85 | 
 86 |     def disable(self):
 87 |         __builtins__["__import__"] = self._original_importer
 88 | 
 89 |     def print_info(self, threshold=1.):
 90 |         """ Print profiler results.
 91 | 
 92 |         Parameters
 93 |         ----------
 94 |         threshold : float
 95 |             import statements taking less than threshold (in ms) will not be
 96 |             displayed.
 97 |         """
 98 |         full_stack = self._import_stack._full_stack
 99 | 
100 |         keys = sorted(full_stack.keys(), key=lambda p: p._counter)
101 |         visited = {}
102 |         ordered_visited = []
103 | 
104 |         for key in keys:
105 |             compute_intime(key, full_stack, ordered_visited, visited)
106 | 
107 |         lines = []
108 |         for k in ordered_visited:
109 |             node = visited[k]
110 |             cumtime = node[0] * 1000
111 |             name = node[1]
112 |             level = node[3]
113 |             intime = node[-1] * 1000
114 |             if cumtime > threshold and level < 6:
115 |                 lines.append((
116 |                     "{:.1f}".format(cumtime),
117 |                     "{:.1f}".format(intime),
118 |                     "+" * level + name,
119 |                 ))
120 | 
121 |         # Import here to avoid messing with the profile
122 |         import tabulate
123 | 
124 |         print(
125 |             tabulate.tabulate(
126 |                 lines, headers=("cumtime (ms)", "intime (ms)", "name"), tablefmt="plain")
127 |         )
128 | 
129 |     # Protocol implementations
130 |     def __enter__(self):
131 |         self.enable()
132 |         return self
133 | 
134 |     def __exit__(self, *a, **kw):
135 |         self.disable()
136 | 
137 |     def _profiled_import(self, name, globals=None, locals=None, fromlist=None,
138 |                          level=0, *a, **kw):
139 |         if globals is None:
140 |             context_name = None
141 |         else:
142 |             context_name = globals.get("__name__")
143 |             if context_name is None:
144 |                 context_name = globals.get("__file__")
145 | 
146 |         info = self._import_stack.push(name, context_name)
147 |         try:
148 |             return self._original_importer(name, globals, locals, fromlist, level, *a, **kw)
149 |         finally:
150 |             self._import_stack.pop(info)
151 | 
152 | 
153 | def profile_import():
154 |     return ImportProfilerContext()
155 | 


--------------------------------------------------------------------------------
/profile-import/profile-import.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # flake8: noqa
 3 | # -*- coding: utf-8 -*-
 4 | # File: profile-import.py
 5 | 
 6 | import sys
 7 | from import_profiler import profile_import
 8 | 
 9 | if __name__ == '__main__':
10 |     task = sys.argv[1]
11 | 
12 |     if task == 'contrib':
13 |         import tensorflow
14 |         with profile_import() as context:
15 |             import tensorflow.contrib
16 |         context.print_info(threshold=5)
17 |     elif task == 'tensorpack':
18 |         import tensorflow.contrib.framework
19 |         import cv2
20 |         with profile_import() as context:
21 |             import tensorpack
22 |         context.print_info(threshold=5)
23 |     elif task == 'timing':
24 |         import tensorflow.contrib.framework
25 |         import cv2
26 |         import time
27 |         s = time.time()
28 |         import tensorpack
29 |         print(time.time() - s)
30 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [flake8]
2 | max-line-length = 120
3 | ignore = E741,E742,E743,F405,F403,E402,E731
4 | exclude = .git,
5 |         ResNet-MultiGPU/tfbench
6 | 


--------------------------------------------------------------------------------