├── .github └── ISSUE_TEMPLATE.md ├── .gitignore ├── Cifar10-fast ├── README.md └── cifar10-fast.py ├── DCGAN └── README.md ├── ImageNet ├── README.md ├── augmentors.py ├── benchmark-dataflow.py ├── benchmark-opencv-resize.py ├── benchmark-tfdata.py ├── dump-lmdb.py └── symbolic_imagenet.py ├── LICENSE ├── MaskRCNN ├── README.md └── maskrcnn.patch ├── README.md ├── ResNet-Horovod ├── README.md ├── imagenet-resnet-horovod.py ├── imagenet_utils.py ├── resnet_model.py ├── serve-data.py └── slurm.script ├── ResNet-MultiGPU ├── README.md ├── resnet-multigpu.py └── tfbench │ ├── __init__.py │ ├── convnet_builder.py │ ├── model.py │ ├── model_config.py │ └── resnet_model.py ├── other-wrappers ├── README.md ├── keras.alexnet.py ├── keras.cifar10.py ├── keras.resnet.py ├── keras.vgg.py ├── tensorpack.alexnet.py ├── tensorpack.cifar10.py ├── tensorpack.resnet.py ├── tensorpack.vgg.py ├── tflearn.cifar10.py └── tflearn.vgg.py ├── profile-import ├── README.md ├── import_profiler.py └── profile-import.py └── tox.ini /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | If you're asking about an unexpected problem which you do not know the root cause, 2 | please include: 3 | 4 | ### 1. What you did: 5 | 6 | (1) **What's the command you run:** 7 | 8 | (2) **Have you made any changes to the code? Paste `git status; git diff` here:** 9 | 10 | Please try to provide enough information to let other __reproduce__ your issues. 11 | Without reproducing the issue, we may not be able to investigate it. 12 | 13 | ### 2. What you observed: 14 | 15 | (1) **Include the ENTIRE logs here:** 16 | 17 | It's always better to copy-paste what you observed instead of describing them. 18 | 19 | It's always better to paste **as much as possible**, although sometimes a partial log is OK. 20 | 21 | Tensorpack typically saves stdout to its training log. 22 | If stderr is relevant, you can run a command with `CMD 2>&1 | tee logs.txt` 23 | to save both stdout and stderr to one file. 24 | 25 | (2) **Other observations, if any:** 26 | For example, CPU/GPU utilization, output images, tensorboard curves, if relevant to your issue. 27 | 28 | ### 3. What you expected, if not obvious. 29 | 30 | If you expect certain accuracy, only in one of the two conditions can we help with it: 31 | (1) You're unable to reproduce the accuracy documented in tensorpack examples. 32 | (2) It appears to be a tensorpack bug. 33 | 34 | Otherwise, how to train a model to certain accuracy is a machine learning question. 35 | We do not answer machine learning questions and it is your responsibility to 36 | figure out how to make your models more accurate. 37 | 38 | ### 4. Your environment: 39 | + Python version: 40 | + TF version: `python -c 'import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)'`. 41 | + Tensorpack version: `python -c 'import tensorpack; print(tensorpack.__version__);'`. 42 | You can install Tensorpack master by `pip install -U git+https://github.com/ppwwyyxx/tensorpack.git` 43 | and see if your issue is already solved. 44 | + If you're not using tensorpack under a normal command line shell (e.g., 45 | using an IDE or jupyter notebook), please retry under a normal command line shell. 46 | + Hardware information, e.g. number of GPUs used. 47 | 48 | You may often want to provide extra information related to your issue, but 49 | at the minimum please try to provide the above information __accurately__ to save effort in the investigation. 50 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | train_log 3 | __pycache__ 4 | *.pyc 5 | *.so 6 | cifar-10-batches-py 7 | -------------------------------------------------------------------------------- /Cifar10-fast/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Train to 94% accuracy on Cifar10 within a minute 3 | 4 | This script is able to train to 94% accuracy (average over multiple runs) 5 | on Cifar10 within a minute, when run with: 6 | 7 | * 1 V100 GPU 8 | * TensorFlow 1.14 9 | * Tensorpack @047579df 10 | * CUDA 10, CuDNN 7.6.2 11 | 12 | The script mostly follows the [cifar10-fast repo](https://github.com/davidcpage/cifar10-fast) 13 | with small modifications on architecture. 14 | 15 | This sort of "competition" doesn't really constitute any innovations since it's 16 | mainly about recipe tuning and overfitting the test accuracy. 17 | But since someone has tuned it, it's an interesting excercise to follow. 18 | 19 | ## To Run: 20 | ``` 21 | ./cifar10-fast.py --num-runs 10 22 | ``` 23 | 24 | Time: it takes about 2.2s per epoch over the 24-epoch training. 25 | The first epoch is slower because of CuDNN warmup and XLA compilation. 26 | 27 | Accuracy: it prints the test accuracy after every run finishes, and the average accuracy in the end. 28 | -------------------------------------------------------------------------------- /Cifar10-fast/cifar10-fast.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import argparse 4 | import numpy as np 5 | import os 6 | import multiprocessing as mp 7 | import tensorflow as tf 8 | 9 | from tensorpack import * 10 | from tensorpack.utils import logger 11 | from tensorpack.tfutils.tower import TowerFunc 12 | from tensorpack.tfutils.varreplace import custom_getter_scope 13 | from tensorpack.dataflow import dataset 14 | 15 | 16 | BATCH = 512 17 | STEPS_PER_EPOCH = 50000 // 512 + 1 18 | TOTAL_EPOCH = 24 19 | WARMUP_EPOCH = 5.0 / 24 * TOTAL_EPOCH 20 | USE_FP16 = True 21 | DATA_FORMAT = "NCHW" 22 | 23 | 24 | def get_inputs(batch): 25 | return [tf.TensorSpec( 26 | (batch, 3, 32, 32) if DATA_FORMAT == "NCHW" else (batch, 32, 32, 3), 27 | tf.float32, 'input'), 28 | tf.TensorSpec((batch, 10), tf.float32, 'label')] 29 | 30 | 31 | def build_graph(image, label): 32 | if USE_FP16: 33 | image = tf.cast(image, tf.float16) 34 | 35 | def activation(x): 36 | return tf.nn.leaky_relu(x, alpha=0.1) 37 | 38 | def residual(name, x, chan): 39 | with tf.variable_scope(name): 40 | x = Conv2D('res1', x, chan, 3) 41 | x = BatchNorm('bn1', x) 42 | x = activation(x) 43 | x = Conv2D('res2', x, chan, 3) 44 | x = BatchNorm('bn2', x) 45 | x = activation(x) 46 | return x 47 | 48 | def fp16_getter(getter, *args, **kwargs): 49 | name = args[0] if len(args) else kwargs['name'] 50 | if not USE_FP16 or (not name.endswith('/W') and not name.endswith('/b')): 51 | # ignore BN's gamma and beta 52 | return getter(*args, **kwargs) 53 | else: 54 | if kwargs['dtype'] == tf.float16: 55 | kwargs['dtype'] = tf.float32 56 | ret = getter(*args, **kwargs) 57 | return tf.cast(ret, tf.float16) 58 | else: 59 | return getter(*args, **kwargs) 60 | 61 | with custom_getter_scope(fp16_getter), \ 62 | argscope(Conv2D, activation=tf.identity, use_bias=False), \ 63 | argscope([Conv2D, MaxPooling, BatchNorm], data_format=DATA_FORMAT), \ 64 | argscope(BatchNorm, momentum=0.8): 65 | 66 | with tf.variable_scope('prep'): 67 | l = Conv2D('conv', image, 64, 3) 68 | l = BatchNorm('bn', l) 69 | l = activation(l) 70 | 71 | with tf.variable_scope("layer1"): 72 | l = Conv2D('conv', l, 128, 3) 73 | l = MaxPooling('pool', l, 2) 74 | l = BatchNorm('bn', l) 75 | l = activation(l) 76 | l = l + residual('res', l, 128) 77 | 78 | with tf.variable_scope("layer2"): 79 | l = Conv2D('conv', l, 256, 3) 80 | l = MaxPooling('pool', l, 2) 81 | l = BatchNorm('bn', l) 82 | l = activation(l) 83 | 84 | with tf.variable_scope("layer3"): 85 | l = Conv2D('conv', l, 512, 3) 86 | l = MaxPooling('pool', l, 2) 87 | l = BatchNorm('bn', l) 88 | l = activation(l) 89 | l = l + residual('res', l, 512) 90 | 91 | l = tf.reduce_max(l, axis=[2, 3] if DATA_FORMAT == "NCHW" else [1, 2]) 92 | l = FullyConnected('fc', l, 10, use_bias=False) 93 | logits = tf.cast(l * 0.125, tf.float32, name='logits') 94 | 95 | cost = tf.nn.softmax_cross_entropy_with_logits(labels=label, logits=logits) 96 | cost = tf.reduce_sum(cost) 97 | wd_cost = regularize_cost('.*', l2_regularizer(5e-4 * BATCH), name='regularize_loss') 98 | 99 | correct = tf.equal(tf.argmax(logits, axis=1), tf.argmax(label, axis=1), name='correct') 100 | return tf.add_n([cost, wd_cost], name='cost') 101 | 102 | 103 | def get_data(train_or_test): 104 | isTrain = train_or_test == 'train' 105 | ds = dataset.Cifar10(train_or_test) 106 | 107 | cifar10_mean = np.asarray([0.4914, 0.4822, 0.4465], dtype="float32") * 255. 108 | cifar10_invstd = 1.0 / (np.asarray([0.2471, 0.2435, 0.2616], dtype="float32") * 255) 109 | 110 | if isTrain: 111 | augmentors = imgaug.AugmentorList([ 112 | imgaug.RandomCrop((32, 32)), 113 | imgaug.Flip(horiz=True), 114 | imgaug.RandomCutout(8, 8), 115 | ]) 116 | 117 | def mapf(dp): 118 | img, label = dp 119 | img = (img.astype("float32") - cifar10_mean) * cifar10_invstd 120 | 121 | if isTrain: 122 | img = np.pad(img, [(4, 4), (4, 4), (0, 0)], mode='reflect') 123 | img = augmentors.augment(img) 124 | 125 | onehot = np.zeros((10, ), dtype=np.float32) + 0.2 / 9 126 | onehot[label] = 0.8 127 | else: 128 | onehot = np.zeros((10, ), dtype=np.float32) 129 | onehot[label] = 1. 130 | 131 | if DATA_FORMAT == "NCHW": 132 | img = img.transpose(2, 0, 1) 133 | return img, onehot 134 | 135 | if not isTrain: 136 | ds = MapData(ds, mapf) 137 | ds = BatchData(ds, BATCH, remainder=False) 138 | return ds 139 | 140 | ds = MultiProcessMapAndBatchDataZMQ(ds, 8, mapf, BATCH, buffer_size=20000) 141 | ds = RepeatedData(ds, -1) 142 | return ds 143 | 144 | def run_once(result_queue): 145 | tf.reset_default_graph() 146 | trainer = SimpleTrainer() 147 | trainer.XLA_COMPILE = True 148 | 149 | with tf.device('/gpu:0'): 150 | gs = tf.train.get_or_create_global_step() 151 | gs = tf.cast(gs, tf.float32) 152 | # 0.0 -> 0.4 in warmup 153 | # 0.4 -> 0.0 in the rest epochs 154 | lr = tf.where(tf.greater(gs, 5 * STEPS_PER_EPOCH), 155 | (TOTAL_EPOCH * STEPS_PER_EPOCH - gs) / ((TOTAL_EPOCH - WARMUP_EPOCH) * STEPS_PER_EPOCH) * 0.4 / BATCH, 156 | gs / (WARMUP_EPOCH * STEPS_PER_EPOCH) * 0.4 / BATCH 157 | ) 158 | 159 | trainer.setup_graph( 160 | get_inputs(BATCH), 161 | #StagingInput(QueueInput(get_data('train'), queue=tf.FIFOQueue(300, [tf.float32, tf.int64]))), 162 | #DummyConstantInput([x.shape for x in get_inputs()]), 163 | StagingInput(TFDatasetInput(get_data('train'))), 164 | build_graph, 165 | lambda: tf.train.MomentumOptimizer(lr, 0.9, use_nesterov=True) 166 | ) 167 | 168 | trainer.train_with_defaults( 169 | callbacks=[ 170 | PeriodicTrigger( 171 | InferenceRunner( 172 | get_data('test'), ClassificationError('correct', 'val_acc'), 173 | # We used static shape in training, in order to allow XLA 174 | # But we want dynamic batch size for inference, therefore 175 | # recreate a tower function with different input signature. 176 | tower_func=TowerFunc(build_graph, get_inputs(None)) 177 | ), every_k_epochs=TOTAL_EPOCH), 178 | RunUpdateOps(), 179 | ], 180 | extra_callbacks=[], # disable all default callbacks 181 | monitors=[ScalarPrinter()], # disable other default monitors 182 | steps_per_epoch=STEPS_PER_EPOCH, 183 | max_epoch=TOTAL_EPOCH 184 | ) 185 | result_queue.put(trainer.monitors.get_latest('val_acc')) 186 | 187 | 188 | if __name__ == '__main__': 189 | parser = argparse.ArgumentParser() 190 | parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.') 191 | parser.add_argument('--num-runs', default=1, type=int) 192 | args = parser.parse_args() 193 | 194 | if args.gpu: 195 | os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu 196 | os.environ["TF_AUTOTUNE_THRESHOLD"] = '1' 197 | 198 | q = mp.Queue() 199 | if args.num_runs == 1: 200 | run_once(q) 201 | logger.info("Val Acc: " + str(q.get())) 202 | else: 203 | val_accs = [] 204 | for k in range(args.num_runs): 205 | proc = mp.Process(target=run_once, args=(q,)) 206 | proc.start() 207 | val_accs.append(q.get()) 208 | proc.join(timeout=5) 209 | proc.terminate() 210 | logger.info("Val Accs: " + str(val_accs)) 211 | logger.info("Mean Val Acc: " + str(np.mean(val_accs))) 212 | -------------------------------------------------------------------------------- /DCGAN/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## DCGAN 3 | Environment: TFv1.3.0-rc1-1302-g593dc8e. Tesla P100. 4 | 5 | Code: [DCGAN-tensorflow](https://github.com/carpedm20/DCGAN-tensorflow/) at commit b13830. 6 | 7 | * DCGAN-tensorflow: 8 | ``` 9 | python main.py --dataset celebA --train --crop 10 | ``` 11 | This command takes 0.36s per iteration, where each iteration is 1 update to D and 2 updates to G. 12 | 13 | * [tensorpack DCGAN examples](https://github.com/tensorpack/tensorpack/blob/master/examples/GAN/DCGAN.py): 14 | 15 | Modify the code to use `SeparateGANTrainer(..., d_period=2)`, and run with: 16 | ``` 17 | python DCGAN.py --data /path/to/img_align_celebA --crop-size 108 --batch 64 18 | ``` 19 | 20 | This script runs at 15.5 it/s, where every two iterations is equivalent to 1 iteration in DCGAN-tensorflow. 21 | Therefore this script is roughly 2.8x faster. 22 | -------------------------------------------------------------------------------- /ImageNet/README.md: -------------------------------------------------------------------------------- 1 | Some scripts to test ImageNet reading speed. 2 | 3 | + With `tensorpack.dataflow` (pure python loader): 4 | 5 | Augmentations=[GoogleNetResize, Lighting, Flip]: reach 5.6k image/s on a DGX1. 6 | 7 | ``` 8 | python benchmark-dataflow.py /path/to/imagenet --batch 128 --name train --aug resizeAndLighting 9 | ``` 10 | 11 | + With `tf.data`: 12 | 13 | Augmentations=[GoogleNetResize, Flip]: 11k image/s on a DGX1. 14 | As a reference, a DGX1 with 8 V100s can train ResNet-50 in fp32 at 2.6k image/s. 15 | ``` 16 | python benchmark-tfdata.py /path/to/imagenet --name train --batch 128 17 | ``` 18 | -------------------------------------------------------------------------------- /ImageNet/augmentors.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: augmentors.py 4 | 5 | import numpy as np 6 | import cv2 7 | from tensorpack.dataflow import imgaug 8 | 9 | 10 | __all__ = ['fbresnet_augmentor', 'inference_augmentor', 11 | 'resizeAndLighting_augmentor'] 12 | 13 | 14 | 15 | def inference_augmentor(): 16 | return [ 17 | imgaug.ResizeShortestEdge(256, cv2.INTER_CUBIC), 18 | imgaug.CenterCrop((224, 224)) 19 | ] 20 | 21 | 22 | def fbresnet_augmentor(): 23 | # assme BGR input 24 | augmentors = [ 25 | imgaug.GoogleNetRandomCropAndResize(), 26 | imgaug.RandomOrderAug( 27 | [imgaug.BrightnessScale((0.6, 1.4), clip=False), 28 | imgaug.Contrast((0.6, 1.4), clip=False), 29 | imgaug.Saturation(0.4, rgb=False), 30 | # rgb->bgr conversion for the constants copied from fb.resnet.torch 31 | imgaug.Lighting(0.1, 32 | eigval=np.asarray( 33 | [0.2175, 0.0188, 0.0045][::-1]) * 255.0, 34 | eigvec=np.array( 35 | [[-0.5675, 0.7192, 0.4009], 36 | [-0.5808, -0.0045, -0.8140], 37 | [-0.5836, -0.6948, 0.4203]], 38 | dtype='float32')[::-1, ::-1] 39 | )]), 40 | imgaug.Flip(horiz=True), 41 | ] 42 | return augmentors 43 | 44 | 45 | def resizeAndLighting_augmentor(): 46 | # assme BGR input 47 | augmentors = [ 48 | imgaug.GoogleNetRandomCropAndResize(), 49 | imgaug.Lighting(0.1, 50 | eigval=np.asarray( 51 | [0.2175, 0.0188, 0.0045][::-1]) * 255.0, 52 | eigvec=np.array( 53 | [[-0.5675, 0.7192, 0.4009], 54 | [-0.5808, -0.0045, -0.8140], 55 | [-0.5836, -0.6948, 0.4203]], 56 | dtype='float32')[::-1, ::-1]), 57 | imgaug.Flip(horiz=True), 58 | ] 59 | return augmentors 60 | 61 | 62 | def resizeOnly_augmentor(): 63 | # assme BGR input 64 | augmentors = [ 65 | imgaug.GoogleNetRandomCropAndResize(), 66 | imgaug.Lighting(0.1, 67 | eigval=np.asarray( 68 | [0.2175, 0.0188, 0.0045][::-1]) * 255.0, 69 | eigvec=np.array( 70 | [[-0.5675, 0.7192, 0.4009], 71 | [-0.5808, -0.0045, -0.8140], 72 | [-0.5836, -0.6948, 0.4203]], 73 | dtype='float32')[::-1, ::-1]), 74 | imgaug.Flip(horiz=True), 75 | ] 76 | return augmentors 77 | -------------------------------------------------------------------------------- /ImageNet/benchmark-dataflow.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: benchmark-dataflow.py 4 | 5 | import argparse 6 | import cv2 7 | 8 | from tensorpack import * 9 | from tensorpack.dataflow.imgaug import * 10 | from tensorpack.dataflow.parallel import PlasmaGetData, PlasmaPutData # noqa 11 | from tensorpack.utils.serialize import loads 12 | 13 | import augmentors 14 | 15 | 16 | def test_orig(dir, name, augs, batch): 17 | ds = dataset.ILSVRC12(dir, name, shuffle=True) 18 | ds = AugmentImageComponent(ds, augs) 19 | 20 | ds = BatchData(ds, batch) 21 | # ds = PlasmaPutData(ds) 22 | ds = MultiProcessRunnerZMQ(ds, 50, hwm=80) 23 | # ds = PlasmaGetData(ds) 24 | return ds 25 | 26 | 27 | def test_lmdb_train(db, augs, batch): 28 | ds = LMDBData(db, shuffle=False) 29 | ds = LocallyShuffleData(ds, 50000) 30 | ds = MultiProcessRunner(ds, 5000, 1) 31 | return ds 32 | 33 | ds = LMDBDataPoint(ds) 34 | 35 | def f(x): 36 | return cv2.imdecode(x, cv2.IMREAD_COLOR) 37 | ds = MapDataComponent(ds, f, 0) 38 | ds = AugmentImageComponent(ds, augs) 39 | 40 | ds = BatchData(ds, batch, use_list=True) 41 | # ds = PlasmaPutData(ds) 42 | ds = MultiProcessRunnerZMQ(ds, 40, hwm=80) 43 | # ds = PlasmaGetData(ds) 44 | return ds 45 | 46 | 47 | def test_lmdb_inference(db, augs, batch): 48 | ds = LMDBData(db, shuffle=False) 49 | # ds = LocallyShuffleData(ds, 50000) 50 | 51 | augs = AugmentorList(augs) 52 | 53 | def mapper(data): 54 | im, label = loads(data[1]) 55 | im = cv2.imdecode(im, cv2.IMREAD_COLOR) 56 | im = augs.augment(im) 57 | return im, label 58 | 59 | ds = MultiProcessMapData(ds, 40, mapper, 60 | buffer_size=200) 61 | # ds = MultiThreadMapData(ds, 40, mapper, buffer_size=2000) 62 | 63 | ds = BatchData(ds, batch) 64 | ds = MultiProcessRunnerZMQ(ds, 1) 65 | return ds 66 | 67 | 68 | def test_inference(dir, name, augs, batch=128): 69 | ds = dataset.ILSVRC12Files(dir, name, shuffle=False, dir_structure='train') 70 | 71 | aug = imgaug.AugmentorList(augs) 72 | 73 | def mapf(dp): 74 | fname, cls = dp 75 | im = cv2.imread(fname, cv2.IMREAD_COLOR) 76 | im = aug.augment(im) 77 | return im, cls 78 | ds = MultiThreadMapData(ds, 30, mapf, buffer_size=2000, strict=True) 79 | ds = BatchData(ds, batch) 80 | ds = MultiProcessRunnerZMQ(ds, 1) 81 | return ds 82 | 83 | 84 | if __name__ == '__main__': 85 | available_augmentors = [ 86 | k[:-len("_augmentor")] 87 | for k in augmentors.__all__ if k.endswith('_augmentor')] 88 | parser = argparse.ArgumentParser() 89 | parser.add_argument('data', help='file or directory of dataset') 90 | parser.add_argument('--batch', type=int, default=64) 91 | parser.add_argument('--name', choices=['train', 'val'], default='train') 92 | parser.add_argument('--aug', choices=available_augmentors, required=True) 93 | args = parser.parse_args() 94 | 95 | augs = getattr(augmentors, args.aug + '_augmentor')() 96 | 97 | if args.data.endswith('lmdb'): 98 | if args.name == 'train': 99 | ds = test_lmdb_train(args.data, augs, args.batch) 100 | else: 101 | ds = test_lmdb_inference(args.data, augs, args.batch) 102 | else: 103 | if args.name == 'train': 104 | ds = test_orig(args.data, args.name, augs, args.batch) 105 | else: 106 | ds = test_inference(args.data, args.name, augs, args.batch) 107 | TestDataSpeed(ds, 500000, warmup=100).start() 108 | -------------------------------------------------------------------------------- /ImageNet/benchmark-opencv-resize.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: benchmark-opencv-resize.py 4 | 5 | 6 | import cv2 7 | import time 8 | import numpy as np 9 | 10 | """ 11 | Some prebuilt opencv is much slower than others. 12 | You should check with this script and make sure it prints < 1s. 13 | 14 | 15 | On E5-2680v3, archlinux, this script prints: 16 | 0.61s for system opencv 3.4.0-2. 17 | >5 s for anaconda opencv 3.3.1 py36h6cbbc71_1. 18 | 19 | On E5-2650v4, this script prints: 20 | 0.6s for opencv built locally with -DWITH_OPENMP=OFF 21 | 0.6s for opencv from `pip install opencv-python`. 22 | 1.3s for opencv built locally with -DWITH_OPENMP=ON 23 | 2s for opencv from `conda install`. 24 | """ 25 | 26 | 27 | img = (np.random.rand(256, 256, 3) * 255).astype('uint8') 28 | 29 | start = time.time() 30 | for k in range(1000): 31 | out = cv2.resize(img, (384, 384)) 32 | print(time.time() - start) 33 | -------------------------------------------------------------------------------- /ImageNet/benchmark-tfdata.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: benchmark-tfdata.py 4 | 5 | import tqdm 6 | import argparse 7 | import tensorflow as tf 8 | from tensorpack.tfutils.common import get_default_sess_config 9 | 10 | from symbolic_imagenet import get_imglist, build_pipeline 11 | 12 | 13 | def benchmark_ds(ds, count, warmup=200): 14 | itr = ds.make_initializable_iterator() 15 | dp = itr.get_next() 16 | dpop = tf.group(*dp) 17 | with tf.Session(config=get_default_sess_config()) as sess: 18 | 19 | sess.run(itr.initializer) 20 | for _ in tqdm.trange(warmup): 21 | sess.run(dpop) 22 | for _ in tqdm.trange(count, smoothing=0.1): 23 | sess.run(dpop) 24 | 25 | 26 | if __name__ == '__main__': 27 | parser = argparse.ArgumentParser() 28 | parser.add_argument('data', help='directory to imagenet') 29 | parser.add_argument('--name', choices=['train', 'val'], default='train') 30 | parser.add_argument('--batch', type=int, default=128) 31 | parser.add_argument('--parallel', type=int, default=40) 32 | args = parser.parse_args() 33 | 34 | imglist = get_imglist(args.data, args.name) 35 | print("Number of Images: {}".format(len(imglist))) 36 | 37 | with tf.device('/cpu:0'): 38 | data = build_pipeline( 39 | imglist, args.name == 'train', 40 | args.batch, args.parallel) 41 | if args.name != 'train': 42 | data = data.repeat() # for benchmark 43 | benchmark_ds(data, 100000) 44 | -------------------------------------------------------------------------------- /ImageNet/dump-lmdb.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: dump-lmdb.py 4 | 5 | import numpy as np 6 | import cv2 7 | import os 8 | import argparse 9 | 10 | from tensorpack.dataflow import * 11 | 12 | 13 | class RawILSVRC12(DataFlow): 14 | def __init__(self, dir, name): 15 | self.dir = os.path.join(dir, name) 16 | 17 | meta = dataset.ILSVRCMeta() 18 | self.imglist = meta.get_image_list( 19 | name, 20 | dataset.ILSVRCMeta.guess_dir_structure(self.dir)) 21 | np.random.shuffle(self.imglist) 22 | 23 | def get_data(self): 24 | for fname, label in self.imglist: 25 | fname = os.path.join(self.dir, fname) 26 | im = cv2.imread(fname) 27 | assert im is not None, fname 28 | with open(fname, 'rb') as f: 29 | jpeg = f.read() 30 | jpeg = np.asarray(bytearray(jpeg), dtype='uint8') 31 | assert len(jpeg) > 10 32 | yield [jpeg, label] 33 | 34 | def size(self): 35 | return len(self.imglist) 36 | 37 | 38 | if __name__ == '__main__': 39 | parser = argparse.ArgumentParser() 40 | parser.add_argument('--data', help='path to ILSVRC12 images') 41 | parser.add_argument('--name', choices=['train', 'val']) 42 | parser.add_argument('--output', required=True) 43 | args = parser.parse_args() 44 | assert args.output.endswith('.lmdb') 45 | 46 | ds = RawILSVRC12(args.data, args.name) 47 | ds = PrefetchDataZMQ(ds, 1) 48 | dftools.dump_dataflow_to_lmdb(ds, args.output) 49 | -------------------------------------------------------------------------------- /ImageNet/symbolic_imagenet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: symbolic_imagenet.py 4 | 5 | import os 6 | import tensorflow as tf 7 | import numpy as np 8 | from tensorpack.dataflow import dataset 9 | from tensorpack.utils import logger 10 | 11 | __all__ = ['get_imglist', 'build_pipeline', 'lighting'] 12 | 13 | 14 | def get_imglist(dir, name): 15 | """ 16 | Args: 17 | dir(str): directory which contains name 18 | name(str): 'train' or 'val' 19 | 20 | Returns: 21 | [(full filename, label)] 22 | """ 23 | dir = os.path.join(dir, name) 24 | meta = dataset.ILSVRCMeta() 25 | imglist = meta.get_image_list( 26 | name, 27 | dataset.ILSVRCMeta.guess_dir_structure(dir)) 28 | 29 | def _filter(fname): 30 | # png 31 | return 'n02105855_2933.JPEG' in fname 32 | 33 | ret = [] 34 | for fname, label in imglist: 35 | if _filter(fname): 36 | logger.info("Image {} was filtered out.".format(fname)) 37 | continue 38 | fname = os.path.join(dir, fname) 39 | ret.append((fname, label)) 40 | return ret 41 | 42 | 43 | def uint8_resize_bicubic(image, shape): 44 | ret = tf.image.resize_bicubic([image], shape) 45 | return tf.cast(tf.clip_by_value(ret, 0, 255), tf.uint8)[0] 46 | 47 | 48 | def resize_shortest_edge(image, image_shape, size): 49 | shape = tf.cast(image_shape, tf.float32) 50 | w_greater = tf.greater(image_shape[0], image_shape[1]) 51 | shape = tf.cond(w_greater, 52 | lambda: tf.cast([shape[0] / shape[1] * size, size], tf.int32), 53 | lambda: tf.cast([size, shape[1] / shape[0] * size], tf.int32)) 54 | 55 | return uint8_resize_bicubic(image, shape) 56 | 57 | 58 | def center_crop(image, size): 59 | image_height = tf.shape(image)[0] 60 | image_width = tf.shape(image)[1] 61 | 62 | offset_height = (image_height - size) // 2 63 | offset_width = (image_width - size) // 2 64 | image = tf.slice(image, [offset_height, offset_width, 0], [size, size, -1]) 65 | return image 66 | 67 | 68 | def lighting(image, std, eigval, eigvec): 69 | v = tf.random_uniform(shape=[3]) * std * eigval 70 | inc = tf.matmul(eigvec, tf.reshape(v, [3, 1])) 71 | image = tf.cast(tf.cast(image, tf.float32) + tf.reshape(inc, [3]), image.dtype) 72 | return image 73 | 74 | 75 | def training_mapper(filename, label): 76 | byte = tf.read_file(filename) 77 | 78 | jpeg_opt = {'fancy_upscaling': True, 'dct_method': 'INTEGER_ACCURATE'} 79 | jpeg_shape = tf.image.extract_jpeg_shape(byte) # hwc 80 | bbox_begin, bbox_size, distort_bbox = tf.image.sample_distorted_bounding_box( 81 | jpeg_shape, 82 | bounding_boxes=tf.zeros(shape=[0, 0, 4]), 83 | min_object_covered=0, 84 | aspect_ratio_range=[0.75, 1.33], 85 | area_range=[0.08, 1.0], 86 | max_attempts=10, 87 | use_image_if_no_bounding_boxes=True) 88 | 89 | is_bad = tf.reduce_sum(tf.cast(tf.equal(bbox_size, jpeg_shape), tf.int32)) >= 2 90 | 91 | def good(): 92 | offset_y, offset_x, _ = tf.unstack(bbox_begin) 93 | target_height, target_width, _ = tf.unstack(bbox_size) 94 | crop_window = tf.stack([offset_y, offset_x, target_height, target_width]) 95 | 96 | image = tf.image.decode_and_crop_jpeg( 97 | byte, crop_window, channels=3, **jpeg_opt) 98 | image = uint8_resize_bicubic(image, [224, 224]) 99 | return image 100 | 101 | def bad(): 102 | image = tf.image.decode_jpeg( 103 | tf.reshape(byte, shape=[]), 3, **jpeg_opt) 104 | image = resize_shortest_edge(image, jpeg_shape, 224) 105 | image = center_crop(image, 224) 106 | return image 107 | 108 | image = tf.cond(is_bad, bad, good) 109 | # TODO other imgproc 110 | # image = lighting(image, 0.1, 111 | # eigval=np.array([0.2175, 0.0188, 0.0045], dtype='float32') * 255.0, 112 | # eigvec=np.array([[-0.5675, 0.7192, 0.4009], 113 | # [-0.5808, -0.0045, -0.8140], 114 | # [-0.5836, -0.6948, 0.4203]], dtype='float32')) 115 | image = tf.image.random_flip_left_right(image) 116 | return image, label 117 | 118 | 119 | def validation_mapper(filename, label): 120 | byte = tf.read_file(filename) 121 | 122 | jpeg_opt = {'fancy_upscaling': True, 'dct_method': 'INTEGER_ACCURATE'} 123 | image = tf.image.decode_jpeg( 124 | tf.reshape(byte, shape=[]), 3, **jpeg_opt) 125 | image = resize_shortest_edge(image, tf.shape(image), 256) 126 | image = center_crop(image, 224) 127 | return image, label 128 | 129 | 130 | def build_pipeline(imglist, training, batch, parallel): 131 | """ 132 | Args: 133 | imglist (list): [(full filename, label)] 134 | training (bool): 135 | batch (int): 136 | parallel (int): 137 | 138 | If training, returns an infinite dataset. 139 | 140 | Note that it produces RGB images, not BGR. 141 | """ 142 | N = len(imglist) 143 | filenames = tf.constant([k[0] for k in imglist], name='filenames') 144 | labels = tf.constant([k[1] for k in imglist], dtype=tf.int32, name='labels') 145 | 146 | ds = tf.data.Dataset.from_tensor_slices((filenames, labels)) 147 | 148 | if training: 149 | ds = ds.shuffle(N, reshuffle_each_iteration=True).repeat() 150 | 151 | mapper = training_mapper if training else validation_mapper 152 | 153 | ds = ds.apply( 154 | tf.contrib.data.map_and_batch( 155 | mapper, 156 | batch_size=batch, 157 | num_parallel_batches=parallel)) 158 | ds = ds.prefetch(100) 159 | return ds 160 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /MaskRCNN/README.md: -------------------------------------------------------------------------------- 1 | ## Mask R-CNN 2 | 3 | This is a benchmark of tensorpack's Mask R-CNN implementation 4 | and the popular Matterport Mask R-CNN implementation. 5 | 6 | ### Environment: 7 | 8 | * TensorFlow 1.14 (6e0893c79) + PR30893 9 | * Python 3.7 10 | * CUDA 10.0, CuDNN 7.6.2 11 | * tensorpack 0.9.7.1 (a7f4094d) 12 | * keras 2.2.5 13 | * matterport/Mask_RCNN 3deaec5d 14 | * horovod 0.18.0 15 | * 8xV100s + 80xE5-2698 v4 16 | 17 | ### Settings: 18 | * Use the standard hyperparameters used by [Detectron](https://github.com/facebookresearch/Detectron/), 19 | except that total batch size is set to 8. 20 | 21 | * `export TF_CUDNN_USE_AUTOTUNE=0` to avoid CuDNN warmup time. 22 | 23 | * Measure speed using "images per second", in the second or later epochs. 24 | 25 | 26 | ### [tensorpack FasterRCNN example](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN): 27 | 28 | Using `TRAINER=replicated`, the speed is about 42 img/s: 29 | ``` 30 | ./train.py --config DATA.BASEDIR=~/data/coco DATA.NUM_WORKERS=20 MODE_FPN=True --load ImageNet-R50-AlignPadding.npz 31 | ``` 32 | 33 | Using `TRAINER=horovod`, the speed is about 50 img/s: 34 | ``` 35 | mpirun -np 8 ./train.py --config DATA.BASEDIR=~/data/coco MODE_FPN=True TRAINER=horovod --load ImageNet-R50-AlignPadding.npz 36 | ``` 37 | 38 | ### [matterport/Mask_RCNN](https://github.com/matterport/Mask_RCNN/): 39 | 40 | Apply [maskrcnn.patch](maskrcnn.patch) to make it use the same hyperparameters. 41 | Then, run command: 42 | 43 | ``` 44 | python coco.py train --dataset=~/data/coco/ --model=imagenet 45 | ``` 46 | 47 | It trains at 0.77 s / step, aka 10 img/s. 48 | If using 2 images per GPU, it can improve to 12 img/s. 49 | 50 | 51 | ### Note: 52 | 53 | * Mask R-CNN is a complicated system and there could be many implementation differences. 54 | The above diff only makes the two systems perform roughly the same training. 55 | 56 | * The training time of a R-CNN typically slowly decreases as the training progresses. 57 | In this experiment we only look at the training time of the first couple thousand iterations. 58 | It cannot be extrapolated to compute the total training time of the model. 59 | 60 | * Tensorpack's Mask R-CNN is not only fast, but also 61 | [more accurate](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN#results). 62 | -------------------------------------------------------------------------------- /MaskRCNN/maskrcnn.patch: -------------------------------------------------------------------------------- 1 | diff --git i/samples/coco/coco.py w/samples/coco/coco.py 2 | index 5d172b5..b0bba41 100644 3 | --- i/samples/coco/coco.py 4 | +++ w/samples/coco/coco.py 5 | @@ -78,10 +78,13 @@ class CocoConfig(Config): 6 | 7 | # We use a GPU with 12GB memory, which can fit two images. 8 | # Adjust down if you use a smaller GPU. 9 | - IMAGES_PER_GPU = 2 10 | + IMAGES_PER_GPU = 1 11 | 12 | # Uncomment to train on 8 GPUs (default is 1) 13 | - # GPU_COUNT = 8 14 | + GPU_COUNT = 8 15 | + BACKBONE = "resnet50" 16 | + STEPS_PER_EPOCH = 200 17 | + TRAIN_ROIS_PER_IMAGE = 512 18 | 19 | # Number of classes (including background) 20 | NUM_CLASSES = 1 + 80 # COCO has 80 classes 21 | @@ -496,29 +499,10 @@ if __name__ == '__main__': 22 | # *** This training schedule is an example. Update to your needs *** 23 | 24 | # Training - Stage 1 25 | - print("Training network heads") 26 | model.train(dataset_train, dataset_val, 27 | learning_rate=config.LEARNING_RATE, 28 | epochs=40, 29 | - layers='heads', 30 | - augmentation=augmentation) 31 | - 32 | - # Training - Stage 2 33 | - # Finetune layers from ResNet stage 4 and up 34 | - print("Fine tune Resnet stage 4 and up") 35 | - model.train(dataset_train, dataset_val, 36 | - learning_rate=config.LEARNING_RATE, 37 | - epochs=120, 38 | - layers='4+', 39 | - augmentation=augmentation) 40 | - 41 | - # Training - Stage 3 42 | - # Fine tune all layers 43 | - print("Fine tune all layers") 44 | - model.train(dataset_train, dataset_val, 45 | - learning_rate=config.LEARNING_RATE / 10, 46 | - epochs=160, 47 | - layers='all', 48 | + layers='3+', 49 | augmentation=augmentation) 50 | 51 | elif args.command == "evaluate": 52 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # tensorpack benchmarks 3 | 4 | We use TensorFlow in the efficient way. Tensorpack is: 5 | 6 | * [As fast as tensorflow/benchmarks in multi-GPU ResNet training](ResNet-MultiGPU/) 7 | * [__1.2x~5x__ faster than Keras & tflearn in common CNNs](other-wrappers/) 8 | * [Able to reproduce "ImageNet in one hour" with 256 GPUs](ResNet-Horovod/) 9 | * [Able to train Cifar10 to 94% accuracy within __a minute__](Cifar10-fast) 10 | * [__5x__ faster than matterport/Mask_RCNN](MaskRCNN/) 11 | * [2.8x faster than DCGAN-tensorflow](DCGAN/) 12 | 13 | All above claims can be reproduced with the corresponding code. 14 | -------------------------------------------------------------------------------- /ResNet-Horovod/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Tensorpack + Horovod 3 | 4 | Multi-GPU / distributed training on ImageNet, with TensorFlow + Tensorpack + Horovod. 5 | 6 | It reproduces the settings in the paper 7 | + [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677) 8 | 9 | The code is annotated with sentences from the paper. 10 | 11 | Based on this baseline implementation, we implemented adversarial training and obtained ImageNet classifiers with state-of-the-art adversarial robustness. See our code release at [facebookresearch/ImageNet-Adversarial-Training](https://github.com/facebookresearch/ImageNet-Adversarial-Training/). 12 | 13 | ## Dependencies: 14 | + TensorFlow>=1.5, tensorpack>=0.9.9. 15 | + [Horovod](https://github.com/uber/horovod) with NCCL support. 16 | See [doc](https://github.com/uber/horovod/blob/master/docs/gpus.md) for its installation instructions. 17 | + [zmq_ops](https://github.com/tensorpack/zmq_ops): optional but recommended. 18 | + Prepare ImageNet data into [this structure](http://tensorpack.readthedocs.io/modules/dataflow.dataset.html#tensorpack.dataflow.dataset.ILSVRC12). 19 | 20 | ## Run: 21 | ```bash 22 | # Single Machine, Multiple GPUs: 23 | # Run the following two commands together: 24 | $ ./serve-data.py --data ~/data/imagenet/ --batch 64 25 | $ mpirun -np 8 --output-filename test.log python3 ./imagenet-resnet-horovod.py -d 50 --data ~/data/imagenet/ --batch 64 26 | ``` 27 | 28 | ```bash 29 | # Multiple Machines with RoCE/IB: 30 | host1$ ./serve-data.py --data ~/data/imagenet/ --batch 64 31 | host2$ ./serve-data.py --data ~/data/imagenet/ --batch 64 32 | $ mpirun -np 16 -H host1:8,host2:8 --output-filename test.log \ 33 | -bind-to none -map-by slot -mca pml ob1 \ 34 | -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \ 35 | -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \ 36 | python3 ./imagenet-resnet-horovod.py -d 50 \ 37 | --data ~/data/imagenet/ --batch 64 --validation distributed 38 | ``` 39 | 40 | Notes: 41 | 1. MPI does not like fork(), so running `serve-data.py` inside MPI is not a good idea. 42 | 2. You may tune the best mca & NCCL options for your own systems. 43 | See [horovod docs](https://github.com/uber/horovod/blob/master/docs/) for details. 44 | Note that TCP connection will then have much worse scaling efficiency. 45 | 3. To train on small datasets, __you don't need a separate data serving process or zmq ops__. 46 | You can simply load data inside each training process with its own data loader. 47 | The main motivation to use a separate data loader is to avoid fork() inside 48 | MPI and to make it easier to benchmark. 49 | 4. You can pass `--no-zmq-ops` to both scripts, to use Python for communication instead of the faster zmq_ops. 50 | 5. If you're using slurm in a cluster, checkout an example [sbatch script](slurm.script). 51 | 52 | ## Performance Benchmark: 53 | ```bash 54 | # To benchmark data speed: 55 | $ ./serve-data.py --data ~/data/imagenet/ --batch 64 --benchmark 56 | # To benchmark training with fake data: 57 | # Run the training command with `--fake` 58 | ``` 59 | 60 | ## Distributed ResNet50 Results: 61 | 62 | | devices | batch per GPU | time [1](#ft1) | top1 err [3](#ft3)| 63 | | - | - | - | - | 64 | | 32 P100s | 64 | 5h9min | 23.73% | 65 | | 128 P100s | 32 | 1h40min | 23.62% | 66 | | 128 P100s | 64 | 1h23min | 23.97% | 67 | | 256 P100s | 32 | 1h9min [2](#ft2) | 23.90% | 68 | 69 | 70 | 1: Validation time excluded from total time. Time depends on your hardware. 71 | 72 | 2: This corresponds to exactly the "1 hour" setting in the original paper. 73 | 74 | 3: The final error typically has ±0.1 or more fluctuation according to the paper. 75 | 76 | Although the code does not scale very ideally with 32 machines, it does scale with 90+% efficiency on 2 or 4 machines. 77 | -------------------------------------------------------------------------------- /ResNet-Horovod/imagenet-resnet-horovod.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: imagenet-resnet-horovod.py 4 | 5 | import argparse 6 | import sys 7 | import os 8 | import socket 9 | import numpy as np 10 | 11 | import tensorflow as tf 12 | from tensorpack import * 13 | from tensorpack.tfutils import argscope, SmartInit 14 | 15 | import horovod.tensorflow as hvd 16 | 17 | from imagenet_utils import ( 18 | fbresnet_augmentor, get_val_dataflow, ImageNetModel, eval_classification) 19 | from resnet_model import ( 20 | resnet_group, resnet_bottleneck, resnet_backbone, Norm) 21 | 22 | 23 | class Model(ImageNetModel): 24 | def __init__(self, depth, norm='BN'): 25 | self.num_blocks = { 26 | 50: [3, 4, 6, 3], 27 | 101: [3, 4, 23, 3], 28 | 152: [3, 8, 36, 3], 29 | }[depth] 30 | self.norm = norm 31 | 32 | def get_logits(self, image): 33 | with argscope([Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm], data_format='NCHW'), \ 34 | argscope(Norm, type=self.norm): 35 | return resnet_backbone(image, self.num_blocks, resnet_group, resnet_bottleneck) 36 | 37 | 38 | class HorovodClassificationError(ClassificationError): 39 | """ 40 | Like ClassificationError, it evaluates total samples & count of wrong samples. 41 | But in the end we aggregate the total&count by horovod. 42 | """ 43 | def _setup_graph(self): 44 | self._placeholder = tf.placeholder(tf.float32, shape=[2], name='to_be_reduced') 45 | self._reduced = hvd.allreduce(self._placeholder, average=False) 46 | 47 | def _after_inference(self): 48 | tot = self.err_stat.total 49 | cnt = self.err_stat.count 50 | tot, cnt = self._reduced.eval(feed_dict={self._placeholder: [tot, cnt]}) 51 | return {self.summary_name: cnt * 1. / tot} 52 | 53 | 54 | def get_config(model, fake=False): 55 | batch = args.batch 56 | total_batch = batch * hvd.size() 57 | 58 | if fake: 59 | data = FakeData( 60 | [[args.batch, 224, 224, 3], [args.batch]], 1000, 61 | random=False, dtype=['uint8', 'int32']) 62 | data = StagingInput(QueueInput(data)) 63 | callbacks = [] 64 | steps_per_epoch = 50 65 | else: 66 | logger.info("#Tower: {}; Batch size per tower: {}".format(hvd.size(), batch)) 67 | zmq_addr = 'ipc://@imagenet-train-b{}'.format(batch) 68 | if args.no_zmq_ops: 69 | dataflow = RemoteDataZMQ(zmq_addr, hwm=150, bind=False) 70 | data = QueueInput(dataflow) 71 | else: 72 | data = ZMQInput(zmq_addr, 30, bind=False) 73 | data = StagingInput(data, nr_stage=1) 74 | 75 | steps_per_epoch = int(np.round(1281167 / total_batch)) 76 | 77 | """ 78 | Sec 2.1: Linear Scaling Rule: When the minibatch size is multiplied by k, multiply the learning rate by k. 79 | """ 80 | BASE_LR = 0.1 * (total_batch // 256) 81 | logger.info("Base LR: {}".format(BASE_LR)) 82 | """ 83 | Sec 5.1: 84 | We call this number (0.1 * kn / 256 ) the reference learning rate, 85 | and reduce it by 1/10 at the 30-th, 60-th, and 80-th epoch 86 | """ 87 | callbacks = [ 88 | ModelSaver(max_to_keep=100), 89 | EstimatedTimeLeft(), 90 | ScheduledHyperParamSetter( 91 | 'learning_rate', [(0, BASE_LR), (30, BASE_LR * 1e-1), (60, BASE_LR * 1e-2), 92 | (80, BASE_LR * 1e-3)]), 93 | ] 94 | if BASE_LR > 0.1: 95 | """ 96 | Sec 2.2: In practice, with a large minibatch of size kn, we start from a learning rate of η and increment 97 | it by a constant amount at each iteration such that it reachesη = kη after 5 epochs. 98 | After the warmup phase, we go back to the original learning rate schedule. 99 | """ 100 | callbacks.append( 101 | ScheduledHyperParamSetter( 102 | 'learning_rate', [(0, 0.1), (5 * steps_per_epoch, BASE_LR)], 103 | interp='linear', step_based=True)) 104 | 105 | if args.validation is not None: 106 | # TODO For distributed training, you probably don't want everyone to wait for master doing validation. 107 | # Better to start a separate job, since the model is saved. 108 | if args.validation == 'master' and hvd.rank() == 0: 109 | # For reproducibility, do not use remote data for validation 110 | dataset_val = get_val_dataflow( 111 | args.data, 64, fbresnet_augmentor(False)) 112 | infs = [ClassificationError('wrong-top1', 'val-error-top1'), 113 | ClassificationError('wrong-top5', 'val-error-top5')] 114 | callbacks.append(InferenceRunner(QueueInput(dataset_val), infs)) 115 | # For simple validation tasks such as image classification, distributed validation is possible. 116 | elif args.validation == 'distributed': 117 | dataset_val = get_val_dataflow( 118 | args.data, 64, fbresnet_augmentor(False), 119 | num_splits=hvd.size(), split_index=hvd.rank()) 120 | infs = [HorovodClassificationError('wrong-top1', 'val-error-top1'), 121 | HorovodClassificationError('wrong-top5', 'val-error-top5')] 122 | callbacks.append( 123 | InferenceRunner(QueueInput(dataset_val), infs).set_chief_only(False)) 124 | 125 | return TrainConfig( 126 | model=model, 127 | data=data, 128 | callbacks=callbacks, 129 | steps_per_epoch=steps_per_epoch, 130 | max_epoch=35 if args.fake else 95, 131 | ) 132 | 133 | 134 | if __name__ == '__main__': 135 | parser = argparse.ArgumentParser() 136 | parser.add_argument('--data', help='ILSVRC dataset dir') 137 | parser.add_argument('--logdir', help='Directory for models and training stats.') 138 | parser.add_argument('--load', help='load model') 139 | parser.add_argument('--eval', action='store_true', help='run evaluation with --load instead of training.') 140 | 141 | parser.add_argument('--fake', help='use fakedata to test or benchmark this model', action='store_true') 142 | parser.add_argument('-d', '--depth', help='resnet depth', 143 | type=int, default=50, choices=[50, 101, 152]) 144 | parser.add_argument('--norm', choices=['BN', 'GN'], default='BN') 145 | parser.add_argument('--accum-grad', type=int, default=1) 146 | parser.add_argument('--weight-decay-norm', action='store_true', 147 | help="apply weight decay on normalization layers (gamma & beta)." 148 | "This is used in torch/pytorch, and slightly " 149 | "improves validation accuracy of large models.") 150 | parser.add_argument('--validation', choices=['distributed', 'master'], 151 | help='Validation method. By default the script performs no validation.') 152 | parser.add_argument('--no-zmq-ops', help='use pure python to send/receive data', 153 | action='store_true') 154 | """ 155 | Sec 2.3: We keep the per-worker sample size n constant when we change the number of workers k. 156 | In this work, we use n = 32 which has performed well for a wide range of datasets and networks. 157 | """ 158 | parser.add_argument('--batch', help='per-GPU batch size', default=32, type=int) 159 | args = parser.parse_args() 160 | 161 | model = Model(args.depth, args.norm) 162 | model.accum_grad = args.accum_grad 163 | if args.weight_decay_norm: 164 | model.weight_decay_pattern = ".*/W|.*/gamma|.*/beta" 165 | 166 | if args.eval: 167 | batch = 128 # something that can run on one gpu 168 | ds = get_val_dataflow(args.data, batch, fbresnet_augmentor(False)) 169 | eval_classification(model, SmartInit(args.load), ds) 170 | sys.exit() 171 | 172 | logger.info("Training on {}".format(socket.gethostname())) 173 | # Print some information for sanity check. 174 | os.system("nvidia-smi") 175 | assert args.load is None 176 | 177 | hvd.init() 178 | 179 | if args.logdir is None: 180 | args.logdir = os.path.join('train_log', 'Horovod-{}GPUs-{}Batch'.format(hvd.size(), args.batch)) 181 | 182 | if hvd.rank() == 0: 183 | logger.set_logger_dir(args.logdir, 'd') 184 | logger.info("Rank={}, Local Rank={}, Size={}".format(hvd.rank(), hvd.local_rank(), hvd.size())) 185 | 186 | """ 187 | Sec 3: Remark 3: Normalize the per-worker loss by 188 | total minibatch size kn, not per-worker size n. 189 | """ 190 | model.loss_scale = 1.0 / hvd.size() 191 | config = get_config(model, fake=args.fake) 192 | """ 193 | Sec 3: standard communication primitives like 194 | allreduce [11] perform summing, not averaging 195 | """ 196 | trainer = HorovodTrainer(average=False) 197 | launch_train_with_config(config, trainer) 198 | -------------------------------------------------------------------------------- /ResNet-Horovod/imagenet_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: imagenet_utils.py 4 | 5 | """ 6 | This file is modified from 7 | https://github.com/tensorpack/tensorpack/blob/master/examples/ImageNetModels/imagenet_utils.py 8 | """ 9 | 10 | 11 | import multiprocessing as mp 12 | import numpy as np 13 | from abc import abstractmethod 14 | import cv2 15 | import tensorflow as tf 16 | 17 | from tensorpack import imgaug, dataset, ModelDesc 18 | from tensorpack.dataflow import ( 19 | BatchData, MultiThreadMapData, DataFromList, 20 | AugmentImageComponent, MultiProcessRunnerZMQ) 21 | from tensorpack.models import regularize_cost 22 | from tensorpack.predict import FeedfreePredictor, PredictConfig 23 | from tensorpack.tfutils.summary import add_moving_summary 24 | from tensorpack.tfutils.optimizer import AccumGradOptimizer 25 | from tensorpack.utils import logger 26 | from tensorpack.utils.stats import RatioCounter 27 | 28 | """ 29 | ====== DataFlow ======= 30 | """ 31 | 32 | 33 | def fbresnet_augmentor(isTrain): 34 | """ 35 | Augmentor used in fb.resnet.torch, for BGR images in range [0,255]. 36 | """ 37 | interpolation = cv2.INTER_LINEAR 38 | if isTrain: 39 | """ 40 | Sec 5.1: 41 | We use scale and aspect ratio data augmentation [35] as 42 | in [12]. The network input image is a 224×224 pixel random 43 | crop from an augmented image or its horizontal flip. 44 | """ 45 | augmentors = [ 46 | imgaug.GoogleNetRandomCropAndResize(interp=interpolation), 47 | # It's OK to remove the following augs if your CPU is not fast enough. 48 | # Removing brightness/contrast/saturation does not have a significant effect on accuracy. 49 | # Removing lighting leads to a tiny drop in accuracy. 50 | imgaug.ToFloat32(), 51 | imgaug.RandomOrderAug( 52 | [imgaug.BrightnessScale((0.6, 1.4), clip=False), 53 | imgaug.Contrast((0.6, 1.4), rgb=False, clip=False), 54 | imgaug.Saturation(0.4, rgb=False, clip=False), 55 | # rgb-bgr conversion for the constants copied from fb.resnet.torch 56 | imgaug.Lighting(0.1, 57 | eigval=np.asarray( 58 | [0.2175, 0.0188, 0.0045][::-1]) * 255.0, 59 | eigvec=np.array( 60 | [[-0.5675, 0.7192, 0.4009], 61 | [-0.5808, -0.0045, -0.8140], 62 | [-0.5836, -0.6948, 0.4203]], 63 | dtype='float32')[::-1, ::-1] 64 | )]), 65 | imgaug.ToUint8(), 66 | imgaug.Flip(horiz=True), 67 | ] 68 | else: 69 | augmentors = [ 70 | imgaug.ResizeShortestEdge(256, interp=interpolation), 71 | imgaug.CenterCrop((224, 224)), 72 | ] 73 | return augmentors 74 | 75 | 76 | def get_train_dataflow(datadir, batch, augmentors=None, parallel=None): 77 | """ 78 | Sec 3, Remark 4: 79 | Use a single random shuffling of the training data (per epoch) 80 | that is divided amongst all k workers. 81 | 82 | NOTE: Here we do not follow the paper which makes some differences. 83 | Here, each machine shuffles independently. 84 | """ 85 | if parallel is None: 86 | parallel = min(50, mp.cpu_count()) 87 | if augmentors is None: 88 | augmentors = fbresnet_augmentor(True) 89 | ds = dataset.ILSVRC12(datadir, 'train', shuffle=True) 90 | ds = AugmentImageComponent(ds, augmentors, copy=False) 91 | ds = BatchData(ds, batch, remainder=False) 92 | ds = MultiProcessRunnerZMQ(ds, parallel) 93 | return ds 94 | 95 | 96 | def get_val_dataflow( 97 | datadir, batch_size, 98 | augmentors=None, parallel=None, 99 | num_splits=None, split_index=None): 100 | if augmentors is None: 101 | augmentors = fbresnet_augmentor(False) 102 | assert datadir is not None 103 | assert isinstance(augmentors, list) 104 | if parallel is None: 105 | parallel = min(40, mp.cpu_count()) 106 | 107 | if num_splits is None: 108 | ds = dataset.ILSVRC12Files(datadir, 'val', shuffle=False) 109 | else: 110 | # shard validation data 111 | assert split_index < num_splits 112 | files = dataset.ILSVRC12Files(datadir, 'val', shuffle=False) 113 | files.reset_state() 114 | files = list(files.get_data()) 115 | logger.info("Number of validation data = {}".format(len(files))) 116 | split_size = len(files) // num_splits 117 | start, end = split_size * split_index, split_size * (split_index + 1) 118 | end = min(end, len(files)) 119 | logger.info("Local validation split = {} - {}".format(start, end)) 120 | files = files[start: end] 121 | ds = DataFromList(files, shuffle=False) 122 | aug = imgaug.AugmentorList(augmentors) 123 | 124 | def mapf(dp): 125 | fname, cls = dp 126 | im = cv2.imread(fname, cv2.IMREAD_COLOR) 127 | im = aug.augment(im) 128 | return im, cls 129 | ds = MultiThreadMapData(ds, parallel, mapf, 130 | buffer_size=min(2000, ds.size()), strict=True) 131 | ds = BatchData(ds, batch_size, remainder=True) 132 | # do not fork() under MPI 133 | return ds 134 | 135 | 136 | def eval_classification(model, sessinit, dataflow): 137 | """ 138 | Eval a classification model on the dataset. It assumes the model inputs are 139 | named "input" and "label", and contains "wrong-top1" and "wrong-top5" in the graph. 140 | """ 141 | pred_config = PredictConfig( 142 | model=model, 143 | session_init=sessinit, 144 | input_names=['input', 'label'], 145 | output_names=['wrong-top1', 'wrong-top5'] 146 | ) 147 | acc1, acc5 = RatioCounter(), RatioCounter() 148 | 149 | # This does not have a visible improvement over naive predictor, 150 | # but will have an improvement if image_dtype is set to float32. 151 | pred = FeedfreePredictor(pred_config, StagingInput(QueueInput(dataflow), device='/gpu:0')) 152 | for _ in tqdm.trange(dataflow.size()): 153 | top1, top5 = pred() 154 | batch_size = top1.shape[0] 155 | acc1.feed(top1.sum(), batch_size) 156 | acc5.feed(top5.sum(), batch_size) 157 | 158 | print("Top1 Error: {}".format(acc1.ratio)) 159 | print("Top5 Error: {}".format(acc5.ratio)) 160 | 161 | 162 | class ImageNetModel(ModelDesc): 163 | image_shape = 224 164 | 165 | """ 166 | uint8 instead of float32 is used as input type to reduce copy overhead. 167 | It might hurt the performance a liiiitle bit. 168 | The pretrained models were trained with float32. 169 | """ 170 | image_dtype = tf.uint8 171 | 172 | """ 173 | Either 'NCHW' or 'NHWC' 174 | """ 175 | data_format = 'NCHW' 176 | 177 | """ 178 | Whether the image is BGR or RGB. If using DataFlow, then it should be BGR. 179 | """ 180 | image_bgr = True 181 | 182 | weight_decay = 1e-4 183 | 184 | """ 185 | To apply on normalization parameters, use '.*/W|.*/gamma|.*/beta' 186 | 187 | Sec 5.1: We use a weight decay λ of 0.0001 and following [16] we do not apply 188 | weight decay on the learnable BN coefficients 189 | """ 190 | weight_decay_pattern = '.*/W' 191 | 192 | """ 193 | Scale the loss, for whatever reasons (e.g., gradient averaging, fp16 training, etc) 194 | """ 195 | loss_scale = 1. 196 | 197 | """ 198 | Label smoothing (See tf.losses.softmax_cross_entropy) 199 | """ 200 | label_smoothing = 0. 201 | 202 | """ 203 | Accumulate gradients across several steps (by default 1, which means no accumulation across steps). 204 | """ 205 | accum_grad = 1 206 | 207 | def inputs(self): 208 | return [tf.TensorSpec([None, self.image_shape, self.image_shape, 3], self.image_dtype, 'input'), 209 | tf.TensorSpec([None], tf.int32, 'label')] 210 | 211 | def build_graph(self, image, label): 212 | image = self.image_preprocess(image) 213 | assert self.data_format == 'NCHW' 214 | image = tf.transpose(image, [0, 3, 1, 2]) 215 | 216 | logits = self.get_logits(image) 217 | loss = ImageNetModel.compute_loss_and_error( 218 | logits, label, label_smoothing=self.label_smoothing) 219 | 220 | if self.weight_decay > 0: 221 | wd_loss = regularize_cost(self.weight_decay_pattern, 222 | tf.contrib.layers.l2_regularizer(self.weight_decay), 223 | name='l2_regularize_loss') 224 | add_moving_summary(loss, wd_loss) 225 | total_cost = tf.add_n([loss, wd_loss], name='cost') 226 | else: 227 | total_cost = tf.identity(loss, name='cost') 228 | add_moving_summary(total_cost) 229 | 230 | if self.loss_scale != 1.: 231 | logger.info("Scaling the total loss by {} ...".format(self.loss_scale)) 232 | return total_cost * self.loss_scale 233 | else: 234 | return total_cost 235 | 236 | @abstractmethod 237 | def get_logits(self, image): 238 | """ 239 | Args: 240 | image: 4D tensor of ``self.input_shape`` in ``self.data_format`` 241 | 242 | Returns: 243 | Nx#class logits 244 | """ 245 | 246 | def optimizer(self): 247 | """ 248 | Sec 5.1: We use Nesterov momentum with m of 0.9. 249 | 250 | Sec 3: momentum correction 251 | Tensorflow's momentum optimizer does not need correction. 252 | """ 253 | lr = tf.get_variable('learning_rate', initializer=0.1, trainable=False) 254 | tf.summary.scalar('learning_rate-summary', lr) 255 | opt = tf.train.MomentumOptimizer(lr, 0.9, use_nesterov=True) 256 | if self.accum_grad != 1: 257 | opt = AccumGradOptimizer(opt, self.accum_grad) 258 | return opt 259 | 260 | def image_preprocess(self, image): 261 | with tf.name_scope('image_preprocess'): 262 | if image.dtype.base_dtype != tf.float32: 263 | image = tf.cast(image, tf.float32) 264 | 265 | """ 266 | Sec 5.1: 267 | The input image is normalized by the per-color mean and 268 | standard deviation, as in [12] 269 | """ 270 | mean = [0.485, 0.456, 0.406] # rgb 271 | std = [0.229, 0.224, 0.225] 272 | if self.image_bgr: 273 | mean = mean[::-1] 274 | std = std[::-1] 275 | image_mean = tf.constant(mean, dtype=tf.float32) * 255. 276 | image_std = tf.constant(std, dtype=tf.float32) * 255. 277 | image = (image - image_mean) / image_std 278 | return image 279 | 280 | @staticmethod 281 | def compute_loss_and_error(logits, label, label_smoothing=0.): 282 | if label_smoothing != 0.: 283 | nclass = logits.shape[-1] 284 | label = tf.one_hot(label, nclass) if label.shape.ndims == 1 else label 285 | 286 | if label.shape.ndims == 1: 287 | loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 288 | else: 289 | loss = tf.losses.softmax_cross_entropy( 290 | label, logits, label_smoothing=label_smoothing, 291 | reduction=tf.losses.Reduction.NONE) 292 | loss = tf.reduce_mean(loss, name='xentropy-loss') 293 | 294 | def prediction_incorrect(logits, label, topk=1, name='incorrect_vector'): 295 | with tf.name_scope('prediction_incorrect'): 296 | x = tf.logical_not(tf.nn.in_top_k(logits, label, topk)) 297 | return tf.cast(x, tf.float32, name=name) 298 | 299 | wrong = prediction_incorrect(logits, label, 1, name='wrong-top1') 300 | add_moving_summary(tf.reduce_mean(wrong, name='train-error-top1')) 301 | 302 | wrong = prediction_incorrect(logits, label, 5, name='wrong-top5') 303 | add_moving_summary(tf.reduce_mean(wrong, name='train-error-top5')) 304 | return loss 305 | -------------------------------------------------------------------------------- /ResNet-Horovod/resnet_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: resnet_model.py 4 | 5 | import tensorflow as tf 6 | from contextlib import contextmanager 7 | 8 | 9 | from tensorpack.tfutils.argscope import argscope 10 | from tensorpack.tfutils.varreplace import remap_variables 11 | from tensorpack.models import ( 12 | Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm, FullyConnected, BNReLU, layer_register) 13 | 14 | 15 | @layer_register(log_shape=False, use_scope=False) 16 | def Norm(x, type, gamma_initializer=tf.constant_initializer(1.)): 17 | """ 18 | A norm layer (which depends on 'type') 19 | 20 | Args: 21 | type (str): one of "BN" or "GN" 22 | """ 23 | assert type in ["BN", "GN"] 24 | if type == "BN": 25 | return BatchNorm('bn', x, gamma_initializer=gamma_initializer) 26 | else: 27 | return GroupNorm('gn', x, gamma_initializer=gamma_initializer) 28 | 29 | 30 | @layer_register(log_shape=True) 31 | def GroupNorm(x, group=32, gamma_initializer=tf.constant_initializer(1.)): 32 | """ 33 | https://arxiv.org/abs/1803.08494 34 | """ 35 | shape = x.get_shape().as_list() 36 | ndims = len(shape) 37 | assert ndims == 4, shape 38 | chan = shape[1] 39 | assert chan % group == 0, chan 40 | group_size = chan // group 41 | 42 | orig_shape = tf.shape(x) 43 | h, w = orig_shape[2], orig_shape[3] 44 | 45 | x = tf.reshape(x, tf.stack([-1, group, group_size, h, w])) 46 | 47 | mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True) 48 | 49 | new_shape = [1, group, group_size, 1, 1] 50 | 51 | beta = tf.get_variable('beta', [chan], initializer=tf.constant_initializer()) 52 | beta = tf.reshape(beta, new_shape) 53 | 54 | gamma = tf.get_variable('gamma', [chan], initializer=gamma_initializer) 55 | gamma = tf.reshape(gamma, new_shape) 56 | 57 | out = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-5, name='output') 58 | return tf.reshape(out, orig_shape, name='output') 59 | 60 | 61 | def resnet_shortcut(l, n_out, stride, activation=tf.identity): 62 | n_in = l.get_shape().as_list()[1] 63 | if n_in != n_out: # change dimension when channel is not the same 64 | return Conv2D('convshortcut', l, n_out, 1, strides=stride, activation=activation) 65 | else: 66 | return l 67 | 68 | 69 | def resnet_bottleneck(l, ch_out, stride, stride_first=False): 70 | shortcut = l 71 | norm_relu = lambda x: tf.nn.relu(Norm(x)) 72 | l = Conv2D('conv1', l, ch_out, 1, strides=stride if stride_first else 1, activation=norm_relu) 73 | """ 74 | Sec 5.1: 75 | We use the ResNet-50 [16] variant from [12], noting that 76 | the stride-2 convolutions are on 3×3 layers instead of on 1×1 layers 77 | """ 78 | l = Conv2D('conv2', l, ch_out, 3, strides=1 if stride_first else stride, activation=norm_relu) 79 | """ 80 | Section 5.1: 81 | For BN layers, the learnable scaling coefficient γ is initialized 82 | to be 1, except for each residual block's last BN 83 | where γ is initialized to be 0. 84 | """ 85 | l = Conv2D('conv3', l, ch_out * 4, 1, activation=lambda x: Norm(x, gamma_initializer=tf.zeros_initializer())) 86 | ret = l + resnet_shortcut(shortcut, ch_out * 4, stride, activation=lambda x: Norm(x)) 87 | return tf.nn.relu(ret, name='block_output') 88 | 89 | 90 | def resnet_group(name, l, block_func, features, count, stride): 91 | with tf.variable_scope(name): 92 | for i in range(0, count): 93 | with tf.variable_scope('block{}'.format(i)): 94 | l = block_func(l, features, stride if i == 0 else 1) 95 | return l 96 | 97 | 98 | @contextmanager 99 | def weight_standardization_context(enable=True): 100 | """ 101 | Implement Centered Weight Normalization 102 | (http://openaccess.thecvf.com/content_ICCV_2017/papers/Huang_Centered_Weight_Normalization_ICCV_2017_paper.pdf) 103 | or Weight Standardization (https://arxiv.org/abs/1903.10520) 104 | 105 | Usage: 106 | 107 | with weight_standardization_context(): 108 | l = Conv2D('conv', l) 109 | ... 110 | """ 111 | if enable: 112 | def weight_standardization(v): 113 | if (not v.name.endswith('/W:0')) or v.shape.ndims != 4: 114 | return v 115 | mean, var = tf.nn.moments(v, [0, 1, 2], keep_dims=True) 116 | v = (v - mean) / (tf.sqrt(var) + 1e-5) 117 | return v 118 | 119 | with remap_variables(weight_standardization): 120 | yield 121 | 122 | else: 123 | yield 124 | 125 | 126 | def resnet_backbone(image, num_blocks, group_func, block_func): 127 | """ 128 | Sec 5.1: We adopt the initialization of [15] for all convolutional layers. 129 | TensorFlow does not have the true "MSRA init". We use variance_scaling as an approximation. 130 | """ 131 | with argscope(Conv2D, use_bias=False, 132 | kernel_initializer=tf.variance_scaling_initializer(scale=2.0, mode='fan_out')): 133 | l = Conv2D('conv0', image, 64, 7, strides=2, activation=BNReLU) 134 | l = MaxPooling('pool0', l, pool_size=3, strides=2, padding='SAME') 135 | l = group_func('group0', l, block_func, 64, num_blocks[0], 1) 136 | l = group_func('group1', l, block_func, 128, num_blocks[1], 2) 137 | l = group_func('group2', l, block_func, 256, num_blocks[2], 2) 138 | l = group_func('group3', l, block_func, 512, num_blocks[3], 2) 139 | l = GlobalAvgPooling('gap', l) 140 | logits = FullyConnected('linear', l, 1000, 141 | kernel_initializer=tf.random_normal_initializer(stddev=0.01)) 142 | """ 143 | Sec 5.1: 144 | The 1000-way fully-connected layer is initialized by 145 | drawing weights from a zero-mean Gaussian with standard 146 | deviation of 0.01. 147 | """ 148 | return logits 149 | -------------------------------------------------------------------------------- /ResNet-Horovod/serve-data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: serve-data.py 4 | 5 | import argparse 6 | import os 7 | import socket 8 | 9 | from tensorpack.dataflow import ( 10 | send_dataflow_zmq, MapData, TestDataSpeed, FakeData, dataset) 11 | from tensorpack.utils import logger 12 | from imagenet_utils import get_train_dataflow 13 | 14 | from zmq_ops import dump_arrays 15 | 16 | 17 | if __name__ == '__main__': 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument('--data', help='ILSVRC dataset dir') 20 | parser.add_argument('--fake', action='store_true') 21 | parser.add_argument('--batch', help='per-GPU batch size', 22 | default=32, type=int) 23 | parser.add_argument('--benchmark', action='store_true') 24 | parser.add_argument('--no-zmq-ops', action='store_true') 25 | args = parser.parse_args() 26 | 27 | os.environ['CUDA_VISIBLE_DEVICES'] = '' 28 | 29 | if args.fake: 30 | ds = FakeData( 31 | [[args.batch, 224, 224, 3], [args.batch]], 32 | 1000, random=False, dtype=['uint8', 'int32']) 33 | else: 34 | ds = get_train_dataflow(args.data, args.batch) 35 | 36 | logger.info("Serving data on {}".format(socket.gethostname())) 37 | 38 | if args.benchmark: 39 | ds = MapData(ds, dump_arrays) 40 | TestDataSpeed(ds, warmup=300).start() 41 | else: 42 | format = None if args.no_zmq_ops else 'zmq_ops' 43 | send_dataflow_zmq( 44 | ds, 'ipc://@imagenet-train-b{}'.format(args.batch), 45 | hwm=150, format=format, bind=True) 46 | -------------------------------------------------------------------------------- /ResNet-Horovod/slurm.script: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | #SBATCH --output=logs/job-%j.%N.out 3 | #SBATCH --error=logs/job-%j.%N.err 4 | #SBATCH --ntasks-per-node=8 # 8 tasks per node 5 | #SBATCH --gres=gpu:8 # 8 GPUs per node 6 | #SBATCH --cpus-per-task=10 # 80/8 cpus per task 7 | #SBATCH --mem=200G # ask for 200G 8 | 9 | # To run on 4 nodes x 8 GPUs: use "mkdir -p logs && sbatch --nodes=4 slurm.script" 10 | 11 | echo "NNODES: $SLURM_NNODES" 12 | echo "JOBID: $SLURM_JOB_ID" 13 | env | grep PATH 14 | 15 | export TENSORPACK_PROGRESS_REFRESH=20 16 | export TENSORPACK_SERIALIZE=msgpack 17 | DATA_PATH=~/data/imagenet 18 | BATCH=64 19 | 20 | # launch data 21 | srun --output=logs/data-%J.%N.log \ 22 | --error=logs/data-%J.%N.err \ 23 | --gres=gpu:0 --cpus-per-task=60 --mincpus 60 \ 24 | --ntasks=$SLURM_NNODES --ntasks-per-node=1 \ 25 | python ./serve-data.py --data $DATA_PATH --batch $BATCH & 26 | DATA_PID=$! 27 | 28 | # launch training 29 | # https://www.open-mpi.org/faq/?category=openfabrics#ib-router has document on IB options 30 | # the queue parameters sometimes can hang the communication (for some MPI versions and some operations) 31 | mpirun -output-filename logs/train-$SLURM_JOB_ID.log -tag-output \ 32 | -bind-to none -map-by slot \ 33 | -mca pml ob1 -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \ 34 | -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \ 35 | python ./imagenet-resnet-horovod.py \ 36 | --data $DATA_PATH --batch $BATCH --validation distributed $@ & 37 | MPI_PID=$! 38 | 39 | wait $MPI_PID 40 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Benchmark Multi-GPU training against tensorflow/benchmarks 3 | 4 | Tensorpack Multi-GPU trainers are implemented following the awesome examples in 5 | [tensorflow/benchmarks](https://github.com/tensorflow/benchmarks). 6 | Their performance should be the same. 7 | 8 | Here we measure performance by the number of images the trainer can process per second when training a ResNet-50 on ImageNet-size images. 9 | 10 | This script is tested on fake data to focus on the performance of different parallel strategies. 11 | To train on real data with reasonable experiment settings, see 12 | [Tensorpack ResNet example](https://github.com/tensorpack/tensorpack/tree/master/examples/ResNet) or [ResNet-Horovod benchmark](../ResNet-Horovod). 13 | 14 | ## Scripts: 15 | 16 | The following command in tensorflow/benchmarks: 17 | ``` 18 | python tf_cnn_benchmarks.py --num_gpus=8 --batch_size=64 --model=resnet50 --variable_update=replicated/parameter_server --local_parameter_device=cpu 19 | ``` 20 | 21 | is roughly the same as this command in tensorpack: 22 | ``` 23 | python resnet-multigpu.py --fake-location gpu --variable-update=replicated/parameter_server 24 | ``` 25 | 26 | There are tiny differences in the way data is synthesized (the `--fake-location` option): 27 | * gpu: synthesized on GPU as a constant. TF benchmark uses something most similar to this. 28 | * cpu: synthesized on CPU inside TF runtime. 29 | * python-queue: synthesized on CPU inside Python, transferred to TF with queue, prefetch on GPU. 30 | This is the most common experiement setting in tensorpack, because it's easy to write 31 | (use Python to write data loader) and still fast. 32 | * python-dataset: also use python to write data loader, but transfer to TF with `tf.data + prefetch` instead. 33 | * zmq-consume: consume data from a ZMQ pipe, using [my zmq ops](https://github.com/tensorpack/zmq_ops), also with GPU prefetch. 34 | The data producer can therefore be written in any language. 35 | 36 | Data processing inside TF is often a [bad idea in practice](https://tensorpack.readthedocs.io/tutorial/philosophy/dataflow.html#alternative-data-loading-solutions). 37 | When data comes from outside TF, my experiements show 38 | that `zmq-consume` is the fastest input pipeline compared to others here. 39 | It's also faster than `tensorflow/benchmarks` (tested on Jan 6 2018 with TF1.5.0rc0) when training real data. 40 | 41 | ## Performance @ Sep 2017: 42 | 43 | Environment: 44 | * Software: TF v1.3.0-rc1-1302-g593dc8e; tensorpack 0.5. 45 | * Hardware: 8 P100s. 46 | 47 | Note that the latest source code uses new features in tensorpack and therefore won't work with tensorpack 0.5. 48 | Checkout an old version if you intend to repdouce these numbers. 49 | 50 | Experiments were not run for multiple times. So expect some small variance in results. 51 | 52 | `variable-update=replicated`: 53 | 54 | | #GPU | tensorpack(GPU/CPU/Python) | tensorflow/benchmarks | 55 | | --------- | ---------------------- | -------------------- | 56 | | 1 | 228/228/219 | 225.73 | 57 | | 2 | 427/423/415 | 424.54 | 58 | | 4 | 802/785/787 | 789.51 | 59 | | 8 | 1612/1579/1551 | 1580.58 | 60 | 61 | `variable-update=parameter_server`: 62 | 63 | | #GPU | tensorpack(GPU/CPU/Python) | tensorflow/benchmarks | 64 | | --------- | ------------------- | -------------------- | 65 | | 1 | 227/227/223 | 221.68 | 66 | | 2 | 428/418/403 | 421.01 | 67 | | 4 | 817/802/787 | 828.29 | 68 | | 8 | 1651/1556/1574 | 1604.55 | 69 | 70 | ## Performance @ May 2019: 71 | 72 | Environment: 73 | 74 | * Software: TF 1.13.1, cuda 10, cudnn 7.4.2, tensorpack 0.9.5. 75 | * Hardware: AWS p2.16xlarge (8 V100s) 76 | 77 | Results: 78 | 79 | * `--fake-location=gpu --variable-update=horovod`: 2874 img/s. 80 | * `--fake-location=gpu --variable-update=replicated`: 2844 img/s. 81 | * `--fake-location=gpu --variable-update=replicated --use-fp16`: 5224 img/s. 82 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128`: 5891 img/s 83 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128 --use-xla-compile`: 9225 img/s 84 | 85 | ## Performance @ Jul 2020: 86 | 87 | Environment: 88 | 89 | * Software: TF 2.2 with v1 compatible mode, cuda 10.1, cudnn 7.6.5, tensorpack 0.10.1. 90 | * Hardware: AWS p2.16xlarge (8 V100s, 16G memory each) 91 | 92 | Results: 93 | 94 | * `--fake-location=gpu --variable-update=horovod`: 2973 img/s. 95 | * `--fake-location=gpu --variable-update=replicated`: 2961 img/s. 96 | * `--fake-location=gpu --variable-update=replicated --batch 128`: 3137 img/s. 97 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128`: 6141 img/s 98 | * `--fake-location=gpu --variable-update=replicated --use-fp16 --batch 128 --use-xla-compile`: 9341 img/s 99 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/resnet-multigpu.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: resnet-multigpu.py 4 | 5 | import sys 6 | import argparse 7 | import os 8 | from contextlib import contextmanager, ExitStack 9 | import tensorflow as tf 10 | 11 | from tensorpack import * 12 | from tensorpack.utils.gpu import get_nr_gpu 13 | from tensorpack.tfutils.summary import * 14 | from tensorpack.utils.argtools import log_once 15 | 16 | from tensorpack.tfutils.collection import freeze_collection 17 | from tensorpack.tfutils import get_current_tower_context 18 | from tensorpack.tfutils.varreplace import custom_getter_scope 19 | 20 | from tfbench.convnet_builder import ConvNetBuilder 21 | from tfbench import model_config 22 | 23 | INPUT_SHAPE = 224 24 | IMAGE_DTYPE = tf.uint8 25 | IMAGE_DTYPE_NUMPY = 'uint8' 26 | 27 | 28 | class Model(ModelDesc): 29 | def __init__(self, data_format='NCHW'): 30 | self.data_format = data_format 31 | 32 | def inputs(self): 33 | return [tf.TensorSpec([args.batch, INPUT_SHAPE, INPUT_SHAPE, 3], IMAGE_DTYPE, 'input'), 34 | tf.TensorSpec([args.batch], tf.int32, 'label')] 35 | 36 | def optimizer(self): 37 | lr = tf.get_variable('learning_rate', initializer=0.1, trainable=False) 38 | return tf.train.GradientDescentOptimizer(lr) 39 | 40 | def build_graph(self, image, label): 41 | # all-zero tensor hurt performance for some reason. 42 | ctx = get_current_tower_context() 43 | label = tf.random_uniform( 44 | [args.batch], 45 | minval=0, maxval=1000 - 1, 46 | dtype=tf.int32, name='synthetic_labels') 47 | 48 | # our fake images are in [0, 1] 49 | target_dtype = tf.float16 if args.use_fp16 else tf.float32 50 | if image.dtype != target_dtype: 51 | image = tf.cast(image, target_dtype) * 2.0 - 1. 52 | if self.data_format == 'NCHW': 53 | image = tf.transpose(image, [0, 3, 1, 2]) 54 | 55 | logits = self._get_logits(image) 56 | 57 | if logits.dtype != tf.float32: 58 | logger.info("Casting logits back to fp32 ...") 59 | logits = tf.cast(logits, tf.float32) 60 | 61 | loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 62 | loss = tf.reduce_mean(loss, name='xentropy-loss') 63 | if ctx.index == ctx.total - 1: 64 | wd_cost = regularize_cost('.*', tf.nn.l2_loss) * (1e-4 * ctx.total) 65 | return tf.add_n([loss, wd_cost], name='cost') 66 | else: 67 | return loss 68 | 69 | 70 | @contextmanager 71 | def maybe_freeze_updates(enable): 72 | if enable: 73 | with freeze_collection([tf.GraphKeys.UPDATE_OPS]): 74 | yield 75 | else: 76 | yield 77 | 78 | 79 | class TFBenchModel(Model): 80 | def _get_logits(self, image): 81 | ctx = get_current_tower_context() 82 | with maybe_freeze_updates(ctx.index > 0): 83 | network = ConvNetBuilder( 84 | image, 3, True, 85 | use_tf_layers=True, 86 | data_format=self.data_format, 87 | dtype=tf.float16 if args.use_fp16 else tf.float32, 88 | variable_dtype=tf.float32) 89 | with custom_getter_scope(network.get_custom_getter()): 90 | dataset = lambda: 1 91 | dataset.name = 'imagenet' 92 | model_conf = model_config.get_model_config('resnet50', dataset) 93 | model_conf.set_batch_size(args.batch) 94 | model_conf.add_inference(network) 95 | return network.affine(1000, activation='linear', stddev=0.001) 96 | 97 | 98 | class TensorpackModel(Model): 99 | """ 100 | Implement the same model with tensorpack layers. 101 | """ 102 | def _get_logits(self, image): 103 | 104 | def fp16_getter(getter, *args, **kwargs): 105 | name = args[0] if len(args) else kwargs['name'] 106 | if not name.endswith('/W') and not name.endswith('/b'): 107 | """ 108 | Following convention, convolution & fc are quantized. 109 | BatchNorm (gamma & beta) are not quantized. 110 | """ 111 | return getter(*args, **kwargs) 112 | else: 113 | if kwargs['dtype'] == tf.float16: 114 | kwargs['dtype'] = tf.float32 115 | ret = getter(*args, **kwargs) 116 | ret = tf.cast(ret, tf.float16) 117 | log_once("Variable {} casted to fp16 ...".format(name)) 118 | return ret 119 | else: 120 | return getter(*args, **kwargs) 121 | 122 | def shortcut(l, n_in, n_out, stride): 123 | if n_in != n_out: 124 | l = Conv2D('convshortcut', l, n_out, 1, strides=stride) 125 | l = BatchNorm('bnshortcut', l) 126 | return l 127 | else: 128 | return l 129 | 130 | def bottleneck(l, ch_out, stride, preact): 131 | ch_in = l.get_shape().as_list()[1] 132 | input = l 133 | l = Conv2D('conv1', l, ch_out, 1, strides=stride, activation=BNReLU) 134 | l = Conv2D('conv2', l, ch_out, 3, strides=1, activation=BNReLU) 135 | l = Conv2D('conv3', l, ch_out * 4, 1, activation=tf.identity) 136 | l = BatchNorm('bn', l) 137 | ret = l + shortcut(input, ch_in, ch_out * 4, stride) 138 | return tf.nn.relu(ret) 139 | 140 | def layer(l, layername, block_func, features, count, stride, first=False): 141 | with tf.variable_scope(layername): 142 | with tf.variable_scope('block0'): 143 | l = block_func(l, features, stride, 144 | 'no_preact' if first else 'both_preact') 145 | for i in range(1, count): 146 | with tf.variable_scope('block{}'.format(i)): 147 | l = block_func(l, features, 1, 'default') 148 | return l 149 | 150 | defs = [3, 4, 6, 3] 151 | 152 | with ExitStack() as stack: 153 | stack.enter_context(argscope( 154 | Conv2D, use_bias=False, 155 | kernel_initializer=tf.variance_scaling_initializer(mode='fan_out'))) 156 | stack.enter_context(argscope( 157 | [Conv2D, MaxPooling, GlobalAvgPooling, BatchNorm], 158 | data_format=self.data_format)) 159 | if args.use_fp16: 160 | stack.enter_context(custom_getter_scope(fp16_getter)) 161 | image = tf.cast(image, tf.float16) 162 | logits = (LinearWrap(image) 163 | .Conv2D('conv0', 64, 7, strides=2) 164 | .BatchNorm('bn0') 165 | .tf.nn.relu() 166 | .MaxPooling('pool0', 3, strides=2, padding='SAME') 167 | .apply(layer, 'group0', bottleneck, 64, defs[0], 1, first=True) 168 | .apply(layer, 'group1', bottleneck, 128, defs[1], 2) 169 | .apply(layer, 'group2', bottleneck, 256, defs[2], 2) 170 | .apply(layer, 'group3', bottleneck, 512, defs[3], 2) 171 | .GlobalAvgPooling('gap') 172 | .FullyConnected('linear', 1000)()) 173 | if args.use_fp16: 174 | logits = tf.cast(logits, tf.float32) 175 | return logits 176 | 177 | 178 | def get_data(mode): 179 | # get input 180 | input_shape = [args.batch, 224, 224, 3] 181 | label_shape = [args.batch] 182 | dataflow = FakeData( 183 | [input_shape, label_shape], 1000, 184 | random=False, dtype=[IMAGE_DTYPE_NUMPY, 'int32']) 185 | if mode == 'gpu': 186 | return DummyConstantInput([input_shape, label_shape]) 187 | elif mode == 'cpu': 188 | def fn(): 189 | # these copied from tensorflow/benchmarks 190 | with tf.device('/cpu:0'): 191 | if IMAGE_DTYPE == tf.float32: 192 | images = tf.truncated_normal( 193 | input_shape, dtype=IMAGE_DTYPE, stddev=0.1, name='synthetic_images') 194 | else: 195 | images = tf.random_uniform( 196 | input_shape, minval=0, maxval=255, dtype=tf.int32, name='synthetic_images') 197 | images = tf.cast(images, IMAGE_DTYPE) 198 | labels = tf.random_uniform( 199 | label_shape, minval=1, maxval=1000, dtype=tf.int32, name='synthetic_labels') 200 | # images = tf.contrib.framework.local_variable(images, name='images') 201 | return [images, labels] 202 | ret = TensorInput(fn) 203 | return StagingInput(ret, nr_stage=1) 204 | elif mode == 'python' or mode == 'python-queue': 205 | ret = QueueInput( 206 | dataflow, 207 | queue=tf.FIFOQueue(args.prefetch, [IMAGE_DTYPE, tf.int32])) 208 | return StagingInput(ret, nr_stage=1) 209 | elif mode == 'python-dataset': 210 | ds = TFDatasetInput.dataflow_to_dataset(dataflow, [IMAGE_DTYPE, tf.int32]) 211 | ds = ds.repeat().prefetch(args.prefetch) 212 | ret = TFDatasetInput(ds) 213 | return StagingInput(ret, nr_stage=1) 214 | elif mode == 'zmq-serve': 215 | send_dataflow_zmq(dataflow, 'ipc://testpipe', hwm=args.prefetch, format='zmq_op') 216 | sys.exit() 217 | elif mode == 'zmq-consume': 218 | ret = ZMQInput( 219 | 'ipc://testpipe', hwm=args.prefetch) 220 | return StagingInput(ret, nr_stage=1) 221 | 222 | 223 | if __name__ == '__main__': 224 | parser = argparse.ArgumentParser() 225 | parser.add_argument('--gpu', help='comma separated list of GPU(s) to use.') 226 | parser.add_argument('--model', choices=['tfbench', 'tensorpack'], default='tensorpack') 227 | parser.add_argument('--load', help='load model') 228 | parser.add_argument('--prefetch', type=int, default=150) 229 | parser.add_argument('--use-fp16', action='store_true') 230 | parser.add_argument('--use-xla-compile', action='store_true') 231 | parser.add_argument('--batch', type=int, default=64, help='per GPU batch size') 232 | parser.add_argument('--data_format', help='specify NCHW or NHWC', 233 | type=str, default='NCHW') 234 | parser.add_argument('--fake-location', help='the place to create fake data', 235 | type=str, default='gpu', 236 | choices=[ 237 | 'cpu', 'gpu', 238 | 'python', 'python-queue', 'python-dataset', 239 | 'zmq-serve', 'zmq-consume']) 240 | parser.add_argument('--variable-update', help='variable update strategy', 241 | type=str, 242 | choices=['replicated', 'parameter_server', 'horovod', 'byteps'], 243 | required=True) 244 | 245 | parser.add_argument('--ps-hosts') 246 | parser.add_argument('--worker-hosts') 247 | parser.add_argument('--job') 248 | args = parser.parse_args() 249 | 250 | if args.gpu: 251 | os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu 252 | 253 | sessconf = get_default_sess_config() 254 | sessconf.inter_op_parallelism_threads = 80 - 16 255 | if args.job: 256 | # distributed: 257 | cluster_spec = tf.train.ClusterSpec({ 258 | 'ps': args.ps_hosts.split(','), 259 | 'worker': args.worker_hosts.split(',') 260 | }) 261 | job = args.job.split(':')[0] 262 | if job == 'ps': 263 | os.environ['CUDA_VISIBLE_DEVICES'] = '' 264 | task_index = int(args.job.split(':')[1]) 265 | server = tf.train.Server( 266 | cluster_spec, job_name=job, task_index=task_index, 267 | config=sessconf) 268 | 269 | NUM_GPU = get_nr_gpu() 270 | logger.info("Number of GPUs found: " + str(NUM_GPU)) 271 | 272 | if args.job: 273 | trainer = { 274 | 'replicated': lambda: DistributedTrainerReplicated(NUM_GPU, server), 275 | 'parameter_server': lambda: DistributedTrainerParameterServer(NUM_GPU, server), 276 | }[args.variable_update]() 277 | else: 278 | if NUM_GPU == 1: 279 | trainer = SimpleTrainer() 280 | else: 281 | trainer = { 282 | 'replicated': lambda: SyncMultiGPUTrainerReplicated( 283 | NUM_GPU, average=False, mode='hierarchical' if NUM_GPU >= 8 else 'cpu'), 284 | # average=False is the actual configuration used by tfbench 285 | 'horovod': lambda: HorovodTrainer(), 286 | 'byteps': lambda: BytePSTrainer(), 287 | 'parameter_server': lambda: SyncMultiGPUTrainerParameterServer(NUM_GPU, ps_device='cpu') 288 | }[args.variable_update]() 289 | if args.variable_update == 'replicated': 290 | trainer.BROADCAST_EVERY_EPOCH = False 291 | 292 | M = TFBenchModel if args.model == 'tfbench' else TensorpackModel 293 | callbacks = [ 294 | GPUMemoryTracker(), 295 | # ModelSaver(checkpoint_dir='./tmpmodel'), # it takes time 296 | ] 297 | if args.variable_update != 'horovod': 298 | callbacks.append(GPUUtilizationTracker()) 299 | if args.variable_update in ['horovod', 'byteps']: 300 | # disable logging if not on first rank 301 | if trainer.hvd.rank() != 0: 302 | logger.addFilter(lambda _: False) 303 | NUM_GPU = trainer.hvd.size() 304 | logger.info("Number of GPUs to use: " + str(NUM_GPU)) 305 | 306 | try: 307 | from tensorpack.callbacks import ThroughputTracker 308 | except ImportError: 309 | # only available in latest tensorpack 310 | pass 311 | else: 312 | callbacks.append(ThroughputTracker(samples_per_step=args.batch * NUM_GPU)) 313 | 314 | config = TrainConfig( 315 | data=get_data(args.fake_location), 316 | model=M(data_format=args.data_format), 317 | callbacks=callbacks, 318 | extra_callbacks=[ 319 | # MovingAverageSummary(), # tensorflow/benchmarks does not do this 320 | ProgressBar(), # nor this 321 | # MergeAllSummaries(), 322 | RunUpdateOps() 323 | ], 324 | session_config=sessconf if not args.job else None, 325 | steps_per_epoch=50, 326 | max_epoch=10, 327 | ) 328 | 329 | # consistent with tensorflow/benchmarks 330 | trainer.COLOCATE_GRADIENTS_WITH_OPS = False 331 | trainer.XLA_COMPILE = args.use_xla_compile 332 | launch_train_with_config(config, trainer) 333 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/tfbench/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tensorpack/benchmarks/7c97174f9a00e440d60c38b54d4c45f58271fc3e/ResNet-MultiGPU/tfbench/__init__.py -------------------------------------------------------------------------------- /ResNet-MultiGPU/tfbench/convnet_builder.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """CNN builder.""" 16 | 17 | from __future__ import print_function 18 | 19 | from collections import defaultdict 20 | import contextlib 21 | 22 | import numpy as np 23 | 24 | import tensorflow as tf 25 | 26 | from tensorflow.python.layers import convolutional as conv_layers 27 | from tensorflow.python.layers import core as core_layers 28 | from tensorflow.python.layers import pooling as pooling_layers 29 | from tensorflow.python.training import moving_averages 30 | 31 | 32 | class ConvNetBuilder(object): 33 | """Builder of cnn net.""" 34 | 35 | def __init__(self, 36 | input_op, 37 | input_nchan, 38 | phase_train, 39 | use_tf_layers, 40 | data_format='NCHW', 41 | dtype=tf.float32, 42 | variable_dtype=tf.float32): 43 | self.top_layer = input_op 44 | self.top_size = input_nchan 45 | self.phase_train = phase_train 46 | self.use_tf_layers = use_tf_layers 47 | self.data_format = data_format 48 | self.dtype = dtype 49 | self.variable_dtype = variable_dtype 50 | self.counts = defaultdict(lambda: 0) 51 | self.use_batch_norm = False 52 | self.batch_norm_config = {} # 'decay': 0.997, 'scale': True} 53 | self.channel_pos = ('channels_last' 54 | if data_format == 'NHWC' else 'channels_first') 55 | self.aux_top_layer = None 56 | self.aux_top_size = 0 57 | 58 | def get_custom_getter(self): 59 | """Returns a custom getter that this class's methods must be called under. 60 | 61 | All methods of this class must be called under a variable scope that was 62 | passed this custom getter. Example: 63 | 64 | ```python 65 | network = ConvNetBuilder(...) 66 | with tf.variable_scope('cg', custom_getter=network.get_custom_getter()): 67 | network.conv(...) 68 | # Call more methods of network here 69 | ``` 70 | 71 | Currently, this custom getter only does anything if self.use_tf_layers is 72 | True. In that case, it causes variables to be stored as dtype 73 | self.variable_type, then casted to the requested dtype, instead of directly 74 | storing the variable as the requested dtype. 75 | """ 76 | def inner_custom_getter(getter, *args, **kwargs): 77 | """Custom getter that forces variables to have type self.variable_type.""" 78 | if not self.use_tf_layers: 79 | return getter(*args, **kwargs) 80 | requested_dtype = kwargs['dtype'] 81 | if not (requested_dtype == tf.float32 and 82 | self.variable_dtype == tf.float16): 83 | # Only change the variable dtype if doing so does not decrease variable 84 | # precision. 85 | kwargs['dtype'] = self.variable_dtype 86 | var = getter(*args, **kwargs) 87 | # This if statement is needed to guard the cast, because batch norm 88 | # assigns directly to the return value of this custom getter. The cast 89 | # makes the return value not a variable so it cannot be assigned. Batch 90 | # norm variables are always in fp32 so this if statement is never 91 | # triggered for them. 92 | if var.dtype.base_dtype != requested_dtype: 93 | var = tf.cast(var, requested_dtype) 94 | return var 95 | return inner_custom_getter 96 | 97 | @contextlib.contextmanager 98 | def switch_to_aux_top_layer(self): 99 | """Context that construct cnn in the auxiliary arm.""" 100 | if self.aux_top_layer is None: 101 | raise RuntimeError('Empty auxiliary top layer in the network.') 102 | saved_top_layer = self.top_layer 103 | saved_top_size = self.top_size 104 | self.top_layer = self.aux_top_layer 105 | self.top_size = self.aux_top_size 106 | yield 107 | self.aux_top_layer = self.top_layer 108 | self.aux_top_size = self.top_size 109 | self.top_layer = saved_top_layer 110 | self.top_size = saved_top_size 111 | 112 | def get_variable(self, name, shape, dtype, cast_dtype, *args, **kwargs): 113 | # TODO(reedwm): Currently variables and gradients are transferred to other 114 | # devices and machines as type `dtype`, not `cast_dtype`. In particular, 115 | # this means in fp16 mode, variables are transferred as fp32 values, not 116 | # fp16 values, which uses extra bandwidth. 117 | var = tf.get_variable(name, shape, dtype, *args, **kwargs) 118 | return tf.cast(var, cast_dtype) 119 | 120 | def _conv2d_impl(self, input_layer, num_channels_in, filters, kernel_size, 121 | strides, padding, kernel_initializer): 122 | if self.use_tf_layers: 123 | return conv_layers.conv2d(input_layer, filters, kernel_size, strides, 124 | padding, self.channel_pos, 125 | kernel_initializer=kernel_initializer, 126 | use_bias=False) 127 | else: 128 | weights_shape = [kernel_size[0], kernel_size[1], num_channels_in, filters] 129 | # We use the name 'conv2d/kernel' so the variable has the same name as its 130 | # tf.layers equivalent. This way, if a checkpoint is written when 131 | # self.use_tf_layers == True, it can be loaded when 132 | # self.use_tf_layers == False, and vice versa. 133 | weights = self.get_variable('conv2d/kernel', weights_shape, 134 | self.variable_dtype, self.dtype, 135 | initializer=kernel_initializer) 136 | if self.data_format == 'NHWC': 137 | strides = [1] + strides + [1] 138 | else: 139 | strides = [1, 1] + strides 140 | return tf.nn.conv2d(input_layer, weights, strides, padding, 141 | data_format=self.data_format) 142 | 143 | def conv(self, 144 | num_out_channels, 145 | k_height, 146 | k_width, 147 | d_height=1, 148 | d_width=1, 149 | mode='SAME', 150 | input_layer=None, 151 | num_channels_in=None, 152 | use_batch_norm=None, 153 | stddev=None, 154 | activation='relu', 155 | bias=0.0, 156 | kernel_initializer=None): 157 | """Construct a conv2d layer on top of cnn.""" 158 | if input_layer is None: 159 | input_layer = self.top_layer 160 | if num_channels_in is None: 161 | num_channels_in = self.top_size 162 | if stddev is not None and kernel_initializer is None: 163 | kernel_initializer = tf.truncated_normal_initializer(stddev=stddev) 164 | name = 'conv' + str(self.counts['conv']) 165 | self.counts['conv'] += 1 166 | with tf.variable_scope(name): 167 | strides = [1, d_height, d_width, 1] 168 | if self.data_format == 'NCHW': 169 | strides = [strides[0], strides[3], strides[1], strides[2]] 170 | if mode != 'SAME_RESNET': 171 | conv = self._conv2d_impl(input_layer, num_channels_in, num_out_channels, 172 | kernel_size=[k_height, k_width], 173 | strides=[d_height, d_width], padding=mode, 174 | kernel_initializer=kernel_initializer) 175 | else: # Special padding mode for ResNet models 176 | if d_height == 1 and d_width == 1: 177 | conv = self._conv2d_impl(input_layer, num_channels_in, 178 | num_out_channels, 179 | kernel_size=[k_height, k_width], 180 | strides=[d_height, d_width], padding='SAME', 181 | kernel_initializer=kernel_initializer) 182 | else: 183 | rate = 1 # Unused (for 'a trous' convolutions) 184 | kernel_height_effective = k_height + (k_height - 1) * (rate - 1) 185 | pad_h_beg = (kernel_height_effective - 1) // 2 186 | pad_h_end = kernel_height_effective - 1 - pad_h_beg 187 | kernel_width_effective = k_width + (k_width - 1) * (rate - 1) 188 | pad_w_beg = (kernel_width_effective - 1) // 2 189 | pad_w_end = kernel_width_effective - 1 - pad_w_beg 190 | padding = [[0, 0], [pad_h_beg, pad_h_end], 191 | [pad_w_beg, pad_w_end], [0, 0]] 192 | if self.data_format == 'NCHW': 193 | padding = [padding[0], padding[3], padding[1], padding[2]] 194 | input_layer = tf.pad(input_layer, padding) 195 | conv = self._conv2d_impl(input_layer, num_channels_in, 196 | num_out_channels, 197 | kernel_size=[k_height, k_width], 198 | strides=[d_height, d_width], padding='VALID', 199 | kernel_initializer=kernel_initializer) 200 | if use_batch_norm is None: 201 | use_batch_norm = self.use_batch_norm 202 | if not use_batch_norm: 203 | if bias is not None: 204 | biases = self.get_variable('biases', [num_out_channels], 205 | self.variable_dtype, self.dtype, 206 | initializer=tf.constant_initializer(bias)) 207 | biased = tf.reshape( 208 | tf.nn.bias_add(conv, biases, data_format=self.data_format), 209 | conv.get_shape()) 210 | else: 211 | biased = conv 212 | else: 213 | self.top_layer = conv 214 | self.top_size = num_out_channels 215 | biased = self.batch_norm(**self.batch_norm_config) 216 | if activation == 'relu': 217 | conv1 = tf.nn.relu(biased) 218 | elif activation == 'linear' or activation is None: 219 | conv1 = biased 220 | elif activation == 'tanh': 221 | conv1 = tf.nn.tanh(biased) 222 | else: 223 | raise KeyError('Invalid activation type \'%s\'' % activation) 224 | self.top_layer = conv1 225 | self.top_size = num_out_channels 226 | return conv1 227 | 228 | def _pool(self, 229 | pool_name, 230 | pool_function, 231 | k_height, 232 | k_width, 233 | d_height, 234 | d_width, 235 | mode, 236 | input_layer, 237 | num_channels_in): 238 | """Construct a pooling layer.""" 239 | if input_layer is None: 240 | input_layer = self.top_layer 241 | else: 242 | self.top_size = num_channels_in 243 | name = pool_name + str(self.counts[pool_name]) 244 | self.counts[pool_name] += 1 245 | if self.use_tf_layers: 246 | pool = pool_function( 247 | input_layer, [k_height, k_width], [d_height, d_width], 248 | padding=mode, 249 | data_format=self.channel_pos, 250 | name=name) 251 | else: 252 | if self.data_format == 'NHWC': 253 | ksize = [1, k_height, k_width, 1] 254 | strides = [1, d_height, d_width, 1] 255 | else: 256 | ksize = [1, 1, k_height, k_width] 257 | strides = [1, 1, d_height, d_width] 258 | pool = tf.nn.max_pool(input_layer, ksize, strides, padding=mode, 259 | data_format=self.data_format, name=name) 260 | self.top_layer = pool 261 | return pool 262 | 263 | def mpool(self, 264 | k_height, 265 | k_width, 266 | d_height=2, 267 | d_width=2, 268 | mode='VALID', 269 | input_layer=None, 270 | num_channels_in=None): 271 | """Construct a max pooling layer.""" 272 | return self._pool('mpool', pooling_layers.max_pooling2d, k_height, k_width, 273 | d_height, d_width, mode, input_layer, num_channels_in) 274 | 275 | def apool(self, 276 | k_height, 277 | k_width, 278 | d_height=2, 279 | d_width=2, 280 | mode='VALID', 281 | input_layer=None, 282 | num_channels_in=None): 283 | """Construct an average pooling layer.""" 284 | return self._pool('apool', pooling_layers.average_pooling2d, k_height, 285 | k_width, d_height, d_width, mode, input_layer, 286 | num_channels_in) 287 | 288 | def reshape(self, shape, input_layer=None): 289 | if input_layer is None: 290 | input_layer = self.top_layer 291 | self.top_layer = tf.reshape(input_layer, shape) 292 | self.top_size = shape[-1] # HACK This may not always work 293 | return self.top_layer 294 | 295 | def affine(self, 296 | num_out_channels, 297 | input_layer=None, 298 | num_channels_in=None, 299 | bias=0.0, 300 | stddev=None, 301 | activation='relu'): 302 | if input_layer is None: 303 | input_layer = self.top_layer 304 | if num_channels_in is None: 305 | num_channels_in = self.top_size 306 | name = 'affine' + str(self.counts['affine']) 307 | self.counts['affine'] += 1 308 | with tf.variable_scope(name): 309 | init_factor = 2. if activation == 'relu' else 1. 310 | stddev = stddev or np.sqrt(init_factor / num_channels_in) 311 | kernel = self.get_variable( 312 | 'weights', [num_channels_in, num_out_channels], 313 | self.variable_dtype, self.dtype, 314 | initializer=tf.truncated_normal_initializer(stddev=stddev)) 315 | biases = self.get_variable('biases', [num_out_channels], 316 | self.variable_dtype, self.dtype, 317 | initializer=tf.constant_initializer(bias)) 318 | logits = tf.nn.xw_plus_b(input_layer, kernel, biases) 319 | if activation == 'relu': 320 | affine1 = tf.nn.relu(logits, name=name) 321 | elif activation == 'linear' or activation is None: 322 | affine1 = logits 323 | else: 324 | raise KeyError('Invalid activation type \'%s\'' % activation) 325 | self.top_layer = affine1 326 | self.top_size = num_out_channels 327 | return affine1 328 | 329 | def inception_module(self, name, cols, input_layer=None, in_size=None): 330 | if input_layer is None: 331 | input_layer = self.top_layer 332 | if in_size is None: 333 | in_size = self.top_size 334 | name += str(self.counts[name]) 335 | self.counts[name] += 1 336 | with tf.variable_scope(name): 337 | col_layers = [] 338 | col_layer_sizes = [] 339 | for c, col in enumerate(cols): 340 | col_layers.append([]) 341 | col_layer_sizes.append([]) 342 | for l, layer in enumerate(col): 343 | ltype, args = layer[0], layer[1:] 344 | kwargs = { 345 | 'input_layer': input_layer, 346 | 'num_channels_in': in_size 347 | } if l == 0 else {} 348 | if ltype == 'conv': 349 | self.conv(*args, **kwargs) 350 | elif ltype == 'mpool': 351 | self.mpool(*args, **kwargs) 352 | elif ltype == 'apool': 353 | self.apool(*args, **kwargs) 354 | elif ltype == 'share': # Share matching layer from previous column 355 | self.top_layer = col_layers[c - 1][l] 356 | self.top_size = col_layer_sizes[c - 1][l] 357 | else: 358 | raise KeyError( 359 | 'Invalid layer type for inception module: \'%s\'' % ltype) 360 | col_layers[c].append(self.top_layer) 361 | col_layer_sizes[c].append(self.top_size) 362 | catdim = 3 if self.data_format == 'NHWC' else 1 363 | self.top_layer = tf.concat([layers[-1] for layers in col_layers], catdim) 364 | self.top_size = sum([sizes[-1] for sizes in col_layer_sizes]) 365 | return self.top_layer 366 | 367 | def spatial_mean(self, keep_dims=False): 368 | name = 'spatial_mean' + str(self.counts['spatial_mean']) 369 | self.counts['spatial_mean'] += 1 370 | axes = [1, 2] if self.data_format == 'NHWC' else [2, 3] 371 | self.top_layer = tf.reduce_mean( 372 | self.top_layer, axes, keepdims=keep_dims, name=name) 373 | return self.top_layer 374 | 375 | def dropout(self, keep_prob=0.5, input_layer=None): 376 | if input_layer is None: 377 | input_layer = self.top_layer 378 | else: 379 | self.top_size = None 380 | name = 'dropout' + str(self.counts['dropout']) 381 | with tf.variable_scope(name): 382 | if not self.phase_train: 383 | keep_prob = 1.0 384 | if self.use_tf_layers: 385 | dropout = core_layers.dropout(input_layer, 1. - keep_prob, 386 | training=self.phase_train) 387 | else: 388 | dropout = tf.nn.dropout(input_layer, keep_prob) 389 | self.top_layer = dropout 390 | return dropout 391 | 392 | def _batch_norm_without_layers(self, input_layer, decay, use_scale, epsilon): 393 | """Batch normalization on `input_layer` without tf.layers.""" 394 | # We make this function as similar as possible to the 395 | # tf.contrib.layers.batch_norm, to minimize the differences between using 396 | # layers and not using layers. 397 | shape = input_layer.shape 398 | num_channels = shape[3] if self.data_format == 'NHWC' else shape[1] 399 | beta = self.get_variable('beta', [num_channels], tf.float32, tf.float32, 400 | initializer=tf.zeros_initializer()) 401 | if use_scale: 402 | gamma = self.get_variable('gamma', [num_channels], tf.float32, 403 | tf.float32, initializer=tf.ones_initializer()) 404 | else: 405 | gamma = tf.constant(1.0, tf.float32, [num_channels]) 406 | # For moving variables, we use tf.get_variable instead of self.get_variable, 407 | # since self.get_variable returns the result of tf.cast which we cannot 408 | # assign to. 409 | moving_mean = tf.get_variable('moving_mean', [num_channels], 410 | tf.float32, 411 | initializer=tf.zeros_initializer(), 412 | trainable=False) 413 | moving_variance = tf.get_variable('moving_variance', [num_channels], 414 | tf.float32, 415 | initializer=tf.ones_initializer(), 416 | trainable=False) 417 | if self.phase_train: 418 | bn, batch_mean, batch_variance = tf.nn.fused_batch_norm( 419 | input_layer, gamma, beta, epsilon=epsilon, 420 | data_format=self.data_format, is_training=True) 421 | mean_update = moving_averages.assign_moving_average( 422 | moving_mean, batch_mean, decay=decay, zero_debias=False) 423 | variance_update = moving_averages.assign_moving_average( 424 | moving_variance, batch_variance, decay=decay, zero_debias=False) 425 | tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, mean_update) 426 | tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, variance_update) 427 | else: 428 | bn, _, _ = tf.nn.fused_batch_norm( 429 | input_layer, gamma, beta, mean=moving_mean, 430 | variance=moving_variance, epsilon=epsilon, 431 | data_format=self.data_format, is_training=False) 432 | return bn 433 | 434 | def batch_norm(self, input_layer=None, decay=0.999, scale=False, 435 | epsilon=0.001): 436 | """Adds a Batch Normalization layer.""" 437 | if input_layer is None: 438 | input_layer = self.top_layer 439 | else: 440 | self.top_size = None 441 | name = 'batchnorm' + str(self.counts['batchnorm']) 442 | self.counts['batchnorm'] += 1 443 | 444 | with tf.variable_scope(name) as scope: 445 | if self.use_tf_layers: 446 | bn = tf.contrib.layers.batch_norm( 447 | input_layer, 448 | decay=decay, 449 | scale=scale, 450 | epsilon=epsilon, 451 | is_training=self.phase_train, 452 | fused=True, 453 | data_format=self.data_format, 454 | scope=scope) 455 | else: 456 | bn = self._batch_norm_without_layers(input_layer, decay, scale, epsilon) 457 | self.top_layer = bn 458 | self.top_size = bn.shape[3] if self.data_format == 'NHWC' else bn.shape[1] 459 | self.top_size = int(self.top_size) 460 | return bn 461 | 462 | def lrn(self, depth_radius, bias, alpha, beta): 463 | """Adds a local response normalization layer.""" 464 | name = 'lrn' + str(self.counts['lrn']) 465 | self.counts['lrn'] += 1 466 | self.top_layer = tf.nn.lrn( 467 | self.top_layer, depth_radius, bias, alpha, beta, name=name) 468 | return self.top_layer 469 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/tfbench/model.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Base model configuration for CNN benchmarks.""" 16 | 17 | 18 | class Model(object): 19 | """Base model configuration for CNN benchmarks.""" 20 | 21 | def __init__(self, 22 | model, 23 | image_size, 24 | batch_size, 25 | learning_rate, 26 | layer_counts=None, 27 | fp16_loss_scale=128): 28 | self.model = model 29 | self.image_size = image_size 30 | self.batch_size = batch_size 31 | self.default_batch_size = batch_size 32 | self.learning_rate = learning_rate 33 | self.layer_counts = layer_counts 34 | # TODO(reedwm) Set custom loss scales for each model instead of using the 35 | # default of 128. 36 | self.fp16_loss_scale = fp16_loss_scale 37 | 38 | def get_model(self): 39 | return self.model 40 | 41 | def get_image_size(self): 42 | return self.image_size 43 | 44 | def get_batch_size(self): 45 | return self.batch_size 46 | 47 | def set_batch_size(self, batch_size): 48 | self.batch_size = batch_size 49 | 50 | def get_default_batch_size(self): 51 | return self.default_batch_size 52 | 53 | def get_layer_counts(self): 54 | return self.layer_counts 55 | 56 | def get_fp16_loss_scale(self): 57 | return self.fp16_loss_scale 58 | 59 | def get_learning_rate(self, global_step, batch_size): 60 | del global_step 61 | del batch_size 62 | return self.learning_rate 63 | 64 | def add_inference(self, unused_cnn): 65 | raise ValueError('Must be implemented in derived classes') 66 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/tfbench/model_config.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Model configurations for CNN benchmarks. 17 | """ 18 | 19 | from . import resnet_model 20 | 21 | 22 | _model_name_to_imagenet_model = { 23 | 'resnet50': resnet_model.create_resnet50_model, 24 | 'resnet50_v2': resnet_model.create_resnet50_v2_model, 25 | 'resnet101': resnet_model.create_resnet101_model, 26 | 'resnet101_v2': resnet_model.create_resnet101_v2_model, 27 | 'resnet152': resnet_model.create_resnet152_model, 28 | 'resnet152_v2': resnet_model.create_resnet152_v2_model, 29 | } 30 | 31 | 32 | _model_name_to_cifar_model = { 33 | } 34 | 35 | 36 | def _get_model_map(dataset_name): 37 | if 'cifar10' == dataset_name: 38 | return _model_name_to_cifar_model 39 | elif dataset_name in ('imagenet', 'synthetic'): 40 | return _model_name_to_imagenet_model 41 | else: 42 | raise ValueError('Invalid dataset name: %s' % dataset_name) 43 | 44 | 45 | def get_model_config(model_name, dataset): 46 | """Map model name to model network configuration.""" 47 | model_map = _get_model_map(dataset.name) 48 | if model_name not in model_map: 49 | raise ValueError('Invalid model name \'%s\' for dataset \'%s\'' % 50 | (model_name, dataset.name)) 51 | else: 52 | return model_map[model_name]() 53 | 54 | 55 | def register_model(model_name, dataset_name, model_func): 56 | """Register a new model that can be obtained with `get_model_config`.""" 57 | model_map = _get_model_map(dataset_name) 58 | if model_name in model_map: 59 | raise ValueError('Model "%s" is already registered for dataset "%s"' % 60 | (model_name, dataset_name)) 61 | model_map[model_name] = model_func 62 | -------------------------------------------------------------------------------- /ResNet-MultiGPU/tfbench/resnet_model.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Resnet model configuration. 17 | 18 | References: 19 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 20 | Deep Residual Learning for Image Recognition 21 | arXiv:1512.03385 (2015) 22 | 23 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 24 | Identity Mappings in Deep Residual Networks 25 | arXiv:1603.05027 (2016) 26 | 27 | Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, 28 | Alan L. Yuille 29 | DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, 30 | Atrous Convolution, and Fully Connected CRFs 31 | arXiv:1606.00915 (2016) 32 | """ 33 | 34 | import numpy as np 35 | from six.moves import xrange # pylint: disable=redefined-builtin 36 | import tensorflow as tf 37 | from . import model as model_lib 38 | 39 | 40 | def bottleneck_block_v1(cnn, depth, depth_bottleneck, stride): 41 | """Bottleneck block with identity short-cut for ResNet v1. 42 | 43 | Args: 44 | cnn: the network to append bottleneck blocks. 45 | depth: the number of output filters for this bottleneck block. 46 | depth_bottleneck: the number of bottleneck filters for this block. 47 | stride: Stride used in the first layer of the bottleneck block. 48 | """ 49 | input_layer = cnn.top_layer 50 | in_size = cnn.top_size 51 | name_key = 'resnet_v1' 52 | name = name_key + str(cnn.counts[name_key]) 53 | cnn.counts[name_key] += 1 54 | 55 | with tf.variable_scope(name): 56 | if depth == in_size: 57 | if stride == 1: 58 | shortcut = input_layer 59 | else: 60 | shortcut = cnn.apool( 61 | 1, 1, stride, stride, input_layer=input_layer, 62 | num_channels_in=in_size) 63 | else: 64 | shortcut = cnn.conv( 65 | depth, 1, 1, stride, stride, activation=None, 66 | use_batch_norm=True, input_layer=input_layer, 67 | num_channels_in=in_size, bias=None) 68 | cnn.conv(depth_bottleneck, 1, 1, stride, stride, 69 | input_layer=input_layer, num_channels_in=in_size, 70 | use_batch_norm=True, bias=None) 71 | cnn.conv(depth_bottleneck, 3, 3, 1, 1, mode='SAME_RESNET', 72 | use_batch_norm=True, bias=None) 73 | res = cnn.conv(depth, 1, 1, 1, 1, activation=None, 74 | use_batch_norm=True, bias=None) 75 | output = tf.nn.relu(shortcut + res) 76 | cnn.top_layer = output 77 | cnn.top_size = depth 78 | 79 | 80 | def bottleneck_block_v2(cnn, depth, depth_bottleneck, stride): 81 | """Bottleneck block with identity short-cut for ResNet v2. 82 | 83 | The main difference from v1 is that a batch norm and relu are done at the 84 | start of the block, instead of the end. This initial batch norm and relu is 85 | collectively called a pre-activation. 86 | 87 | Args: 88 | cnn: the network to append bottleneck blocks. 89 | depth: the number of output filters for this bottleneck block. 90 | depth_bottleneck: the number of bottleneck filters for this block. 91 | stride: Stride used in the first layer of the bottleneck block. 92 | """ 93 | input_layer = cnn.top_layer 94 | in_size = cnn.top_size 95 | name_key = 'resnet_v2' 96 | name = name_key + str(cnn.counts[name_key]) 97 | cnn.counts[name_key] += 1 98 | 99 | preact = cnn.batch_norm() 100 | preact = tf.nn.relu(preact) 101 | with tf.variable_scope(name): 102 | if depth == in_size: 103 | if stride == 1: 104 | shortcut = input_layer 105 | else: 106 | shortcut = cnn.apool( 107 | 1, 1, stride, stride, input_layer=input_layer, 108 | num_channels_in=in_size) 109 | else: 110 | shortcut = cnn.conv( 111 | depth, 1, 1, stride, stride, activation=None, use_batch_norm=False, 112 | input_layer=preact, num_channels_in=in_size, bias=None) 113 | cnn.conv(depth_bottleneck, 1, 1, stride, stride, 114 | input_layer=preact, num_channels_in=in_size, 115 | use_batch_norm=True, bias=None) 116 | cnn.conv(depth_bottleneck, 3, 3, 1, 1, mode='SAME_RESNET', 117 | use_batch_norm=True, bias=None) 118 | res = cnn.conv(depth, 1, 1, 1, 1, activation=None, 119 | use_batch_norm=False, bias=None) 120 | output = shortcut + res 121 | cnn.top_layer = output 122 | cnn.top_size = depth 123 | 124 | 125 | def bottleneck_block(cnn, depth, depth_bottleneck, stride, pre_activation): 126 | """Bottleneck block with identity short-cut. 127 | 128 | Args: 129 | cnn: the network to append bottleneck blocks. 130 | depth: the number of output filters for this bottleneck block. 131 | depth_bottleneck: the number of bottleneck filters for this block. 132 | stride: Stride used in the first layer of the bottleneck block. 133 | pre_activation: use pre_activation structure used in v2 or not. 134 | """ 135 | if pre_activation: 136 | bottleneck_block_v2(cnn, depth, depth_bottleneck, stride) 137 | else: 138 | bottleneck_block_v1(cnn, depth, depth_bottleneck, stride) 139 | 140 | 141 | def residual_block(cnn, depth, stride, pre_activation): 142 | """Residual block with identity short-cut. 143 | 144 | Args: 145 | cnn: the network to append residual blocks. 146 | depth: the number of output filters for this residual block. 147 | stride: Stride used in the first layer of the residual block. 148 | pre_activation: use pre_activation structure or not. 149 | """ 150 | input_layer = cnn.top_layer 151 | in_size = cnn.top_size 152 | if in_size != depth: 153 | # Plan A of shortcut. 154 | shortcut = cnn.apool(1, 1, stride, stride, 155 | input_layer=input_layer, 156 | num_channels_in=in_size) 157 | padding = (depth - in_size) // 2 158 | if cnn.channel_pos == 'channels_last': 159 | shortcut = tf.pad( 160 | shortcut, [[0, 0], [0, 0], [0, 0], [padding, padding]]) 161 | else: 162 | shortcut = tf.pad( 163 | shortcut, [[0, 0], [padding, padding], [0, 0], [0, 0]]) 164 | else: 165 | shortcut = input_layer 166 | if pre_activation: 167 | res = cnn.batch_norm(input_layer) 168 | res = tf.nn.relu(res) 169 | else: 170 | res = input_layer 171 | cnn.conv(depth, 3, 3, stride, stride, 172 | input_layer=res, num_channels_in=in_size, 173 | use_batch_norm=True, bias=None) 174 | if pre_activation: 175 | res = cnn.conv(depth, 3, 3, 1, 1, activation=None, 176 | use_batch_norm=False, bias=None) 177 | output = shortcut + res 178 | else: 179 | res = cnn.conv(depth, 3, 3, 1, 1, activation=None, 180 | use_batch_norm=True, bias=None) 181 | output = tf.nn.relu(shortcut + res) 182 | cnn.top_layer = output 183 | cnn.top_size = depth 184 | 185 | 186 | class ResnetModel(model_lib.Model): 187 | """Resnet cnn network configuration.""" 188 | 189 | def __init__(self, model, layer_counts): 190 | default_batch_sizes = { 191 | 'resnet50': 64, 192 | 'resnet101': 32, 193 | 'resnet152': 32, 194 | 'resnet50_v2': 64, 195 | 'resnet101_v2': 32, 196 | 'resnet152_v2': 32, 197 | } 198 | batch_size = default_batch_sizes.get(model, 32) 199 | super(ResnetModel, self).__init__(model, 224, batch_size, 0.005, 200 | layer_counts) 201 | self.pre_activation = 'v2' in model 202 | 203 | def add_inference(self, cnn): 204 | if self.layer_counts is None: 205 | raise ValueError('Layer counts not specified for %s' % self.get_model()) 206 | cnn.use_batch_norm = True 207 | cnn.batch_norm_config = {'decay': 0.997, 'epsilon': 1e-5, 'scale': True} 208 | cnn.conv(64, 7, 7, 2, 2, mode='SAME_RESNET', use_batch_norm=True) 209 | cnn.mpool(3, 3, 2, 2, mode='SAME') 210 | for _ in xrange(self.layer_counts[0]): 211 | bottleneck_block(cnn, 256, 64, 1, self.pre_activation) 212 | for i in xrange(self.layer_counts[1]): 213 | stride = 2 if i == 0 else 1 214 | bottleneck_block(cnn, 512, 128, stride, self.pre_activation) 215 | for i in xrange(self.layer_counts[2]): 216 | stride = 2 if i == 0 else 1 217 | bottleneck_block(cnn, 1024, 256, stride, self.pre_activation) 218 | for i in xrange(self.layer_counts[3]): 219 | stride = 2 if i == 0 else 1 220 | bottleneck_block(cnn, 2048, 512, stride, self.pre_activation) 221 | if self.pre_activation: 222 | cnn.batch_norm() 223 | cnn.top_layer = tf.nn.relu(cnn.top_layer) 224 | cnn.spatial_mean() 225 | 226 | def get_learning_rate(self, global_step, batch_size): 227 | num_batches_per_epoch = ( 228 | float(datasets.IMAGENET_NUM_TRAIN_IMAGES) / batch_size) 229 | boundaries = [int(num_batches_per_epoch * x) for x in [30, 60]] 230 | values = [0.1, 0.01, 0.001] 231 | return tf.train.piecewise_constant(global_step, boundaries, values) 232 | 233 | 234 | def create_resnet50_model(): 235 | return ResnetModel('resnet50', (3, 4, 6, 3)) 236 | 237 | 238 | def create_resnet50_v2_model(): 239 | return ResnetModel('resnet50_v2', (3, 4, 6, 3)) 240 | 241 | 242 | def create_resnet101_model(): 243 | return ResnetModel('resnet101', (3, 4, 23, 3)) 244 | 245 | 246 | def create_resnet101_v2_model(): 247 | return ResnetModel('resnet101_v2', (3, 4, 23, 3)) 248 | 249 | 250 | def create_resnet152_model(): 251 | return ResnetModel('resnet152', (3, 8, 36, 3)) 252 | 253 | 254 | def create_resnet152_v2_model(): 255 | return ResnetModel('resnet152_v2', (3, 8, 36, 3)) 256 | 257 | 258 | class ResnetCifar10Model(model_lib.Model): 259 | """Resnet cnn network configuration for Cifar 10 dataset. 260 | 261 | V1 model architecture follows the one defined in the paper: 262 | https://arxiv.org/pdf/1512.03385.pdf. 263 | 264 | V2 model architecture follows the one defined in the paper: 265 | https://arxiv.org/pdf/1603.05027.pdf. 266 | """ 267 | 268 | def __init__(self, model, layer_counts): 269 | self.pre_activation = 'v2' in model 270 | super(ResnetCifar10Model, self).__init__( 271 | model, 32, 128, 0.1, layer_counts) 272 | 273 | def add_inference(self, cnn): 274 | if self.layer_counts is None: 275 | raise ValueError('Layer counts not specified for %s' % self.get_model()) 276 | 277 | cnn.use_batch_norm = True 278 | cnn.batch_norm_config = {'decay': 0.9, 'epsilon': 1e-5, 'scale': True} 279 | if self.pre_activation: 280 | cnn.conv(16, 3, 3, 1, 1, use_batch_norm=True) 281 | else: 282 | cnn.conv(16, 3, 3, 1, 1, activation=None, use_batch_norm=True) 283 | for i in xrange(self.layer_counts[0]): 284 | # reshape to batch_size x 16 x 32 x 32 285 | residual_block(cnn, 16, 1, self.pre_activation) 286 | for i in xrange(self.layer_counts[1]): 287 | # Subsampling is performed at the first convolution with a stride of 2 288 | stride = 2 if i == 0 else 1 289 | # reshape to batch_size x 32 x 16 x 16 290 | residual_block(cnn, 32, stride, self.pre_activation) 291 | for i in xrange(self.layer_counts[2]): 292 | stride = 2 if i == 0 else 1 293 | # reshape to batch_size x 64 x 8 x 8 294 | residual_block(cnn, 64, stride, self.pre_activation) 295 | if self.pre_activation: 296 | cnn.batch_norm() 297 | cnn.top_layer = tf.nn.relu(cnn.top_layer) 298 | cnn.spatial_mean() 299 | 300 | def get_learning_rate(self, global_step, batch_size): 301 | num_batches_per_epoch = int(50000 / batch_size) 302 | boundaries = num_batches_per_epoch * np.array([82, 123, 300], 303 | dtype=np.int64) 304 | boundaries = [x for x in boundaries] 305 | values = [0.1, 0.01, 0.001, 0.0002] 306 | return tf.train.piecewise_constant(global_step, boundaries, values) 307 | 308 | 309 | def create_resnet20_cifar_model(): 310 | return ResnetCifar10Model('resnet20', (3, 3, 3)) 311 | 312 | 313 | def create_resnet20_v2_cifar_model(): 314 | return ResnetCifar10Model('resnet20_v2', (3, 3, 3)) 315 | 316 | 317 | def create_resnet32_cifar_model(): 318 | return ResnetCifar10Model('resnet32_v2', (5, 5, 5)) 319 | 320 | 321 | def create_resnet32_v2_cifar_model(): 322 | return ResnetCifar10Model('resnet32_v2', (5, 5, 5)) 323 | 324 | 325 | def create_resnet44_cifar_model(): 326 | return ResnetCifar10Model('resnet44', (7, 7, 7)) 327 | 328 | 329 | def create_resnet44_v2_cifar_model(): 330 | return ResnetCifar10Model('resnet44_v2', (7, 7, 7)) 331 | 332 | 333 | def create_resnet56_cifar_model(): 334 | return ResnetCifar10Model('resnet56', (9, 9, 9)) 335 | 336 | 337 | def create_resnet56_v2_cifar_model(): 338 | return ResnetCifar10Model('resnet56_v2', (9, 9, 9)) 339 | 340 | 341 | def create_resnet110_cifar_model(): 342 | return ResnetCifar10Model('resnet110', (18, 18, 18)) 343 | 344 | 345 | def create_resnet110_v2_cifar_model(): 346 | return ResnetCifar10Model('resnet110_v2', (18, 18, 18)) 347 | -------------------------------------------------------------------------------- /other-wrappers/README.md: -------------------------------------------------------------------------------- 1 | ## Benchmark CNN with other TF high-level APIs 2 | 3 | Tensorpack is __1.2x~5x__ faster than the equivalent code written in some other TF high-level APIs. 4 | 5 | ### Benchmark setting: 6 | 7 | * Hardware: AWS p3.16xlarge (8 Tesla V100s) 8 | * Software: 9 | Python 3.6, TF 1.13.1, cuda 10, cudnn 7.4.2, Keras 2.1.5, tflearn 0.3.2, tensorpack 0.9.4. 10 | * Measurement: speed is measured by images per second (__larger is better__). First epoch is warmup and 11 | is not considered in timing. Second or later epochs have statistically insignificant difference. 12 | * Data: 13 | * True data for Cifar10. 14 | * For ImageNet, assumed to be a constant numpy array already available on CPU. 15 | This is a reasonable setting because data always has to come from somewhere to CPU anyway. 16 | * __Source code is here__. They are all <100 lines that you can easily 17 | run and __verify by yourself__. 18 | 19 | ### On a Single GPU: 20 | | Task | tensorpack | Keras | tflearn | 21 | | ------------------------------ | ----------- | ------ | ------- | 22 | | Keras Official Cifar10 Example | __11904__ | 7142 | 5882 | 23 | | VGG16 on fake ImageNet | __230__ | 204 | 194 | 24 | | AlexNet on fake ImageNet | __2603__ | 1454 | N/A | 25 | | ResNet50 on fake ImageNet | __333__ | 266 | N/A | 26 | 27 | ### Data Parallel on 8 GPUs: 28 | 29 | Each script used in this section can be run with "./script.py NUM_GPU" to use a different number of GPUs. 30 | 31 | | | 1 GPU | 2 GPUs | 8 GPUs | 32 | | -------------------- | ------- | ------ | --------- | 33 | | tensorpack+ResNet50 | 333 | 579 | __2245__ | 34 | | Keras+ResNet50 | 266 | 320 | 470 | 35 | | | | | | 36 | | tensorpack+VGG16 | 230 | 438 | __1651__ | 37 | | Keras+VGG16 | 204 | 304 | 449 | 38 | 39 | 40 | 41 | Notes: 42 | 43 | 1. With a better (but different in batch sizes, etc) setting in [../ResNet-MultiGPU/](../ResNet-MultiGPU/), 44 | tensorpack can further reach 2800 im/s for ResNet50 on a p3.16xlarge instance. 45 | And 9225 im/s with all optimizations + fp16 turned on. 46 | 2. It's possible for Keras to be faster (by using better input pipeline, building data-parallel graph by yourself, etc), but it's NOT 47 | how most users are using Keras or how any of the Keras examples are written. 48 | 49 | Using Keras together with Tensorpack is one way to make Keras faster. 50 | See the [Keras+Tensorpack example](https://github.com/tensorpack/tensorpack/tree/master/examples/keras). 51 | -------------------------------------------------------------------------------- /other-wrappers/keras.alexnet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import keras 4 | keras.backend.set_image_data_format('channels_first') 5 | from keras.models import Sequential 6 | from keras.layers import * 7 | from keras.utils import np_utils 8 | import numpy as np 9 | import sys 10 | 11 | try: 12 | NUM_GPU = int(sys.argv[1]) 13 | except IndexError: 14 | NUM_GPU = 1 15 | batch_size = 64 * NUM_GPU 16 | nb_classes = 1000 17 | 18 | img_rows, img_cols = 224, 224 19 | 20 | if keras.backend.image_data_format() == 'channels_first': 21 | X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32') 22 | else: 23 | X_train = np.random.random((batch_size, img_rows, img_cols, 3)).astype('float32') 24 | Y_train = np.random.random((batch_size,)).astype('int32') 25 | Y_train = np_utils.to_categorical(Y_train, nb_classes) 26 | 27 | 28 | def gen(): 29 | while True: 30 | yield (X_train, Y_train) 31 | 32 | 33 | model = Sequential() 34 | model.add(Convolution2D(64, 11, strides=4, padding='valid', input_shape=X_train.shape[1:])) 35 | model.add(Activation('relu')) 36 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2)) 37 | model.add(Convolution2D(192, 5, padding='same')) 38 | model.add(Activation('relu')) 39 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2)) 40 | 41 | model.add(Convolution2D(384, 3, padding='same')) 42 | model.add(Activation('relu')) 43 | model.add(Convolution2D(256, 3, padding='same')) 44 | model.add(Activation('relu')) 45 | model.add(Convolution2D(256, 3, padding='same')) 46 | model.add(Activation('relu')) 47 | model.add(MaxPooling2D(pool_size=(3, 3), strides=2)) 48 | 49 | model.add(Flatten()) 50 | model.add(Dense(4096)) 51 | model.add(Activation('relu')) 52 | model.add(Dropout(0.5)) 53 | model.add(Dense(4096)) 54 | model.add(Activation('relu')) 55 | model.add(Dropout(0.5)) 56 | model.add(Dense(nb_classes)) 57 | model.add(Activation('softmax')) 58 | 59 | for l in model.layers: 60 | print(l.input_shape, l.output_shape) 61 | 62 | if NUM_GPU != 1: 63 | model = keras.utils.multi_gpu_model(model, NUM_GPU) 64 | 65 | # Let's train the model using RMSprop 66 | model.compile(loss='categorical_crossentropy', 67 | optimizer='rmsprop', 68 | metrics=['accuracy']) 69 | 70 | model.fit_generator(gen(), epochs=100, steps_per_epoch=200) 71 | -------------------------------------------------------------------------------- /other-wrappers/keras.cifar10.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | '''Train a simple deep CNN on the CIFAR10 small images dataset. 3 | 4 | It gets to 75% validation accuracy in 25 epochs, and 79% after 50 epochs. 5 | (it's still underfitting at that point, though). 6 | ''' 7 | 8 | from __future__ import print_function 9 | import keras 10 | from keras.datasets import cifar10 11 | from keras.models import Sequential 12 | from keras.layers import Dense, Dropout, Activation, Flatten 13 | from keras.layers import Conv2D, MaxPooling2D 14 | 15 | batch_size = 32 16 | num_classes = 10 17 | epochs = 100 18 | 19 | # The data, split between train and test sets: 20 | (x_train, y_train), (x_test, y_test) = cifar10.load_data() 21 | print('x_train shape:', x_train.shape) 22 | print(x_train.shape[0], 'train samples') 23 | print(x_test.shape[0], 'test samples') 24 | 25 | # Convert class vectors to binary class matrices. 26 | y_train = keras.utils.to_categorical(y_train, num_classes) 27 | y_test = keras.utils.to_categorical(y_test, num_classes) 28 | 29 | model = Sequential() 30 | model.add(Conv2D(32, (3, 3), padding='same', 31 | input_shape=x_train.shape[1:])) 32 | model.add(Activation('relu')) 33 | model.add(Conv2D(32, (3, 3))) 34 | model.add(Activation('relu')) 35 | model.add(MaxPooling2D(pool_size=(2, 2))) 36 | model.add(Dropout(0.25)) 37 | 38 | model.add(Conv2D(64, (3, 3), padding='same')) 39 | model.add(Activation('relu')) 40 | model.add(Conv2D(64, (3, 3))) 41 | model.add(Activation('relu')) 42 | model.add(MaxPooling2D(pool_size=(2, 2))) 43 | model.add(Dropout(0.25)) 44 | 45 | model.add(Flatten()) 46 | model.add(Dense(512)) 47 | model.add(Activation('relu')) 48 | model.add(Dropout(0.5)) 49 | model.add(Dense(num_classes)) 50 | model.add(Activation('softmax')) 51 | 52 | # initiate RMSprop optimizer 53 | opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6) 54 | 55 | # Let's train the model using RMSprop 56 | model.compile(loss='categorical_crossentropy', 57 | optimizer=opt, 58 | metrics=['accuracy']) 59 | 60 | x_train = x_train.astype('float32') 61 | x_test = x_test.astype('float32') 62 | x_train /= 255 63 | x_test /= 255 64 | 65 | print('Not using data augmentation.') 66 | model.fit(x_train, y_train, 67 | batch_size=batch_size, 68 | epochs=epochs, 69 | # validation_data=(x_test, y_test), 70 | shuffle=True) 71 | -------------------------------------------------------------------------------- /other-wrappers/keras.resnet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import keras 4 | keras.backend.set_image_data_format('channels_first') 5 | from keras.applications.resnet50 import ResNet50 6 | from keras.utils import np_utils 7 | import numpy as np 8 | import sys 9 | 10 | try: 11 | NUM_GPU = int(sys.argv[1]) 12 | except IndexError: 13 | NUM_GPU = 1 14 | 15 | batch_size = 32 * NUM_GPU 16 | 17 | img_rows, img_cols = 224, 224 18 | 19 | if keras.backend.image_data_format() == 'channels_first': 20 | X_train = np.random.random((batch_size, 3, img_rows, img_cols)).astype('float32') 21 | else: 22 | X_train = np.random.random((batch_size, img_rows, img_cols, 3)).astype('float32') 23 | Y_train = np.random.random((batch_size,)).astype('int32') 24 | Y_train = np_utils.to_categorical(Y_train, 1000) 25 | 26 | 27 | def gen(): 28 | while True: 29 | yield (X_train, Y_train) 30 | 31 | 32 | model = ResNet50(weights=None, input_shape=X_train.shape[1:]) 33 | 34 | if NUM_GPU != 1: 35 | model = keras.utils.multi_gpu_model(model, gpus=NUM_GPU) 36 | 37 | for l in model.layers: 38 | print(l.input_shape, l.output_shape) 39 | 40 | model.compile(loss='categorical_crossentropy', 41 | optimizer='sgd', metrics=['accuracy']) 42 | 43 | model.fit_generator(gen(), epochs=100, steps_per_epoch=50) 44 | -------------------------------------------------------------------------------- /other-wrappers/keras.vgg.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | import keras 5 | keras.backend.set_image_data_format('channels_first') 6 | from keras.models import Model 7 | from keras.utils import np_utils 8 | from keras.layers import * 9 | import numpy as np 10 | 11 | try: 12 | NUM_GPU = int(sys.argv[1]) 13 | except IndexError: 14 | NUM_GPU = 1 15 | batch_size = 64 * NUM_GPU 16 | nb_classes = 1000 17 | nb_epoch = 200 18 | 19 | img_rows, img_cols = 224, 224 20 | 21 | if keras.backend.image_data_format() == 'channels_first': 22 | X_train = np.random.random((batch_size, 3, img_rows, img_cols)) 23 | else: 24 | X_train = np.random.random((batch_size, img_rows, img_cols, 3)) 25 | Y_train = np.random.random((batch_size,)).astype('int32') 26 | Y_train = np_utils.to_categorical(Y_train, nb_classes) 27 | 28 | 29 | def gen(): 30 | while True: 31 | yield (X_train, Y_train) 32 | 33 | 34 | img_input = Input(shape=X_train.shape[1:], dtype="float32") 35 | # Block 1 36 | x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(img_input) 37 | x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x) 38 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x) 39 | 40 | # Block 2 41 | x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv1')(x) 42 | x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv2')(x) 43 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x) 44 | 45 | # Block 3 46 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv1')(x) 47 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv2')(x) 48 | x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv3')(x) 49 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(x) 50 | 51 | # Block 4 52 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x) 53 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x) 54 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x) 55 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x) 56 | 57 | # Block 5 58 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x) 59 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x) 60 | x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x) 61 | x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x) 62 | 63 | x = Flatten(name='flatten')(x) 64 | x = Dense(4096, activation='relu', name='fc1')(x) 65 | x = Dense(4096, activation='relu', name='fc2')(x) 66 | x = Dense(nb_classes, activation='softmax', name='predictions')(x) 67 | 68 | model = Model(img_input, x, name='vgg16') 69 | 70 | if NUM_GPU != 1: 71 | model = keras.utils.multi_gpu_model(model, NUM_GPU) 72 | 73 | # Let's train the model using RMSprop 74 | model.compile(loss='categorical_crossentropy', 75 | optimizer='rmsprop', 76 | metrics=['accuracy']) 77 | 78 | model.fit_generator(gen(), epochs=nb_epoch, steps_per_epoch=50) 79 | -------------------------------------------------------------------------------- /other-wrappers/tensorpack.alexnet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: tensorpack.alexnet.py 4 | import tensorflow as tf 5 | import numpy as np 6 | from tensorpack import * 7 | 8 | BATCH = 64 9 | 10 | 11 | class Model(ModelDesc): 12 | def inputs(self): 13 | return [tf.TensorSpec([BATCH, 3, 224, 224], tf.float32, 'input'), 14 | tf.TensorSpec([BATCH], tf.int32, 'label')] 15 | 16 | def build_graph(self, image, label): 17 | image = image / 255.0 18 | 19 | with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3), \ 20 | argscope([Conv2D, MaxPooling], data_format='channels_first'): 21 | logits = (LinearWrap(image) 22 | .Conv2D('conv1_1', 64, kernel_size=11, strides=4, padding='VALID') 23 | .MaxPooling('pool1', 3, 2) 24 | .Conv2D('conv1_2', 192, kernel_size=5) 25 | .MaxPooling('pool2', 3, 2) 26 | 27 | .Conv2D('conv3', 384) 28 | .Conv2D('conv4', 256) 29 | .Conv2D('conv5', 256) 30 | .MaxPooling('pool3', 3, 2) 31 | 32 | .FullyConnected('fc6', 4096, activation=tf.nn.relu) 33 | .Dropout('drop0', rate=0.5) 34 | .FullyConnected('fc7', 4096, activation=tf.nn.relu) 35 | .Dropout('drop1', rate=0.5) 36 | .FullyConnected('fc8', 1000, activation=tf.identity)()) 37 | 38 | cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 39 | cost = tf.reduce_mean(cost, name='cost') 40 | return cost 41 | 42 | def optimizer(self): 43 | return tf.train.RMSPropOptimizer(1e-3, epsilon=1e-8) 44 | 45 | 46 | def get_data(): 47 | X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32') 48 | Y_train = np.random.random((BATCH,)).astype('int32') 49 | 50 | def gen(): 51 | while True: 52 | yield [X_train, Y_train] 53 | return DataFromGenerator(gen) 54 | 55 | 56 | if __name__ == '__main__': 57 | dataset_train = get_data() 58 | config = TrainConfig( 59 | model=Model(), 60 | data=StagingInput(QueueInput(dataset_train)), 61 | callbacks=[], 62 | max_epoch=100, 63 | steps_per_epoch=200, 64 | ) 65 | launch_train_with_config(config, SimpleTrainer()) 66 | -------------------------------------------------------------------------------- /other-wrappers/tensorpack.cifar10.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: tensorpack.cifar10.py 4 | import tensorflow as tf 5 | from tensorpack import * 6 | 7 | 8 | class Model(ModelDesc): 9 | def inputs(self): 10 | return [tf.TensorSpec([None, 32, 32, 3], tf.float32, 'input'), 11 | tf.TensorSpec([None], tf.int32, 'label')] 12 | 13 | def build_graph(self, image, label): 14 | image = tf.transpose(image, [0, 3, 1, 2]) 15 | image = image / 255.0 16 | 17 | with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3, padding='VALID'), \ 18 | argscope([Conv2D, MaxPooling], data_format='NCHW'): 19 | logits = (LinearWrap(image) 20 | .Conv2D('conv0', 32, padding='SAME') 21 | .Conv2D('conv1', 32) 22 | .MaxPooling('pool0', 2) 23 | .Dropout(rate=0.25) 24 | .Conv2D('conv2', 64, padding='SAME') 25 | .Conv2D('conv3', 64) 26 | .MaxPooling('pool1', 2) 27 | .Dropout(rate=0.25) 28 | .FullyConnected('fc1', 512, activation=tf.nn.relu) 29 | .Dropout(rate=0.5) 30 | .FullyConnected('linear', 10, activation=tf.identity)()) 31 | 32 | cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 33 | cost = tf.reduce_mean(cost, name='cost') 34 | 35 | wrong = tf.cast(tf.logical_not(tf.nn.in_top_k(logits, label, 1)), tf.float32, name='wrong') 36 | tf.reduce_mean(wrong, name='train_error') 37 | # no weight decay 38 | return cost 39 | 40 | def optimizer(self): 41 | lr = tf.get_variable('learning_rate', initializer=1e-4, trainable=False) 42 | return tf.train.RMSPropOptimizer(lr, epsilon=1e-8) 43 | 44 | 45 | def get_data(train_or_test): 46 | isTrain = train_or_test == 'train' 47 | ds = dataset.Cifar10(train_or_test) 48 | ds = BatchData(ds, 32, remainder=not isTrain) 49 | return ds 50 | 51 | 52 | if __name__ == '__main__': 53 | dataset_train = get_data('train') 54 | dataset_test = get_data('test') 55 | config = TrainConfig( 56 | model=Model(), 57 | data=QueueInput(dataset_train, 58 | queue=tf.FIFOQueue(300, [tf.float32, tf.int32])), 59 | # callbacks=[InferenceRunner(dataset_test, ClassificationError('wrong'))], # skip validation 60 | callbacks=[], 61 | # keras monitor these two live data during training. do it here (no overhead actually) 62 | extra_callbacks=[ProgressBar(['cost', 'train_error']), MergeAllSummaries()], 63 | max_epoch=200, 64 | ) 65 | launch_train_with_config(config, SimpleTrainer()) 66 | -------------------------------------------------------------------------------- /other-wrappers/tensorpack.resnet.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: tensorpack.resnet.py 4 | import tensorflow as tf 5 | import sys 6 | import numpy as np 7 | from tensorpack import * 8 | 9 | BATCH = 32 # tensorpack's "batch" is per-GPU batch. 10 | try: 11 | NUM_GPU = int(sys.argv[1]) 12 | except IndexError: 13 | NUM_GPU = 1 14 | 15 | 16 | def resnet_shortcut(l, n_out, stride, activation=tf.identity): 17 | n_in = l.get_shape().as_list()[1] 18 | if n_in != n_out: # change dimension when channel is not the same 19 | return Conv2D('convshortcut', l, n_out, 1, strides=stride, activation=activation) 20 | else: 21 | return l 22 | 23 | 24 | def block_func(l, ch_out, stride): 25 | BN = lambda x, name=None: BatchNorm('bn', x) 26 | shortcut = l 27 | l = Conv2D('conv1', l, ch_out, 1, strides=stride, activation=BNReLU) 28 | l = Conv2D('conv2', l, ch_out, 3, strides=1, activation=BNReLU) 29 | l = Conv2D('conv3', l, ch_out * 4, 1, activation=BN) 30 | return tf.nn.relu(l + resnet_shortcut(shortcut, ch_out * 4, stride, activation=BN)) 31 | 32 | 33 | def group_func(l, name, block_func, features, count, stride): 34 | with tf.variable_scope(name): 35 | for i in range(0, count): 36 | with tf.variable_scope('block{}'.format(i)): 37 | l = block_func(l, features, stride if i == 0 else 1) 38 | return l 39 | 40 | 41 | class Model(ModelDesc): 42 | def inputs(self): 43 | return [tf.TensorSpec([None, 3, 224, 224], tf.float32, 'input'), 44 | tf.TensorSpec([None], tf.int32, 'label')] 45 | 46 | def build_graph(self, image, label): 47 | image = image / 255.0 48 | 49 | num_blocks = [3, 4, 6, 3] 50 | with argscope([Conv2D, MaxPooling, BatchNorm, GlobalAvgPooling], data_format='channels_first'), \ 51 | argscope(Conv2D, use_bias=False): 52 | logits = (LinearWrap(image) 53 | .tf.pad([[0, 0], [3, 3], [3, 3], [0, 0]]) 54 | .Conv2D('conv0', 64, 7, strides=2, activation=BNReLU, padding='VALID') 55 | .MaxPooling('pool0', 3, strides=2, padding='SAME') 56 | .apply(group_func, 'group0', block_func, 64, num_blocks[0], 1) 57 | .apply(group_func, 'group1', block_func, 128, num_blocks[1], 2) 58 | .apply(group_func, 'group2', block_func, 256, num_blocks[2], 2) 59 | .apply(group_func, 'group3', block_func, 512, num_blocks[3], 2) 60 | .GlobalAvgPooling('gap') 61 | .FullyConnected('linear', 1000, activation=tf.identity)()) 62 | 63 | cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 64 | cost = tf.reduce_mean(cost, name='cost') 65 | return cost 66 | 67 | def optimizer(self): 68 | return tf.train.GradientDescentOptimizer(1e-3) 69 | 70 | 71 | def get_data(): 72 | X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32') 73 | Y_train = np.random.random((BATCH,)).astype('int32') 74 | 75 | def gen(): 76 | while True: 77 | yield [X_train, Y_train] 78 | return DataFromGenerator(gen) 79 | 80 | 81 | if __name__ == '__main__': 82 | dataset_train = get_data() 83 | config = TrainConfig( 84 | model=Model(), 85 | dataflow=dataset_train, 86 | callbacks=[], 87 | max_epoch=100, 88 | steps_per_epoch=50, 89 | ) 90 | trainer = SyncMultiGPUTrainerReplicated( 91 | NUM_GPU, mode='hierarchical' if NUM_GPU == 8 else 'cpu') 92 | launch_train_with_config(config, trainer) 93 | -------------------------------------------------------------------------------- /other-wrappers/tensorpack.vgg.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # File: tensorpack.vgg.py 4 | import tensorflow as tf 5 | import numpy as np 6 | import sys 7 | from tensorpack import * 8 | 9 | BATCH = 64 # tensorpack's "batch" is per-GPU batch. 10 | try: 11 | NUM_GPU = int(sys.argv[1]) 12 | except IndexError: 13 | NUM_GPU = 1 14 | 15 | 16 | class Model(ModelDesc): 17 | def inputs(self): 18 | return [tf.TensorSpec([None, 3, 224, 224], tf.float32, 'input'), 19 | tf.TensorSpec([None], tf.int32, 'label')] 20 | 21 | def build_graph(self, image, label): 22 | image = image / 255.0 23 | 24 | with argscope(Conv2D, activation=tf.nn.relu, kernel_size=3), \ 25 | argscope([Conv2D, MaxPooling], data_format='channels_first'): 26 | logits = (LinearWrap(image) 27 | .Conv2D('conv1_1', 64) 28 | .Conv2D('conv1_2', 64) 29 | .MaxPooling('pool1', 2) 30 | # 112 31 | .Conv2D('conv2_1', 128) 32 | .Conv2D('conv2_2', 128) 33 | .MaxPooling('pool2', 2) 34 | # 56 35 | .Conv2D('conv3_1', 256) 36 | .Conv2D('conv3_2', 256) 37 | .Conv2D('conv3_3', 256) 38 | .MaxPooling('pool3', 2) 39 | # 28 40 | .Conv2D('conv4_1', 512) 41 | .Conv2D('conv4_2', 512) 42 | .Conv2D('conv4_3', 512) 43 | .MaxPooling('pool4', 2) 44 | # 14 45 | .Conv2D('conv5_1', 512) 46 | .Conv2D('conv5_2', 512) 47 | .Conv2D('conv5_3', 512) 48 | .MaxPooling('pool5', 2) 49 | # 7 50 | .FullyConnected('fc6', 4096, activation=tf.nn.relu) 51 | .FullyConnected('fc7', 4096, activation=tf.nn.relu) 52 | .FullyConnected('fc8', 1000, activation=tf.identity)()) 53 | 54 | cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label) 55 | cost = tf.reduce_mean(cost, name='cost') 56 | return cost 57 | 58 | def optimizer(self): 59 | return tf.train.RMSPropOptimizer(1e-3, epsilon=1e-8) 60 | 61 | 62 | def get_data(): 63 | X_train = np.random.random((BATCH, 3, 224, 224)).astype('float32') 64 | Y_train = np.random.random((BATCH,)).astype('int32') 65 | 66 | def gen(): 67 | while True: 68 | yield [X_train, Y_train] 69 | return DataFromGenerator(gen) 70 | 71 | 72 | if __name__ == '__main__': 73 | dataset_train = get_data() 74 | config = TrainConfig( 75 | model=Model(), 76 | data=StagingInput(QueueInput(dataset_train)), 77 | callbacks=[], 78 | extra_callbacks=[ProgressBar(['cost'])], 79 | max_epoch=200, 80 | steps_per_epoch=50, 81 | ) 82 | if NUM_GPU == 1: 83 | trainer = SimpleTrainer() 84 | else: 85 | trainer = SyncMultiGPUTrainerReplicated(NUM_GPU, mode='nccl') 86 | launch_train_with_config(config, trainer) 87 | -------------------------------------------------------------------------------- /other-wrappers/tflearn.cifar10.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ Convolutional network applied to CIFAR-10 dataset classification task. 5 | 6 | References: 7 | Learning Multiple Layers of Features from Tiny Images, A. Krizhevsky, 2009. 8 | 9 | Links: 10 | [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html) 11 | 12 | """ 13 | from __future__ import division, print_function, absolute_import 14 | 15 | import tflearn 16 | from tflearn.data_utils import shuffle, to_categorical 17 | from tflearn.layers.core import input_data, dropout, fully_connected 18 | from tflearn.layers.conv import conv_2d, max_pool_2d 19 | from tflearn.layers.estimator import regression 20 | from tflearn.data_preprocessing import ImagePreprocessing 21 | from tflearn.data_augmentation import ImageAugmentation 22 | 23 | # Data loading and preprocessing 24 | from tflearn.datasets import cifar10 25 | (X, Y), (X_test, Y_test) = cifar10.load_data() 26 | X, Y = shuffle(X, Y) 27 | Y = to_categorical(Y, 10) 28 | Y_test = to_categorical(Y_test, 10) 29 | 30 | # Real-time data preprocessing 31 | img_prep = ImagePreprocessing() 32 | 33 | # Real-time data augmentation 34 | img_aug = ImageAugmentation() 35 | 36 | # Convolutional network building 37 | network = input_data(shape=[None, 32, 32, 3], 38 | data_preprocessing=img_prep, 39 | data_augmentation=img_aug) 40 | network = conv_2d(network, 32, 3, activation='relu') 41 | network = conv_2d(network, 32, 3, activation='relu', padding='valid') 42 | network = max_pool_2d(network, 2) 43 | network = dropout(network, 0.5) 44 | 45 | network = conv_2d(network, 64, 3, activation='relu') 46 | network = conv_2d(network, 64, 3, activation='relu', padding='valid') 47 | network = max_pool_2d(network, 2) 48 | network = dropout(network, 0.5) 49 | network = fully_connected(network, 512, activation='relu') 50 | network = dropout(network, 0.5) 51 | network = fully_connected(network, 10, activation='softmax') 52 | network = regression(network, optimizer='rmsprop', 53 | loss='categorical_crossentropy', 54 | learning_rate=0.0001) 55 | 56 | # Train using classifier 57 | model = tflearn.DNN(network, tensorboard_verbose=0) 58 | model.fit(X, Y, n_epoch=50, shuffle=True, 59 | show_metric=True, batch_size=32, run_id='cifar10_cnn') 60 | -------------------------------------------------------------------------------- /other-wrappers/tflearn.vgg.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import division, print_function, absolute_import 5 | 6 | import tflearn 7 | from tflearn.data_utils import to_categorical 8 | from tflearn.layers.core import input_data, dropout, fully_connected 9 | import numpy as np 10 | from tflearn.layers.conv import conv_2d, max_pool_2d 11 | from tflearn.layers.estimator import regression 12 | 13 | batch_size = 64 14 | img_rows, img_cols = 224, 224 15 | # Data loading and preprocessing 16 | X = np.random.random((batch_size, img_rows, img_cols, 3)) 17 | Y = np.random.random((batch_size,)).astype('int32') 18 | Y = to_categorical(Y, 1000) 19 | 20 | # Convolutional network building 21 | network = input_data(shape=[None, img_rows, img_cols, 3]) 22 | network = conv_2d(network, 64, 3, activation='relu') 23 | network = conv_2d(network, 64, 3, activation='relu') 24 | network = max_pool_2d(network, 2) 25 | 26 | network = conv_2d(network, 128, 3, activation='relu') 27 | network = conv_2d(network, 128, 3, activation='relu') 28 | network = max_pool_2d(network, 2) 29 | 30 | network = conv_2d(network, 256, 3, activation='relu') 31 | network = conv_2d(network, 256, 3, activation='relu') 32 | network = conv_2d(network, 256, 3, activation='relu') 33 | network = max_pool_2d(network, 2) 34 | 35 | network = conv_2d(network, 512, 3, activation='relu') 36 | network = conv_2d(network, 512, 3, activation='relu') 37 | network = conv_2d(network, 512, 3, activation='relu') 38 | network = max_pool_2d(network, 2) 39 | 40 | network = conv_2d(network, 512, 3, activation='relu') 41 | network = conv_2d(network, 512, 3, activation='relu') 42 | network = conv_2d(network, 512, 3, activation='relu') 43 | network = max_pool_2d(network, 2) 44 | 45 | network = fully_connected(network, 4096, activation='relu') 46 | network = dropout(network, 0.5) 47 | network = fully_connected(network, 4096, activation='relu') 48 | network = dropout(network, 0.5) 49 | network = fully_connected(network, 1000, activation='softmax') 50 | network = regression(network, optimizer='rmsprop', 51 | loss='categorical_crossentropy', 52 | learning_rate=0.001) 53 | 54 | # Train using classifier 55 | model = tflearn.DNN(network, tensorboard_verbose=0) 56 | import time 57 | while True: 58 | start = time.time() 59 | model.fit(X, Y, n_epoch=50, snapshot_step=99999) 60 | print("Time:", time.time() - start) 61 | -------------------------------------------------------------------------------- /profile-import/README.md: -------------------------------------------------------------------------------- 1 | 2 | This small script shows that the import time of tensorpack is negligible compared to tensorflow. 3 | -------------------------------------------------------------------------------- /profile-import/import_profiler.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | 4 | class ImportInfo(object): 5 | def __init__(self, name, context_name, counter): 6 | self.name = name 7 | self.context_name = context_name 8 | self._counter = counter 9 | self._depth = 0 10 | self._start = time.time() 11 | 12 | self.elapsed = None 13 | 14 | def done(self): 15 | self.elapsed = time.time() - self._start 16 | 17 | @property 18 | def _key(self): 19 | return self.name, self.context_name, self._counter 20 | 21 | def __repr__(self): 22 | return "ImportInfo({!r}, {!r}, {!r})".format(*self._key) 23 | 24 | def __hash__(self): 25 | return hash(self._key) 26 | 27 | def __eq__(self, other): 28 | if isinstance(other, ImportInfo): 29 | return other._key == self._key 30 | return NotImplemented 31 | 32 | def __ne__(self, other): 33 | return not self == other 34 | 35 | 36 | class ImportStack(object): 37 | def __init__(self): 38 | self._current_stack = [] 39 | self._full_stack = {} 40 | self._counter = 0 41 | 42 | def push(self, name, context_name): 43 | info = ImportInfo(name, context_name, self._counter) 44 | self._counter += 1 45 | 46 | if len(self._current_stack) > 0: 47 | parent = self._current_stack[-1] 48 | if parent not in self._full_stack: 49 | self._full_stack[parent] = [] 50 | self._full_stack[parent].append(info) 51 | self._current_stack.append(info) 52 | 53 | info._depth = len(self._current_stack) - 1 54 | 55 | return info 56 | 57 | def pop(self, import_info): 58 | top = self._current_stack.pop() 59 | assert top is import_info 60 | top.done() 61 | 62 | 63 | def compute_intime(parent, full_stack, ordered_visited, visited, depth=0): 64 | if parent in visited: 65 | return 66 | 67 | cumtime = intime = parent.elapsed 68 | visited[parent] = [cumtime, parent.name, parent.context_name, depth] 69 | ordered_visited.append(parent) 70 | 71 | for child in full_stack.get(parent, []): 72 | intime -= child.elapsed 73 | compute_intime(child, full_stack, ordered_visited, visited, depth + 1) 74 | 75 | visited[parent].append(intime) 76 | 77 | 78 | class ImportProfilerContext(object): 79 | def __init__(self): 80 | self._original_importer = __builtins__["__import__"] 81 | self._import_stack = ImportStack() 82 | 83 | def enable(self): 84 | __builtins__["__import__"] = self._profiled_import 85 | 86 | def disable(self): 87 | __builtins__["__import__"] = self._original_importer 88 | 89 | def print_info(self, threshold=1.): 90 | """ Print profiler results. 91 | 92 | Parameters 93 | ---------- 94 | threshold : float 95 | import statements taking less than threshold (in ms) will not be 96 | displayed. 97 | """ 98 | full_stack = self._import_stack._full_stack 99 | 100 | keys = sorted(full_stack.keys(), key=lambda p: p._counter) 101 | visited = {} 102 | ordered_visited = [] 103 | 104 | for key in keys: 105 | compute_intime(key, full_stack, ordered_visited, visited) 106 | 107 | lines = [] 108 | for k in ordered_visited: 109 | node = visited[k] 110 | cumtime = node[0] * 1000 111 | name = node[1] 112 | level = node[3] 113 | intime = node[-1] * 1000 114 | if cumtime > threshold and level < 6: 115 | lines.append(( 116 | "{:.1f}".format(cumtime), 117 | "{:.1f}".format(intime), 118 | "+" * level + name, 119 | )) 120 | 121 | # Import here to avoid messing with the profile 122 | import tabulate 123 | 124 | print( 125 | tabulate.tabulate( 126 | lines, headers=("cumtime (ms)", "intime (ms)", "name"), tablefmt="plain") 127 | ) 128 | 129 | # Protocol implementations 130 | def __enter__(self): 131 | self.enable() 132 | return self 133 | 134 | def __exit__(self, *a, **kw): 135 | self.disable() 136 | 137 | def _profiled_import(self, name, globals=None, locals=None, fromlist=None, 138 | level=0, *a, **kw): 139 | if globals is None: 140 | context_name = None 141 | else: 142 | context_name = globals.get("__name__") 143 | if context_name is None: 144 | context_name = globals.get("__file__") 145 | 146 | info = self._import_stack.push(name, context_name) 147 | try: 148 | return self._original_importer(name, globals, locals, fromlist, level, *a, **kw) 149 | finally: 150 | self._import_stack.pop(info) 151 | 152 | 153 | def profile_import(): 154 | return ImportProfilerContext() 155 | -------------------------------------------------------------------------------- /profile-import/profile-import.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # flake8: noqa 3 | # -*- coding: utf-8 -*- 4 | # File: profile-import.py 5 | 6 | import sys 7 | from import_profiler import profile_import 8 | 9 | if __name__ == '__main__': 10 | task = sys.argv[1] 11 | 12 | if task == 'contrib': 13 | import tensorflow 14 | with profile_import() as context: 15 | import tensorflow.contrib 16 | context.print_info(threshold=5) 17 | elif task == 'tensorpack': 18 | import tensorflow.contrib.framework 19 | import cv2 20 | with profile_import() as context: 21 | import tensorpack 22 | context.print_info(threshold=5) 23 | elif task == 'timing': 24 | import tensorflow.contrib.framework 25 | import cv2 26 | import time 27 | s = time.time() 28 | import tensorpack 29 | print(time.time() - s) 30 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [flake8] 2 | max-line-length = 120 3 | ignore = E741,E742,E743,F405,F403,E402,E731 4 | exclude = .git, 5 | ResNet-MultiGPU/tfbench 6 | --------------------------------------------------------------------------------