├── README.md
├── configs
├── cifar10_cnn.yaml
├── cifar10_resnet.yaml
├── hello.yaml
└── imagenet_resnet.yaml
├── data
├── __init__.py
├── cifar10.py
├── dummy.py
└── imagenet.py
├── logs
└── README.md
├── models
├── __init__.py
├── cnn.py
└── resnet.py
├── notebooks
└── Analysis.ipynb
├── scripts
├── cifar_cnn.sh
└── cifar_resnet.sh
├── train_horovod.py
└── utils
├── __init__.py
├── callbacks.py
├── device.py
└── optimizers.py
/README.md:
--------------------------------------------------------------------------------
1 | # Deep Learning for Science School Tutorial:
Deep Learning At Scale
2 |
3 | This repository contains the material for the DL4Sci tutorial:
4 | *Deep Learning at Scale*.
5 |
6 | It contains specifications for datasets, a couple of CNN models, and
7 | all the training code to enable training the models in a distributed fashion
8 | using Horovod.
9 |
10 | As part of the tutorial, you will train a ResNet model to classify images
11 | from the CIFAR10 dataset on multiple nodes with synchronous data parallelism.
12 |
13 | **Contents**
14 | * [Links](https://github.com/NERSC/dl4sci-scaling-tutorial#links)
15 | * [Installation](https://github.com/NERSC/dl4sci-scaling-tutorial#installation)
16 | * [Navigating the repository](https://github.com/NERSC/dl4sci-scaling-tutorial#navigating-the-repository)
17 | * [Hands-on walk-through](https://github.com/NERSC/dl4sci-scaling-tutorial#hands-on-multi-node-training-example)
18 | * [Code references](https://github.com/NERSC/dl4sci-scaling-tutorial#code-references)
19 |
20 | ## Links
21 |
22 | NERSC JupyterHub: https://jupyter-dl.nersc.gov
23 |
24 | Slides: https://drive.google.com/drive/folders/10NqOLaqPTZ0nobE7JNSaGwrXQi_ACD35?usp=sharing
25 |
26 | ## Installation
27 |
28 | 1. Start a terminal on Cori, either via ssh or from the Jupyter interface.
29 | * **IMPORTANT: if using jupyter, you need to use a SHARED CPU. Click the CPU button instead of the GPU button to run this example!**
30 | 2. Clone the repository using git:\
31 | `git clone https://github.com/NERSC/dl4sci-scaling-tutorial.git`
32 |
33 | That's it! The rest of the software (Keras, TensorFlow) is pre-installed on Cori
34 | and loaded via the scripts used below.
35 |
36 | ## Navigating the repository
37 |
38 | **`train_horovod.py`** - the main training script which can be steered with YAML
39 | configuration files.
40 |
41 | **`data/`** - folder containing the specifications of the datasets. Each dataset
42 | has a corresponding name which is mapped to the specification in `data/__init__.py`
43 |
44 | **`models/`** - folder containing the Keras model definitions. Again, each model
45 | has a name which is interpreted in `models/__init__.py`.
46 |
47 | **`configs/`** - folder containing the configuration files. Each
48 | configuration specifies a dataset, a model, and all relevant configuration
49 | options (with some exceptions like the number of nodes, which is specified
50 | instead to SLURM via the command line).
51 |
52 | **`scripts/`** - contains an environment setup script and some SLURM scripts
53 | for easily submitting the example jobs to the Cori batch system.
54 |
55 | **`utils/`** - contains additional useful code for the training script, e.g.
56 | custom callbacks, device configuration, and optimizers logic.
57 |
58 | ## Hands-on multi-node training example
59 |
60 | We will use a customized ResNet model in this example to classify CIFAR10
61 | images and demonstrate distributed training with Horovod.
62 |
63 | 1. Check out the ResNet model code in [models/resnet.py](models/resnet.py).
64 | Note that the model code is broken into multiple functions for easy reuse.
65 | We provide here two versions of ResNet models: a standard ResNet50 (with 50
66 | layers) and a smaller ResNet consisting of 26 layers.
67 | * Identify the identy block and conv block functions. *How many convolutional
68 | layers do each of these have*?
69 | * Identify the functions that build the ResNet50 and the ResNetSmall. Given how
70 | many layers are in each block, *see if you can confirm how many layers (conv
71 | and dense) are in the models*. **Hint:** we don't normally count the
72 | convolution applied to the shortcuts.
73 |
74 | 2. Inspect the optimizer setup in [utils/optimizers.py](utils/optimizers.py).
75 | * Note how we scale the learning rate (`lr`) according to the number of
76 | processes (ranks).
77 | * Note how we construct our optimizer and then wrap it in the Horovod
78 | DistributedOptimizer.
79 |
80 | 3. Inspect the main [train_horovod.py](train_horovod.py) training script.
81 | * Identify the `init_workers` function where we initialize Horovod.
82 | Note where this is invoked in the main() function (right away).
83 | * Identify where we setup our distributed training callbacks.
84 | * *Which callback ensures we have consistent model weights at the start of training?*
85 | * Identify the callbacks responsible for the learning rate schedule (warmup and decay).
86 |
87 | 4. Finally, look at the configuration file:
88 | [configs/cifar10_resnet.yaml](configs/cifar10_resnet.yaml)
89 | * YAML allows to express configurations in rich, human-readable, hierarchical structure.
90 | * Identify where you would edit to modify the optimizer, learning-rate, batch-size, etc.
91 |
92 | That's mostly it for the code. Note that in general when training distributed
93 | you might want to use more complicated data handling, e.g. to ensure different
94 | workers are always processing different samples of your data within a training
95 | epoch. In this case we aren't worrying about that and are, for simplicity,
96 | relying on the independent random shuffling of the data by each worker as well
97 | as the random data augmentation.
98 |
99 | 5. To gain an appreciation for the speedup of training on
100 | multiple nodes, first train the ResNet model on a single node.
101 | Adjust the configuration in [configs/cifar10_resnet.yaml](configs/cifar10_resnet.yaml)
102 | to train for just 1 epoch and then submit the job to the Cori batch system with
103 | SLURM sbatch and our provided SLURM batch script:\
104 | `sbatch -N 1 scripts/cifar_resnet.sh`
105 | * **Important:** the first time you run a CIFAR10 example, it will
106 | automatically download the dataset. If you have more than one job attempting
107 | this download simultaneously it will likely fail.
108 |
109 | 6. Now we are ready to train our ResNet model on multiple nodes using Horovod
110 | and MPI! If you changed the config to 1 epoch above, be sure to change it back
111 | to 32 epochs for this step. To launch the ResNet training on 8 nodes, do:\
112 | `sbatch -N 8 scripts/cifar_resnet.sh`
113 |
114 | 7. Check on the status of your job by running `sqs`.
115 | Once the job starts running, you should see the output start to appear in the
116 | slurm log file `logs/cifar-cnn-*.out`. You'll see some printouts from every
117 | worker. Others are only printed from rank 0.
118 |
119 | 8. When the job is finished, check the log to identify how well your model learned
120 | to solve the CIFAR10 classification task. For every epoch you should see the
121 | loss and accuracy reported for both the training set and the validation set.
122 | Take note of the best validation accuracy achieved.
123 |
124 | Now that you've finished the main tutorial material, try to play with the code
125 | and/or configuration to see the effect on the training results. You can try changing
126 | things like
127 | * Change the optimizer (search for Keras optimizers on google).
128 | * Change the nominal learning rate, number of warmup epochs, decay schedule
129 | * Change the learning rate scaling (e.g. try "sqrt" scaling instead of linear)
130 |
131 | Most of these things can be changed entirely within the configuration.
132 | See [configs/imagenet_resnet.yaml](configs/imagenet_resnet.yaml) for examples.
133 |
134 | ## Code references
135 |
136 | Keras ResNet50 official model:
137 | https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py
138 |
139 | Horovod ResNet + ImageNet example:
140 | https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py
141 |
142 | CIFAR10 CNN and ResNet examples:
143 | https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py
144 | https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py
145 |
--------------------------------------------------------------------------------
/configs/cifar10_cnn.yaml:
--------------------------------------------------------------------------------
1 | description: 'CNN CIFAR10'
2 | output_dir: $SCRATCH/isc19-dl-tutorial/cifar-cnn-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID}
3 |
4 | data:
5 | name: cifar10
6 |
7 | model:
8 | name: cnn
9 | input_shape: [32, 32, 3]
10 | n_classes: 10
11 | dropout: 0.1
12 |
13 | optimizer:
14 | name: Adam
15 | lr: 0.001
16 |
17 | training:
18 | batch_size: 64
19 | n_epochs: 24
20 | lr_warmup_epochs: 5
21 | loss: categorical_crossentropy
22 | metrics: [accuracy]
23 |
--------------------------------------------------------------------------------
/configs/cifar10_resnet.yaml:
--------------------------------------------------------------------------------
1 | description: 'ResNet CIFAR10'
2 | output_dir: $SCRATCH/isc19-dl-tutorial/cifar10-resnet-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID}
3 |
4 | data:
5 | name: cifar10
6 |
7 | model:
8 | name: resnet_small
9 | input_shape: [32, 32, 3]
10 | n_classes: 10
11 |
12 | optimizer:
13 | name: Adam
14 | lr: 0.0001
15 | lr_scaling: linear
16 |
17 | training:
18 | batch_size: 64
19 | n_epochs: 32
20 | lr_warmup_epochs: 5
21 | loss: categorical_crossentropy
22 | metrics: [accuracy]
23 |
24 | device:
25 | intra_threads: 32
26 |
--------------------------------------------------------------------------------
/configs/hello.yaml:
--------------------------------------------------------------------------------
1 | description: 'Hello world'
2 | output_dir: $SCRATCH/isc19-dl-tutorial/hello-world
3 |
4 | data:
5 | name: dummy
6 | input_shape: [64, 64, 3]
7 | n_train: 1024
8 | n_valid: 1024
9 |
10 | model:
11 | name: cnn
12 | input_shape: [64, 64, 3]
13 | n_classes: 1
14 |
15 | optimizer:
16 | name: Adam
17 | lr: 0.001
18 |
19 | training:
20 | batch_size: 32
21 | n_epochs: 1
22 | loss: binary_crossentropy
23 | metrics: [accuracy]
24 |
--------------------------------------------------------------------------------
/configs/imagenet_resnet.yaml:
--------------------------------------------------------------------------------
1 | # This configuration should match what is implemented in the horovod example:
2 | # https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py
3 |
4 | description: 'ResNet ImageNet'
5 | output_dir: $SCRATCH/isc19-dl-tutorial/imagenet-resnet-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID}
6 |
7 | data:
8 | name: imagenet
9 | train_dir: /global/cscratch1/sd/sfarrell/ImageNet-100/train
10 | valid_dir: /global/cscratch1/sd/sfarrell/ImageNet-100/validation
11 |
12 | model:
13 | name: resnet50
14 | input_shape: [224, 224, 3]
15 | n_classes: 100
16 |
17 | optimizer:
18 | name: SGD
19 | lr: 0.0125
20 | momentum: 0.9
21 |
22 | training:
23 | batch_size: 32
24 | n_epochs: 100
25 | lr_warmup_epochs: 5
26 | loss: categorical_crossentropy
27 | metrics: [accuracy, top_k_categorical_accuracy]
28 | lr_schedule:
29 | - {start_epoch: 5, end_epoch: 30, multiplier: 1.}
30 | - {start_epoch: 30, end_epoch: 60, multiplier: 1.e-1}
31 | - {start_epoch: 60, end_epoch: 80, multiplier: 1.e-2}
32 | - {start_epoch: 80, multiplier: 1.e-3}
33 |
--------------------------------------------------------------------------------
/data/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Keras dataset specifications.
3 | TODO: add MNIST.
4 | """
5 |
6 | def get_datasets(name, **data_args):
7 | if name == 'dummy':
8 | from .dummy import get_datasets
9 | return get_datasets(**data_args)
10 | elif name == 'cifar10':
11 | from .cifar10 import get_datasets
12 | return get_datasets(**data_args)
13 | elif name == 'imagenet':
14 | from .imagenet import get_datasets
15 | return get_datasets(**data_args)
16 | else:
17 | raise ValueError('Dataset %s unknown' % name)
18 |
--------------------------------------------------------------------------------
/data/cifar10.py:
--------------------------------------------------------------------------------
1 | """
2 | CIFAR10 dataset specification.
3 |
4 | https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py
5 | https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py
6 | """
7 |
8 | # Externals
9 | import keras
10 | from keras.datasets import cifar10
11 | from keras.preprocessing.image import ImageDataGenerator
12 |
13 | def get_datasets(batch_size, n_train=None, n_valid=None):
14 | """
15 | Load the CIFAR10 data and construct pipeline.
16 | """
17 | (x_train, y_train), (x_valid, y_valid) = cifar10.load_data()
18 |
19 | # Normalize pixel values
20 | x_train = x_train.astype('float32') / 255
21 | x_valid = x_valid.astype('float32') / 255
22 |
23 | # Select subset of data if specified
24 | if n_train is not None:
25 | x_train, y_train = x_train[:n_train], y_train[:n_train]
26 | if n_valid is not None:
27 | x_valid, y_valid = x_valid[:n_valid], y_valid[:n_valid]
28 |
29 | # Convert labels to class vectors
30 | n_classes = 10
31 | y_train = keras.utils.to_categorical(y_train, n_classes)
32 | y_valid = keras.utils.to_categorical(y_valid, n_classes)
33 |
34 | # Prepare the generators with data augmentation
35 | train_gen = ImageDataGenerator(width_shift_range=0.1,
36 | height_shift_range=0.1,
37 | horizontal_flip=True)
38 | valid_gen = ImageDataGenerator()
39 | train_iter = train_gen.flow(x_train, y_train, batch_size=batch_size, shuffle=True)
40 | valid_iter = valid_gen.flow(x_valid, y_valid, batch_size=batch_size, shuffle=True)
41 | return train_iter, valid_iter
42 |
--------------------------------------------------------------------------------
/data/dummy.py:
--------------------------------------------------------------------------------
1 | """
2 | Random dummy dataset specification.
3 | """
4 |
5 | # System
6 | import math
7 |
8 | # Externals
9 | import numpy as np
10 | from keras.utils import Sequence
11 |
12 | class DummyDataset(Sequence):
13 |
14 | def __init__(self, n_samples, batch_size, input_shape, target_shape):
15 | self.x = np.random.normal(size=(n_samples,) + tuple(input_shape))
16 | self.y = np.random.normal(size=(n_samples,) + tuple(target_shape))
17 | self.batch_size = batch_size
18 |
19 | def __len__(self):
20 | return math.ceil(len(self.x) / self.batch_size)
21 |
22 | def __getitem__(self, idx):
23 | start = idx * self.batch_size
24 | end = (idx + 1) * self.batch_size
25 | return self.x[start:end], self.y[start:end]
26 |
27 | def get_datasets(batch_size, n_train=1024, n_valid=1024,
28 | input_shape=(32, 32, 3), target_shape=()):
29 | train_data = DummyDataset(n_train, batch_size, input_shape, target_shape)
30 | valid_data = DummyDataset(n_valid, batch_size, input_shape, target_shape)
31 | return train_data, valid_data
32 |
--------------------------------------------------------------------------------
/data/imagenet.py:
--------------------------------------------------------------------------------
1 | """
2 | ImageNet dataset specification.
3 |
4 | Adapted from
5 | https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py
6 | """
7 |
8 | # Externals
9 | import keras
10 | from keras.preprocessing.image import ImageDataGenerator
11 |
12 | def get_datasets(batch_size, train_dir, valid_dir):
13 | train_gen = ImageDataGenerator(
14 | preprocessing_function=keras.applications.resnet50.preprocess_input,
15 | width_shift_range=0.33, height_shift_range=0.33, zoom_range=0.5,
16 | horizontal_flip=True)
17 | test_gen = ImageDataGenerator(
18 | preprocessing_function=keras.applications.resnet50.preprocess_input,
19 | zoom_range=(0.875, 0.875))
20 | train_iter = train_gen.flow_from_directory(train_dir, batch_size=batch_size,
21 | target_size=(224, 224), shuffle=True)
22 | test_iter = train_gen.flow_from_directory(valid_dir, batch_size=batch_size,
23 | target_size=(224, 224), shuffle=True)
24 | return train_iter, test_iter
25 |
--------------------------------------------------------------------------------
/logs/README.md:
--------------------------------------------------------------------------------
1 | Slurm logs go in this directory
2 |
--------------------------------------------------------------------------------
/models/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Keras example model factory functions.
3 | """
4 |
5 | def get_model(name, **model_args):
6 | if name == 'cnn':
7 | from .cnn import build_model
8 | return build_model(**model_args)
9 | elif name == 'resnet_small':
10 | from .resnet import ResNetSmall
11 | return ResNetSmall(**model_args)
12 | elif name == 'resnet50':
13 | from .resnet import ResNet50
14 | return ResNet50(**model_args)
15 | else:
16 | raise ValueError('Model %s unknown' % name)
17 |
--------------------------------------------------------------------------------
/models/cnn.py:
--------------------------------------------------------------------------------
1 | """
2 | Simple CNN classifier model.
3 | """
4 |
5 | import keras
6 | from keras.models import Sequential
7 | from keras.layers import Dense, Dropout, Activation, Flatten
8 | from keras.layers import Conv2D, MaxPooling2D
9 |
10 | def build_model(input_shape=(32, 32, 3), n_classes=10, dropout=0):
11 | """Construct the simple CNN model"""
12 | conv_args = dict(kernel_size=3, padding='same', activation='relu')
13 | model = Sequential()
14 | model.add(Conv2D(16, input_shape=input_shape, **conv_args))
15 | model.add(MaxPooling2D(pool_size=2))
16 | model.add(Conv2D(32, **conv_args))
17 | model.add(MaxPooling2D(pool_size=2))
18 | model.add(Conv2D(64, **conv_args))
19 | model.add(MaxPooling2D(pool_size=2))
20 | model.add(Flatten())
21 | model.add(Dense(128, activation='relu'))
22 | model.add(Dropout(dropout))
23 | model.add(Dense(n_classes, activation='softmax'))
24 | return model
25 |
--------------------------------------------------------------------------------
/models/resnet.py:
--------------------------------------------------------------------------------
1 | """
2 | ResNet models for Keras.
3 | Implementations have been adapted from keras_applications/resnet50.py
4 | """
5 |
6 | # Externals
7 | import keras
8 | from keras import backend, layers, models, regularizers
9 |
10 | def identity_block(input_tensor, kernel_size, filters, stage, block,
11 | l2_reg=5e-5, bn_mom=0.9):
12 | """The identity block is the block that has no conv layer at shortcut.
13 |
14 | # Arguments
15 | input_tensor: input tensor
16 | kernel_size: default 3, the kernel size of
17 | middle conv layer at main path
18 | filters: list of integers, the filters of 3 conv layer at main path
19 | stage: integer, current stage label, used for generating layer names
20 | block: 'a','b'..., current block label, used for generating layer names
21 | l2_reg: L2 weight regularization (weight decay)
22 | bn_mom: batch-norm momentum
23 |
24 | # Returns
25 | Output tensor for the block.
26 | """
27 | filters1, filters2, filters3 = filters
28 | if backend.image_data_format() == 'channels_last':
29 | bn_axis = 3
30 | else:
31 | bn_axis = 1
32 | conv_name_base = 'res' + str(stage) + block + '_branch'
33 | bn_name_base = 'bn' + str(stage) + block + '_branch'
34 |
35 | x = layers.Conv2D(filters1, (1, 1),
36 | kernel_initializer='he_normal',
37 | kernel_regularizer=regularizers.l2(l2_reg),
38 | name=conv_name_base + '2a')(input_tensor)
39 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a',
40 | momentum=bn_mom, epsilon=1e-5)(x)
41 | x = layers.Activation('relu')(x)
42 |
43 | x = layers.Conv2D(filters2, kernel_size,
44 | padding='same',
45 | kernel_initializer='he_normal',
46 | kernel_regularizer=regularizers.l2(l2_reg),
47 | name=conv_name_base + '2b')(x)
48 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b',
49 | momentum=bn_mom, epsilon=1e-5)(x)
50 | x = layers.Activation('relu')(x)
51 |
52 | x = layers.Conv2D(filters3, (1, 1),
53 | kernel_initializer='he_normal',
54 | kernel_regularizer=regularizers.l2(l2_reg),
55 | name=conv_name_base + '2c')(x)
56 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c',
57 | momentum=bn_mom, epsilon=1e-5)(x)
58 |
59 | x = layers.add([x, input_tensor])
60 | x = layers.Activation('relu')(x)
61 | return x
62 |
63 | def conv_block(input_tensor, kernel_size, filters, stage, block,
64 | strides=(2, 2), l2_reg=5e-5, bn_mom=0.9):
65 | """A block that has a conv layer at shortcut.
66 |
67 | # Arguments
68 | input_tensor: input tensor
69 | kernel_size: default 3, the kernel size of
70 | middle conv layer at main path
71 | filters: list of integers, the filters of 3 conv layer at main path
72 | stage: integer, current stage label, used for generating layer names
73 | block: 'a','b'..., current block label, used for generating layer names
74 | strides: Strides for the first conv layer in the block.
75 | l2_reg: L2 weight regularization (weight decay)
76 | bn_mom: batch-norm momentum
77 |
78 | # Returns
79 | Output tensor for the block.
80 | """
81 | filters1, filters2, filters3 = filters
82 | if backend.image_data_format() == 'channels_last':
83 | bn_axis = 3
84 | else:
85 | bn_axis = 1
86 | conv_name_base = 'res' + str(stage) + block + '_branch'
87 | bn_name_base = 'bn' + str(stage) + block + '_branch'
88 |
89 | x = layers.Conv2D(filters1, (1, 1), strides=strides,
90 | kernel_initializer='he_normal',
91 | kernel_regularizer=regularizers.l2(l2_reg),
92 | name=conv_name_base + '2a')(input_tensor)
93 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a')(x)
94 | x = layers.Activation('relu')(x)
95 |
96 | x = layers.Conv2D(filters2, kernel_size, padding='same',
97 | kernel_initializer='he_normal',
98 | kernel_regularizer=regularizers.l2(l2_reg),
99 | name=conv_name_base + '2b')(x)
100 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b')(x)
101 | x = layers.Activation('relu')(x)
102 |
103 | x = layers.Conv2D(filters3, (1, 1),
104 | kernel_initializer='he_normal',
105 | kernel_regularizer=regularizers.l2(l2_reg),
106 | name=conv_name_base + '2c')(x)
107 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c')(x)
108 |
109 | shortcut = layers.Conv2D(filters3, (1, 1), strides=strides,
110 | kernel_initializer='he_normal',
111 | kernel_regularizer=regularizers.l2(l2_reg),
112 | name=conv_name_base + '1')(input_tensor)
113 | shortcut = layers.BatchNormalization(
114 | axis=bn_axis, name=bn_name_base + '1')(shortcut)
115 |
116 | x = layers.add([x, shortcut])
117 | x = layers.Activation('relu')(x)
118 | return x
119 |
120 | def ResNet50(input_shape=(224, 224, 3), n_classes=1000,
121 | l2_reg=5e-5, bn_mom=0.9):
122 | """Instantiates the ResNet50 architecture.
123 |
124 | # Arguments
125 | input_shape: input shape tuple. It should have 3 input channels.
126 | n_classes: number of classes to classify images.
127 | l2_reg: L2 weight regularization (weight decay)
128 | bn_mom: batch-norm momentum
129 |
130 | # Returns
131 | A Keras model instance.
132 | """
133 | img_input = layers.Input(shape=input_shape)
134 |
135 | if backend.image_data_format() == 'channels_last':
136 | bn_axis = 3
137 | else:
138 | bn_axis = 1
139 |
140 | x = layers.ZeroPadding2D(padding=(3, 3), name='conv1_pad')(img_input)
141 | x = layers.Conv2D(64, (7, 7), strides=(2, 2), padding='valid',
142 | kernel_initializer='he_normal',
143 | kernel_regularizer=regularizers.l2(l2_reg),
144 | name='conv1')(x)
145 | x = layers.BatchNormalization(axis=bn_axis, name='bn_conv1')(x)
146 | x = layers.Activation('relu')(x)
147 | x = layers.ZeroPadding2D(padding=(1, 1), name='pool1_pad')(x)
148 | x = layers.MaxPooling2D((3, 3), strides=(2, 2))(x)
149 |
150 | x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1),
151 | l2_reg=l2_reg, bn_mom=bn_mom)
152 | x = identity_block(x, 3, [64, 64, 256], stage=2, block='b',
153 | l2_reg=l2_reg, bn_mom=bn_mom)
154 | x = identity_block(x, 3, [64, 64, 256], stage=2, block='c',
155 | l2_reg=l2_reg, bn_mom=bn_mom)
156 |
157 | x = conv_block(x, 3, [128, 128, 512], stage=3, block='a',
158 | l2_reg=l2_reg, bn_mom=bn_mom)
159 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='b',
160 | l2_reg=l2_reg, bn_mom=bn_mom)
161 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='c',
162 | l2_reg=l2_reg, bn_mom=bn_mom)
163 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='d',
164 | l2_reg=l2_reg, bn_mom=bn_mom)
165 |
166 | x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a',
167 | l2_reg=l2_reg, bn_mom=bn_mom)
168 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='b',
169 | l2_reg=l2_reg, bn_mom=bn_mom)
170 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='c',
171 | l2_reg=l2_reg, bn_mom=bn_mom)
172 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='d',
173 | l2_reg=l2_reg, bn_mom=bn_mom)
174 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='e',
175 | l2_reg=l2_reg, bn_mom=bn_mom)
176 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='f',
177 | l2_reg=l2_reg, bn_mom=bn_mom)
178 |
179 | x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a',
180 | l2_reg=l2_reg, bn_mom=bn_mom)
181 | x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b',
182 | l2_reg=l2_reg, bn_mom=bn_mom)
183 | x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c',
184 | l2_reg=l2_reg, bn_mom=bn_mom)
185 |
186 | x = layers.GlobalAveragePooling2D(name='avg_pool')(x)
187 | x = layers.Dense(n_classes, activation='softmax',
188 | kernel_regularizer=regularizers.l2(l2_reg),
189 | name='fc1000')(x)
190 |
191 | return models.Model(img_input, x, name='resnet50')
192 |
193 |
194 | def ResNetSmall(input_shape=(32, 32, 3), n_classes=10,
195 | l2_reg=5e-5, bn_mom=0.9):
196 | """Instantiates the small ResNet architecture.
197 |
198 | # Arguments
199 | input_shape: input shape tuple. It should have 3 input channels.
200 | n_classes: number of classes to classify images.
201 | l2_reg: L2 weight regularization (weight decay)
202 | bn_mom: batch-norm momentum
203 |
204 | # Returns
205 | A Keras model instance.
206 | """
207 | img_input = layers.Input(shape=input_shape)
208 |
209 | if backend.image_data_format() == 'channels_last':
210 | bn_axis = 3
211 | else:
212 | bn_axis = 1
213 |
214 | x = img_input
215 | x = layers.Conv2D(64, (3, 3), strides=(1, 1), padding='same',
216 | kernel_initializer='he_normal',
217 | kernel_regularizer=regularizers.l2(l2_reg),
218 | name='conv1')(img_input)
219 | x = layers.BatchNormalization(axis=bn_axis, name='bn_conv1')(x)
220 | x = layers.Activation('relu')(x)
221 |
222 | x = conv_block(x, 3, [64, 64, 64], stage=2, block='a',
223 | l2_reg=l2_reg, bn_mom=bn_mom)
224 | x = identity_block(x, 3, [64, 64, 64], stage=2, block='b',
225 | l2_reg=l2_reg, bn_mom=bn_mom)
226 |
227 | x = conv_block(x, 3, [128, 128, 128], stage=3, block='a',
228 | l2_reg=l2_reg, bn_mom=bn_mom)
229 | x = identity_block(x, 3, [128, 128, 128], stage=3, block='b',
230 | l2_reg=l2_reg, bn_mom=bn_mom)
231 |
232 | x = conv_block(x, 3, [256, 256, 256], stage=4, block='a',
233 | l2_reg=l2_reg, bn_mom=bn_mom)
234 | x = identity_block(x, 3, [256, 256, 256], stage=4, block='b',
235 | l2_reg=l2_reg, bn_mom=bn_mom)
236 |
237 | x = conv_block(x, 3, [512, 512, 512], stage=5, block='a',
238 | l2_reg=l2_reg, bn_mom=bn_mom)
239 | x = identity_block(x, 3, [512, 512, 512], stage=5, block='b',
240 | l2_reg=l2_reg, bn_mom=bn_mom)
241 |
242 | x = layers.GlobalAveragePooling2D(name='avg_pool')(x)
243 | x = layers.Dense(n_classes, activation='softmax',
244 | kernel_regularizer=regularizers.l2(l2_reg),
245 | name='fc1000')(x)
246 |
247 | return models.Model(img_input, x, name='resnet50')
248 |
--------------------------------------------------------------------------------
/notebooks/Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Results analysis\n",
8 | "\n",
9 | "You can use this notebook to analyze the results of your training runs on Cori.\n",
10 | "\n",
11 | "More documentation to come."
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "# System\n",
21 | "import os\n",
22 | "\n",
23 | "# Externals\n",
24 | "import numpy as np\n",
25 | "import matplotlib.pyplot as plt"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 2,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "# Magic\n",
35 | "%matplotlib inline"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "## Specify the results to load\n",
43 | "\n",
44 | "I'll start with plotting just one experiment. Later we may want to allow to plot multiple runs, like in Tensorboard."
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 22,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "results_dir = '/global/cscratch1/sd/sfarrell/sc18-dl-tutorial/cifar10-cnn-N1-16150683'"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 23,
59 | "metadata": {},
60 | "outputs": [
61 | {
62 | "name": "stdout",
63 | "output_type": "stream",
64 | "text": [
65 | "\u001b[0m\u001b[01;34mcheckpoints\u001b[0m/ history.npz out.log\n"
66 | ]
67 | }
68 | ],
69 | "source": [
70 | "ls $results_dir"
71 | ]
72 | },
73 | {
74 | "cell_type": "markdown",
75 | "metadata": {},
76 | "source": [
77 | "## Load the summary results and inspect them"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 24,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "history = np.load(os.path.join(results_dir, 'history.npz'))"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 25,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "n_ranks = int(history['n_ranks'])"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 26,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "data": {
105 | "text/plain": [
106 | "['n_ranks', 'val_loss', 'val_acc', 'loss', 'acc', 'lr']"
107 | ]
108 | },
109 | "execution_count": 26,
110 | "metadata": {},
111 | "output_type": "execute_result"
112 | }
113 | ],
114 | "source": [
115 | "history.files"
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": [
122 | "## Plot loss and accuracy"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 27,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "data": {
132 | "image/png": "\n",
133 | "text/plain": [
134 | ""
135 | ]
136 | },
137 | "metadata": {
138 | "needs_background": "light"
139 | },
140 | "output_type": "display_data"
141 | }
142 | ],
143 | "source": [
144 | "fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))\n",
145 | "\n",
146 | "# Plot losses\n",
147 | "ax0.plot(history['loss'], label='train')\n",
148 | "ax0.plot(history['val_loss'], label='validation')\n",
149 | "ax0.set_xlabel('Epoch')\n",
150 | "ax0.set_ylabel('Loss')\n",
151 | "ax0.legend(loc=0)\n",
152 | "\n",
153 | "ax1.plot(history['acc'], label='train')\n",
154 | "ax1.plot(history['val_acc'], label='validation')\n",
155 | "ax1.set_xlabel('Epoch')\n",
156 | "ax1.set_ylabel('Accuracy')\n",
157 | "ax1.legend(loc=0)\n",
158 | "\n",
159 | "plt.tight_layout()"
160 | ]
161 | }
162 | ],
163 | "metadata": {
164 | "kernelspec": {
165 | "display_name": "tf-1.11.0-py36",
166 | "language": "python",
167 | "name": "tf-1.11.0-py36"
168 | },
169 | "language_info": {
170 | "codemirror_mode": {
171 | "name": "ipython",
172 | "version": 3
173 | },
174 | "file_extension": ".py",
175 | "mimetype": "text/x-python",
176 | "name": "python",
177 | "nbconvert_exporter": "python",
178 | "pygments_lexer": "ipython3",
179 | "version": "3.6.5"
180 | },
181 | "toc-autonumbering": false,
182 | "toc-showmarkdowntxt": true,
183 | "toc-showtags": false
184 | },
185 | "nbformat": 4,
186 | "nbformat_minor": 2
187 | }
188 |
--------------------------------------------------------------------------------
/scripts/cifar_cnn.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH -J cifar-cnn
3 | #SBATCH -C knl
4 | #SBATCH -N 1
5 | #SBATCH -q regular
6 | #SBATCH --reservation dl4sci
7 | #SBATCH -t 1:00:00
8 | #SBATCH -o logs/%x-%j.out
9 |
10 | # Load the software
11 | module load tensorflow/intel-1.13.1-py36
12 | export KMP_BLOCKTIME=1
13 | export KMP_AFFINITY="granularity=fine,compact,1,0"
14 |
15 | # Ensure dataset is downloaded by single process
16 | python -c "import keras; keras.datasets.cifar10.load_data()"
17 |
18 | # Submit multi-node training
19 | srun -u python train_horovod.py configs/cifar10_cnn.yaml -d
20 |
--------------------------------------------------------------------------------
/scripts/cifar_resnet.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH -J cifar-resnet
3 | #SBATCH -C knl
4 | #SBATCH -N 1
5 | #SBATCH -q regular
6 | #SBATCH --reservation dl4sci
7 | #SBATCH -t 1:00:00
8 | #SBATCH -o logs/%x-%j.out
9 |
10 | # Load the software
11 | module load tensorflow/intel-1.13.1-py36
12 | export KMP_BLOCKTIME=1
13 | export KMP_AFFINITY="granularity=fine,compact,1,0"
14 |
15 | # Ensure dataset is downloaded by single process
16 | python -c "import keras; keras.datasets.cifar10.load_data()"
17 |
18 | # Submit multi-node training
19 | srun -u python train_horovod.py configs/cifar10_resnet.yaml -d
20 |
--------------------------------------------------------------------------------
/train_horovod.py:
--------------------------------------------------------------------------------
1 | """
2 | Main training script for the Deep Learning at Scale Keras examples.
3 | """
4 |
5 | # System
6 | import os
7 | import sys
8 | import argparse
9 | import logging
10 |
11 | # Externals
12 | import keras
13 | import horovod.keras as hvd
14 | import yaml
15 | import numpy as np
16 |
17 | # Locals
18 | from data import get_datasets
19 | from models import get_model
20 | from utils.device import configure_session
21 | from utils.optimizers import get_optimizer
22 | from utils.callbacks import TimingCallback
23 |
24 | # Suppress TF warnings
25 | import tensorflow as tf
26 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
27 | tf.logging.set_verbosity(logging.ERROR)
28 |
29 | #load dictionary from argparse
30 | class StoreDictKeyPair(argparse.Action):
31 | def __call__(self, parser, namespace, values, option_string=None):
32 | my_dict = {}
33 | for kv in values.split(","):
34 | k,v = kv.split("=")
35 | my_dict[k] = v
36 | setattr(namespace, self.dest, my_dict)
37 |
38 | def parse_args():
39 | """Parse command line arguments."""
40 | parser = argparse.ArgumentParser('train.py')
41 | add_arg = parser.add_argument
42 | add_arg('config', nargs='?', default='configs/hello.yaml')
43 | add_arg('-d', '--distributed', action='store_true')
44 | add_arg('-v', '--verbose', action='store_true')
45 | add_arg('--show-config', action='store_true')
46 | add_arg('--interactive', action='store_true')
47 | #parameters which override the YAML file
48 | add_arg('--dropout', type=float, help='keep rate for dropout layers')
49 | add_arg("--optimizer", action=StoreDictKeyPair, help="optimizer parameters")
50 | add_arg('--batch_size', type=int, help='batch size for training')
51 | add_arg('--n_epochs', type=int, help='number of epochs to train')
52 | return parser.parse_args()
53 |
54 | def config_logging(verbose):
55 | log_format = '%(asctime)s %(levelname)s %(message)s'
56 | log_level = logging.DEBUG if verbose else logging.INFO
57 | logging.basicConfig(level=log_level, format=log_format)
58 |
59 | def init_workers(distributed=False):
60 | rank, n_ranks = 0, 1
61 | if distributed:
62 | hvd.init()
63 | rank, n_ranks = hvd.rank(), hvd.size()
64 | return rank, n_ranks
65 |
66 | def load_config(arguments):
67 | #read base config from yaml file
68 | config_file = arguments.config
69 | with open(config_file) as f:
70 | config = yaml.load(f, Loader=yaml.FullLoader)
71 |
72 | #override with CLA
73 | if arguments.dropout:
74 | config["model"]["dropout"] = arguments.dropout
75 | if arguments.batch_size:
76 | config["training"]["batch_size"] = arguments.batch_size
77 | if arguments.n_epochs:
78 | config["training"]["n_epochs"] = arguments.n_epochs
79 | if arguments.optimizer:
80 | if "name" in arguments.optimizer:
81 | config["optimizer"]["name"] = arguments.optimizer["name"]
82 | if "lr" in arguments.optimizer:
83 | config["optimizer"]["lr"] = float(arguments.optimizer["lr"] )
84 | if "lr_scaling" in arguments.optimizer:
85 | config["optimizer"]["lr_scaling"] = arguments.optimizer["lr_scaling"]
86 | if "lr_warmup_epochs" in arguments.optimizer:
87 | config["training"]["lr_warmup_epochs"] = int(arguments.optimizer["lr_warmup_epochs"])
88 |
89 | return config
90 |
91 | def get_basic_callbacks(distributed=False):
92 | cb = []
93 |
94 | if distributed:
95 | #this is for broadcasting the initial model to all nodes
96 | cb.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
97 |
98 | #this is for averaging the reported metrics across all nodes
99 | cb.append(hvd.callbacks.MetricAverageCallback())
100 |
101 | return cb
102 |
103 | def main():
104 | """Main function"""
105 |
106 | # Initialization
107 | args = parse_args()
108 | rank, n_ranks = init_workers(args.distributed)
109 |
110 | # Load configuration
111 | config = load_config(args)
112 | train_config = config['training']
113 | output_dir = os.path.expandvars(config['output_dir'])
114 | checkpoint_format = os.path.join(output_dir, 'checkpoints',
115 | 'checkpoint-{epoch}.h5')
116 | if rank==0:
117 | os.makedirs(output_dir, exist_ok=True)
118 |
119 | # Loggging
120 | config_logging(verbose=args.verbose)
121 | logging.info('Initialized rank %i out of %i', rank, n_ranks)
122 | if args.show_config:
123 | logging.info('Command line config: %s', args)
124 | if rank == 0:
125 | logging.info('Job configuration: %s', config)
126 | logging.info('Saving job outputs to %s', output_dir)
127 |
128 | # Configure session
129 | device_config = config.get('device', {})
130 | configure_session(**device_config)
131 |
132 | # Load the data
133 | train_gen, valid_gen = get_datasets(batch_size=train_config['batch_size'],
134 | **config['data'])
135 |
136 | # Build the model
137 | model = get_model(**config['model'])
138 | # Configure optimizer
139 | opt = get_optimizer(n_ranks=n_ranks, **config['optimizer'])
140 | # Compile the model
141 | model.compile(loss=train_config['loss'], optimizer=opt,
142 | metrics=train_config['metrics'])
143 | if rank == 0:
144 | model.summary()
145 |
146 | # Prepare the training callbacks
147 | callbacks = get_basic_callbacks(args.distributed)
148 |
149 | # Learning rate warmup
150 | warmup_epochs = train_config.get('lr_warmup_epochs', 0)
151 | callbacks.append(hvd.callbacks.LearningRateWarmupCallback(
152 | warmup_epochs=warmup_epochs, verbose=1))
153 |
154 | # Learning rate decay schedule
155 | for lr_schedule in train_config.get('lr_schedule', []):
156 | if rank == 0:
157 | logging.info('Adding LR schedule: %s', lr_schedule)
158 | callbacks.append(hvd.callbacks.LearningRateScheduleCallback(**lr_schedule))
159 |
160 | # Checkpoint only from rank 0
161 | if rank == 0:
162 | os.makedirs(os.path.dirname(checkpoint_format), exist_ok=True)
163 | callbacks.append(keras.callbacks.ModelCheckpoint(checkpoint_format))
164 |
165 | # Timing callback
166 | timing_callback = TimingCallback()
167 | callbacks.append(timing_callback)
168 |
169 | # Train the model
170 | train_steps_per_epoch = max([len(train_gen) // n_ranks, 1])
171 | valid_steps_per_epoch = max([len(valid_gen) // n_ranks, 1])
172 | history = model.fit_generator(train_gen,
173 | epochs=train_config['n_epochs'],
174 | steps_per_epoch=train_steps_per_epoch,
175 | validation_data=valid_gen,
176 | validation_steps=valid_steps_per_epoch,
177 | callbacks=callbacks,
178 | workers=4, verbose=2 if rank==0 else 0)
179 |
180 | # Save training history
181 | if rank == 0:
182 | # Print some best-found metrics
183 | if 'val_acc' in history.history.keys():
184 | logging.info('Best validation accuracy: %.3f',
185 | max(history.history['val_acc']))
186 | if 'val_top_k_categorical_accuracy' in history.history.keys():
187 | logging.info('Best top-5 validation accuracy: %.3f',
188 | max(history.history['val_top_k_categorical_accuracy']))
189 | logging.info('Average time per epoch: %.3f s',
190 | np.mean(timing_callback.times))
191 | np.savez(os.path.join(output_dir, 'history'),
192 | n_ranks=n_ranks, **history.history)
193 |
194 | # Drop to IPython interactive shell
195 | if args.interactive and (rank == 0):
196 | logging.info('Starting IPython interactive session')
197 | import IPython
198 | IPython.embed()
199 |
200 | if rank == 0:
201 | logging.info('All done!')
202 |
203 | if __name__ == '__main__':
204 | main()
205 |
--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | """
3 |
--------------------------------------------------------------------------------
/utils/callbacks.py:
--------------------------------------------------------------------------------
1 | """
2 | This module contains some utility callbacks for Keras training.
3 | """
4 |
5 | # System
6 | from time import time
7 |
8 | # Externals
9 | import keras
10 |
11 | class TimingCallback(keras.callbacks.Callback):
12 | """A Keras Callback which records the time of each epoch"""
13 | def __init__(self):
14 | self.times = []
15 |
16 | def on_epoch_begin(self, epoch, logs={}):
17 | self.starttime = time()
18 |
19 | def on_epoch_end(self, epoch, logs={}):
20 | epoch_time = time() - self.starttime
21 | self.times.append(epoch_time)
22 |
--------------------------------------------------------------------------------
/utils/device.py:
--------------------------------------------------------------------------------
1 | """
2 | Hardware/device configuration
3 | """
4 |
5 | # System
6 | import os
7 |
8 | # Externals
9 | import keras
10 | import tensorflow as tf
11 |
12 | def configure_session(intra_threads=32, inter_threads=2, gpu=None):
13 | """Sets the thread knobs in the TF backend"""
14 | os.environ['OMP_NUM_THREADS'] = str(intra_threads)
15 | config = tf.ConfigProto(
16 | inter_op_parallelism_threads=inter_threads,
17 | intra_op_parallelism_threads=intra_threads
18 | )
19 | if gpu is not None:
20 | config.gpu_options.visible_device_list = str(gpu)
21 | keras.backend.set_session(tf.Session(config=config))
22 |
--------------------------------------------------------------------------------
/utils/optimizers.py:
--------------------------------------------------------------------------------
1 | """
2 | Utilty code for constructing optimizers and scheduling learning rates.
3 | """
4 |
5 | # System
6 | import math
7 |
8 | # Externals
9 | import keras
10 | import horovod.keras as hvd
11 |
12 | def get_optimizer(name, lr, lr_scaling='linear', n_ranks=1, **opt_args):
13 | """
14 | Configure the optimizer and scale the learning rate by n_ranks.
15 | """
16 | # Scale the learning rate
17 | if lr_scaling == 'linear':
18 | lr = lr * n_ranks
19 | elif lr_scaling == 'sqrt':
20 | lr = lr * math.sqrt(n_ranks)
21 |
22 | # Construct the optimizer
23 | OptType = getattr(keras.optimizers, name)
24 | opt = OptType(lr=lr, **opt_args)
25 |
26 | # Distributed optimizer wrapper
27 | if n_ranks > 1:
28 | opt = hvd.DistributedOptimizer(opt)
29 |
30 | return opt
31 |
--------------------------------------------------------------------------------