├── README.md ├── configs ├── cifar10_cnn.yaml ├── cifar10_resnet.yaml ├── hello.yaml └── imagenet_resnet.yaml ├── data ├── __init__.py ├── cifar10.py ├── dummy.py └── imagenet.py ├── logs └── README.md ├── models ├── __init__.py ├── cnn.py └── resnet.py ├── notebooks └── Analysis.ipynb ├── scripts ├── cifar_cnn.sh └── cifar_resnet.sh ├── train_horovod.py └── utils ├── __init__.py ├── callbacks.py ├── device.py └── optimizers.py /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning for Science School Tutorial:
Deep Learning At Scale 2 | 3 | This repository contains the material for the DL4Sci tutorial: 4 | *Deep Learning at Scale*. 5 | 6 | It contains specifications for datasets, a couple of CNN models, and 7 | all the training code to enable training the models in a distributed fashion 8 | using Horovod. 9 | 10 | As part of the tutorial, you will train a ResNet model to classify images 11 | from the CIFAR10 dataset on multiple nodes with synchronous data parallelism. 12 | 13 | **Contents** 14 | * [Links](https://github.com/NERSC/dl4sci-scaling-tutorial#links) 15 | * [Installation](https://github.com/NERSC/dl4sci-scaling-tutorial#installation) 16 | * [Navigating the repository](https://github.com/NERSC/dl4sci-scaling-tutorial#navigating-the-repository) 17 | * [Hands-on walk-through](https://github.com/NERSC/dl4sci-scaling-tutorial#hands-on-multi-node-training-example) 18 | * [Code references](https://github.com/NERSC/dl4sci-scaling-tutorial#code-references) 19 | 20 | ## Links 21 | 22 | NERSC JupyterHub: https://jupyter-dl.nersc.gov 23 | 24 | Slides: https://drive.google.com/drive/folders/10NqOLaqPTZ0nobE7JNSaGwrXQi_ACD35?usp=sharing 25 | 26 | ## Installation 27 | 28 | 1. Start a terminal on Cori, either via ssh or from the Jupyter interface. 29 | * **IMPORTANT: if using jupyter, you need to use a SHARED CPU. Click the CPU button instead of the GPU button to run this example!** 30 | 2. Clone the repository using git:\ 31 | `git clone https://github.com/NERSC/dl4sci-scaling-tutorial.git` 32 | 33 | That's it! The rest of the software (Keras, TensorFlow) is pre-installed on Cori 34 | and loaded via the scripts used below. 35 | 36 | ## Navigating the repository 37 | 38 | **`train_horovod.py`** - the main training script which can be steered with YAML 39 | configuration files. 40 | 41 | **`data/`** - folder containing the specifications of the datasets. Each dataset 42 | has a corresponding name which is mapped to the specification in `data/__init__.py` 43 | 44 | **`models/`** - folder containing the Keras model definitions. Again, each model 45 | has a name which is interpreted in `models/__init__.py`. 46 | 47 | **`configs/`** - folder containing the configuration files. Each 48 | configuration specifies a dataset, a model, and all relevant configuration 49 | options (with some exceptions like the number of nodes, which is specified 50 | instead to SLURM via the command line). 51 | 52 | **`scripts/`** - contains an environment setup script and some SLURM scripts 53 | for easily submitting the example jobs to the Cori batch system. 54 | 55 | **`utils/`** - contains additional useful code for the training script, e.g. 56 | custom callbacks, device configuration, and optimizers logic. 57 | 58 | ## Hands-on multi-node training example 59 | 60 | We will use a customized ResNet model in this example to classify CIFAR10 61 | images and demonstrate distributed training with Horovod. 62 | 63 | 1. Check out the ResNet model code in [models/resnet.py](models/resnet.py). 64 | Note that the model code is broken into multiple functions for easy reuse. 65 | We provide here two versions of ResNet models: a standard ResNet50 (with 50 66 | layers) and a smaller ResNet consisting of 26 layers. 67 | * Identify the identy block and conv block functions. *How many convolutional 68 | layers do each of these have*? 69 | * Identify the functions that build the ResNet50 and the ResNetSmall. Given how 70 | many layers are in each block, *see if you can confirm how many layers (conv 71 | and dense) are in the models*. **Hint:** we don't normally count the 72 | convolution applied to the shortcuts. 73 | 74 | 2. Inspect the optimizer setup in [utils/optimizers.py](utils/optimizers.py). 75 | * Note how we scale the learning rate (`lr`) according to the number of 76 | processes (ranks). 77 | * Note how we construct our optimizer and then wrap it in the Horovod 78 | DistributedOptimizer. 79 | 80 | 3. Inspect the main [train_horovod.py](train_horovod.py) training script. 81 | * Identify the `init_workers` function where we initialize Horovod. 82 | Note where this is invoked in the main() function (right away). 83 | * Identify where we setup our distributed training callbacks. 84 | * *Which callback ensures we have consistent model weights at the start of training?* 85 | * Identify the callbacks responsible for the learning rate schedule (warmup and decay). 86 | 87 | 4. Finally, look at the configuration file: 88 | [configs/cifar10_resnet.yaml](configs/cifar10_resnet.yaml) 89 | * YAML allows to express configurations in rich, human-readable, hierarchical structure. 90 | * Identify where you would edit to modify the optimizer, learning-rate, batch-size, etc. 91 | 92 | That's mostly it for the code. Note that in general when training distributed 93 | you might want to use more complicated data handling, e.g. to ensure different 94 | workers are always processing different samples of your data within a training 95 | epoch. In this case we aren't worrying about that and are, for simplicity, 96 | relying on the independent random shuffling of the data by each worker as well 97 | as the random data augmentation. 98 | 99 | 5. To gain an appreciation for the speedup of training on 100 | multiple nodes, first train the ResNet model on a single node. 101 | Adjust the configuration in [configs/cifar10_resnet.yaml](configs/cifar10_resnet.yaml) 102 | to train for just 1 epoch and then submit the job to the Cori batch system with 103 | SLURM sbatch and our provided SLURM batch script:\ 104 | `sbatch -N 1 scripts/cifar_resnet.sh` 105 | * **Important:** the first time you run a CIFAR10 example, it will 106 | automatically download the dataset. If you have more than one job attempting 107 | this download simultaneously it will likely fail. 108 | 109 | 6. Now we are ready to train our ResNet model on multiple nodes using Horovod 110 | and MPI! If you changed the config to 1 epoch above, be sure to change it back 111 | to 32 epochs for this step. To launch the ResNet training on 8 nodes, do:\ 112 | `sbatch -N 8 scripts/cifar_resnet.sh` 113 | 114 | 7. Check on the status of your job by running `sqs`. 115 | Once the job starts running, you should see the output start to appear in the 116 | slurm log file `logs/cifar-cnn-*.out`. You'll see some printouts from every 117 | worker. Others are only printed from rank 0. 118 | 119 | 8. When the job is finished, check the log to identify how well your model learned 120 | to solve the CIFAR10 classification task. For every epoch you should see the 121 | loss and accuracy reported for both the training set and the validation set. 122 | Take note of the best validation accuracy achieved. 123 | 124 | Now that you've finished the main tutorial material, try to play with the code 125 | and/or configuration to see the effect on the training results. You can try changing 126 | things like 127 | * Change the optimizer (search for Keras optimizers on google). 128 | * Change the nominal learning rate, number of warmup epochs, decay schedule 129 | * Change the learning rate scaling (e.g. try "sqrt" scaling instead of linear) 130 | 131 | Most of these things can be changed entirely within the configuration. 132 | See [configs/imagenet_resnet.yaml](configs/imagenet_resnet.yaml) for examples. 133 | 134 | ## Code references 135 | 136 | Keras ResNet50 official model: 137 | https://github.com/keras-team/keras-applications/blob/master/keras_applications/resnet50.py 138 | 139 | Horovod ResNet + ImageNet example: 140 | https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py 141 | 142 | CIFAR10 CNN and ResNet examples: 143 | https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py 144 | https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py 145 | -------------------------------------------------------------------------------- /configs/cifar10_cnn.yaml: -------------------------------------------------------------------------------- 1 | description: 'CNN CIFAR10' 2 | output_dir: $SCRATCH/isc19-dl-tutorial/cifar-cnn-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID} 3 | 4 | data: 5 | name: cifar10 6 | 7 | model: 8 | name: cnn 9 | input_shape: [32, 32, 3] 10 | n_classes: 10 11 | dropout: 0.1 12 | 13 | optimizer: 14 | name: Adam 15 | lr: 0.001 16 | 17 | training: 18 | batch_size: 64 19 | n_epochs: 24 20 | lr_warmup_epochs: 5 21 | loss: categorical_crossentropy 22 | metrics: [accuracy] 23 | -------------------------------------------------------------------------------- /configs/cifar10_resnet.yaml: -------------------------------------------------------------------------------- 1 | description: 'ResNet CIFAR10' 2 | output_dir: $SCRATCH/isc19-dl-tutorial/cifar10-resnet-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID} 3 | 4 | data: 5 | name: cifar10 6 | 7 | model: 8 | name: resnet_small 9 | input_shape: [32, 32, 3] 10 | n_classes: 10 11 | 12 | optimizer: 13 | name: Adam 14 | lr: 0.0001 15 | lr_scaling: linear 16 | 17 | training: 18 | batch_size: 64 19 | n_epochs: 32 20 | lr_warmup_epochs: 5 21 | loss: categorical_crossentropy 22 | metrics: [accuracy] 23 | 24 | device: 25 | intra_threads: 32 26 | -------------------------------------------------------------------------------- /configs/hello.yaml: -------------------------------------------------------------------------------- 1 | description: 'Hello world' 2 | output_dir: $SCRATCH/isc19-dl-tutorial/hello-world 3 | 4 | data: 5 | name: dummy 6 | input_shape: [64, 64, 3] 7 | n_train: 1024 8 | n_valid: 1024 9 | 10 | model: 11 | name: cnn 12 | input_shape: [64, 64, 3] 13 | n_classes: 1 14 | 15 | optimizer: 16 | name: Adam 17 | lr: 0.001 18 | 19 | training: 20 | batch_size: 32 21 | n_epochs: 1 22 | loss: binary_crossentropy 23 | metrics: [accuracy] 24 | -------------------------------------------------------------------------------- /configs/imagenet_resnet.yaml: -------------------------------------------------------------------------------- 1 | # This configuration should match what is implemented in the horovod example: 2 | # https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py 3 | 4 | description: 'ResNet ImageNet' 5 | output_dir: $SCRATCH/isc19-dl-tutorial/imagenet-resnet-N${SLURM_JOB_NUM_NODES}-${SLURM_JOB_ID} 6 | 7 | data: 8 | name: imagenet 9 | train_dir: /global/cscratch1/sd/sfarrell/ImageNet-100/train 10 | valid_dir: /global/cscratch1/sd/sfarrell/ImageNet-100/validation 11 | 12 | model: 13 | name: resnet50 14 | input_shape: [224, 224, 3] 15 | n_classes: 100 16 | 17 | optimizer: 18 | name: SGD 19 | lr: 0.0125 20 | momentum: 0.9 21 | 22 | training: 23 | batch_size: 32 24 | n_epochs: 100 25 | lr_warmup_epochs: 5 26 | loss: categorical_crossentropy 27 | metrics: [accuracy, top_k_categorical_accuracy] 28 | lr_schedule: 29 | - {start_epoch: 5, end_epoch: 30, multiplier: 1.} 30 | - {start_epoch: 30, end_epoch: 60, multiplier: 1.e-1} 31 | - {start_epoch: 60, end_epoch: 80, multiplier: 1.e-2} 32 | - {start_epoch: 80, multiplier: 1.e-3} 33 | -------------------------------------------------------------------------------- /data/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Keras dataset specifications. 3 | TODO: add MNIST. 4 | """ 5 | 6 | def get_datasets(name, **data_args): 7 | if name == 'dummy': 8 | from .dummy import get_datasets 9 | return get_datasets(**data_args) 10 | elif name == 'cifar10': 11 | from .cifar10 import get_datasets 12 | return get_datasets(**data_args) 13 | elif name == 'imagenet': 14 | from .imagenet import get_datasets 15 | return get_datasets(**data_args) 16 | else: 17 | raise ValueError('Dataset %s unknown' % name) 18 | -------------------------------------------------------------------------------- /data/cifar10.py: -------------------------------------------------------------------------------- 1 | """ 2 | CIFAR10 dataset specification. 3 | 4 | https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py 5 | https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py 6 | """ 7 | 8 | # Externals 9 | import keras 10 | from keras.datasets import cifar10 11 | from keras.preprocessing.image import ImageDataGenerator 12 | 13 | def get_datasets(batch_size, n_train=None, n_valid=None): 14 | """ 15 | Load the CIFAR10 data and construct pipeline. 16 | """ 17 | (x_train, y_train), (x_valid, y_valid) = cifar10.load_data() 18 | 19 | # Normalize pixel values 20 | x_train = x_train.astype('float32') / 255 21 | x_valid = x_valid.astype('float32') / 255 22 | 23 | # Select subset of data if specified 24 | if n_train is not None: 25 | x_train, y_train = x_train[:n_train], y_train[:n_train] 26 | if n_valid is not None: 27 | x_valid, y_valid = x_valid[:n_valid], y_valid[:n_valid] 28 | 29 | # Convert labels to class vectors 30 | n_classes = 10 31 | y_train = keras.utils.to_categorical(y_train, n_classes) 32 | y_valid = keras.utils.to_categorical(y_valid, n_classes) 33 | 34 | # Prepare the generators with data augmentation 35 | train_gen = ImageDataGenerator(width_shift_range=0.1, 36 | height_shift_range=0.1, 37 | horizontal_flip=True) 38 | valid_gen = ImageDataGenerator() 39 | train_iter = train_gen.flow(x_train, y_train, batch_size=batch_size, shuffle=True) 40 | valid_iter = valid_gen.flow(x_valid, y_valid, batch_size=batch_size, shuffle=True) 41 | return train_iter, valid_iter 42 | -------------------------------------------------------------------------------- /data/dummy.py: -------------------------------------------------------------------------------- 1 | """ 2 | Random dummy dataset specification. 3 | """ 4 | 5 | # System 6 | import math 7 | 8 | # Externals 9 | import numpy as np 10 | from keras.utils import Sequence 11 | 12 | class DummyDataset(Sequence): 13 | 14 | def __init__(self, n_samples, batch_size, input_shape, target_shape): 15 | self.x = np.random.normal(size=(n_samples,) + tuple(input_shape)) 16 | self.y = np.random.normal(size=(n_samples,) + tuple(target_shape)) 17 | self.batch_size = batch_size 18 | 19 | def __len__(self): 20 | return math.ceil(len(self.x) / self.batch_size) 21 | 22 | def __getitem__(self, idx): 23 | start = idx * self.batch_size 24 | end = (idx + 1) * self.batch_size 25 | return self.x[start:end], self.y[start:end] 26 | 27 | def get_datasets(batch_size, n_train=1024, n_valid=1024, 28 | input_shape=(32, 32, 3), target_shape=()): 29 | train_data = DummyDataset(n_train, batch_size, input_shape, target_shape) 30 | valid_data = DummyDataset(n_valid, batch_size, input_shape, target_shape) 31 | return train_data, valid_data 32 | -------------------------------------------------------------------------------- /data/imagenet.py: -------------------------------------------------------------------------------- 1 | """ 2 | ImageNet dataset specification. 3 | 4 | Adapted from 5 | https://github.com/uber/horovod/blob/master/examples/keras_imagenet_resnet50.py 6 | """ 7 | 8 | # Externals 9 | import keras 10 | from keras.preprocessing.image import ImageDataGenerator 11 | 12 | def get_datasets(batch_size, train_dir, valid_dir): 13 | train_gen = ImageDataGenerator( 14 | preprocessing_function=keras.applications.resnet50.preprocess_input, 15 | width_shift_range=0.33, height_shift_range=0.33, zoom_range=0.5, 16 | horizontal_flip=True) 17 | test_gen = ImageDataGenerator( 18 | preprocessing_function=keras.applications.resnet50.preprocess_input, 19 | zoom_range=(0.875, 0.875)) 20 | train_iter = train_gen.flow_from_directory(train_dir, batch_size=batch_size, 21 | target_size=(224, 224), shuffle=True) 22 | test_iter = train_gen.flow_from_directory(valid_dir, batch_size=batch_size, 23 | target_size=(224, 224), shuffle=True) 24 | return train_iter, test_iter 25 | -------------------------------------------------------------------------------- /logs/README.md: -------------------------------------------------------------------------------- 1 | Slurm logs go in this directory 2 | -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Keras example model factory functions. 3 | """ 4 | 5 | def get_model(name, **model_args): 6 | if name == 'cnn': 7 | from .cnn import build_model 8 | return build_model(**model_args) 9 | elif name == 'resnet_small': 10 | from .resnet import ResNetSmall 11 | return ResNetSmall(**model_args) 12 | elif name == 'resnet50': 13 | from .resnet import ResNet50 14 | return ResNet50(**model_args) 15 | else: 16 | raise ValueError('Model %s unknown' % name) 17 | -------------------------------------------------------------------------------- /models/cnn.py: -------------------------------------------------------------------------------- 1 | """ 2 | Simple CNN classifier model. 3 | """ 4 | 5 | import keras 6 | from keras.models import Sequential 7 | from keras.layers import Dense, Dropout, Activation, Flatten 8 | from keras.layers import Conv2D, MaxPooling2D 9 | 10 | def build_model(input_shape=(32, 32, 3), n_classes=10, dropout=0): 11 | """Construct the simple CNN model""" 12 | conv_args = dict(kernel_size=3, padding='same', activation='relu') 13 | model = Sequential() 14 | model.add(Conv2D(16, input_shape=input_shape, **conv_args)) 15 | model.add(MaxPooling2D(pool_size=2)) 16 | model.add(Conv2D(32, **conv_args)) 17 | model.add(MaxPooling2D(pool_size=2)) 18 | model.add(Conv2D(64, **conv_args)) 19 | model.add(MaxPooling2D(pool_size=2)) 20 | model.add(Flatten()) 21 | model.add(Dense(128, activation='relu')) 22 | model.add(Dropout(dropout)) 23 | model.add(Dense(n_classes, activation='softmax')) 24 | return model 25 | -------------------------------------------------------------------------------- /models/resnet.py: -------------------------------------------------------------------------------- 1 | """ 2 | ResNet models for Keras. 3 | Implementations have been adapted from keras_applications/resnet50.py 4 | """ 5 | 6 | # Externals 7 | import keras 8 | from keras import backend, layers, models, regularizers 9 | 10 | def identity_block(input_tensor, kernel_size, filters, stage, block, 11 | l2_reg=5e-5, bn_mom=0.9): 12 | """The identity block is the block that has no conv layer at shortcut. 13 | 14 | # Arguments 15 | input_tensor: input tensor 16 | kernel_size: default 3, the kernel size of 17 | middle conv layer at main path 18 | filters: list of integers, the filters of 3 conv layer at main path 19 | stage: integer, current stage label, used for generating layer names 20 | block: 'a','b'..., current block label, used for generating layer names 21 | l2_reg: L2 weight regularization (weight decay) 22 | bn_mom: batch-norm momentum 23 | 24 | # Returns 25 | Output tensor for the block. 26 | """ 27 | filters1, filters2, filters3 = filters 28 | if backend.image_data_format() == 'channels_last': 29 | bn_axis = 3 30 | else: 31 | bn_axis = 1 32 | conv_name_base = 'res' + str(stage) + block + '_branch' 33 | bn_name_base = 'bn' + str(stage) + block + '_branch' 34 | 35 | x = layers.Conv2D(filters1, (1, 1), 36 | kernel_initializer='he_normal', 37 | kernel_regularizer=regularizers.l2(l2_reg), 38 | name=conv_name_base + '2a')(input_tensor) 39 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a', 40 | momentum=bn_mom, epsilon=1e-5)(x) 41 | x = layers.Activation('relu')(x) 42 | 43 | x = layers.Conv2D(filters2, kernel_size, 44 | padding='same', 45 | kernel_initializer='he_normal', 46 | kernel_regularizer=regularizers.l2(l2_reg), 47 | name=conv_name_base + '2b')(x) 48 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b', 49 | momentum=bn_mom, epsilon=1e-5)(x) 50 | x = layers.Activation('relu')(x) 51 | 52 | x = layers.Conv2D(filters3, (1, 1), 53 | kernel_initializer='he_normal', 54 | kernel_regularizer=regularizers.l2(l2_reg), 55 | name=conv_name_base + '2c')(x) 56 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c', 57 | momentum=bn_mom, epsilon=1e-5)(x) 58 | 59 | x = layers.add([x, input_tensor]) 60 | x = layers.Activation('relu')(x) 61 | return x 62 | 63 | def conv_block(input_tensor, kernel_size, filters, stage, block, 64 | strides=(2, 2), l2_reg=5e-5, bn_mom=0.9): 65 | """A block that has a conv layer at shortcut. 66 | 67 | # Arguments 68 | input_tensor: input tensor 69 | kernel_size: default 3, the kernel size of 70 | middle conv layer at main path 71 | filters: list of integers, the filters of 3 conv layer at main path 72 | stage: integer, current stage label, used for generating layer names 73 | block: 'a','b'..., current block label, used for generating layer names 74 | strides: Strides for the first conv layer in the block. 75 | l2_reg: L2 weight regularization (weight decay) 76 | bn_mom: batch-norm momentum 77 | 78 | # Returns 79 | Output tensor for the block. 80 | """ 81 | filters1, filters2, filters3 = filters 82 | if backend.image_data_format() == 'channels_last': 83 | bn_axis = 3 84 | else: 85 | bn_axis = 1 86 | conv_name_base = 'res' + str(stage) + block + '_branch' 87 | bn_name_base = 'bn' + str(stage) + block + '_branch' 88 | 89 | x = layers.Conv2D(filters1, (1, 1), strides=strides, 90 | kernel_initializer='he_normal', 91 | kernel_regularizer=regularizers.l2(l2_reg), 92 | name=conv_name_base + '2a')(input_tensor) 93 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a')(x) 94 | x = layers.Activation('relu')(x) 95 | 96 | x = layers.Conv2D(filters2, kernel_size, padding='same', 97 | kernel_initializer='he_normal', 98 | kernel_regularizer=regularizers.l2(l2_reg), 99 | name=conv_name_base + '2b')(x) 100 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b')(x) 101 | x = layers.Activation('relu')(x) 102 | 103 | x = layers.Conv2D(filters3, (1, 1), 104 | kernel_initializer='he_normal', 105 | kernel_regularizer=regularizers.l2(l2_reg), 106 | name=conv_name_base + '2c')(x) 107 | x = layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c')(x) 108 | 109 | shortcut = layers.Conv2D(filters3, (1, 1), strides=strides, 110 | kernel_initializer='he_normal', 111 | kernel_regularizer=regularizers.l2(l2_reg), 112 | name=conv_name_base + '1')(input_tensor) 113 | shortcut = layers.BatchNormalization( 114 | axis=bn_axis, name=bn_name_base + '1')(shortcut) 115 | 116 | x = layers.add([x, shortcut]) 117 | x = layers.Activation('relu')(x) 118 | return x 119 | 120 | def ResNet50(input_shape=(224, 224, 3), n_classes=1000, 121 | l2_reg=5e-5, bn_mom=0.9): 122 | """Instantiates the ResNet50 architecture. 123 | 124 | # Arguments 125 | input_shape: input shape tuple. It should have 3 input channels. 126 | n_classes: number of classes to classify images. 127 | l2_reg: L2 weight regularization (weight decay) 128 | bn_mom: batch-norm momentum 129 | 130 | # Returns 131 | A Keras model instance. 132 | """ 133 | img_input = layers.Input(shape=input_shape) 134 | 135 | if backend.image_data_format() == 'channels_last': 136 | bn_axis = 3 137 | else: 138 | bn_axis = 1 139 | 140 | x = layers.ZeroPadding2D(padding=(3, 3), name='conv1_pad')(img_input) 141 | x = layers.Conv2D(64, (7, 7), strides=(2, 2), padding='valid', 142 | kernel_initializer='he_normal', 143 | kernel_regularizer=regularizers.l2(l2_reg), 144 | name='conv1')(x) 145 | x = layers.BatchNormalization(axis=bn_axis, name='bn_conv1')(x) 146 | x = layers.Activation('relu')(x) 147 | x = layers.ZeroPadding2D(padding=(1, 1), name='pool1_pad')(x) 148 | x = layers.MaxPooling2D((3, 3), strides=(2, 2))(x) 149 | 150 | x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), 151 | l2_reg=l2_reg, bn_mom=bn_mom) 152 | x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', 153 | l2_reg=l2_reg, bn_mom=bn_mom) 154 | x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', 155 | l2_reg=l2_reg, bn_mom=bn_mom) 156 | 157 | x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', 158 | l2_reg=l2_reg, bn_mom=bn_mom) 159 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', 160 | l2_reg=l2_reg, bn_mom=bn_mom) 161 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', 162 | l2_reg=l2_reg, bn_mom=bn_mom) 163 | x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', 164 | l2_reg=l2_reg, bn_mom=bn_mom) 165 | 166 | x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', 167 | l2_reg=l2_reg, bn_mom=bn_mom) 168 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='b', 169 | l2_reg=l2_reg, bn_mom=bn_mom) 170 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='c', 171 | l2_reg=l2_reg, bn_mom=bn_mom) 172 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='d', 173 | l2_reg=l2_reg, bn_mom=bn_mom) 174 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='e', 175 | l2_reg=l2_reg, bn_mom=bn_mom) 176 | x = identity_block(x, 3, [256, 256, 1024], stage=4, block='f', 177 | l2_reg=l2_reg, bn_mom=bn_mom) 178 | 179 | x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', 180 | l2_reg=l2_reg, bn_mom=bn_mom) 181 | x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', 182 | l2_reg=l2_reg, bn_mom=bn_mom) 183 | x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', 184 | l2_reg=l2_reg, bn_mom=bn_mom) 185 | 186 | x = layers.GlobalAveragePooling2D(name='avg_pool')(x) 187 | x = layers.Dense(n_classes, activation='softmax', 188 | kernel_regularizer=regularizers.l2(l2_reg), 189 | name='fc1000')(x) 190 | 191 | return models.Model(img_input, x, name='resnet50') 192 | 193 | 194 | def ResNetSmall(input_shape=(32, 32, 3), n_classes=10, 195 | l2_reg=5e-5, bn_mom=0.9): 196 | """Instantiates the small ResNet architecture. 197 | 198 | # Arguments 199 | input_shape: input shape tuple. It should have 3 input channels. 200 | n_classes: number of classes to classify images. 201 | l2_reg: L2 weight regularization (weight decay) 202 | bn_mom: batch-norm momentum 203 | 204 | # Returns 205 | A Keras model instance. 206 | """ 207 | img_input = layers.Input(shape=input_shape) 208 | 209 | if backend.image_data_format() == 'channels_last': 210 | bn_axis = 3 211 | else: 212 | bn_axis = 1 213 | 214 | x = img_input 215 | x = layers.Conv2D(64, (3, 3), strides=(1, 1), padding='same', 216 | kernel_initializer='he_normal', 217 | kernel_regularizer=regularizers.l2(l2_reg), 218 | name='conv1')(img_input) 219 | x = layers.BatchNormalization(axis=bn_axis, name='bn_conv1')(x) 220 | x = layers.Activation('relu')(x) 221 | 222 | x = conv_block(x, 3, [64, 64, 64], stage=2, block='a', 223 | l2_reg=l2_reg, bn_mom=bn_mom) 224 | x = identity_block(x, 3, [64, 64, 64], stage=2, block='b', 225 | l2_reg=l2_reg, bn_mom=bn_mom) 226 | 227 | x = conv_block(x, 3, [128, 128, 128], stage=3, block='a', 228 | l2_reg=l2_reg, bn_mom=bn_mom) 229 | x = identity_block(x, 3, [128, 128, 128], stage=3, block='b', 230 | l2_reg=l2_reg, bn_mom=bn_mom) 231 | 232 | x = conv_block(x, 3, [256, 256, 256], stage=4, block='a', 233 | l2_reg=l2_reg, bn_mom=bn_mom) 234 | x = identity_block(x, 3, [256, 256, 256], stage=4, block='b', 235 | l2_reg=l2_reg, bn_mom=bn_mom) 236 | 237 | x = conv_block(x, 3, [512, 512, 512], stage=5, block='a', 238 | l2_reg=l2_reg, bn_mom=bn_mom) 239 | x = identity_block(x, 3, [512, 512, 512], stage=5, block='b', 240 | l2_reg=l2_reg, bn_mom=bn_mom) 241 | 242 | x = layers.GlobalAveragePooling2D(name='avg_pool')(x) 243 | x = layers.Dense(n_classes, activation='softmax', 244 | kernel_regularizer=regularizers.l2(l2_reg), 245 | name='fc1000')(x) 246 | 247 | return models.Model(img_input, x, name='resnet50') 248 | -------------------------------------------------------------------------------- /notebooks/Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Results analysis\n", 8 | "\n", 9 | "You can use this notebook to analyze the results of your training runs on Cori.\n", 10 | "\n", 11 | "More documentation to come." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# System\n", 21 | "import os\n", 22 | "\n", 23 | "# Externals\n", 24 | "import numpy as np\n", 25 | "import matplotlib.pyplot as plt" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# Magic\n", 35 | "%matplotlib inline" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## Specify the results to load\n", 43 | "\n", 44 | "I'll start with plotting just one experiment. Later we may want to allow to plot multiple runs, like in Tensorboard." 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 22, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "results_dir = '/global/cscratch1/sd/sfarrell/sc18-dl-tutorial/cifar10-cnn-N1-16150683'" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 23, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "\u001b[0m\u001b[01;34mcheckpoints\u001b[0m/ history.npz out.log\n" 66 | ] 67 | } 68 | ], 69 | "source": [ 70 | "ls $results_dir" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "## Load the summary results and inspect them" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 24, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "history = np.load(os.path.join(results_dir, 'history.npz'))" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 25, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "n_ranks = int(history['n_ranks'])" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 26, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "data": { 105 | "text/plain": [ 106 | "['n_ranks', 'val_loss', 'val_acc', 'loss', 'acc', 'lr']" 107 | ] 108 | }, 109 | "execution_count": 26, 110 | "metadata": {}, 111 | "output_type": "execute_result" 112 | } 113 | ], 114 | "source": [ 115 | "history.files" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Plot loss and accuracy" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 27, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "data": { 132 | "image/png": "\n", 133 | "text/plain": [ 134 | "
" 135 | ] 136 | }, 137 | "metadata": { 138 | "needs_background": "light" 139 | }, 140 | "output_type": "display_data" 141 | } 142 | ], 143 | "source": [ 144 | "fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))\n", 145 | "\n", 146 | "# Plot losses\n", 147 | "ax0.plot(history['loss'], label='train')\n", 148 | "ax0.plot(history['val_loss'], label='validation')\n", 149 | "ax0.set_xlabel('Epoch')\n", 150 | "ax0.set_ylabel('Loss')\n", 151 | "ax0.legend(loc=0)\n", 152 | "\n", 153 | "ax1.plot(history['acc'], label='train')\n", 154 | "ax1.plot(history['val_acc'], label='validation')\n", 155 | "ax1.set_xlabel('Epoch')\n", 156 | "ax1.set_ylabel('Accuracy')\n", 157 | "ax1.legend(loc=0)\n", 158 | "\n", 159 | "plt.tight_layout()" 160 | ] 161 | } 162 | ], 163 | "metadata": { 164 | "kernelspec": { 165 | "display_name": "tf-1.11.0-py36", 166 | "language": "python", 167 | "name": "tf-1.11.0-py36" 168 | }, 169 | "language_info": { 170 | "codemirror_mode": { 171 | "name": "ipython", 172 | "version": 3 173 | }, 174 | "file_extension": ".py", 175 | "mimetype": "text/x-python", 176 | "name": "python", 177 | "nbconvert_exporter": "python", 178 | "pygments_lexer": "ipython3", 179 | "version": "3.6.5" 180 | }, 181 | "toc-autonumbering": false, 182 | "toc-showmarkdowntxt": true, 183 | "toc-showtags": false 184 | }, 185 | "nbformat": 4, 186 | "nbformat_minor": 2 187 | } 188 | -------------------------------------------------------------------------------- /scripts/cifar_cnn.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -J cifar-cnn 3 | #SBATCH -C knl 4 | #SBATCH -N 1 5 | #SBATCH -q regular 6 | #SBATCH --reservation dl4sci 7 | #SBATCH -t 1:00:00 8 | #SBATCH -o logs/%x-%j.out 9 | 10 | # Load the software 11 | module load tensorflow/intel-1.13.1-py36 12 | export KMP_BLOCKTIME=1 13 | export KMP_AFFINITY="granularity=fine,compact,1,0" 14 | 15 | # Ensure dataset is downloaded by single process 16 | python -c "import keras; keras.datasets.cifar10.load_data()" 17 | 18 | # Submit multi-node training 19 | srun -u python train_horovod.py configs/cifar10_cnn.yaml -d 20 | -------------------------------------------------------------------------------- /scripts/cifar_resnet.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -J cifar-resnet 3 | #SBATCH -C knl 4 | #SBATCH -N 1 5 | #SBATCH -q regular 6 | #SBATCH --reservation dl4sci 7 | #SBATCH -t 1:00:00 8 | #SBATCH -o logs/%x-%j.out 9 | 10 | # Load the software 11 | module load tensorflow/intel-1.13.1-py36 12 | export KMP_BLOCKTIME=1 13 | export KMP_AFFINITY="granularity=fine,compact,1,0" 14 | 15 | # Ensure dataset is downloaded by single process 16 | python -c "import keras; keras.datasets.cifar10.load_data()" 17 | 18 | # Submit multi-node training 19 | srun -u python train_horovod.py configs/cifar10_resnet.yaml -d 20 | -------------------------------------------------------------------------------- /train_horovod.py: -------------------------------------------------------------------------------- 1 | """ 2 | Main training script for the Deep Learning at Scale Keras examples. 3 | """ 4 | 5 | # System 6 | import os 7 | import sys 8 | import argparse 9 | import logging 10 | 11 | # Externals 12 | import keras 13 | import horovod.keras as hvd 14 | import yaml 15 | import numpy as np 16 | 17 | # Locals 18 | from data import get_datasets 19 | from models import get_model 20 | from utils.device import configure_session 21 | from utils.optimizers import get_optimizer 22 | from utils.callbacks import TimingCallback 23 | 24 | # Suppress TF warnings 25 | import tensorflow as tf 26 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 27 | tf.logging.set_verbosity(logging.ERROR) 28 | 29 | #load dictionary from argparse 30 | class StoreDictKeyPair(argparse.Action): 31 | def __call__(self, parser, namespace, values, option_string=None): 32 | my_dict = {} 33 | for kv in values.split(","): 34 | k,v = kv.split("=") 35 | my_dict[k] = v 36 | setattr(namespace, self.dest, my_dict) 37 | 38 | def parse_args(): 39 | """Parse command line arguments.""" 40 | parser = argparse.ArgumentParser('train.py') 41 | add_arg = parser.add_argument 42 | add_arg('config', nargs='?', default='configs/hello.yaml') 43 | add_arg('-d', '--distributed', action='store_true') 44 | add_arg('-v', '--verbose', action='store_true') 45 | add_arg('--show-config', action='store_true') 46 | add_arg('--interactive', action='store_true') 47 | #parameters which override the YAML file 48 | add_arg('--dropout', type=float, help='keep rate for dropout layers') 49 | add_arg("--optimizer", action=StoreDictKeyPair, help="optimizer parameters") 50 | add_arg('--batch_size', type=int, help='batch size for training') 51 | add_arg('--n_epochs', type=int, help='number of epochs to train') 52 | return parser.parse_args() 53 | 54 | def config_logging(verbose): 55 | log_format = '%(asctime)s %(levelname)s %(message)s' 56 | log_level = logging.DEBUG if verbose else logging.INFO 57 | logging.basicConfig(level=log_level, format=log_format) 58 | 59 | def init_workers(distributed=False): 60 | rank, n_ranks = 0, 1 61 | if distributed: 62 | hvd.init() 63 | rank, n_ranks = hvd.rank(), hvd.size() 64 | return rank, n_ranks 65 | 66 | def load_config(arguments): 67 | #read base config from yaml file 68 | config_file = arguments.config 69 | with open(config_file) as f: 70 | config = yaml.load(f, Loader=yaml.FullLoader) 71 | 72 | #override with CLA 73 | if arguments.dropout: 74 | config["model"]["dropout"] = arguments.dropout 75 | if arguments.batch_size: 76 | config["training"]["batch_size"] = arguments.batch_size 77 | if arguments.n_epochs: 78 | config["training"]["n_epochs"] = arguments.n_epochs 79 | if arguments.optimizer: 80 | if "name" in arguments.optimizer: 81 | config["optimizer"]["name"] = arguments.optimizer["name"] 82 | if "lr" in arguments.optimizer: 83 | config["optimizer"]["lr"] = float(arguments.optimizer["lr"] ) 84 | if "lr_scaling" in arguments.optimizer: 85 | config["optimizer"]["lr_scaling"] = arguments.optimizer["lr_scaling"] 86 | if "lr_warmup_epochs" in arguments.optimizer: 87 | config["training"]["lr_warmup_epochs"] = int(arguments.optimizer["lr_warmup_epochs"]) 88 | 89 | return config 90 | 91 | def get_basic_callbacks(distributed=False): 92 | cb = [] 93 | 94 | if distributed: 95 | #this is for broadcasting the initial model to all nodes 96 | cb.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0)) 97 | 98 | #this is for averaging the reported metrics across all nodes 99 | cb.append(hvd.callbacks.MetricAverageCallback()) 100 | 101 | return cb 102 | 103 | def main(): 104 | """Main function""" 105 | 106 | # Initialization 107 | args = parse_args() 108 | rank, n_ranks = init_workers(args.distributed) 109 | 110 | # Load configuration 111 | config = load_config(args) 112 | train_config = config['training'] 113 | output_dir = os.path.expandvars(config['output_dir']) 114 | checkpoint_format = os.path.join(output_dir, 'checkpoints', 115 | 'checkpoint-{epoch}.h5') 116 | if rank==0: 117 | os.makedirs(output_dir, exist_ok=True) 118 | 119 | # Loggging 120 | config_logging(verbose=args.verbose) 121 | logging.info('Initialized rank %i out of %i', rank, n_ranks) 122 | if args.show_config: 123 | logging.info('Command line config: %s', args) 124 | if rank == 0: 125 | logging.info('Job configuration: %s', config) 126 | logging.info('Saving job outputs to %s', output_dir) 127 | 128 | # Configure session 129 | device_config = config.get('device', {}) 130 | configure_session(**device_config) 131 | 132 | # Load the data 133 | train_gen, valid_gen = get_datasets(batch_size=train_config['batch_size'], 134 | **config['data']) 135 | 136 | # Build the model 137 | model = get_model(**config['model']) 138 | # Configure optimizer 139 | opt = get_optimizer(n_ranks=n_ranks, **config['optimizer']) 140 | # Compile the model 141 | model.compile(loss=train_config['loss'], optimizer=opt, 142 | metrics=train_config['metrics']) 143 | if rank == 0: 144 | model.summary() 145 | 146 | # Prepare the training callbacks 147 | callbacks = get_basic_callbacks(args.distributed) 148 | 149 | # Learning rate warmup 150 | warmup_epochs = train_config.get('lr_warmup_epochs', 0) 151 | callbacks.append(hvd.callbacks.LearningRateWarmupCallback( 152 | warmup_epochs=warmup_epochs, verbose=1)) 153 | 154 | # Learning rate decay schedule 155 | for lr_schedule in train_config.get('lr_schedule', []): 156 | if rank == 0: 157 | logging.info('Adding LR schedule: %s', lr_schedule) 158 | callbacks.append(hvd.callbacks.LearningRateScheduleCallback(**lr_schedule)) 159 | 160 | # Checkpoint only from rank 0 161 | if rank == 0: 162 | os.makedirs(os.path.dirname(checkpoint_format), exist_ok=True) 163 | callbacks.append(keras.callbacks.ModelCheckpoint(checkpoint_format)) 164 | 165 | # Timing callback 166 | timing_callback = TimingCallback() 167 | callbacks.append(timing_callback) 168 | 169 | # Train the model 170 | train_steps_per_epoch = max([len(train_gen) // n_ranks, 1]) 171 | valid_steps_per_epoch = max([len(valid_gen) // n_ranks, 1]) 172 | history = model.fit_generator(train_gen, 173 | epochs=train_config['n_epochs'], 174 | steps_per_epoch=train_steps_per_epoch, 175 | validation_data=valid_gen, 176 | validation_steps=valid_steps_per_epoch, 177 | callbacks=callbacks, 178 | workers=4, verbose=2 if rank==0 else 0) 179 | 180 | # Save training history 181 | if rank == 0: 182 | # Print some best-found metrics 183 | if 'val_acc' in history.history.keys(): 184 | logging.info('Best validation accuracy: %.3f', 185 | max(history.history['val_acc'])) 186 | if 'val_top_k_categorical_accuracy' in history.history.keys(): 187 | logging.info('Best top-5 validation accuracy: %.3f', 188 | max(history.history['val_top_k_categorical_accuracy'])) 189 | logging.info('Average time per epoch: %.3f s', 190 | np.mean(timing_callback.times)) 191 | np.savez(os.path.join(output_dir, 'history'), 192 | n_ranks=n_ranks, **history.history) 193 | 194 | # Drop to IPython interactive shell 195 | if args.interactive and (rank == 0): 196 | logging.info('Starting IPython interactive session') 197 | import IPython 198 | IPython.embed() 199 | 200 | if rank == 0: 201 | logging.info('All done!') 202 | 203 | if __name__ == '__main__': 204 | main() 205 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | """ 3 | -------------------------------------------------------------------------------- /utils/callbacks.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module contains some utility callbacks for Keras training. 3 | """ 4 | 5 | # System 6 | from time import time 7 | 8 | # Externals 9 | import keras 10 | 11 | class TimingCallback(keras.callbacks.Callback): 12 | """A Keras Callback which records the time of each epoch""" 13 | def __init__(self): 14 | self.times = [] 15 | 16 | def on_epoch_begin(self, epoch, logs={}): 17 | self.starttime = time() 18 | 19 | def on_epoch_end(self, epoch, logs={}): 20 | epoch_time = time() - self.starttime 21 | self.times.append(epoch_time) 22 | -------------------------------------------------------------------------------- /utils/device.py: -------------------------------------------------------------------------------- 1 | """ 2 | Hardware/device configuration 3 | """ 4 | 5 | # System 6 | import os 7 | 8 | # Externals 9 | import keras 10 | import tensorflow as tf 11 | 12 | def configure_session(intra_threads=32, inter_threads=2, gpu=None): 13 | """Sets the thread knobs in the TF backend""" 14 | os.environ['OMP_NUM_THREADS'] = str(intra_threads) 15 | config = tf.ConfigProto( 16 | inter_op_parallelism_threads=inter_threads, 17 | intra_op_parallelism_threads=intra_threads 18 | ) 19 | if gpu is not None: 20 | config.gpu_options.visible_device_list = str(gpu) 21 | keras.backend.set_session(tf.Session(config=config)) 22 | -------------------------------------------------------------------------------- /utils/optimizers.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utilty code for constructing optimizers and scheduling learning rates. 3 | """ 4 | 5 | # System 6 | import math 7 | 8 | # Externals 9 | import keras 10 | import horovod.keras as hvd 11 | 12 | def get_optimizer(name, lr, lr_scaling='linear', n_ranks=1, **opt_args): 13 | """ 14 | Configure the optimizer and scale the learning rate by n_ranks. 15 | """ 16 | # Scale the learning rate 17 | if lr_scaling == 'linear': 18 | lr = lr * n_ranks 19 | elif lr_scaling == 'sqrt': 20 | lr = lr * math.sqrt(n_ranks) 21 | 22 | # Construct the optimizer 23 | OptType = getattr(keras.optimizers, name) 24 | opt = OptType(lr=lr, **opt_args) 25 | 26 | # Distributed optimizer wrapper 27 | if n_ranks > 1: 28 | opt = hvd.DistributedOptimizer(opt) 29 | 30 | return opt 31 | --------------------------------------------------------------------------------