├── .gitignore ├── 0-intro ├── README.md ├── thumbnail.png └── workflow.png ├── 1-docker ├── README.md └── src │ ├── Dockerfile │ └── main.py ├── 2-kubernetes └── README.md ├── 3-helm ├── README.md └── dokuwiki.png ├── 4-gpus ├── README.md └── src │ ├── Dockerfile │ └── main.py ├── 5-tfjob ├── README.md ├── file-share.png └── tensorboard.png ├── 6-distributed-tensorflow ├── README.md ├── solution-src │ ├── Dockerfile │ └── main.py └── tensorboard.png ├── 7-hyperparam-sweep ├── README.md ├── solution-chart │ ├── Chart.yaml │ ├── templates │ │ ├── _helpers.tpl │ │ └── deployment.yaml │ └── values.yaml ├── src │ ├── Dockerfile │ ├── Dockerfile.gpu │ ├── main.py │ ├── requirements.txt │ └── starry.jpg └── tensorboard.png ├── 8-going-further ├── NFSonAzureConcept.png └── README.md ├── 9-jupyter └── README.md ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /0-intro/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | ## Motivations 4 | 5 | Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps. 6 | 7 | However in traditional application development, DevOps practices are becoming ubiquitous. 8 | We can benefit from many of these practices by applying them to model development and operationalization. 9 | 10 | Here are a subset of pain points that exists in a typical ML workflow. 11 | #### A Typical (Simplified) ML Workflow and its Pain Points 12 | ![Typical Workflow](workflow.png) 13 | 14 | This workshop is going to focus on improving the training process by leveraging containers and Kubernetes. 15 | 16 | Today many data scientists are training their models either on their physical workstation (be it a laptop or a desktop with multiple GPUs) or using a VM (sometime, but rarely, a couple of them) in the cloud. 17 | 18 | This approach is sub-optimal for many reasons, among which: 19 | * Training is slow and sequential 20 | * Having a single (or few) GPU on hand, means there is only so much trainings you can do at the time. It also means that once your GPU is busy with a training you cannot use it to do something else, just as smaller experiments. 21 | * Hyper-parameter sweeping is vastly inefficient: The different hypothesis you want to test will run sequentially and not in parallel. In practices this means that very often we don't have time to really explore the hyper-parameter space and we just run a couple of experiments that we think will yield the best result. 22 | The longer the training time, the fewer experiments we can run. 23 | * Distributed training is hard (or impossible) to setup 24 | * In practice very few data scientist benefit from distributed training. Either because they simply can't use it (you need multiple machines for that), or because it is too tedious to setup. 25 | * High cost 26 | * If each member of the team has it's own allocated resources, in practices it means many of them will not be used at any given time, given the price of a single GPU, this is very costly. On the other hand pooling resourcing (such as sharing VM) is also painful since multiple people might want to use them at the same time. 27 | 28 | Using Kubernetes, we can alleviate many of these pain points: 29 | * Training is massively parallelizable 30 | * Kubernetes is highly scalable (up to 1200 VMs for a single cluster on Azure). In practice that means you could run as many experiments as you want at the same time. This makes exploring and comparing different hypothesis much simpler and efficient. 31 | * Distributed training is much simpler 32 | * As we will see in this workshop, it is very easy to setup a TensorFlow distributed training on kubernetes, and scale it to whatever size you want, making much more usable in practice. 33 | * Optimized cost with autoscaling* 34 | * Kubernetes allows for resource pooling while at the same time ensuring that any training job can run without waiting for another one to finish. 35 | * With autoscaling the cluster can automatically scale out or in to ensure maximum utilization, thus keeping the cost as low as possible. 36 | 37 | *While autoscaling is very powerful, it is outside the scope of this workshop. However we will give you resources and pointer to get started with it. 38 | 39 | 40 | ## OpenAI: Building the Infrastructure that Powers the Future of AI 41 | 42 | During KubeCon 2017, Vicki Cheung and Jonas Schneider delivered a keynote explaining how OpenAI manage to handle training at very large scale with Kubernetes, it is worth listening to: 43 | 44 | ![OpenAI](./thumbnail.png) 45 | 46 | ## Next Step 47 | [Module 1: Docker](../1-docker/README.md) 48 | -------------------------------------------------------------------------------- /0-intro/thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/0-intro/thumbnail.png -------------------------------------------------------------------------------- /0-intro/workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/0-intro/workflow.png -------------------------------------------------------------------------------- /1-docker/README.md: -------------------------------------------------------------------------------- 1 | # Docker 2 | 3 | ### Summary 4 | 5 | In this section you will learn about: 6 | * Running Docker locally 7 | * Basics of Docker 8 | * Containerizing a simple application 9 | * Building and Pushing an Image 10 | 11 | 12 | 13 | ### Basics of Docker and Containers 14 | 15 | Docker has a very well structured six-part tutorial. 16 | While for this workshop you don't need to go through all of them, part 1 and 2 are required: 17 | * [Get Started, Part 1: Orientation and setup](https://docs.docker.com/get-started) 18 | * [Get Started, Part 2: Containers](https://docs.docker.com/get-started/part2/) 19 | 20 | By the end of Part 2, you should have a simple container up and running, and understand the basic concepts of a container. 21 | 22 | #### Additional Important Docker Command 23 | 24 | Here a few other docker commands that are important to be aware of for the rest of this workshop: 25 | 26 | 1. `docker ps` 27 | 28 | The docker `ps` command allows to list the status of the containers. 29 | 30 | A container could be either stopped or running. When it finishes to execute the process, it will stop. 31 | 32 | For example if you run the command `docker run -it ubuntu hostname` this will : 33 | - Pull the official ubuntu image from the registry 34 | - Start the container in the interactive mode `-it` 35 | - Execute the command : `hostname` 36 | - Stop 37 | 38 | ``` 39 | $ docker run -it ubuntu hostname 40 | 0d0af5005fc7 41 | ``` 42 | 43 | If you run the command `docker ps -a` you should see : 44 | ``` 45 | $ docker ps -a 46 | CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 47 | 0d0af5005fc7 ubuntu "hostname" 58 seconds ago Exited (0) About a minute ago gifted_darwin 48 | ``` 49 | 50 | > The `-a` allows you to list all the containers, not just the one that is running 51 | 52 | We can notice a few things here such as : 53 | - The status `Exited...` for our container 54 | - The name `gifted_darwin` randomly generated for our container, we can specify a custom one using the command `--name` explained in the previous section. 55 | - We can re-execute our container using the command `docker start gifted_darwin` 56 | - We can run the command `docker run -it ubuntu hostname` again and do a `docker ps -a`; we should see how two containers exited. 57 | 58 | 1. `docker logs` 59 | 60 | The docker `logs` allow to fetch the console output from inside the container 61 | 62 | From our previous example, we can run `docker logs gifted_darwin` 63 | 64 | ``` 65 | $ docker logs gifted_darwin 66 | 0d0af5005fc7 67 | ``` 68 | 69 | You can also stream the logs using the command `-f` and print in real time in your console the stdout of your container. 70 | 71 | 1. `docker rm` 72 | 73 | The docker `rm` command allows to remove a container. 74 | 75 | From the previous example, we can see that we have a container listed as exited in our environment, or maybe more if we run the same command `docker run -it ubuntu hostname` multiple times. If we want to do some cleaning and remove those executions from our environment we can use the command `docker rm` 76 | 77 | ``` 78 | $ docker rm gifted_darwin 79 | gifted_darwin 80 | ``` 81 | 82 | > You can either specify the **CONTAINER ID** or the **NAME** of the container to refer to it 83 | 84 | 85 | 1. `docker images` 86 | 87 | This command allows us to list all the base images available in the environment. 88 | 89 | ``` 90 | $ docker images 91 | REPOSITORY TAG IMAGE ID CREATED SIZE 92 | ubuntu latest 20c44cd7596f 2 days ago 123MB 93 | example-scratch latest 32ff7b65f567 5 days ago 30.7MB 94 | node 8.9.1-slim a6bb2cc1118f 11 days ago 230MB 95 | buildpack-deps xenial a27b6a8abd1c 2 weeks ago 644MB 96 | ``` 97 | 98 | > You can manage your images by removing them using `docker rmi IMAGENAME` or pulling a new one with `docker pull IMAGENAME` 99 | 100 | ### Containerizing a TensorFlow model 101 | 102 | Now that we understand the basics of Docker, let's containerize our first TensorFlow model that we will reuse in the following modules. 103 | Our first model will be a very simple MNIST classifier. You can see the source code in [`./src/main.py`](./src/main.py). 104 | As you can see there is nothing specific to containers in this code, you can run this script directly on your laptop or on a VM. 105 | 106 | Now, to have this run in a container, we need to build an image containing this code and it's dependencies. 107 | As you saw in the tutorial, we will use a `Dockerfile` to do this. 108 | 109 | Here is the (very simple) `Dockerfile` that we are going to use for this model (located in [`./src/Dockerfile`](./src/Dockerfile)): 110 | 111 | ```dockerfile 112 | FROM tensorflow/tensorflow:1.4.0 113 | COPY main.py /app/main.py 114 | 115 | ENTRYPOINT ["python", "/app/main.py"] 116 | ``` 117 | 118 | As you can see, we are not building a new image from scratch, instead we are using a base image from TensorFlow. Indeed, TensorFlow has a bunch of base images that you can start with. 119 | You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/. 120 | 121 | What is important to note is that different tags need to be used depending on if you want to use GPU or not. 122 | For example, if you wanted to run your model with TensorFlow 1.4.0 and CPU only, you would use `tensorflow/tensorflow:1.4.0`. 123 | If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.4.0-gpu`. 124 | 125 | The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script. 126 | 127 | #### Building the image 128 | 129 | > If you don't already have a Docker account, see [Log in with your Docker ID](https://docs.docker.com/get-started/part2/#log-in-with-your-docker-id). 130 | 131 | The next step is to build our image to be able to run it using docker. For that, we will use the command `docker build`. 132 | 133 | From the [`./src`](./src) repository, we can build the image with 134 | 135 | ```console 136 | docker build -t ${DOCKER_USERNAME}/tf-mnist . 137 | ``` 138 | > Reminder: the `-t` argument allows to **tag** the image with a specific name. 139 | 140 | `${DOCKER_USERNAME}` should be your Docker username that you use to connect to Docker Hub. 141 | 142 | The output from this command should look like this: 143 | 144 | ``` 145 | Sending build context to Docker daemon 11.26kB 146 | Step 1/3 : FROM tensorflow/tensorflow:1.4.0 147 | ---> a61a91cc0d1b 148 | Step 2/3 : COPY main.py /app/main.py 149 | ---> b264d6e9a5ef 150 | Removing intermediate container fe8128425296 151 | Step 3/3 : ENTRYPOINT python /app/main.py 152 | ---> Running in 7acb7aac7a9f 153 | ---> 92c7ed17916b 154 | Removing intermediate container 7acb7aac7a9f 155 | Successfully built 92c7ed17916b 156 | Successfully tagged wbuchwalter/tf-mnist:latest 157 | ``` 158 | Let's analyse this image full name (`wbuchwalter/tf-mnist:latest`): 159 | * `wbuchwalter` is the name of my repository, this is where we can find the image. This will be different for you (same as your docker hub username). 160 | * `tf-mnist` is the name of the image itself 161 | * `latest` is the tag. `latest` is the default tag if you don't specify any. Tags are usually used to denote different versions or flavors of a same image. For example you could have a tag `v1` and `v2` to denote different versions, or `cpu` and `gpu` to denote what hardware it can run on. 162 | 163 | When you have the successfully built message, you should now be able to see if your image is locally available with the command `docker images` described earlier. 164 | 165 | #### Running the image 166 | 167 | Now we can try to run it locally using the `docker run` command. 168 | By default the model will run 1000 training steps which can take a few minutes on a laptop. Let's reduce this number to 100 with the `--max_steps` argument. 169 | 170 | ```console 171 | docker run -it ${DOCKER_USERNAME}/tf-mnist --max_steps 100 172 | ``` 173 | 174 | If everything is okay you should see the model training: 175 | 176 | ``` 177 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 178 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz 179 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 180 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz 181 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 182 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz 183 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 184 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz 185 | 2017-11-29 18:32:41.992194: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 186 | Accuracy at step 0: 0.1292 187 | Accuracy at step 10: 0.7198 188 | Accuracy at step 20: 0.834 189 | Accuracy at step 30: 0.8698 190 | Accuracy at step 40: 0.8783 191 | Accuracy at step 50: 0.8968 192 | Accuracy at step 60: 0.9023 193 | Accuracy at step 70: 0.9059 194 | Accuracy at step 80: 0.9084 195 | Accuracy at step 90: 0.9154 196 | Adding run metadata for 99 197 | ``` 198 | 199 | You can kill the process and exit the container at any time with `ctrl + c`. 200 | 201 | ### Running the image with the NVIDIA GPU of your machine (If you have one) 202 | 203 | **Currently, running docker containers with GPU is only supported on Linux.** 204 | 205 | First install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker). 206 | 207 | You also need to make sure the image you are going to use is optimized for GPU. 208 | In our example you need to modify the `Dockerfile` to use a TensorFlow image built for GPU: 209 | 210 | ```dockerfile 211 | FROM tensorflow/tensorflow:1.4.0-gpu 212 | COPY main.py /app/main.py 213 | 214 | ENTRYPOINT ["python", "/app/main.py"] 215 | ``` 216 | 217 | Then simply rebuild the image with a new tag (you can use docker or nvidia-docker interchangeably for any command except run): 218 | 219 | ```console 220 | docker build -t ${DOCKER_USERNAME}/tf-mnist:gpu . 221 | ``` 222 | 223 | Finally run the container with nvidia-docker: 224 | 225 | ```console 226 | nvidia-docker run -it ${DOCKER_USERNAME}/tf-mnist:gpu 227 | ``` 228 | 229 | > Note: If the command fails with `Unknown runtime specified nvidia`, follow the steps described here (Systemd drop-in file): https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 230 | 231 | #### Publish the Image 232 | 233 | Our image is now built and running locally, but what about sharing it to be able to use it from anywhere by anyone? 234 | Most importantly we want to be able to reuse this image on the Kubernetes cluster we are going to create in module 2. 235 | So let's push our image to Docker Hub: 236 | 237 | ```console 238 | docker push ${DOCKER_USERNAME}/tf-mnist 239 | ``` 240 | 241 | If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image) 242 | 243 | 244 | ### Useful Links 245 | * [What is Docker ?](https://www.docker.com/what-docker) 246 | * [Docker for beginner](https://github.com/docker/labs/blob/master/beginner/readme.md) 247 | 248 | 249 | ## Next Step 250 | [2 - Kubernetes](../2-kubernetes/README.md) 251 | -------------------------------------------------------------------------------- /1-docker/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.4.0 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] -------------------------------------------------------------------------------- /1-docker/src/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the 'License'); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an 'AS IS' BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """A simple MNIST classifier which displays summaries in TensorBoard. 16 | 17 | This is an unimpressive MNIST model, but it is a good example of using 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of 19 | naming summary tags so that they are grouped meaningfully in TensorBoard. 20 | 21 | It demonstrates the functionality of every TensorBoard dashboard. 22 | """ 23 | from __future__ import absolute_import 24 | from __future__ import division 25 | from __future__ import print_function 26 | 27 | import argparse 28 | import os 29 | import sys 30 | 31 | import tensorflow as tf 32 | 33 | from tensorflow.examples.tutorials.mnist import input_data 34 | 35 | FLAGS = None 36 | 37 | 38 | def train(): 39 | # Import data 40 | mnist = input_data.read_data_sets(FLAGS.data_dir, 41 | one_hot=True, 42 | fake_data=FLAGS.fake_data) 43 | 44 | # Create a multilayer model. 45 | 46 | # Input placeholders 47 | with tf.name_scope('input'): 48 | x = tf.placeholder(tf.float32, [None, 784], name='x-input') 49 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input') 50 | 51 | with tf.name_scope('input_reshape'): 52 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1]) 53 | tf.summary.image('input', image_shaped_input, 10) 54 | 55 | # We can't initialize these variables to 0 - the network will get stuck. 56 | def weight_variable(shape): 57 | """Create a weight variable with appropriate initialization.""" 58 | initial = tf.truncated_normal(shape, stddev=0.1) 59 | return tf.Variable(initial) 60 | 61 | def bias_variable(shape): 62 | """Create a bias variable with appropriate initialization.""" 63 | initial = tf.constant(0.1, shape=shape) 64 | return tf.Variable(initial) 65 | 66 | def variable_summaries(var): 67 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization).""" 68 | with tf.name_scope('summaries'): 69 | mean = tf.reduce_mean(var) 70 | tf.summary.scalar('mean', mean) 71 | with tf.name_scope('stddev'): 72 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 73 | tf.summary.scalar('stddev', stddev) 74 | tf.summary.scalar('max', tf.reduce_max(var)) 75 | tf.summary.scalar('min', tf.reduce_min(var)) 76 | tf.summary.histogram('histogram', var) 77 | 78 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): 79 | """Reusable code for making a simple neural net layer. 80 | 81 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize. 82 | It also sets up name scoping so that the resultant graph is easy to read, 83 | and adds a number of summary ops. 84 | """ 85 | # Adding a name scope ensures logical grouping of the layers in the graph. 86 | with tf.name_scope(layer_name): 87 | # This Variable will hold the state of the weights for the layer 88 | with tf.name_scope('weights'): 89 | weights = weight_variable([input_dim, output_dim]) 90 | variable_summaries(weights) 91 | with tf.name_scope('biases'): 92 | biases = bias_variable([output_dim]) 93 | variable_summaries(biases) 94 | with tf.name_scope('Wx_plus_b'): 95 | preactivate = tf.matmul(input_tensor, weights) + biases 96 | tf.summary.histogram('pre_activations', preactivate) 97 | activations = act(preactivate, name='activation') 98 | tf.summary.histogram('activations', activations) 99 | return activations 100 | 101 | hidden1 = nn_layer(x, 784, 500, 'layer1') 102 | 103 | with tf.name_scope('dropout'): 104 | keep_prob = tf.placeholder(tf.float32) 105 | tf.summary.scalar('dropout_keep_probability', keep_prob) 106 | dropped = tf.nn.dropout(hidden1, keep_prob) 107 | 108 | # Do not apply softmax activation yet, see below. 109 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity) 110 | 111 | with tf.name_scope('cross_entropy'): 112 | # The raw formulation of cross-entropy, 113 | # 114 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)), 115 | # reduction_indices=[1])) 116 | # 117 | # can be numerically unstable. 118 | # 119 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the 120 | # raw outputs of the nn_layer above, and then average across 121 | # the batch. 122 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y) 123 | with tf.name_scope('total'): 124 | cross_entropy = tf.reduce_mean(diff) 125 | tf.summary.scalar('cross_entropy', cross_entropy) 126 | 127 | with tf.name_scope('train'): 128 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( 129 | cross_entropy) 130 | 131 | with tf.name_scope('accuracy'): 132 | with tf.name_scope('correct_prediction'): 133 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 134 | with tf.name_scope('accuracy'): 135 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 136 | tf.summary.scalar('accuracy', accuracy) 137 | 138 | # Merge all the summaries and write them out to 139 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default) 140 | merged = tf.summary.merge_all() 141 | 142 | def feed_dict(train): 143 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders.""" 144 | if train or FLAGS.fake_data: 145 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data) 146 | k = FLAGS.dropout 147 | else: 148 | xs, ys = mnist.test.images, mnist.test.labels 149 | k = 1.0 150 | return {x: xs, y_: ys, keep_prob: k} 151 | 152 | sess = tf.InteractiveSession() 153 | train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph) 154 | test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test') 155 | tf.global_variables_initializer().run() 156 | # Train the model, and also write summaries. 157 | # Every 10th step, measure test-set accuracy, and write test summaries 158 | # All other steps, run train_step on training data, & add training summaries 159 | 160 | 161 | for i in range(FLAGS.max_steps): 162 | if i % 10 == 0: # Record summaries and test-set accuracy 163 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False)) 164 | test_writer.add_summary(summary, i) 165 | print('Accuracy at step %s: %s' % (i, acc)) 166 | else: # Record train set summaries, and train 167 | if i % 100 == 99: # Record execution stats 168 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 169 | run_metadata = tf.RunMetadata() 170 | summary, _ = sess.run([merged, train_step], 171 | feed_dict=feed_dict(True), 172 | options=run_options, 173 | run_metadata=run_metadata) 174 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i) 175 | train_writer.add_summary(summary, i) 176 | print('Adding run metadata for', i) 177 | else: # Record a summary 178 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True)) 179 | train_writer.add_summary(summary, i) 180 | train_writer.close() 181 | test_writer.close() 182 | 183 | 184 | def main(_): 185 | train() 186 | 187 | 188 | if __name__ == '__main__': 189 | parser = argparse.ArgumentParser() 190 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool, 191 | default=False, 192 | help='If true, uses fake data for unit testing.') 193 | parser.add_argument('--max_steps', type=int, default=1000, 194 | help='Number of steps to run trainer.') 195 | parser.add_argument('--learning_rate', type=float, default=0.001, 196 | help='Initial learning rate') 197 | parser.add_argument('--dropout', type=float, default=0.9, 198 | help='Keep probability for training dropout.') 199 | parser.add_argument( 200 | '--data_dir', 201 | type=str, 202 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 203 | 'tensorflow/input_data'), 204 | help='Directory for storing input data') 205 | parser.add_argument( 206 | '--log_dir', 207 | type=str, 208 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 209 | 'tensorflow/logs'), 210 | help='Summaries log directory') 211 | FLAGS, unparsed = parser.parse_known_args() 212 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) -------------------------------------------------------------------------------- /2-kubernetes/README.md: -------------------------------------------------------------------------------- 1 | # Kubernetes 2 | 3 | ### Prerequisites 4 | * [Docker Basics](../1-docker/README.md) 5 | 6 | ### Summary 7 | 8 | In this module you will learn: 9 | * The basic concepts of Kubernetes 10 | * How to create a Kubernetes cluster on Azure 11 | 12 | > *Important* : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this workshop. 13 | 14 | ## The basic concepts of Kubernetes 15 | 16 | [Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more. 17 | 18 | ### Overview 19 | 20 | Kubernetes is a system for managing containerized applications across a cluster of nodes. To work with Kubernetes, you use Kubernetes API objects to describe your cluster’s desired state: what applications or other workloads you want to run, what container images they use, the number of replicas, what network and disk resources you want to make available, and more. You set your desired state by creating objects using the Kubernetes API. Once you’ve set your desired state, the Kubernetes Control Plane works to make the cluster’s current state match the desired state. To do so, Kubernetes performs a variety of tasks automatically, such as starting or restarting containers, scaling the number of replicas of a given application, and more. 21 | 22 | ### Kubernetes Master 23 | 24 | The Kubernetes master is responsible for maintaining the desired state for your cluster. When you interact with Kubernetes, such as by using the kubectl command-line interface, you’re communicating with your cluster’s Kubernetes master. These master services can be installed on a single machine, or distributed across multiple machines. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 1 master. 25 | 26 | ### Kubernetes Nodes 27 | 28 | The worker nodes communicate with the master components, configure the networking for containers, and run the actual workloads assigned to them. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 3 worker nodes. 29 | 30 | ### Kubernetes Objects 31 | 32 | Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re telling the Kubernetes system your cluster’s desired state. 33 | 34 | The basic Kubernetes objects include: 35 | * **Pod** - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run. 36 | * **Service** - an abstraction which defines a logical set of Pods and a policy by which to access them. 37 | * **Volume** - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers. 38 | * **Namespace** - a way to divide a physical cluster resources into multiple virtual clusters between multiple users. 39 | * **Deployment** - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server. 40 | * **Job** - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model. 41 | 42 | ### Creating a Kubernetes Object 43 | 44 | When you create an object in Kubernetes, you must provide the object specifications that describes its desired state, as well as some basic information about the object (such as a name) to the Kubernetes API either directly or via the `kubectl` command-line interface. Usually, you will provide the information to `kubectl` in a .yaml file. `kubectl` then converts the information to JSON when making the API request. In the next few sections, we will be using various yaml files to describe the Kubernetes objects we want to deploy to our Kubernetes cluster. 45 | 46 | For example, the `.yaml` file shown below includes the required fields and object spec for a Kubernetes Deployment. A Kubernetes Deployment is an object that can represent an application running on your cluster. In the example below, the Deployment spec describes the desired state of three replicas of the nginx application to be running. When you create the Deployment, the Kubernetes system reads the Deployment spec and starts three instances of your desired application, updating the status to match your spec. 47 | 48 | ```yaml 49 | apiVersion: apps/v1beta2 # Kubernetes API version for the object 50 | kind: Deployment # The type of object described by this YAML, here a Deployment 51 | metadata: 52 | name: nginx-deployment # Name of the deployment 53 | spec: # Actual specifications of this deployment 54 | replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod 55 | template: 56 | metadata: 57 | labels: 58 | app: nginx 59 | spec: # Specification for the Pod 60 | containers: # These are the containers running inside our Pod, in our case a single one 61 | - name: nginx # Name of this container 62 | image: nginx:1.7.9 # Image to run 63 | ports: # Ports to expose 64 | - containerPort: 80 65 | ``` 66 | 67 | To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`). 68 | We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster. 69 | 70 | ## Provisioning a Kubernetes cluster on Azure 71 | 72 | There are multiple ways to provision a Kubernetes (K8s) on Azure: 73 | * ACS 74 | * AKS 75 | * acs-engine 76 | 77 | AKS is currently still in preview and acs-engine is a bit more complex to setup, so we advice you to create your cluster using ACS. 78 | 79 | We are going to create a Linux-based K8s cluster. 80 | You can either create the cluster using the portal, or using Azure-CLI (`az`). 81 | 82 | ### A Note on GPUs with Kubernetes 83 | 84 | As of this writing, GPUs are still in preview with ACS. 85 | You can deploy an ACS cluster with GPU VMs (such as `Standard_NC6`) in `westus2` or `uksouth` but you should be aware of some pitfalls: 86 | * Deploying a GPU cluster takes longer than a CPU cluster (about 10-15 minutes more) because the NVIDIA drivers need to be installed as well. 87 | * Since this is a preview, you might hit capacity issues if the location you chose does not have enough GPUs available to accommodate you. 88 | 89 | **Unless you are already pretty familiar with docker and Kubernetes, we recommend that you create a cluster with CPU VMs to save some time.** 90 | Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters. 91 | 92 | ### With the CLI 93 | 94 | #### Creating a resource group 95 | ```console 96 | az group create --name --location 97 | ``` 98 | 99 | With: 100 | 101 | | Parameter | Description | 102 | | --- | --- | 103 | | RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. | 104 | | LOCATION | Name of the region where the cluster should be deployed. | 105 | 106 | #### Creating the cluster 107 | ```console 108 | az acs create --agent-vm-size --resource-group --name 109 | --orchestrator-type Kubernetes --agent-count 110 | --location --generate-ssh-keys 111 | ``` 112 | 113 | With: 114 | 115 | | Parameter | Description | 116 | | --- | --- | 117 | | AGENT_SIZE | The size of K8s's agent VM. `Standard_D2_v2` is enough for this workshop. | 118 | | RG | Name of the resource group that was created in the previous step. | 119 | | NAME | Name of the ACS resource (can be whatever you want). | 120 | | AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 2 or 3 is recommended to play with hyper-parameter tuning and distributed TensorFlow | 121 | | LOCATION | Same location that was specified for the resource group creation. | 122 | 123 | The command should take a few minutes to complete (longer if you chose GPU VMs). Once it is done, the output should be a JSON object indicating among other things the `provisioningState`: 124 | ``` 125 | { 126 | [...] 127 | "provisioningState": "Succeeded", 128 | [...] 129 | } 130 | ``` 131 | 132 | #### Getting the `kubeconfig` file 133 | 134 | The `kubeconfig` file is a configuration file that will allow Kubernetes's CLI (`kubectl`) to know how to talk to our cluster. 135 | To download the `kubeconfig` file from the cluster we just created, run: 136 | 137 | ```console 138 | az acs kubernetes get-credentials --name --resource-group 139 | ``` 140 | 141 | Where `NAME` and `RG` should be the same values as for the cluster creation. 142 | 143 | ##### Validation 144 | 145 | Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command: 146 | 147 | ```console 148 | kubectl get nodes 149 | ``` 150 | 151 | Should yield an output similar to this one: 152 | ``` 153 | NAME STATUS AGE VERSION 154 | k8s-agent-ef2b999d-0 Ready 9d v1.7.7 155 | k8s-agent-ef2b999d-1 Ready 9d v1.7.7 156 | k8s-agent-ef2b999d-2 Ready 9d v1.7.7 157 | k8s-master-ef2b999d-0 Ready 9d v1.7.7 158 | ``` 159 | 160 | If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node: 161 | ```console 162 | > kubectl describe node 163 | 164 | [...] 165 | Capacity: 166 | alpha.kubernetes.io/nvidia-gpu: 1 167 | [...] 168 | ``` 169 | 170 | ## Exercise 171 | 172 | ### Running our Model on Kubernetes 173 | 174 | > Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise. 175 | 176 | In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub. 177 | Since we now have a running Kubernetes cluster, let's run our training on it! 178 | 179 | First, we need to create a YAML template to define what we want to deploy. 180 | We want our deployment to have a few characteristics: 181 | * It should be a `Job` since we expect the training to finish successfully after some time. 182 | * It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module). 183 | * The `Job` should be named `2-mnist-training`. 184 | * We want our training to run for `500` steps. 185 | 186 | Here is what this would look like in YAML format: 187 | 188 | ```yaml 189 | apiVersion: batch/v1 190 | kind: Job # Our training should be a Job since it is supposed to terminate at some point 191 | metadata: 192 | name: module2-ex1 # Name of our job 193 | spec: 194 | template: # Template of the Pod that is going to be run by the Job 195 | metadata: 196 | name: module2-ex1 # Name of the pod 197 | spec: 198 | containers: # List of containers that should run inside the pod, in our case there is only one. 199 | - name: tensorflow 200 | image: ${DOCKER_USERNAME}/tf-mnist # The image to run, you can replace by your own. 201 | args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile 202 | restartPolicy: OnFailure # restart the pod if it fails 203 | ``` 204 | 205 | Save this template somewhere and deploy it with: 206 | 207 | ```console 208 | kubectl create -f 209 | ``` 210 | 211 | #### Validation 212 | 213 | After deploying the template, 214 | 215 | ```console 216 | kubectl get job 217 | ``` 218 | 219 | Should show your new job: 220 | 221 | ```bash 222 | NAME DESIRED SUCCESSFUL AGE 223 | module2-ex1 1 0 1m 224 | ``` 225 | 226 | Looking at the Pods: 227 | ```console 228 | kubectl get pods 229 | ```` 230 | You should see your training running 231 | ```bash 232 | NAME READY STATUS RESTARTS AGE 233 | module2-ex1-c5b8q 1/1 Runing 0 1m 234 | ``` 235 | 236 | Finally you can look at the logs of your pod with: 237 | 238 | ```console 239 | kubectl logs 240 | ``` 241 | > Be careful to use the Pod name (from `kubectl get pods`) not the Job name. 242 | 243 | And you should see the training happening 244 | 245 | ```bash 246 | 2017-11-29 21:49:16.462292: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 247 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 248 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz 249 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 250 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz 251 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 252 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz 253 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 254 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz 255 | Accuracy at step 0: 0.1285 256 | Accuracy at step 10: 0.674 257 | Accuracy at step 20: 0.8065 258 | Accuracy at step 30: 0.8606 259 | Accuracy at step 40: 0.8759 260 | Accuracy at step 50: 0.888 261 | [...] 262 | ``` 263 | 264 | After a few minutes, looking again at the Job should show that it has completed successfully: 265 | ```console 266 | kubectl get job 267 | ``` 268 | 269 | ```bash 270 | NAME DESIRED SUCCESSFUL AGE 271 | module2-ex1 1 1 3m 272 | ``` 273 | 274 | ## Next Step 275 | 276 | Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4. 277 | 278 | [Module 3: Helm](../3-helm/README.md) 279 | -------------------------------------------------------------------------------- /3-helm/README.md: -------------------------------------------------------------------------------- 1 | # Helm 2 | 3 | ## Prerequisites 4 | 5 | * [Docker Basics](../1-docker/README.md) 6 | * [Kubernetes Basics and cluster created](../2-kubernetes) 7 | 8 | ## Summary 9 | 10 | In this module you will learn : 11 | * What is Helm and how to use it 12 | * What is a Chart and how to create one 13 | 14 | ## Context 15 | 16 | As you saw in the second module [Kubernetes Basics and cluster created](../2-kubernetes), the default way to deploy objects in Kubernetes is by using `kubectl` with `yaml` files. 17 | 18 | For example, if we want to deploy a `pod` running `nginx` and then make it available from an external IP using a `service` you will need to describe at least these two objects such as : 19 | 20 | Deployment : 21 | ```yaml 22 | apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1 23 | kind: Deployment 24 | metadata: 25 | name: nginx-deployment 26 | spec: 27 | selector: 28 | matchLabels: 29 | app: nginx 30 | replicas: 2 # tells deployment to run 2 pods matching the template 31 | template: # create pods using pod definition in this template 32 | metadata: 33 | labels: 34 | app: nginx 35 | spec: 36 | containers: 37 | - name: nginx 38 | image: nginx:1.7.9 39 | ports: 40 | - containerPort: 80 41 | ``` 42 | Service : 43 | ```yaml 44 | apiVersion: v1 45 | kind: Service 46 | metadata: 47 | name: nginx-service 48 | spec: 49 | ports: 50 | - port: 8000 51 | targetPort: 80 52 | protocol: TCP 53 | type: LoadBalancer 54 | selector: 55 | app: nginx 56 | ``` 57 | 58 | The problem with this approach is that when you need to make an update to your solution, you will need to update it across different yaml files. 59 | 60 | Let's say you want to change the name of the app from `nginx` to `nginx-production`. You have to change it in a few places in the deployment and not forget to change the selector setting in the service as well. 61 | 62 | This is one example among others where Helm is fixing the issue by being able to create and use templates. 63 | 64 | ## Helm and Chart 65 | 66 | Helm is the [package manager for Kubernetes](https://deis.com/blog/2016/trusting-whos-at-the-helm/). 67 | 68 | A package is named a **Chart**. 69 | 70 | You can either create you own, or pull and install an official one such as Wordpress, GitLab, Apache Spark, etc... 71 | 72 | You can find a list of the official ones here : [https://github.com/kubernetes/charts/tree/master/stable](https://github.com/kubernetes/charts/tree/master/stable) 73 | 74 | To use Helm, you need to have the [CLI installed on your machine](https://github.com/kubernetes/helm/blob/master/docs/install.md) 75 | 76 | Let's try to deploy an official Chart such as the popular [Wordpress](https://github.com/kubernetes/charts/tree/master/stable/wordpress) 77 | 78 | We'll need to initialize helm first, with this command: 79 | 80 | ```bash 81 | helm init 82 | ``` 83 | 84 | Which should return something similar to: 85 | ```bash 86 | Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com 87 | Adding local repo with URL: http://127.0.0.1:8879/charts 88 | $HELM_HOME has been configured at /Users/YOURUSER/.helm. 89 | ``` 90 | 91 | ```bash 92 | helm install stable/wordpress 93 | ``` 94 | 95 | > Note: If you have an error such has `Error: incompatible versions client[v2.7.0] server[v2.6.2]`, please run `helm init --upgrade` 96 | 97 | After a few seconds you should see the following output in your terminal : 98 | 99 | ```bash 100 | ... 101 | NAME: cloying-crocodile 102 | LAST DEPLOYED: Wed Nov 22 11:29:55 2017 103 | NAMESPACE: default 104 | STATUS: DEPLOYED 105 | 106 | RESOURCES: 107 | ==> v1beta1/Deployment 108 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 109 | cloying-crocodile-mariadb 1 1 1 0 10s 110 | cloying-crocodile-wordpress 1 1 1 0 10s 111 | 112 | ==> v1/Pod(related) 113 | NAME READY STATUS RESTARTS AGE 114 | cloying-crocodile-mariadb-1648957417-0prvc 0/1 Pending 0 10s 115 | cloying-crocodile-wordpress-3958361718-z9qr3 0/1 Pending 0 10s 116 | 117 | ==> v1/Secret 118 | NAME TYPE DATA AGE 119 | cloying-crocodile-mariadb Opaque 2 10s 120 | cloying-crocodile-wordpress Opaque 2 10s 121 | 122 | ==> v1/ConfigMap 123 | NAME DATA AGE 124 | cloying-crocodile-mariadb 1 10s 125 | 126 | ==> v1/PersistentVolumeClaim 127 | NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE 128 | cloying-crocodile-mariadb Pending default 10s 129 | cloying-crocodile-wordpress Pending default 10s 130 | 131 | ==> v1/Service 132 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 133 | cloying-crocodile-mariadb ClusterIP 10.0.163.26 3306/TCP 10s 134 | cloying-crocodile-wordpress LoadBalancer 10.0.168.104 80:31549/TCP,443:32728/TCP 10s 135 | ... 136 | ``` 137 | 138 | You can see all the objects that are necessary to run our Wordpress application in your Kubernetes cluster deployed such as **pods**, **services**, **secrets** etc... Furthermore, since we need a MariaDB engine to run Wordpress, Helm did also automatically deploy it as a dependency in the cluster ! 139 | 140 | As you can see inside the [Wordpress's Chart documentation](https://github.com/kubernetes/charts/tree/master/stable/wordpress) you can override some values such as the image, the database name or the SMTP server for example. 141 | 142 | You just have to use the `--set` option during the `install` command, like so : 143 | 144 | ```bash 145 | helm install --name my-wordpress \ 146 | --set wordpressUsername=admin,wordpressPassword=password,mariadb.mariadbRootPassword=secretpassword \ 147 | stable/wordpress 148 | ``` 149 | 150 | ## Create your own Chart 151 | 152 | You can also create your own Chart by using the scaffolding command `helm create mychart` 153 | 154 | This will create a folder which includes all the files necessary to create your own package : 155 | 156 | ```bash 157 | ├── Chart.yaml 158 | ├── templates 159 | │   ├── NOTES.txt 160 | │   ├── _helpers.tpl 161 | │   ├── deployment.yaml 162 | │   ├── ingress.yaml 163 | │   └── service.yaml 164 | └── values.yaml 165 | ``` 166 | 167 | All the objects that you want to deploy are stored inside the templates folder in different .yaml files. 168 | 169 | You can find more information on how to create your own chart here : [https://deis.com/blog/2016/getting-started-authoring-helm-charts/](https://deis.com/blog/2016/getting-started-authoring-helm-charts/) 170 | 171 | When you are done with your package, Helm provides a linting tool `helm lint mychart` to help you find issues in it. 172 | 173 | If you want to deploy it into your cluster, you can run the following command from the repository folder: 174 | 175 | ```bash 176 | helm install . --name my-custom-chart 177 | ``` 178 | 179 | ## Exercises 180 | 181 | ### Exercise 1 - Deploy an official Chart : DokuWiki 182 | 183 | From the [official Chart repository](https://github.com/kubernetes/charts/tree/master) you have to deploy a DokuWiki environment. 184 | 185 | [DokuWiki](https://www.dokuwiki.org/) is a standards-compliant, simple to use wiki optimized for creating documentation. It is targeted at developer teams, workgroups, and small companies. All data is stored in plain text files, so no database is required. 186 | 187 | #### Validation 188 | 189 | We want to be able to define a custom Wiki name such as `Hello MLADS` at the deployment. 190 | 191 | You should see the following web page from your deployment : 192 | 193 | ![](dokuwiki.png) 194 | 195 | #### Solution 196 | 197 |
198 | Solution (expand to see) 199 |

200 | 201 | ```bash 202 | helm install stable/dokuwiki --set dokuwikiWikiName="Hello MLADS" 203 | ``` 204 | 205 |

206 |
207 | 208 | 209 | ## Next Step 210 | 211 | [4 - GPUs](../4-gpus/README.md) -------------------------------------------------------------------------------- /3-helm/dokuwiki.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/3-helm/dokuwiki.png -------------------------------------------------------------------------------- /4-gpus/README.md: -------------------------------------------------------------------------------- 1 | # GPUs And Kubernetes 2 | 3 | ## Prerequisites 4 | * [1 - Docker Basics](../1-docker) 5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes) 6 | 7 | ## Summary 8 | 9 | In this module you will learn how to: 10 | * Create a Pod that is using GPU. 11 | * Requesting a GPU 12 | * Mounting the NVIDIA drivers into the container 13 | 14 | 15 | ## Important Note 16 | 17 | If you created a cluster with CPU VMs only you won't be able to complete the exercises in this module, but it still contains valuable information that you should read through nonetheless. 18 | 19 | ## How GPU works with Kubernetes 20 | 21 | GPU support in K8s is still in it's early stage, and as such requires a bit of effort on your part to use. 22 | 23 | While you don't need to do anything to access a CPU from inside your container (except specifying CPU request and limit optionally), getting access to the agent's GPU is a little bit more tricky: 24 | * First, the drivers needs to be installed on the agent, otherwise this agent will not report the presence of GPU, and you won't be able to use it (this is already done for you in ACS/AKS/acs-engine). 25 | * Then you need to explicitly ask for 1 or multiple GPU(s) to be mounted into your container, otherwise you will simply not be able to access the GPU, even if is running on a GPU agent. 26 | * Finally, and most importantly, you need to mount the drivers from the agent VM into your container. 27 | 28 | In Module 5, we will see how this process can be greatly simplified when using TensorFlow with `TFJob`, but for now, let's do it ourselves. 29 | 30 | 31 | ### Creating a container that can benefit from GPU 32 | 33 | As a prerequisite for everything else, it is important to make sure that the container we are going to use actually knows what to do with a GPU. 34 | For example TensorFlow needs to be installed with GPU support. CUDA and cuDNN also needs to be present. 35 | Thankfully, most deep learning framework provide base images that are ready to use with GPU support, so we can use them as base image. 36 | 37 | For example, TensorFlow has a lot of different images ready to use [https://hub.docker.com/r/tensorflow/tensorflow/tags/](https://hub.docker.com/r/tensorflow/tensorflow/tags/) such as: 38 | * `tensorflow/tensorflow:1.4.0-gpu-py3` for GPU 39 | * `tensorflow/tensorflow:1.4.0-py3` for CPU only 40 | 41 | CNTK also has pre-built images with or without GPU [https://hub.docker.com/r/microsoft/cntk/tags/](https://hub.docker.com/r/microsoft/cntk/tags/): 42 | * `microsoft/cntk:2.2-gpu-python3.5-cuda8.0-cudnn6.0` for GPU 43 | * `microsoft/cntk:2.2-python3.5` for CPU only 44 | 45 | Also what's important to note, is that most deep learning frameworks images are built on top of the official [nvidia/cuda][https://hub.docker.com/r/nvidia/cuda/] image, which already comes with CUDA and cuDNN preinstalled, so you don't need to worry about installing them. 46 | 47 | 48 | ### Requesting GPU(s) 49 | 50 | K8s has a concept of resource `requests` and `limits` allowing you to specify how much CPU, RAM and GPU should be reserved for a specific container. 51 | By default, if no `limits` is specified for CPU or RAM on a container, K8s will schedule it on any node and run the container with unbounded CPU and memory limits. 52 | 53 | > *To know more on K8s `requests` and `limits`, see [Managing Compute Resources for Containers](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/).* 54 | 55 | However, things are different for GPUs. If no `limit` is defined for GPU, K8s will run the pod on any node (with or without GPU), and will not expose the GPU even if the node has one. So you need to explicitly set the `limit` to the exact number of GPUs that should be assigned to your container. 56 | Also, not that while you can request for a fraction of a CPU, you cannot request a fraction of a GPU. One GPU can thus only be assigned to one container at a time. 57 | The name for the GPU resource in K8s is `alpha.kubernetes.io/nvidia-gpu` for versions `1.8` and below and `nvidia.com/gpu` for versions > `1.9`. Note that currently only NVIDIA GPUs are supported. 58 | 59 | To set the `limit` for GPU, you should provide a value to `spec.containers[].resources.limits.alpha.kubernetes.io/nvidia-gpu`, in YAML this would looks like: 60 | 61 | ```yaml 62 | [...] 63 | containers: 64 | - name: tensorflow 65 | image: tensorflow/tensorflow:latest-gpu 66 | resources: 67 | limits: 68 | alpha.kubernetes.io/nvidia-gpu: 1 69 | [...] 70 | ``` 71 | 72 | ### Exposing the node's drivers into the container 73 | 74 | Now for the tricky part. 75 | As stated earlier the NVIDIA drivers needs to be exposed (mounted) from the node into the container. This is a bit tricky since the location of the drivers can vary depending on the operating system of the node, as well as depending on how the drivers were installed. 76 | For ACS/AKS/acs-engine only Ubuntu nodes are supported so far, so it should be a consistent experience as long as your cluster was created with one of them. 77 | 78 | ##### Drivers locations on the node 79 | 80 | | Path | Purpose 81 | |----|----| 82 | |`/usr/lib/nvidia-384` | NVIDIA libraries | 83 | |`/usr/lib/nvidia-384/bin`| NVIDIA binaries | 84 | |`/usr/lib/x86_64-linux-gnu/libcuda.so.1` | CUDA Driver API library | 85 | 86 | > Note that the NVIDIA driver's version is `384` at the time of this writing, but the driver's location will change as the version change. 87 | 88 | For each of the above paths we need to create a corresponding `Volume` and a `VolumeMount` to expose them into our container. 89 | 90 | > To understand how to configure `Volumes` and `VolumeMounts` take a look at [Volumes](https://kubernetes.io/docs/user-guide/walkthrough/#volumes) on the Kubernetes documentation. 91 | 92 | ## Exercises 93 | 94 | ### 1. NVIDIA-SMI 95 | In this first exercise we are simply going to schedule a `Job` that will run `nvidia-smi`, printing details about our GPU from inside the container and exit. 96 | You don't need to build a custom image, instead, simply use the official `nvidia/cuda` docker image. 97 | 98 | Your K8s YAML template should have the following characteristics: 99 | * It should be a `Job` 100 | * It should be name `module4-ex1` 101 | * It should request 1 GPU 102 | * It should mount the drivers from the node into the container 103 | * It should run the `nvidia-smi` executable 104 | 105 | #### Useful Links 106 | * [Microsoft Azure Container Service Engine - Using GPUs with Kubernetes](https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/gpu.md) 107 | 108 | #### Validation 109 | 110 | Once you have created your Job with `kubectl create -f : 111 | 112 | ```console 113 | kubectl get pods -a 114 | ``` 115 | The `-a` arguments tells K8s to also report pods that are already completed. Since the container exits as soon as you nvidia-smi finishes executing, it might already be completed by the tome you execute the command. 116 | 117 | ```bash 118 | NAME READY STATUS RESTARTS AGE 119 | module4-ex1-p40vx 0/1 Completed 0 20s 120 | ``` 121 | 122 | Let's look at the logs of our pod 123 | 124 | ```console 125 | kubectl logs 126 | ``` 127 | ```bash 128 | Wed Nov 29 23:43:03 2017 129 | +-----------------------------------------------------------------------------+ 130 | | NVIDIA-SMI 384.98 Driver Version: 384.98 | 131 | |-------------------------------+----------------------+----------------------+ 132 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 133 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 134 | |===============================+======================+======================| 135 | | 0 Tesla K80 Off | 0000E322:00:00.0 Off | 0 | 136 | | N/A 39C P0 70W / 149W | 0MiB / 11439MiB | 0% Default | 137 | +-------------------------------+----------------------+----------------------+ 138 | 139 | +-----------------------------------------------------------------------------+ 140 | | Processes: GPU Memory | 141 | | GPU PID Type Process name Usage | 142 | |=============================================================================| 143 | | No running processes found | 144 | +-----------------------------------------------------------------------------+ 145 | ``` 146 | We can see that `nvidia-smi` has successfully detected a Tesla K80 with drivers version `384.98`. 147 | 148 | #### Solution 149 | 150 |
151 | Solution (expand to see) 152 |

153 | 154 | ```yaml 155 | apiVersion: batch/v1 156 | kind: Job # We want a Job 157 | metadata: 158 | name: 4-nvidia-smi 159 | spec: 160 | template: 161 | metadata: 162 | name: module4-ex1 163 | spec: 164 | restartPolicy: Never 165 | volumes: # Where the NVIDIA driver libraries and binaries are located on the host (note that libcuda is not needed to run nvidia-smi) 166 | - name: bin 167 | hostPath: 168 | path: /usr/lib/nvidia-384/bin 169 | - name: lib 170 | hostPath: 171 | path: /usr/lib/nvidia-384 172 | containers: 173 | - name: nvidia-smi 174 | image: nvidia/cuda # Which image to run 175 | command: 176 | - nvidia-smi 177 | resources: 178 | limits: 179 | alpha.kubernetes.io/nvidia-gpu: 1 # Requesting 1 GPU 180 | volumeMounts: # Where the NVIDIA driver libraries and binaries should be mounted inside our container 181 | - name: bin 182 | mountPath: /usr/local/nvidia/bin 183 | - name: lib 184 | mountPath: /usr/local/nvidia/lib64 185 | ``` 186 |

187 |
188 | 189 | ### 2. Running TensorFlow with GPU 190 | 191 | In module 1 and 2, we first created a Docker image for our MNIST classifier and then ran a training on Kubernetes. 192 | However, this training only used CPU. Let's make things much faster by accelerating our training with GPU. 193 | 194 | You'll find the code and the `Dockerfile` under [`./src`](./src). 195 | 196 | For this exercise, your tasks are to: 197 | * Modify our `Dockerfile` to use a base image compatible with GPU, such as `tensorflow/tensorflow:1.4.0-gpu` 198 | * Build and push this new image under a new tag, such as `${DOCKER_USERNAME}/tf-mnist:gpu` 199 | * Modify the [template we built in module 2](2-kubernetes/training.yaml) to add a GPU `limit` and mount the drivers libraries. 200 | * Deploy this new template. 201 | 202 | ### Validation 203 | 204 | Once you deployed your template, take a look at the logs of your pod: 205 | 206 | ```console 207 | kubectl logs 208 | ``` 209 | And you should see that your GPU is correctly detected and used by TensorFlow ( `[...] Found device 0 with properties: name: Tesla K80 [...]`) 210 | 211 | ```bash 212 | 2017-11-30 00:59:54.053227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 213 | 2017-11-30 01:00:03.274198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 214 | name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 215 | pciBusID: b2de:00:00.0 216 | totalMemory: 11.17GiB freeMemory: 11.10GiB 217 | 2017-11-30 01:00:03.274238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: b2de:00:00.0, compute capability: 3.7) 218 | 2017-11-30 01:00:08.000884: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally 219 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 220 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz 221 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 222 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz 223 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 224 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz 225 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 226 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz 227 | Accuracy at step 0: 0.1245 228 | Accuracy at step 10: 0.6664 229 | Accuracy at step 20: 0.8227 230 | Accuracy at step 30: 0.8657 231 | Accuracy at step 40: 0.8815 232 | Accuracy at step 50: 0.892 233 | Accuracy at step 60: 0.9068 234 | [...] 235 | ``` 236 | 237 | 238 | ### Solution 239 | 240 | 241 |
242 | Solution (expand to see) 243 |

244 | 245 | First we need to modify the `Dockerfile`. 246 | We just need to change the tag of the TensorFlow base image to be one that support GPU: 247 | 248 | ```dockerfile 249 | FROM tensorflow/tensorflow:1.4.0-gpu 250 | COPY main.py /app/main.py 251 | 252 | ENTRYPOINT ["python", "/app/main.py"] 253 | ``` 254 | 255 | Then we can create our Job template: 256 | 257 | ```yaml 258 | apiVersion: batch/v1 259 | kind: Job # Our training should be a Job since it is supposed to terminate at some point 260 | metadata: 261 | name: module4-ex2 # Name of our job 262 | spec: 263 | template: # Template of the Pod that is going to be run by the Job 264 | metadata: 265 | name: mnist-pod # Name of the pod 266 | spec: 267 | containers: # List of containers that should run inside the pod, in our case there is only one. 268 | - name: tensorflow 269 | image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own. 270 | args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile 271 | resources: 272 | limits: 273 | alpha.kubernetes.io/nvidia-gpu: 1 274 | volumeMounts: # Where the drivers should be mounted in the container 275 | - name: lib 276 | mountPath: /usr/local/nvidia/lib64 277 | - name: libcuda 278 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 279 | restartPolicy: OnFailure 280 | volumes: # Where the drivers are located on the node 281 | - name: lib 282 | hostPath: 283 | path: /usr/lib/nvidia-384 284 | - name: libcuda 285 | hostPath: 286 | path: /usr/lib/x86_64-linux-gnu/libcuda.so.1 287 | ``` 288 | 289 | And deploy it with 290 | 291 | ```console 292 | kubectl create -f 293 | ``` 294 | 295 |

296 |
297 | 298 | ## Next Step 299 | [5 - TFJob](../5-tfjob/README.md) 300 | -------------------------------------------------------------------------------- /4-gpus/src/Dockerfile: -------------------------------------------------------------------------------- 1 | # Change this image to one that supports GPU 2 | FROM tensorflow/tensorflow:1.4.0 3 | COPY main.py /app/main.py 4 | 5 | ENTRYPOINT ["python", "/app/main.py"] -------------------------------------------------------------------------------- /4-gpus/src/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the 'License'); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an 'AS IS' BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """A simple MNIST classifier which displays summaries in TensorBoard. 16 | 17 | This is an unimpressive MNIST model, but it is a good example of using 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of 19 | naming summary tags so that they are grouped meaningfully in TensorBoard. 20 | 21 | It demonstrates the functionality of every TensorBoard dashboard. 22 | """ 23 | from __future__ import absolute_import 24 | from __future__ import division 25 | from __future__ import print_function 26 | 27 | import argparse 28 | import os 29 | import sys 30 | 31 | import tensorflow as tf 32 | 33 | from tensorflow.examples.tutorials.mnist import input_data 34 | 35 | FLAGS = None 36 | 37 | 38 | def train(): 39 | # Import data 40 | mnist = input_data.read_data_sets(FLAGS.data_dir, 41 | one_hot=True, 42 | fake_data=FLAGS.fake_data) 43 | 44 | # Create a multilayer model. 45 | 46 | # Input placeholders 47 | with tf.name_scope('input'): 48 | x = tf.placeholder(tf.float32, [None, 784], name='x-input') 49 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input') 50 | 51 | with tf.name_scope('input_reshape'): 52 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1]) 53 | tf.summary.image('input', image_shaped_input, 10) 54 | 55 | # We can't initialize these variables to 0 - the network will get stuck. 56 | def weight_variable(shape): 57 | """Create a weight variable with appropriate initialization.""" 58 | initial = tf.truncated_normal(shape, stddev=0.1) 59 | return tf.Variable(initial) 60 | 61 | def bias_variable(shape): 62 | """Create a bias variable with appropriate initialization.""" 63 | initial = tf.constant(0.1, shape=shape) 64 | return tf.Variable(initial) 65 | 66 | def variable_summaries(var): 67 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization).""" 68 | with tf.name_scope('summaries'): 69 | mean = tf.reduce_mean(var) 70 | tf.summary.scalar('mean', mean) 71 | with tf.name_scope('stddev'): 72 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 73 | tf.summary.scalar('stddev', stddev) 74 | tf.summary.scalar('max', tf.reduce_max(var)) 75 | tf.summary.scalar('min', tf.reduce_min(var)) 76 | tf.summary.histogram('histogram', var) 77 | 78 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): 79 | """Reusable code for making a simple neural net layer. 80 | 81 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize. 82 | It also sets up name scoping so that the resultant graph is easy to read, 83 | and adds a number of summary ops. 84 | """ 85 | # Adding a name scope ensures logical grouping of the layers in the graph. 86 | with tf.name_scope(layer_name): 87 | # This Variable will hold the state of the weights for the layer 88 | with tf.name_scope('weights'): 89 | weights = weight_variable([input_dim, output_dim]) 90 | variable_summaries(weights) 91 | with tf.name_scope('biases'): 92 | biases = bias_variable([output_dim]) 93 | variable_summaries(biases) 94 | with tf.name_scope('Wx_plus_b'): 95 | preactivate = tf.matmul(input_tensor, weights) + biases 96 | tf.summary.histogram('pre_activations', preactivate) 97 | activations = act(preactivate, name='activation') 98 | tf.summary.histogram('activations', activations) 99 | return activations 100 | 101 | hidden1 = nn_layer(x, 784, 500, 'layer1') 102 | 103 | with tf.name_scope('dropout'): 104 | keep_prob = tf.placeholder(tf.float32) 105 | tf.summary.scalar('dropout_keep_probability', keep_prob) 106 | dropped = tf.nn.dropout(hidden1, keep_prob) 107 | 108 | # Do not apply softmax activation yet, see below. 109 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity) 110 | 111 | with tf.name_scope('cross_entropy'): 112 | # The raw formulation of cross-entropy, 113 | # 114 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)), 115 | # reduction_indices=[1])) 116 | # 117 | # can be numerically unstable. 118 | # 119 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the 120 | # raw outputs of the nn_layer above, and then average across 121 | # the batch. 122 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y) 123 | with tf.name_scope('total'): 124 | cross_entropy = tf.reduce_mean(diff) 125 | tf.summary.scalar('cross_entropy', cross_entropy) 126 | 127 | with tf.name_scope('train'): 128 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( 129 | cross_entropy) 130 | 131 | with tf.name_scope('accuracy'): 132 | with tf.name_scope('correct_prediction'): 133 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 134 | with tf.name_scope('accuracy'): 135 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 136 | tf.summary.scalar('accuracy', accuracy) 137 | 138 | # Merge all the summaries and write them out to 139 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default) 140 | merged = tf.summary.merge_all() 141 | 142 | def feed_dict(train): 143 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders.""" 144 | if train or FLAGS.fake_data: 145 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data) 146 | k = FLAGS.dropout 147 | else: 148 | xs, ys = mnist.test.images, mnist.test.labels 149 | k = 1.0 150 | return {x: xs, y_: ys, keep_prob: k} 151 | 152 | sess = tf.InteractiveSession() 153 | train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph) 154 | test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test') 155 | tf.global_variables_initializer().run() 156 | # Train the model, and also write summaries. 157 | # Every 10th step, measure test-set accuracy, and write test summaries 158 | # All other steps, run train_step on training data, & add training summaries 159 | 160 | 161 | for i in range(FLAGS.max_steps): 162 | if i % 10 == 0: # Record summaries and test-set accuracy 163 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False)) 164 | test_writer.add_summary(summary, i) 165 | print('Accuracy at step %s: %s' % (i, acc)) 166 | else: # Record train set summaries, and train 167 | if i % 100 == 99: # Record execution stats 168 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 169 | run_metadata = tf.RunMetadata() 170 | summary, _ = sess.run([merged, train_step], 171 | feed_dict=feed_dict(True), 172 | options=run_options, 173 | run_metadata=run_metadata) 174 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i) 175 | train_writer.add_summary(summary, i) 176 | print('Adding run metadata for', i) 177 | else: # Record a summary 178 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True)) 179 | train_writer.add_summary(summary, i) 180 | train_writer.close() 181 | test_writer.close() 182 | 183 | 184 | def main(_): 185 | train() 186 | 187 | 188 | if __name__ == '__main__': 189 | parser = argparse.ArgumentParser() 190 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool, 191 | default=False, 192 | help='If true, uses fake data for unit testing.') 193 | parser.add_argument('--max_steps', type=int, default=1000, 194 | help='Number of steps to run trainer.') 195 | parser.add_argument('--learning_rate', type=float, default=0.001, 196 | help='Initial learning rate') 197 | parser.add_argument('--dropout', type=float, default=0.9, 198 | help='Keep probability for training dropout.') 199 | parser.add_argument( 200 | '--data_dir', 201 | type=str, 202 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 203 | 'tensorflow/input_data'), 204 | help='Directory for storing input data') 205 | parser.add_argument( 206 | '--log_dir', 207 | type=str, 208 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 209 | 'tensorflow/logs'), 210 | help='Summaries log directory') 211 | FLAGS, unparsed = parser.parse_known_args() 212 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) -------------------------------------------------------------------------------- /5-tfjob/README.md: -------------------------------------------------------------------------------- 1 | # `tensorflow/k8s` and `TFJob` 2 | 3 | ## Prerequisites 4 | 5 | * [3 - Helm](../3-helm/README.md) 6 | * [4 - GPUs](../4-gpus/README.md) 7 | 8 | ## Summary 9 | 10 | In this module you will learn how [`tensorflow/k8s`](https://github.com/tensorflow/k8s) can greatly simplify our lives when running TensorFlow on Kubernetes. 11 | 12 | ## `tensorflow/k8s` 13 | 14 | As we saw earlier, giving a container access to GPU is not exactly a breeze on Kubernetes: We need to manually mount the drivers from the node into the container. 15 | If you already tried to run a distributed TensorFlow training, you know that it's not easy either. Getting the `ClusterSpec` right can be painful if you have more than a couple VMs, and it's also quite brittle (we will look more into distributed TensorFlow in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/README.md)). 16 | 17 | `tensorflow/k8s` is a new project in TensorFlow's organization on GitHub that makes all of this much easier. 18 | 19 | 20 | ### Installing `tensorflow/k8s` 21 | 22 | Installing `tensorflow/k8s` with Helm is very easy, just run the following commands: 23 | 24 | ```console 25 | > CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz 26 | > helm install ${CHART} -n tf-job --wait --replace --set cloud=azure 27 | ``` 28 | 29 | If it worked, you should see something like: 30 | 31 | ``` 32 | NAME: tf-job 33 | LAST DEPLOYED: Mon Nov 20 14:24:16 2017 34 | NAMESPACE: default 35 | STATUS: DEPLOYED 36 | 37 | RESOURCES: 38 | ==> v1/ConfigMap 39 | NAME DATA AGE 40 | tf-job-operator-config 1 7s 41 | 42 | ==> v1beta1/Deployment 43 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 44 | tf-job-operator 1 1 1 1 7s 45 | 46 | ==> v1/Pod(related) 47 | NAME READY STATUS RESTARTS AGE 48 | tf-job-operator-3005087210-c3js3 1/1 Running 1 4s 49 | ``` 50 | 51 | This means that 3 resources were created, a `ConfigMap`, a `Deployment`, and a `Pod`. 52 | We will see in just a moment what each of them do. 53 | 54 | ### Kubernetes Custom Resource Definition 55 | 56 | Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use. 57 | In the case of `tensorflow/k8s`, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe TensorFlow a training. 58 | 59 | #### `TFJob` Specifications 60 | 61 | Before going further, let's take a look at what the `TFJob` looks like: 62 | 63 | > Note: Some of the fields are not described here for brevity. 64 | 65 | **`TFJob` Object** 66 | 67 | | Field | Type| Description | 68 | |-------|-----|-------------| 69 | | apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` | 70 | | kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` | 71 | | metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. | 72 | | spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. | 73 | 74 | `spec` is the most important part, so let's look at it too: 75 | 76 | **`TFJobSpec` Object** 77 | 78 | | Field | Type| Description | 79 | |-------|-----|-------------| 80 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below | 81 | 82 | Let's go deeper: 83 | 84 | **`TFReplicaSpec` Object** 85 | 86 | | Field | Type| Description | 87 | |-------|-----|-------------| 88 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 89 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. | 90 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. | 91 | 92 | 93 | As a refresher, here is what a simple TensorFlow training (with GPU) would look like using "vanilla" kubernetes: 94 | 95 | ```yaml 96 | apiVersion: batch/v1 97 | kind: Job 98 | metadata: 99 | name: example-job 100 | spec: 101 | template: 102 | metadata: 103 | name: example-job 104 | spec: 105 | restartPolicy: OnFailure 106 | volumes: 107 | - name: bin 108 | hostPath: 109 | path: /usr/lib/nvidia-384/bin 110 | - name: lib 111 | hostPath: 112 | path: /usr/lib/nvidia-384 113 | containers: 114 | - name: tensorflow 115 | image: wbuchwalter/ 116 | resources: 117 | requests: 118 | alpha.kubernetes.io/nvidia-gpu: 1 119 | volumeMounts: 120 | - name: bin 121 | mountPath: /usr/local/nvidia/bin 122 | - name: lib 123 | mountPath: /usr/local/nvidia/lib64 124 | ``` 125 | Here is what the same thing looks like using the new `TFJob` resource: 126 | 127 | ```yaml 128 | apiVersion: kubeflow.org/v1alpha1 129 | kind: TFJob 130 | metadata: 131 | name: example-tfjob 132 | spec: 133 | replicaSpecs: 134 | - template: 135 | spec: 136 | containers: 137 | - image: wbuchwalter/ 138 | name: tensorflow 139 | resources: 140 | requests: 141 | alpha.kubernetes.io/nvidia-gpu: 1 142 | restartPolicy: OnFailure 143 | ``` 144 | 145 | No need to mount drivers anymore! Note that we are not specifying `TfReplicaType` or `Replicas` as the default values are already what we want. 146 | 147 | #### How does this work? 148 | 149 | As we saw earlier, when we installed the Helm chart for `tensorflow/k8s`, 3 resources were created in our cluster: 150 | * A `ConfigMap` named `tf-job-operator-config` 151 | * A `Deployment` 152 | * And a `Pod` named `tf-job-operator` 153 | 154 | The `tf-job-operator` pod (simply called the operator, or `TFJob` operator), is going to monitor your cluster, and every time you create a new resource of type `TFJob`, the operator will know what to do with it. 155 | Specifically, when you create a new `TFJob`, the operator will create a new Kubernetes `Job` for it, and automatically mount the drivers if needed (i.e. when you request a GPU). 156 | 157 | You may wonder how the operator knows which directory needs to be mounted in the container for the NVIDIA drivers: that's where the `ConfigMap` comes into play. 158 | 159 | In K8s, a [`ConfigMap`](https://kubernetes.io/docs/tasks/configure-pod-container/configmap/) is a simple object that contains key-value pairs. This `ConfigMap` can then be linked with a container to inject some configuration. 160 | 161 | When we installed the Helm chart, we specified which cloud provider we are running on by doing `--set cloud=azure`. 162 | This creates a `ConfigMap` that contains configuration options specific for Azure, including the list of directory to mount. 163 | 164 | We can take a look at what is inside our `tf-job-operator-config` by doing: 165 | 166 | ```console 167 | kubectl describe configmaps tf-job-operator-config 168 | ``` 169 | 170 | The output is: 171 | 172 | ``` 173 | Name: tf-job-operator-config 174 | Namespace: default 175 | Labels: 176 | Annotations: 177 | 178 | Data 179 | ==== 180 | controller_config_file.yaml: 181 | ---- 182 | grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py 183 | accelerators: 184 | alpha.kubernetes.io/nvidia-gpu: 185 | volumes: 186 | - name: lib 187 | mountPath: /usr/local/nvidia/lib64 188 | hostPath: /usr/lib/nvidia-384 189 | - name: bin 190 | mountPath: /usr/local/nvidia/bin 191 | hostPath: /usr/lib/nvidia-384/bin 192 | - name: libcuda 193 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 194 | hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 195 | ``` 196 | 197 | If you want to know more: 198 | * [tensorflow/k8s](https://github.com/tensorflow/k8s) GitHub repository 199 | * [Introducing Operators](https://coreos.com/blog/introducing-operators.html), a blog post by CoreOS explaining the Operator pattern 200 | 201 | ## Exercises 202 | 203 | ### Exercise 1: A Simple `TFJob` 204 | 205 | Let's schedule a very simple TensorFlow job using `TFJob` first. 206 | 207 | > Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead. 208 | 209 | Depending on whether or not your cluster has GPU, choose the correct template: 210 | 211 |
212 | CPU Only 213 | 214 | ```yaml 215 | apiVersion: kubeflow.org/v1alpha1 216 | kind: TFJob 217 | metadata: 218 | name: module5-ex1 219 | spec: 220 | replicaSpecs: 221 | - template: 222 | spec: 223 | containers: 224 | - image: wbuchwalter/tf-mnist:cpu 225 | name: tensorflow 226 | restartPolicy: OnFailure 227 | ``` 228 | 229 |
230 | 231 |
232 | With GPU 233 | 234 | When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image. 235 | 236 | ```yaml 237 | apiVersion: kubeflow.org/v1alpha1 238 | kind: TFJob 239 | metadata: 240 | name: module5-ex1-gpu 241 | spec: 242 | replicaSpecs: 243 | - template: 244 | spec: 245 | containers: 246 | - image: wbuchwalter/tf-mnist:gpu 247 | name: tensorflow 248 | resources: 249 | requests: 250 | alpha.kubernetes.io/nvidia-gpu: 1 251 | restartPolicy: OnFailure 252 | ``` 253 | 254 |
255 | 256 | 257 | 258 | Save the template that applies to you in a file, and create the `TFJob`: 259 | ```console 260 | kubectl create -f 261 | ``` 262 | 263 | Let's look at what has been created in our cluster. 264 | 265 | First a `TFJob` was created: 266 | 267 | ```console 268 | kubectl get tfjob 269 | ``` 270 | Returns: 271 | ``` 272 | NAME KIND 273 | module5-ex1 TFJob.v1alpha1.tensorflow.org 274 | ``` 275 | 276 | As well as a `Job`, which was actually created by the operator: 277 | 278 | ```console 279 | kubectl get job 280 | ``` 281 | Returns: 282 | ``` 283 | NAME DESIRED SUCCESSFUL AGE 284 | module5-ex1-master-xs4b-0 1 0 2m 285 | ``` 286 | and a `Pod`: 287 | 288 | ```console 289 | kubectl get pod 290 | ``` 291 | Returns: 292 | ``` 293 | NAME READY STATUS RESTARTS AGE 294 | module5-ex1-master-xs4b-0-6gpfn 1/1 Running 0 2m 295 | ``` 296 | 297 | Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first. 298 | 299 | Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs: 300 | 301 | ```console 302 | kubectl logs 303 | ``` 304 | 305 | This container is pretty verbose, but you should see a TensorFlow training happening: 306 | 307 | ``` 308 | [...] 309 | INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486 310 | INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100) 311 | INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0% 312 | INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210 313 | INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100) 314 | INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0% 315 | INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348 316 | INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100) 317 | INFO:tensorflow:Final test accuracy = 88.4% (N=353) 318 | [...] 319 | ``` 320 | 321 | Once your job is completed, clean it up: 322 | 323 | ```console 324 | kubectl delete tfjob module5-ex1 325 | ``` 326 | 327 | > That's great and all, but how do we grab our trained model and TensorFlow's summaries? 328 | 329 | Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost. 330 | 331 | Thankfully, Kubernetes `Volumes` can help us here. 332 | If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container. 333 | But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/). 334 | 335 | In our case we are going to use Azure Files, as it is really easy to use with Kubernetes. 336 | 337 | ## Exercise 2: Azure Files to the Rescue 338 | 339 | ### Creating a New File Share and Kubernetes Secret 340 | 341 | In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-in/azure/aks/azure-files), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first: 342 | * It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly. 343 | * While this document specifically refers to AKS, it will work for any K8s cluster 344 | * Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly. 345 | 346 | Once you completed all the steps, run: 347 | ```console 348 | kubectl get secrets 349 | ``` 350 | 351 | Which should return: 352 | ``` 353 | NAME TYPE DATA AGE 354 | azure-secret Opaque 2 4m 355 | ``` 356 | 357 | 358 | ### Updating our example to use our Azure File Share 359 | 360 | Now we need to mount our new file share into our container so the model and the summaries can be persisted. 361 | Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition: 362 | 363 | ```yaml 364 | [...] 365 | containers: 366 | - image: 367 | name: tensorflow 368 | volumeMounts: 369 | - name: azurefile 370 | mountPath: 371 | volumes: 372 | - name: azurefile 373 | azureFile: 374 | secretName: azure-secret 375 | shareName: tensorflow 376 | readOnly: false 377 | ``` 378 | 379 | Update your template from exercise 1 to mount the Azure File share into your container,and create your new job. 380 | Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`. 381 | 382 | Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that: 383 | 384 | ![file-share](./file-share.png) 385 | 386 | This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share. 387 | 388 | #### Solution for Exercise 2 389 | 390 | *For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.* 391 | 392 |
393 | Solution 394 | 395 | ```yaml 396 | apiVersion: kubeflow.org/v1alpha1 397 | kind: TFJob 398 | metadata: 399 | name: module5-ex2 400 | spec: 401 | replicaSpecs: 402 | - template: 403 | spec: 404 | containers: 405 | - image: wbuchwalter/tf-mnist:cpu 406 | name: tensorflow 407 | volumeMounts: 408 | # By default our classifier saves the summaries in /tmp/tensorflow, 409 | # so that's where we want to mount our Azure File Share. 410 | - name: azurefile 411 | # The subPath allows us to mount a subdirectory within the azure file share instead of root 412 | # this is useful so that we can save the logs for each run in a different subdirectory 413 | # instead of overwriting what was done before. 414 | subPath: module5-ex2 415 | mountPath: /tmp/tensorflow 416 | volumes: 417 | - name: azurefile 418 | azureFile: 419 | # We reference the secret we created just earlier 420 | # so that the account name and key are passed securely and not directly in a template 421 | secretName: azure-secret 422 | shareName: tensorflow 423 | readOnly: false 424 | restartPolicy: OnFailure 425 | ``` 426 | 427 |
428 | 429 | 430 | **Don't forget to delete the `TFJob` once it is completed!** 431 | 432 | > Great, but what if I want to check out the training in TensorBoard, do I need to download everything on my machine? 433 | 434 | Actually no, you don't. `TFJob` provides a very handy mechanism to monitor your trainings with TensorBoard easily! 435 | We will try that in our third exercise. 436 | 437 | ### Exercise 3: Adding TensorBoard 438 | 439 | So far, we have a TensorFlow training running, and it's model and summaries are persisted to an Azure File share. 440 | But having TensorBoard monitoring the training would be pretty useful as well. 441 | Turns out `TFJob` can also help us with that. 442 | 443 | When we looked at the `TFJob` specification at the beginning of this module, we omitted some fields in `TFJobSpec` descriptions. 444 | Here is a still incomplete but more accurate representation with one additional field: 445 | 446 | **`TFJobSpec` Object** 447 | 448 | | Field | Type| Description | 449 | |-------|-----|-------------| 450 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes. | 451 | | **TensorBoard** | `TensorBoardSpec` | Configuration to start a TensorBoard deployment associated to our training job. Defined below. | 452 | 453 | That's right, `TFJobSpec` contains an object of type `TensorBoadSpec` which allows us to describe a TensorBoard instance! 454 | Let's look at it: 455 | 456 | **`TensorBoardSpec` Object** 457 | 458 | | Field | Type| Description | 459 | |-------|-----|-------------| 460 | | LogDir | `string` | Location of TensorFlow summaries in the TensorBoard container. | 461 | | ServiceType | [`ServiceType`](https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services---service-types) | What kind of service should expose TensorBoard. Usually `ClusterIP` (Only reachable from within the cluster) or `LoadBalancer` (Exposes the service externally using a cloud provider’s load balancer. ) | 462 | | Volumes | [`Volume`](https://kubernetes.io/docs/api-reference/v1.8/#volume-v1-core) array | List of volumes that can be mounted. | 463 | | VolumeMounts | [`VolumeMount`](https://kubernetes.io/docs/api-reference/v1.8/#volumemount-v1-core) array | Pod volumes to mount into the container's filesystem. | 464 | 465 | 466 | Let's add TensorBoard to our job then. 467 | Here is how this will work: We will keep the same TensorFlow training job as in exercise 2. This `TFJob` will write the model and summaries in the Azure File share. 468 | We will also set up the configuration for TensorBoard so that it reads the summaries from the same Azure File share: 469 | * `Volumes` and `VolumeMounts` in `TensorBoardSpec` should be updated adequately. 470 | * For `ServiceType`, you should use `LoadBalancer`, this will create a public IP so it will be easier to access. 471 | * `LogDir` will depend on how you configure `VolumeMounts`, but on your file share, the summaries will be under the `training_summaries` sub directory. 472 | 473 | #### Solution for Exercise 3 474 | 475 | *For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.* 476 | 477 |
478 | Solution 479 | 480 | ```yaml 481 | apiVersion: kubeflow.org/v1alpha1 482 | kind: TFJob 483 | metadata: 484 | name: module5-ex3 485 | spec: 486 | replicaSpecs: 487 | - template: 488 | spec: 489 | volumes: 490 | - name: azurefile 491 | azureFile: 492 | secretName: azure-secret 493 | shareName: tensorflow 494 | readOnly: false 495 | containers: 496 | - image: wbuchwalter/tf-mnist:cpu 497 | name: tensorflow 498 | volumeMounts: 499 | - mountPath: /tmp/tensorflow 500 | subPath: module5-ex3 # Again we isolate the logs in a new directory on Azure Files 501 | name: azurefile 502 | restartPolicy: OnFailure 503 | tensorboard: 504 | logDir: /tmp/tensorflow/logs 505 | serviceType: LoadBalancer # We request a public IP for our TensorBoard instance 506 | volumes: 507 | - name: azurefile 508 | azureFile: 509 | secretName: azure-secret 510 | shareName: tensorflow 511 | volumeMounts: 512 | - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it. 513 | subPath: module5-ex3 # This should match the directory our Master is actually writing in 514 | name: azurefile 515 | ``` 516 |
517 | 518 | 519 | #### Validation 520 | 521 | If you updated the `TFJob` template correctly, when doing: 522 | ```console 523 | kubectl get services 524 | ``` 525 | You should see something like: 526 | ``` 527 | NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE 528 | kubernetes 10.0.0.1 443/TCP 14d 529 | module5-ex3-master-7yqt-0 10.0.126.11 2222/TCP 5m 530 | module5-ex3-tensorboard-7yqt 10.0.199.170 104.42.193.76 80:31770/TCP 5m 531 | ``` 532 | Note that provisioning a public IP on Azure can take a few minutes. During this time the `EXTERNAL-IP` for TensorBoard's service will show as ``. 533 | 534 | Once the public IP is provisioned, browse it, and you should land on a working TensorBoard instance with live monitoring of the training job running. 535 | 536 | ![TensorBoard](./tensorboard.png) 537 | 538 | ## Next Step 539 | 540 | [6 - Distributed TensorFlow](../6-distributed-tensorflow) 541 | -------------------------------------------------------------------------------- /5-tfjob/file-share.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/5-tfjob/file-share.png -------------------------------------------------------------------------------- /5-tfjob/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/5-tfjob/tensorboard.png -------------------------------------------------------------------------------- /6-distributed-tensorflow/README.md: -------------------------------------------------------------------------------- 1 | # Distributed TensorFlow with `TFJob` 2 | 3 | ## Prerequisites 4 | 5 | [5 - TFJob](../5-tfjob/) 6 | 7 | ## Summary 8 | 9 | In this module we will see how `TFJob` can greatly simplify the deployment and monitoring of distributed TensorFlow trainings. 10 | 11 | ## "Vanilla" Distributed TensorFlow is Hard 12 | 13 | First let's see how we would setup a distributed TensorFlow training without Kubernetes or `TFJob` (fear not, we are not actually going to do that). 14 | First, you would have to find or setup a bunch of idle VMs, or physical machines. In most companies, this would already be a feat, and likely require the coordination of multiple department (such as IT) to get the VMs up, running and reserved for your experiment. 15 | Then you would likely have to do some back and forth with the IT department to be able to setup your training: the VMs need to be able to talk to each others and have stable endpoints. Work might be needed to access the data, you would need to upload your TF code on every single machine etc. 16 | If you add GPU to the mix, it would likely get even harder since GPUs aren't usually just waiting there because of their high cost. 17 | 18 | Assuming you get through this, you now need to modify your model for distributed training. 19 | Among other things, you will need to setup the `ClusterSpec` ([`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec)): a TensorFlow class that allows you to describe the architecture of your cluster. 20 | For example, if you were to setup a distributed training with a mere 2 workers and 2 parameter servers, your cluster spec would look like this (the `clusterSpec` would most likely not be hardcoded, but passed as argument to your training script as we will see below, this is for illustration): 21 | 22 | ```python 23 | cluster = tf.train.ClusterSpec({"worker": [":2222", 24 | ":2222"], 25 | "ps": [":2222", 26 | ":2222"]}) 27 | ``` 28 | Here we assume that you want your workers to run on GPU VMs and your parameter servers to run on CPU VMs. 29 | 30 | We will not go through the rest of the modifications needed (splitting operation across devices, getting the master session etc.), as we will look at them later and this would be pretty much the same thing no matter how you run your distributed training. 31 | 32 | Once your model is ready, you need to start the training. 33 | You will need to connect to every single VM, and pass the `ClusterSpec` as well as the assigned job name (ps or worker) and task index to each VM. 34 | So it would look something like this: 35 | 36 | ```bash 37 | # On ps0: 38 | $ python trainer.py \ 39 | --ps_hosts=:2222,:2222 \ 40 | --worker_hosts=:2222,:2222 \ 41 | --job_name=ps --task_index=0 42 | # On ps1: 43 | $ python trainer.py \ 44 | --ps_hosts=:2222,:2222 \ 45 | --worker_hosts=:2222,:2222 \ 46 | --job_name=ps --task_index=1 47 | # On worker0: 48 | $ python trainer.py \ 49 | --ps_hosts=:2222,:2222 \ 50 | --worker_hosts=:2222,:2222 \ 51 | --job_name=worker --task_index=0 52 | # On worker1: 53 | $ python trainer.py \ 54 | --ps_hosts=:2222,:2222 \ 55 | --worker_hosts=:2222,:2222 \ 56 | --job_name=worker --task_index=1 57 | ``` 58 | 59 | At this point your training would finally start. 60 | However, if for some reason an IP changes (a VM restarts for example), you would need to go back on every VM in your cluster, and restart the training with an updated `ClusterSpec` (If the IT department of your company is feeling extra-generous they might assign a DNS name to every VM which would already make your life much easier). 61 | If you see that your training is not doing well and you need to update the code, you have to redeploy it on every VM and restart the training everywhere. 62 | If for some reason you want to retrain after a while, you would most likely need to go back to step 1: ask for the VMs to be allocated, redeploy, update the `clusterSpec`. 63 | 64 | All this hurdles means that in practice very few people actually bother with distributed training as the time gained during training might not be worth the energy and time necessary to set it up correctly. 65 | 66 | ## Distributed TensorFlow with Kubernetes and `TFJob` 67 | 68 | Thankfully, with Kubernetes and `TFJob` things are much, much simpler, making distributed training something you might actually be able to benefit from. 69 | 70 | 71 | #### A Small Disclaimer 72 | The issues we saw in the first part of this module can be categorized in two groups: 73 | * Issues with getting access to enough resources for the trainings (VMs, GPU etc) 74 | * Issues with setting up the training itself 75 | 76 | The first group of issue is still very dependent on the processes in your company/group. If you need to go through a formal request to get access to extra VMs/GPU, it will still be a hassle and there is nothing Kubernetes can do about that. 77 | However, Kubernetes makes this process much easier: 78 | * On ACS and AKS you can spin up new VMs with a single command: [`az scale`](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az_aks_scale) 79 | * On acs-engine you can setup autoscaling so that anytime you schedule a training on Kubernetes, the autoscaler will make sure your cluster has all the resources it need to run it, and when your training is completed, it will shut down any idle VMs, making this the best solution in term of cost and effort. While autoscaling is outside the scope of this workshop we will give you pointers in module [8 - Going Further](../8-going-further). 80 | 81 | Setting up the training, however, is drastically simplified with Kubernetes and `TFJob`. 82 | 83 | ### Overview of `TFJob` distributed training 84 | 85 | So, how does `TFJob` works for distributed training? 86 | Let's look again at what the `TFJobSpec`and `TFReplicaSpec` objects looks like: 87 | 88 | **`TFJobSpec` Object** 89 | 90 | | Field | Type| Description | 91 | |-------|-----|-------------| 92 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below | 93 | 94 | 95 | **`TFReplicaSpec` Object** 96 | 97 | Note the last parameter `IsDefaultPS` that we didn't talk about before. 98 | 99 | | Field | Type| Description | 100 | |-------|-----|-------------| 101 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 102 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. | 103 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. | 104 | | **IsDefaultPS** | `boolean` | Whether the parameter server should be using a default image or a custom one (default to `true`) | 105 | 106 | In case the distinction between master and workers is not clear, there is a single master per TensorFlow cluster, and it is in fact a worker. The difference is that the master is the worker that is going to handle the creation of the `tf.Session`, write logs and save the model. 107 | 108 | As you can see, `TFJobSpec` and `TFReplicaSpec` allow us to easily define the architecture of the TensorFlow cluster we would like to setup. 109 | 110 | Once we have defined this architecture in a `TFJob` template and deployed it with `kubectl create`, the operator will do most of the work for us. 111 | For each master, worker and parameter server in our TensorFlow cluster, the operator will create a service exposing it so they can communicate. 112 | It will then create an internal representation of the cluster with each node and it's associated internal DNS name. 113 | 114 | For example, if you were to create a `TFJob` with 1 `MASTER`, 2 `WORKERS` and 1 `PS`, this representation would look similar to this: 115 | ```json 116 | { 117 | "master":[ 118 | "distributed-mnist-master-5oz2-0:2222" 119 | ], 120 | "ps":[ 121 | "distributed-mnist-ps-5oz2-0:2222" 122 | ], 123 | "worker":[ 124 | "distributed-mnist-worker-5oz2-0:2222", 125 | "distributed-mnist-worker-5oz2-1:2222" 126 | ] 127 | } 128 | ``` 129 | 130 | Finally, the operator will create all the necessary pods, and in each one, inject an environment variable named `Tf_CONFIG`, containing the cluster specification above, as well as the respective job name and task id that each node of the TensorFlow cluster should assume. 131 | 132 | For example, here is the value of the `TF_CONFIG` environment variable that would be sent to worker 1: 133 | 134 | ```json 135 | { 136 | "cluster":{ 137 | "master":[ 138 | "distributed-mnist-master-5oz2-0:2222" 139 | ], 140 | "ps":[ 141 | "distributed-mnist-ps-5oz2-0:2222" 142 | ], 143 | "worker":[ 144 | "distributed-mnist-worker-5oz2-0:2222", 145 | "distributed-mnist-worker-5oz2-1:2222" 146 | ] 147 | }, 148 | "task":{ 149 | "type":"worker", 150 | "index":1 151 | }, 152 | "environment":"cloud" 153 | } 154 | ``` 155 | 156 | As you can see, this completely takes the responsibility of building and maintaining the `ClusterSpec` away from you. 157 | All you have to do, is modify your code to read the `TF_CONFIG` and act accordingly. 158 | 159 | ### Modifying your model to use `TFJob`'s `TF_CONFIG` 160 | 161 | Concretely, let's see how you would modify your code: 162 | 163 | ```python 164 | # Grab the TF_CONFIG environment variable 165 | tf_config_json = os.environ.get("TF_CONFIG", "{}") 166 | 167 | # Deserialize to a python object 168 | tf_config = json.loads(tf_config_json) 169 | 170 | # Grab the cluster specification from tf_config and create a new tf.train.ClusterSpec instance with it 171 | cluster_spec = tf_config.get("cluster", {}) 172 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec) 173 | 174 | # Grab the task assigned to this specific process from the config. job_name might be "worker" and task_id might be 1 for example 175 | task = tf_config.get("task", {}) 176 | job_name = task["type"] 177 | task_id = task["index"] 178 | 179 | # Configure the TensorFlow server 180 | server_def = tf.train.ServerDef( 181 | cluster=cluster_spec_object.as_cluster_def(), 182 | protocol="grpc", 183 | job_name=job_name, 184 | task_index=task_id) 185 | server = tf.train.Server(server_def) 186 | 187 | # checking if this process is the chief (also called master). The master has the responsibility of creating the session, saving the summaries etc. 188 | is_chief = (job_name == 'master') 189 | 190 | # Notice that we are not handling the case where job_name == 'ps'. That is because `TFJob` will take care of the parameter servers for us by default. 191 | ``` 192 | 193 | As for any distributed TensorFlow training, you will then also need to modify your model to split the operations and variables among the workers and parameter servers as well as create on session on the master. 194 | 195 | ## Exercises 196 | 197 | ### 1 - Modifying Our MNIST Example to Support Distributed Training 198 | 199 | #### 1. a. 200 | Starting from the MNIST sample we have been working with so far, modify it to work with distributed TensorFlow and `TFJob`. 201 | You will then need to build the image and push it (you should push it under a different name or tag to avoid overwriting what you did before). 202 | 203 | #### 1. b. 204 | 205 | Modify the yaml template from module [5 - TFJob](../5-tfjob) exercise 3, to instead deploy 1 master, 2 workers and 1 PS. We also want to monitor the training with TensorBoard. 206 | Note that since our model is very simple, TensorFlow will likely use only 1 of the workers, but it will still work fine. 207 | Don't forget to update the image or tag. 208 | 209 | #### Validation 210 | 211 | ```console 212 | kubectl get pods 213 | ``` 214 | 215 | Should yield: 216 | 217 | ``` 218 | NAME READY STATUS RESTARTS AGE 219 | module6-ex1-master-3khk-0-fkm8p 1/1 Running 0 39s 220 | module6-ex1-ps-3khk-0-rqkv5 1/1 Running 0 39s 221 | module6-ex1-tensorboard-3khk-2845579357-75rtd 1/1 Running 0 39s 222 | module6-ex1-worker-3khk-0-jsm8c 1/1 Running 0 39s 223 | module6-ex1-worker-3khk-1-8rgh4 1/1 Running 0 39s 224 | ``` 225 | 226 | looking at the logs of the master with: 227 | 228 | ```console 229 | kubectl logs 230 | ``` 231 | 232 | Should yield: 233 | 234 | ``` 235 | [...] 236 | Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 237 | Initialize GrpcChannelCache for job ps -> {0 -> distributed-mnist-ps-5oz2-0:2222} 238 | Initialize GrpcChannelCache for job worker -> {0 -> distributed-mnist-worker-5oz2-0:2222, 1 -> distributed-mnist-worker-5oz2-1:2222} 239 | 2017-12-01 20:10:11.826258: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222 240 | 2017-12-01 20:10:14.395476: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 87c6df6850b8f074 with config: 241 | ``` 242 | 243 | This indicates that the `ClusterSpec` was correctly extracted from the environment variable and given to TensorFlow. 244 | 245 | Once TensorBoard public IP is successfully provisioned (check with `kubectl get svc`), go in TensorBoard's graph section and change the color to Device in the left menu. 246 | You should see that your model is indeed correctly distributed between workers and PS: 247 | 248 | ![TensorBoard](./tensorboard.png) 249 | 250 | Again, since our model is very simple, TensorFlow will most likely only use a single worker. 251 | 252 | After a few minutes, the status of both worker nodes should show as `Completed` when doing `kubectl get pods -a`. 253 | 254 | #### Solution 255 | 256 | 257 | A working code sample is available in [`solution-src/main.py`](./solution-src/main.py). 258 | 259 |
260 | TFJob's Template 261 | 262 | ```yaml 263 | apiVersion: kubeflow.org/v1alpha1 264 | kind: TFJob 265 | metadata: 266 | name: module6-ex1 267 | spec: 268 | tensorboard: # Specification fot the TensorBoard instance that is going to monitor our training 269 | logDir: /tmp/tensorflow/logs 270 | serviceType: LoadBalancer 271 | volumes: 272 | - name: azurefile 273 | azureFile: 274 | secretName: azure-secret 275 | shareName: tensorflow 276 | volumeMounts: 277 | - mountPath: /tmp/tensorflow 278 | subPath: module6-ex1 279 | name: azurefile 280 | replicaSpecs: 281 | - replicas: 1 # 1 Master 282 | tfReplicaType: MASTER 283 | template: 284 | spec: 285 | volumes: 286 | - name: azurefile 287 | azureFile: 288 | secretName: azure-secret 289 | shareName: tensorflow 290 | readOnly: false 291 | containers: 292 | - image: wbuchwalter/tf-mnist:distributed # You can replace this by your own image 293 | name: tensorflow 294 | imagePullPolicy: Always 295 | volumeMounts: 296 | - mountPath: /tmp/tensorflow 297 | subPath: module6-ex1 298 | name: azurefile 299 | restartPolicy: OnFailure 300 | - replicas: 2 # 2 Workers 301 | tfReplicaType: WORKER 302 | template: 303 | spec: 304 | containers: 305 | - image: wbuchwalter/tf-mnist:distributed # You can replace this by your own image 306 | name: tensorflow 307 | imagePullPolicy: Always 308 | restartPolicy: OnFailure 309 | - replicas: 1 # 1 Parameter server 310 | tfReplicaType: PS 311 | ``` 312 | 313 | There are two things to notice here: 314 | * Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `replicaSpec`, not on the `workers` or `ps`. 315 | * We are not specifying anything for the `PS` `replicaSpec` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here. 316 | 317 |
318 | 319 | 320 | ## Next Step 321 | 322 | [7 - Hyperparameters Sweep with Helm](../7-hyperparam-sweep) 323 | -------------------------------------------------------------------------------- /6-distributed-tensorflow/solution-src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.4.0 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] -------------------------------------------------------------------------------- /6-distributed-tensorflow/solution-src/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the 'License'); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an 'AS IS' BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """A simple MNIST classifier which displays summaries in TensorBoard. 16 | 17 | This is an unimpressive MNIST model, but it is a good example of using 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of 19 | naming summary tags so that they are grouped meaningfully in TensorBoard. 20 | 21 | It demonstrates the functionality of every TensorBoard dashboard. 22 | """ 23 | from __future__ import absolute_import 24 | from __future__ import division 25 | from __future__ import print_function 26 | 27 | import argparse 28 | import os 29 | import sys 30 | import ast 31 | import json 32 | 33 | import tensorflow as tf 34 | 35 | from tensorflow.examples.tutorials.mnist import input_data 36 | 37 | FLAGS = None 38 | 39 | def train(): 40 | tf_config_json = os.environ.get("TF_CONFIG", "{}") 41 | tf_config = json.loads(tf_config_json) 42 | 43 | task = tf_config.get("task", {}) 44 | cluster_spec = tf_config.get("cluster", {}) 45 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec) 46 | job_name = task["type"] 47 | task_id = task["index"] 48 | server_def = tf.train.ServerDef( 49 | cluster=cluster_spec_object.as_cluster_def(), 50 | protocol="grpc", 51 | job_name=job_name, 52 | task_index=task_id) 53 | server = tf.train.Server(server_def) 54 | 55 | is_chief = (job_name == 'master') 56 | 57 | # Import data 58 | mnist = input_data.read_data_sets(FLAGS.data_dir, 59 | one_hot=True, 60 | fake_data=FLAGS.fake_data) 61 | 62 | 63 | # Create a multilayer model. 64 | 65 | 66 | # Between-graph replication 67 | with tf.device(tf.train.replica_device_setter( 68 | worker_device="/job:worker/task:%d" % task_id, 69 | cluster=cluster_spec)): 70 | 71 | # count the number of updates 72 | global_step = tf.get_variable( 73 | 'global_step', 74 | [], 75 | initializer = tf.constant_initializer(0), 76 | trainable = False) 77 | 78 | # Input placeholders 79 | with tf.name_scope('input'): 80 | x = tf.placeholder(tf.float32, [None, 784], name='x-input') 81 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input') 82 | 83 | with tf.name_scope('input_reshape'): 84 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1]) 85 | tf.summary.image('input', image_shaped_input, 10) 86 | 87 | # We can't initialize these variables to 0 - the network will get stuck. 88 | def weight_variable(shape): 89 | """Create a weight variable with appropriate initialization.""" 90 | initial = tf.truncated_normal(shape, stddev=0.1) 91 | return tf.Variable(initial) 92 | 93 | def bias_variable(shape): 94 | """Create a bias variable with appropriate initialization.""" 95 | initial = tf.constant(0.1, shape=shape) 96 | return tf.Variable(initial) 97 | 98 | def variable_summaries(var): 99 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization).""" 100 | with tf.name_scope('summaries'): 101 | mean = tf.reduce_mean(var) 102 | tf.summary.scalar('mean', mean) 103 | with tf.name_scope('stddev'): 104 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 105 | tf.summary.scalar('stddev', stddev) 106 | tf.summary.scalar('max', tf.reduce_max(var)) 107 | tf.summary.scalar('min', tf.reduce_min(var)) 108 | tf.summary.histogram('histogram', var) 109 | 110 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): 111 | """Reusable code for making a simple neural net layer. 112 | 113 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize. 114 | It also sets up name scoping so that the resultant graph is easy to read, 115 | and adds a number of summary ops. 116 | """ 117 | # Adding a name scope ensures logical grouping of the layers in the graph. 118 | with tf.name_scope(layer_name): 119 | # This Variable will hold the state of the weights for the layer 120 | with tf.name_scope('weights'): 121 | weights = weight_variable([input_dim, output_dim]) 122 | variable_summaries(weights) 123 | with tf.name_scope('biases'): 124 | biases = bias_variable([output_dim]) 125 | variable_summaries(biases) 126 | with tf.name_scope('Wx_plus_b'): 127 | preactivate = tf.matmul(input_tensor, weights) + biases 128 | tf.summary.histogram('pre_activations', preactivate) 129 | activations = act(preactivate, name='activation') 130 | tf.summary.histogram('activations', activations) 131 | return activations 132 | 133 | hidden1 = nn_layer(x, 784, 500, 'layer1') 134 | 135 | with tf.name_scope('dropout'): 136 | keep_prob = tf.placeholder(tf.float32) 137 | tf.summary.scalar('dropout_keep_probability', keep_prob) 138 | dropped = tf.nn.dropout(hidden1, keep_prob) 139 | 140 | # Do not apply softmax activation yet, see below. 141 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity) 142 | 143 | with tf.name_scope('cross_entropy'): 144 | # The raw formulation of cross-entropy, 145 | # 146 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)), 147 | # reduction_indices=[1])) 148 | # 149 | # can be numerically unstable. 150 | # 151 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the 152 | # raw outputs of the nn_layer above, and then average across 153 | # the batch. 154 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y) 155 | with tf.name_scope('total'): 156 | cross_entropy = tf.reduce_mean(diff) 157 | tf.summary.scalar('cross_entropy', cross_entropy) 158 | 159 | with tf.name_scope('train'): 160 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( 161 | cross_entropy) 162 | 163 | with tf.name_scope('accuracy'): 164 | with tf.name_scope('correct_prediction'): 165 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 166 | with tf.name_scope('accuracy'): 167 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 168 | tf.summary.scalar('accuracy', accuracy) 169 | 170 | # Merge all the summaries and write them out to 171 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default) 172 | merged = tf.summary.merge_all() 173 | 174 | init_op = tf.global_variables_initializer() 175 | 176 | def feed_dict(train): 177 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders.""" 178 | if train or FLAGS.fake_data: 179 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data) 180 | k = FLAGS.dropout 181 | else: 182 | xs, ys = mnist.test.images, mnist.test.labels 183 | k = 1.0 184 | return {x: xs, y_: ys, keep_prob: k} 185 | 186 | 187 | 188 | sv = tf.train.Supervisor(is_chief=is_chief, 189 | global_step=global_step, 190 | init_op=init_op, 191 | logdir=FLAGS.logdir) 192 | 193 | with sv.prepare_or_wait_for_session(server.target) as sess: 194 | train_writer = tf.summary.FileWriter(FLAGS.logdir + '/train', sess.graph) 195 | test_writer = tf.summary.FileWriter(FLAGS.logdir + '/test') 196 | # Train the model, and also write summaries. 197 | # Every 10th step, measure test-set accuracy, and write test summaries 198 | # All other steps, run train_step on training data, & add training summaries 199 | 200 | for i in range(FLAGS.max_steps): 201 | if i % 10 == 0: # Record summaries and test-set accuracy 202 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False)) 203 | test_writer.add_summary(summary, i) 204 | print('Accuracy at step %s: %s' % (i, acc)) 205 | else: # Record train set summaries, and train 206 | if i % 100 == 99: # Record execution stats 207 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 208 | run_metadata = tf.RunMetadata() 209 | summary, _ = sess.run([merged, train_step], 210 | feed_dict=feed_dict(True), 211 | options=run_options, 212 | run_metadata=run_metadata) 213 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i) 214 | train_writer.add_summary(summary, i) 215 | print('Adding run metadata for', i) 216 | else: # Record a summary 217 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True)) 218 | train_writer.add_summary(summary, i) 219 | train_writer.close() 220 | test_writer.close() 221 | 222 | 223 | def main(_): 224 | train() 225 | 226 | 227 | if __name__ == '__main__': 228 | parser = argparse.ArgumentParser() 229 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool, 230 | default=False, 231 | help='If true, uses fake data for unit testing.') 232 | parser.add_argument('--max_steps', type=int, default=1000, 233 | help='Number of steps to run trainer.') 234 | parser.add_argument('--learning_rate', type=float, default=0.001, 235 | help='Initial learning rate') 236 | parser.add_argument('--dropout', type=float, default=0.9, 237 | help='Keep probability for training dropout.') 238 | parser.add_argument( 239 | '--data_dir', 240 | type=str, 241 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 242 | 'tensorflow/input_data'), 243 | help='Directory for storing input data') 244 | parser.add_argument( 245 | '--logdir', 246 | type=str, 247 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 248 | 'tensorflow/logs'), 249 | help='Summaries log directory') 250 | FLAGS, unparsed = parser.parse_known_args() 251 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) -------------------------------------------------------------------------------- /6-distributed-tensorflow/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/6-distributed-tensorflow/tensorboard.png -------------------------------------------------------------------------------- /7-hyperparam-sweep/README.md: -------------------------------------------------------------------------------- 1 | # Automated Hyperparameters Sweep with `TFJob` and Helm 2 | 3 | ## Prerequisites 4 | 5 | * [3 - Helm](../3-helm) 6 | * [5 - TFJob](../5-tfjob) 7 | 8 | ### "Vanilla" Hyperparameter Sweep 9 | 10 | Just as distributed training, automated hyperparameter sweep is barely used in many organizations. 11 | The reasons are similar: It takes a lot of resources, or time, to run more than a couple training for the same model. 12 | * Either you run different hypothesis in parallel, which will likely requires a lot of resources and VMs. These VMs need to be managed by someone, the model need to be deployed, logs and checkpoints have to be gathered etc. 13 | * Or you run everything sequentially on a small number of VMs, which takes a lot of time before being able to compare results. 14 | 15 | So in practice most people manually fine-tune their hyperparameters through a few runs and pick a winner. 16 | 17 | ### Kubernetes + Helm 18 | 19 | Kubernetes coupled with Helm can make this easier as we will see. 20 | Because Kubernetes on Azure also allows you to scale very easily (manually or automatically), this allows you to explore a very large hyperparameter space, while maximizing the usage of your cluster (and thus optimizing cost). 21 | 22 | In practice, this process is still rudimentary today as the technologies involved are all pretty young. Most likely tools better suited for doing hyperparameter sweeping in distributed systems will soon be available, but in the meantime Kubernetes and Helm already allow us to deploy a large number of trainings fairly easily. 23 | 24 | ### Why Helm? 25 | 26 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize its deployment easily. 27 | To do that, Helm allows us to use Golang templating engine in the chart definitions. This means we can use conditions, loops, variables and [much more](https://docs.helm.sh/chart_template_guide). 28 | This will allow us to create complex deployment flow. 29 | 30 | In the case of hyperparameters sweeping, we will want a chart able to deploy a number of `TFJobs` each trying different values for some hyperparameters. 31 | We will also want to deploy a single TensorBoard instance monitoring all these `TFJobs`, that way we can quickly compare all our hypothesis, and even early-stop jobs that clearly don't perform well if we want to reduce cost as much as possible. 32 | For now, this chart will simply do a grid search, and while it is less efficient than random search it will be a good place to start. 33 | 34 | ## Exercise 35 | 36 | ### Creating and Deploying the Chart 37 | In this exercise, you will create a new Helm chart that will deploy a number of `TFJobs` as well as a TensorBoard instance. 38 | 39 | Here is what our `values.yaml` file could look like for example (you are free to go a different route): 40 | 41 | ```yaml 42 | image: wbuchwalter/tf-paint:cpu 43 | useGPU: false 44 | shareName: tensorflow 45 | hyperParamValues: 46 | learningRate: 47 | - 0.001 48 | - 0.01 49 | - 0.1 50 | hiddenLayers: 51 | - 5 52 | - 6 53 | - 7 54 | ``` 55 | 56 | That way, when installing the chart, 9 `TFJob` will actually get deployed, testing all the combination of learning rate and hidden layers depth that we specified. 57 | This is a very simple example (our model is also very simple), but hopefully you start to see the possibilities than Helm offers. 58 | 59 | In this exercise, we are going to use a new model based on [Andrej Karpathy's Image painting demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_regression.html). 60 | This model objective is to to create a new picture as close as possible to the original one, "The Starry Night" by Vincent van Gogh: 61 | 62 | ![Starry](./src/starry.jpg) 63 | 64 | The source code is located in [src/](./src/). 65 | 66 | Our model takes 3 parameters: 67 | 68 | | argument | description | default value | 69 | |------|-------------|---------------| 70 | |`--learning-rate` | Learning rate value | `0.001` | 71 | |`--hidden-layers` | Number of hidden layers in our network. | `4` | 72 | |`--log-dir` | Path to save TensorFlow's summaries | `None`| 73 | 74 | For simplicity, docker images have already been created so you don't have to build and push yourself: 75 | * `wbuchwalter/tf-paint:cpu` for CPU only. 76 | * `wbuchwalter/tf-paint:gpu` for GPU. 77 | 78 | The goal of this exercise is to create an Helm chart that will allow us to test as many variations and combinations of the two hyperparameters `--learning-rate`and `--hidden-layers` as we want by just adding them in our `values.yaml` file. 79 | This chart should also deploy a single TensorBoard instance (and it's associated service), so we can quickly monitor and compare our different hypothesis. 80 | 81 | If you are pretty new to Kubernetes and Helm and don't feel like building your own helm chart just yet, you can skip to the solution where details and explanations are provided. 82 | 83 | #### Validation 84 | 85 | Once you have created and deployed your chart, looking at the pods that were created, you should see a bunch of them, as well as a single TensorBoard instance monitoring all of them: 86 | 87 | ```console 88 | kubectl get pods 89 | ``` 90 | 91 | ``` 92 | NAME READY STATUS RESTARTS AGE 93 | module7-tensorboard-3609490657-6w7zf 1/1 Running 0 23s 94 | module7-tf-paint-0-0-master-tduk-0-zqngf 1/1 Running 0 2m 95 | module7-tf-paint-0-1-master-ub9a-0-3rlbr 1/1 Running 0 2m 96 | module7-tf-paint-0-2-master-ekw1-0-cp0l3 1/1 Running 0 2m 97 | module7-tf-paint-1-0-master-jr7r-0-6jkwc 1/1 Running 0 2m 98 | module7-tf-paint-1-1-master-rqh2-0-0t4zw 1/1 Running 0 2m 99 | module7-tf-paint-1-2-master-le5b-0-34q38 1/1 Running 0 2m 100 | module7-tf-paint-2-0-master-g7i9-0-jq1c1 1/1 Running 0 2m 101 | module7-tf-paint-2-1-master-1urb-0-sq92z 1/1 Running 0 2m 102 | module7-tf-paint-2-2-master-ay57-0-0qt2c 1/1 Running 0 2m 103 | ``` 104 | 105 | Looking at TensorBoard, you should see something similar to this: 106 | ![TensorBoard](tensorboard.png) 107 | 108 | > Note that TensorBoard can take a while before correctly displaying images. 109 | 110 | Here we can see that some models are doing much better than others. Models with a learning rate of `0.1` for example are producing an all-black image, we are probably over-shooting. 111 | After a few minutes, we can see that the two best performing models are: 112 | * 5 hidden layers and learning rate of `0.01` 113 | * 7 hidden layers and learning rate of `0.001` 114 | 115 | At this point we could decide to kill all the other models if we wanted to free some capacity in our cluster, or launch additional new experiments based on our initial findings. 116 | 117 | #### Solution 118 | 119 | Check out the commented solution chart: [./solution-chart/templates/deployment.yaml](./solution-chart/templates/deployment.yaml) 120 | 121 | ## Next Step 122 | 123 | [8 - Going Further](../8-going-further) 124 | -------------------------------------------------------------------------------- /7-hyperparam-sweep/solution-chart/Chart.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | description: A Helm chart for Kubernetes 3 | name: module7-hyperparam-sweep 4 | version: 0.1.0 5 | -------------------------------------------------------------------------------- /7-hyperparam-sweep/solution-chart/templates/_helpers.tpl: -------------------------------------------------------------------------------- 1 | {{/* vim: set filetype=mustache: */}} 2 | {{/* 3 | Expand the name of the chart. 4 | */}} 5 | {{- define "name" -}} 6 | {{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} 7 | {{- end -}} 8 | 9 | {{/* 10 | Create a default fully qualified app name. 11 | We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). 12 | */}} 13 | {{- define "fullname" -}} 14 | {{- $name := default .Chart.Name .Values.nameOverride -}} 15 | {{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} 16 | {{- end -}} 17 | -------------------------------------------------------------------------------- /7-hyperparam-sweep/solution-chart/templates/deployment.yaml: -------------------------------------------------------------------------------- 1 | 2 | # First we copy the values of values.yaml in variable to make it easier to access them 3 | {{- $lrlist := .Values.hyperParamValues.learningRate -}} 4 | {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}} 5 | {{- $image := .Values.image -}} 6 | {{- $useGPU := .Values.useGPU -}} 7 | {{- $shareName := .Values.shareName -}} 8 | {{- $chartname := .Chart.Name -}} 9 | {{- $chartversion := .Chart.Version -}} 10 | 11 | # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth) 12 | # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth 13 | {{- range $i, $lr := $lrlist }} 14 | {{- range $j, $nblayers := $nblayerslist }} 15 | apiVersion: kubeflow.org/v1alpha1 16 | kind: TFJob # Each one of our trainings will be a separate TFJob 17 | metadata: 18 | name: module7-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training 19 | labels: 20 | chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}" 21 | spec: 22 | replicaSpecs: 23 | - template: 24 | spec: 25 | restartPolicy: OnFailure 26 | containers: 27 | - name: tensorflow 28 | image: {{ $image }} 29 | env: 30 | - name: LC_ALL 31 | value: C.UTF-8 32 | args: 33 | # Here we pass a unique learning rate and hidden layer count to each instance. 34 | # We also put the values between quotes to avoid potential formatting issues 35 | - --learning-rate 36 | - {{ $lr | quote }} 37 | - --hidden-layers 38 | - {{ $nblayers | quote }} 39 | - --logdir 40 | - /tmp/tensorflow/tf-paint-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory 41 | {{ if $useGPU }} # We only want to request GPUs if we asked for it in values.yaml with useGPU 42 | resources: 43 | requests: 44 | alpha.kubernetes.io/nvidia-gpu: 1 45 | {{ end }} 46 | volumeMounts: 47 | - mountPath: /tmp/tensorflow 48 | subPath: module7 # As usual we want to save everything in a separate subdirectory 49 | name: azurefile 50 | volumes: 51 | - name: azurefile 52 | azureFile: 53 | secretName: azure-secret 54 | shareName: {{ $shareName }} 55 | readOnly: false 56 | --- 57 | {{- end }} 58 | {{- end }} 59 | # We are not using TFJob integrated TensorBoard, because we want to create the Service and Deployment outside of the loop 60 | # since we only want one instance running for all our jobs, and not 1 per job. 61 | apiVersion: v1 62 | kind: Service 63 | metadata: 64 | labels: 65 | app: tensorboard 66 | name: module7-tensorboard 67 | spec: 68 | ports: 69 | - port: 80 70 | targetPort: 6006 71 | selector: 72 | app: tensorboard 73 | type: LoadBalancer 74 | --- 75 | apiVersion: extensions/v1beta1 76 | kind: Deployment 77 | metadata: 78 | labels: 79 | app: tensorboard 80 | name: module7-tensorboard 81 | spec: 82 | template: 83 | metadata: 84 | labels: 85 | app: tensorboard 86 | spec: 87 | volumes: 88 | - name: azurefile 89 | azureFile: 90 | secretName: azure-secret 91 | shareName: {{ $shareName }} 92 | readOnly: false 93 | containers: 94 | - name: tensorboard 95 | command: 96 | - /usr/local/bin/tensorboard 97 | - --logdir=/tmp/tensorflow 98 | - --host=0.0.0.0 99 | image: tensorflow/tensorflow 100 | ports: 101 | - containerPort: 6006 102 | volumeMounts: 103 | - mountPath: /tmp/tensorflow 104 | subPath: module7 105 | name: azurefile -------------------------------------------------------------------------------- /7-hyperparam-sweep/solution-chart/values.yaml: -------------------------------------------------------------------------------- 1 | image: wbuchwalter/tf-paint:cpu 2 | useGPU: false 3 | shareName: tensorflow 4 | hyperParamValues: 5 | learningRate: 6 | - 0.001 7 | - 0.01 8 | - 0.1 9 | hiddenLayers: 10 | - 5 11 | - 6 12 | - 7 13 | -------------------------------------------------------------------------------- /7-hyperparam-sweep/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.4.0 2 | COPY requirements.txt /app/requirements.txt 3 | WORKDIR /app 4 | RUN mkdir ./output 5 | RUN mkdir ./logs 6 | RUN mkdir ./checkpoints 7 | RUN pip install -r requirements.txt 8 | COPY ./* /app/ 9 | 10 | ENTRYPOINT [ "python", "/app/main.py" ] -------------------------------------------------------------------------------- /7-hyperparam-sweep/src/Dockerfile.gpu: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.4.0-gpu 2 | COPY requirements.txt /app/requirements.txt 3 | WORKDIR /app 4 | RUN mkdir ./output 5 | RUN mkdir ./logs 6 | RUN mkdir ./checkpoints 7 | RUN pip install -r requirements.txt 8 | COPY ./* /app/ 9 | 10 | ENTRYPOINT [ "python", "/app/main.py" ] -------------------------------------------------------------------------------- /7-hyperparam-sweep/src/main.py: -------------------------------------------------------------------------------- 1 | import click 2 | import tensorflow as tf 3 | import numpy as np 4 | from skimage.data import astronaut 5 | from scipy.misc import imresize, imsave, imread 6 | 7 | img = imread('./starry.jpg') 8 | img = imresize(img, (100, 100)) 9 | save_dir = 'output' 10 | epochs = 2000 11 | 12 | 13 | def linear_layer(X, layer_size, layer_name): 14 | with tf.variable_scope(layer_name): 15 | W = tf.Variable(tf.random_uniform([X.get_shape().as_list()[1], layer_size], dtype=tf.float32), name='W') 16 | b = tf.Variable(tf.zeros([layer_size]), name='b') 17 | return tf.nn.relu(tf.matmul(X, W) + b) 18 | 19 | @click.command() 20 | @click.option("--learning-rate", default=0.01) 21 | @click.option("--hidden-layers", default=7) 22 | @click.option("--logdir") 23 | def main(learning_rate, hidden_layers, logdir='./logs/1'): 24 | X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name='X') 25 | y = tf.placeholder(dtype=tf.float32, shape=(None, 3), name='y') 26 | current_input = X 27 | for layer_id in range(hidden_layers): 28 | h = linear_layer(current_input, 20, 'layer{}'.format(layer_id)) 29 | current_input = h 30 | 31 | y_pred = linear_layer(current_input, 3, 'output') 32 | 33 | #loss will be distance between predicted and true RGB 34 | loss = tf.reduce_mean(tf.reduce_sum(tf.squared_difference(y, y_pred), 1)) 35 | tf.summary.scalar('loss', loss) 36 | 37 | train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss) 38 | merged_summary_op = tf.summary.merge_all() 39 | 40 | res_img = tf.cast(tf.clip_by_value(tf.reshape(y_pred, (1,) + img.shape), 0, 255), tf.uint8) 41 | img_summary = tf.summary.image('out', res_img, max_outputs=1) 42 | 43 | xs, ys = get_data(img) 44 | 45 | with tf.Session() as sess: 46 | tf.global_variables_initializer().run() 47 | train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph) 48 | test_writer = tf.summary.FileWriter(logdir + '/test') 49 | batch_size = 50 50 | for i in range(epochs): 51 | # Get a random sampling of the dataset 52 | idxs = np.random.permutation(range(len(xs))) 53 | # The number of batches we have to iterate over 54 | n_batches = len(idxs) // batch_size 55 | # Now iterate over our stochastic minibatches: 56 | for batch_i in range(n_batches): 57 | batch_idxs = idxs[batch_i * batch_size: (batch_i + 1) * batch_size] 58 | sess.run([train_op, loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) 59 | if batch_i % 100 == 0: 60 | c, summary = sess.run([loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) 61 | train_writer.add_summary(summary, (i * n_batches * batch_size) + batch_i) 62 | print("epoch {}, (l2) loss {}".format(i, c)) 63 | 64 | if i % 10 == 0: 65 | img_summary_res = sess.run(img_summary, feed_dict={X: xs, y: ys}) 66 | test_writer.add_summary(img_summary_res, i * n_batches * batch_size) 67 | 68 | def get_data(img): 69 | xs = [] 70 | ys = [] 71 | for row_i in range(img.shape[0]): 72 | for col_i in range(img.shape[1]): 73 | xs.append([row_i, col_i]) 74 | ys.append(img[row_i, col_i]) 75 | 76 | xs = (xs - np.mean(xs)) / np.std(xs) 77 | return xs, np.array(ys) 78 | 79 | if __name__ == "__main__": 80 | main() -------------------------------------------------------------------------------- /7-hyperparam-sweep/src/requirements.txt: -------------------------------------------------------------------------------- 1 | scikit-image 2 | click>=6.2 -------------------------------------------------------------------------------- /7-hyperparam-sweep/src/starry.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/7-hyperparam-sweep/src/starry.jpg -------------------------------------------------------------------------------- /7-hyperparam-sweep/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/7-hyperparam-sweep/tensorboard.png -------------------------------------------------------------------------------- /8-going-further/NFSonAzureConcept.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/8-going-further/NFSonAzureConcept.png -------------------------------------------------------------------------------- /8-going-further/README.md: -------------------------------------------------------------------------------- 1 | # Going Further 2 | 3 | ## Summary 4 | 5 | * Learn how to deal with very large amount of input data 6 | * Learn how to autoscale a kubernetes cluster on acs-engine. 7 | 8 | 9 | ## Working with large amount of data 10 | 11 | In the previous modules, we saw how we could easily scale our trainings on Kubernetes to a large amount of nodes. 12 | We also saw how Azure Files provides an easy way to add persistent storage to our trainings to save the output models, summaries etc. 13 | 14 | However in every example we worked with so far, the training data was either downloaded at run time, or directly baked into the container. While this can work for very small datasets, it is not a scalable approach for larger ones. 15 | Just as we used Azure Files to store models and logs, we can use Azure Files to store our dataset and mount the share in our containers. This work well only up to a certain point though (say a few GBs), as an Azure file share is limited to 1000 IOPS. 16 | 17 | So how can we deal with larger datasets? 18 | One solution to this is to use a distributed file system. 19 | 20 | ### Distributed File System, Tools and Concepts 21 | 22 | A distributed file system aggregates various storages over network into a single large network file system. 23 | Such a file system can be deployed inside your Kubernetes cluster, and can use the disks already attached to your nodes as a partition of the overall network file system. 24 | 25 | 26 | ![](NFSonAzureConcept.png) 27 | 28 | Here are some tools and frameworks that can make it easy to deploy such a distributed file system on Kubernetes: 29 | 30 | * [GlusterFS](http://www.gluster.org/) 31 | * [Rook](https://rook.io/) 32 | * [Portworks](https://portworx.com/) 33 | * [Pachyderm](http://pachyderm.io/) 34 | 35 | 36 | ## Autoscaling a Kubernetes Cluster 37 | 38 | As we saw in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/) and [7 - Hyperparameters Sweep](../7-hyperparam-sweep), being able to autoscale our Kubernetes cluster can be extremely useful. 39 | Indeed, automatic scale-out (adding more VMs to the cluster) would allow anyone to run any experiment they want, whenever they want without needing to involve an ops person to setup and prepare virtual machines. 40 | And because the cluster can also automatically scale-in by deleting idle VMs once training jobs are completed, we can keep the cost as low as possible by just paying for what we actually use. 41 | 42 | As of this writing, autoscaling is only supported on Kubernetes cluster created with [acs-engine](https://github.com/Azure/acs-engine). 43 | 44 | See the following resources to get started: 45 | * [Kubernetes `acs-engine` Autoscaler](https://github.com/wbuchwalter/Kubernetes-acs-engine-autoscaler) 46 | * [Autoscaling a Kubernetes cluster created with acs-engine on Azure](https://medium.com/@wbuchwalter/autoscaling-a-kubernetes-cluster-created-with-acs-engine-on-azure-5e24ddc6402e) 47 | * [Case study: Autoscaling Deep Learning Training with Kubernetes](https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/) 48 | 49 | 50 | ## Next Step 51 | 52 | [9 - Jupyter Notebooks on Kubernetes](../9-jupyter) 53 | -------------------------------------------------------------------------------- /9-jupyter/README.md: -------------------------------------------------------------------------------- 1 | # Jupyter Notebooks on Kubernetes 2 | 3 | ## Prerequisites 4 | * [1 - Docker Basics](../1-docker) 5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes) 6 | 7 | ## Summary 8 | 9 | In this module, you will learn how to: 10 | * Run Jupyter Notebooks locally using Docker 11 | * Run Jupyter Notebooks on Kubernetes 12 | 13 | ## How Jupyter Notebooks work 14 | 15 | The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes. 16 | 17 | ## Exercises 18 | 19 | ### Exercise 1: Run Jupyter Notebooks locally using Docker 20 | 21 | In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook. 22 | 23 | ```console 24 | docker run -it -p 8888:8888 tensorflow/tensorflow 25 | ``` 26 | 27 | #### Validation 28 | 29 | To verify, browse to the url in the output log. 30 | 31 | For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413` 32 | 33 | 34 | ### Exercise 2: Run Jupyter Notebooks on Kubernetes 35 | 36 | In this exercise, we will run Jupyter Notebooks on a Kubernetes cluster. 37 | 38 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster. 39 | 40 | Similar to running Jupyter Notebooks locally using Docker, we can again use the official tensorflow docker image as it comes with Jupyter notebook. But here we can run many instances of Jupyter Notebooks in the cluster to handle additional load. 41 | 42 | To run Jupyter Notebook using Kubernetes, you need to: 43 | * Create a Pod using tensorflow image 44 | * Expose port 8888 to run Jupyter notebook 45 | * [With GPU] Mount nvidia libraries from the host VM to a custom directory in the container 46 | * Create a Service to run Jupyter Notebook 47 | 48 | #### Solution for Exercise 2 49 | 50 | Create a yaml file like to the one below. 51 | 52 |
53 | Solution for CPU only (expand to see) 54 |

55 | 56 | ```yaml 57 | apiVersion: v1 58 | kind: Service 59 | metadata: 60 | labels: 61 | app: jupyter-server 62 | name: jupyter-server 63 | spec: 64 | ports: 65 | - port: 8888 66 | targetPort: 8888 67 | selector: 68 | app: jupyter-server 69 | type: LoadBalancer 70 | --- 71 | apiVersion: extensions/v1beta1 72 | kind: Deployment 73 | metadata: 74 | name: jupyter-server 75 | spec: 76 | replicas: 1 77 | template: 78 | metadata: 79 | labels: 80 | app: jupyter-server 81 | spec: 82 | containers: 83 | - args: 84 | image: tensorflow/tensorflow 85 | name: jupyter-server 86 | ports: 87 | - containerPort: 8888 88 | ``` 89 | 90 |

91 |
92 | 93 |
94 | Solution with GPU (expand to see) 95 |

96 | 97 | ```yaml 98 | apiVersion: v1 99 | kind: Service 100 | metadata: 101 | labels: 102 | app: jupyter-server 103 | name: jupyter-server 104 | spec: 105 | ports: 106 | - port: 8888 107 | targetPort: 8888 108 | selector: 109 | app: jupyter-server 110 | type: LoadBalancer 111 | --- 112 | apiVersion: extensions/v1beta1 113 | kind: Deployment 114 | metadata: 115 | name: jupyter-server 116 | spec: 117 | replicas: 1 118 | template: 119 | metadata: 120 | labels: 121 | app: jupyter-server 122 | spec: 123 | containers: 124 | - name: jupyter-server 125 | image: tensorflow/tensorflow:latest-gpu 126 | ports: 127 | - containerPort: 8888 128 | imagePullPolicy: IfNotPresent 129 | env: 130 | - name: LD_LIBRARY_PATH 131 | value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu 132 | resources: 133 | requests: 134 | alpha.kubernetes.io/nvidia-gpu: 1 135 | volumeMounts: 136 | - mountPath: /usr/local/nvidia/bin 137 | name: bin 138 | - mountPath: /usr/lib/nvidia 139 | name: lib 140 | - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 141 | name: libcuda 142 | volumes: 143 | - name: bin 144 | hostPath: 145 | path: /usr/lib/nvidia-384/bin 146 | - name: lib 147 | hostPath: 148 | path: /usr/lib/nvidia-384 149 | - name: libcuda 150 | hostPath: 151 | path: /usr/lib/x86_64-linux-gnu/libcuda.so.1 152 | ``` 153 | 154 |

155 |
156 | 157 | Save the yaml file, then deploy it to your Kubernetes cluster by running: 158 | 159 | ```console 160 | kubectl create -f 161 | ``` 162 | 163 | #### Validation 164 | 165 | After the deployment is created, a pod running tensorflow will be created, along with a new service for the Jupyter notebook. The new service will acquire a new external ip to run Jupyter Notebook on port 8888. This may take few minutes to complete. 166 | 167 | To verify, run the following to view the output log to get the URL and the token for the hosted Jupyter notebook: 168 | 169 | ```console 170 | kubectl log jupyter-server-xxxxx 171 | 172 | # sample output 173 | 174 | http://localhost:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d 175 | 176 | ``` 177 | 178 | Next to get the public ip for the new service created for Jupyter Notebook, run: 179 | 180 | ```console 181 | kubectl get svc jupyter-server -o jsonpath={.status.loadBalancer.ingress[0].ip} 182 | 183 | xx.xx.xx.xx 184 | ``` 185 | From a browser, navigate to the Jupyter notebook with the following URL, replace `PUBLICIP` with the output from previous step: 186 | 187 | ``` 188 | http://:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d 189 | ``` 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Microsoft 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### :warning: This repository is deprecated! Go to [Azure/kubeflow-labs](https://github.com/Azure/kubeflow-labs) instead :warning: 2 | 3 | # Train TensorFlow Models at Scale with Kubernetes on Azure 4 | 5 | 8 | 9 | ## Prerequisites 10 | 11 | 1. Have a valid Microsoft Azure subscription allowing the creation of an ACS cluster 12 | 1. Docker client installed: [Installing Docker](https://www.docker.com/community-edition) 13 | 1. Azure-cli (2.0) installed: [Installing the Azure CLI 2.0 | Microsoft Docs](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) 14 | 1. Git cli installed: [Installing Git CLI](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 15 | 1. Kubectl installed: [Installing Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) 16 | 1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip. 17 | 18 | Clone this repository somewhere so you can easily access the different source files: 19 | ```console 20 | git clone https://github.com/wbuchwalter/tensorflow-k8s-azure 21 | ``` 22 | 23 | ## Content Summary 24 | 25 | | | Module | Description | 26 | | --- | --- | --- | 27 | |0| **[Introduction](0-intro)** | Introduction to this workshop. Motivations and goals.| 28 | |1| **[Docker](1-docker)** | Docker and containers 101.| 29 | |2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.| 30 | |3| **[Helm](3-helm)** | Introduction to Helm | 31 | |4| **[GPUs](4-gpus)** | How to use GPUs with Kubernetes.| 32 | |5| **[TFJob](5-tfjob)** | How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.| 33 | |6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Going distributed with `TFJob`| 34 | |7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of training testing different hypothesis, monitoring and comparing them. | 35 | |8| **[Going Further](8-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage. | 36 | |9| **[Jupyter Notebooks](9-jupyter)** | Easily deploy a Jupyter Notebook instance on Kubernetes. | 37 | --------------------------------------------------------------------------------