├── .gitignore
├── 0-intro
    ├── README.md
    ├── thumbnail.png
    └── workflow.png
├── 1-docker
    ├── README.md
    └── src
    │   ├── Dockerfile
    │   └── main.py
├── 2-kubernetes
    └── README.md
├── 3-helm
    ├── README.md
    └── dokuwiki.png
├── 4-gpus
    ├── README.md
    └── src
    │   ├── Dockerfile
    │   └── main.py
├── 5-tfjob
    ├── README.md
    ├── file-share.png
    └── tensorboard.png
├── 6-distributed-tensorflow
    ├── README.md
    ├── solution-src
    │   ├── Dockerfile
    │   └── main.py
    └── tensorboard.png
├── 7-hyperparam-sweep
    ├── README.md
    ├── solution-chart
    │   ├── Chart.yaml
    │   ├── templates
    │   │   ├── _helpers.tpl
    │   │   └── deployment.yaml
    │   └── values.yaml
    ├── src
    │   ├── Dockerfile
    │   ├── Dockerfile.gpu
    │   ├── main.py
    │   ├── requirements.txt
    │   └── starry.jpg
    └── tensorboard.png
├── 8-going-further
    ├── NFSonAzureConcept.png
    └── README.md
├── 9-jupyter
    └── README.md
├── LICENSE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store


--------------------------------------------------------------------------------
/0-intro/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | ## Motivations
 4 | 
 5 | Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps.
 6 | 
 7 | However in traditional application development, DevOps practices are becoming ubiquitous. 
 8 | We can benefit from many of these practices by applying them to model development and operationalization.
 9 | 
10 | Here are a subset of pain points that exists in a typical ML workflow.
11 | #### A Typical (Simplified) ML Workflow and its Pain Points
12 | ![Typical Workflow](workflow.png)
13 | 
14 | This workshop is going to focus on improving the training process by leveraging  containers and Kubernetes.
15 | 
16 | Today many data scientists are training their models either on their physical workstation (be it a laptop or a desktop with multiple GPUs) or using a VM (sometime, but rarely, a couple of them) in the cloud.
17 | 
18 | This approach is sub-optimal for many reasons, among which:
19 | * Training is slow and sequential
20 |   * Having a single (or few) GPU on hand, means there is only so much trainings you can do at the time. It also means that once your GPU is busy with a training you cannot use it to do something else, just as smaller experiments.
21 |   * Hyper-parameter sweeping is vastly inefficient: The different hypothesis you want to test will run sequentially and not in parallel. In practices this means that very often we don't have time to really explore the hyper-parameter space and we just run a couple of experiments that we think will yield the best result.
22 |   The longer the training time, the fewer experiments we can run.
23 | * Distributed training is hard (or impossible) to setup
24 |   * In practice very few data scientist benefit from distributed training. Either because they simply can't use it (you need multiple machines for that), or because it is too tedious to setup.
25 | * High cost
26 |   * If each member of the team has it's own allocated resources, in practices it means many of them will not be used at any given time, given the price of a single GPU, this is very costly. On the other hand pooling resourcing (such as sharing VM) is also painful since multiple people might want to use them at the same time.
27 | 
28 | Using Kubernetes, we can alleviate many of these pain points:
29 | * Training is massively parallelizable
30 |   * Kubernetes is highly scalable (up to 1200 VMs for a single cluster on Azure). In practice that means you could run as many experiments as you want at the same time. This makes exploring and comparing different hypothesis much simpler and efficient.
31 | * Distributed training is much simpler
32 |   * As we will see in this workshop, it is very easy to setup a TensorFlow distributed training on kubernetes, and scale it to whatever size you want, making much more usable in practice.
33 | * Optimized cost with autoscaling* 
34 |   * Kubernetes allows for resource pooling while at the same time ensuring that any training job can run without waiting for another one to finish.
35 |   * With autoscaling the cluster can automatically scale out or in to ensure maximum utilization, thus keeping the cost as low as possible.
36 | 
37 | *While autoscaling is very powerful, it is outside the scope of this workshop. However we will give you resources and pointer to get started with it.
38 | 
39 | 
40 | ## OpenAI: Building the Infrastructure that Powers the Future of AI 
41 | 
42 | During KubeCon 2017, Vicki Cheung and Jonas Schneider delivered a keynote explaining how OpenAI manage to handle training at very large scale with Kubernetes, it is worth listening to: 
43 | 
44 | <a href="https://www.youtube.com/watch?v=v4N3Krzb8Eg">![OpenAI](./thumbnail.png)</a>
45 | 
46 | ## Next Step
47 | [Module 1: Docker](../1-docker/README.md)
48 | 


--------------------------------------------------------------------------------
/0-intro/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/0-intro/thumbnail.png


--------------------------------------------------------------------------------
/0-intro/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/0-intro/workflow.png


--------------------------------------------------------------------------------
/1-docker/README.md:
--------------------------------------------------------------------------------
  1 | # Docker
  2 | 
  3 | ### Summary
  4 | 
  5 | In this section you will learn about:
  6 | * Running Docker locally
  7 | * Basics of Docker
  8 | * Containerizing a simple application
  9 | * Building and Pushing an Image
 10 | 
 11 | 
 12 | 
 13 | ### Basics of Docker and Containers
 14 | 
 15 | Docker has a very well structured six-part tutorial. 
 16 | While for this workshop you don't need to go through all of them, part 1 and 2 are required:
 17 | * [Get Started, Part 1: Orientation and setup](https://docs.docker.com/get-started)
 18 | * [Get Started, Part 2: Containers](https://docs.docker.com/get-started/part2/)
 19 | 
 20 | By the end of Part 2, you should have a simple container up and running, and understand the basic concepts of a container.
 21 | 
 22 | #### Additional Important Docker Command
 23 | 
 24 | Here a few other docker commands that are important to be aware of for the rest of this workshop:
 25 | 
 26 | 1. `docker ps`
 27 | 
 28 |     The docker `ps` command allows to list the status of the containers.
 29 | 
 30 |     A container could be either stopped or running. When it finishes to execute the process, it will stop. 
 31 |     
 32 |     For example if you run the command `docker run -it ubuntu hostname` this will :
 33 |     - Pull the official ubuntu image from the registry
 34 |     - Start the container in the interactive mode `-it`
 35 |     - Execute the command : `hostname`
 36 |     - Stop
 37 | 
 38 |     ```
 39 |     $ docker run -it ubuntu hostname
 40 |     0d0af5005fc7
 41 |     ```
 42 | 
 43 |     If you run the command `docker ps -a` you should see :
 44 |     ```
 45 |      $ docker ps -a
 46 |     CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                          PORTS               NAMES
 47 |     0d0af5005fc7        ubuntu              "hostname"          58 seconds ago      Exited (0) About a minute ago                       gifted_darwin
 48 |     ```
 49 | 
 50 |     > The `-a` allows you to list all the containers, not just the one that is running 
 51 | 
 52 |     We can notice a few things here such as :
 53 |     - The status `Exited...` for our container
 54 |     - The name `gifted_darwin` randomly generated for our container, we can specify a custom one using the command `--name` explained in the previous section.
 55 |     - We can re-execute our container using the command `docker start gifted_darwin`
 56 |     - We can run the command `docker run -it ubuntu hostname` again and do a `docker ps -a`; we should see how two containers exited.
 57 | 
 58 | 1. `docker logs`
 59 | 
 60 |     The docker `logs` allow to fetch the console output from inside the container
 61 | 
 62 |     From our previous example, we can run `docker logs gifted_darwin`
 63 | 
 64 |     ```
 65 |     $ docker logs gifted_darwin
 66 |     0d0af5005fc7
 67 |     ```
 68 | 
 69 |     You can also stream the logs using the command `-f` and print in real time in your console the stdout of your container.
 70 | 
 71 | 1. `docker rm`
 72 | 
 73 |     The docker `rm` command allows to remove a container.
 74 | 
 75 |     From the previous example, we can see that we have a container listed as exited in our environment, or maybe more if we run the same command `docker run -it ubuntu hostname` multiple times. If we want to do some cleaning and remove those executions from our environment we can use the command `docker rm`
 76 | 
 77 |     ```
 78 |     $ docker rm gifted_darwin
 79 |     gifted_darwin
 80 |     ```
 81 | 
 82 |     > You can either specify the **CONTAINER ID** or the **NAME** of the container to refer to it
 83 | 
 84 | 
 85 | 1. `docker images`
 86 | 
 87 |     This command allows us to list all the base images available in the environment.
 88 | 
 89 |     ```
 90 |     $ docker images
 91 |     REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
 92 |     ubuntu                                          latest              20c44cd7596f        2 days ago          123MB
 93 |     example-scratch                                 latest              32ff7b65f567        5 days ago          30.7MB
 94 |     node                                            8.9.1-slim          a6bb2cc1118f        11 days ago         230MB
 95 |     buildpack-deps                                  xenial              a27b6a8abd1c        2 weeks ago         644MB
 96 |     ```
 97 | 
 98 |     > You can manage your images by removing them using `docker rmi IMAGENAME` or pulling a new one with `docker pull IMAGENAME`
 99 | 
100 | ### Containerizing a TensorFlow model
101 | 
102 | Now that we understand the basics of Docker, let's containerize our first TensorFlow model that we will reuse in the following modules.  
103 | Our first model will be a very simple MNIST classifier. You can see the source code in [`./src/main.py`](./src/main.py).    
104 | As you can see there is nothing specific to containers in this code, you can run this script directly on your laptop or on a VM.
105 | 
106 | Now, to have this run in a container, we need to build an image containing this code and it's dependencies.  
107 | As you saw in the tutorial, we will use a `Dockerfile` to do this.
108 | 
109 | Here is the (very simple) `Dockerfile` that we are going to use for this model (located in [`./src/Dockerfile`](./src/Dockerfile)):
110 | 
111 | ```dockerfile
112 | FROM tensorflow/tensorflow:1.4.0
113 | COPY main.py /app/main.py
114 | 
115 | ENTRYPOINT ["python", "/app/main.py"]
116 | ```
117 | 
118 | As you can see, we are not building a new image from scratch, instead we are using a base image from TensorFlow. Indeed, TensorFlow has a bunch of base images that you can start with.
119 | You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/.
120 | 
121 | What is important to note is that different tags need to be used depending on if you want to use GPU or not.  
122 | For example, if you wanted to run your model with TensorFlow 1.4.0 and CPU only, you would use `tensorflow/tensorflow:1.4.0`.  
123 | If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.4.0-gpu`.
124 | 
125 | The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script.
126 | 
127 | #### Building the image
128 | 
129 | > If you don't already have a Docker account, see [Log in with your Docker ID](https://docs.docker.com/get-started/part2/#log-in-with-your-docker-id).
130 | 
131 | The next step is to build our image to be able to run it using docker. For that, we will use the command `docker build`.
132 | 
133 | From the [`./src`](./src) repository, we can build the image with
134 | 
135 | ```console
136 | docker build -t ${DOCKER_USERNAME}/tf-mnist .
137 | ```
138 | > Reminder: the `-t` argument allows to **tag** the image with a specific name.  
139 | 
140 | `${DOCKER_USERNAME}` should be your Docker username that you use to connect to Docker Hub.
141 | 
142 | The output from this command should look like this:
143 | 
144 | ```
145 | Sending build context to Docker daemon  11.26kB
146 | Step 1/3 : FROM tensorflow/tensorflow:1.4.0
147 |  ---> a61a91cc0d1b
148 | Step 2/3 : COPY main.py /app/main.py
149 |  ---> b264d6e9a5ef
150 | Removing intermediate container fe8128425296
151 | Step 3/3 : ENTRYPOINT python /app/main.py
152 |  ---> Running in 7acb7aac7a9f
153 |  ---> 92c7ed17916b
154 | Removing intermediate container 7acb7aac7a9f
155 | Successfully built 92c7ed17916b
156 | Successfully tagged wbuchwalter/tf-mnist:latest
157 | ```
158 | Let's analyse this image full name (`wbuchwalter/tf-mnist:latest`):
159 | * `wbuchwalter` is the name of my repository, this is where we can find the image. This will be different for you (same as your docker hub username).
160 | * `tf-mnist` is the name of the image itself
161 | * `latest` is the tag. `latest` is the default tag if you don't specify any. Tags are usually used to denote different versions or flavors of a same image. For example you could have a tag `v1` and `v2` to denote different versions, or `cpu` and `gpu` to denote what hardware it can run on.  
162 | 
163 | When you have the successfully built message, you should now be able to see if your image is locally available with the command `docker images` described earlier.
164 | 
165 | #### Running the image  
166 | 
167 | Now we can try to run it locally using the `docker run` command.   
168 | By default the model will run 1000 training steps which can take a few minutes on a laptop. Let's reduce this number to 100 with the `--max_steps` argument. 
169 | 
170 | ```console
171 | docker run -it ${DOCKER_USERNAME}/tf-mnist --max_steps 100
172 | ```
173 | 
174 | If everything is okay you should see the model training:
175 | 
176 | ```
177 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
178 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
179 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
180 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
181 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
182 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
183 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
184 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
185 | 2017-11-29 18:32:41.992194: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
186 | Accuracy at step 0: 0.1292
187 | Accuracy at step 10: 0.7198
188 | Accuracy at step 20: 0.834
189 | Accuracy at step 30: 0.8698
190 | Accuracy at step 40: 0.8783
191 | Accuracy at step 50: 0.8968
192 | Accuracy at step 60: 0.9023
193 | Accuracy at step 70: 0.9059
194 | Accuracy at step 80: 0.9084
195 | Accuracy at step 90: 0.9154
196 | Adding run metadata for 99
197 | ```
198 | 
199 | You can kill the process and exit the container at any time with `ctrl + c`.
200 | 
201 | ### Running the image with the NVIDIA GPU of your machine (If you have one)
202 | 
203 | **Currently, running docker containers with GPU is only supported on Linux.**
204 | 
205 | First install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker).  
206 | 
207 | You also need to make sure the image you are going to use is optimized for GPU.  
208 | In our example you need to modify the `Dockerfile` to use a TensorFlow image built for GPU:
209 | 
210 | ```dockerfile 
211 | FROM tensorflow/tensorflow:1.4.0-gpu
212 | COPY main.py /app/main.py
213 | 
214 | ENTRYPOINT ["python", "/app/main.py"]
215 | ```
216 | 
217 | Then simply rebuild the image with a new tag (you can use docker or nvidia-docker interchangeably for any command except run):
218 | 
219 | ```console
220 | docker build -t ${DOCKER_USERNAME}/tf-mnist:gpu .
221 | ```
222 | 
223 | Finally run the container with nvidia-docker:
224 | 
225 | ```console
226 | nvidia-docker run -it ${DOCKER_USERNAME}/tf-mnist:gpu
227 | ```
228 | 
229 | > Note: If the command fails with `Unknown runtime specified nvidia`, follow the steps described here (Systemd drop-in file): https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
230 | 
231 | #### Publish the Image
232 | 
233 | Our image is now built and running locally, but what about sharing it to be able to use it from anywhere by anyone?  
234 | Most importantly we want to be able to reuse this image on the Kubernetes cluster we are going to create in module 2.
235 | So let's push our image to Docker Hub:
236 | 
237 | ```console
238 | docker push ${DOCKER_USERNAME}/tf-mnist
239 | ```
240 | 
241 | If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image)
242 | 
243 | 
244 | ### Useful Links
245 | * [What is Docker ?](https://www.docker.com/what-docker)
246 | * [Docker for beginner](https://github.com/docker/labs/blob/master/beginner/readme.md)
247 | 
248 | 
249 | ## Next Step
250 | [2 - Kubernetes](../2-kubernetes/README.md)
251 | 


--------------------------------------------------------------------------------
/1-docker/src/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.4.0
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]


--------------------------------------------------------------------------------
/1-docker/src/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the 'License');
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an 'AS IS' BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """A simple MNIST classifier which displays summaries in TensorBoard.
 16 | 
 17 | This is an unimpressive MNIST model, but it is a good example of using
 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
 19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
 20 | 
 21 | It demonstrates the functionality of every TensorBoard dashboard.
 22 | """
 23 | from __future__ import absolute_import
 24 | from __future__ import division
 25 | from __future__ import print_function
 26 | 
 27 | import argparse
 28 | import os
 29 | import sys
 30 | 
 31 | import tensorflow as tf
 32 | 
 33 | from tensorflow.examples.tutorials.mnist import input_data
 34 | 
 35 | FLAGS = None
 36 | 
 37 | 
 38 | def train():
 39 |   # Import data
 40 |   mnist = input_data.read_data_sets(FLAGS.data_dir,
 41 |                                     one_hot=True,
 42 |                                     fake_data=FLAGS.fake_data)
 43 | 
 44 |   # Create a multilayer model.
 45 | 
 46 |   # Input placeholders
 47 |   with tf.name_scope('input'):
 48 |     x = tf.placeholder(tf.float32, [None, 784], name='x-input')
 49 |     y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
 50 | 
 51 |   with tf.name_scope('input_reshape'):
 52 |     image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
 53 |     tf.summary.image('input', image_shaped_input, 10)
 54 | 
 55 |   # We can't initialize these variables to 0 - the network will get stuck.
 56 |   def weight_variable(shape):
 57 |     """Create a weight variable with appropriate initialization."""
 58 |     initial = tf.truncated_normal(shape, stddev=0.1)
 59 |     return tf.Variable(initial)
 60 | 
 61 |   def bias_variable(shape):
 62 |     """Create a bias variable with appropriate initialization."""
 63 |     initial = tf.constant(0.1, shape=shape)
 64 |     return tf.Variable(initial)
 65 | 
 66 |   def variable_summaries(var):
 67 |     """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
 68 |     with tf.name_scope('summaries'):
 69 |       mean = tf.reduce_mean(var)
 70 |       tf.summary.scalar('mean', mean)
 71 |       with tf.name_scope('stddev'):
 72 |         stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
 73 |       tf.summary.scalar('stddev', stddev)
 74 |       tf.summary.scalar('max', tf.reduce_max(var))
 75 |       tf.summary.scalar('min', tf.reduce_min(var))
 76 |       tf.summary.histogram('histogram', var)
 77 | 
 78 |   def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
 79 |     """Reusable code for making a simple neural net layer.
 80 | 
 81 |     It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
 82 |     It also sets up name scoping so that the resultant graph is easy to read,
 83 |     and adds a number of summary ops.
 84 |     """
 85 |     # Adding a name scope ensures logical grouping of the layers in the graph.
 86 |     with tf.name_scope(layer_name):
 87 |       # This Variable will hold the state of the weights for the layer
 88 |       with tf.name_scope('weights'):
 89 |         weights = weight_variable([input_dim, output_dim])
 90 |         variable_summaries(weights)
 91 |       with tf.name_scope('biases'):
 92 |         biases = bias_variable([output_dim])
 93 |         variable_summaries(biases)
 94 |       with tf.name_scope('Wx_plus_b'):
 95 |         preactivate = tf.matmul(input_tensor, weights) + biases
 96 |         tf.summary.histogram('pre_activations', preactivate)
 97 |       activations = act(preactivate, name='activation')
 98 |       tf.summary.histogram('activations', activations)
 99 |       return activations
100 | 
101 |   hidden1 = nn_layer(x, 784, 500, 'layer1')
102 | 
103 |   with tf.name_scope('dropout'):
104 |     keep_prob = tf.placeholder(tf.float32)
105 |     tf.summary.scalar('dropout_keep_probability', keep_prob)
106 |     dropped = tf.nn.dropout(hidden1, keep_prob)
107 | 
108 |   # Do not apply softmax activation yet, see below.
109 |   y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
110 | 
111 |   with tf.name_scope('cross_entropy'):
112 |     # The raw formulation of cross-entropy,
113 |     #
114 |     # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
115 |     #                               reduction_indices=[1]))
116 |     #
117 |     # can be numerically unstable.
118 |     #
119 |     # So here we use tf.nn.softmax_cross_entropy_with_logits on the
120 |     # raw outputs of the nn_layer above, and then average across
121 |     # the batch.
122 |     diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
123 |     with tf.name_scope('total'):
124 |       cross_entropy = tf.reduce_mean(diff)
125 |   tf.summary.scalar('cross_entropy', cross_entropy)
126 | 
127 |   with tf.name_scope('train'):
128 |     train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
129 |         cross_entropy)
130 | 
131 |   with tf.name_scope('accuracy'):
132 |     with tf.name_scope('correct_prediction'):
133 |       correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
134 |     with tf.name_scope('accuracy'):
135 |       accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
136 |   tf.summary.scalar('accuracy', accuracy)
137 | 
138 |   # Merge all the summaries and write them out to
139 |   # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
140 |   merged = tf.summary.merge_all()
141 | 
142 |   def feed_dict(train):
143 |     """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
144 |     if train or FLAGS.fake_data:
145 |       xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
146 |       k = FLAGS.dropout
147 |     else:
148 |       xs, ys = mnist.test.images, mnist.test.labels
149 |       k = 1.0
150 |     return {x: xs, y_: ys, keep_prob: k}
151 | 
152 |   sess = tf.InteractiveSession()
153 |   train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
154 |   test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test')
155 |   tf.global_variables_initializer().run()
156 |   # Train the model, and also write summaries.
157 |   # Every 10th step, measure test-set accuracy, and write test summaries
158 |   # All other steps, run train_step on training data, & add training summaries
159 | 
160 | 
161 |   for i in range(FLAGS.max_steps):
162 |     if i % 10 == 0:  # Record summaries and test-set accuracy
163 |       summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
164 |       test_writer.add_summary(summary, i)
165 |       print('Accuracy at step %s: %s' % (i, acc))
166 |     else:  # Record train set summaries, and train
167 |       if i % 100 == 99:  # Record execution stats
168 |         run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
169 |         run_metadata = tf.RunMetadata()
170 |         summary, _ = sess.run([merged, train_step],
171 |                               feed_dict=feed_dict(True),
172 |                               options=run_options,
173 |                               run_metadata=run_metadata)
174 |         train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
175 |         train_writer.add_summary(summary, i)
176 |         print('Adding run metadata for', i)
177 |       else:  # Record a summary
178 |         summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
179 |         train_writer.add_summary(summary, i)
180 |   train_writer.close()
181 |   test_writer.close()
182 | 
183 | 
184 | def main(_):
185 |   train()
186 | 
187 | 
188 | if __name__ == '__main__':
189 |   parser = argparse.ArgumentParser()
190 |   parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
191 |                       default=False,
192 |                       help='If true, uses fake data for unit testing.')
193 |   parser.add_argument('--max_steps', type=int, default=1000,
194 |                       help='Number of steps to run trainer.')
195 |   parser.add_argument('--learning_rate', type=float, default=0.001,
196 |                       help='Initial learning rate')
197 |   parser.add_argument('--dropout', type=float, default=0.9,
198 |                       help='Keep probability for training dropout.')
199 |   parser.add_argument(
200 |       '--data_dir',
201 |       type=str,
202 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
203 |                            'tensorflow/input_data'),
204 |       help='Directory for storing input data')
205 |   parser.add_argument(
206 |       '--log_dir',
207 |       type=str,
208 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
209 |                            'tensorflow/logs'),
210 |       help='Summaries log directory')
211 |   FLAGS, unparsed = parser.parse_known_args()
212 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)


--------------------------------------------------------------------------------
/2-kubernetes/README.md:
--------------------------------------------------------------------------------
  1 | # Kubernetes
  2 | 
  3 | ### Prerequisites  
  4 | * [Docker Basics](../1-docker/README.md)
  5 | 
  6 | ### Summary
  7 | 
  8 | In this module you will learn:
  9 | * The basic concepts of Kubernetes
 10 | * How to create a Kubernetes cluster on Azure
 11 | 
 12 | > *Important* : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this workshop.
 13 | 
 14 | ## The basic concepts of Kubernetes
 15 | 
 16 | [Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more. 
 17 | 
 18 | ### Overview
 19 | 
 20 | Kubernetes is a system for managing containerized applications across a cluster of nodes. To work with Kubernetes, you use Kubernetes API objects to describe your cluster’s desired state: what applications or other workloads you want to run, what container images they use, the number of replicas, what network and disk resources you want to make available, and more. You set your desired state by creating objects using the Kubernetes API. Once you’ve set your desired state, the Kubernetes Control Plane works to make the cluster’s current state match the desired state. To do so, Kubernetes performs a variety of tasks automatically, such as starting or restarting containers, scaling the number of replicas of a given application, and more.
 21 | 
 22 | ### Kubernetes Master
 23 | 
 24 | The Kubernetes master is responsible for maintaining the desired state for your cluster. When you interact with Kubernetes, such as by using the kubectl command-line interface, you’re communicating with your cluster’s Kubernetes master. These master services can be installed on a single machine, or distributed across multiple machines. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 1 master.
 25 | 
 26 | ### Kubernetes Nodes
 27 | 
 28 | The worker nodes communicate with the master components, configure the networking for containers, and run the actual workloads assigned to them. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 3 worker nodes.
 29 | 
 30 | ### Kubernetes Objects
 31 | 
 32 | Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re telling the Kubernetes system your cluster’s desired state.
 33 | 
 34 | The basic Kubernetes objects include:
 35 | * **Pod** - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
 36 | * **Service** - an abstraction which defines a logical set of Pods and a policy by which to access them.
 37 | * **Volume** - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
 38 | * **Namespace** - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
 39 | * **Deployment** - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
 40 | * **Job** - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.
 41 | 
 42 | ### Creating a Kubernetes Object
 43 | 
 44 | When you create an object in Kubernetes, you must provide the object specifications that describes its desired state, as well as some basic information about the object (such as a name) to the Kubernetes API either directly or via the `kubectl` command-line interface. Usually, you will provide the information to `kubectl` in a .yaml file. `kubectl` then converts the information to JSON when making the API request. In the next few sections, we will be using various yaml files to describe the Kubernetes objects we want to deploy to our Kubernetes cluster.
 45 | 
 46 | For example, the `.yaml` file shown below includes the required fields and object spec for a Kubernetes Deployment. A Kubernetes Deployment is an object that can represent an application running on your cluster. In the example below, the Deployment spec describes the desired state of three replicas of the nginx application to be running. When you create the Deployment, the Kubernetes system reads the Deployment spec and starts three instances of your desired application, updating the status to match your spec.
 47 | 
 48 | ```yaml
 49 | apiVersion: apps/v1beta2 # Kubernetes API version for the object
 50 | kind: Deployment # The type of object described by this YAML, here a Deployment
 51 | metadata:
 52 |   name: nginx-deployment # Name of the deployment
 53 | spec: # Actual specifications of this deployment
 54 |   replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod
 55 |   template: 
 56 |     metadata:
 57 |       labels:
 58 |         app: nginx
 59 |     spec: # Specification for the Pod 
 60 |       containers: # These are the containers running inside our Pod, in our case a single one
 61 |       - name: nginx # Name of this container
 62 |         image: nginx:1.7.9 # Image to run
 63 |         ports: # Ports to expose
 64 |         - containerPort: 80
 65 | ```
 66 | 
 67 | To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`). 
 68 | We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster.
 69 | 
 70 | ## Provisioning a Kubernetes cluster on Azure
 71 | 
 72 | There are multiple ways to provision a Kubernetes (K8s) on Azure:
 73 | * ACS
 74 | * AKS
 75 | * acs-engine
 76 | 
 77 | AKS is currently still in preview and acs-engine is a bit more complex to setup, so we advice you to create your cluster using ACS.
 78 | 
 79 | We are going to create a Linux-based K8s cluster.
 80 | You can either create the cluster using the portal, or using Azure-CLI (`az`).
 81 | 
 82 | ### A Note on GPUs with Kubernetes
 83 | 
 84 | As of this writing, GPUs are still in preview with ACS.  
 85 | You can deploy an ACS cluster with GPU VMs (such as `Standard_NC6`) in `westus2` or `uksouth` but you should be aware of some pitfalls:
 86 | * Deploying a GPU cluster takes longer than a CPU cluster (about 10-15 minutes more) because the NVIDIA drivers need to be installed as well.
 87 | * Since this is a preview, you might hit capacity issues if the location you chose does not have enough GPUs available to accommodate you.
 88 | 
 89 | **Unless you are already pretty familiar with docker and Kubernetes, we recommend that you create a cluster with CPU VMs to save some time.**
 90 | Only module 3 has an exercise which is specific for GPU VMs, all other modules can be followed on either CPU or GPU clusters.
 91 | 
 92 | ### With the CLI
 93 | 
 94 | #### Creating a resource group
 95 | ```console
 96 | az group create --name <RESOURCE_GROUP_NAME> --location <LOCATION>
 97 | ```  
 98 | 
 99 | With:  
100 |   
101 | | Parameter | Description |
102 | | --- | --- | 
103 | | RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed.  |
104 | | LOCATION | Name of the region where the cluster should be deployed. |
105 | 
106 | #### Creating the cluster  
107 | ```console
108 | az acs create --agent-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME> 
109 | --orchestrator-type Kubernetes --agent-count <AGENT_COUNT> 
110 | --location <LOCATION> --generate-ssh-keys
111 | ```
112 | 
113 | With:  
114 |   
115 | | Parameter | Description |
116 | | --- | --- | 
117 | | AGENT_SIZE | The size of K8s's agent VM. `Standard_D2_v2` is enough for this workshop. |
118 | | RG | Name of the resource group that was created in the previous step. |
119 | | NAME | Name of the ACS resource (can be whatever you want). | 
120 | | AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 2 or 3 is recommended to play with hyper-parameter tuning and distributed TensorFlow | 
121 | | LOCATION | Same location that was specified for the resource group creation. |
122 | 
123 | The command should take a few minutes to complete (longer if you chose GPU VMs). Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
124 | ```
125 | {
126 |   [...]
127 |   "provisioningState": "Succeeded",
128 |   [...]
129 | }
130 | ```
131 | 
132 | #### Getting the `kubeconfig` file
133 | 
134 | The `kubeconfig` file is a configuration file that will allow Kubernetes's CLI (`kubectl`) to know how to talk to our cluster.
135 | To download the `kubeconfig` file from the cluster we just created, run:
136 | 
137 | ```console
138 | az acs kubernetes get-credentials --name <NAME> --resource-group <RG>
139 | ```
140 | 
141 | Where `NAME` and `RG` should be the same values as for the cluster creation.
142 | 
143 | ##### Validation
144 | 
145 | Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command:
146 | 
147 | ```console
148 | kubectl get nodes
149 | ```
150 | 
151 | Should yield an output similar to this one:
152 | ```
153 | NAME                    STATUS    AGE       VERSION
154 | k8s-agent-ef2b999d-0    Ready     9d        v1.7.7
155 | k8s-agent-ef2b999d-1    Ready     9d        v1.7.7
156 | k8s-agent-ef2b999d-2    Ready     9d        v1.7.7
157 | k8s-master-ef2b999d-0   Ready     9d        v1.7.7
158 | ```
159 | 
160 | If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node:
161 | ```console
162 | > kubectl describe node <NODE_NAME>
163 | 
164 | [...]
165 | Capacity:
166 |  alpha.kubernetes.io/nvidia-gpu:	1
167 | [...]
168 |  ```
169 | 
170 | ## Exercise
171 | 
172 | ### Running our Model on Kubernetes
173 | 
174 | > Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise.
175 | 
176 | In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.  
177 | Since we now have a running Kubernetes cluster, let's run our training on it!
178 | 
179 | First, we need to create a YAML template to define what we want to deploy.
180 | We want our deployment to have a few characteristics:
181 | * It should be a `Job` since we expect the training to finish successfully after some time.
182 | * It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
183 | * The `Job` should be named `2-mnist-training`.
184 | * We want our training to run for `500` steps.
185 | 
186 | Here is what this would look like in YAML format:
187 | 
188 | ```yaml
189 | apiVersion: batch/v1
190 | kind: Job # Our training should be a Job since it is supposed to terminate at some point
191 | metadata:
192 |   name: module2-ex1 # Name of our job
193 | spec:
194 |   template: # Template of the Pod that is going to be run by the Job
195 |     metadata:
196 |       name: module2-ex1 # Name of the pod
197 |     spec:
198 |       containers: # List of containers that should run inside the pod, in our case there is only one.
199 |       - name: tensorflow
200 |         image: ${DOCKER_USERNAME}/tf-mnist # The image to run, you can replace by your own.
201 |         args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
202 |       restartPolicy: OnFailure # restart the pod if it fails
203 | ```
204 | 
205 | Save this template somewhere and deploy it with:
206 | 
207 | ```console
208 | kubectl create -f <path-to-your-template>
209 | ```
210 | 
211 | #### Validation
212 | 
213 | After deploying the template,
214 | 
215 | ```console
216 | kubectl get job
217 | ```
218 | 
219 | Should show your new job:
220 | 
221 | ```bash
222 | NAME                                DESIRED   SUCCESSFUL   AGE
223 | module2-ex1                      1         0            1m
224 | ```
225 | 
226 | Looking at the Pods:
227 | ```console
228 | kubectl get pods
229 | ````
230 | You should see your training running
231 | ```bash
232 | NAME                                      READY     STATUS      RESTARTS   AGE
233 | module2-ex1-c5b8q                      1/1       Runing      0          1m
234 | ```
235 | 
236 | Finally you can look at the logs of your pod with:
237 | 
238 | ```console
239 | kubectl logs <pod-name>
240 | ```
241 | > Be careful to use the Pod name (from `kubectl get pods`) not the Job name.
242 | 
243 | And you should see the training happening
244 | 
245 | ```bash
246 | 2017-11-29 21:49:16.462292: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
247 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
248 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
249 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
250 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
251 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
252 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
253 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
254 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
255 | Accuracy at step 0: 0.1285
256 | Accuracy at step 10: 0.674
257 | Accuracy at step 20: 0.8065
258 | Accuracy at step 30: 0.8606
259 | Accuracy at step 40: 0.8759
260 | Accuracy at step 50: 0.888
261 | [...]
262 | ```
263 | 
264 | After a few minutes, looking again at the Job should show that it has completed successfully:
265 | ```console
266 | kubectl get job
267 | ```
268 | 
269 | ```bash
270 | NAME                                DESIRED   SUCCESSFUL   AGE
271 | module2-ex1                    1         1            3m
272 | ```
273 | 
274 | ## Next Step
275 | 
276 | Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4.
277 | 
278 | [Module 3: Helm](../3-helm/README.md)
279 | 


--------------------------------------------------------------------------------
/3-helm/README.md:
--------------------------------------------------------------------------------
  1 | # Helm
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [Docker Basics](../1-docker/README.md)
  6 | * [Kubernetes Basics and cluster created](../2-kubernetes)
  7 | 
  8 | ## Summary
  9 | 
 10 | In this module you will learn :
 11 | * What is Helm and how to use it
 12 | * What is a Chart and how to create one
 13 |   
 14 | ## Context
 15 | 
 16 | As you saw in the second module [Kubernetes Basics and cluster created](../2-kubernetes), the default way to deploy objects in Kubernetes is by using `kubectl` with `yaml` files.
 17 | 
 18 | For example, if we want to deploy a `pod` running `nginx` and then make it available from an external IP using a `service` you will need to describe at least these two objects such as :
 19 | 
 20 | Deployment : 
 21 | ```yaml
 22 | apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
 23 | kind: Deployment
 24 | metadata:
 25 |   name: nginx-deployment
 26 | spec:
 27 |   selector:
 28 |     matchLabels:
 29 |       app: nginx
 30 |   replicas: 2 # tells deployment to run 2 pods matching the template
 31 |   template: # create pods using pod definition in this template
 32 |     metadata:
 33 |       labels:
 34 |         app: nginx
 35 |     spec:
 36 |       containers:
 37 |       - name: nginx
 38 |         image: nginx:1.7.9
 39 |         ports:
 40 |         - containerPort: 80
 41 | ```
 42 | Service :
 43 | ```yaml
 44 | apiVersion: v1
 45 | kind: Service
 46 | metadata:
 47 |   name: nginx-service
 48 | spec:
 49 |   ports:
 50 |   - port: 8000 
 51 |     targetPort: 80
 52 |     protocol: TCP
 53 |   type: LoadBalancer
 54 |   selector:
 55 |     app: nginx
 56 | ```
 57 | 
 58 | The problem with this approach is that when you need to make an update to your solution, you will need to update it across different yaml files.
 59 | 
 60 | Let's say you want to change the name of the app from `nginx` to `nginx-production`. You have to change it in a few places in the deployment and not forget to change the selector setting in the service as well.
 61 | 
 62 | This is one example among others where Helm is fixing the issue by being able to create and use templates.
 63 | 
 64 | ## Helm and Chart
 65 | 
 66 | Helm is the [package manager for Kubernetes](https://deis.com/blog/2016/trusting-whos-at-the-helm/). 
 67 | 
 68 | A package is named a **Chart**. 
 69 | 
 70 | You can either create you own, or pull and install an official one such as Wordpress, GitLab, Apache Spark, etc...
 71 | 
 72 | You can find a list of the official ones here : [https://github.com/kubernetes/charts/tree/master/stable](https://github.com/kubernetes/charts/tree/master/stable)
 73 | 
 74 | To use Helm, you need to have the [CLI installed on your machine](https://github.com/kubernetes/helm/blob/master/docs/install.md)
 75 | 
 76 | Let's try to deploy an official Chart such as the popular [Wordpress](https://github.com/kubernetes/charts/tree/master/stable/wordpress)
 77 | 
 78 | We'll need to initialize helm first, with this command:
 79 | 
 80 | ```bash
 81 | helm init
 82 | ```
 83 | 
 84 | Which should return something similar to:
 85 | ```bash
 86 | Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
 87 | Adding local repo with URL: http://127.0.0.1:8879/charts
 88 | $HELM_HOME has been configured at /Users/YOURUSER/.helm.
 89 | ```
 90 | 
 91 | ```bash
 92 | helm install stable/wordpress
 93 | ```
 94 | 
 95 | > Note: If you have an error such has `Error: incompatible versions client[v2.7.0] server[v2.6.2]`, please run `helm init --upgrade`
 96 | 
 97 | After a few seconds you should see the following output in your terminal :
 98 | 
 99 | ```bash
100 | ...
101 | NAME:   cloying-crocodile
102 | LAST DEPLOYED: Wed Nov 22 11:29:55 2017
103 | NAMESPACE: default
104 | STATUS: DEPLOYED
105 | 
106 | RESOURCES:
107 | ==> v1beta1/Deployment
108 | NAME                         DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
109 | cloying-crocodile-mariadb    1        1        1           0          10s
110 | cloying-crocodile-wordpress  1        1        1           0          10s
111 | 
112 | ==> v1/Pod(related)
113 | NAME                                          READY  STATUS   RESTARTS  AGE
114 | cloying-crocodile-mariadb-1648957417-0prvc    0/1    Pending  0         10s
115 | cloying-crocodile-wordpress-3958361718-z9qr3  0/1    Pending  0         10s
116 | 
117 | ==> v1/Secret
118 | NAME                         TYPE    DATA  AGE
119 | cloying-crocodile-mariadb    Opaque  2     10s
120 | cloying-crocodile-wordpress  Opaque  2     10s
121 | 
122 | ==> v1/ConfigMap
123 | NAME                       DATA  AGE
124 | cloying-crocodile-mariadb  1     10s
125 | 
126 | ==> v1/PersistentVolumeClaim
127 | NAME                         STATUS   VOLUME   CAPACITY  ACCESS MODES  STORAGECLASS  AGE
128 | cloying-crocodile-mariadb    Pending  default  10s
129 | cloying-crocodile-wordpress  Pending  default  10s
130 | 
131 | ==> v1/Service
132 | NAME                         TYPE          CLUSTER-IP    EXTERNAL-IP  PORT(S)                     AGE
133 | cloying-crocodile-mariadb    ClusterIP     10.0.163.26   <none>       3306/TCP                    10s
134 | cloying-crocodile-wordpress  LoadBalancer  10.0.168.104  <pending>    80:31549/TCP,443:32728/TCP  10s
135 | ...
136 | ```
137 | 
138 | You can see all the objects that are necessary to run our Wordpress application in your Kubernetes cluster deployed such as **pods**, **services**, **secrets** etc... Furthermore, since we need a MariaDB engine to run Wordpress, Helm did also automatically deploy it as a dependency in the cluster !
139 | 
140 | As you can see inside the [Wordpress's Chart documentation](https://github.com/kubernetes/charts/tree/master/stable/wordpress) you can override some values such as the image, the database name or the SMTP server for example.
141 | 
142 | You just have to use the `--set` option during the `install` command, like so :
143 | 
144 | ```bash
145 | helm install --name my-wordpress \
146 |   --set wordpressUsername=admin,wordpressPassword=password,mariadb.mariadbRootPassword=secretpassword \
147 |     stable/wordpress
148 | ```
149 | 
150 | ## Create your own Chart
151 | 
152 | You can also create your own Chart by using the scaffolding command `helm create mychart`
153 | 
154 | This will create a folder which includes all the files necessary to create your own package :
155 | 
156 | ```bash
157 | ├── Chart.yaml
158 | ├── templates
159 | │   ├── NOTES.txt
160 | │   ├── _helpers.tpl
161 | │   ├── deployment.yaml
162 | │   ├── ingress.yaml
163 | │   └── service.yaml
164 | └── values.yaml
165 | ```
166 | 
167 | All the objects that you want to deploy are stored inside the templates folder in different .yaml files.
168 | 
169 | You can find more information on how to create your own chart here : [https://deis.com/blog/2016/getting-started-authoring-helm-charts/](https://deis.com/blog/2016/getting-started-authoring-helm-charts/)
170 | 
171 | When you are done with your package, Helm provides a linting tool `helm lint mychart` to help you find issues in it.
172 | 
173 | If you want to deploy it into your cluster, you can run the following command from the repository folder:
174 | 
175 | ```bash
176 | helm install . --name my-custom-chart
177 | ```
178 | 
179 | ## Exercises
180 | 
181 | ### Exercise 1 - Deploy an official Chart : DokuWiki
182 | 
183 | From the [official Chart repository](https://github.com/kubernetes/charts/tree/master) you have to deploy a DokuWiki environment.
184 | 
185 | [DokuWiki](https://www.dokuwiki.org/) is a standards-compliant, simple to use wiki optimized for creating documentation. It is targeted at developer teams, workgroups, and small companies. All data is stored in plain text files, so no database is required.
186 | 
187 | #### Validation
188 | 
189 | We want to be able to define a custom Wiki name such as `Hello MLADS` at the deployment.
190 | 
191 | You should see the following web page from your deployment :
192 | 
193 | ![](dokuwiki.png)
194 | 
195 | #### Solution
196 | 
197 | <details>
198 | <summary><strong>Solution (expand to see)</strong></summary>
199 | <p>
200 | 
201 | ```bash
202 |     helm install stable/dokuwiki --set dokuwikiWikiName="Hello MLADS"
203 | ```
204 | 
205 | </p>
206 | </details>
207 | 
208 | 
209 | ## Next Step
210 | 
211 | [4 - GPUs](../4-gpus/README.md)


--------------------------------------------------------------------------------
/3-helm/dokuwiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/3-helm/dokuwiki.png


--------------------------------------------------------------------------------
/4-gpus/README.md:
--------------------------------------------------------------------------------
  1 | # GPUs And Kubernetes
  2 | 
  3 | ## Prerequisites  
  4 | * [1 - Docker Basics](../1-docker)
  5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
  6 | 
  7 | ## Summary
  8 | 
  9 | In this module you will learn how to:
 10 | * Create a Pod that is using GPU.
 11 |   * Requesting a GPU
 12 |   * Mounting the NVIDIA drivers into the container
 13 | 
 14 | 
 15 | ## Important Note
 16 | 
 17 | If you created a cluster with CPU VMs only you won't be able to complete the exercises in this module, but it still contains valuable information that you should read through nonetheless.
 18 | 
 19 | ## How GPU works with Kubernetes
 20 | 
 21 | GPU support in K8s is still in it's early stage, and as such requires a bit of effort on your part to use.
 22 | 
 23 | While you don't need to do anything to access a CPU from inside your container (except specifying CPU request and limit optionally), getting access to the agent's  GPU is a little bit more tricky:
 24 | * First, the drivers needs to be installed on the agent, otherwise this agent will not report the presence of GPU, and you won't be able to use it (this is already done for you in ACS/AKS/acs-engine).
 25 | * Then you need to explicitly ask for 1 or multiple GPU(s) to be mounted into your container, otherwise you will simply not be able to access the GPU, even if is running on a GPU agent.
 26 | * Finally, and most importantly, you need to mount the drivers from the agent VM into your container.
 27 | 
 28 | In Module 5, we will see how this process can be greatly simplified when using TensorFlow with `TFJob`, but for now, let's do it ourselves.
 29 | 
 30 | 
 31 | ### Creating a container that can benefit from GPU
 32 | 
 33 | As a prerequisite for everything else, it is important to make sure that the container we are going to use actually knows what to do with a GPU.
 34 | For example TensorFlow needs to be installed with GPU support. CUDA and cuDNN also needs to be present.  
 35 | Thankfully, most deep learning framework provide base images that are ready to use with GPU support, so we can use them as base image.
 36 | 
 37 | For example, TensorFlow has a lot of different images ready to use [https://hub.docker.com/r/tensorflow/tensorflow/tags/](https://hub.docker.com/r/tensorflow/tensorflow/tags/) such as:
 38 | * `tensorflow/tensorflow:1.4.0-gpu-py3` for GPU
 39 | * `tensorflow/tensorflow:1.4.0-py3` for CPU only
 40 | 
 41 | CNTK also has pre-built images with or without GPU [https://hub.docker.com/r/microsoft/cntk/tags/](https://hub.docker.com/r/microsoft/cntk/tags/):
 42 | * `microsoft/cntk:2.2-gpu-python3.5-cuda8.0-cudnn6.0` for GPU
 43 | * `microsoft/cntk:2.2-python3.5` for CPU only
 44 | 
 45 | Also what's important to note, is that most deep learning frameworks images are built on top of the official [nvidia/cuda][https://hub.docker.com/r/nvidia/cuda/] image, which already comes with CUDA and cuDNN preinstalled, so you don't need to worry about installing them.
 46 | 
 47 | 
 48 | ### Requesting GPU(s)
 49 | 
 50 | K8s has a concept of resource `requests` and `limits` allowing you to specify how much CPU, RAM and GPU should be reserved for a specific container.
 51 | By default, if no `limits` is specified for CPU or RAM on a container, K8s will schedule it on any node and run the container with unbounded CPU and memory limits.
 52 | 
 53 | > *To know more on K8s `requests` and `limits`, see [Managing Compute Resources for Containers](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/).*
 54 | 
 55 | However, things are different for GPUs. If no `limit` is defined for GPU, K8s will run the pod on any node (with or without GPU), and will not expose the GPU even if the node has one. So you need to explicitly set the `limit` to the exact number of GPUs that should be assigned to your container.  
 56 | Also, not that while you can request for a fraction of a CPU, you cannot request a fraction of a GPU. One GPU can thus only be assigned to one container at a time.
 57 | The name for the GPU resource in K8s is `alpha.kubernetes.io/nvidia-gpu` for versions `1.8` and below and `nvidia.com/gpu` for versions > `1.9`. Note that currently only NVIDIA GPUs are supported.
 58 | 
 59 | To set the `limit` for GPU, you should provide a value to `spec.containers[].resources.limits.alpha.kubernetes.io/nvidia-gpu`, in YAML this would looks like:
 60 | 
 61 | ```yaml
 62 | [...]
 63 | containers:
 64 |       - name: tensorflow
 65 |         image: tensorflow/tensorflow:latest-gpu
 66 |         resources:
 67 |           limits:
 68 |             alpha.kubernetes.io/nvidia-gpu: 1 
 69 | [...]
 70 | ```
 71 | 
 72 | ### Exposing the node's drivers into the container
 73 | 
 74 | Now for the tricky part.  
 75 | As stated earlier the NVIDIA drivers needs to be exposed (mounted) from the node into the container. This is a bit tricky since the location of the drivers can vary depending on the operating system of the node, as well as depending on how the drivers were installed.  
 76 | For ACS/AKS/acs-engine only Ubuntu nodes are supported so far, so it should be a consistent experience as long as your cluster was created with one of them.  
 77 | 
 78 | ##### Drivers locations on the node
 79 | 
 80 | | Path | Purpose
 81 | |----|----|
 82 | |`/usr/lib/nvidia-384` | NVIDIA libraries |
 83 | |`/usr/lib/nvidia-384/bin`| NVIDIA binaries |
 84 | |`/usr/lib/x86_64-linux-gnu/libcuda.so.1` | CUDA Driver API library | 
 85 | 
 86 | > Note that the NVIDIA driver's version is `384` at the time of this writing, but the driver's location will change as the version change.
 87 | 
 88 | For each of the above paths we need to create a corresponding `Volume` and a `VolumeMount` to expose them into our container.  
 89 | 
 90 | > To understand how to configure `Volumes` and `VolumeMounts` take a look at [Volumes](https://kubernetes.io/docs/user-guide/walkthrough/#volumes) on the Kubernetes documentation.
 91 | 
 92 | ## Exercises
 93 | 
 94 | ### 1. NVIDIA-SMI
 95 | In this first exercise we are simply going to schedule a `Job` that will run `nvidia-smi`, printing details about our GPU from inside the container and exit.
 96 | You don't need to build a custom image, instead, simply use the official `nvidia/cuda` docker image.
 97 | 
 98 | Your K8s YAML template should have the following characteristics:
 99 | * It should be a `Job`
100 | * It should be name `module4-ex1`
101 | * It should request 1 GPU
102 | * It should mount the drivers from the node into the container
103 | * It should run the `nvidia-smi` executable
104 | 
105 | #### Useful Links
106 | * [Microsoft Azure Container Service Engine - Using GPUs with Kubernetes](https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/gpu.md)
107 | 
108 | #### Validation
109 | 
110 | Once you have created your Job with `kubectl create -f <template-path>:
111 | 
112 | ```console
113 | kubectl get pods -a
114 | ```
115 | The `-a` arguments tells K8s to also report pods that are already completed. Since the container exits as soon as you nvidia-smi finishes executing, it might already be completed by the tome you execute the command.
116 | 
117 | ```bash
118 | NAME                 READY     STATUS        RESTARTS   AGE
119 | module4-ex1-p40vx   0/1       Completed     0          20s
120 | ```
121 | 
122 | Let's look at the logs of our pod
123 | 
124 | ```console 
125 | kubectl logs <pod-name>
126 | ```
127 | ```bash
128 | Wed Nov 29 23:43:03 2017
129 | +-----------------------------------------------------------------------------+
130 | | NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
131 | |-------------------------------+----------------------+----------------------+
132 | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
133 | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
134 | |===============================+======================+======================|
135 | |   0  Tesla K80           Off  | 0000E322:00:00.0 Off |                    0 |
136 | | N/A   39C    P0    70W / 149W |      0MiB / 11439MiB |      0%      Default |
137 | +-------------------------------+----------------------+----------------------+
138 | 
139 | +-----------------------------------------------------------------------------+
140 | | Processes:                                                       GPU Memory |
141 | |  GPU       PID   Type   Process name                             Usage      |
142 | |=============================================================================|
143 | |  No running processes found                                                 |
144 | +-----------------------------------------------------------------------------+
145 | ```
146 | We can see that `nvidia-smi` has successfully detected a Tesla K80 with drivers version `384.98`.
147 | 
148 | #### Solution
149 | 
150 | <details>
151 | <summary><strong>Solution (expand to see)</strong></summary>
152 | <p>
153 | 
154 | ```yaml
155 | apiVersion: batch/v1
156 | kind: Job # We want a Job
157 | metadata:
158 |   name: 4-nvidia-smi
159 | spec:
160 |   template:
161 |     metadata:
162 |       name: module4-ex1
163 |     spec:
164 |       restartPolicy: Never
165 |       volumes: # Where the NVIDIA driver libraries and binaries are located on the host (note that libcuda is not needed to run nvidia-smi)
166 |       - name: bin
167 |         hostPath: 
168 |           path: /usr/lib/nvidia-384/bin
169 |       - name: lib
170 |         hostPath: 
171 |           path: /usr/lib/nvidia-384
172 |       containers:
173 |       - name: nvidia-smi
174 |         image: nvidia/cuda # Which image to run        
175 |         command: 
176 |           - nvidia-smi
177 |         resources:
178 |           limits:
179 |             alpha.kubernetes.io/nvidia-gpu: 1 # Requesting 1 GPU
180 |         volumeMounts: # Where the NVIDIA driver libraries and binaries should be mounted inside our container
181 |         - name: bin
182 |           mountPath: /usr/local/nvidia/bin
183 |         - name: lib
184 |           mountPath: /usr/local/nvidia/lib64
185 | ```
186 | </p>
187 | </details>
188 | 
189 | ### 2. Running TensorFlow with GPU
190 | 
191 | In module 1 and 2, we first created a Docker image for our MNIST classifier and then ran a training on Kubernetes.  
192 | However, this training only used CPU. Let's make things much faster by accelerating our training with GPU.
193 | 
194 | You'll find the code and the `Dockerfile` under [`./src`](./src).
195 | 
196 | For this exercise, your tasks are to:
197 | * Modify our `Dockerfile` to use a base image compatible with GPU, such as `tensorflow/tensorflow:1.4.0-gpu`
198 | * Build and push this new image under a new tag, such as `${DOCKER_USERNAME}/tf-mnist:gpu`
199 | * Modify the [template we built in module 2](2-kubernetes/training.yaml) to add a GPU `limit` and mount the drivers libraries.
200 | * Deploy this new template.
201 | 
202 | ### Validation
203 | 
204 | Once you deployed your template, take a look at the logs of your pod:
205 | 
206 | ```console
207 | kubectl logs <pod-name>
208 | ```
209 | And you should see that your GPU is correctly detected and used by TensorFlow ( `[...] Found device 0 with properties: name: Tesla K80 [...]`)
210 | 
211 | ```bash
212 | 2017-11-30 00:59:54.053227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
213 | 2017-11-30 01:00:03.274198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
214 | name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
215 | pciBusID: b2de:00:00.0
216 | totalMemory: 11.17GiB freeMemory: 11.10GiB
217 | 2017-11-30 01:00:03.274238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: b2de:00:00.0, compute capability: 3.7)
218 | 2017-11-30 01:00:08.000884: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
219 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
220 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
221 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
222 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
223 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
224 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
225 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
226 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
227 | Accuracy at step 0: 0.1245
228 | Accuracy at step 10: 0.6664
229 | Accuracy at step 20: 0.8227
230 | Accuracy at step 30: 0.8657
231 | Accuracy at step 40: 0.8815
232 | Accuracy at step 50: 0.892
233 | Accuracy at step 60: 0.9068
234 | [...]
235 | ```
236 | 
237 | 
238 | ### Solution
239 | 
240 | 
241 | <details>
242 | <summary><strong>Solution (expand to see)</strong></summary>
243 | <p>
244 | 
245 | First we need to modify the `Dockerfile`.
246 | We just need to change the tag of the TensorFlow base image to be one that support GPU:
247 | 
248 | ```dockerfile
249 | FROM tensorflow/tensorflow:1.4.0-gpu
250 | COPY main.py /app/main.py
251 | 
252 | ENTRYPOINT ["python", "/app/main.py"]
253 | ```
254 | 
255 | Then we can create our Job template:
256 | 
257 | ```yaml
258 | apiVersion: batch/v1
259 | kind: Job # Our training should be a Job since it is supposed to terminate at some point
260 | metadata:
261 |   name: module4-ex2 # Name of our job
262 | spec:
263 |   template: # Template of the Pod that is going to be run by the Job
264 |     metadata:
265 |       name: mnist-pod # Name of the pod
266 |     spec:
267 |       containers: # List of containers that should run inside the pod, in our case there is only one.
268 |       - name: tensorflow
269 |         image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
270 |         args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
271 |         resources:
272 |           limits:
273 |             alpha.kubernetes.io/nvidia-gpu: 1
274 |         volumeMounts: # Where the drivers should be mounted in the container
275 |         - name: lib
276 |           mountPath: /usr/local/nvidia/lib64
277 |         - name: libcuda
278 |           mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
279 |       restartPolicy: OnFailure
280 |       volumes: # Where the drivers are located on the node
281 |         - name: lib
282 |           hostPath: 
283 |             path: /usr/lib/nvidia-384
284 |         - name: libcuda
285 |           hostPath:
286 |             path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
287 | ```
288 | 
289 | And deploy it with 
290 | 
291 | ```console
292 | kubectl create -f <template-path>
293 | ```
294 | 
295 | </p>
296 | </details>
297 | 
298 | ## Next Step
299 | [5 - TFJob](../5-tfjob/README.md)
300 | 


--------------------------------------------------------------------------------
/4-gpus/src/Dockerfile:
--------------------------------------------------------------------------------
1 | # Change this image to one that supports GPU
2 | FROM tensorflow/tensorflow:1.4.0
3 | COPY main.py /app/main.py
4 | 
5 | ENTRYPOINT ["python", "/app/main.py"]


--------------------------------------------------------------------------------
/4-gpus/src/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the 'License');
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an 'AS IS' BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """A simple MNIST classifier which displays summaries in TensorBoard.
 16 | 
 17 | This is an unimpressive MNIST model, but it is a good example of using
 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
 19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
 20 | 
 21 | It demonstrates the functionality of every TensorBoard dashboard.
 22 | """
 23 | from __future__ import absolute_import
 24 | from __future__ import division
 25 | from __future__ import print_function
 26 | 
 27 | import argparse
 28 | import os
 29 | import sys
 30 | 
 31 | import tensorflow as tf
 32 | 
 33 | from tensorflow.examples.tutorials.mnist import input_data
 34 | 
 35 | FLAGS = None
 36 | 
 37 | 
 38 | def train():
 39 |   # Import data
 40 |   mnist = input_data.read_data_sets(FLAGS.data_dir,
 41 |                                     one_hot=True,
 42 |                                     fake_data=FLAGS.fake_data)
 43 | 
 44 |   # Create a multilayer model.
 45 | 
 46 |   # Input placeholders
 47 |   with tf.name_scope('input'):
 48 |     x = tf.placeholder(tf.float32, [None, 784], name='x-input')
 49 |     y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
 50 | 
 51 |   with tf.name_scope('input_reshape'):
 52 |     image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
 53 |     tf.summary.image('input', image_shaped_input, 10)
 54 | 
 55 |   # We can't initialize these variables to 0 - the network will get stuck.
 56 |   def weight_variable(shape):
 57 |     """Create a weight variable with appropriate initialization."""
 58 |     initial = tf.truncated_normal(shape, stddev=0.1)
 59 |     return tf.Variable(initial)
 60 | 
 61 |   def bias_variable(shape):
 62 |     """Create a bias variable with appropriate initialization."""
 63 |     initial = tf.constant(0.1, shape=shape)
 64 |     return tf.Variable(initial)
 65 | 
 66 |   def variable_summaries(var):
 67 |     """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
 68 |     with tf.name_scope('summaries'):
 69 |       mean = tf.reduce_mean(var)
 70 |       tf.summary.scalar('mean', mean)
 71 |       with tf.name_scope('stddev'):
 72 |         stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
 73 |       tf.summary.scalar('stddev', stddev)
 74 |       tf.summary.scalar('max', tf.reduce_max(var))
 75 |       tf.summary.scalar('min', tf.reduce_min(var))
 76 |       tf.summary.histogram('histogram', var)
 77 | 
 78 |   def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
 79 |     """Reusable code for making a simple neural net layer.
 80 | 
 81 |     It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
 82 |     It also sets up name scoping so that the resultant graph is easy to read,
 83 |     and adds a number of summary ops.
 84 |     """
 85 |     # Adding a name scope ensures logical grouping of the layers in the graph.
 86 |     with tf.name_scope(layer_name):
 87 |       # This Variable will hold the state of the weights for the layer
 88 |       with tf.name_scope('weights'):
 89 |         weights = weight_variable([input_dim, output_dim])
 90 |         variable_summaries(weights)
 91 |       with tf.name_scope('biases'):
 92 |         biases = bias_variable([output_dim])
 93 |         variable_summaries(biases)
 94 |       with tf.name_scope('Wx_plus_b'):
 95 |         preactivate = tf.matmul(input_tensor, weights) + biases
 96 |         tf.summary.histogram('pre_activations', preactivate)
 97 |       activations = act(preactivate, name='activation')
 98 |       tf.summary.histogram('activations', activations)
 99 |       return activations
100 | 
101 |   hidden1 = nn_layer(x, 784, 500, 'layer1')
102 | 
103 |   with tf.name_scope('dropout'):
104 |     keep_prob = tf.placeholder(tf.float32)
105 |     tf.summary.scalar('dropout_keep_probability', keep_prob)
106 |     dropped = tf.nn.dropout(hidden1, keep_prob)
107 | 
108 |   # Do not apply softmax activation yet, see below.
109 |   y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
110 | 
111 |   with tf.name_scope('cross_entropy'):
112 |     # The raw formulation of cross-entropy,
113 |     #
114 |     # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
115 |     #                               reduction_indices=[1]))
116 |     #
117 |     # can be numerically unstable.
118 |     #
119 |     # So here we use tf.nn.softmax_cross_entropy_with_logits on the
120 |     # raw outputs of the nn_layer above, and then average across
121 |     # the batch.
122 |     diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
123 |     with tf.name_scope('total'):
124 |       cross_entropy = tf.reduce_mean(diff)
125 |   tf.summary.scalar('cross_entropy', cross_entropy)
126 | 
127 |   with tf.name_scope('train'):
128 |     train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
129 |         cross_entropy)
130 | 
131 |   with tf.name_scope('accuracy'):
132 |     with tf.name_scope('correct_prediction'):
133 |       correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
134 |     with tf.name_scope('accuracy'):
135 |       accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
136 |   tf.summary.scalar('accuracy', accuracy)
137 | 
138 |   # Merge all the summaries and write them out to
139 |   # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
140 |   merged = tf.summary.merge_all()
141 | 
142 |   def feed_dict(train):
143 |     """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
144 |     if train or FLAGS.fake_data:
145 |       xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
146 |       k = FLAGS.dropout
147 |     else:
148 |       xs, ys = mnist.test.images, mnist.test.labels
149 |       k = 1.0
150 |     return {x: xs, y_: ys, keep_prob: k}
151 | 
152 |   sess = tf.InteractiveSession()
153 |   train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
154 |   test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test')
155 |   tf.global_variables_initializer().run()
156 |   # Train the model, and also write summaries.
157 |   # Every 10th step, measure test-set accuracy, and write test summaries
158 |   # All other steps, run train_step on training data, & add training summaries
159 | 
160 | 
161 |   for i in range(FLAGS.max_steps):
162 |     if i % 10 == 0:  # Record summaries and test-set accuracy
163 |       summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
164 |       test_writer.add_summary(summary, i)
165 |       print('Accuracy at step %s: %s' % (i, acc))
166 |     else:  # Record train set summaries, and train
167 |       if i % 100 == 99:  # Record execution stats
168 |         run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
169 |         run_metadata = tf.RunMetadata()
170 |         summary, _ = sess.run([merged, train_step],
171 |                               feed_dict=feed_dict(True),
172 |                               options=run_options,
173 |                               run_metadata=run_metadata)
174 |         train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
175 |         train_writer.add_summary(summary, i)
176 |         print('Adding run metadata for', i)
177 |       else:  # Record a summary
178 |         summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
179 |         train_writer.add_summary(summary, i)
180 |   train_writer.close()
181 |   test_writer.close()
182 | 
183 | 
184 | def main(_):
185 |   train()
186 | 
187 | 
188 | if __name__ == '__main__':
189 |   parser = argparse.ArgumentParser()
190 |   parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
191 |                       default=False,
192 |                       help='If true, uses fake data for unit testing.')
193 |   parser.add_argument('--max_steps', type=int, default=1000,
194 |                       help='Number of steps to run trainer.')
195 |   parser.add_argument('--learning_rate', type=float, default=0.001,
196 |                       help='Initial learning rate')
197 |   parser.add_argument('--dropout', type=float, default=0.9,
198 |                       help='Keep probability for training dropout.')
199 |   parser.add_argument(
200 |       '--data_dir',
201 |       type=str,
202 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
203 |                            'tensorflow/input_data'),
204 |       help='Directory for storing input data')
205 |   parser.add_argument(
206 |       '--log_dir',
207 |       type=str,
208 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
209 |                            'tensorflow/logs'),
210 |       help='Summaries log directory')
211 |   FLAGS, unparsed = parser.parse_known_args()
212 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)


--------------------------------------------------------------------------------
/5-tfjob/README.md:
--------------------------------------------------------------------------------
  1 | # `tensorflow/k8s` and `TFJob`
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [3 - Helm](../3-helm/README.md)
  6 | * [4 - GPUs](../4-gpus/README.md)  
  7 | 
  8 | ## Summary
  9 | 
 10 | In this module you will learn how [`tensorflow/k8s`](https://github.com/tensorflow/k8s) can greatly simplify our lives when running TensorFlow on Kubernetes.
 11 | 
 12 | ## `tensorflow/k8s`
 13 | 
 14 | As we saw earlier, giving a container access to GPU is not exactly a breeze on Kubernetes: We need to manually mount the drivers from the node into the container.    
 15 | If you already tried to run a distributed TensorFlow training, you know that it's not easy either. Getting the `ClusterSpec` right can be painful if you have more than a couple VMs, and it's also quite brittle (we will look more into distributed TensorFlow in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/README.md)).
 16 |   
 17 | `tensorflow/k8s` is a new project in TensorFlow's organization on GitHub that makes all of this much easier.  
 18 | 
 19 | 
 20 | ### Installing `tensorflow/k8s`
 21 | 
 22 | Installing `tensorflow/k8s` with Helm is very easy, just run the following commands:
 23 | 
 24 | ```console
 25 | > CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
 26 | > helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
 27 | ```
 28 | 
 29 | If it worked, you should see something like:
 30 | 
 31 | ```
 32 | NAME:   tf-job
 33 | LAST DEPLOYED: Mon Nov 20 14:24:16 2017
 34 | NAMESPACE: default
 35 | STATUS: DEPLOYED
 36 | 
 37 | RESOURCES:
 38 | ==> v1/ConfigMap
 39 | NAME                    DATA  AGE
 40 | tf-job-operator-config  1     7s
 41 | 
 42 | ==> v1beta1/Deployment
 43 | NAME             DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
 44 | tf-job-operator  1        1        1           1          7s
 45 | 
 46 | ==> v1/Pod(related)
 47 | NAME                              READY  STATUS   RESTARTS  AGE
 48 | tf-job-operator-3005087210-c3js3  1/1    Running  1         4s
 49 | ```
 50 | 
 51 | This means that 3 resources were created, a `ConfigMap`, a `Deployment`, and a `Pod`.  
 52 | We will see in just a moment what each of them do.
 53 | 
 54 | ### Kubernetes Custom Resource Definition
 55 | 
 56 | Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
 57 | In the case of `tensorflow/k8s`, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe TensorFlow a training.
 58 | 
 59 | #### `TFJob` Specifications
 60 | 
 61 | Before going further, let's take a look at what the `TFJob` looks like:
 62 | 
 63 | > Note: Some of the fields are not described here for brevity.
 64 | 
 65 | **`TFJob` Object**
 66 |   
 67 | | Field | Type| Description |
 68 | |-------|-----|-------------| 
 69 | | apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
 70 | | kind | `string` |  Value representing the REST resource this object represents. In our case it's `TFJob` |
 71 | | metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
 72 | | spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
 73 | 
 74 | `spec` is the most important part, so let's look at it too:
 75 | 
 76 | **`TFJobSpec` Object**
 77 | 
 78 | | Field | Type| Description |
 79 | |-------|-----|-------------|
 80 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
 81 | 
 82 | Let's go deeper: 
 83 | 
 84 | **`TFReplicaSpec` Object**
 85 | 
 86 | | Field | Type| Description |
 87 | |-------|-----|-------------|
 88 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
 89 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
 90 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
 91 | 
 92 | 
 93 | As a refresher, here is what a simple TensorFlow training (with GPU) would look like using "vanilla" kubernetes:
 94 | 
 95 | ```yaml
 96 | apiVersion: batch/v1
 97 | kind: Job
 98 | metadata:
 99 |   name: example-job
100 | spec:
101 |   template:
102 |     metadata:
103 |       name: example-job
104 |     spec:
105 |       restartPolicy: OnFailure
106 |       volumes:
107 |       - name: bin
108 |         hostPath: 
109 |           path: /usr/lib/nvidia-384/bin
110 |       - name: lib
111 |         hostPath: 
112 |           path: /usr/lib/nvidia-384
113 |       containers:
114 |       - name: tensorflow
115 |         image: wbuchwalter/<SAMPLE IMAGE>
116 |         resources:
117 |           requests:
118 |             alpha.kubernetes.io/nvidia-gpu: 1 
119 |         volumeMounts:
120 |         - name: bin
121 |           mountPath: /usr/local/nvidia/bin
122 |         - name: lib
123 |           mountPath: /usr/local/nvidia/lib64
124 | ```
125 | Here is what the same thing looks like using the new `TFJob` resource:
126 | 
127 | ```yaml
128 | apiVersion: kubeflow.org/v1alpha1
129 | kind: TFJob
130 | metadata:
131 |   name: example-tfjob
132 | spec:
133 |   replicaSpecs:
134 |     - template:
135 |         spec:
136 |           containers:
137 |             - image: wbuchwalter/<SAMPLE IMAGE>
138 |               name: tensorflow
139 |               resources:
140 |                 requests:
141 |                   alpha.kubernetes.io/nvidia-gpu: 1
142 |           restartPolicy: OnFailure
143 | ```
144 | 
145 | No need to mount drivers anymore! Note that we are not specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
146 | 
147 | #### How does this work?
148 | 
149 | As we saw earlier, when we installed the Helm chart for `tensorflow/k8s`, 3 resources were created in our cluster:
150 | * A `ConfigMap` named `tf-job-operator-config`
151 | * A `Deployment`
152 | * And a `Pod` named `tf-job-operator`
153 | 
154 | The `tf-job-operator` pod (simply called the operator, or `TFJob` operator), is going to monitor your cluster, and every time you create a new resource of type `TFJob`, the operator will know what to do with it.
155 | Specifically, when you create a new `TFJob`, the operator will create a new Kubernetes `Job` for it, and automatically mount the drivers if needed (i.e. when you request a GPU).
156 | 
157 | You may wonder how the operator knows which directory needs to be mounted in the container for the NVIDIA drivers: that's where the `ConfigMap` comes into play.  
158 | 
159 | In K8s, a [`ConfigMap`](https://kubernetes.io/docs/tasks/configure-pod-container/configmap/) is a simple object that contains key-value pairs. This `ConfigMap` can then be linked with a container to inject some configuration.   
160 | 
161 | When we installed the Helm chart, we specified which cloud provider we are running on by doing `--set cloud=azure`. 
162 | This creates a `ConfigMap` that contains configuration options specific for Azure, including the list of directory to mount.
163 | 
164 | We can take a look at what is inside our `tf-job-operator-config` by doing:
165 | 
166 | ```console
167 | kubectl describe configmaps tf-job-operator-config
168 | ```
169 | 
170 | The output is:
171 | 
172 | ```
173 | Name:		tf-job-operator-config
174 | Namespace:	default
175 | Labels:		<none>
176 | Annotations:	<none>
177 | 
178 | Data
179 | ====
180 | controller_config_file.yaml:
181 | ----
182 | grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
183 | accelerators:
184 |   alpha.kubernetes.io/nvidia-gpu:
185 |     volumes:
186 |       - name: lib
187 |         mountPath: /usr/local/nvidia/lib64
188 |         hostPath:  /usr/lib/nvidia-384
189 |       - name: bin
190 |         mountPath: /usr/local/nvidia/bin
191 |         hostPath: /usr/lib/nvidia-384/bin
192 |       - name: libcuda
193 |         mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
194 |         hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
195 | ```
196 | 
197 | If you want to know more:
198 | * [tensorflow/k8s](https://github.com/tensorflow/k8s) GitHub repository
199 | * [Introducing Operators](https://coreos.com/blog/introducing-operators.html), a blog post by CoreOS explaining the Operator pattern
200 | 
201 | ## Exercises 
202 | 
203 | ### Exercise 1: A Simple `TFJob`
204 | 
205 | Let's schedule a very simple TensorFlow job using `TFJob` first.
206 | 
207 | > Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
208 | 
209 | Depending on whether or not your cluster has GPU, choose the correct template:
210 | 
211 | <details>
212 | <summary><strong>CPU Only</strong></summary>  
213 |   
214 | ```yaml
215 | apiVersion: kubeflow.org/v1alpha1
216 | kind: TFJob
217 | metadata:
218 |   name: module5-ex1
219 | spec:
220 |   replicaSpecs:
221 |     - template:
222 |         spec:
223 |           containers:
224 |             - image: wbuchwalter/tf-mnist:cpu
225 |               name: tensorflow
226 |           restartPolicy: OnFailure
227 | ```
228 | 
229 | </details>
230 | 
231 | <details>
232 | <summary><strong>With GPU</strong></summary>  
233 | 
234 | When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
235 | 
236 | ```yaml
237 | apiVersion: kubeflow.org/v1alpha1
238 | kind: TFJob
239 | metadata:
240 |   name: module5-ex1-gpu
241 | spec:
242 |   replicaSpecs:
243 |     - template:
244 |         spec:
245 |           containers:
246 |             - image: wbuchwalter/tf-mnist:gpu
247 |               name: tensorflow
248 |               resources:
249 |                 requests:
250 |                   alpha.kubernetes.io/nvidia-gpu: 1
251 |           restartPolicy: OnFailure
252 | ```
253 | 
254 | </details>  
255 |   
256 | 
257 | 
258 | Save the template that applies to you in a file, and create the `TFJob`:
259 | ```console
260 | kubectl create -f <template-path>
261 | ```
262 | 
263 | Let's look at what has been created in our cluster.
264 | 
265 | First a `TFJob` was created:
266 | 
267 | ```console
268 | kubectl get tfjob
269 | ```
270 | Returns:
271 | ```
272 | NAME            KIND
273 | module5-ex1   TFJob.v1alpha1.tensorflow.org
274 | ```
275 | 
276 | As well as a `Job`, which was actually created by the operator:
277 | 
278 | ```console
279 | kubectl get job
280 | ```
281 | Returns:
282 | ```
283 | NAME            DESIRED   SUCCESSFUL   AGE
284 | module5-ex1-master-xs4b-0   1         0            2m
285 | ```
286 | and a `Pod`:
287 | 
288 | ```console
289 | kubectl get pod
290 | ```
291 | Returns:
292 | ```
293 | NAME                                READY     STATUS      RESTARTS   AGE
294 | module5-ex1-master-xs4b-0-6gpfn                 1/1       Running   0          2m
295 | ```
296 | 
297 | Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
298 | 
299 | Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
300 | 
301 | ```console 
302 | kubectl logs <your-pod-name>
303 | ```
304 | 
305 | This container is pretty verbose, but you should see a TensorFlow training happening: 
306 | 
307 | ```
308 | [...]
309 | INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
310 | INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
311 | INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
312 | INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
313 | INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
314 | INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
315 | INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
316 | INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
317 | INFO:tensorflow:Final test accuracy = 88.4% (N=353)
318 | [...]
319 | ```
320 | 
321 | Once your job is completed, clean it up:
322 | 
323 | ```console
324 | kubectl delete tfjob module5-ex1
325 | ```
326 | 
327 | > That's great and all, but how do we grab our trained model and TensorFlow's summaries?  
328 | 
329 | Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.  
330 | 
331 | Thankfully, Kubernetes `Volumes` can help us here.
332 | If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.  
333 | But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).  
334 | 
335 | In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
336 | 
337 | ## Exercise 2: Azure Files to the Rescue
338 | 
339 | ### Creating a New File Share and Kubernetes Secret
340 | 
341 | In the official documentation: [Using Azure Files with Kubernetes](https://docs.microsoft.com/en-in/azure/aks/azure-files), follow the steps listed under `Create an Azure file share` and `Create Kubernetes Secret`, but be aware of a few details first:
342 | * It is **very** important that you create you storage account (hence your resource group) in the **same** region as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
343 | * While this document specifically refers to AKS, it will work for any K8s cluster
344 | * Name your file share `tensorflow`. While the share could be named anything, it will make it easier to follow the examples later on. `AKS_PERS_SHARE_NAME` should be updated accordingly.
345 | 
346 | Once you completed all the steps, run:
347 | ```console
348 | kubectl get secrets
349 | ```
350 | 
351 | Which should return:
352 | ```
353 | NAME                  TYPE                                  DATA      AGE
354 | azure-secret          Opaque                                2         4m
355 | ```
356 | 
357 | 
358 | ### Updating our example to use our Azure File Share
359 | 
360 | Now we need to mount our new file share into our container so the model and the summaries can be persisted.  
361 | Turns out mounting an Azure File share into a container is really easy, we simply need to reference our secret in the `Volume` definition:
362 | 
363 | ```yaml
364 | [...]
365 |  containers:
366 |   - image: <IMAGE>
367 |     name: tensorflow
368 |     volumeMounts:
369 |       - name: azurefile
370 |         mountPath: <MOUNT_PATH>
371 |  volumes:
372 |   - name: azurefile
373 |     azureFile:
374 |       secretName: azure-secret
375 |       shareName: tensorflow
376 |       readOnly: false
377 | ```
378 | 
379 | Update your template from exercise 1 to mount the Azure File share into your container,and create your new job.
380 | Note that by default our container saves everything into `/app/tf_files` so that's the value you will want to use for `MOUNT_PATH`.
381 | 
382 | Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
383 | 
384 | ![file-share](./file-share.png)
385 | 
386 | This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
387 | 
388 | #### Solution for Exercise 2
389 | 
390 | *For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
391 | 
392 | <details>
393 | <summary><strong>Solution</strong></summary>  
394 | 
395 | ```yaml
396 | apiVersion: kubeflow.org/v1alpha1
397 | kind: TFJob
398 | metadata:
399 |   name: module5-ex2
400 | spec:
401 |   replicaSpecs:
402 |     - template:
403 |         spec:
404 |           containers:
405 |             - image: wbuchwalter/tf-mnist:cpu
406 |               name: tensorflow
407 |               volumeMounts:
408 |                 # By default our classifier saves the summaries in /tmp/tensorflow,
409 |                 # so that's where we want to mount our Azure File Share.
410 |                 - name: azurefile
411 |                   # The subPath allows us to mount a subdirectory within the azure file share instead of root
412 |                   # this is useful so that we can save the logs for each run in a different subdirectory
413 |                   # instead of overwriting what was done before.
414 |                   subPath: module5-ex2
415 |                   mountPath: /tmp/tensorflow 
416 |           volumes:
417 |             - name: azurefile
418 |               azureFile:
419 |                 # We reference the secret we created just earlier 
420 |                 # so that the account name and key are passed securely and not directly in a template
421 |                 secretName: azure-secret
422 |                 shareName: tensorflow
423 |                 readOnly: false
424 |           restartPolicy: OnFailure
425 | ```
426 | 
427 | </details>
428 | 
429 | 
430 | **Don't forget to delete the `TFJob` once it is completed!**
431 | 
432 | > Great, but what if I want to check out the training in TensorBoard, do I need to download everything on my machine?
433 | 
434 | Actually no, you don't. `TFJob` provides a very handy mechanism to monitor your trainings with TensorBoard easily!
435 | We will try that in our third exercise.
436 | 
437 | ### Exercise 3: Adding TensorBoard
438 | 
439 | So far, we have a TensorFlow training running, and it's model and summaries are persisted to an Azure File share.  
440 | But having TensorBoard monitoring the training would be pretty useful as well.
441 | Turns out `TFJob` can also help us with that.
442 | 
443 | When we looked at the `TFJob` specification at the beginning of this module, we omitted some fields in `TFJobSpec` descriptions.
444 | Here is a still incomplete but more accurate representation with one additional field:
445 | 
446 | **`TFJobSpec` Object**
447 | 
448 | | Field | Type| Description |
449 | |-------|-----|-------------|
450 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes. |
451 | | **TensorBoard** | `TensorBoardSpec` | Configuration to start a TensorBoard deployment associated to our training job. Defined below. |
452 | 
453 | That's right, `TFJobSpec` contains an object of type `TensorBoadSpec` which allows us to describe a TensorBoard instance!
454 | Let's look at it:
455 | 
456 | **`TensorBoardSpec` Object**
457 | 
458 | | Field | Type| Description |
459 | |-------|-----|-------------|
460 | | LogDir | `string` | Location of TensorFlow summaries in the TensorBoard container. |
461 | | ServiceType | [`ServiceType`](https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services---service-types) | What kind of service should expose TensorBoard. Usually `ClusterIP` (Only reachable from within the cluster) or `LoadBalancer` (Exposes the service externally using a cloud provider’s load balancer. ) |
462 | | Volumes | [`Volume`](https://kubernetes.io/docs/api-reference/v1.8/#volume-v1-core) array | List of volumes that can be mounted.  |
463 | | VolumeMounts | [`VolumeMount`](https://kubernetes.io/docs/api-reference/v1.8/#volumemount-v1-core) array | Pod volumes to mount into the container's filesystem. |
464 | 
465 | 
466 | Let's add TensorBoard to our job then.
467 | Here is how this will work: We will keep the same TensorFlow training job as in exercise 2. This `TFJob` will write the model and summaries in the Azure File share.
468 | We will also set up the configuration for TensorBoard so that it reads the summaries from the same Azure File share:
469 | * `Volumes` and `VolumeMounts` in `TensorBoardSpec` should be updated adequately.
470 | * For `ServiceType`, you should use `LoadBalancer`, this will create a public IP so it will be easier to access.
471 | * `LogDir` will depend on how you configure `VolumeMounts`, but on your file share, the summaries will be under the `training_summaries` sub directory.
472 | 
473 | #### Solution for Exercise 3
474 | 
475 | *For brevity, the solution show here is for CPU-only training. If you are using GPU, don't forget to update the image tag as well as adding a GPU request.*
476 | 
477 | <details>
478 | <summary><strong>Solution</strong></summary>  
479 | 
480 | ```yaml
481 | apiVersion: kubeflow.org/v1alpha1
482 | kind: TFJob
483 | metadata:
484 |   name: module5-ex3
485 | spec:
486 |   replicaSpecs:
487 |     - template:
488 |         spec:
489 |           volumes:
490 |             - name: azurefile
491 |               azureFile:
492 |                   secretName: azure-secret
493 |                   shareName: tensorflow
494 |                   readOnly: false
495 |           containers:
496 |             - image: wbuchwalter/tf-mnist:cpu
497 |               name: tensorflow
498 |               volumeMounts:
499 |                 - mountPath: /tmp/tensorflow
500 |                   subPath: module5-ex3 # Again we isolate the logs in a new directory on Azure Files
501 |                   name: azurefile
502 |           restartPolicy: OnFailure
503 |   tensorboard:
504 |     logDir: /tmp/tensorflow/logs
505 |     serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
506 |     volumes:
507 |       - name: azurefile
508 |         azureFile:
509 |             secretName: azure-secret
510 |             shareName: tensorflow
511 |     volumeMounts:
512 |       - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
513 |         subPath: module5-ex3 # This should match the directory our Master is actually writing in
514 |         name: azurefile
515 | ```
516 | </details>
517 | 
518 | 
519 | #### Validation
520 | 
521 | If you updated the `TFJob` template correctly, when doing:
522 | ```console
523 | kubectl get services
524 | ```
525 | You should see something like:
526 | ```
527 | NAME                           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
528 | kubernetes                     10.0.0.1       <none>          443/TCP        14d
529 | module5-ex3-master-7yqt-0      10.0.126.11    <none>          2222/TCP       5m
530 | module5-ex3-tensorboard-7yqt   10.0.199.170   104.42.193.76   80:31770/TCP   5m
531 | ```
532 | Note that provisioning a public IP on Azure can take a few minutes. During this time the `EXTERNAL-IP` for TensorBoard's service will show as `<pending>`.  
533 | 
534 | Once the public IP is provisioned, browse it, and you should land on a working TensorBoard instance with live monitoring of the training job running.
535 | 
536 | ![TensorBoard](./tensorboard.png)
537 | 
538 | ## Next Step
539 | 
540 | [6 - Distributed TensorFlow](../6-distributed-tensorflow)
541 | 


--------------------------------------------------------------------------------
/5-tfjob/file-share.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/5-tfjob/file-share.png


--------------------------------------------------------------------------------
/5-tfjob/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/5-tfjob/tensorboard.png


--------------------------------------------------------------------------------
/6-distributed-tensorflow/README.md:
--------------------------------------------------------------------------------
  1 | # Distributed TensorFlow with `TFJob`
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | [5 - TFJob](../5-tfjob/)
  6 | 
  7 | ## Summary
  8 | 
  9 | In this module we will see how `TFJob` can greatly simplify the deployment and monitoring of distributed TensorFlow trainings.
 10 |   
 11 | ## "Vanilla" Distributed TensorFlow is Hard
 12 | 
 13 | First let's see how we would setup a distributed TensorFlow training without Kubernetes or `TFJob` (fear not, we are not actually going to do that).
 14 | First, you would have to find or setup a bunch of idle VMs, or physical machines. In most companies, this would already be a feat, and likely require the coordination of multiple department (such as IT) to get the VMs up, running and reserved for your experiment. 
 15 | Then you would likely have to do some back and forth with the IT department to be able to setup your training: the VMs need to be able to talk to each others and have stable endpoints. Work might be needed to access the data, you would need to upload your TF code on every single machine etc.  
 16 | If you add GPU to the mix, it would likely get even harder since GPUs aren't usually just waiting there because of their high cost.  
 17 | 
 18 | Assuming you get through this, you now need to modify your model for distributed training.  
 19 | Among other things, you will need to setup the `ClusterSpec` ([`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec)):  a TensorFlow class that allows you to describe the architecture of your cluster. 
 20 | For example, if you were to setup a distributed training with a mere 2 workers and 2 parameter servers, your cluster spec would look like this (the `clusterSpec` would most likely not be hardcoded, but passed as argument to your training script as we will see below, this is for illustration):
 21 | 
 22 | ```python
 23 | cluster = tf.train.ClusterSpec({"worker": ["<IP_GPU_VM_1>:2222",
 24 |                                            "<IP_GPU_VM_2>:2222"],
 25 |                                 "ps": ["<IP_CPU_VM_1>:2222",
 26 |                                        "<IP_CPU_VM_2>:2222"]})
 27 | ```
 28 | Here we assume that you want your workers to run on GPU VMs and your parameter servers to run on CPU VMs.  
 29 | 
 30 | We will not go through the rest of the modifications needed (splitting operation across devices, getting the master session etc.), as we will look at them later and this would be pretty much the same thing no matter how you run your distributed training.
 31 | 
 32 | Once your model is ready, you need to start the training.  
 33 | You will need to connect to every single VM, and pass the `ClusterSpec` as well as the assigned job name (ps or worker) and task index to each VM. 
 34 | So it would look something like this:
 35 | 
 36 | ```bash
 37 | # On ps0:
 38 | $ python trainer.py \
 39 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 40 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 41 |      --job_name=ps --task_index=0
 42 | # On ps1:
 43 | $ python trainer.py \
 44 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 45 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 46 |      --job_name=ps --task_index=1
 47 | # On worker0:
 48 | $ python trainer.py \
 49 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 50 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 51 |      --job_name=worker --task_index=0
 52 | # On worker1:
 53 | $ python trainer.py \
 54 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 55 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 56 |      --job_name=worker --task_index=1
 57 | ```
 58 | 
 59 | At this point your training would finally start.  
 60 | However, if for some reason an IP changes (a VM restarts for example), you would need to go back on every VM in your cluster, and restart the training with an updated `ClusterSpec` (If the IT department of your company is feeling extra-generous they might assign a DNS name to every VM which would already make your life much easier).
 61 | If you see that your training is not doing well and you need to update the code, you have to redeploy it on every VM and restart the training everywhere.
 62 | If for some reason you want to retrain after a while, you would most likely need to go back to step 1: ask for the VMs to be allocated, redeploy, update the `clusterSpec`.
 63 | 
 64 | All this hurdles means that in practice very few people actually bother with distributed training as the time gained during training might not be worth the energy and time necessary to set it up correctly.
 65 | 
 66 | ## Distributed TensorFlow with Kubernetes and `TFJob`
 67 | 
 68 | Thankfully, with Kubernetes and `TFJob` things are much, much simpler, making distributed training something you might actually be able to benefit from.
 69 | 
 70 | 
 71 | #### A Small Disclaimer
 72 | The issues we saw in the first part of this module can be categorized in two groups: 
 73 | * Issues with getting access to enough resources for the trainings (VMs, GPU etc)
 74 | * Issues with setting up the training itself
 75 | 
 76 | The first group of issue is still very dependent on the processes in your company/group. If you need to go through a formal request to get access to extra VMs/GPU, it will still be a hassle and there is nothing Kubernetes can do about that.  
 77 | However, Kubernetes makes this process much easier:
 78 | * On ACS and AKS you can spin up new VMs with a single command: [`az <acs|aks> scale`](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az_aks_scale)
 79 | * On acs-engine you can setup autoscaling so that anytime you schedule a training on Kubernetes, the autoscaler will make sure your cluster has all the resources it need to run it, and when your training is completed, it will shut down any idle VMs, making this the best solution in term of cost and effort. While autoscaling is outside the scope of this workshop we will give you pointers in module [8 - Going Further](../8-going-further).
 80 | 
 81 | Setting up the training, however, is drastically simplified with Kubernetes and `TFJob`.
 82 | 
 83 | ### Overview of `TFJob` distributed training
 84 | 
 85 | So, how does `TFJob` works for distributed training?
 86 | Let's look again at what the `TFJobSpec`and `TFReplicaSpec` objects looks like:
 87 | 
 88 | **`TFJobSpec` Object**
 89 | 
 90 | | Field | Type| Description |
 91 | |-------|-----|-------------|
 92 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
 93 | 
 94 | 
 95 | **`TFReplicaSpec` Object**
 96 | 
 97 | Note the last parameter `IsDefaultPS` that we didn't talk about before.
 98 | 
 99 | | Field | Type| Description |
100 | |-------|-----|-------------|
101 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
102 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
103 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
104 | | **IsDefaultPS** | `boolean` | Whether the parameter server should be using a default image or a custom one (default to `true`) |
105 | 
106 | In case the distinction between master and workers is not clear, there is a single master per TensorFlow cluster, and it is in fact a worker. The difference is that the master is the worker that is going to handle the creation of the `tf.Session`, write logs and save the model.
107 | 
108 | As you can see, `TFJobSpec` and `TFReplicaSpec` allow us to easily define the architecture of the TensorFlow cluster we would like to setup.
109 | 
110 | Once we have defined this architecture in a `TFJob` template and deployed it with `kubectl create`, the operator will do most of the work for us.
111 | For each master, worker and parameter server in our TensorFlow cluster, the operator will create a service exposing it so they can communicate.   
112 | It will then create an internal representation of the cluster with each node and it's associated internal DNS name.  
113 | 
114 | For example, if you were to create a `TFJob` with 1 `MASTER`, 2 `WORKERS` and 1 `PS`, this representation would look similar to this:
115 | ```json
116 | {  
117 |     "master":[  
118 |         "distributed-mnist-master-5oz2-0:2222"
119 |     ],
120 |     "ps":[  
121 |         "distributed-mnist-ps-5oz2-0:2222"
122 |     ],
123 |     "worker":[  
124 |         "distributed-mnist-worker-5oz2-0:2222",
125 |         "distributed-mnist-worker-5oz2-1:2222"
126 |     ]
127 | }
128 | ```
129 | 
130 | Finally, the operator will create all the necessary pods, and in each one, inject an environment variable named `Tf_CONFIG`, containing the cluster specification above, as well as the respective job name and task id that each node of the TensorFlow cluster should assume.  
131 | 
132 | For example, here is the value of the `TF_CONFIG` environment variable that would be sent to worker 1:
133 | 
134 | ```json
135 | {  
136 |    "cluster":{  
137 |       "master":[  
138 |          "distributed-mnist-master-5oz2-0:2222"
139 |       ],
140 |       "ps":[  
141 |          "distributed-mnist-ps-5oz2-0:2222"
142 |       ],
143 |       "worker":[  
144 |          "distributed-mnist-worker-5oz2-0:2222",
145 |          "distributed-mnist-worker-5oz2-1:2222"
146 |       ]
147 |    },
148 |    "task":{  
149 |       "type":"worker",
150 |       "index":1
151 |    },
152 |    "environment":"cloud"
153 | }
154 | ```
155 | 
156 | As you can see, this completely takes the responsibility of building and maintaining the `ClusterSpec` away from you.
157 | All you have to do, is modify your code to read the `TF_CONFIG` and act accordingly.
158 | 
159 | ### Modifying your model to use `TFJob`'s `TF_CONFIG`
160 | 
161 | Concretely, let's see how you would modify your code:
162 | 
163 | ```python
164 | # Grab the TF_CONFIG environment variable
165 | tf_config_json = os.environ.get("TF_CONFIG", "{}")
166 | 
167 | # Deserialize to a python object
168 | tf_config = json.loads(tf_config_json)
169 | 
170 | # Grab the cluster specification from tf_config and create a new tf.train.ClusterSpec instance with it
171 | cluster_spec = tf_config.get("cluster", {})
172 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
173 | 
174 | # Grab the task assigned to this specific process from the config. job_name might be "worker" and task_id might be 1 for example
175 | task = tf_config.get("task", {})
176 | job_name = task["type"]
177 | task_id = task["index"]
178 | 
179 | # Configure the TensorFlow server
180 | server_def = tf.train.ServerDef(
181 |     cluster=cluster_spec_object.as_cluster_def(),
182 |     protocol="grpc",
183 |     job_name=job_name,
184 |     task_index=task_id)
185 | server = tf.train.Server(server_def)
186 | 
187 | # checking if this process is the chief (also called master). The master has the responsibility of creating the session, saving the summaries etc.
188 | is_chief = (job_name == 'master')
189 | 
190 | # Notice that we are not handling the case where job_name == 'ps'. That is because `TFJob` will take care of the parameter servers for us by default.
191 | ```
192 | 
193 | As for any distributed TensorFlow training, you will then also need to modify your model to split the operations and variables among the workers and parameter servers as well as create on session on the master.
194 | 
195 | ## Exercises
196 | 
197 | ### 1 - Modifying Our MNIST Example to Support Distributed Training
198 | 
199 | #### 1. a.
200 | Starting from the MNIST sample we have been working with so far, modify it to work with distributed TensorFlow and `TFJob`.
201 | You will then need to build the image and push it (you should push it under a different name or tag to avoid overwriting what you did before).
202 | 
203 | #### 1. b.
204 | 
205 | Modify the yaml template from module [5 - TFJob](../5-tfjob) exercise 3, to instead deploy 1 master, 2 workers and 1 PS. We also want to monitor the training with TensorBoard.
206 | Note that since our model is very simple, TensorFlow will likely use only 1 of the workers, but it will still work fine.
207 | Don't forget to update the image or tag.
208 | 
209 | #### Validation
210 | 
211 | ```console
212 | kubectl get pods
213 | ```
214 | 
215 | Should yield:
216 | 
217 | ```
218 | NAME                                                 READY     STATUS    RESTARTS   AGE
219 | module6-ex1-master-3khk-0-fkm8p                 1/1       Running       0          39s
220 | module6-ex1-ps-3khk-0-rqkv5                     1/1       Running       0          39s
221 | module6-ex1-tensorboard-3khk-2845579357-75rtd   1/1       Running       0          39s
222 | module6-ex1-worker-3khk-0-jsm8c                 1/1       Running       0          39s
223 | module6-ex1-worker-3khk-1-8rgh4                 1/1       Running       0          39s
224 | ```
225 | 
226 | looking at the logs of the master with:
227 | 
228 | ```console
229 | kubectl logs <master-pod-name>
230 | ```
231 | 
232 | Should yield:
233 | 
234 | ```
235 | [...]
236 | Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
237 | Initialize GrpcChannelCache for job ps -> {0 -> distributed-mnist-ps-5oz2-0:2222}
238 | Initialize GrpcChannelCache for job worker -> {0 -> distributed-mnist-worker-5oz2-0:2222, 1 -> distributed-mnist-worker-5oz2-1:2222}
239 | 2017-12-01 20:10:11.826258: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
240 | 2017-12-01 20:10:14.395476: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 87c6df6850b8f074 with config:
241 | ```
242 | 
243 | This indicates that the `ClusterSpec` was correctly extracted from the environment variable and given to TensorFlow.
244 | 
245 | Once TensorBoard public IP is successfully provisioned (check with `kubectl get svc`), go in TensorBoard's graph section and change the color to Device in the left menu.
246 | You should see that your model is indeed correctly distributed between workers and PS:  
247 | 
248 | ![TensorBoard](./tensorboard.png)  
249 | 
250 | Again, since our model is very simple, TensorFlow will most likely only use a single worker.
251 | 
252 | After a few minutes, the status of both worker nodes should show as `Completed` when doing `kubectl get pods -a`.
253 | 
254 | #### Solution
255 | 
256 | 
257 | A working code sample is available in [`solution-src/main.py`](./solution-src/main.py).
258 | 
259 | <details>
260 | <summary><strong>TFJob's Template</strong></summary>
261 | 
262 | ```yaml
263 | apiVersion: kubeflow.org/v1alpha1
264 | kind: TFJob
265 | metadata:
266 |   name: module6-ex1
267 | spec:
268 |   tensorboard: # Specification fot the TensorBoard instance that is going to monitor our training
269 |     logDir: /tmp/tensorflow/logs
270 |     serviceType: LoadBalancer
271 |     volumes:
272 |       - name: azurefile
273 |         azureFile:
274 |             secretName: azure-secret
275 |             shareName: tensorflow
276 |     volumeMounts:
277 |       - mountPath: /tmp/tensorflow 
278 |         subPath: module6-ex1
279 |         name: azurefile
280 |   replicaSpecs:
281 |     - replicas: 1 # 1 Master
282 |       tfReplicaType: MASTER
283 |       template:
284 |         spec:
285 |           volumes:
286 |             - name: azurefile
287 |               azureFile:
288 |                   secretName: azure-secret
289 |                   shareName: tensorflow
290 |                   readOnly: false
291 |           containers:
292 |             - image: wbuchwalter/tf-mnist:distributed  # You can replace this by your own image           
293 |               name: tensorflow
294 |               imagePullPolicy: Always
295 |               volumeMounts:
296 |                 - mountPath: /tmp/tensorflow
297 |                   subPath: module6-ex1
298 |                   name: azurefile
299 |           restartPolicy: OnFailure
300 |     - replicas: 2 # 2 Workers
301 |       tfReplicaType: WORKER
302 |       template:
303 |         spec:
304 |           containers:
305 |             - image: wbuchwalter/tf-mnist:distributed  # You can replace this by your own image                       
306 |               name: tensorflow
307 |               imagePullPolicy: Always
308 |           restartPolicy: OnFailure
309 |     - replicas: 1  # 1 Parameter server
310 |       tfReplicaType: PS
311 | ```
312 | 
313 | There are two things to notice here:
314 | * Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `replicaSpec`, not on the `workers` or `ps`.
315 | * We are not specifying anything for the `PS` `replicaSpec` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
316 | 
317 | </details>
318 | 
319 | 
320 | ## Next Step
321 | 
322 | [7 - Hyperparameters Sweep with Helm](../7-hyperparam-sweep)
323 | 


--------------------------------------------------------------------------------
/6-distributed-tensorflow/solution-src/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.4.0
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]


--------------------------------------------------------------------------------
/6-distributed-tensorflow/solution-src/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the 'License');
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an 'AS IS' BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """A simple MNIST classifier which displays summaries in TensorBoard.
 16 | 
 17 | This is an unimpressive MNIST model, but it is a good example of using
 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
 19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
 20 | 
 21 | It demonstrates the functionality of every TensorBoard dashboard.
 22 | """
 23 | from __future__ import absolute_import
 24 | from __future__ import division
 25 | from __future__ import print_function
 26 | 
 27 | import argparse
 28 | import os
 29 | import sys
 30 | import ast
 31 | import json
 32 | 
 33 | import tensorflow as tf
 34 | 
 35 | from tensorflow.examples.tutorials.mnist import input_data
 36 | 
 37 | FLAGS = None
 38 | 
 39 | def train():
 40 |   tf_config_json = os.environ.get("TF_CONFIG", "{}")
 41 |   tf_config = json.loads(tf_config_json)
 42 | 
 43 |   task = tf_config.get("task", {})
 44 |   cluster_spec = tf_config.get("cluster", {})
 45 |   cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
 46 |   job_name = task["type"]
 47 |   task_id = task["index"]
 48 |   server_def = tf.train.ServerDef(
 49 |       cluster=cluster_spec_object.as_cluster_def(),
 50 |       protocol="grpc",
 51 |       job_name=job_name,
 52 |       task_index=task_id)
 53 |   server = tf.train.Server(server_def)
 54 | 
 55 |   is_chief = (job_name == 'master')
 56 | 
 57 |   # Import data
 58 |   mnist = input_data.read_data_sets(FLAGS.data_dir,
 59 |                                     one_hot=True,
 60 |                                     fake_data=FLAGS.fake_data)
 61 | 
 62 |   
 63 |   # Create a multilayer model.
 64 | 
 65 | 
 66 |   # Between-graph replication
 67 |   with tf.device(tf.train.replica_device_setter(
 68 |     worker_device="/job:worker/task:%d" % task_id,
 69 |     cluster=cluster_spec)):
 70 | 
 71 |     # count the number of updates
 72 |     global_step = tf.get_variable(
 73 |       'global_step',
 74 |       [],
 75 |       initializer = tf.constant_initializer(0),
 76 |       trainable = False)
 77 | 
 78 |     # Input placeholders
 79 |     with tf.name_scope('input'):
 80 |       x = tf.placeholder(tf.float32, [None, 784], name='x-input')
 81 |       y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
 82 | 
 83 |     with tf.name_scope('input_reshape'):
 84 |       image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
 85 |       tf.summary.image('input', image_shaped_input, 10)
 86 | 
 87 |     # We can't initialize these variables to 0 - the network will get stuck.
 88 |     def weight_variable(shape):
 89 |       """Create a weight variable with appropriate initialization."""
 90 |       initial = tf.truncated_normal(shape, stddev=0.1)
 91 |       return tf.Variable(initial)
 92 | 
 93 |     def bias_variable(shape):
 94 |       """Create a bias variable with appropriate initialization."""
 95 |       initial = tf.constant(0.1, shape=shape)
 96 |       return tf.Variable(initial)
 97 | 
 98 |     def variable_summaries(var):
 99 |       """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
100 |       with tf.name_scope('summaries'):
101 |         mean = tf.reduce_mean(var)
102 |         tf.summary.scalar('mean', mean)
103 |         with tf.name_scope('stddev'):
104 |           stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
105 |         tf.summary.scalar('stddev', stddev)
106 |         tf.summary.scalar('max', tf.reduce_max(var))
107 |         tf.summary.scalar('min', tf.reduce_min(var))
108 |         tf.summary.histogram('histogram', var)
109 | 
110 |     def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
111 |       """Reusable code for making a simple neural net layer.
112 | 
113 |       It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
114 |       It also sets up name scoping so that the resultant graph is easy to read,
115 |       and adds a number of summary ops.
116 |       """
117 |       # Adding a name scope ensures logical grouping of the layers in the graph.
118 |       with tf.name_scope(layer_name):
119 |         # This Variable will hold the state of the weights for the layer
120 |         with tf.name_scope('weights'):
121 |           weights = weight_variable([input_dim, output_dim])
122 |           variable_summaries(weights)
123 |         with tf.name_scope('biases'):
124 |           biases = bias_variable([output_dim])
125 |           variable_summaries(biases)
126 |         with tf.name_scope('Wx_plus_b'):
127 |           preactivate = tf.matmul(input_tensor, weights) + biases
128 |           tf.summary.histogram('pre_activations', preactivate)
129 |         activations = act(preactivate, name='activation')
130 |         tf.summary.histogram('activations', activations)
131 |         return activations
132 | 
133 |     hidden1 = nn_layer(x, 784, 500, 'layer1')
134 | 
135 |     with tf.name_scope('dropout'):
136 |       keep_prob = tf.placeholder(tf.float32)
137 |       tf.summary.scalar('dropout_keep_probability', keep_prob)
138 |       dropped = tf.nn.dropout(hidden1, keep_prob)
139 | 
140 |     # Do not apply softmax activation yet, see below.
141 |     y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
142 | 
143 |     with tf.name_scope('cross_entropy'):
144 |       # The raw formulation of cross-entropy,
145 |       #
146 |       # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
147 |       #                               reduction_indices=[1]))
148 |       #
149 |       # can be numerically unstable.
150 |       #
151 |       # So here we use tf.nn.softmax_cross_entropy_with_logits on the
152 |       # raw outputs of the nn_layer above, and then average across
153 |       # the batch.
154 |       diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
155 |       with tf.name_scope('total'):
156 |         cross_entropy = tf.reduce_mean(diff)
157 |     tf.summary.scalar('cross_entropy', cross_entropy)
158 | 
159 |     with tf.name_scope('train'):
160 |       train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
161 |           cross_entropy)
162 | 
163 |     with tf.name_scope('accuracy'):
164 |       with tf.name_scope('correct_prediction'):
165 |         correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
166 |       with tf.name_scope('accuracy'):
167 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
168 |     tf.summary.scalar('accuracy', accuracy)
169 | 
170 |     # Merge all the summaries and write them out to
171 |     # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
172 |     merged = tf.summary.merge_all()  
173 | 
174 |     init_op = tf.global_variables_initializer()
175 | 
176 |   def feed_dict(train):
177 |     """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
178 |     if train or FLAGS.fake_data:
179 |       xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
180 |       k = FLAGS.dropout
181 |     else:
182 |       xs, ys = mnist.test.images, mnist.test.labels
183 |       k = 1.0
184 |     return {x: xs, y_: ys, keep_prob: k}
185 | 
186 | 
187 | 
188 |   sv = tf.train.Supervisor(is_chief=is_chief,
189 | 						global_step=global_step,
190 | 						init_op=init_op,
191 | 						logdir=FLAGS.logdir)
192 | 
193 |   with sv.prepare_or_wait_for_session(server.target) as sess:  
194 |     train_writer = tf.summary.FileWriter(FLAGS.logdir + '/train', sess.graph)
195 |     test_writer = tf.summary.FileWriter(FLAGS.logdir + '/test')
196 |     # Train the model, and also write summaries.
197 |     # Every 10th step, measure test-set accuracy, and write test summaries
198 |     # All other steps, run train_step on training data, & add training summaries
199 | 
200 |     for i in range(FLAGS.max_steps):
201 |       if i % 10 == 0:  # Record summaries and test-set accuracy
202 |         summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
203 |         test_writer.add_summary(summary, i)
204 |         print('Accuracy at step %s: %s' % (i, acc))
205 |       else:  # Record train set summaries, and train
206 |         if i % 100 == 99:  # Record execution stats
207 |           run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
208 |           run_metadata = tf.RunMetadata()
209 |           summary, _ = sess.run([merged, train_step],
210 |                                 feed_dict=feed_dict(True),
211 |                                 options=run_options,
212 |                                 run_metadata=run_metadata)
213 |           train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
214 |           train_writer.add_summary(summary, i)
215 |           print('Adding run metadata for', i)
216 |         else:  # Record a summary
217 |           summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
218 |           train_writer.add_summary(summary, i)
219 |     train_writer.close()
220 |     test_writer.close()
221 | 
222 | 
223 | def main(_):
224 |   train()
225 | 
226 | 
227 | if __name__ == '__main__':
228 |   parser = argparse.ArgumentParser()
229 |   parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
230 |                       default=False,
231 |                       help='If true, uses fake data for unit testing.')
232 |   parser.add_argument('--max_steps', type=int, default=1000,
233 |                       help='Number of steps to run trainer.')
234 |   parser.add_argument('--learning_rate', type=float, default=0.001,
235 |                       help='Initial learning rate')
236 |   parser.add_argument('--dropout', type=float, default=0.9,
237 |                       help='Keep probability for training dropout.')
238 |   parser.add_argument(
239 |       '--data_dir',
240 |       type=str,
241 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
242 |                            'tensorflow/input_data'),
243 |       help='Directory for storing input data')
244 |   parser.add_argument(
245 |       '--logdir',
246 |       type=str,
247 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
248 |                            'tensorflow/logs'),
249 |       help='Summaries log directory')
250 |   FLAGS, unparsed = parser.parse_known_args()
251 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)


--------------------------------------------------------------------------------
/6-distributed-tensorflow/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/6-distributed-tensorflow/tensorboard.png


--------------------------------------------------------------------------------
/7-hyperparam-sweep/README.md:
--------------------------------------------------------------------------------
  1 | # Automated Hyperparameters Sweep with `TFJob` and Helm
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [3 - Helm](../3-helm)
  6 | * [5 - TFJob](../5-tfjob)
  7 |   
  8 | ### "Vanilla" Hyperparameter Sweep
  9 | 
 10 | Just as distributed training, automated hyperparameter sweep is barely used in many organizations.
 11 | The reasons are similar: It takes a lot of resources, or time, to run more than a couple training for the same model.
 12 |   * Either you run different hypothesis in parallel, which will likely requires a lot of resources and VMs. These VMs need to be managed by someone, the model need to be deployed, logs and checkpoints have to be gathered etc.
 13 |   * Or you run everything sequentially on a small number of VMs, which takes a lot of time before being able to compare results.
 14 | 
 15 | So in practice most people manually fine-tune their hyperparameters through a few runs and pick a winner.
 16 | 
 17 | ### Kubernetes + Helm
 18 | 
 19 | Kubernetes coupled with Helm can make this easier as we will see. 
 20 | Because Kubernetes on Azure also allows you to scale very easily (manually or automatically), this allows you to explore a very large hyperparameter space, while maximizing the usage of your cluster (and thus optimizing cost).
 21 | 
 22 | In practice, this process is still rudimentary today as the technologies involved are all pretty young. Most likely tools better suited for doing hyperparameter sweeping in distributed systems will soon be available, but in the meantime Kubernetes and Helm already allow us to deploy a large number of trainings fairly easily.
 23 | 
 24 | ### Why Helm?
 25 | 
 26 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize its deployment easily.  
 27 | To do that, Helm allows us to use Golang templating engine in the chart definitions. This means we can use conditions, loops, variables and [much more](https://docs.helm.sh/chart_template_guide).  
 28 | This will allow us to create complex deployment flow.   
 29 | 
 30 | In the case of hyperparameters sweeping, we will want a chart able to deploy a number of `TFJobs` each trying different values for some hyperparameters.
 31 | We will also want to deploy a single TensorBoard instance monitoring all these `TFJobs`, that way we can quickly compare all our hypothesis, and even early-stop jobs that clearly don't perform well if we want to reduce cost as much as possible.
 32 | For now, this chart will simply do a grid search, and while it is less efficient than random search it will be a good place to start.
 33 | 
 34 | ## Exercise
 35 | 
 36 | ### Creating and Deploying the Chart
 37 | In this exercise, you will create a new Helm chart that will deploy a number of `TFJobs` as well as a TensorBoard instance.
 38 | 
 39 | Here is what our `values.yaml` file could look like for example (you are free to go a different route):
 40 | 
 41 | ```yaml
 42 | image: wbuchwalter/tf-paint:cpu
 43 | useGPU: false
 44 | shareName: tensorflow
 45 | hyperParamValues:
 46 |   learningRate:
 47 |     - 0.001
 48 |     - 0.01
 49 |     - 0.1
 50 |   hiddenLayers:
 51 |     - 5
 52 |     - 6
 53 |     - 7
 54 | ```
 55 | 
 56 | That way, when installing the chart, 9 `TFJob` will actually get deployed, testing all the combination of learning rate and hidden layers depth that we specified.
 57 | This is a very simple example (our model is also very simple), but hopefully you start to see the possibilities than Helm offers.
 58 | 
 59 | In this exercise, we are going to use a new model based on [Andrej Karpathy's Image painting demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_regression.html).  
 60 | This model objective is to to create a new picture as close as possible to the original one, "The Starry Night" by Vincent van Gogh:
 61 | 
 62 | ![Starry](./src/starry.jpg)
 63 | 
 64 | The source code is located in [src/](./src/).  
 65 | 
 66 | Our model takes 3 parameters:
 67 | 
 68 | | argument | description | default value |
 69 | |------|-------------|---------------|
 70 | |`--learning-rate` | Learning rate value | `0.001` | 
 71 | |`--hidden-layers` | Number of hidden layers in our network. | `4` | 
 72 | |`--log-dir` | Path to save TensorFlow's summaries | `None`| 
 73 | 
 74 | For simplicity, docker images have already been created so you don't have to build and push yourself:
 75 | * `wbuchwalter/tf-paint:cpu` for CPU only.
 76 | * `wbuchwalter/tf-paint:gpu` for GPU.  
 77 | 
 78 | The goal of this exercise is to create an Helm chart that will allow us to test as many variations and combinations of the two hyperparameters `--learning-rate`and `--hidden-layers` as we want by just adding them in our `values.yaml` file.   
 79 | This chart should also deploy a single TensorBoard instance (and it's associated service), so we can quickly monitor and compare our different hypothesis.
 80 | 
 81 | If you are pretty new to Kubernetes and Helm and don't feel like building your own helm chart just yet, you can skip to the solution where details and explanations are provided.
 82 | 
 83 | #### Validation
 84 | 
 85 | Once you have created and deployed your chart, looking at the pods that were created, you should see a bunch of them, as well as a single TensorBoard instance monitoring all of them:
 86 | 
 87 | ```console
 88 | kubectl get pods
 89 | ```
 90 | 
 91 | ```
 92 | NAME                                      READY     STATUS    RESTARTS   AGE
 93 | module7-tensorboard-3609490657-6w7zf             1/1       Running   0          23s
 94 | module7-tf-paint-0-0-master-tduk-0-zqngf     1/1       Running   0          2m
 95 | module7-tf-paint-0-1-master-ub9a-0-3rlbr     1/1       Running   0          2m
 96 | module7-tf-paint-0-2-master-ekw1-0-cp0l3     1/1       Running   0          2m
 97 | module7-tf-paint-1-0-master-jr7r-0-6jkwc     1/1       Running   0          2m
 98 | module7-tf-paint-1-1-master-rqh2-0-0t4zw     1/1       Running   0          2m
 99 | module7-tf-paint-1-2-master-le5b-0-34q38     1/1       Running   0          2m
100 | module7-tf-paint-2-0-master-g7i9-0-jq1c1     1/1       Running   0          2m
101 | module7-tf-paint-2-1-master-1urb-0-sq92z     1/1       Running   0          2m
102 | module7-tf-paint-2-2-master-ay57-0-0qt2c     1/1       Running   0          2m
103 | ```
104 | 
105 | Looking at TensorBoard, you should see something similar to this:
106 | ![TensorBoard](tensorboard.png)
107 | 
108 | > Note that TensorBoard can take a while before correctly displaying images.
109 | 
110 | Here we can see that some models are doing much better than others. Models with a learning rate of `0.1` for example are producing an all-black image, we are probably over-shooting.  
111 | After a few minutes, we can see that the two best performing models are:
112 | * 5 hidden layers and learning rate of `0.01`
113 | * 7 hidden layers and learning rate of `0.001`
114 | 
115 | At this point we could decide to kill all the other models if we wanted to free some capacity in our cluster, or launch additional new experiments based on our initial findings.
116 | 
117 | #### Solution
118 | 
119 | Check out the commented solution chart: [./solution-chart/templates/deployment.yaml](./solution-chart/templates/deployment.yaml)
120 | 
121 | ## Next Step
122 | 
123 | [8 - Going Further](../8-going-further)
124 | 


--------------------------------------------------------------------------------
/7-hyperparam-sweep/solution-chart/Chart.yaml:
--------------------------------------------------------------------------------
1 | apiVersion: v1
2 | description: A Helm chart for Kubernetes
3 | name: module7-hyperparam-sweep
4 | version: 0.1.0
5 | 


--------------------------------------------------------------------------------
/7-hyperparam-sweep/solution-chart/templates/_helpers.tpl:
--------------------------------------------------------------------------------
 1 | {{/* vim: set filetype=mustache: */}}
 2 | {{/*
 3 | Expand the name of the chart.
 4 | */}}
 5 | {{- define "name" -}}
 6 | {{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
 7 | {{- end -}}
 8 | 
 9 | {{/*
10 | Create a default fully qualified app name.
11 | We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
12 | */}}
13 | {{- define "fullname" -}}
14 | {{- $name := default .Chart.Name .Values.nameOverride -}}
15 | {{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
16 | {{- end -}}
17 | 


--------------------------------------------------------------------------------
/7-hyperparam-sweep/solution-chart/templates/deployment.yaml:
--------------------------------------------------------------------------------
  1 | 
  2 | # First we copy the values of values.yaml in variable to make it easier to access them
  3 | {{- $lrlist := .Values.hyperParamValues.learningRate -}}
  4 | {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}}
  5 | {{- $image := .Values.image -}}
  6 | {{- $useGPU := .Values.useGPU -}}
  7 | {{- $shareName := .Values.shareName -}}
  8 | {{- $chartname := .Chart.Name -}}
  9 | {{- $chartversion := .Chart.Version -}}
 10 | 
 11 | # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth)
 12 | # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
 13 | {{- range $i, $lr := $lrlist }}
 14 | {{- range $j, $nblayers := $nblayerslist }}
 15 | apiVersion: kubeflow.org/v1alpha1
 16 | kind: TFJob # Each one of our trainings will be a separate TFJob
 17 | metadata:
 18 |   name: module7-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
 19 |   labels:
 20 |     chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
 21 | spec: 
 22 |   replicaSpecs:
 23 |     - template:
 24 |         spec:
 25 |           restartPolicy: OnFailure
 26 |           containers:
 27 |             - name: tensorflow
 28 |               image: {{ $image }} 
 29 |               env:
 30 |               - name: LC_ALL
 31 |                 value: C.UTF-8
 32 |               args:
 33 |                 # Here we pass a unique learning rate and hidden layer count to each instance.
 34 |                 # We also put the values between quotes to avoid potential formatting issues
 35 |                 - --learning-rate  
 36 |                 - {{ $lr | quote }}
 37 |                 - --hidden-layers
 38 |                 - {{ $nblayers | quote }}
 39 |                 - --logdir
 40 |                 - /tmp/tensorflow/tf-paint-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory
 41 | {{ if $useGPU }}  # We only want to request GPUs if we asked for it in values.yaml with useGPU
 42 |               resources:
 43 |                 requests:
 44 |                   alpha.kubernetes.io/nvidia-gpu: 1 
 45 | {{ end }}
 46 |               volumeMounts:
 47 |               - mountPath: /tmp/tensorflow
 48 |                 subPath: module7 # As usual we want to save everything in a separate subdirectory 
 49 |                 name: azurefile
 50 |           volumes:
 51 |             - name: azurefile
 52 |               azureFile:
 53 |                   secretName: azure-secret
 54 |                   shareName: {{ $shareName }}
 55 |                   readOnly: false
 56 | ---
 57 | {{- end }}
 58 | {{- end }}
 59 | # We are not using TFJob integrated TensorBoard, because we want to create the Service and Deployment outside of the loop
 60 | # since we only want one instance running for all our jobs, and not 1 per job.
 61 | apiVersion: v1
 62 | kind: Service
 63 | metadata:
 64 |   labels:
 65 |     app: tensorboard
 66 |   name: module7-tensorboard
 67 | spec:
 68 |   ports:
 69 |   - port: 80
 70 |     targetPort: 6006
 71 |   selector:
 72 |     app: tensorboard
 73 |   type: LoadBalancer
 74 | ---
 75 | apiVersion: extensions/v1beta1
 76 | kind: Deployment
 77 | metadata:
 78 |   labels:
 79 |     app: tensorboard
 80 |   name: module7-tensorboard
 81 | spec:
 82 |   template:
 83 |     metadata:
 84 |       labels:
 85 |         app: tensorboard
 86 |     spec:
 87 |       volumes:
 88 |       - name: azurefile
 89 |         azureFile:
 90 |             secretName: azure-secret
 91 |             shareName: {{ $shareName }}
 92 |             readOnly: false      
 93 |       containers:
 94 |       - name: tensorboard
 95 |         command:
 96 |           - /usr/local/bin/tensorboard
 97 |           - --logdir=/tmp/tensorflow
 98 |           - --host=0.0.0.0
 99 |         image: tensorflow/tensorflow
100 |         ports:
101 |         - containerPort: 6006
102 |         volumeMounts:
103 |         - mountPath: /tmp/tensorflow
104 |           subPath: module7
105 |           name: azurefile  


--------------------------------------------------------------------------------
/7-hyperparam-sweep/solution-chart/values.yaml:
--------------------------------------------------------------------------------
 1 | image: wbuchwalter/tf-paint:cpu
 2 | useGPU: false
 3 | shareName: tensorflow
 4 | hyperParamValues:
 5 |   learningRate:
 6 |     - 0.001
 7 |     - 0.01
 8 |     - 0.1
 9 |   hiddenLayers:
10 |     - 5
11 |     - 6
12 |     - 7
13 |     


--------------------------------------------------------------------------------
/7-hyperparam-sweep/src/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM tensorflow/tensorflow:1.4.0
 2 | COPY requirements.txt /app/requirements.txt
 3 | WORKDIR /app
 4 | RUN mkdir ./output
 5 | RUN mkdir ./logs
 6 | RUN mkdir ./checkpoints
 7 | RUN pip install -r requirements.txt
 8 | COPY ./* /app/
 9 | 
10 | ENTRYPOINT [ "python", "/app/main.py" ]


--------------------------------------------------------------------------------
/7-hyperparam-sweep/src/Dockerfile.gpu:
--------------------------------------------------------------------------------
 1 | FROM tensorflow/tensorflow:1.4.0-gpu
 2 | COPY requirements.txt /app/requirements.txt
 3 | WORKDIR /app
 4 | RUN mkdir ./output
 5 | RUN mkdir ./logs
 6 | RUN mkdir ./checkpoints
 7 | RUN pip install -r requirements.txt
 8 | COPY ./* /app/
 9 | 
10 | ENTRYPOINT [ "python", "/app/main.py" ]


--------------------------------------------------------------------------------
/7-hyperparam-sweep/src/main.py:
--------------------------------------------------------------------------------
 1 | import click
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | from skimage.data import astronaut
 5 | from scipy.misc import imresize, imsave, imread
 6 | 
 7 | img = imread('./starry.jpg')
 8 | img = imresize(img, (100, 100))
 9 | save_dir = 'output'
10 | epochs = 2000
11 | 
12 | 
13 | def linear_layer(X, layer_size, layer_name):
14 |     with tf.variable_scope(layer_name):
15 |         W = tf.Variable(tf.random_uniform([X.get_shape().as_list()[1], layer_size], dtype=tf.float32), name='W')
16 |         b = tf.Variable(tf.zeros([layer_size]), name='b')
17 |         return tf.nn.relu(tf.matmul(X, W) + b)
18 | 
19 | @click.command()
20 | @click.option("--learning-rate", default=0.01) 
21 | @click.option("--hidden-layers", default=7)
22 | @click.option("--logdir")
23 | def main(learning_rate, hidden_layers, logdir='./logs/1'):
24 |     X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name='X') 
25 |     y = tf.placeholder(dtype=tf.float32, shape=(None, 3), name='y')
26 |     current_input = X
27 |     for layer_id in range(hidden_layers):
28 |         h = linear_layer(current_input, 20, 'layer{}'.format(layer_id))
29 |         current_input = h
30 | 
31 |     y_pred = linear_layer(current_input, 3, 'output')
32 | 
33 |     #loss will be distance between predicted and true RGB
34 |     loss = tf.reduce_mean(tf.reduce_sum(tf.squared_difference(y, y_pred), 1))
35 |     tf.summary.scalar('loss', loss)
36 | 
37 |     train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
38 |     merged_summary_op = tf.summary.merge_all()  
39 | 
40 |     res_img = tf.cast(tf.clip_by_value(tf.reshape(y_pred, (1,) + img.shape), 0, 255), tf.uint8)
41 |     img_summary = tf.summary.image('out', res_img, max_outputs=1)
42 |     
43 |     xs, ys = get_data(img)
44 | 
45 |     with tf.Session() as sess:
46 |         tf.global_variables_initializer().run()  
47 |         train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph)
48 |         test_writer = tf.summary.FileWriter(logdir + '/test')
49 |         batch_size = 50
50 |         for i in range(epochs):
51 |             # Get a random sampling of the dataset
52 |             idxs = np.random.permutation(range(len(xs)))
53 |             # The number of batches we have to iterate over
54 |             n_batches = len(idxs) // batch_size
55 |             # Now iterate over our stochastic minibatches:
56 |             for batch_i in range(n_batches):
57 |                 batch_idxs = idxs[batch_i * batch_size: (batch_i + 1) * batch_size]
58 |                 sess.run([train_op, loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]})
59 |                 if batch_i % 100 == 0:
60 |                     c, summary = sess.run([loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]})
61 |                     train_writer.add_summary(summary, (i * n_batches * batch_size) + batch_i)
62 |                     print("epoch {}, (l2) loss {}".format(i, c))           
63 | 
64 |             if i % 10 == 0:
65 |                 img_summary_res = sess.run(img_summary, feed_dict={X: xs, y: ys})
66 |                 test_writer.add_summary(img_summary_res, i * n_batches * batch_size)
67 | 
68 | def get_data(img):
69 |     xs = []
70 |     ys = []
71 |     for row_i in range(img.shape[0]):
72 |         for col_i in range(img.shape[1]):
73 |             xs.append([row_i, col_i])
74 |             ys.append(img[row_i, col_i])
75 | 
76 |     xs = (xs - np.mean(xs)) / np.std(xs)
77 |     return xs, np.array(ys)
78 | 
79 | if __name__ == "__main__":
80 |     main()


--------------------------------------------------------------------------------
/7-hyperparam-sweep/src/requirements.txt:
--------------------------------------------------------------------------------
1 | scikit-image
2 | click>=6.2


--------------------------------------------------------------------------------
/7-hyperparam-sweep/src/starry.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/7-hyperparam-sweep/src/starry.jpg


--------------------------------------------------------------------------------
/7-hyperparam-sweep/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/7-hyperparam-sweep/tensorboard.png


--------------------------------------------------------------------------------
/8-going-further/NFSonAzureConcept.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wbuchwalter/tensorflow-k8s-azure/497ea5d68f4da287693fd19541473b6e9d796041/8-going-further/NFSonAzureConcept.png


--------------------------------------------------------------------------------
/8-going-further/README.md:
--------------------------------------------------------------------------------
 1 | # Going Further
 2 | 
 3 | ## Summary
 4 | 
 5 | * Learn how to deal with very large amount of input data
 6 | * Learn how to autoscale a kubernetes cluster on acs-engine.
 7 | 
 8 | 
 9 | ## Working with large amount of data
10 | 
11 | In the previous modules, we saw how we could easily scale our trainings on Kubernetes to a large amount of nodes.  
12 | We also saw how Azure Files provides an easy way to add persistent storage to our trainings to save the output models, summaries etc.  
13 | 
14 | However in every example we worked with so far, the training data was either downloaded at run time, or directly baked into the container. While this can work for very small datasets, it is not a scalable approach for larger ones.  
15 | Just as we used Azure Files to store models and logs, we can use Azure Files to store our dataset and mount the share in our containers. This work well only up to a certain point though (say a few GBs), as an Azure file share is limited to 1000 IOPS.
16 | 
17 | So how can we deal with larger datasets?  
18 | One solution to this is to use a distributed file system.  
19 |   
20 | ### Distributed File System, Tools and Concepts
21 | 
22 | A distributed file system aggregates various storages over network into a single large network file system.
23 | Such a file system can be deployed inside your Kubernetes cluster, and can use the disks already attached to your nodes as a partition of the overall network file system.
24 | 
25 | 
26 | ![](NFSonAzureConcept.png)
27 | 
28 | Here are some tools and frameworks that can make it easy to deploy such a distributed file system on Kubernetes:
29 | 
30 | * [GlusterFS](http://www.gluster.org/)
31 | * [Rook](https://rook.io/)
32 | * [Portworks](https://portworx.com/)
33 | * [Pachyderm](http://pachyderm.io/)
34 | 
35 | 
36 | ## Autoscaling a Kubernetes Cluster
37 | 
38 | As we saw in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/) and [7 - Hyperparameters Sweep](../7-hyperparam-sweep), being able to autoscale our Kubernetes cluster can be extremely useful.
39 | Indeed, automatic scale-out (adding more VMs to the cluster) would allow anyone to run any experiment they want, whenever they want without needing to involve an ops person to setup and prepare virtual machines.  
40 | And because the cluster can also automatically scale-in by deleting idle VMs once training jobs are completed, we can keep the cost as low as possible by just paying for what we actually use.  
41 | 
42 | As of this writing, autoscaling is only supported on Kubernetes cluster created with [acs-engine](https://github.com/Azure/acs-engine).
43 | 
44 | See the following resources to get started:
45 | * [Kubernetes `acs-engine` Autoscaler](https://github.com/wbuchwalter/Kubernetes-acs-engine-autoscaler)
46 | * [Autoscaling a Kubernetes cluster created with acs-engine on Azure](https://medium.com/@wbuchwalter/autoscaling-a-kubernetes-cluster-created-with-acs-engine-on-azure-5e24ddc6402e)
47 | * [Case study: Autoscaling Deep Learning Training with Kubernetes](https://www.microsoft.com/developerblog/2017/11/21/autoscaling-deep-learning-training-kubernetes/)
48 | 
49 | 
50 | ## Next Step
51 | 
52 | [9 - Jupyter Notebooks on Kubernetes](../9-jupyter)
53 | 


--------------------------------------------------------------------------------
/9-jupyter/README.md:
--------------------------------------------------------------------------------
  1 | # Jupyter Notebooks on Kubernetes
  2 | 
  3 | ## Prerequisites  
  4 | * [1 - Docker Basics](../1-docker)
  5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
  6 | 
  7 | ## Summary
  8 | 
  9 | In this module, you will learn how to:
 10 | * Run Jupyter Notebooks locally using Docker
 11 | * Run Jupyter Notebooks on Kubernetes
 12 |  
 13 | ## How Jupyter Notebooks work
 14 | 
 15 | The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
 16 | 
 17 | ## Exercises
 18 | 
 19 | ### Exercise 1: Run Jupyter Notebooks locally using Docker
 20 | 
 21 | In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook.
 22 | 
 23 | ```console
 24 | docker run -it -p 8888:8888 tensorflow/tensorflow
 25 | ```
 26 | 
 27 | #### Validation
 28 | 
 29 | To verify, browse to the url in the output log. 
 30 | 
 31 | For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`
 32 | 
 33 | 
 34 | ### Exercise 2: Run Jupyter Notebooks on Kubernetes
 35 | 
 36 | In this exercise, we will run Jupyter Notebooks on a Kubernetes cluster. 
 37 | 
 38 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster. 
 39 | 
 40 | Similar to running Jupyter Notebooks locally using Docker, we can again use the official tensorflow docker image as it comes with Jupyter notebook. But here we can run many instances of Jupyter Notebooks in the cluster to handle additional load.
 41 | 
 42 | To run Jupyter Notebook using Kubernetes, you need to: 
 43 | * Create a Pod using tensorflow image
 44 | * Expose port 8888 to run Jupyter notebook 
 45 | * [With GPU] Mount nvidia libraries from the host VM to a custom directory in the container
 46 | * Create a Service to run Jupyter Notebook
 47 | 
 48 | #### Solution for Exercise 2
 49 | 
 50 | Create a yaml file like to the one below.
 51 | 
 52 | <details>
 53 | <summary><strong>Solution for CPU only (expand to see)</strong></summary>
 54 | <p>
 55 | 
 56 | ```yaml
 57 | apiVersion: v1
 58 | kind: Service
 59 | metadata:
 60 |   labels:
 61 |     app: jupyter-server
 62 |   name: jupyter-server
 63 | spec:
 64 |   ports:
 65 |   - port: 8888
 66 |     targetPort: 8888
 67 |   selector:
 68 |     app: jupyter-server
 69 |   type: LoadBalancer
 70 | ---
 71 | apiVersion: extensions/v1beta1
 72 | kind: Deployment
 73 | metadata:
 74 |   name: jupyter-server
 75 | spec:
 76 |   replicas: 1
 77 |   template:
 78 |     metadata:
 79 |       labels:
 80 |         app: jupyter-server
 81 |     spec:
 82 |       containers:
 83 |       - args:
 84 |         image: tensorflow/tensorflow
 85 |         name: jupyter-server
 86 |         ports:
 87 |         - containerPort: 8888
 88 | ```
 89 | 
 90 | </p>
 91 | </details>
 92 | 
 93 | <details>
 94 | <summary><strong>Solution with GPU (expand to see)</strong></summary>
 95 | <p>
 96 | 
 97 | ```yaml
 98 | apiVersion: v1
 99 | kind: Service
100 | metadata:
101 |   labels:
102 |     app: jupyter-server
103 |   name: jupyter-server
104 | spec:
105 |   ports:
106 |   - port: 8888
107 |     targetPort: 8888
108 |   selector:
109 |     app: jupyter-server
110 |   type: LoadBalancer
111 | ---
112 | apiVersion: extensions/v1beta1
113 | kind: Deployment
114 | metadata:
115 |   name: jupyter-server
116 | spec:
117 |   replicas: 1
118 |   template:
119 |     metadata:
120 |       labels:
121 |         app: jupyter-server
122 |     spec:
123 |       containers:
124 |       - name: jupyter-server
125 |         image: tensorflow/tensorflow:latest-gpu
126 |         ports:
127 |         - containerPort: 8888     
128 |         imagePullPolicy: IfNotPresent
129 |         env:
130 |         - name: LD_LIBRARY_PATH
131 |           value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu
132 |         resources:
133 |           requests:
134 |             alpha.kubernetes.io/nvidia-gpu: 1
135 |         volumeMounts:
136 |         - mountPath: /usr/local/nvidia/bin
137 |           name: bin
138 |         - mountPath: /usr/lib/nvidia
139 |           name: lib
140 |         - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
141 |           name: libcuda
142 |       volumes:
143 |         - name: bin
144 |           hostPath: 
145 |             path: /usr/lib/nvidia-384/bin
146 |         - name: lib
147 |           hostPath: 
148 |             path: /usr/lib/nvidia-384
149 |         - name: libcuda
150 |           hostPath:
151 |             path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
152 | ```
153 | 
154 | </p>
155 | </details>
156 | 
157 | Save the yaml file, then deploy it to your Kubernetes cluster by running:
158 | 
159 | ```console
160 | kubectl create -f <template-path>
161 | ```
162 | 
163 | #### Validation
164 | 
165 | After the deployment is created, a pod running tensorflow will be created, along with a new service for the Jupyter notebook. The new service will acquire a new external ip to run Jupyter Notebook on port 8888. This may take few minutes to complete. 
166 | 
167 | To verify, run the following to view the output log to get the URL and the token for the hosted Jupyter notebook:
168 | 
169 | ```console
170 | kubectl log jupyter-server-xxxxx
171 | 
172 | # sample output
173 | 
174 | http://localhost:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d
175 | 
176 | ```
177 | 
178 | Next to get the public ip for the new service created for Jupyter Notebook, run:
179 | 
180 | ```console
181 | kubectl get svc jupyter-server -o jsonpath={.status.loadBalancer.ingress[0].ip}
182 | 
183 | xx.xx.xx.xx
184 | ```
185 | From a browser, navigate to the Jupyter notebook with the following URL, replace `PUBLICIP` with the output from previous step:
186 | 
187 | ```
188 | http://<PUBLICIP>:8888/?token=2e7c875bd4e72137911d33e209c91d01f7a7b44868cf664d
189 | ```
190 | 
191 | 
192 | 
193 | 
194 | 
195 | 
196 | 
197 | 
198 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Microsoft
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### :warning: This repository is deprecated! Go to [Azure/kubeflow-labs](https://github.com/Azure/kubeflow-labs) instead :warning:
 2 | 
 3 | # Train TensorFlow Models at Scale with Kubernetes on Azure
 4 | 
 5 | <!-- ## [Learning Objectives](./learningObjectives.md)
 6 | ## [Presentation Content](./presentationContent.md)
 7 | ## [Room Organization](./roomOrganization.md) -->
 8 | 
 9 | ## Prerequisites
10 | 
11 | 1. Have a valid Microsoft Azure subscription allowing the creation of an ACS cluster
12 | 1. Docker client installed: [Installing Docker](https://www.docker.com/community-edition)
13 | 1. Azure-cli  (2.0) installed: [Installing the Azure CLI 2.0 | Microsoft Docs](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
14 | 1. Git cli installed: [Installing Git CLI](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
15 | 1. Kubectl installed: [Installing Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
16 | 1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip.
17 | 
18 | Clone this repository somewhere so you can easily access the different source files:
19 | ```console
20 | git clone https://github.com/wbuchwalter/tensorflow-k8s-azure
21 | ```
22 | 
23 | ## Content Summary
24 | 
25 | | | Module | Description |
26 | | --- | --- | --- |
27 | |0| **[Introduction](0-intro)** | Introduction to this workshop. Motivations and goals.|
28 | |1| **[Docker](1-docker)** | Docker and containers 101.|
29 | |2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.|
30 | |3| **[Helm](3-helm)** | Introduction to Helm |
31 | |4| **[GPUs](4-gpus)** | How to use GPUs with Kubernetes.|
32 | |5| **[TFJob](5-tfjob)** | How to use `tensorflow/k8s` and `TFJob` to deploy a simple TensorFlow training.|
33 | |6| **[Distributed Tensorflow](6-distributed-tensorflow)** | Going distributed with `TFJob`|
34 | |7| **[Hyperparameters Sweep with Helm](7-hyperparam-sweep)** | Using Helm to deploy a large number of training testing different hypothesis, monitoring and comparing them. |
35 | |8| **[Going Further](8-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage. |
36 | |9| **[Jupyter Notebooks](9-jupyter)** | Easily deploy a Jupyter Notebook instance on Kubernetes. |
37 | 


--------------------------------------------------------------------------------