├── .gitignore
├── 0-intro
    ├── README.md
    ├── thumbnail.png
    └── workflow.png
├── 1-docker
    ├── README.md
    └── src
    │   ├── Dockerfile
    │   ├── Dockerfile.gpu
    │   └── main.py
├── 10-going-further
    ├── NFSonAzureConcept.png
    └── README.md
├── 2-kubernetes
    └── README.md
├── 3-helm
    ├── README.md
    └── dokuwiki.png
├── 4-kubeflow
    └── README.md
├── 5-jupyterhub
    ├── README.md
    └── jupyterhub.png
├── 6-tfjob
    ├── README.md
    └── file-share.png
├── 7-distributed-tensorflow
    ├── README.md
    ├── solution-src
    │   ├── Dockerfile
    │   ├── Dockerfile.gpu
    │   └── main.py
    └── tensorboard.png
├── 8-hyperparam-sweep
    ├── README.md
    ├── solution-chart
    │   ├── Chart.yaml
    │   ├── templates
    │   │   ├── _helpers.tpl
    │   │   └── deployment.yaml
    │   └── values.yaml
    ├── src
    │   ├── Dockerfile
    │   ├── Dockerfile.gpu
    │   ├── main.py
    │   ├── requirements.txt
    │   └── starry.jpg
    └── tensorboard.png
├── 9-serving
    ├── README.md
    ├── data
    │   ├── 0.png
    │   ├── 1.png
    │   ├── 2.png
    │   ├── 3.png
    │   ├── 4.png
    │   ├── 5.png
    │   ├── 6.png
    │   ├── 7.png
    │   ├── 8.png
    │   └── 9.png
    ├── mnist_client.py
    └── requirements.txt
├── LICENSE
├── LICENSE-CODE
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Binaries for programs and plugins
 2 | *.exe
 3 | *.dll
 4 | *.so
 5 | *.dylib
 6 | 
 7 | # Test binary, build with `go test -c`
 8 | *.test
 9 | 
10 | # Output of the go coverage tool, specifically when used with LiteIDE
11 | *.out
12 | 
13 | # Project-local glide cache, RE: https://github.com/Masterminds/glide/issues/736
14 | .glide/
15 | .DS_Store
16 | 


--------------------------------------------------------------------------------
/0-intro/README.md:
--------------------------------------------------------------------------------
 1 | # Introduction
 2 | 
 3 | This labs will walk you through setting up [Kubeflow](https://github.com/kubeflow/kubeflow) on a kubernetes cluster on Azure Container Service (AKS).
 4 | 
 5 | We will then take a look at how to use the different components that make up Kubeflow.
 6 | 
 7 | ## Motivations
 8 | 
 9 | Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps.
10 | 
11 | However in traditional application development, DevOps practices are becoming ubiquitous. 
12 | We can benefit from many of these practices by applying them to model development and operationalization.
13 | 
14 | Here are a subset of pain points that exists in a typical ML workflow.
15 | #### A Typical (Simplified) ML Workflow and its Pain Points
16 | ![Typical Workflow](workflow.png)
17 | 
18 | This workshop is going to focus on improving the training and serving process by leveraging containers and Kubernetes.
19 | 
20 | Today many data scientists are training their models either on their physical workstation (be it a laptop or a desktop with multiple GPUs) or using a VM (sometime, but rarely, a couple of them) in the cloud.
21 | 
22 | This approach is sub-optimal for many reasons, among which:
23 | * Training is slow and sequential
24 |   * Having a single (or few) GPU on hand, means there is only so much trainings you can do at the time. It also means that once your GPU is busy with a training you cannot use it to do something else, just as smaller experiments.
25 |   * Hyper-parameter sweeping is vastly inefficient: The different hypothesis you want to test will run sequentially and not in parallel. In practices this means that very often we don't have time to really explore the hyper-parameter space and we just run a couple of experiments that we think will yield the best result.
26 |   The longer the training time, the fewer experiments we can run.
27 | * Distributed training is hard (or impossible) to setup
28 |   * In practice very few data scientist benefit from distributed training. Either because they simply can't use it (you need multiple machines for that), or because it is too tedious to setup.
29 | * High cost
30 |   * If each member of the team has it's own allocated resources, in practices it means many of them will not be used at any given time, given the price of a single GPU, this is very costly. On the other hand pooling resourcing (such as sharing VM) is also painful since multiple people might want to use them at the same time.
31 | 
32 | Using Kubernetes, we can alleviate many of these pain points:
33 | * Training is massively parallelizable
34 |   * Kubernetes is highly scalable (up to 1200 VMs for a single cluster on Azure). In practice that means you could run as many experiments as you want at the same time. This makes exploring and comparing different hypothesis much simpler and efficient.
35 | * Distributed training is much simpler
36 |   * As we will see in this workshop, it is very easy to setup a TensorFlow distributed training on kubernetes, and scale it to whatever size you want, making much more usable in practice.
37 | * Optimized cost with autoscaling* 
38 |   * Kubernetes allows for resource pooling while at the same time ensuring that any training job can run without waiting for another one to finish.
39 |   * With autoscaling the cluster can automatically scale out or in to ensure maximum utilization, thus keeping the cost as low as possible.
40 | 
41 | *While autoscaling is very powerful, it is outside the scope of this workshop. However we will give you resources and pointer to get started with it.
42 | 
43 | 
44 | ## OpenAI: Building the Infrastructure that Powers the Future of AI 
45 | 
46 | During KubeCon 2017, Vicki Cheung and Jonas Schneider delivered a keynote explaining how OpenAI manage to handle training at very large scale with Kubernetes, it is worth listening to: 
47 | 
48 | <a href="https://www.youtube.com/watch?v=v4N3Krzb8Eg">![OpenAI](./thumbnail.png)</a>
49 | 
50 | ## Next Step
51 | [Module 1: Docker](../1-docker/README.md)
52 | 


--------------------------------------------------------------------------------
/0-intro/thumbnail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/0-intro/thumbnail.png


--------------------------------------------------------------------------------
/0-intro/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/0-intro/workflow.png


--------------------------------------------------------------------------------
/1-docker/README.md:
--------------------------------------------------------------------------------
  1 | # Docker
  2 | 
  3 | ### Summary
  4 | 
  5 | In this section you will learn about:
  6 | * Running Docker locally
  7 | * Basics of Docker
  8 | * Containerizing a simple application
  9 | * Building and Pushing an Image
 10 | 
 11 | ### Basics of Docker and Containers
 12 | 
 13 | Docker has a very well structured six-part tutorial. 
 14 | While for this workshop you don't need to go through all of them, part 1 and 2 are required:
 15 | * [Get Started, Part 1: Orientation and setup](https://docs.docker.com/get-started)
 16 | * [Get Started, Part 2: Containers](https://docs.docker.com/get-started/part2/)
 17 | 
 18 | By the end of Part 2, you should have a simple container up and running, and understand the basic concepts of a container.
 19 | 
 20 | #### Additional Important Docker Command
 21 | 
 22 | Here a few other docker commands that are important to be aware of for the rest of this workshop:
 23 | 
 24 | 1. `docker ps`
 25 | 
 26 |     The docker `ps` command allows to list the status of the containers.
 27 | 
 28 |     A container could be either stopped or running. When it finishes to execute the process, it will stop. 
 29 |     
 30 |     For example if you run the command `docker run -it ubuntu hostname` this will :
 31 |     - Pull the official ubuntu image from the registry
 32 |     - Start the container in the interactive mode `-it`
 33 |     - Execute the command : `hostname`
 34 |     - Stop
 35 | 
 36 |     ```
 37 |     $ docker run -it ubuntu hostname
 38 |     0d0af5005fc7
 39 |     ```
 40 | 
 41 |     If you run the command `docker ps -a` you should see :
 42 |     ```
 43 |      $ docker ps -a
 44 |     CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                          PORTS               NAMES
 45 |     0d0af5005fc7        ubuntu              "hostname"          58 seconds ago      Exited (0) About a minute ago                       gifted_darwin
 46 |     ```
 47 | 
 48 |     > The `-a` allows you to list all the containers, not just the one that is running 
 49 | 
 50 |     We can notice a few things here such as :
 51 |     - The status `Exited...` for our container
 52 |     - The name `gifted_darwin` randomly generated for our container, we can specify a custom one using the command `--name` explained in the previous section.
 53 |     - We can re-execute our container using the command `docker start gifted_darwin`
 54 |     - We can run the command `docker run -it ubuntu hostname` again and do a `docker ps -a`; we should see how two containers exited.
 55 | 
 56 | 1. `docker logs`
 57 | 
 58 |     The docker `logs` allow to fetch the console output from inside the container
 59 | 
 60 |     From our previous example, we can run `docker logs gifted_darwin`
 61 | 
 62 |     ```
 63 |     $ docker logs gifted_darwin
 64 |     0d0af5005fc7
 65 |     ```
 66 | 
 67 |     You can also stream the logs using the command `-f` and print in real time in your console the stdout of your container.
 68 | 
 69 | 1. `docker rm`
 70 | 
 71 |     The docker `rm` command allows to remove a container.
 72 | 
 73 |     From the previous example, we can see that we have a container listed as exited in our environment, or maybe more if we run the same command `docker run -it ubuntu hostname` multiple times. If we want to do some cleaning and remove those executions from our environment we can use the command `docker rm`
 74 | 
 75 |     ```
 76 |     $ docker rm gifted_darwin
 77 |     gifted_darwin
 78 |     ```
 79 | 
 80 |     > You can either specify the **CONTAINER ID** or the **NAME** of the container to refer to it
 81 | 
 82 | 
 83 | 1. `docker images`
 84 | 
 85 |     This command allows us to list all the base images available in the environment.
 86 | 
 87 |     ```
 88 |     $ docker images
 89 |     REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
 90 |     ubuntu                                          latest              20c44cd7596f        2 days ago          123MB
 91 |     example-scratch                                 latest              32ff7b65f567        5 days ago          30.7MB
 92 |     node                                            8.9.1-slim          a6bb2cc1118f        11 days ago         230MB
 93 |     buildpack-deps                                  xenial              a27b6a8abd1c        2 weeks ago         644MB
 94 |     ```
 95 | 
 96 |     > You can manage your images by removing them using `docker rmi IMAGENAME` or pulling a new one with `docker pull IMAGENAME`
 97 | 
 98 | ### Containerizing a TensorFlow model
 99 | 
100 | Now that we understand the basics of Docker, let's containerize our first TensorFlow model that we will reuse in the following modules.  
101 | Our first model will be a very simple MNIST classifier. You can see the source code in [`./src/main.py`](./src/main.py).    
102 | As you can see there is nothing specific to containers in this code, you can run this script directly on your laptop or on a VM.
103 | 
104 | Now, to have this run in a container, we need to build an image containing this code and it's dependencies.  
105 | As you saw in the tutorial, we will use a `Dockerfile` to do this.
106 | 
107 | Here is the (very simple) `Dockerfile` that we are going to use for this model (located in [`./src/Dockerfile`](./src/Dockerfile)):
108 | 
109 | ```dockerfile
110 | FROM tensorflow/tensorflow:1.10.0
111 | COPY main.py /app/main.py
112 | 
113 | ENTRYPOINT ["python", "/app/main.py"]
114 | ```
115 | 
116 | As you can see, we are not building a new image from scratch, instead we are using a base image from TensorFlow. Indeed, TensorFlow has a bunch of base images that you can start with.
117 | You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/.
118 | 
119 | What is important to note is that different tags need to be used depending on if you want to use GPU or not.  
120 | For example, if you wanted to run your model with TensorFlow 1.10.0 and CPU only, you would use `tensorflow/tensorflow:1.10.0`.  
121 | If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.10.0-gpu`.
122 | 
123 | The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script.
124 | 
125 | #### Building the image
126 | 
127 | > If you don't already have a Docker account, see [Log in with your Docker ID](https://docs.docker.com/get-started/part2/#log-in-with-your-docker-id).
128 | 
129 | The next step is to build our image to be able to run it using docker. For that, we will use the command `docker build`.
130 | 
131 | From the [`./src`](./src) repository, we can build the image with
132 | 
133 | ```console
134 | cd src
135 | docker build -t ${DOCKER_USERNAME}/tf-mnist .
136 | ```
137 | > Reminder: the `-t` argument allows to **tag** the image with a specific name.  
138 | 
139 | `${DOCKER_USERNAME}` should be your Docker username that you use to connect to Docker Hub.
140 | 
141 | The output from this command should look like this:
142 | 
143 | ```
144 | Sending build context to Docker daemon  11.26kB
145 | Step 1/3 : FROM tensorflow/tensorflow:1.10.0
146 |  ---> a61a91cc0d1b
147 | Step 2/3 : COPY main.py /app/main.py
148 |  ---> b264d6e9a5ef
149 | Removing intermediate container fe8128425296
150 | Step 3/3 : ENTRYPOINT python /app/main.py
151 |  ---> Running in 7acb7aac7a9f
152 |  ---> 92c7ed17916b
153 | Removing intermediate container 7acb7aac7a9f
154 | Successfully built 92c7ed17916b
155 | Successfully tagged wbuchwalter/tf-mnist:latest
156 | ```
157 | Let's analyse this image full name (`wbuchwalter/tf-mnist:latest`):
158 | * `wbuchwalter` is the name of my repository, this is where we can find the image. This will be different for you (same as your docker hub username).
159 | * `tf-mnist` is the name of the image itself
160 | * `latest` is the tag. `latest` is the default tag if you don't specify any. Tags are usually used to denote different versions or flavors of a same image. For example you could have a tag `v1` and `v2` to denote different versions, or `cpu` and `gpu` to denote what hardware it can run on.  
161 | 
162 | When you have the successfully built message, you should now be able to see if your image is locally available with the command `docker images` described earlier.
163 | 
164 | #### Running the image  
165 | 
166 | Now we can try to run it locally using the `docker run` command.   
167 | By default the model will run 1000 training steps which can take a few minutes on a laptop. Let's reduce this number to 100 with the `--max_steps` argument. 
168 | 
169 | ```console
170 | docker run -it ${DOCKER_USERNAME}/tf-mnist --max_steps 100
171 | ```
172 | 
173 | If everything is okay you should see the model training:
174 | 
175 | ```
176 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
177 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
178 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
179 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
180 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
181 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
182 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
183 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
184 | 2017-11-29 18:32:41.992194: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
185 | Accuracy at step 0: 0.1292
186 | Accuracy at step 10: 0.7198
187 | Accuracy at step 20: 0.834
188 | Accuracy at step 30: 0.8698
189 | Accuracy at step 40: 0.8783
190 | Accuracy at step 50: 0.8968
191 | Accuracy at step 60: 0.9023
192 | Accuracy at step 70: 0.9059
193 | Accuracy at step 80: 0.9084
194 | Accuracy at step 90: 0.9154
195 | Adding run metadata for 99
196 | ```
197 | 
198 | You can kill the process and exit the container at any time with `ctrl + c`.
199 | 
200 | ### Running the image with the NVIDIA GPU of your machine (If you have one)
201 | 
202 | **Currently, running docker containers with GPU is only supported on Linux.**
203 | 
204 | First install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker).  
205 | 
206 | You also need to make sure the image you are going to use is optimized for GPU.  
207 | In our example you need to modify the `Dockerfile` to use a TensorFlow image built for GPU:
208 | 
209 | ```dockerfile 
210 | FROM tensorflow/tensorflow:1.10.0-gpu
211 | COPY main.py /app/main.py
212 | 
213 | ENTRYPOINT ["python", "/app/main.py"]
214 | ```
215 | 
216 | Then simply rebuild the image with a new tag (you can use docker or nvidia-docker interchangeably for any command except run):
217 | 
218 | ```console
219 | cd src
220 | docker build -t ${DOCKER_USERNAME}/tf-mnist:gpu -f Dockerfile.gpu .
221 | ```
222 | 
223 | Finally run the container with nvidia-docker:
224 | 
225 | ```console
226 | nvidia-docker run -it ${DOCKER_USERNAME}/tf-mnist:gpu
227 | ```
228 | 
229 | > Note: If the command fails with `Unknown runtime specified nvidia`, follow the steps described here (Systemd drop-in file): https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
230 | 
231 | #### Publish the Image
232 | 
233 | Our image is now built and running locally, but what about sharing it to be able to use it from anywhere by anyone?  
234 | Most importantly we want to be able to reuse this image on the Kubernetes cluster we are going to create in module 2.
235 | So let's push our image to Docker Hub:
236 | 
237 | ```console
238 | docker push ${DOCKER_USERNAME}/tf-mnist:gpu
239 | ```
240 | 
241 | If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image)
242 | 
243 | 
244 | ### Useful Links
245 | * [What is Docker ?](https://www.docker.com/what-docker)
246 | * [Docker for beginner](https://github.com/docker/labs/blob/master/beginner/readme.md)
247 | 
248 | 
249 | ## Next Step
250 | [2 - Kubernetes](../2-kubernetes/README.md)
251 | 


--------------------------------------------------------------------------------
/1-docker/src/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.10.0
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]
5 | 


--------------------------------------------------------------------------------
/1-docker/src/Dockerfile.gpu:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.10.0-gpu
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]
5 | 


--------------------------------------------------------------------------------
/1-docker/src/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the 'License');
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an 'AS IS' BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """A simple MNIST classifier which displays summaries in TensorBoard.
 16 | 
 17 | This is an unimpressive MNIST model, but it is a good example of using
 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
 19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
 20 | 
 21 | It demonstrates the functionality of every TensorBoard dashboard.
 22 | """
 23 | from __future__ import absolute_import
 24 | from __future__ import division
 25 | from __future__ import print_function
 26 | 
 27 | import argparse
 28 | import os
 29 | import sys
 30 | 
 31 | import tensorflow as tf
 32 | 
 33 | from tensorflow.examples.tutorials.mnist import input_data
 34 | 
 35 | FLAGS = None
 36 | 
 37 | 
 38 | def train():
 39 |   # Import data
 40 |   mnist = input_data.read_data_sets(FLAGS.data_dir,
 41 |                                     one_hot=True,
 42 |                                     fake_data=FLAGS.fake_data)
 43 | 
 44 |   # Create a multilayer model.
 45 | 
 46 |   # Input placeholders
 47 |   with tf.name_scope('input'):
 48 |     x = tf.placeholder(tf.float32, [None, 784], name='x-input')
 49 |     y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
 50 | 
 51 |   with tf.name_scope('input_reshape'):
 52 |     image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
 53 |     tf.summary.image('input', image_shaped_input, 10)
 54 | 
 55 |   # We can't initialize these variables to 0 - the network will get stuck.
 56 |   def weight_variable(shape):
 57 |     """Create a weight variable with appropriate initialization."""
 58 |     initial = tf.truncated_normal(shape, stddev=0.1)
 59 |     return tf.Variable(initial)
 60 | 
 61 |   def bias_variable(shape):
 62 |     """Create a bias variable with appropriate initialization."""
 63 |     initial = tf.constant(0.1, shape=shape)
 64 |     return tf.Variable(initial)
 65 | 
 66 |   def variable_summaries(var):
 67 |     """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
 68 |     with tf.name_scope('summaries'):
 69 |       mean = tf.reduce_mean(var)
 70 |       tf.summary.scalar('mean', mean)
 71 |       with tf.name_scope('stddev'):
 72 |         stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
 73 |       tf.summary.scalar('stddev', stddev)
 74 |       tf.summary.scalar('max', tf.reduce_max(var))
 75 |       tf.summary.scalar('min', tf.reduce_min(var))
 76 |       tf.summary.histogram('histogram', var)
 77 | 
 78 |   def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
 79 |     """Reusable code for making a simple neural net layer.
 80 | 
 81 |     It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
 82 |     It also sets up name scoping so that the resultant graph is easy to read,
 83 |     and adds a number of summary ops.
 84 |     """
 85 |     # Adding a name scope ensures logical grouping of the layers in the graph.
 86 |     with tf.name_scope(layer_name):
 87 |       # This Variable will hold the state of the weights for the layer
 88 |       with tf.name_scope('weights'):
 89 |         weights = weight_variable([input_dim, output_dim])
 90 |         variable_summaries(weights)
 91 |       with tf.name_scope('biases'):
 92 |         biases = bias_variable([output_dim])
 93 |         variable_summaries(biases)
 94 |       with tf.name_scope('Wx_plus_b'):
 95 |         preactivate = tf.matmul(input_tensor, weights) + biases
 96 |         tf.summary.histogram('pre_activations', preactivate)
 97 |       activations = act(preactivate, name='activation')
 98 |       tf.summary.histogram('activations', activations)
 99 |       return activations
100 | 
101 |   hidden1 = nn_layer(x, 784, 500, 'layer1')
102 | 
103 |   with tf.name_scope('dropout'):
104 |     keep_prob = tf.placeholder(tf.float32)
105 |     tf.summary.scalar('dropout_keep_probability', keep_prob)
106 |     dropped = tf.nn.dropout(hidden1, keep_prob)
107 | 
108 |   # Do not apply softmax activation yet, see below.
109 |   y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
110 | 
111 |   with tf.name_scope('cross_entropy'):
112 |     # The raw formulation of cross-entropy,
113 |     #
114 |     # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
115 |     #                               reduction_indices=[1]))
116 |     #
117 |     # can be numerically unstable.
118 |     #
119 |     # So here we use tf.nn.softmax_cross_entropy_with_logits on the
120 |     # raw outputs of the nn_layer above, and then average across
121 |     # the batch.
122 |     diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
123 |     with tf.name_scope('total'):
124 |       cross_entropy = tf.reduce_mean(diff)
125 |   tf.summary.scalar('cross_entropy', cross_entropy)
126 | 
127 |   with tf.name_scope('train'):
128 |     train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
129 |         cross_entropy)
130 | 
131 |   with tf.name_scope('accuracy'):
132 |     with tf.name_scope('correct_prediction'):
133 |       correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
134 |     with tf.name_scope('accuracy'):
135 |       accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
136 |   tf.summary.scalar('accuracy', accuracy)
137 | 
138 |   # Merge all the summaries and write them out to
139 |   # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
140 |   merged = tf.summary.merge_all()
141 | 
142 |   def feed_dict(train):
143 |     """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
144 |     if train or FLAGS.fake_data:
145 |       xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
146 |       k = FLAGS.dropout
147 |     else:
148 |       xs, ys = mnist.test.images, mnist.test.labels
149 |       k = 1.0
150 |     return {x: xs, y_: ys, keep_prob: k}
151 | 
152 |   sess = tf.InteractiveSession()
153 |   train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
154 |   test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test')
155 |   tf.global_variables_initializer().run()
156 |   # Train the model, and also write summaries.
157 |   # Every 10th step, measure test-set accuracy, and write test summaries
158 |   # All other steps, run train_step on training data, & add training summaries
159 | 
160 | 
161 |   for i in range(FLAGS.max_steps):
162 |     if i % 10 == 0:  # Record summaries and test-set accuracy
163 |       summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
164 |       test_writer.add_summary(summary, i)
165 |       print('Accuracy at step %s: %s' % (i, acc))
166 |     else:  # Record train set summaries, and train
167 |       if i % 100 == 99:  # Record execution stats
168 |         run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
169 |         run_metadata = tf.RunMetadata()
170 |         summary, _ = sess.run([merged, train_step],
171 |                               feed_dict=feed_dict(True),
172 |                               options=run_options,
173 |                               run_metadata=run_metadata)
174 |         train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
175 |         train_writer.add_summary(summary, i)
176 |         print('Adding run metadata for', i)
177 |       else:  # Record a summary
178 |         summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
179 |         train_writer.add_summary(summary, i)
180 |   train_writer.close()
181 |   test_writer.close()
182 | 
183 | 
184 | def main(_):
185 |   train()
186 | 
187 | 
188 | if __name__ == '__main__':
189 |   parser = argparse.ArgumentParser()
190 |   parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
191 |                       default=False,
192 |                       help='If true, uses fake data for unit testing.')
193 |   parser.add_argument('--max_steps', type=int, default=1000,
194 |                       help='Number of steps to run trainer.')
195 |   parser.add_argument('--learning_rate', type=float, default=0.001,
196 |                       help='Initial learning rate')
197 |   parser.add_argument('--dropout', type=float, default=0.9,
198 |                       help='Keep probability for training dropout.')
199 |   parser.add_argument(
200 |       '--data_dir',
201 |       type=str,
202 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
203 |                            'tensorflow/input_data'),
204 |       help='Directory for storing input data')
205 |   parser.add_argument(
206 |       '--log_dir',
207 |       type=str,
208 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
209 |                            'tensorflow/logs'),
210 |       help='Summaries log directory')
211 |   FLAGS, unparsed = parser.parse_known_args()
212 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)


--------------------------------------------------------------------------------
/10-going-further/NFSonAzureConcept.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/10-going-further/NFSonAzureConcept.png


--------------------------------------------------------------------------------
/10-going-further/README.md:
--------------------------------------------------------------------------------
 1 | # Going Further
 2 | 
 3 | ## Summary
 4 | 
 5 | * Learn how to deal with very large amount of input data
 6 | * Learn how to autoscale a kubernetes cluster on acs-engine.
 7 | 
 8 | ## Working with large amount of data
 9 | 
10 | In the previous modules, we saw how we could easily scale our trainings on Kubernetes to a large amount of nodes.
11 | We also saw how Azure Files provides an easy way to add persistent storage to our trainings to save the output models, summaries etc.
12 | 
13 | However in every example we worked with so far, the training data was either downloaded at run time, or directly baked into the container. While this can work for very small datasets, it is not a scalable approach for larger ones.
14 | Just as we used Azure Files to store models and logs, we can use Azure Files to store our dataset and mount the share in our containers. This work well only up to a certain point though (say a few GBs), as an Azure file share is limited to 1000 IOPS.
15 | 
16 | So how can we deal with larger datasets?
17 | One solution to this is to use a distributed file system.
18 | 
19 | ### Distributed File System, Tools and Concepts
20 | 
21 | A distributed file system aggregates various storages over network into a single large network file system.
22 | Such a file system can be deployed inside your Kubernetes cluster, and can use the disks already attached to your nodes as a partition of the overall network file system.
23 | 
24 | ![](NFSonAzureConcept.png)
25 | 
26 | Here are some tools and frameworks that can make it easy to deploy such a distributed file system on Kubernetes:
27 | 
28 | * [GlusterFS](http://www.gluster.org/)
29 | * [Rook](https://rook.io/)
30 | * [Portworx](https://portworx.com/)
31 | * [Pachyderm](http://pachyderm.io/)
32 | 
33 | ## Autoscaling a Kubernetes Cluster
34 | 
35 | As we saw in module [7 - Distributed TensorFlow](../7-distributed-tensorflow/) and [8 - Hyperparameters Sweep](../8-hyperparam-sweep), being able to autoscale our Kubernetes cluster can be extremely useful.
36 | Indeed, automatic scale-out (adding more VMs to the cluster) would allow anyone to run any experiment they want, whenever they want without needing to involve an ops person to setup and prepare virtual machines.
37 | And because the cluster can also automatically scale-in by deleting idle VMs once training jobs are completed, we can keep the cost as low as possible by just paying for what we actually use.
38 | 
39 | As of this writing, autoscaling is only supported on Kubernetes cluster created with [acs-engine](https://github.com/Azure/acs-engine).
40 | 
41 | See the following resources to get started:
42 | 
43 | * [Kubernetes Azure cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/azure)
44 | 


--------------------------------------------------------------------------------
/2-kubernetes/README.md:
--------------------------------------------------------------------------------
  1 | # Kubernetes
  2 | 
  3 | ### Prerequisites
  4 | 
  5 | - [Docker Basics](../1-docker/README.md)
  6 | 
  7 | ### Summary
  8 | 
  9 | In this module you will learn:
 10 | 
 11 | - The basic concepts of Kubernetes
 12 | - How to create a Kubernetes cluster on Azure
 13 | 
 14 | > _Important_ : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this lab.
 15 | 
 16 | ## The basic concepts of Kubernetes
 17 | 
 18 | [Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more.
 19 | 
 20 | ### Overview
 21 | 
 22 | Kubernetes is a system for managing containerized applications across a cluster of nodes. To work with Kubernetes, you use Kubernetes API objects to describe your cluster’s desired state: what applications or other workloads you want to run, what container images they use, the number of replicas, what network and disk resources you want to make available, and more. You set your desired state by creating objects using the Kubernetes API. Once you’ve set your desired state, the Kubernetes Control Plane works to make the cluster’s current state match the desired state. To do so, Kubernetes performs a variety of tasks automatically, such as starting or restarting containers, scaling the number of replicas of a given application, and more.
 23 | 
 24 | ### Kubernetes Master
 25 | 
 26 | The Kubernetes master is responsible for maintaining the desired state for your cluster. When you interact with Kubernetes, such as by using the kubectl command-line interface, you’re communicating with your cluster’s Kubernetes master. These master services can be installed on a single machine, or distributed across multiple machines. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 1 master.
 27 | 
 28 | ### Kubernetes Nodes
 29 | 
 30 | The worker nodes communicate with the master components, configure the networking for containers, and run the actual workloads assigned to them. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 3 worker nodes.
 31 | 
 32 | ### Kubernetes Objects
 33 | 
 34 | Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re telling the Kubernetes system your cluster’s desired state.
 35 | 
 36 | The basic Kubernetes objects include:
 37 | 
 38 | - Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run.
 39 | - Service - an abstraction which defines a logical set of Pods and a policy by which to access them.
 40 | - Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers.
 41 | - Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users.
 42 | - Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server.
 43 | - Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model.
 44 | 
 45 | ### Creating a Kubernetes Object
 46 | 
 47 | When you create an object in Kubernetes, you must provide the object specifications that describes its desired state, as well as some basic information about the object (such as a name) to the Kubernetes API either directly or via the `kubectl` command-line interface. Usually, you will provide the information to `kubectl` in a .yaml file. `kubectl` then converts the information to JSON when making the API request. In the next few sections, we will be using various yaml files to describe the Kubernetes objects we want to deploy to our Kubernetes cluster.
 48 | 
 49 | For example, the `.yaml` file shown below includes the required fields and object spec for a Kubernetes Deployment. A Kubernetes Deployment is an object that can represent an application running on your cluster. In the example below, the Deployment spec describes the desired state of three replicas of the nginx application to be running. When you create the Deployment, the Kubernetes system reads the Deployment spec and starts three instances of your desired application, updating the status to match your spec.
 50 | 
 51 | ```yaml
 52 | apiVersion: apps/v1beta2 # Kubernetes API version for the object
 53 | kind: Deployment # The type of object described by this YAML, here a Deployment
 54 | metadata:
 55 |   name: nginx-deployment # Name of the deployment
 56 | spec: # Actual specifications of this deployment
 57 |   replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod
 58 |   template:
 59 |     metadata:
 60 |       labels:
 61 |         app: nginx
 62 |     spec: # Specification for the Pod
 63 |       containers: # These are the containers running inside our Pod, in our case a single one
 64 |         - name: nginx # Name of this container
 65 |           image: nginx:1.7.9 # Image to run
 66 |           ports: # Ports to expose
 67 |             - containerPort: 80
 68 | ```
 69 | 
 70 | To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`).
 71 | We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster.
 72 | 
 73 | ## Provisioning a Kubernetes cluster on Azure
 74 | 
 75 | We are going to use AKS to create a GPU-enabled Kubernetes cluster.
 76 | You could also use [aks-engine](https://github.com/Azure/aks-engine) if you prefer, this guide will assume you are using aks.
 77 | 
 78 | ### A Note on GPUs with Kubernetes
 79 | 
 80 | You can view AKS region availability in [Azure AKS docs](https://docs.microsoft.com/en-us/azure/aks/container-service-quotas#region-availability)
 81 | 
 82 | You can find NVIDIA GPUs (N-series) availability in [region availability documentation](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines&regions=all)
 83 | 
 84 | If you want more options, you may want to use aks-engine for more flexibility.
 85 | 
 86 | ### With the CLI
 87 | 
 88 | #### Creating a resource group
 89 | 
 90 | ```console
 91 | az group create --name <RESOURCE_GROUP_NAME> --location <LOCATION>
 92 | ```
 93 | 
 94 | With:
 95 | 
 96 | | Parameter           | Description                                                    |
 97 | | ------------------- | -------------------------------------------------------------- |
 98 | | RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. |
 99 | | LOCATION            | Name of the region where the cluster should be deployed.       |
100 | 
101 | #### Creating the cluster
102 | 
103 | ```console
104 | az aks create --node-vm-size <AGENT_SIZE> --resource-group <RG> --name <NAME>
105 | --node-count <AGENT_COUNT> --kubernetes-version 1.12.5 --location <LOCATION> --generate-ssh-keys
106 | ```
107 | 
108 | > Note : The kubernetes verion could change depending where you are deploying your cluster. You can get more informations running the `az aks get-versions` command.
109 | 
110 | With:
111 | 
112 | | Parameter   | Description                                                                                                                                                                                                                                                                                        |
113 | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
114 | | AGENT_SIZE  | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). |
115 | | RG          | Name of the resource group that was created in the previous step.                                                                                                                                                                                                                                  |
116 | | NAME        | Name of the AKS resource (can be whatever you want).                                                                                                                                                                                                                                               |
117 | | AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow                                                                                                                                        |
118 | | LOCATION    | Same location that was specified for the resource group creation.                                                                                                                                                                                                                                  |
119 | 
120 | The command should take a few minutes to complete. Once it is done, the output should be a JSON object indicating among other things the `provisioningState`:
121 | 
122 | ```
123 | {
124 |   [...]
125 |   "provisioningState": "Succeeded",
126 |   [...]
127 | }
128 | ```
129 | 
130 | #### Getting the `kubeconfig` file
131 | 
132 | The `kubeconfig` file is a configuration file that will allow Kubernetes's CLI (`kubectl`) to know how to talk to our cluster.
133 | To download the `kubeconfig` file from the cluster we just created, run:
134 | 
135 | ```console
136 | az aks get-credentials --name <NAME> --resource-group <RG>
137 | ```
138 | 
139 | Where `NAME` and `RG` should be the same values as for the cluster creation.
140 | 
141 | #### Installing NVIDIA Device Plugin (AKS only)
142 | 
143 | For AKS, install NVIDIA Device Plugin using:
144 | 
145 | For Kubernetes 1.10:
146 | 
147 | ```bash
148 | kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
149 | ```
150 | 
151 | For Kubernetes 1.11 and above:
152 | 
153 | ```bash
154 | kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
155 | ```
156 | 
157 | For AKS Engine, NVIDIA Device Plugin will automatically installed with N-Series GPU clusters.
158 | 
159 | ##### Validation
160 | 
161 | Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command:
162 | 
163 | ```console
164 | kubectl get nodes
165 | ```
166 | 
167 | Should yield an output similar to this one:
168 | 
169 | ```
170 | NAME                       STATUS    ROLES     AGE       VERSION
171 | aks-nodepool1-42640332-0   Ready     agent     1h        v1.11.1
172 | aks-nodepool1-42640332-1   Ready     agent     1h        v1.11.1
173 | aks-nodepool1-42640332-2   Ready     agent     1h        v1.11.1
174 | ```
175 | 
176 | If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node:
177 | 
178 | ```console
179 | > kubectl describe node <NODE_NAME>
180 | 
181 | [...]
182 | Capacity:
183 |  nvidia.com/gpu:     1
184 | [...]
185 | ```
186 | 
187 | > Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster
188 | 
189 | ## Exercise
190 | 
191 | ### Running our Model on Kubernetes
192 | 
193 | > Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise.
194 | 
195 | In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub.
196 | Since we now have a running Kubernetes cluster, let's run our training on it!
197 | 
198 | First, we need to create a YAML template to define what we want to deploy.
199 | We want our deployment to have a few characteristics:
200 | 
201 | - It should be a `Job` since we expect the training to finish successfully after some time.
202 | - It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module).
203 | - The `Job` should be named `2-mnist-training`.
204 | - We want our training to run for `500` steps.
205 | - We want our training to use 1 GPU
206 | 
207 | Here is what this would look like in YAML format:
208 | 
209 | ```yaml
210 | apiVersion: batch/v1
211 | kind: Job # Our training should be a Job since it is supposed to terminate at some point
212 | metadata:
213 |   name: module2-ex1 # Name of our job
214 | spec:
215 |   template: # Template of the Pod that is going to be run by the Job
216 |     metadata:
217 |       name: module2-ex1 # Name of the pod
218 |     spec:
219 |       containers: # List of containers that should run inside the pod, in our case there is only one.
220 |         - name: tensorflow
221 |           image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own.
222 |           args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
223 |           resources:
224 |             limits:
225 |               nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
226 |           volumeMounts:
227 |             - name: nvidia
228 |               mountPath: /usr/local/nvidia
229 |       volumes:
230 |         - name: nvidia
231 |           hostPath:
232 |             path: /usr/local/nvidia
233 |       restartPolicy: OnFailure # restart the pod if it fails
234 | ```
235 | 
236 | Save this template somewhere and deploy it with:
237 | 
238 | ```console
239 | kubectl create -f <path-to-your-template>
240 | ```
241 | 
242 | #### Validation
243 | 
244 | After deploying the template,
245 | 
246 | ```console
247 | kubectl get job
248 | ```
249 | 
250 | Should show your new job:
251 | 
252 | ```bash
253 | NAME                             DESIRED   SUCCESSFUL   AGE
254 | module2-ex1                      1         0            1m
255 | ```
256 | 
257 | Looking at the Pods:
258 | 
259 | ```console
260 | kubectl get pods
261 | ```
262 | 
263 | You should see your training running
264 | 
265 | ```bash
266 | NAME                                      READY     STATUS      RESTARTS   AGE
267 | module2-ex1-c5b8q                      1/1       Runing      0          1m
268 | ```
269 | 
270 | Finally you can look at the logs of your pod with:
271 | 
272 | ```console
273 | kubectl logs <pod-name>
274 | ```
275 | 
276 | > Be careful to use the Pod name (from `kubectl get pods`) not the Job name.
277 | 
278 | And you should see the training happening
279 | 
280 | ```bash
281 | 2017-11-29 21:49:16.462292: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
282 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
283 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
284 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
285 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
286 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
287 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
288 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
289 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
290 | Accuracy at step 0: 0.1285
291 | Accuracy at step 10: 0.674
292 | Accuracy at step 20: 0.8065
293 | Accuracy at step 30: 0.8606
294 | Accuracy at step 40: 0.8759
295 | Accuracy at step 50: 0.888
296 | [...]
297 | ```
298 | 
299 | After a few minutes, looking again at the Job should show that it has completed successfully:
300 | 
301 | ```console
302 | kubectl get job
303 | ```
304 | 
305 | ```bash
306 | NAME                           DESIRED   SUCCESSFUL   AGE
307 | module2-ex1                    1         1            3m
308 | ```
309 | 
310 | ## Next Step
311 | 
312 | Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4.
313 | 
314 | [Module 3: Helm](../3-helm/README.md)
315 | 


--------------------------------------------------------------------------------
/3-helm/README.md:
--------------------------------------------------------------------------------
  1 | # Helm
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [Docker Basics](../1-docker/README.md)
  6 | * [Kubernetes Basics and cluster created](../2-kubernetes)
  7 | 
  8 | ## Summary
  9 | 
 10 | In this module you will learn :
 11 | * What is Helm and how to use it
 12 | * What is a Chart and how to create one
 13 |   
 14 | ## Context
 15 | 
 16 | As you saw in the second module [Kubernetes Basics and cluster created](../2-kubernetes), the default way to deploy objects in Kubernetes is by using `kubectl` with `yaml` files.
 17 | 
 18 | For example, if we want to deploy a `pod` running `nginx` and then make it available from an external IP using a `service` you will need to describe at least these two objects such as :
 19 | 
 20 | Deployment : 
 21 | ```yaml
 22 | apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
 23 | kind: Deployment
 24 | metadata:
 25 |   name: nginx-deployment
 26 | spec:
 27 |   selector:
 28 |     matchLabels:
 29 |       app: nginx
 30 |   replicas: 2 # tells deployment to run 2 pods matching the template
 31 |   template: # create pods using pod definition in this template
 32 |     metadata:
 33 |       labels:
 34 |         app: nginx
 35 |     spec:
 36 |       containers:
 37 |       - name: nginx
 38 |         image: nginx:1.7.9
 39 |         ports:
 40 |         - containerPort: 80
 41 | ```
 42 | Service :
 43 | ```yaml
 44 | apiVersion: v1
 45 | kind: Service
 46 | metadata:
 47 |   name: nginx-service
 48 | spec:
 49 |   ports:
 50 |   - port: 8000 
 51 |     targetPort: 80
 52 |     protocol: TCP
 53 |   type: LoadBalancer
 54 |   selector:
 55 |     app: nginx
 56 | ```
 57 | 
 58 | The problem with this approach is that when you need to make an update to your solution, you will need to update it across different yaml files.
 59 | 
 60 | Let's say you want to change the name of the app from `nginx` to `nginx-production`. You have to change it in a few places in the deployment and not forget to change the selector setting in the service as well.
 61 | 
 62 | This is one example among others where Helm is fixing the issue by being able to create and use templates.
 63 | 
 64 | ## Helm and Chart
 65 | 
 66 | Helm is the [package manager for Kubernetes](https://deis.com/blog/2016/trusting-whos-at-the-helm/). 
 67 | 
 68 | A package is named a **Chart**. 
 69 | 
 70 | You can either create you own, or pull and install an official one such as Wordpress, GitLab, Apache Spark, etc...
 71 | 
 72 | You can find a list of the official ones here : [https://github.com/kubernetes/charts/tree/master/stable](https://github.com/kubernetes/charts/tree/master/stable)
 73 | 
 74 | To use Helm, you need to have the [CLI installed on your machine](https://github.com/kubernetes/helm/blob/master/docs/install.md)
 75 | 
 76 | Before using Helm with your cluster, make sure you have the Tiller component running in your cluster. 
 77 | ```bash
 78 | kubectl get pod --all-namespaces | grep tiller
 79 | ```
 80 | 
 81 | If you do not see Tiller running and if you have an RBAC-enabled cluster, you need a service account and role binding for the Tiller service in your cluster. To install these components, refer to these guides to [create a service account](https://docs.microsoft.com/en-us/azure/aks/kubernetes-helm#create-a-service-account) and [configure helm](https://docs.microsoft.com/en-us/azure/aks/kubernetes-helm#configure-helm) to initialize Tiller in the cluster.  
 82 | 
 83 | Once Tiller service is up and running in the cluster and you have initialized helm, let's try to deploy an official Chart such as the popular [Wordpress](https://github.com/kubernetes/charts/tree/master/stable/wordpress)
 84 | 
 85 | ```bash
 86 | helm install stable/wordpress
 87 | ```
 88 | 
 89 | > Note: If you have an error such has `Error: incompatible versions client[v2.7.0] server[v2.6.2]`, please run `helm init --upgrade`
 90 | 
 91 | After a few seconds you should see the following output in your terminal :
 92 | 
 93 | ```bash
 94 | ...
 95 | NAME:   cloying-crocodile
 96 | LAST DEPLOYED: Wed Nov 22 11:29:55 2017
 97 | NAMESPACE: default
 98 | STATUS: DEPLOYED
 99 | 
100 | RESOURCES:
101 | ==> v1beta1/Deployment
102 | NAME                         DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
103 | cloying-crocodile-mariadb    1        1        1           0          10s
104 | cloying-crocodile-wordpress  1        1        1           0          10s
105 | 
106 | ==> v1/Pod(related)
107 | NAME                                          READY  STATUS   RESTARTS  AGE
108 | cloying-crocodile-mariadb-1648957417-0prvc    0/1    Pending  0         10s
109 | cloying-crocodile-wordpress-3958361718-z9qr3  0/1    Pending  0         10s
110 | 
111 | ==> v1/Secret
112 | NAME                         TYPE    DATA  AGE
113 | cloying-crocodile-mariadb    Opaque  2     10s
114 | cloying-crocodile-wordpress  Opaque  2     10s
115 | 
116 | ==> v1/ConfigMap
117 | NAME                       DATA  AGE
118 | cloying-crocodile-mariadb  1     10s
119 | 
120 | ==> v1/PersistentVolumeClaim
121 | NAME                         STATUS   VOLUME   CAPACITY  ACCESS MODES  STORAGECLASS  AGE
122 | cloying-crocodile-mariadb    Pending  default  10s
123 | cloying-crocodile-wordpress  Pending  default  10s
124 | 
125 | ==> v1/Service
126 | NAME                         TYPE          CLUSTER-IP    EXTERNAL-IP  PORT(S)                     AGE
127 | cloying-crocodile-mariadb    ClusterIP     10.0.163.26   <none>       3306/TCP                    10s
128 | cloying-crocodile-wordpress  LoadBalancer  10.0.168.104  <pending>    80:31549/TCP,443:32728/TCP  10s
129 | ...
130 | ```
131 | 
132 | You can see all the objects that are necessary to run our Wordpress application in your Kubernetes cluster deployed such as **pods**, **services**, **secrets** etc... Furthermore, since we need a MariaDB engine to run Wordpress, Helm did also automatically deploy it as a dependency in the cluster !
133 | 
134 | As you can see inside the [Wordpress's Chart documentation](https://github.com/kubernetes/charts/tree/master/stable/wordpress) you can override some values such as the image, the database name or the SMTP server for example.
135 | 
136 | You just have to use the `--set` option during the `install` command, like so :
137 | 
138 | ```bash
139 | helm install --name my-wordpress \
140 |   --set wordpressUsername=admin,wordpressPassword=password,mariadb.mariadbRootPassword=secretpassword \
141 |     stable/wordpress
142 | ```
143 | 
144 | ## Create your own Chart
145 | 
146 | You can also create your own Chart by using the scaffolding command `helm create mychart`
147 | 
148 | This will create a folder which includes all the files necessary to create your own package :
149 | 
150 | ```bash
151 | ├── Chart.yaml
152 | ├── templates
153 | │   ├── NOTES.txt
154 | │   ├── _helpers.tpl
155 | │   ├── deployment.yaml
156 | │   ├── ingress.yaml
157 | │   └── service.yaml
158 | └── values.yaml
159 | ```
160 | 
161 | All the objects that you want to deploy are stored inside the templates folder in different .yaml files.
162 | 
163 | You can find more information on how to create your own chart here : [https://deis.com/blog/2016/getting-started-authoring-helm-charts/](https://deis.com/blog/2016/getting-started-authoring-helm-charts/)
164 | 
165 | When you are done with your package, Helm provides a linting tool `helm lint mychart` to help you find issues in it.
166 | 
167 | If you want to deploy it into your cluster, you can run the following command from the repository folder:
168 | 
169 | ```bash
170 | helm install . --name my-custom-chart
171 | ```
172 | 
173 | ## Exercises
174 | 
175 | ### Exercise 1 - Deploy an official Chart : DokuWiki
176 | 
177 | From the [official Chart repository](https://github.com/kubernetes/charts/tree/master) you have to deploy a DokuWiki environment.
178 | 
179 | [DokuWiki](https://www.dokuwiki.org/) is a standards-compliant, simple to use wiki optimized for creating documentation. It is targeted at developer teams, workgroups, and small companies. All data is stored in plain text files, so no database is required.
180 | 
181 | #### Validation
182 | 
183 | We want to be able to define a custom Wiki name such as `Hello MLADS` at the deployment.
184 | 
185 | You should see the following web page from your deployment :
186 | 
187 | ![](dokuwiki.png)
188 | 
189 | #### Solution
190 | 
191 | <details>
192 | <summary><strong>Solution (expand to see)</strong></summary>
193 | <p>
194 | 
195 | ```bash
196 |     helm install stable/dokuwiki --set dokuwikiWikiName="Hello MLADS"
197 | ```
198 | 
199 | </p>
200 | </details>
201 | 
202 | 
203 | ## Next Step
204 | 
205 | [4 - Kubeflow](../4-kubeflow/README.md)


--------------------------------------------------------------------------------
/3-helm/dokuwiki.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/3-helm/dokuwiki.png


--------------------------------------------------------------------------------
/4-kubeflow/README.md:
--------------------------------------------------------------------------------
  1 | # Kubeflow - Overview and Installation
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | - [1 - Docker](../1-docker/README.md)
  6 | - [2 - Kubernetes](../2-kubernetes/README.md)
  7 | 
  8 | ## Summary
  9 | 
 10 | In this module we are going to get an overview of the different components that make up [Kubeflow](https://github.com/kubeflow/kubeflow), and how to install them into our newly deployed Kubernetes cluster.
 11 | 
 12 | ### Kubeflow Overview
 13 | 
 14 | From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation:
 15 | 
 16 | > The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.
 17 | 
 18 | Kubeflow is composed of multiple components:
 19 | 
 20 | - [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them.
 21 | - One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well.
 22 | - A serving component that will help you serve predictions with your models.
 23 | 
 24 | For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md).
 25 | 
 26 | ### Deploying Kubeflow
 27 | 
 28 | Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components.
 29 | 
 30 | > ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments.
 31 | 
 32 | First, install ksonnet version [0.13.1](https://ksonnet.io/#get-started), or you can [download a prebuilt binary](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) for your OS.
 33 | 
 34 | Then run the following commands to download Kubeflow:
 35 | 
 36 | ```bash
 37 | KUBEFLOW_SRC=$(realpath kubeflow)
 38 | 
 39 | mkdir ${KUBEFLOW_SRC}
 40 | cd ${KUBEFLOW_SRC}
 41 | 
 42 | export KUBEFLOW_TAG=v0.4.1
 43 | 
 44 | curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
 45 | ```
 46 | 
 47 | `KUBEFLOW_SRC` a directory where you want to download the source to
 48 | 
 49 | `KUBEFLOW_TAG` a tag corresponding to the version to check out, such as master for the latest code.
 50 | 
 51 | ```bash
 52 | # Initialize a kubeflow app
 53 | KFAPP=mykubeflowapp
 54 | ${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none
 55 | 
 56 | # Generate kubeflow app
 57 | cd ${KFAPP}
 58 | ${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
 59 | 
 60 | # Deploy Kubeflow app
 61 | ${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s
 62 | ```
 63 | 
 64 | ### Validation
 65 | 
 66 | `kubectl get pods -n kubeflow`
 67 | 
 68 | should return something like this:
 69 | 
 70 | ```
 71 | NAME                                READY     STATUS    RESTARTS   AGE
 72 | kubeflow      ambassador-b4d9cdb8-2qgww                                 1/1     Running     0          111m
 73 | kubeflow      ambassador-b4d9cdb8-hpwdc                                 1/1     Running     0          111m
 74 | kubeflow      ambassador-b4d9cdb8-khg8l                                 1/1     Running     0          111m
 75 | kubeflow      argo-ui-6d6658d8f7-t6whw                                  1/1     Running     0          110m
 76 | kubeflow      centraldashboard-6f686c5b7c-462cq                         1/1     Running     0          111m
 77 | kubeflow      jupyter-0                                                 1/1     Running     0          111m
 78 | kubeflow      katib-ui-6c59754c48-mgf62                                 1/1     Running     0          110m
 79 | kubeflow      metacontroller-0                                          1/1     Running     0          111m
 80 | kubeflow      minio-d79b65988-6qkxp                                     1/1     Running     0          110m
 81 | kubeflow      ml-pipeline-66df9d86f6-rp245                              1/1     Running     0          110m
 82 | kubeflow      ml-pipeline-persistenceagent-7b86dbf4b5-rgndj             1/1     Running     0          110m
 83 | kubeflow      ml-pipeline-scheduledworkflow-84f6477479-9tvhk            1/1     Running     0          110m
 84 | kubeflow      ml-pipeline-ui-f76bb5f97-2s5qb                            1/1     Running     0          110m
 85 | kubeflow      mysql-ffc889689-xkpxb                                     1/1     Running     0          110m
 86 | kubeflow      pytorch-operator-ff46f9b7d-qkbvh                          1/1     Running     0          111m
 87 | kubeflow      spartakus-volunteer-5b6c956c8f-2gnvb                      1/1     Running     0          111m
 88 | kubeflow      studyjob-controller-b7cdbd4cd-nf9z5                       1/1     Running     0          110m
 89 | kubeflow      tf-job-dashboard-7746db84cf-njdzk                         1/1     Running     0          111m
 90 | kubeflow      tf-job-operator-v1beta1-5949f668f7-j5zrn                  1/1     Running     0          111m
 91 | kubeflow      vizier-core-7c56465f6-t6d5p                               1/1     Running     0          110m
 92 | kubeflow      vizier-core-rest-67f588b4cb-lqvgr                         1/1     Running     0          110m
 93 | kubeflow      vizier-db-86dc7d89c5-8vtfs                                1/1     Running     0          110m
 94 | kubeflow      vizier-suggestion-bayesianoptimization-7cb546fb84-tsrn4   1/1     Running     0          110m
 95 | kubeflow      vizier-suggestion-grid-6587f9d6b-92c9h                    1/1     Running     0          110m
 96 | kubeflow      vizier-suggestion-hyperband-8bb44f8c8-gs72m               1/1     Running     0          110m
 97 | kubeflow      vizier-suggestion-random-7ff5db687b-bjdh5                 1/1     Running     0          110m
 98 | kubeflow      workflow-controller-cf79dfbff-lv7jk                       1/1     Running     0          110m
 99 | ```
100 | 
101 | The most important components for the purpose of this lab are `jupyter-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-v1beta1-5949f668f7-j5zrn` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.
102 | 
103 | ### Remove Kubeflow
104 | 
105 | If you want to remove the Kubeflow deployment, you can run the following to remove the namespace and installed components:
106 | 
107 | ```bash
108 | cd ${KUBEFLOW_SRC}/${KFAPP}
109 | ${KUBEFLOW_SRC}/scripts/kfctl.sh delete k8s
110 | ```
111 | 
112 | ## Next Step
113 | 
114 | [5 - JupyterHub](../5-jupyterhub/README.md)
115 | 


--------------------------------------------------------------------------------
/5-jupyterhub/README.md:
--------------------------------------------------------------------------------
  1 | # Jupyter Notebooks on Kubernetes
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | - [1 - Docker Basics](../1-docker)
  6 | - [2 - Kubernetes Basics and cluster created](../2-kubernetes)
  7 | - [4 - Kubeflow](../4-kubeflow)
  8 | 
  9 | ## Summary
 10 | 
 11 | In this module, you will learn how to:
 12 | 
 13 | - Run Jupyter Notebooks locally using Docker
 14 | - Run JupyterHub on Kubernetes using Kubeflow
 15 | 
 16 | ## How Jupyter Notebooks work
 17 | 
 18 | The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes.
 19 | 
 20 | ## How JupyterHub works
 21 | 
 22 | The [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/) is a multi-user Hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group. Let's look at how we can create JupyterHub to spawn multiple instances of Jupyter Notebook on Kubernetes using Kubeflow.
 23 | 
 24 | ## Exercises
 25 | 
 26 | ### Exercise 1: Run Jupyter Notebooks locally using Docker
 27 | 
 28 | In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook.
 29 | 
 30 | ```console
 31 | docker run -it -p 8888:8888 tensorflow/tensorflow
 32 | ```
 33 | 
 34 | #### Validation
 35 | 
 36 | To verify, browse to the url in the output log.
 37 | 
 38 | For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413`
 39 | 
 40 | ### Exercise 2: Run JupyterHub on Kubernetes using Kubeflow
 41 | 
 42 | In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow.
 43 | 
 44 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob).
 45 | 
 46 | In module 4, you installed the kubeflow-core component, which already includes JupyterHub and a corresponding load balancer service of type `ClusterIP`. To check its status, run the following kubectl command.
 47 | 
 48 | ```
 49 | NAMESPACE=kubeflow
 50 | kubectl get svc -n=${NAMESPACE}
 51 | 
 52 | NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
 53 | ...
 54 | jupyter-0                                ClusterIP   None           <none>        8000/TCP            132m
 55 | jupyter-lb                               ClusterIP   10.0.191.68    <none>        80/TCP              132m
 56 | ```
 57 | 
 58 | To connect to the Kubeflow dashboard locally:
 59 | 
 60 | ```bash
 61 | kubectl port-forward    svc/ambassador -n ${NAMESPACE} 8080:80
 62 | ```
 63 | 
 64 | Then navigate to JupyterHub: http://localhost:8080/hub
 65 | 
 66 | [Optional] To connect to your JupyterHub over a public IP:
 67 | 
 68 | To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer:
 69 | 
 70 | ```bash
 71 | cd ks_app
 72 | ks param set jupyter serviceType LoadBalancer
 73 | cd ..
 74 | ${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
 75 | ```
 76 | 
 77 | Create a new Jupyter Notebook instance:
 78 | 
 79 | - open http://localhost:8080/hub/ in your browser (or use the public IP for the service `tf-hub-lb`)
 80 | - log in using any username and password
 81 | - click the "Start My Server" button to sprawn a new Jupyter notebook
 82 | - from the image dropdown, select a tensorflow image for your notebook
 83 | - for CPU and memory, enter values based on your resource requirements, for example: 1 CPU and 2Gi
 84 | - to get available GPUs in your cluster, run the following command:
 85 | 
 86 | ```
 87 | kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu"
 88 | ```
 89 | 
 90 | - for GPU, enter values in json format `{"nvidia.com/gpu":"1"}`
 91 | - click the "Spawn" button
 92 | 
 93 | ![jupyterhub](./jupyterhub.png)
 94 | 
 95 | The images are quite large. This process can take a long time.
 96 | 
 97 | #### Validation
 98 | 
 99 | You can check the status of the pod by running:
100 | 
101 | ```
102 | kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME}
103 | ```
104 | 
105 | After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree
106 | 
107 | ## Next Step
108 | 
109 | [6 - TfJob](../6-tfjob)
110 | 


--------------------------------------------------------------------------------
/5-jupyterhub/jupyterhub.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/5-jupyterhub/jupyterhub.png


--------------------------------------------------------------------------------
/6-tfjob/README.md:
--------------------------------------------------------------------------------
  1 | # `TFJob`
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | - [1 - Docker](../1-docker/README.md)
  6 | - [2 - Kubernetes](../2-kubernetes/README.md)
  7 | - [4 - Kubeflow](../4-kubeflow/README.md)
  8 | 
  9 | ## Summary
 10 | 
 11 | In this module you will learn how to describe a TensorFlow training using `TFJob` object.
 12 | 
 13 | ### Kubernetes Custom Resource Definition
 14 | 
 15 | Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
 16 | In the case of Kubeflow, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe a TensorFlow training.
 17 | 
 18 | #### `TFJob` Specifications
 19 | 
 20 | Before going further, let's take a look at what the `TFJob` object looks like:
 21 | 
 22 | > Note: Some of the fields are not described here for brevity.
 23 | 
 24 | **`TFJob` Object**
 25 | 
 26 | | Field      | Type                                                                                                               | Description                                                                                    |
 27 | | ---------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- |
 28 | | apiVersion | `string`                                                                                                           | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1beta1` |
 29 | | kind       | `string`                                                                                                           | Value representing the REST resource this object represents. In our case it's `TFJob`          |
 30 | | metadata   | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata) | Standard object's metadata.                                                                    |
 31 | | spec       | `TFJobSpec`                                                                                                        | The actual specification of our TensorFlow job, defined below.                                 |
 32 | 
 33 | `spec` is the most important part, so let's look at it too:
 34 | 
 35 | **`TFJobSpec` Object**
 36 | 
 37 | | Field         | Type                  | Description                                                    |
 38 | | ------------- | --------------------- | -------------------------------------------------------------- |
 39 | | TFReplicaSpec | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
 40 | 
 41 | Let's go deeper:
 42 | 
 43 | **`TFReplicaSpec` Object**
 44 | 
 45 | | Field         | Type                                                                                        | Description                                                                                                                                                                 |
 46 | | ------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 47 | | TfReplicaType | `string`                                                                                    | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
 48 | | Replicas      | `int`                                                                                       | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`.                                                          |
 49 | | Template      | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.                                       |
 50 | 
 51 | Here is what a simple TensorFlow training looks like using this `TFJob` object:
 52 | 
 53 | ```yaml
 54 | apiVersion: kubeflow.org/v1beta1
 55 | kind: TFJob
 56 | metadata:
 57 |   name: example-tfjob
 58 | spec:
 59 |   tfReplicaSpecs:
 60 |     MASTER:
 61 |       replicas: 1
 62 |       template:
 63 |         spec:
 64 |           containers:
 65 |             - image: <DOCKER_USERNAME>/tf-mnist:gpu
 66 |               name: tensorflow
 67 |               resources:
 68 |                 limits:
 69 |                   nvidia.com/gpu: 1
 70 |           restartPolicy: OnFailure
 71 | ```
 72 | 
 73 | Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want.
 74 | 
 75 | ## Exercises
 76 | 
 77 | ### Exercise 1: A Simple `TFJob`
 78 | 
 79 | Let's schedule a very simple TensorFlow job using `TFJob` first.
 80 | 
 81 | > Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead.
 82 | 
 83 | When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image.
 84 | 
 85 | ```yaml
 86 | apiVersion: kubeflow.org/v1beta1
 87 | kind: TFJob
 88 | metadata:
 89 |   name: module6-ex1-gpu
 90 | spec:
 91 |   tfReplicaSpecs:
 92 |     MASTER:
 93 |       replicas: 1
 94 |       template:
 95 |         spec:
 96 |           containers:
 97 |             - image: <DOCKER_USERNAME>/tf-mnist:gpu # From module 1
 98 |               name: tensorflow
 99 |               resources:
100 |                 limits:
101 |                   nvidia.com/gpu: 1
102 |           restartPolicy: OnFailure
103 | ```
104 | 
105 | Save the template that applies to you in a file, and create the `TFJob`:
106 | 
107 | ```console
108 | kubectl create -f <template-path>
109 | ```
110 | 
111 | Let's look at what has been created in our cluster.
112 | 
113 | First a `TFJob` was created:
114 | 
115 | ```console
116 | kubectl get tfjob
117 | ```
118 | 
119 | Returns:
120 | 
121 | ```
122 | NAME              AGE
123 | module6-ex1-gpu   5s
124 | ```
125 | 
126 | As well as a `Pod`, which was actually created by the operator:
127 | 
128 | ```console
129 | kubectl get pod
130 | ```
131 | 
132 | Returns:
133 | 
134 | ```
135 | NAME                                            READY     STATUS      RESTARTS   AGE
136 | module6-ex1-master-xs4b-0-6gpfn                 1/1       Running     0          2m
137 | ```
138 | 
139 | Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first.
140 | 
141 | Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs:
142 | 
143 | ```console
144 | kubectl logs <your-pod-name>
145 | ```
146 | 
147 | This container is pretty verbose, but you should see a TensorFlow training happening:
148 | 
149 | ```
150 | [...]
151 | INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486
152 | INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100)
153 | INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0%
154 | INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210
155 | INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100)
156 | INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0%
157 | INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348
158 | INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100)
159 | INFO:tensorflow:Final test accuracy = 88.4% (N=353)
160 | [...]
161 | ```
162 | 
163 | > That's great and all, but how do we grab our trained model and TensorFlow's summaries?
164 | 
165 | Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost.
166 | 
167 | Thankfully, Kubernetes `Volumes` can help us here.
168 | If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container.
169 | But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/).
170 | 
171 | In our case we are going to use Azure Files, as it is really easy to use with Kubernetes.
172 | 
173 | ## Exercise 2: Azure Files to the Rescue
174 | 
175 | ### Creating a New File Share and Kubernetes Secret
176 | 
177 | In the official documentation: [Persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv), follow the steps listed under `Create storage account`, `Create storage class`, and `Create persistent volume claim`.
178 | Be aware of a few details first:
179 | 
180 | - It is **very** important that you create your storage account in the **same** region and the same resource group (with MC\_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly.
181 | - While this document specifically refers to AKS, it will work for any K8s cluster
182 | - Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share.
183 | - PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
184 | - If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)
185 | 
186 | Once you completed all the steps, run:
187 | 
188 | ```console
189 | kubectl get pvc
190 | ```
191 | 
192 | Which should return:
193 | 
194 | ```
195 | NAME             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
196 | azurefile        Bound     pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9   5Gi        RWO            azurefile      5m
197 | ```
198 | 
199 | ### Updating our example to use our Azure File Share
200 | 
201 | Now we need to mount our new file share into our container so the model and the summaries can be persisted.
202 | Turns out mounting an Azure File share into a container is really easy, we simply need to reference our PVC in the `Volume` definition:
203 | 
204 | ```yaml
205 | [...]
206 |  containers:
207 |   - image: <IMAGE>
208 |     name: tensorflow
209 |     resources:
210 |       limits:
211 |         nvidia.com/gpu: 1
212 |     volumeMounts:
213 |       - name: azurefile
214 |         subPath: module6-ex2-gpu
215 |         mountPath: /tmp/tensorflow
216 |  volumes:
217 |   - name: azurefile
218 |     persistentVolumeClaim:
219 |       claimName: azurefile
220 | ```
221 | 
222 | Update your template from exercise 1 to mount the Azure File share into your container, and create your new job.
223 | 
224 | Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
225 | 
226 | ![file-share](./file-share.png)
227 | 
228 | This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share.
229 | 
230 | #### Solution for Exercise 2
231 | 
232 | <details>
233 | <summary><strong>Solution</strong></summary>
234 | 
235 | ```yaml
236 | apiVersion: kubeflow.org/v1beta1
237 | kind: TFJob
238 | metadata:
239 |   name: module6-ex2
240 | spec:
241 |   tfReplicaSpecs:
242 |     MASTER:
243 |       replicas: 1
244 |       template:
245 |         spec:
246 |           containers:
247 |             - image: <DOCKER_USERNAME>/tf-mnist:gpu
248 |               name: tensorflow
249 |               resources:
250 |                 limits:
251 |                   nvidia.com/gpu: 1
252 |               volumeMounts:
253 |                 # By default our classifier saves the summaries in /tmp/tensorflow,
254 |                 # so that's where we want to mount our Azure File Share.
255 |                 - name: azurefile
256 |                   # The subPath allows us to mount a subdirectory within the azure file share instead of root
257 |                   # this is useful so that we can save the logs for each run in a different subdirectory
258 |                   # instead of overwriting what was done before.
259 |                   subPath: module6-ex2-gpu
260 |                   mountPath: /tmp/tensorflow
261 |           restartPolicy: OnFailure
262 |           volumes:
263 |             - name: azurefile
264 |               persistentVolumeClaim:
265 |                 claimName: azurefile
266 | ```
267 | 
268 | </details>
269 | 
270 | ## Next Step
271 | 
272 | [7 - Distributed TensorFlow](../7-distributed-tensorflow)
273 | 


--------------------------------------------------------------------------------
/6-tfjob/file-share.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/6-tfjob/file-share.png


--------------------------------------------------------------------------------
/7-distributed-tensorflow/README.md:
--------------------------------------------------------------------------------
  1 | # Distributed TensorFlow with `TFJob`
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
  6 | * [4 - Kubeflow](../4-kubeflow)
  7 | * [6 - TFJob](../6-tfjob)
  8 | 
  9 | ## Summary
 10 | 
 11 | Distributed TensorFlow trainings can be complicated. In this module, we will see how `TFJob`, one of the components of Kubeflow, can be used to simplify the deployment and monitoring of distributed TensorFlow trainings.
 12 |   
 13 | ## "Vanilla" Distributed TensorFlow is Hard
 14 | 
 15 | First let's see how we would setup a distributed TensorFlow training without Kubernetes or `TFJob` (fear not, we are not actually going to do that).
 16 | First, you would have to find or setup a bunch of idle VMs, or physical machines. In most companies, this would already be a feat, and likely require the coordination of multiple department (such as IT) to get the VMs up, running and reserved for your experiment. 
 17 | Then you would likely have to do some back and forth with the IT department to be able to setup your training: the VMs need to be able to talk to each others and have stable endpoints. Work might be needed to access the data, you would need to upload your TF code on every single machine etc.  
 18 | If you add GPU to the mix, it would likely get even harder since GPUs aren't usually just waiting there because of their high cost.  
 19 | 
 20 | Assuming you get through this, you now need to modify your model for distributed training.  
 21 | Among other things, you will need to setup the `ClusterSpec` ([`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec)):  a TensorFlow class that allows you to describe the architecture of your cluster. 
 22 | For example, if you were to setup a distributed training with a mere 2 workers and 2 parameter servers, your cluster spec would look like this (the `clusterSpec` would most likely not be hardcoded, but passed as argument to your training script as we will see below, this is for illustration):
 23 | 
 24 | ```python
 25 | cluster = tf.train.ClusterSpec({"worker": ["<IP_GPU_VM_1>:2222",
 26 |                                            "<IP_GPU_VM_2>:2222"],
 27 |                                 "ps": ["<IP_CPU_VM_1>:2222",
 28 |                                        "<IP_CPU_VM_2>:2222"]})
 29 | ```
 30 | Here we assume that you want your workers to run on GPU VMs and your parameter servers to run on CPU VMs.  
 31 | 
 32 | We will not go through the rest of the modifications needed (splitting operation across devices, getting the master session etc.), as we will look at them later and this would be pretty much the same thing no matter how you run your distributed training.
 33 | 
 34 | Once your model is ready, you need to start the training.  
 35 | You will need to connect to every single VM, and pass the `ClusterSpec` as well as the assigned job name (ps or worker) and task index to each VM. 
 36 | So it would look something like this:
 37 | 
 38 | ```bash
 39 | # On ps0:
 40 | $ python trainer.py \
 41 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 42 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 43 |      --job_name=ps --task_index=0
 44 | # On ps1:
 45 | $ python trainer.py \
 46 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 47 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 48 |      --job_name=ps --task_index=1
 49 | # On worker0:
 50 | $ python trainer.py \
 51 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 52 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 53 |      --job_name=worker --task_index=0
 54 | # On worker1:
 55 | $ python trainer.py \
 56 |      --ps_hosts=<IP_CPU_VM_1>:2222,<IP_CPU_VM_2>:2222 \
 57 |      --worker_hosts=<IP_GPU_VM_1>:2222,<IP_GPU_VM_2>:2222 \
 58 |      --job_name=worker --task_index=1
 59 | ```
 60 | 
 61 | At this point your training would finally start.  
 62 | However, if for some reason an IP changes (a VM restarts for example), you would need to go back on every VM in your cluster, and restart the training with an updated `ClusterSpec` (If the IT department of your company is feeling extra-generous they might assign a DNS name to every VM which would already make your life much easier).
 63 | If you see that your training is not doing well and you need to update the code, you have to redeploy it on every VM and restart the training everywhere.
 64 | If for some reason you want to retrain after a while, you would most likely need to go back to step 1: ask for the VMs to be allocated, redeploy, update the `clusterSpec`.
 65 | 
 66 | All this hurdles means that in practice very few people actually bother with distributed training as the time gained during training might not be worth the energy and time necessary to set it up correctly.
 67 | 
 68 | ## Distributed TensorFlow with Kubernetes and `TFJob`
 69 | 
 70 | Thankfully, with Kubernetes and `TFJob` things are much, much simpler, making distributed training something you might actually be able to benefit from. Before submitting a training job, you should have deployed Kubeflow to your cluster. Doing so ensures that the `TFJob` custom resource is available when you submit the training job. 
 71 | 
 72 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow and TFJob running in your Kubernetes cluster, you can follow [module 4 - Kubeflow](../4-kubeflow) and [module 6 - TFJob](../6-tfjob). 
 73 | 
 74 | #### A Small Disclaimer
 75 | The issues we saw in the first part of this module can be categorized in two groups: 
 76 | * Issues with getting access to enough resources for the trainings (VMs, GPU etc)
 77 | * Issues with setting up the training itself
 78 | 
 79 | The first group of issue is still very dependent on the processes in your company/group. If you need to go through a formal request to get access to extra VMs/GPU, it will still be a hassle and there is nothing Kubernetes can do about that.  
 80 | 
 81 | However, Kubernetes makes this process much easier:
 82 | On AKS, you can spin up new VMs with a single command: [`az aks scale`](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-scale)
 83 | 
 84 | Setting up the training, however, is drastically simplified with Kubernetes, KubeFlow and `TFJob`.
 85 | 
 86 | ### Overview of `TFJob` distributed training
 87 | 
 88 | So, how does `TFJob` work for distributed training?
 89 | Let's look again at what the `TFJobSpec`and `TFReplicaSpec` objects looks like:
 90 | 
 91 | **`TFJobSpec` Object**
 92 | 
 93 | | Field | Type| Description |
 94 | |-------|-----|-------------|
 95 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
 96 | 
 97 | 
 98 | **`TFReplicaSpec` Object**
 99 | 
100 | Note the last parameter `IsDefaultPS` that we didn't talk about before.
101 | 
102 | | Field | Type| Description |
103 | |-------|-----|-------------|
104 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 
105 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
106 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere.  |
107 | | **IsDefaultPS** | `boolean` | Whether the parameter server should be using a default image or a custom one (default to `true`) |
108 | 
109 | In case the distinction between master and workers is not clear, there is a single master per TensorFlow cluster, and it is in fact a worker. The difference is that the master is the worker that is going to handle the creation of the `tf.Session`, write logs and save the model.
110 | 
111 | As you can see, `TFJobSpec` and `TFReplicaSpec` allow us to easily define the architecture of the TensorFlow cluster we would like to setup.
112 | 
113 | Once we have defined this architecture in a `TFJob` template and deployed it with `kubectl create`, the operator will do most of the work for us.
114 | For each master, worker and parameter server in our TensorFlow cluster, the operator will create a service exposing it so they can communicate.   
115 | It will then create an internal representation of the cluster with each node and it's associated internal DNS name.  
116 | 
117 | For example, if you were to create a `TFJob` with 1 `MASTER`, 2 `WORKERS` and 1 `PS`, this representation would look similar to this:
118 | ```json
119 | {  
120 |     "master":[  
121 |         "distributed-mnist-master-5oz2-0:2222"
122 |     ],
123 |     "ps":[  
124 |         "distributed-mnist-ps-5oz2-0:2222"
125 |     ],
126 |     "worker":[  
127 |         "distributed-mnist-worker-5oz2-0:2222",
128 |         "distributed-mnist-worker-5oz2-1:2222"
129 |     ]
130 | }
131 | ```
132 | 
133 | Finally, the operator will create all the necessary pods, and in each one, inject an environment variable named `Tf_CONFIG`, containing the cluster specification above, as well as the respective job name and task id that each node of the TensorFlow cluster should assume.  
134 | 
135 | For example, here is the value of the `TF_CONFIG` environment variable that would be sent to worker 1:
136 | 
137 | ```json
138 | {  
139 |    "cluster":{  
140 |       "master":[  
141 |          "distributed-mnist-master-5oz2-0:2222"
142 |       ],
143 |       "ps":[  
144 |          "distributed-mnist-ps-5oz2-0:2222"
145 |       ],
146 |       "worker":[  
147 |          "distributed-mnist-worker-5oz2-0:2222",
148 |          "distributed-mnist-worker-5oz2-1:2222"
149 |       ]
150 |    },
151 |    "task":{  
152 |       "type":"worker",
153 |       "index":1
154 |    },
155 |    "environment":"cloud"
156 | }
157 | ```
158 | 
159 | As you can see, this completely takes the responsibility of building and maintaining the `ClusterSpec` away from you.
160 | All you have to do, is modify your code to read the `TF_CONFIG` and act accordingly.
161 | 
162 | ### Modifying your model to use `TFJob`'s `TF_CONFIG`
163 | 
164 | Concretely, let's see how you would modify your code:
165 | 
166 | ```python
167 | # Grab the TF_CONFIG environment variable
168 | tf_config_json = os.environ.get("TF_CONFIG", "{}")
169 | 
170 | # Deserialize to a python object
171 | tf_config = json.loads(tf_config_json)
172 | 
173 | # Grab the cluster specification from tf_config and create a new tf.train.ClusterSpec instance with it
174 | cluster_spec = tf_config.get("cluster", {})
175 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
176 | 
177 | # Grab the task assigned to this specific process from the config. job_name might be "worker" and task_id might be 1 for example
178 | task = tf_config.get("task", {})
179 | job_name = task["type"]
180 | task_id = task["index"]
181 | 
182 | # Configure the TensorFlow server
183 | server_def = tf.train.ServerDef(
184 |     cluster=cluster_spec_object.as_cluster_def(),
185 |     protocol="grpc",
186 |     job_name=job_name,
187 |     task_index=task_id)
188 | server = tf.train.Server(server_def)
189 | 
190 | # checking if this process is the chief (also called master). The master has the responsibility of creating the session, saving the summaries etc.
191 | is_chief = (job_name == 'master')
192 | 
193 | # Notice that we are not handling the case where job_name == 'ps'. That is because `TFJob` will take care of the parameter servers for us by default.
194 | ```
195 | 
196 | As for any distributed TensorFlow training, you will then also need to modify your model to split the operations and variables among the workers and parameter servers as well as create on session on the master.
197 | 
198 | ## Exercises
199 | 
200 | ### 1 - Modifying Our MNIST Example to Support Distributed Training
201 | 
202 | #### 1. a.
203 | 
204 | Starting from the MNIST sample we have been working with so far, modify it to work with distributed TensorFlow and `TFJob`.
205 | You will then need to build the image and push it (you should push it under a different name or tag to avoid overwriting what you did before).
206 | 
207 | ```
208 | cd 7-distributed-tensorflow/solution-src
209 | # build from tensorflow/tensorflow:gpu for master and workers
210 | docker build -t ${DOCKER_USERNAME}/tf-mnist:distributedgpu -f ./Dockerfile.gpu .
211 | 
212 | # builld from tensorflow/tensorflow for the parameter server
213 | docker build -t ${DOCKER_USERNAME}/tf-mnist:distributed .
214 | ```
215 | 
216 | #### 1. b.
217 | 
218 | Modify the yaml template from module [6 - TFJob](../6-tfjob), to instead deploy 1 master, 2 workers and 1 PS. Then create a yaml to deploy TensorBoard to monitor the training with TensorBoard.
219 | Note that since our model is very simple, TensorFlow will likely use only 1 of the workers, but it will still work fine. Don't forget to update the image or tag.
220 | 
221 | #### Validation
222 | 
223 | ```console
224 | kubectl get pods
225 | ```
226 | 
227 | Should yield:
228 | 
229 | ```
230 | NAME                                       READY     STATUS              RESTARTS   AGE
231 | module6-ex1-master-m8vi-0-rdr5o            1/1       Running   0          23s
232 | module6-ex1-ps-m8vi-0-0vhjm                1/1       Running   0          23s
233 | module6-ex1-worker-m8vi-0-eyb6l            1/1       Running   0          23s
234 | module6-ex1-worker-m8vi-1-bm2ue            1/1       Running   0          23s
235 | ```
236 | 
237 | looking at the logs of the master with:
238 | 
239 | ```console
240 | kubectl logs <master-pod-name>
241 | ```
242 | 
243 | Should yield:
244 | 
245 | ```
246 | [...]
247 | Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
248 | Initialize GrpcChannelCache for job ps -> {0 -> module6-ex1-ps-m8vi-0:2222}
249 | Initialize GrpcChannelCache for job worker -> {0 -> module6-ex1-worker-m8vi-0:2222, 1 -> module6-ex1-worker-m8vi-1:2222}
250 | 2018-04-30 22:45:28.963803: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:2222
251 | ...
252 | 
253 | Accuracy at step 970: 0.9784
254 | Accuracy at step 980: 0.9791
255 | Accuracy at step 990: 0.9796
256 | Adding run metadata for 999
257 | ```
258 | 
259 | This indicates that the `ClusterSpec` was correctly extracted from the environment variable and given to TensorFlow.
260 | 
261 | Once the TensorBoard pod is provisioned and running, we can connect to it using:
262 | 
263 | ```console
264 | PODNAME=$(kubectl get pod -l app=tensorboard -o jsonpath='{.items[0].metadata.name}')
265 | kubectl port-forward ${PODNAME} 6006:6006
266 | ```
267 | 
268 | From the browser, connect to it at http://127.0.0.1:6006, you should see that your model is indeed correctly distributed between workers and PS:  
269 | 
270 | ![TensorBoard](./tensorboard.png)  
271 | 
272 | After a few minutes, the status of both worker nodes should show as `Completed` when doing `kubectl get pods -a`.
273 | 
274 | #### Solution
275 | 
276 | A working code sample is available in [`solution-src/main.py`](./solution-src/main.py).
277 | 
278 | <details>
279 | <summary><strong>TFJob's Template</strong></summary>
280 | 
281 | ```yaml
282 | apiVersion: kubeflow.org/v1alpha2
283 | kind: TFJob
284 | metadata:
285 |   name: module7-ex1-gpu
286 | spec:
287 |   tfReplicaSpecs:
288 |     MASTER:
289 |       replicas: 1
290 |       template:
291 |         spec:
292 |           volumes:
293 |             - name: azurefile
294 |               persistentVolumeClaim:
295 |                 claimName: azurefile
296 |           containers:
297 |           - image: <DOCKER_USERNAME>/tf-mnist:distributedgpu  # You can replace this by your own image
298 |             name: tensorflow
299 |             imagePullPolicy: Always
300 |             resources:
301 |               limits:
302 |                 nvidia.com/gpu: 1
303 |             volumeMounts:
304 |               - mountPath: /tmp/tensorflow
305 |                 subPath: module7-ex1-gpu
306 |                 name: azurefile
307 |           restartPolicy: OnFailure
308 |     WORKER:
309 |       replicas: 2
310 |       template:
311 |         spec:
312 |           containers:
313 |           - image: <DOCKER_USERNAME>/tf-mnist:distributedgpu  # You can replace this by your own image    
314 |             name: tensorflow
315 |             imagePullPolicy: Always
316 |             resources:
317 |               limits:
318 |                 nvidia.com/gpu: 1
319 |             volumeMounts:
320 |           restartPolicy: OnFailure
321 |     PS:
322 |       replicas: 1
323 |       template:
324 |         spec:
325 |           containers:
326 |           - image: <DOCKER_USERNAME>/tf-mnist:distributed  # You can replace this by your own image 
327 |             name: tensorflow
328 |             imagePullPolicy: Always
329 |             ports:
330 |             - containerPort: 6006
331 |           restartPolicy: OnFailure
332 | ```
333 | 
334 | There are few things to notice here:
335 | * Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `tfReplicaSpecs`, not on the `WORKER`s or `PS`.
336 | * We are not specifying anything for the `PS` `tfReplicaSpecs` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here.
337 | * When you have limited GPU resources, you can specify Master and Worker nodes to request GPU resources and PS node will only request CPU resources.
338 | 
339 | </details>
340 | 
341 | <details>
342 | <summary><strong>TensorBoard Template</strong></summary>
343 | 
344 | ```yaml
345 | apiVersion: extensions/v1beta1
346 | kind: Deployment
347 | metadata:
348 |   labels:
349 |     app: tensorboard
350 |   name: tensorboard
351 | spec:
352 |   replicas: 1
353 |   selector:
354 |     matchLabels:
355 |       app: tensorboard
356 |   template:
357 |     metadata:
358 |       labels:
359 |         app: tensorboard
360 |     spec:
361 |       volumes:
362 |         - name: azurefile
363 |           persistentVolumeClaim:
364 |             claimName: azurefile
365 |       containers:
366 |       - name: tensorboard
367 |         image: tensorflow/tensorflow:1.10.0
368 |         imagePullPolicy: Always
369 |         command:
370 |          - /usr/local/bin/tensorboard
371 |         args:
372 |         - --logdir
373 |         - /tmp/tensorflow/logs
374 |         volumeMounts:
375 |           - mountPath: /tmp/tensorflow
376 |             subPath: module7-ex1-gpu
377 |             name: azurefile
378 |         ports:
379 |         - containerPort: 6006
380 |           protocol: TCP
381 |       dnsPolicy: ClusterFirst
382 |       restartPolicy: Always
383 | 
384 | ```
385 | There are two things to notice here:
386 | * To view logs and to get saved models from previous trainings, we need to mount `/tmp/tensorflow` from the Azure File share to TensorBoard as that is the mounting point for all the persisted data from TFJob Master.
387 | * To access TensorBoard pod locally, we need to port-forward traffic against the port specified by `containerPort: 6006`.
388 | 
389 | </details>
390 | 
391 | 
392 | ## Next Step
393 | 
394 | [8 - Hyperparameters Sweep with Helm](../8-hyperparam-sweep)
395 | 


--------------------------------------------------------------------------------
/7-distributed-tensorflow/solution-src/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.10.0
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]
5 | 


--------------------------------------------------------------------------------
/7-distributed-tensorflow/solution-src/Dockerfile.gpu:
--------------------------------------------------------------------------------
1 | FROM tensorflow/tensorflow:1.10.0-gpu
2 | COPY main.py /app/main.py
3 | 
4 | ENTRYPOINT ["python", "/app/main.py"]
5 | 


--------------------------------------------------------------------------------
/7-distributed-tensorflow/solution-src/main.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the 'License');
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an 'AS IS' BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """A simple MNIST classifier which displays summaries in TensorBoard.
 16 | 
 17 | This is an unimpressive MNIST model, but it is a good example of using
 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
 19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
 20 | 
 21 | It demonstrates the functionality of every TensorBoard dashboard.
 22 | """
 23 | from __future__ import absolute_import
 24 | from __future__ import division
 25 | from __future__ import print_function
 26 | 
 27 | import argparse
 28 | import os
 29 | import sys
 30 | import ast
 31 | import json
 32 | 
 33 | import tensorflow as tf
 34 | 
 35 | from tensorflow.examples.tutorials.mnist import input_data
 36 | 
 37 | FLAGS = None
 38 | 
 39 | def train():
 40 |   tf_config_json = os.environ.get("TF_CONFIG", "{}")
 41 |   tf_config = json.loads(tf_config_json)
 42 | 
 43 |   task = tf_config.get("task", {})
 44 |   cluster_spec = tf_config.get("cluster", {})
 45 |   cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
 46 |   job_name = task["type"]
 47 |   task_id = task["index"]
 48 |   server_def = tf.train.ServerDef(
 49 |       cluster=cluster_spec_object.as_cluster_def(),
 50 |       protocol="grpc",
 51 |       job_name=job_name,
 52 |       task_index=task_id)
 53 |   server = tf.train.Server(server_def)
 54 | 
 55 |   is_chief = (job_name == 'master')
 56 | 
 57 |   # Import data
 58 |   mnist = input_data.read_data_sets(FLAGS.data_dir,
 59 |                                     one_hot=True,
 60 |                                     fake_data=FLAGS.fake_data)
 61 | 
 62 |   
 63 |   # Create a multilayer model.
 64 | 
 65 | 
 66 |   # Between-graph replication
 67 |   with tf.device(tf.train.replica_device_setter(
 68 |     worker_device="/job:worker/task:%d" % task_id,
 69 |     cluster=cluster_spec)):
 70 | 
 71 |     # count the number of updates
 72 |     global_step = tf.get_variable(
 73 |       'global_step',
 74 |       [],
 75 |       initializer = tf.constant_initializer(0),
 76 |       trainable = False)
 77 | 
 78 |     # Input placeholders
 79 |     with tf.name_scope('input'):
 80 |       x = tf.placeholder(tf.float32, [None, 784], name='x-input')
 81 |       y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
 82 | 
 83 |     with tf.name_scope('input_reshape'):
 84 |       image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
 85 |       tf.summary.image('input', image_shaped_input, 10)
 86 | 
 87 |     # We can't initialize these variables to 0 - the network will get stuck.
 88 |     def weight_variable(shape):
 89 |       """Create a weight variable with appropriate initialization."""
 90 |       initial = tf.truncated_normal(shape, stddev=0.1)
 91 |       return tf.Variable(initial)
 92 | 
 93 |     def bias_variable(shape):
 94 |       """Create a bias variable with appropriate initialization."""
 95 |       initial = tf.constant(0.1, shape=shape)
 96 |       return tf.Variable(initial)
 97 | 
 98 |     def variable_summaries(var):
 99 |       """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
100 |       with tf.name_scope('summaries'):
101 |         mean = tf.reduce_mean(var)
102 |         tf.summary.scalar('mean', mean)
103 |         with tf.name_scope('stddev'):
104 |           stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
105 |         tf.summary.scalar('stddev', stddev)
106 |         tf.summary.scalar('max', tf.reduce_max(var))
107 |         tf.summary.scalar('min', tf.reduce_min(var))
108 |         tf.summary.histogram('histogram', var)
109 | 
110 |     def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
111 |       """Reusable code for making a simple neural net layer.
112 | 
113 |       It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
114 |       It also sets up name scoping so that the resultant graph is easy to read,
115 |       and adds a number of summary ops.
116 |       """
117 |       # Adding a name scope ensures logical grouping of the layers in the graph.
118 |       with tf.name_scope(layer_name):
119 |         # This Variable will hold the state of the weights for the layer
120 |         with tf.name_scope('weights'):
121 |           weights = weight_variable([input_dim, output_dim])
122 |           variable_summaries(weights)
123 |         with tf.name_scope('biases'):
124 |           biases = bias_variable([output_dim])
125 |           variable_summaries(biases)
126 |         with tf.name_scope('Wx_plus_b'):
127 |           preactivate = tf.matmul(input_tensor, weights) + biases
128 |           tf.summary.histogram('pre_activations', preactivate)
129 |         activations = act(preactivate, name='activation')
130 |         tf.summary.histogram('activations', activations)
131 |         return activations
132 | 
133 |     hidden1 = nn_layer(x, 784, 500, 'layer1')
134 | 
135 |     with tf.name_scope('dropout'):
136 |       keep_prob = tf.placeholder(tf.float32)
137 |       tf.summary.scalar('dropout_keep_probability', keep_prob)
138 |       dropped = tf.nn.dropout(hidden1, keep_prob)
139 | 
140 |     # Do not apply softmax activation yet, see below.
141 |     y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
142 | 
143 |     with tf.name_scope('cross_entropy'):
144 |       # The raw formulation of cross-entropy,
145 |       #
146 |       # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
147 |       #                               reduction_indices=[1]))
148 |       #
149 |       # can be numerically unstable.
150 |       #
151 |       # So here we use tf.nn.softmax_cross_entropy_with_logits on the
152 |       # raw outputs of the nn_layer above, and then average across
153 |       # the batch.
154 |       diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
155 |       with tf.name_scope('total'):
156 |         cross_entropy = tf.reduce_mean(diff)
157 |     tf.summary.scalar('cross_entropy', cross_entropy)
158 | 
159 |     with tf.name_scope('train'):
160 |       train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
161 |           cross_entropy)
162 | 
163 |     with tf.name_scope('accuracy'):
164 |       with tf.name_scope('correct_prediction'):
165 |         correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
166 |       with tf.name_scope('accuracy'):
167 |         accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
168 |     tf.summary.scalar('accuracy', accuracy)
169 | 
170 |     # Merge all the summaries and write them out to
171 |     # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
172 |     merged = tf.summary.merge_all()  
173 | 
174 |     init_op = tf.global_variables_initializer()
175 | 
176 |   def feed_dict(train):
177 |     """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
178 |     if train or FLAGS.fake_data:
179 |       xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
180 |       k = FLAGS.dropout
181 |     else:
182 |       xs, ys = mnist.test.images, mnist.test.labels
183 |       k = 1.0
184 |     return {x: xs, y_: ys, keep_prob: k}
185 | 
186 | 
187 | 
188 |   sv = tf.train.Supervisor(is_chief=is_chief,
189 | 						global_step=global_step,
190 | 						init_op=init_op,
191 | 						logdir=FLAGS.logdir)
192 | 
193 |   with sv.prepare_or_wait_for_session(server.target) as sess:  
194 |     train_writer = tf.summary.FileWriter(FLAGS.logdir + '/train', sess.graph)
195 |     test_writer = tf.summary.FileWriter(FLAGS.logdir + '/test')
196 |     # Train the model, and also write summaries.
197 |     # Every 10th step, measure test-set accuracy, and write test summaries
198 |     # All other steps, run train_step on training data, & add training summaries
199 | 
200 |     for i in range(FLAGS.max_steps):
201 |       if i % 10 == 0:  # Record summaries and test-set accuracy
202 |         summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
203 |         test_writer.add_summary(summary, i)
204 |         print('Accuracy at step %s: %s' % (i, acc))
205 |       else:  # Record train set summaries, and train
206 |         if i % 100 == 99:  # Record execution stats
207 |           run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
208 |           run_metadata = tf.RunMetadata()
209 |           summary, _ = sess.run([merged, train_step],
210 |                                 feed_dict=feed_dict(True),
211 |                                 options=run_options,
212 |                                 run_metadata=run_metadata)
213 |           train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
214 |           train_writer.add_summary(summary, i)
215 |           print('Adding run metadata for', i)
216 |         else:  # Record a summary
217 |           summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
218 |           train_writer.add_summary(summary, i)
219 |     train_writer.close()
220 |     test_writer.close()
221 | 
222 | 
223 | def main(_):
224 |   train()
225 | 
226 | 
227 | if __name__ == '__main__':
228 |   parser = argparse.ArgumentParser()
229 |   parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
230 |                       default=False,
231 |                       help='If true, uses fake data for unit testing.')
232 |   parser.add_argument('--max_steps', type=int, default=1000,
233 |                       help='Number of steps to run trainer.')
234 |   parser.add_argument('--learning_rate', type=float, default=0.001,
235 |                       help='Initial learning rate')
236 |   parser.add_argument('--dropout', type=float, default=0.9,
237 |                       help='Keep probability for training dropout.')
238 |   parser.add_argument(
239 |       '--data_dir',
240 |       type=str,
241 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
242 |                            'tensorflow/input_data'),
243 |       help='Directory for storing input data')
244 |   parser.add_argument(
245 |       '--logdir',
246 |       type=str,
247 |       default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
248 |                            'tensorflow/logs'),
249 |       help='Summaries log directory')
250 |   FLAGS, unparsed = parser.parse_known_args()
251 |   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)


--------------------------------------------------------------------------------
/7-distributed-tensorflow/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/7-distributed-tensorflow/tensorboard.png


--------------------------------------------------------------------------------
/8-hyperparam-sweep/README.md:
--------------------------------------------------------------------------------
  1 | # Automated Hyperparameters Sweep with `TFJob` and Helm
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [3 - Helm](../3-helm)
  6 | * [6 - TFJob](../6-tfjob)
  7 |   
  8 | ### "Vanilla" Hyperparameter Sweep
  9 | 
 10 | Just as distributed training, automated hyperparameter sweep is barely used in many organizations.
 11 | The reasons are similar: It takes a lot of resources, or time, to run more than a couple training for the same model.
 12 |   * Either you run different hypothesis in parallel, which will likely requires a lot of resources and VMs. These VMs need to be managed by someone, the model need to be deployed, logs and checkpoints have to be gathered etc.
 13 |   * Or you run everything sequentially on a few number of VMs, which takes a lot of time before being able to compare result
 14 | 
 15 | So in practice most people manually fine-tune their hyperparameters through a few runs and pick a winner.
 16 | 
 17 | ### Kubernetes + Helm
 18 | 
 19 | Kubernetes coupled with Helm can make this easier as we will see. 
 20 | Because Kubernetes on Azure also allows you to scale very easily (manually or automatically), this allows you to explore a very large hyperparameter space, while maximizing the usage of your cluster (and thus optimizing cost).
 21 | 
 22 | In practice, this process is still rudimentary today as the technologies involved are all pretty young. Most likely tools better suited for doing hyperparameter sweeping in distributed systems will soon be available, but in the meantime Kubernetes and Helm already allow us to deploy a large number of trainings fairly easily.
 23 | 
 24 | ### Why Helm?
 25 | 
 26 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize it's deployment easily.  
 27 | To do that, Helm allows us to use Golang templating engine in the chart definitions. This means we can use conditions, loops, variables and [much more](https://docs.helm.sh/chart_template_guide).  
 28 | This will allow us to create complex deployment flow.   
 29 | 
 30 | In the case of hyperparameters sweeping, we will want a chart able to deploy a number of `TFJobs` each trying different values for some hyperparameters.
 31 | We will also want to deploy a single TensorBoard instance monitoring all these `TFJobs`, that way we can quickly compare all our hypothesis, and even early-stop jobs that clearly don't perform well if we want to reduce cost as much as possible.
 32 | For now, this chart will simply do a grid search, and while it is less efficient than random search it will be a good place to start.
 33 | 
 34 | ## Exercise
 35 | 
 36 | ### Creating and Deploying the Chart
 37 | In this exercise, you will create a new Helm chart that will deploy a number of `TFJobs` as well as a TensorBoard instance.
 38 | 
 39 | Here is what our `values.yaml` file could look like for example (you are free to go a different route):
 40 | 
 41 | ```yaml
 42 | image: ritazh/tf-paint:gpu
 43 | useGPU: true
 44 | hyperParamValues:
 45 |   learningRate:
 46 |     - 0.001
 47 |     - 0.01
 48 |     - 0.1
 49 |   hiddenLayers:
 50 |     - 5
 51 |     - 6
 52 |     - 7
 53 | ```
 54 | 
 55 | That way, when installing the chart, 9 `TFJob` will actually get deployed, testing all the combination of learning rate and hidden layers depth that we specified.
 56 | This is a very simple example (our model is also very simple), but hopefully you start to see the possibilities than Helm offers.
 57 | 
 58 | In this exercise, we are going to use a new model based on [Andrej Karpathy's Image painting demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_regression.html).  
 59 | This model objective is to to create a new picture as close as possible to the original one, "The Starry Night" by Vincent van Gogh:
 60 | 
 61 | ![Starry](./src/starry.jpg)
 62 | 
 63 | The source code is located in [src/](./src/).  
 64 | 
 65 | Our model takes 3 parameters:
 66 | 
 67 | | argument | description | default value |
 68 | |------|-------------|---------------|
 69 | |`--learning-rate` | Learning rate value | `0.001` | 
 70 | |`--hidden-layers` | Number of hidden layers in our network. | `4` | 
 71 | |`--log-dir` | Path to save TensorFlow's summaries | `None`| 
 72 | 
 73 | For simplicity, docker images have already been created so you don't have to build and push yourself:
 74 | * `ritazh/tf-paint:cpu` for CPU only.
 75 | * `ritazh/tf-paint:gpu` for GPU.  
 76 | 
 77 | The goal of this exercise is to create an Helm chart that will allow us to test as many variations and combinations of the two hyperparameters `--learning-rate`and `--hidden-layers` as we want by just adding them in our `values.yaml` file.   
 78 | This chart should also deploy a single TensorBoard instance (and it's associated service with a public IP), so we can quickly monitor and compare our different hypothesis.
 79 | 
 80 | If you are pretty new to Kubernetes and Helm and don't feel like building your own helm chart just yet, you can skip to the solution where details and explanations are provided.
 81 | 
 82 | #### Validation
 83 | 
 84 | Once you have created and deployed your chart, looking at the pods that were created, you should see a bunch of them, as well as a single TensorBoard instance monitoring all of them:
 85 | 
 86 | ```console
 87 | kubectl get pods
 88 | ```
 89 | 
 90 | ```
 91 | NAME                                      READY     STATUS    RESTARTS   AGE
 92 | module8-tensorboard-7ccb598cdd-6vg7h       1/1       Running   0          16s
 93 | module8-tf-paint-0-0-master-juc5-0-hw5cm   0/1       Pending   0          4s
 94 | module8-tf-paint-0-1-master-pu49-0-jp06r   1/1       Running   0          14s
 95 | module8-tf-paint-0-2-master-awhs-0-gfra0   0/1       Pending   0          6s
 96 | module8-tf-paint-1-0-master-5tfm-0-dhhhv   1/1       Running   0          16s
 97 | module8-tf-paint-1-1-master-be91-0-zw4gk   1/1       Running   0          16s
 98 | module8-tf-paint-1-2-master-r2nd-0-zhws1   0/1       Pending   0          7s
 99 | module8-tf-paint-2-0-master-7w37-0-ff0w9   0/1       Pending   0          13s
100 | module8-tf-paint-2-1-master-260j-0-l4o7r   0/1       Pending   0          10s
101 | module8-tf-paint-2-2-master-jtjb-0-5l84q   0/1       Pending   0          9s
102 | ```
103 | Note: Some pods are pending due to the GPU resource available in the cluster. If you have 3 GPUs in the cluster, then there can only be a maximum number of 3 parallel trainings at a given time.
104 | 
105 | Once the TensorBoard service for this module is created, you can use the External-IP of that service to connect to the TensorBoard.
106 | 
107 | ```console
108 | kubectl get service
109 | 
110 | NAME                                 TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)             AGE
111 | module8-tensorboard                  LoadBalancer   10.0.142.217   <PUBLIC IP>     80:30896/TCP        5m
112 | 
113 | ```
114 | 
115 | Looking at TensorBoard, you should see something similar to this:
116 | ![TensorBoard](tensorboard.png)
117 | 
118 | > Note that TensorBoard can take a while before correctly displaying images.
119 | 
120 | Here we can see that some models are doing much better than others. Models with a learning rate of `0.1` for example are producing an all-black image, we are probably over-shooting.  
121 | After a few minutes, we can see that the two best performing models are:
122 | * 5 hidden layers and learning rate of `0.01`
123 | * 7 hidden layers and learning rate of `0.001`
124 | 
125 | At this point we could decide to kill all the other models if we wanted to free some capacity in our cluster, or launch additional new experiments based on our initial findings.
126 | 
127 | #### Solution
128 | Check out the commented solution chart: [./solution-chart/templates/deployment.yaml](./solution-chart/templates/deployment.yaml)
129 | <details>
130 | <summary><strong>Hyperparameter Sweep Helm Chart</strong></summary>
131 | 
132 | Install the chart with command:
133 | 
134 | ```console
135 | cd 8-hyperparam-sweep/solution-chart/
136 | helm install .
137 | 
138 | NAME:   telling-buffalo
139 | LAST DEPLOYED: 
140 | NAMESPACE: tfworkflow
141 | STATUS: DEPLOYED
142 | 
143 | RESOURCES:
144 | ==> v1/Service
145 | NAME                 TYPE          CLUSTER-IP    EXTERNAL-IP  PORT(S)       AGE
146 | module8-tensorboard  LoadBalancer  10.0.142.217  <pending>    80:30896/TCP  1s
147 | 
148 | ==> v1beta1/Deployment
149 | NAME                 DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
150 | module8-tensorboard  1        1        1           0          1s
151 | 
152 | ==> v1alpha1/TFJob
153 | NAME                  AGE
154 | module8-tf-paint-0-0  1s
155 | module8-tf-paint-1-0  1s
156 | module8-tf-paint-1-1  1s
157 | module8-tf-paint-2-1  1s
158 | module8-tf-paint-2-2  1s
159 | module8-tf-paint-0-1  1s
160 | module8-tf-paint-0-2  1s
161 | module8-tf-paint-1-2  1s
162 | module8-tf-paint-2-0  0s
163 | 
164 | ==> v1/Pod(related)
165 | NAME                                  READY  STATUS             RESTARTS  AGE
166 | module8-tensorboard-7ccb598cdd-6vg7h   0/1    ContainerCreating  0         1s
167 | 
168 | ```
169 | </details>
170 | 
171 | ## Next Step
172 | 
173 | [9 - Serving](../9-serving)
174 | 


--------------------------------------------------------------------------------
/8-hyperparam-sweep/solution-chart/Chart.yaml:
--------------------------------------------------------------------------------
1 | apiVersion: v1
2 | description: A Helm chart for Kubernetes
3 | name: module8-hyperparam-sweep
4 | version: 0.1.0
5 | 


--------------------------------------------------------------------------------
/8-hyperparam-sweep/solution-chart/templates/_helpers.tpl:
--------------------------------------------------------------------------------
 1 | {{/* vim: set filetype=mustache: */}}
 2 | {{/*
 3 | Expand the name of the chart.
 4 | */}}
 5 | {{- define "name" -}}
 6 | {{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}}
 7 | {{- end -}}
 8 | 
 9 | {{/*
10 | Create a default fully qualified app name.
11 | We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
12 | */}}
13 | {{- define "fullname" -}}
14 | {{- $name := default .Chart.Name .Values.nameOverride -}}
15 | {{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}}
16 | {{- end -}}
17 | 


--------------------------------------------------------------------------------
/8-hyperparam-sweep/solution-chart/templates/deployment.yaml:
--------------------------------------------------------------------------------
  1 | 
  2 | # First we copy the values of values.yaml in variable to make it easier to access them
  3 | {{- $lrlist := .Values.hyperParamValues.learningRate -}}
  4 | {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}}
  5 | {{- $image := .Values.image -}}
  6 | {{- $useGPU := .Values.useGPU -}}
  7 | {{- $chartname := .Chart.Name -}}
  8 | {{- $chartversion := .Chart.Version -}}
  9 | 
 10 | # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth)
 11 | # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth
 12 | {{- range $i, $lr := $lrlist }}
 13 | {{- range $j, $nblayers := $nblayerslist }}
 14 | apiVersion: kubeflow.org/v1beta1
 15 | kind: TFJob # Each one of our trainings will be a separate TFJob
 16 | metadata:
 17 |   name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training
 18 |   labels:
 19 |     chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}"
 20 | spec:
 21 |   tfReplicaSpecs:
 22 |     MASTER:
 23 |       template:
 24 |         spec:
 25 |           restartPolicy: OnFailure
 26 |           containers:
 27 |             - name: tensorflow
 28 |               image: {{ $image }}
 29 |               env:
 30 |               - name: LC_ALL
 31 |                 value: C.UTF-8
 32 |               args:
 33 |                 # Here we pass a unique learning rate and hidden layer count to each instance.
 34 |                 # We also put the values between quotes to avoid potential formatting issues
 35 |                 - --learning-rate
 36 |                 - {{ $lr | quote }}
 37 |                 - --hidden-layers
 38 |                 - {{ $nblayers | quote }}
 39 |                 - --logdir
 40 |                 - /tmp/tensorflow/tf-paint-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory
 41 | {{ if $useGPU }}  # We only want to request GPUs if we asked for it in values.yaml with useGPU
 42 |               resources:
 43 |                 limits:
 44 |                   nvidia.com/gpu: 1
 45 | {{ end }}
 46 |               volumeMounts:
 47 |               - mountPath: /tmp/tensorflow
 48 |                 subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory
 49 |                 name: azurefile
 50 |           volumes:
 51 |             - name: azurefile
 52 |               persistentVolumeClaim:
 53 |                 claimName: azurefile
 54 | ---
 55 | {{- end }}
 56 | {{- end }}
 57 | # We only want one instance running for all our jobs, and not 1 per job.
 58 | apiVersion: v1
 59 | kind: Service
 60 | metadata:
 61 |   labels:
 62 |     app: tensorboard
 63 |   name: module8-tensorboard
 64 | spec:
 65 |   ports:
 66 |   - port: 80
 67 |     targetPort: 6006
 68 |   selector:
 69 |     app: tensorboard
 70 |   type: LoadBalancer
 71 | ---
 72 | apiVersion: extensions/v1beta1
 73 | kind: Deployment
 74 | metadata:
 75 |   labels:
 76 |     app: tensorboard
 77 |   name: module8-tensorboard
 78 | spec:
 79 |   template:
 80 |     metadata:
 81 |       labels:
 82 |         app: tensorboard
 83 |     spec:
 84 |       volumes:
 85 |         - name: azurefile
 86 |           persistentVolumeClaim:
 87 |             claimName: azurefile
 88 |       containers:
 89 |       - name: tensorboard
 90 |         command:
 91 |           - /usr/local/bin/tensorboard
 92 |           - --logdir=/tmp/tensorflow
 93 |           - --host=0.0.0.0
 94 |         image: tensorflow/tensorflow
 95 |         ports:
 96 |         - containerPort: 6006
 97 |         volumeMounts:
 98 |         - mountPath: /tmp/tensorflow
 99 |           subPath: module8-tf-paint
100 |           name: azurefile


--------------------------------------------------------------------------------
/8-hyperparam-sweep/solution-chart/values.yaml:
--------------------------------------------------------------------------------
 1 | image: ritazh/tf-paint:gpu
 2 | useGPU: true
 3 | hyperParamValues:
 4 |   learningRate:
 5 |     - 0.001
 6 |     - 0.01
 7 |     - 0.1
 8 |   hiddenLayers:
 9 |     - 5
10 |     - 6
11 |     - 7
12 | 


--------------------------------------------------------------------------------
/8-hyperparam-sweep/src/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM tensorflow/tensorflow:1.7.0
 2 | COPY requirements.txt /app/requirements.txt
 3 | WORKDIR /app
 4 | RUN mkdir ./output
 5 | RUN mkdir ./logs
 6 | RUN mkdir ./checkpoints
 7 | RUN pip install -r requirements.txt
 8 | COPY ./* /app/
 9 | 
10 | ENTRYPOINT [ "python", "/app/main.py" ]


--------------------------------------------------------------------------------
/8-hyperparam-sweep/src/Dockerfile.gpu:
--------------------------------------------------------------------------------
 1 | FROM tensorflow/tensorflow:1.7.0-gpu
 2 | COPY requirements.txt /app/requirements.txt
 3 | WORKDIR /app
 4 | RUN mkdir ./output
 5 | RUN mkdir ./logs
 6 | RUN mkdir ./checkpoints
 7 | RUN pip install -r requirements.txt
 8 | COPY ./* /app/
 9 | 
10 | ENTRYPOINT [ "python", "/app/main.py" ]


--------------------------------------------------------------------------------
/8-hyperparam-sweep/src/main.py:
--------------------------------------------------------------------------------
 1 | import click
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | from skimage.data import astronaut
 5 | from scipy.misc import imresize, imsave, imread
 6 | 
 7 | img = imread('./starry.jpg')
 8 | img = imresize(img, (100, 100))
 9 | save_dir = 'output'
10 | epochs = 2000
11 | 
12 | 
13 | def linear_layer(X, layer_size, layer_name):
14 |     with tf.variable_scope(layer_name):
15 |         W = tf.Variable(tf.random_uniform([X.get_shape().as_list()[1], layer_size], dtype=tf.float32), name='W')
16 |         b = tf.Variable(tf.zeros([layer_size]), name='b')
17 |         return tf.nn.relu(tf.matmul(X, W) + b)
18 | 
19 | @click.command()
20 | @click.option("--learning-rate", default=0.01) 
21 | @click.option("--hidden-layers", default=7)
22 | @click.option("--logdir")
23 | def main(learning_rate, hidden_layers, logdir='./logs/1'):
24 |     X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name='X') 
25 |     y = tf.placeholder(dtype=tf.float32, shape=(None, 3), name='y')
26 |     current_input = X
27 |     for layer_id in range(hidden_layers):
28 |         h = linear_layer(current_input, 20, 'layer{}'.format(layer_id))
29 |         current_input = h
30 | 
31 |     y_pred = linear_layer(current_input, 3, 'output')
32 | 
33 |     #loss will be distance between predicted and true RGB
34 |     loss = tf.reduce_mean(tf.reduce_sum(tf.squared_difference(y, y_pred), 1))
35 |     tf.summary.scalar('loss', loss)
36 | 
37 |     train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
38 |     merged_summary_op = tf.summary.merge_all()  
39 | 
40 |     res_img = tf.cast(tf.clip_by_value(tf.reshape(y_pred, (1,) + img.shape), 0, 255), tf.uint8)
41 |     img_summary = tf.summary.image('out', res_img, max_outputs=1)
42 |     
43 |     xs, ys = get_data(img)
44 | 
45 |     with tf.Session() as sess:
46 |         tf.global_variables_initializer().run()  
47 |         train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph)
48 |         test_writer = tf.summary.FileWriter(logdir + '/test')
49 |         batch_size = 50
50 |         for i in range(epochs):
51 |             # Get a random sampling of the dataset
52 |             idxs = np.random.permutation(range(len(xs)))
53 |             # The number of batches we have to iterate over
54 |             n_batches = len(idxs) // batch_size
55 |             # Now iterate over our stochastic minibatches:
56 |             for batch_i in range(n_batches):
57 |                 batch_idxs = idxs[batch_i * batch_size: (batch_i + 1) * batch_size]
58 |                 sess.run([train_op, loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]})
59 |                 if batch_i % 100 == 0:
60 |                     c, summary = sess.run([loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]})
61 |                     train_writer.add_summary(summary, (i * n_batches * batch_size) + batch_i)
62 |                     print("epoch {}, (l2) loss {}".format(i, c))           
63 | 
64 |             if i % 10 == 0:
65 |                 img_summary_res = sess.run(img_summary, feed_dict={X: xs, y: ys})
66 |                 test_writer.add_summary(img_summary_res, i * n_batches * batch_size)
67 | 
68 | def get_data(img):
69 |     xs = []
70 |     ys = []
71 |     for row_i in range(img.shape[0]):
72 |         for col_i in range(img.shape[1]):
73 |             xs.append([row_i, col_i])
74 |             ys.append(img[row_i, col_i])
75 | 
76 |     xs = (xs - np.mean(xs)) / np.std(xs)
77 |     return xs, np.array(ys)
78 | 
79 | if __name__ == "__main__":
80 |     main()


--------------------------------------------------------------------------------
/8-hyperparam-sweep/src/requirements.txt:
--------------------------------------------------------------------------------
1 | scikit-image
2 | click>=6.2


--------------------------------------------------------------------------------
/8-hyperparam-sweep/src/starry.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/8-hyperparam-sweep/src/starry.jpg


--------------------------------------------------------------------------------
/8-hyperparam-sweep/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/8-hyperparam-sweep/tensorboard.png


--------------------------------------------------------------------------------
/9-serving/README.md:
--------------------------------------------------------------------------------
  1 | # TensorFlow Serving
  2 | 
  3 | ## Prerequisites
  4 | 
  5 | * [1 - Docker Basics](../1-docker)
  6 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes)
  7 | * [3 - Helm](../3-helm)
  8 | * [4 - Kubeflow](../4-kubeflow)
  9 | 
 10 | ## Summary
 11 | 
 12 | In this section you will learn about:
 13 | 
 14 | * Setting up a Minio file storage in our Kubernetes cluster
 15 | * Serving trained models using TensorFlow Serving
 16 | 
 17 | ## Context
 18 | 
 19 | TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.
 20 | 
 21 | ## Exercises
 22 | 
 23 | ### Exercise 1: Setting up file storage
 24 | 
 25 | First, we'll get started with a file storage backend.
 26 | 
 27 | If you already have a model uploaded to storage, you can skip this step.
 28 | If not, you can [download Minio client](https://minio.io/downloads.html#download-client-macos) to your operating system of choice to upload trained and exported model.
 29 | 
 30 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize it's deployment easily. We'll use Helm to create a Minio deployment in our cluster.
 31 | 
 32 | ```console
 33 | ACCESS_KEY=<your access key>
 34 | ACCESS_SECRET_KEY=<your access secret key>
 35 | 
 36 | helm install --name minio --set accessKey=$ACCESS_KEY,secretKey=$ACCESS_SECRET_KEY,service.type=LoadBalancer stable/minio
 37 | ```
 38 | 
 39 | ```console
 40 | SERVICE_IP=$(kubectl get svc minio --template="{{range .status.loadBalancer.ingress}}{{.ip}}{{end}}")
 41 | S3_ENDPOINT=${SERVICE_IP}:9000
 42 | ```
 43 | 
 44 | Setting up Minio host:
 45 | 
 46 | ```console
 47 | mc config host add minio $S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY
 48 | ```
 49 | 
 50 | Creating a bucket and uploading our trained model:
 51 | 
 52 | ```console
 53 | BUCKET_NAME=kubeflow
 54 | 
 55 | mc mb minio/$BUCKET_NAME
 56 | 
 57 | mc cp --recursive /path/to/your/exported/model minio/$BUCKET_NAME
 58 | ```
 59 | 
 60 | After this command, you should see files are being uploaded.
 61 | 
 62 | ### Exercise 2: Setting up TensorFlow Serving model server
 63 | 
 64 | In this exercise, we are going to set up a TensorFlow model server and start serving our trained model.
 65 | 
 66 | Creating our namespace for serving:
 67 | 
 68 | ```console
 69 | export NAMESPACE=serving
 70 | 
 71 | kubectl create namespace $NAMESPACE
 72 | ```
 73 | 
 74 | Creating secret for the Minio storage so TensorFlow Serving container can access it:
 75 | 
 76 | ```console
 77 | kubectl create secret generic serving-creds --from-literal=accessKeyID=${ACCESS_KEY} \
 78 |  --from-literal=secretAccessKey=${ACCESS_SECRET_KEY} -n $NAMESPACE
 79 | ```
 80 | 
 81 | Defining variables such as model name, TensorFlow Serving image.
 82 | 
 83 | ```console
 84 | S3_USE_HTTPS=0
 85 | S3_VERIFY_SSL=0
 86 | JOB_NAME=myjob
 87 | MODEL_COMPONENT=mnist
 88 | MODEL_NAME=mnist
 89 | MODEL_PATH=s3://${BUCKET_NAME}/models/${JOB_NAME}/export/${MODEL_NAME}/
 90 | MODEL_SERVER_IMAGE=sozercan/tensorflow-model-server
 91 | ```
 92 | 
 93 | Initalize Kubeflow:
 94 | 
 95 | ```console
 96 | ks init my-model-server
 97 | cd my-model-server
 98 | ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
 99 | ks pkg install kubeflow/tf-serving@74629b7
100 | ```
101 | 
102 | Setting up environment for Kubeflow:
103 | 
104 | ```console
105 | ks env add azure
106 | ks env set azure --namespace ${NAMESPACE}
107 | ```
108 | 
109 | Generating the template:
110 | 
111 | ```console
112 | ks generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME}
113 | ```
114 | 
115 | Overriding parameters with our own values:
116 | 
117 | ```console
118 | ks param set --env azure ${MODEL_COMPONENT} modelServerImage $MODEL_SERVER_IMAGE
119 | ks param set --env azure ${MODEL_COMPONENT} modelPath $MODEL_PATH
120 | ks param set --env azure ${MODEL_COMPONENT} s3Enable true
121 | ks param set --env azure ${MODEL_COMPONENT} s3SecretName serving-creds
122 | ks param set --env azure ${MODEL_COMPONENT} s3SecretAccesskeyidKeyName accessKeyID
123 | ks param set --env azure ${MODEL_COMPONENT} s3SecretSecretaccesskeyKeyName secretAccessKey
124 | ks param set --env azure ${MODEL_COMPONENT} s3Endpoint $S3_ENDPOINT
125 | ks param set --env azure ${MODEL_COMPONENT} s3AwsRegion us-east-1
126 | ks param set --env azure ${MODEL_COMPONENT} s3UseHttps $S3_USE_HTTPS --as-string
127 | ks param set --env azure ${MODEL_COMPONENT} s3VerifySsl $S3_VERIFY_SSL --as-string
128 | ks param set --env azure ${MODEL_COMPONENT} serviceType LoadBalancer
129 | ```
130 | 
131 | Deploying TensorFlow Serving to our cluster:
132 | 
133 | ```console
134 | ks apply azure -c ${MODEL_COMPONENT}
135 | ```
136 | 
137 | After deploying, you should see a deployment and service in your cluster. You can verify with the following:
138 | 
139 | ```console
140 | kubectl get pods -n ${NAMESPACE}
141 | 
142 | kubectl get svc -n ${NAMESPACE}
143 | ```
144 | 
145 | ### Exercise 3: Using a client to query TensorFlow Serving Model Server
146 | 
147 | In this exercise, we'll use a client to query the TensorFlow Serving model server.
148 | 
149 | ```
150 | cd 9-serving
151 | ```
152 | 
153 | If you don't have virtualenv installed, you can install with:
154 | 
155 | ```console
156 | pip install virtualenv
157 | ```
158 | 
159 | Setting up our virtual environment:
160 | 
161 | ```console
162 | virtualenv venv
163 | source venv/bin/activate
164 | pip install -r requirements.txt
165 | ```
166 | 
167 | Starting our query from the client:
168 | 
169 | ```console
170 | export TF_MODEL_SERVER_HOST=$(kubectl get svc ${MODEL_NAME} -n ${NAMESPACE} --template="{{range .status.loadBalancer.ingress}}{{.ip}}{{end}}")
171 | 
172 | export TF_MNIST_IMAGE_PATH=data/7.png
173 | 
174 | python mnist_client.py
175 | ```
176 | 
177 | If everything is working correctly, you should see the output from the model and the inference.
178 | 
179 | Sample output:
180 | 
181 | ```
182 | outputs {
183 |   key: "classes"
184 |   value {
185 |     dtype: DT_UINT8
186 |     tensor_shape {
187 |       dim {
188 |         size: 1
189 |       }
190 |     }
191 |     int_val: 7
192 |   }
193 | }
194 | outputs {
195 |   key: "predictions"
196 |   value {
197 |     dtype: DT_FLOAT
198 |     tensor_shape {
199 |       dim {
200 |         size: 1
201 |       }
202 |       dim {
203 |         size: 10
204 |       }
205 |     }
206 |     float_val: 0.0
207 |     float_val: 0.0
208 |     float_val: 0.0
209 |     float_val: 0.0
210 |     float_val: 0.0
211 |     float_val: 0.0
212 |     float_val: 0.0
213 |     float_val: 1.0
214 |     float_val: 0.0
215 |     float_val: 0.0
216 |   }
217 | }
218 | 
219 | 
220 | ............................
221 | ............................
222 | ............................
223 | ............................
224 | ............................
225 | ............................
226 | ............................
227 | ..............@@@@@@........
228 | ..........@@@@@@@@@@........
229 | ........@@@@@@@@@@@@........
230 | ........@@@@@@@@.@@@........
231 | ........@@@@....@@@@........
232 | ................@@@@........
233 | ...............@@@@.........
234 | ...............@@@@.........
235 | ...............@@@..........
236 | ..............@@@@..........
237 | ..............@@@...........
238 | .............@@@@...........
239 | .............@@@............
240 | ............@@@@............
241 | ............@@@.............
242 | ............@@@.............
243 | ...........@@@..............
244 | ..........@@@@..............
245 | ..........@@@@..............
246 | ..........@@................
247 | ............................
248 | Your model says the above number is... 7!
249 | ```
250 | 
251 | ## Next Step
252 | 
253 | [10 - Going Further](../10-going-further)
254 | 


--------------------------------------------------------------------------------
/9-serving/data/0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/0.png


--------------------------------------------------------------------------------
/9-serving/data/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/1.png


--------------------------------------------------------------------------------
/9-serving/data/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/2.png


--------------------------------------------------------------------------------
/9-serving/data/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/3.png


--------------------------------------------------------------------------------
/9-serving/data/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/4.png


--------------------------------------------------------------------------------
/9-serving/data/5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/5.png


--------------------------------------------------------------------------------
/9-serving/data/6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/6.png


--------------------------------------------------------------------------------
/9-serving/data/7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/7.png


--------------------------------------------------------------------------------
/9-serving/data/8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/8.png


--------------------------------------------------------------------------------
/9-serving/data/9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/9.png


--------------------------------------------------------------------------------
/9-serving/mnist_client.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python2.7
 2 | 
 3 | import os
 4 | import random
 5 | import numpy
 6 | 
 7 | from PIL import Image
 8 | 
 9 | import tensorflow as tf
10 | from tensorflow.examples.tutorials.mnist import input_data
11 | from tensorflow_serving.apis import predict_pb2
12 | from tensorflow_serving.apis import prediction_service_pb2
13 | 
14 | from grpc.beta import implementations
15 | 
16 | from mnist import MNIST # pylint: disable=no-name-in-module
17 | 
18 | TF_MODEL_SERVER_HOST = os.getenv("TF_MODEL_SERVER_HOST", "127.0.0.1")
19 | TF_MODEL_SERVER_PORT = int(os.getenv("TF_MODEL_SERVER_PORT", 9000))
20 | TF_DATA_DIR = os.getenv("TF_DATA_DIR", "/tmp/data/")
21 | TF_MNIST_IMAGE_PATH = os.getenv("TF_MNIST_IMAGE_PATH", None)
22 | TF_MNIST_TEST_IMAGE_NUMBER = int(os.getenv("TF_MNIST_TEST_IMAGE_NUMBER", -1))
23 | 
24 | if TF_MNIST_IMAGE_PATH != None:
25 |   raw_image = Image.open(TF_MNIST_IMAGE_PATH)
26 |   int_image = numpy.array(raw_image)
27 |   image = numpy.reshape(int_image, 784).astype(numpy.float32)
28 | elif TF_MNIST_TEST_IMAGE_NUMBER > -1:
29 |   test_data_set = input_data.read_data_sets(TF_DATA_DIR, one_hot=True).test
30 |   image = test_data_set.images[TF_MNIST_TEST_IMAGE_NUMBER]
31 | else:
32 |   test_data_set = input_data.read_data_sets(TF_DATA_DIR, one_hot=True).test
33 |   image = random.choice(test_data_set.images)
34 | 
35 | channel = implementations.insecure_channel(
36 |     TF_MODEL_SERVER_HOST, TF_MODEL_SERVER_PORT)
37 | stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
38 | 
39 | request = predict_pb2.PredictRequest()
40 | request.model_spec.name = "mnist"
41 | request.model_spec.signature_name = "serving_default"
42 | request.inputs['x'].CopyFrom(
43 |     tf.contrib.util.make_tensor_proto(image, shape=[1, 28, 28]))
44 | 
45 | result = stub.Predict(request, 10.0)  # 10 secs timeout
46 | 
47 | print(result)
48 | print(MNIST.display(image, threshold=0))
49 | print("Your model says the above number is... %d!" %
50 |       result.outputs["classes"].int_val[0])
51 | 


--------------------------------------------------------------------------------
/9-serving/requirements.txt:
--------------------------------------------------------------------------------
1 | grpc==0.3.post19
2 | numpy==1.14.0
3 | Pillow==5.0.0
4 | python-mnist==0.5
5 | tensorflow==1.5.0
6 | tensorflow-serving-api==1.4.0
7 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  2 | does not provide legal services or legal advice. Distribution of
  3 | Creative Commons public licenses does not create a lawyer-client or
  4 | other relationship. Creative Commons makes its licenses and related
  5 | information available on an "as-is" basis. Creative Commons gives no
  6 | warranties regarding its licenses, any material licensed under their
  7 | terms and conditions, or any related information. Creative Commons
  8 | disclaims all liability for damages resulting from their use to the
  9 | fullest extent possible.
 10 | 
 11 | Using Creative Commons Public Licenses
 12 | 
 13 | Creative Commons public licenses provide a standard set of terms and
 14 | conditions that creators and other rights holders may use to share
 15 | original works of authorship and other material subject to copyright
 16 | and certain other rights specified in the public license below. The
 17 | following considerations are for informational purposes only, are not
 18 | exhaustive, and do not form part of our licenses.
 19 | 
 20 |      Considerations for licensors: Our public licenses are
 21 |      intended for use by those authorized to give the public
 22 |      permission to use material in ways otherwise restricted by
 23 |      copyright and certain other rights. Our licenses are
 24 |      irrevocable. Licensors should read and understand the terms
 25 |      and conditions of the license they choose before applying it.
 26 |      Licensors should also secure all rights necessary before
 27 |      applying our licenses so that the public can reuse the
 28 |      material as expected. Licensors should clearly mark any
 29 |      material not subject to the license. This includes other CC-
 30 |      licensed material, or material used under an exception or
 31 |      limitation to copyright. More considerations for licensors:
 32 | 	wiki.creativecommons.org/Considerations_for_licensors
 33 | 
 34 |      Considerations for the public: By using one of our public
 35 |      licenses, a licensor grants the public permission to use the
 36 |      licensed material under specified terms and conditions. If
 37 |      the licensor's permission is not necessary for any reason--for
 38 |      example, because of any applicable exception or limitation to
 39 |      copyright--then that use is not regulated by the license. Our
 40 |      licenses grant only permissions under copyright and certain
 41 |      other rights that a licensor has authority to grant. Use of
 42 |      the licensed material may still be restricted for other
 43 |      reasons, including because others have copyright or other
 44 |      rights in the material. A licensor may make special requests,
 45 |      such as asking that all changes be marked or described.
 46 |      Although not required by our licenses, you are encouraged to
 47 |      respect those requests where reasonable. More_considerations
 48 |      for the public: 
 49 | 	wiki.creativecommons.org/Considerations_for_licensees
 50 | 
 51 | =======================================================================
 52 | 
 53 | Creative Commons Attribution 4.0 International Public License
 54 | 
 55 | By exercising the Licensed Rights (defined below), You accept and agree
 56 | to be bound by the terms and conditions of this Creative Commons
 57 | Attribution 4.0 International Public License ("Public License"). To the
 58 | extent this Public License may be interpreted as a contract, You are
 59 | granted the Licensed Rights in consideration of Your acceptance of
 60 | these terms and conditions, and the Licensor grants You such rights in
 61 | consideration of benefits the Licensor receives from making the
 62 | Licensed Material available under these terms and conditions.
 63 | 
 64 | 
 65 | Section 1 -- Definitions.
 66 | 
 67 |   a. Adapted Material means material subject to Copyright and Similar
 68 |      Rights that is derived from or based upon the Licensed Material
 69 |      and in which the Licensed Material is translated, altered,
 70 |      arranged, transformed, or otherwise modified in a manner requiring
 71 |      permission under the Copyright and Similar Rights held by the
 72 |      Licensor. For purposes of this Public License, where the Licensed
 73 |      Material is a musical work, performance, or sound recording,
 74 |      Adapted Material is always produced where the Licensed Material is
 75 |      synched in timed relation with a moving image.
 76 | 
 77 |   b. Adapter's License means the license You apply to Your Copyright
 78 |      and Similar Rights in Your contributions to Adapted Material in
 79 |      accordance with the terms and conditions of this Public License.
 80 | 
 81 |   c. Copyright and Similar Rights means copyright and/or similar rights
 82 |      closely related to copyright including, without limitation,
 83 |      performance, broadcast, sound recording, and Sui Generis Database
 84 |      Rights, without regard to how the rights are labeled or
 85 |      categorized. For purposes of this Public License, the rights
 86 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 87 |      Rights.
 88 | 
 89 |   d. Effective Technological Measures means those measures that, in the
 90 |      absence of proper authority, may not be circumvented under laws
 91 |      fulfilling obligations under Article 11 of the WIPO Copyright
 92 |      Treaty adopted on December 20, 1996, and/or similar international
 93 |      agreements.
 94 | 
 95 |   e. Exceptions and Limitations means fair use, fair dealing, and/or
 96 |      any other exception or limitation to Copyright and Similar Rights
 97 |      that applies to Your use of the Licensed Material.
 98 | 
 99 |   f. Licensed Material means the artistic or literary work, database,
100 |      or other material to which the Licensor applied this Public
101 |      License.
102 | 
103 |   g. Licensed Rights means the rights granted to You subject to the
104 |      terms and conditions of this Public License, which are limited to
105 |      all Copyright and Similar Rights that apply to Your use of the
106 |      Licensed Material and that the Licensor has authority to license.
107 | 
108 |   h. Licensor means the individual(s) or entity(ies) granting rights
109 |      under this Public License.
110 | 
111 |   i. Share means to provide material to the public by any means or
112 |      process that requires permission under the Licensed Rights, such
113 |      as reproduction, public display, public performance, distribution,
114 |      dissemination, communication, or importation, and to make material
115 |      available to the public including in ways that members of the
116 |      public may access the material from a place and at a time
117 |      individually chosen by them.
118 | 
119 |   j. Sui Generis Database Rights means rights other than copyright
120 |      resulting from Directive 96/9/EC of the European Parliament and of
121 |      the Council of 11 March 1996 on the legal protection of databases,
122 |      as amended and/or succeeded, as well as other essentially
123 |      equivalent rights anywhere in the world.
124 | 
125 |   k. You means the individual or entity exercising the Licensed Rights
126 |      under this Public License. Your has a corresponding meaning.
127 | 
128 | 
129 | Section 2 -- Scope.
130 | 
131 |   a. License grant.
132 | 
133 |        1. Subject to the terms and conditions of this Public License,
134 |           the Licensor hereby grants You a worldwide, royalty-free,
135 |           non-sublicensable, non-exclusive, irrevocable license to
136 |           exercise the Licensed Rights in the Licensed Material to:
137 | 
138 |             a. reproduce and Share the Licensed Material, in whole or
139 |                in part; and
140 | 
141 |             b. produce, reproduce, and Share Adapted Material.
142 | 
143 |        2. Exceptions and Limitations. For the avoidance of doubt, where
144 |           Exceptions and Limitations apply to Your use, this Public
145 |           License does not apply, and You do not need to comply with
146 |           its terms and conditions.
147 | 
148 |        3. Term. The term of this Public License is specified in Section
149 |           6(a).
150 | 
151 |        4. Media and formats; technical modifications allowed. The
152 |           Licensor authorizes You to exercise the Licensed Rights in
153 |           all media and formats whether now known or hereafter created,
154 |           and to make technical modifications necessary to do so. The
155 |           Licensor waives and/or agrees not to assert any right or
156 |           authority to forbid You from making technical modifications
157 |           necessary to exercise the Licensed Rights, including
158 |           technical modifications necessary to circumvent Effective
159 |           Technological Measures. For purposes of this Public License,
160 |           simply making modifications authorized by this Section 2(a)
161 |           (4) never produces Adapted Material.
162 | 
163 |        5. Downstream recipients.
164 | 
165 |             a. Offer from the Licensor -- Licensed Material. Every
166 |                recipient of the Licensed Material automatically
167 |                receives an offer from the Licensor to exercise the
168 |                Licensed Rights under the terms and conditions of this
169 |                Public License.
170 | 
171 |             b. No downstream restrictions. You may not offer or impose
172 |                any additional or different terms or conditions on, or
173 |                apply any Effective Technological Measures to, the
174 |                Licensed Material if doing so restricts exercise of the
175 |                Licensed Rights by any recipient of the Licensed
176 |                Material.
177 | 
178 |        6. No endorsement. Nothing in this Public License constitutes or
179 |           may be construed as permission to assert or imply that You
180 |           are, or that Your use of the Licensed Material is, connected
181 |           with, or sponsored, endorsed, or granted official status by,
182 |           the Licensor or others designated to receive attribution as
183 |           provided in Section 3(a)(1)(A)(i).
184 | 
185 |   b. Other rights.
186 | 
187 |        1. Moral rights, such as the right of integrity, are not
188 |           licensed under this Public License, nor are publicity,
189 |           privacy, and/or other similar personality rights; however, to
190 |           the extent possible, the Licensor waives and/or agrees not to
191 |           assert any such rights held by the Licensor to the limited
192 |           extent necessary to allow You to exercise the Licensed
193 |           Rights, but not otherwise.
194 | 
195 |        2. Patent and trademark rights are not licensed under this
196 |           Public License.
197 | 
198 |        3. To the extent possible, the Licensor waives any right to
199 |           collect royalties from You for the exercise of the Licensed
200 |           Rights, whether directly or through a collecting society
201 |           under any voluntary or waivable statutory or compulsory
202 |           licensing scheme. In all other cases the Licensor expressly
203 |           reserves any right to collect such royalties.
204 | 
205 | 
206 | Section 3 -- License Conditions.
207 | 
208 | Your exercise of the Licensed Rights is expressly made subject to the
209 | following conditions.
210 | 
211 |   a. Attribution.
212 | 
213 |        1. If You Share the Licensed Material (including in modified
214 |           form), You must:
215 | 
216 |             a. retain the following if it is supplied by the Licensor
217 |                with the Licensed Material:
218 | 
219 |                  i. identification of the creator(s) of the Licensed
220 |                     Material and any others designated to receive
221 |                     attribution, in any reasonable manner requested by
222 |                     the Licensor (including by pseudonym if
223 |                     designated);
224 | 
225 |                 ii. a copyright notice;
226 | 
227 |                iii. a notice that refers to this Public License;
228 | 
229 |                 iv. a notice that refers to the disclaimer of
230 |                     warranties;
231 | 
232 |                  v. a URI or hyperlink to the Licensed Material to the
233 |                     extent reasonably practicable;
234 | 
235 |             b. indicate if You modified the Licensed Material and
236 |                retain an indication of any previous modifications; and
237 | 
238 |             c. indicate the Licensed Material is licensed under this
239 |                Public License, and include the text of, or the URI or
240 |                hyperlink to, this Public License.
241 | 
242 |        2. You may satisfy the conditions in Section 3(a)(1) in any
243 |           reasonable manner based on the medium, means, and context in
244 |           which You Share the Licensed Material. For example, it may be
245 |           reasonable to satisfy the conditions by providing a URI or
246 |           hyperlink to a resource that includes the required
247 |           information.
248 | 
249 |        3. If requested by the Licensor, You must remove any of the
250 |           information required by Section 3(a)(1)(A) to the extent
251 |           reasonably practicable.
252 | 
253 |        4. If You Share Adapted Material You produce, the Adapter's
254 |           License You apply must not prevent recipients of the Adapted
255 |           Material from complying with this Public License.
256 | 
257 | 
258 | Section 4 -- Sui Generis Database Rights.
259 | 
260 | Where the Licensed Rights include Sui Generis Database Rights that
261 | apply to Your use of the Licensed Material:
262 | 
263 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
264 |      to extract, reuse, reproduce, and Share all or a substantial
265 |      portion of the contents of the database;
266 | 
267 |   b. if You include all or a substantial portion of the database
268 |      contents in a database in which You have Sui Generis Database
269 |      Rights, then the database in which You have Sui Generis Database
270 |      Rights (but not its individual contents) is Adapted Material; and
271 | 
272 |   c. You must comply with the conditions in Section 3(a) if You Share
273 |      all or a substantial portion of the contents of the database.
274 | 
275 | For the avoidance of doubt, this Section 4 supplements and does not
276 | replace Your obligations under this Public License where the Licensed
277 | Rights include other Copyright and Similar Rights.
278 | 
279 | 
280 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
281 | 
282 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
283 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
284 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
285 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
286 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
287 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
288 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
289 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
290 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
291 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
292 | 
293 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
294 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
295 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
296 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
297 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
298 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
299 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
300 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
301 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
302 | 
303 |   c. The disclaimer of warranties and limitation of liability provided
304 |      above shall be interpreted in a manner that, to the extent
305 |      possible, most closely approximates an absolute disclaimer and
306 |      waiver of all liability.
307 | 
308 | 
309 | Section 6 -- Term and Termination.
310 | 
311 |   a. This Public License applies for the term of the Copyright and
312 |      Similar Rights licensed here. However, if You fail to comply with
313 |      this Public License, then Your rights under this Public License
314 |      terminate automatically.
315 | 
316 |   b. Where Your right to use the Licensed Material has terminated under
317 |      Section 6(a), it reinstates:
318 | 
319 |        1. automatically as of the date the violation is cured, provided
320 |           it is cured within 30 days of Your discovery of the
321 |           violation; or
322 | 
323 |        2. upon express reinstatement by the Licensor.
324 | 
325 |      For the avoidance of doubt, this Section 6(b) does not affect any
326 |      right the Licensor may have to seek remedies for Your violations
327 |      of this Public License.
328 | 
329 |   c. For the avoidance of doubt, the Licensor may also offer the
330 |      Licensed Material under separate terms or conditions or stop
331 |      distributing the Licensed Material at any time; however, doing so
332 |      will not terminate this Public License.
333 | 
334 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
335 |      License.
336 | 
337 | 
338 | Section 7 -- Other Terms and Conditions.
339 | 
340 |   a. The Licensor shall not be bound by any additional or different
341 |      terms or conditions communicated by You unless expressly agreed.
342 | 
343 |   b. Any arrangements, understandings, or agreements regarding the
344 |      Licensed Material not stated herein are separate from and
345 |      independent of the terms and conditions of this Public License.
346 | 
347 | 
348 | Section 8 -- Interpretation.
349 | 
350 |   a. For the avoidance of doubt, this Public License does not, and
351 |      shall not be interpreted to, reduce, limit, restrict, or impose
352 |      conditions on any use of the Licensed Material that could lawfully
353 |      be made without permission under this Public License.
354 | 
355 |   b. To the extent possible, if any provision of this Public License is
356 |      deemed unenforceable, it shall be automatically reformed to the
357 |      minimum extent necessary to make it enforceable. If the provision
358 |      cannot be reformed, it shall be severed from this Public License
359 |      without affecting the enforceability of the remaining terms and
360 |      conditions.
361 | 
362 |   c. No term or condition of this Public License will be waived and no
363 |      failure to comply consented to unless expressly agreed to by the
364 |      Licensor.
365 | 
366 |   d. Nothing in this Public License constitutes or may be interpreted
367 |      as a limitation upon, or waiver of, any privileges and immunities
368 |      that apply to the Licensor or You, including from the legal
369 |      processes of any jurisdiction or authority.
370 | 
371 | 
372 | =======================================================================
373 | 
374 | Creative Commons is not a party to its public
375 | licenses. Notwithstanding, Creative Commons may elect to apply one of
376 | its public licenses to material it publishes and in those instances
377 | will be considered the “Licensor.” The text of the Creative Commons
378 | public licenses is dedicated to the public domain under the CC0 Public
379 | Domain Dedication. Except for the limited purpose of indicating that
380 | material is shared under a Creative Commons public license or as
381 | otherwise permitted by the Creative Commons policies published at
382 | creativecommons.org/policies, Creative Commons does not authorize the
383 | use of the trademark "Creative Commons" or any other trademark or logo
384 | of Creative Commons without its prior written consent including,
385 | without limitation, in connection with any unauthorized modifications
386 | to any of its public licenses or any other arrangements,
387 | understandings, or agreements concerning use of licensed material. For
388 | the avoidance of doubt, this paragraph does not form part of the
389 | public licenses.
390 | 
391 | Creative Commons may be contacted at creativecommons.org.
392 | 


--------------------------------------------------------------------------------
/LICENSE-CODE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | Copyright (c) Microsoft Corporation
 3 | 
 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
 5 | associated documentation files (the "Software"), to deal in the Software without restriction, 
 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 
 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 
 8 | subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all copies or substantial 
11 | portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 
14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 
16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 
17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS)
 2 | 
 3 | <!-- ## [Learning Objectives](./learningObjectives.md)
 4 | ## [Presentation Content](./presentationContent.md)
 5 | ## [Room Organization](./roomOrganization.md) -->
 6 | 
 7 | ## Prerequisites
 8 | 
 9 | 1. Have a valid Microsoft Azure subscription allowing the creation of an AKS cluster
10 | 1. Docker client installed: [Installing Docker](https://www.docker.com/community-edition)
11 | 1. Azure-cli  (2.0) installed: [Installing the Azure CLI 2.0 | Microsoft Docs](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest)
12 | 1. Git cli installed: [Installing Git CLI](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
13 | 1. Kubectl installed: [Installing Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
14 | 1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip.)
15 | 1. ksonnet installed: [Installing ksonnet CLI](https://ksonnet.io/#get-started)
16 | 
17 | Clone this repository somewhere so you can easily access the different source files:
18 | ```console
19 | git clone https://github.com/Azure/kubeflow-labs
20 | ```
21 | 
22 | ## Content Summary
23 | 
24 | | | Module | Description |
25 | | --- | --- | --- |
26 | |0| **[Introduction](0-intro)** | Introduction to this workshop. Motivations and goals.|
27 | |1| **[Docker](1-docker)** | Docker and containers 101.|
28 | |2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.|
29 | |3| **[Helm](3-helm)** | Introduction to Helm |
30 | |4| **[Kubeflow](4-kubeflow)** | Introduction to Kubeflow and how to deploy it in your cluster.|
31 | |5| **[JupyterHub](5-jupyterhub)** | Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow |
32 | |6| **[TFJob](6-tfjob)** | Introduction to `TFJob` and how to use it to deploy a simple TensorFlow training.|
33 | |7| **[Distributed Tensorflow](7-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`|
34 | |8| **[Hyperparameters Sweep with Helm](8-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results |
35 | |9| **[Serving](9-serving)** | Using TensorFlow Serving to serve predictions |
36 | |10| **[Going Further](10-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. |
37 | 
38 | 
39 | # Contributing
40 | 
41 | This project welcomes contributions and suggestions.  Most contributions require you to agree to a
42 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
43 | the rights to use your contribution. For details, visit https://cla.microsoft.com.
44 | 
45 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
46 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
47 | provided by the bot. You will only need to do this once across all repos using our CLA.
48 | 
49 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
50 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
51 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
52 | 
53 | # Legal Notices
54 | 
55 | Microsoft and any contributors grant you a license to the Microsoft documentation and other content
56 | in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode),
57 | see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the
58 | [LICENSE-CODE](LICENSE-CODE) file.
59 | 
60 | Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation
61 | may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries.
62 | The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks.
63 | Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
64 | 
65 | Privacy information can be found at https://privacy.microsoft.com/en-us/
66 | 
67 | Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents,
68 | or trademarks, whether by implication, estoppel or otherwise.
69 | 


--------------------------------------------------------------------------------