├── .gitignore ├── 0-intro ├── README.md ├── thumbnail.png └── workflow.png ├── 1-docker ├── README.md └── src │ ├── Dockerfile │ ├── Dockerfile.gpu │ └── main.py ├── 10-going-further ├── NFSonAzureConcept.png └── README.md ├── 2-kubernetes └── README.md ├── 3-helm ├── README.md └── dokuwiki.png ├── 4-kubeflow └── README.md ├── 5-jupyterhub ├── README.md └── jupyterhub.png ├── 6-tfjob ├── README.md └── file-share.png ├── 7-distributed-tensorflow ├── README.md ├── solution-src │ ├── Dockerfile │ ├── Dockerfile.gpu │ └── main.py └── tensorboard.png ├── 8-hyperparam-sweep ├── README.md ├── solution-chart │ ├── Chart.yaml │ ├── templates │ │ ├── _helpers.tpl │ │ └── deployment.yaml │ └── values.yaml ├── src │ ├── Dockerfile │ ├── Dockerfile.gpu │ ├── main.py │ ├── requirements.txt │ └── starry.jpg └── tensorboard.png ├── 9-serving ├── README.md ├── data │ ├── 0.png │ ├── 1.png │ ├── 2.png │ ├── 3.png │ ├── 4.png │ ├── 5.png │ ├── 6.png │ ├── 7.png │ ├── 8.png │ └── 9.png ├── mnist_client.py └── requirements.txt ├── LICENSE ├── LICENSE-CODE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Binaries for programs and plugins 2 | *.exe 3 | *.dll 4 | *.so 5 | *.dylib 6 | 7 | # Test binary, build with `go test -c` 8 | *.test 9 | 10 | # Output of the go coverage tool, specifically when used with LiteIDE 11 | *.out 12 | 13 | # Project-local glide cache, RE: https://github.com/Masterminds/glide/issues/736 14 | .glide/ 15 | .DS_Store 16 | -------------------------------------------------------------------------------- /0-intro/README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This labs will walk you through setting up [Kubeflow](https://github.com/kubeflow/kubeflow) on a kubernetes cluster on Azure Container Service (AKS). 4 | 5 | We will then take a look at how to use the different components that make up Kubeflow. 6 | 7 | ## Motivations 8 | 9 | Machine learning model development and operationalization currently has very few industry-wide best practices to help us reduce the time to market and optimize the different steps. 10 | 11 | However in traditional application development, DevOps practices are becoming ubiquitous. 12 | We can benefit from many of these practices by applying them to model development and operationalization. 13 | 14 | Here are a subset of pain points that exists in a typical ML workflow. 15 | #### A Typical (Simplified) ML Workflow and its Pain Points 16 | ![Typical Workflow](workflow.png) 17 | 18 | This workshop is going to focus on improving the training and serving process by leveraging containers and Kubernetes. 19 | 20 | Today many data scientists are training their models either on their physical workstation (be it a laptop or a desktop with multiple GPUs) or using a VM (sometime, but rarely, a couple of them) in the cloud. 21 | 22 | This approach is sub-optimal for many reasons, among which: 23 | * Training is slow and sequential 24 | * Having a single (or few) GPU on hand, means there is only so much trainings you can do at the time. It also means that once your GPU is busy with a training you cannot use it to do something else, just as smaller experiments. 25 | * Hyper-parameter sweeping is vastly inefficient: The different hypothesis you want to test will run sequentially and not in parallel. In practices this means that very often we don't have time to really explore the hyper-parameter space and we just run a couple of experiments that we think will yield the best result. 26 | The longer the training time, the fewer experiments we can run. 27 | * Distributed training is hard (or impossible) to setup 28 | * In practice very few data scientist benefit from distributed training. Either because they simply can't use it (you need multiple machines for that), or because it is too tedious to setup. 29 | * High cost 30 | * If each member of the team has it's own allocated resources, in practices it means many of them will not be used at any given time, given the price of a single GPU, this is very costly. On the other hand pooling resourcing (such as sharing VM) is also painful since multiple people might want to use them at the same time. 31 | 32 | Using Kubernetes, we can alleviate many of these pain points: 33 | * Training is massively parallelizable 34 | * Kubernetes is highly scalable (up to 1200 VMs for a single cluster on Azure). In practice that means you could run as many experiments as you want at the same time. This makes exploring and comparing different hypothesis much simpler and efficient. 35 | * Distributed training is much simpler 36 | * As we will see in this workshop, it is very easy to setup a TensorFlow distributed training on kubernetes, and scale it to whatever size you want, making much more usable in practice. 37 | * Optimized cost with autoscaling* 38 | * Kubernetes allows for resource pooling while at the same time ensuring that any training job can run without waiting for another one to finish. 39 | * With autoscaling the cluster can automatically scale out or in to ensure maximum utilization, thus keeping the cost as low as possible. 40 | 41 | *While autoscaling is very powerful, it is outside the scope of this workshop. However we will give you resources and pointer to get started with it. 42 | 43 | 44 | ## OpenAI: Building the Infrastructure that Powers the Future of AI 45 | 46 | During KubeCon 2017, Vicki Cheung and Jonas Schneider delivered a keynote explaining how OpenAI manage to handle training at very large scale with Kubernetes, it is worth listening to: 47 | 48 | ![OpenAI](./thumbnail.png) 49 | 50 | ## Next Step 51 | [Module 1: Docker](../1-docker/README.md) 52 | -------------------------------------------------------------------------------- /0-intro/thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/0-intro/thumbnail.png -------------------------------------------------------------------------------- /0-intro/workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/0-intro/workflow.png -------------------------------------------------------------------------------- /1-docker/README.md: -------------------------------------------------------------------------------- 1 | # Docker 2 | 3 | ### Summary 4 | 5 | In this section you will learn about: 6 | * Running Docker locally 7 | * Basics of Docker 8 | * Containerizing a simple application 9 | * Building and Pushing an Image 10 | 11 | ### Basics of Docker and Containers 12 | 13 | Docker has a very well structured six-part tutorial. 14 | While for this workshop you don't need to go through all of them, part 1 and 2 are required: 15 | * [Get Started, Part 1: Orientation and setup](https://docs.docker.com/get-started) 16 | * [Get Started, Part 2: Containers](https://docs.docker.com/get-started/part2/) 17 | 18 | By the end of Part 2, you should have a simple container up and running, and understand the basic concepts of a container. 19 | 20 | #### Additional Important Docker Command 21 | 22 | Here a few other docker commands that are important to be aware of for the rest of this workshop: 23 | 24 | 1. `docker ps` 25 | 26 | The docker `ps` command allows to list the status of the containers. 27 | 28 | A container could be either stopped or running. When it finishes to execute the process, it will stop. 29 | 30 | For example if you run the command `docker run -it ubuntu hostname` this will : 31 | - Pull the official ubuntu image from the registry 32 | - Start the container in the interactive mode `-it` 33 | - Execute the command : `hostname` 34 | - Stop 35 | 36 | ``` 37 | $ docker run -it ubuntu hostname 38 | 0d0af5005fc7 39 | ``` 40 | 41 | If you run the command `docker ps -a` you should see : 42 | ``` 43 | $ docker ps -a 44 | CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 45 | 0d0af5005fc7 ubuntu "hostname" 58 seconds ago Exited (0) About a minute ago gifted_darwin 46 | ``` 47 | 48 | > The `-a` allows you to list all the containers, not just the one that is running 49 | 50 | We can notice a few things here such as : 51 | - The status `Exited...` for our container 52 | - The name `gifted_darwin` randomly generated for our container, we can specify a custom one using the command `--name` explained in the previous section. 53 | - We can re-execute our container using the command `docker start gifted_darwin` 54 | - We can run the command `docker run -it ubuntu hostname` again and do a `docker ps -a`; we should see how two containers exited. 55 | 56 | 1. `docker logs` 57 | 58 | The docker `logs` allow to fetch the console output from inside the container 59 | 60 | From our previous example, we can run `docker logs gifted_darwin` 61 | 62 | ``` 63 | $ docker logs gifted_darwin 64 | 0d0af5005fc7 65 | ``` 66 | 67 | You can also stream the logs using the command `-f` and print in real time in your console the stdout of your container. 68 | 69 | 1. `docker rm` 70 | 71 | The docker `rm` command allows to remove a container. 72 | 73 | From the previous example, we can see that we have a container listed as exited in our environment, or maybe more if we run the same command `docker run -it ubuntu hostname` multiple times. If we want to do some cleaning and remove those executions from our environment we can use the command `docker rm` 74 | 75 | ``` 76 | $ docker rm gifted_darwin 77 | gifted_darwin 78 | ``` 79 | 80 | > You can either specify the **CONTAINER ID** or the **NAME** of the container to refer to it 81 | 82 | 83 | 1. `docker images` 84 | 85 | This command allows us to list all the base images available in the environment. 86 | 87 | ``` 88 | $ docker images 89 | REPOSITORY TAG IMAGE ID CREATED SIZE 90 | ubuntu latest 20c44cd7596f 2 days ago 123MB 91 | example-scratch latest 32ff7b65f567 5 days ago 30.7MB 92 | node 8.9.1-slim a6bb2cc1118f 11 days ago 230MB 93 | buildpack-deps xenial a27b6a8abd1c 2 weeks ago 644MB 94 | ``` 95 | 96 | > You can manage your images by removing them using `docker rmi IMAGENAME` or pulling a new one with `docker pull IMAGENAME` 97 | 98 | ### Containerizing a TensorFlow model 99 | 100 | Now that we understand the basics of Docker, let's containerize our first TensorFlow model that we will reuse in the following modules. 101 | Our first model will be a very simple MNIST classifier. You can see the source code in [`./src/main.py`](./src/main.py). 102 | As you can see there is nothing specific to containers in this code, you can run this script directly on your laptop or on a VM. 103 | 104 | Now, to have this run in a container, we need to build an image containing this code and it's dependencies. 105 | As you saw in the tutorial, we will use a `Dockerfile` to do this. 106 | 107 | Here is the (very simple) `Dockerfile` that we are going to use for this model (located in [`./src/Dockerfile`](./src/Dockerfile)): 108 | 109 | ```dockerfile 110 | FROM tensorflow/tensorflow:1.10.0 111 | COPY main.py /app/main.py 112 | 113 | ENTRYPOINT ["python", "/app/main.py"] 114 | ``` 115 | 116 | As you can see, we are not building a new image from scratch, instead we are using a base image from TensorFlow. Indeed, TensorFlow has a bunch of base images that you can start with. 117 | You can see the full list here: https://hub.docker.com/r/tensorflow/tensorflow/tags/. 118 | 119 | What is important to note is that different tags need to be used depending on if you want to use GPU or not. 120 | For example, if you wanted to run your model with TensorFlow 1.10.0 and CPU only, you would use `tensorflow/tensorflow:1.10.0`. 121 | If instead you wanted to use GPU, you would start from `tensorflow/tensorflow:1.10.0-gpu`. 122 | 123 | The two other instructions are pretty straightforward, first we copy our script into the container, and then we set this script as the entry point for our container, so that any argument passed to our container would actually get passed to our script. 124 | 125 | #### Building the image 126 | 127 | > If you don't already have a Docker account, see [Log in with your Docker ID](https://docs.docker.com/get-started/part2/#log-in-with-your-docker-id). 128 | 129 | The next step is to build our image to be able to run it using docker. For that, we will use the command `docker build`. 130 | 131 | From the [`./src`](./src) repository, we can build the image with 132 | 133 | ```console 134 | cd src 135 | docker build -t ${DOCKER_USERNAME}/tf-mnist . 136 | ``` 137 | > Reminder: the `-t` argument allows to **tag** the image with a specific name. 138 | 139 | `${DOCKER_USERNAME}` should be your Docker username that you use to connect to Docker Hub. 140 | 141 | The output from this command should look like this: 142 | 143 | ``` 144 | Sending build context to Docker daemon 11.26kB 145 | Step 1/3 : FROM tensorflow/tensorflow:1.10.0 146 | ---> a61a91cc0d1b 147 | Step 2/3 : COPY main.py /app/main.py 148 | ---> b264d6e9a5ef 149 | Removing intermediate container fe8128425296 150 | Step 3/3 : ENTRYPOINT python /app/main.py 151 | ---> Running in 7acb7aac7a9f 152 | ---> 92c7ed17916b 153 | Removing intermediate container 7acb7aac7a9f 154 | Successfully built 92c7ed17916b 155 | Successfully tagged wbuchwalter/tf-mnist:latest 156 | ``` 157 | Let's analyse this image full name (`wbuchwalter/tf-mnist:latest`): 158 | * `wbuchwalter` is the name of my repository, this is where we can find the image. This will be different for you (same as your docker hub username). 159 | * `tf-mnist` is the name of the image itself 160 | * `latest` is the tag. `latest` is the default tag if you don't specify any. Tags are usually used to denote different versions or flavors of a same image. For example you could have a tag `v1` and `v2` to denote different versions, or `cpu` and `gpu` to denote what hardware it can run on. 161 | 162 | When you have the successfully built message, you should now be able to see if your image is locally available with the command `docker images` described earlier. 163 | 164 | #### Running the image 165 | 166 | Now we can try to run it locally using the `docker run` command. 167 | By default the model will run 1000 training steps which can take a few minutes on a laptop. Let's reduce this number to 100 with the `--max_steps` argument. 168 | 169 | ```console 170 | docker run -it ${DOCKER_USERNAME}/tf-mnist --max_steps 100 171 | ``` 172 | 173 | If everything is okay you should see the model training: 174 | 175 | ``` 176 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 177 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz 178 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 179 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz 180 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 181 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz 182 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 183 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz 184 | 2017-11-29 18:32:41.992194: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 185 | Accuracy at step 0: 0.1292 186 | Accuracy at step 10: 0.7198 187 | Accuracy at step 20: 0.834 188 | Accuracy at step 30: 0.8698 189 | Accuracy at step 40: 0.8783 190 | Accuracy at step 50: 0.8968 191 | Accuracy at step 60: 0.9023 192 | Accuracy at step 70: 0.9059 193 | Accuracy at step 80: 0.9084 194 | Accuracy at step 90: 0.9154 195 | Adding run metadata for 99 196 | ``` 197 | 198 | You can kill the process and exit the container at any time with `ctrl + c`. 199 | 200 | ### Running the image with the NVIDIA GPU of your machine (If you have one) 201 | 202 | **Currently, running docker containers with GPU is only supported on Linux.** 203 | 204 | First install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker). 205 | 206 | You also need to make sure the image you are going to use is optimized for GPU. 207 | In our example you need to modify the `Dockerfile` to use a TensorFlow image built for GPU: 208 | 209 | ```dockerfile 210 | FROM tensorflow/tensorflow:1.10.0-gpu 211 | COPY main.py /app/main.py 212 | 213 | ENTRYPOINT ["python", "/app/main.py"] 214 | ``` 215 | 216 | Then simply rebuild the image with a new tag (you can use docker or nvidia-docker interchangeably for any command except run): 217 | 218 | ```console 219 | cd src 220 | docker build -t ${DOCKER_USERNAME}/tf-mnist:gpu -f Dockerfile.gpu . 221 | ``` 222 | 223 | Finally run the container with nvidia-docker: 224 | 225 | ```console 226 | nvidia-docker run -it ${DOCKER_USERNAME}/tf-mnist:gpu 227 | ``` 228 | 229 | > Note: If the command fails with `Unknown runtime specified nvidia`, follow the steps described here (Systemd drop-in file): https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup 230 | 231 | #### Publish the Image 232 | 233 | Our image is now built and running locally, but what about sharing it to be able to use it from anywhere by anyone? 234 | Most importantly we want to be able to reuse this image on the Kubernetes cluster we are going to create in module 2. 235 | So let's push our image to Docker Hub: 236 | 237 | ```console 238 | docker push ${DOCKER_USERNAME}/tf-mnist:gpu 239 | ``` 240 | 241 | If this command doesn't look familiar to you, make sure you went through part 1 and 2 of Docker's tutorial, and more precisely: [Tutorial - Share your image](https://docs.docker.com/get-started/part2/#share-your-image) 242 | 243 | 244 | ### Useful Links 245 | * [What is Docker ?](https://www.docker.com/what-docker) 246 | * [Docker for beginner](https://github.com/docker/labs/blob/master/beginner/readme.md) 247 | 248 | 249 | ## Next Step 250 | [2 - Kubernetes](../2-kubernetes/README.md) 251 | -------------------------------------------------------------------------------- /1-docker/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.10.0 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] 5 | -------------------------------------------------------------------------------- /1-docker/src/Dockerfile.gpu: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.10.0-gpu 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] 5 | -------------------------------------------------------------------------------- /1-docker/src/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the 'License'); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an 'AS IS' BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """A simple MNIST classifier which displays summaries in TensorBoard. 16 | 17 | This is an unimpressive MNIST model, but it is a good example of using 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of 19 | naming summary tags so that they are grouped meaningfully in TensorBoard. 20 | 21 | It demonstrates the functionality of every TensorBoard dashboard. 22 | """ 23 | from __future__ import absolute_import 24 | from __future__ import division 25 | from __future__ import print_function 26 | 27 | import argparse 28 | import os 29 | import sys 30 | 31 | import tensorflow as tf 32 | 33 | from tensorflow.examples.tutorials.mnist import input_data 34 | 35 | FLAGS = None 36 | 37 | 38 | def train(): 39 | # Import data 40 | mnist = input_data.read_data_sets(FLAGS.data_dir, 41 | one_hot=True, 42 | fake_data=FLAGS.fake_data) 43 | 44 | # Create a multilayer model. 45 | 46 | # Input placeholders 47 | with tf.name_scope('input'): 48 | x = tf.placeholder(tf.float32, [None, 784], name='x-input') 49 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input') 50 | 51 | with tf.name_scope('input_reshape'): 52 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1]) 53 | tf.summary.image('input', image_shaped_input, 10) 54 | 55 | # We can't initialize these variables to 0 - the network will get stuck. 56 | def weight_variable(shape): 57 | """Create a weight variable with appropriate initialization.""" 58 | initial = tf.truncated_normal(shape, stddev=0.1) 59 | return tf.Variable(initial) 60 | 61 | def bias_variable(shape): 62 | """Create a bias variable with appropriate initialization.""" 63 | initial = tf.constant(0.1, shape=shape) 64 | return tf.Variable(initial) 65 | 66 | def variable_summaries(var): 67 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization).""" 68 | with tf.name_scope('summaries'): 69 | mean = tf.reduce_mean(var) 70 | tf.summary.scalar('mean', mean) 71 | with tf.name_scope('stddev'): 72 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 73 | tf.summary.scalar('stddev', stddev) 74 | tf.summary.scalar('max', tf.reduce_max(var)) 75 | tf.summary.scalar('min', tf.reduce_min(var)) 76 | tf.summary.histogram('histogram', var) 77 | 78 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): 79 | """Reusable code for making a simple neural net layer. 80 | 81 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize. 82 | It also sets up name scoping so that the resultant graph is easy to read, 83 | and adds a number of summary ops. 84 | """ 85 | # Adding a name scope ensures logical grouping of the layers in the graph. 86 | with tf.name_scope(layer_name): 87 | # This Variable will hold the state of the weights for the layer 88 | with tf.name_scope('weights'): 89 | weights = weight_variable([input_dim, output_dim]) 90 | variable_summaries(weights) 91 | with tf.name_scope('biases'): 92 | biases = bias_variable([output_dim]) 93 | variable_summaries(biases) 94 | with tf.name_scope('Wx_plus_b'): 95 | preactivate = tf.matmul(input_tensor, weights) + biases 96 | tf.summary.histogram('pre_activations', preactivate) 97 | activations = act(preactivate, name='activation') 98 | tf.summary.histogram('activations', activations) 99 | return activations 100 | 101 | hidden1 = nn_layer(x, 784, 500, 'layer1') 102 | 103 | with tf.name_scope('dropout'): 104 | keep_prob = tf.placeholder(tf.float32) 105 | tf.summary.scalar('dropout_keep_probability', keep_prob) 106 | dropped = tf.nn.dropout(hidden1, keep_prob) 107 | 108 | # Do not apply softmax activation yet, see below. 109 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity) 110 | 111 | with tf.name_scope('cross_entropy'): 112 | # The raw formulation of cross-entropy, 113 | # 114 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)), 115 | # reduction_indices=[1])) 116 | # 117 | # can be numerically unstable. 118 | # 119 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the 120 | # raw outputs of the nn_layer above, and then average across 121 | # the batch. 122 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y) 123 | with tf.name_scope('total'): 124 | cross_entropy = tf.reduce_mean(diff) 125 | tf.summary.scalar('cross_entropy', cross_entropy) 126 | 127 | with tf.name_scope('train'): 128 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( 129 | cross_entropy) 130 | 131 | with tf.name_scope('accuracy'): 132 | with tf.name_scope('correct_prediction'): 133 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 134 | with tf.name_scope('accuracy'): 135 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 136 | tf.summary.scalar('accuracy', accuracy) 137 | 138 | # Merge all the summaries and write them out to 139 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default) 140 | merged = tf.summary.merge_all() 141 | 142 | def feed_dict(train): 143 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders.""" 144 | if train or FLAGS.fake_data: 145 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data) 146 | k = FLAGS.dropout 147 | else: 148 | xs, ys = mnist.test.images, mnist.test.labels 149 | k = 1.0 150 | return {x: xs, y_: ys, keep_prob: k} 151 | 152 | sess = tf.InteractiveSession() 153 | train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph) 154 | test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test') 155 | tf.global_variables_initializer().run() 156 | # Train the model, and also write summaries. 157 | # Every 10th step, measure test-set accuracy, and write test summaries 158 | # All other steps, run train_step on training data, & add training summaries 159 | 160 | 161 | for i in range(FLAGS.max_steps): 162 | if i % 10 == 0: # Record summaries and test-set accuracy 163 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False)) 164 | test_writer.add_summary(summary, i) 165 | print('Accuracy at step %s: %s' % (i, acc)) 166 | else: # Record train set summaries, and train 167 | if i % 100 == 99: # Record execution stats 168 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 169 | run_metadata = tf.RunMetadata() 170 | summary, _ = sess.run([merged, train_step], 171 | feed_dict=feed_dict(True), 172 | options=run_options, 173 | run_metadata=run_metadata) 174 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i) 175 | train_writer.add_summary(summary, i) 176 | print('Adding run metadata for', i) 177 | else: # Record a summary 178 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True)) 179 | train_writer.add_summary(summary, i) 180 | train_writer.close() 181 | test_writer.close() 182 | 183 | 184 | def main(_): 185 | train() 186 | 187 | 188 | if __name__ == '__main__': 189 | parser = argparse.ArgumentParser() 190 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool, 191 | default=False, 192 | help='If true, uses fake data for unit testing.') 193 | parser.add_argument('--max_steps', type=int, default=1000, 194 | help='Number of steps to run trainer.') 195 | parser.add_argument('--learning_rate', type=float, default=0.001, 196 | help='Initial learning rate') 197 | parser.add_argument('--dropout', type=float, default=0.9, 198 | help='Keep probability for training dropout.') 199 | parser.add_argument( 200 | '--data_dir', 201 | type=str, 202 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 203 | 'tensorflow/input_data'), 204 | help='Directory for storing input data') 205 | parser.add_argument( 206 | '--log_dir', 207 | type=str, 208 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 209 | 'tensorflow/logs'), 210 | help='Summaries log directory') 211 | FLAGS, unparsed = parser.parse_known_args() 212 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) -------------------------------------------------------------------------------- /10-going-further/NFSonAzureConcept.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/10-going-further/NFSonAzureConcept.png -------------------------------------------------------------------------------- /10-going-further/README.md: -------------------------------------------------------------------------------- 1 | # Going Further 2 | 3 | ## Summary 4 | 5 | * Learn how to deal with very large amount of input data 6 | * Learn how to autoscale a kubernetes cluster on acs-engine. 7 | 8 | ## Working with large amount of data 9 | 10 | In the previous modules, we saw how we could easily scale our trainings on Kubernetes to a large amount of nodes. 11 | We also saw how Azure Files provides an easy way to add persistent storage to our trainings to save the output models, summaries etc. 12 | 13 | However in every example we worked with so far, the training data was either downloaded at run time, or directly baked into the container. While this can work for very small datasets, it is not a scalable approach for larger ones. 14 | Just as we used Azure Files to store models and logs, we can use Azure Files to store our dataset and mount the share in our containers. This work well only up to a certain point though (say a few GBs), as an Azure file share is limited to 1000 IOPS. 15 | 16 | So how can we deal with larger datasets? 17 | One solution to this is to use a distributed file system. 18 | 19 | ### Distributed File System, Tools and Concepts 20 | 21 | A distributed file system aggregates various storages over network into a single large network file system. 22 | Such a file system can be deployed inside your Kubernetes cluster, and can use the disks already attached to your nodes as a partition of the overall network file system. 23 | 24 | ![](NFSonAzureConcept.png) 25 | 26 | Here are some tools and frameworks that can make it easy to deploy such a distributed file system on Kubernetes: 27 | 28 | * [GlusterFS](http://www.gluster.org/) 29 | * [Rook](https://rook.io/) 30 | * [Portworx](https://portworx.com/) 31 | * [Pachyderm](http://pachyderm.io/) 32 | 33 | ## Autoscaling a Kubernetes Cluster 34 | 35 | As we saw in module [7 - Distributed TensorFlow](../7-distributed-tensorflow/) and [8 - Hyperparameters Sweep](../8-hyperparam-sweep), being able to autoscale our Kubernetes cluster can be extremely useful. 36 | Indeed, automatic scale-out (adding more VMs to the cluster) would allow anyone to run any experiment they want, whenever they want without needing to involve an ops person to setup and prepare virtual machines. 37 | And because the cluster can also automatically scale-in by deleting idle VMs once training jobs are completed, we can keep the cost as low as possible by just paying for what we actually use. 38 | 39 | As of this writing, autoscaling is only supported on Kubernetes cluster created with [acs-engine](https://github.com/Azure/acs-engine). 40 | 41 | See the following resources to get started: 42 | 43 | * [Kubernetes Azure cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/azure) 44 | -------------------------------------------------------------------------------- /2-kubernetes/README.md: -------------------------------------------------------------------------------- 1 | # Kubernetes 2 | 3 | ### Prerequisites 4 | 5 | - [Docker Basics](../1-docker/README.md) 6 | 7 | ### Summary 8 | 9 | In this module you will learn: 10 | 11 | - The basic concepts of Kubernetes 12 | - How to create a Kubernetes cluster on Azure 13 | 14 | > _Important_ : Kubernetes is very often abbreviated to **K8s**. This is the name we are going to use in this lab. 15 | 16 | ## The basic concepts of Kubernetes 17 | 18 | [Kubernetes](https://kubernetes.io/) is an open-source technology that makes it easier to automate deployment, scale, and manage containerized applications in a clustered environment. The ability to use GPUs with Kubernetes allows the clusters to facilitate running frequent experimentations, using it for high-performing serving, and auto-scaling of deep learning models, and much more. 19 | 20 | ### Overview 21 | 22 | Kubernetes is a system for managing containerized applications across a cluster of nodes. To work with Kubernetes, you use Kubernetes API objects to describe your cluster’s desired state: what applications or other workloads you want to run, what container images they use, the number of replicas, what network and disk resources you want to make available, and more. You set your desired state by creating objects using the Kubernetes API. Once you’ve set your desired state, the Kubernetes Control Plane works to make the cluster’s current state match the desired state. To do so, Kubernetes performs a variety of tasks automatically, such as starting or restarting containers, scaling the number of replicas of a given application, and more. 23 | 24 | ### Kubernetes Master 25 | 26 | The Kubernetes master is responsible for maintaining the desired state for your cluster. When you interact with Kubernetes, such as by using the kubectl command-line interface, you’re communicating with your cluster’s Kubernetes master. These master services can be installed on a single machine, or distributed across multiple machines. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 1 master. 27 | 28 | ### Kubernetes Nodes 29 | 30 | The worker nodes communicate with the master components, configure the networking for containers, and run the actual workloads assigned to them. In the following [Provisioning a Kubernetes cluster on Azure](#provisioning-a-kubernetes-cluster-on-azure) section, we will be creating a Kubernetes cluster with 3 worker nodes. 31 | 32 | ### Kubernetes Objects 33 | 34 | Kubernetes contains a number of abstractions that represent the state of your system: deployed containerized applications and workloads, their associated network and disk resources, and other information about what your cluster is doing. A Kubernetes object is a "record of intent" – once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re telling the Kubernetes system your cluster’s desired state. 35 | 36 | The basic Kubernetes objects include: 37 | 38 | - Pod - the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod encapsulates an application container (or multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run. 39 | - Service - an abstraction which defines a logical set of Pods and a policy by which to access them. 40 | - Volume - an abstraction which allows data to be preserved across container restarts and allows data to be shared between different containers. 41 | - Namespace - a way to divide a physical cluster resources into multiple virtual clusters between multiple users. 42 | - Deployment - Manages pods and ensures a certain number of them are running. This is typically used to deploy pods that should always be up, such as a web server. 43 | - Job - A job creates one or more pods and ensures that a specified number of them successfully terminate. In other words, we use Job to run a task that finishes at some point, such as training a model. 44 | 45 | ### Creating a Kubernetes Object 46 | 47 | When you create an object in Kubernetes, you must provide the object specifications that describes its desired state, as well as some basic information about the object (such as a name) to the Kubernetes API either directly or via the `kubectl` command-line interface. Usually, you will provide the information to `kubectl` in a .yaml file. `kubectl` then converts the information to JSON when making the API request. In the next few sections, we will be using various yaml files to describe the Kubernetes objects we want to deploy to our Kubernetes cluster. 48 | 49 | For example, the `.yaml` file shown below includes the required fields and object spec for a Kubernetes Deployment. A Kubernetes Deployment is an object that can represent an application running on your cluster. In the example below, the Deployment spec describes the desired state of three replicas of the nginx application to be running. When you create the Deployment, the Kubernetes system reads the Deployment spec and starts three instances of your desired application, updating the status to match your spec. 50 | 51 | ```yaml 52 | apiVersion: apps/v1beta2 # Kubernetes API version for the object 53 | kind: Deployment # The type of object described by this YAML, here a Deployment 54 | metadata: 55 | name: nginx-deployment # Name of the deployment 56 | spec: # Actual specifications of this deployment 57 | replicas: 3 # Number of replicas (instances) for this deployment. 1 replica = 1 pod 58 | template: 59 | metadata: 60 | labels: 61 | app: nginx 62 | spec: # Specification for the Pod 63 | containers: # These are the containers running inside our Pod, in our case a single one 64 | - name: nginx # Name of this container 65 | image: nginx:1.7.9 # Image to run 66 | ports: # Ports to expose 67 | - containerPort: 80 68 | ``` 69 | 70 | To create all the objects described in a Deployment using a `.yaml` file like the one above in your own Kubernetes cluster you can use Kubernetes' CLI (`kubectl`). 71 | We will be creating a deployment in the exercise toward the end of this module, but first we need a cluster. 72 | 73 | ## Provisioning a Kubernetes cluster on Azure 74 | 75 | We are going to use AKS to create a GPU-enabled Kubernetes cluster. 76 | You could also use [aks-engine](https://github.com/Azure/aks-engine) if you prefer, this guide will assume you are using aks. 77 | 78 | ### A Note on GPUs with Kubernetes 79 | 80 | You can view AKS region availability in [Azure AKS docs](https://docs.microsoft.com/en-us/azure/aks/container-service-quotas#region-availability) 81 | 82 | You can find NVIDIA GPUs (N-series) availability in [region availability documentation](https://azure.microsoft.com/en-us/global-infrastructure/services/?products=virtual-machines®ions=all) 83 | 84 | If you want more options, you may want to use aks-engine for more flexibility. 85 | 86 | ### With the CLI 87 | 88 | #### Creating a resource group 89 | 90 | ```console 91 | az group create --name --location 92 | ``` 93 | 94 | With: 95 | 96 | | Parameter | Description | 97 | | ------------------- | -------------------------------------------------------------- | 98 | | RESOURCE_GROUP_NAME | Name of the resource group where the cluster will be deployed. | 99 | | LOCATION | Name of the region where the cluster should be deployed. | 100 | 101 | #### Creating the cluster 102 | 103 | ```console 104 | az aks create --node-vm-size --resource-group --name 105 | --node-count --kubernetes-version 1.12.5 --location --generate-ssh-keys 106 | ``` 107 | 108 | > Note : The kubernetes verion could change depending where you are deploying your cluster. You can get more informations running the `az aks get-versions` command. 109 | 110 | With: 111 | 112 | | Parameter | Description | 113 | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 114 | | AGENT_SIZE | The size of K8s's agent VM. Choose `Standard_NC6` for GPUs or `Standard_D2_v2` if you just want CPUs. Full list of [options here](https://github.com/Azure/azure-sdk-for-python/blob/master/azure-mgmt-containerservice/azure/mgmt/containerservice/models/container_service_client_enums.py#L21). | 115 | | RG | Name of the resource group that was created in the previous step. | 116 | | NAME | Name of the AKS resource (can be whatever you want). | 117 | | AGENT_COUNT | The number of agents (virtual machines) that you want in your cluster. 3 or 4 is recommended to play with hyper-parameter tuning and distributed TensorFlow | 118 | | LOCATION | Same location that was specified for the resource group creation. | 119 | 120 | The command should take a few minutes to complete. Once it is done, the output should be a JSON object indicating among other things the `provisioningState`: 121 | 122 | ``` 123 | { 124 | [...] 125 | "provisioningState": "Succeeded", 126 | [...] 127 | } 128 | ``` 129 | 130 | #### Getting the `kubeconfig` file 131 | 132 | The `kubeconfig` file is a configuration file that will allow Kubernetes's CLI (`kubectl`) to know how to talk to our cluster. 133 | To download the `kubeconfig` file from the cluster we just created, run: 134 | 135 | ```console 136 | az aks get-credentials --name --resource-group 137 | ``` 138 | 139 | Where `NAME` and `RG` should be the same values as for the cluster creation. 140 | 141 | #### Installing NVIDIA Device Plugin (AKS only) 142 | 143 | For AKS, install NVIDIA Device Plugin using: 144 | 145 | For Kubernetes 1.10: 146 | 147 | ```bash 148 | kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.10/nvidia-device-plugin.yml 149 | ``` 150 | 151 | For Kubernetes 1.11 and above: 152 | 153 | ```bash 154 | kubectl apply -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.11/nvidia-device-plugin.yml 155 | ``` 156 | 157 | For AKS Engine, NVIDIA Device Plugin will automatically installed with N-Series GPU clusters. 158 | 159 | ##### Validation 160 | 161 | Once you are done with the cluster creation, and downloaded the `kubeconfig` file, running the following command: 162 | 163 | ```console 164 | kubectl get nodes 165 | ``` 166 | 167 | Should yield an output similar to this one: 168 | 169 | ``` 170 | NAME STATUS ROLES AGE VERSION 171 | aks-nodepool1-42640332-0 Ready agent 1h v1.11.1 172 | aks-nodepool1-42640332-1 Ready agent 1h v1.11.1 173 | aks-nodepool1-42640332-2 Ready agent 1h v1.11.1 174 | ``` 175 | 176 | If you provisioned GPU VM, describing one of the node should indicate the presence of GPU(s) on the node: 177 | 178 | ```console 179 | > kubectl describe node 180 | 181 | [...] 182 | Capacity: 183 | nvidia.com/gpu: 1 184 | [...] 185 | ``` 186 | 187 | > Note: In some scenarios, you might not see GPU resources under Capacity. To resolve this, you must install a daemonset as described in the troubleshooting section here: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster 188 | 189 | ## Exercise 190 | 191 | ### Running our Model on Kubernetes 192 | 193 | > Note: If you didn't complete the exercise in module 1, you can use `wbuchwalter/tf-mnist` image for this exercise. 194 | 195 | In module 1, we created an image for our MNIST classifier, ran a small training locally and pushed this image to Docker Hub. 196 | Since we now have a running Kubernetes cluster, let's run our training on it! 197 | 198 | First, we need to create a YAML template to define what we want to deploy. 199 | We want our deployment to have a few characteristics: 200 | 201 | - It should be a `Job` since we expect the training to finish successfully after some time. 202 | - It should run the image you created in module 1 (or `wbuchwalter/tf-mnist` if you skipped this module). 203 | - The `Job` should be named `2-mnist-training`. 204 | - We want our training to run for `500` steps. 205 | - We want our training to use 1 GPU 206 | 207 | Here is what this would look like in YAML format: 208 | 209 | ```yaml 210 | apiVersion: batch/v1 211 | kind: Job # Our training should be a Job since it is supposed to terminate at some point 212 | metadata: 213 | name: module2-ex1 # Name of our job 214 | spec: 215 | template: # Template of the Pod that is going to be run by the Job 216 | metadata: 217 | name: module2-ex1 # Name of the pod 218 | spec: 219 | containers: # List of containers that should run inside the pod, in our case there is only one. 220 | - name: tensorflow 221 | image: ${DOCKER_USERNAME}/tf-mnist:gpu # The image to run, you can replace by your own. 222 | args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile 223 | resources: 224 | limits: 225 | nvidia.com/gpu: 1 # We ask Kubernetes to assign 1 GPU to this container 226 | volumeMounts: 227 | - name: nvidia 228 | mountPath: /usr/local/nvidia 229 | volumes: 230 | - name: nvidia 231 | hostPath: 232 | path: /usr/local/nvidia 233 | restartPolicy: OnFailure # restart the pod if it fails 234 | ``` 235 | 236 | Save this template somewhere and deploy it with: 237 | 238 | ```console 239 | kubectl create -f 240 | ``` 241 | 242 | #### Validation 243 | 244 | After deploying the template, 245 | 246 | ```console 247 | kubectl get job 248 | ``` 249 | 250 | Should show your new job: 251 | 252 | ```bash 253 | NAME DESIRED SUCCESSFUL AGE 254 | module2-ex1 1 0 1m 255 | ``` 256 | 257 | Looking at the Pods: 258 | 259 | ```console 260 | kubectl get pods 261 | ``` 262 | 263 | You should see your training running 264 | 265 | ```bash 266 | NAME READY STATUS RESTARTS AGE 267 | module2-ex1-c5b8q 1/1 Runing 0 1m 268 | ``` 269 | 270 | Finally you can look at the logs of your pod with: 271 | 272 | ```console 273 | kubectl logs 274 | ``` 275 | 276 | > Be careful to use the Pod name (from `kubectl get pods`) not the Job name. 277 | 278 | And you should see the training happening 279 | 280 | ```bash 281 | 2017-11-29 21:49:16.462292: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 282 | Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. 283 | Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz 284 | Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 285 | Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz 286 | Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 287 | Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz 288 | Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 289 | Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz 290 | Accuracy at step 0: 0.1285 291 | Accuracy at step 10: 0.674 292 | Accuracy at step 20: 0.8065 293 | Accuracy at step 30: 0.8606 294 | Accuracy at step 40: 0.8759 295 | Accuracy at step 50: 0.888 296 | [...] 297 | ``` 298 | 299 | After a few minutes, looking again at the Job should show that it has completed successfully: 300 | 301 | ```console 302 | kubectl get job 303 | ``` 304 | 305 | ```bash 306 | NAME DESIRED SUCCESSFUL AGE 307 | module2-ex1 1 1 3m 308 | ``` 309 | 310 | ## Next Step 311 | 312 | Currently our training doesn't do anything interesting. We are not even saving the model and summaries anywhere, but don't worry we are going to dive into this starting in Module 4. 313 | 314 | [Module 3: Helm](../3-helm/README.md) 315 | -------------------------------------------------------------------------------- /3-helm/README.md: -------------------------------------------------------------------------------- 1 | # Helm 2 | 3 | ## Prerequisites 4 | 5 | * [Docker Basics](../1-docker/README.md) 6 | * [Kubernetes Basics and cluster created](../2-kubernetes) 7 | 8 | ## Summary 9 | 10 | In this module you will learn : 11 | * What is Helm and how to use it 12 | * What is a Chart and how to create one 13 | 14 | ## Context 15 | 16 | As you saw in the second module [Kubernetes Basics and cluster created](../2-kubernetes), the default way to deploy objects in Kubernetes is by using `kubectl` with `yaml` files. 17 | 18 | For example, if we want to deploy a `pod` running `nginx` and then make it available from an external IP using a `service` you will need to describe at least these two objects such as : 19 | 20 | Deployment : 21 | ```yaml 22 | apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1 23 | kind: Deployment 24 | metadata: 25 | name: nginx-deployment 26 | spec: 27 | selector: 28 | matchLabels: 29 | app: nginx 30 | replicas: 2 # tells deployment to run 2 pods matching the template 31 | template: # create pods using pod definition in this template 32 | metadata: 33 | labels: 34 | app: nginx 35 | spec: 36 | containers: 37 | - name: nginx 38 | image: nginx:1.7.9 39 | ports: 40 | - containerPort: 80 41 | ``` 42 | Service : 43 | ```yaml 44 | apiVersion: v1 45 | kind: Service 46 | metadata: 47 | name: nginx-service 48 | spec: 49 | ports: 50 | - port: 8000 51 | targetPort: 80 52 | protocol: TCP 53 | type: LoadBalancer 54 | selector: 55 | app: nginx 56 | ``` 57 | 58 | The problem with this approach is that when you need to make an update to your solution, you will need to update it across different yaml files. 59 | 60 | Let's say you want to change the name of the app from `nginx` to `nginx-production`. You have to change it in a few places in the deployment and not forget to change the selector setting in the service as well. 61 | 62 | This is one example among others where Helm is fixing the issue by being able to create and use templates. 63 | 64 | ## Helm and Chart 65 | 66 | Helm is the [package manager for Kubernetes](https://deis.com/blog/2016/trusting-whos-at-the-helm/). 67 | 68 | A package is named a **Chart**. 69 | 70 | You can either create you own, or pull and install an official one such as Wordpress, GitLab, Apache Spark, etc... 71 | 72 | You can find a list of the official ones here : [https://github.com/kubernetes/charts/tree/master/stable](https://github.com/kubernetes/charts/tree/master/stable) 73 | 74 | To use Helm, you need to have the [CLI installed on your machine](https://github.com/kubernetes/helm/blob/master/docs/install.md) 75 | 76 | Before using Helm with your cluster, make sure you have the Tiller component running in your cluster. 77 | ```bash 78 | kubectl get pod --all-namespaces | grep tiller 79 | ``` 80 | 81 | If you do not see Tiller running and if you have an RBAC-enabled cluster, you need a service account and role binding for the Tiller service in your cluster. To install these components, refer to these guides to [create a service account](https://docs.microsoft.com/en-us/azure/aks/kubernetes-helm#create-a-service-account) and [configure helm](https://docs.microsoft.com/en-us/azure/aks/kubernetes-helm#configure-helm) to initialize Tiller in the cluster. 82 | 83 | Once Tiller service is up and running in the cluster and you have initialized helm, let's try to deploy an official Chart such as the popular [Wordpress](https://github.com/kubernetes/charts/tree/master/stable/wordpress) 84 | 85 | ```bash 86 | helm install stable/wordpress 87 | ``` 88 | 89 | > Note: If you have an error such has `Error: incompatible versions client[v2.7.0] server[v2.6.2]`, please run `helm init --upgrade` 90 | 91 | After a few seconds you should see the following output in your terminal : 92 | 93 | ```bash 94 | ... 95 | NAME: cloying-crocodile 96 | LAST DEPLOYED: Wed Nov 22 11:29:55 2017 97 | NAMESPACE: default 98 | STATUS: DEPLOYED 99 | 100 | RESOURCES: 101 | ==> v1beta1/Deployment 102 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 103 | cloying-crocodile-mariadb 1 1 1 0 10s 104 | cloying-crocodile-wordpress 1 1 1 0 10s 105 | 106 | ==> v1/Pod(related) 107 | NAME READY STATUS RESTARTS AGE 108 | cloying-crocodile-mariadb-1648957417-0prvc 0/1 Pending 0 10s 109 | cloying-crocodile-wordpress-3958361718-z9qr3 0/1 Pending 0 10s 110 | 111 | ==> v1/Secret 112 | NAME TYPE DATA AGE 113 | cloying-crocodile-mariadb Opaque 2 10s 114 | cloying-crocodile-wordpress Opaque 2 10s 115 | 116 | ==> v1/ConfigMap 117 | NAME DATA AGE 118 | cloying-crocodile-mariadb 1 10s 119 | 120 | ==> v1/PersistentVolumeClaim 121 | NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE 122 | cloying-crocodile-mariadb Pending default 10s 123 | cloying-crocodile-wordpress Pending default 10s 124 | 125 | ==> v1/Service 126 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 127 | cloying-crocodile-mariadb ClusterIP 10.0.163.26 3306/TCP 10s 128 | cloying-crocodile-wordpress LoadBalancer 10.0.168.104 80:31549/TCP,443:32728/TCP 10s 129 | ... 130 | ``` 131 | 132 | You can see all the objects that are necessary to run our Wordpress application in your Kubernetes cluster deployed such as **pods**, **services**, **secrets** etc... Furthermore, since we need a MariaDB engine to run Wordpress, Helm did also automatically deploy it as a dependency in the cluster ! 133 | 134 | As you can see inside the [Wordpress's Chart documentation](https://github.com/kubernetes/charts/tree/master/stable/wordpress) you can override some values such as the image, the database name or the SMTP server for example. 135 | 136 | You just have to use the `--set` option during the `install` command, like so : 137 | 138 | ```bash 139 | helm install --name my-wordpress \ 140 | --set wordpressUsername=admin,wordpressPassword=password,mariadb.mariadbRootPassword=secretpassword \ 141 | stable/wordpress 142 | ``` 143 | 144 | ## Create your own Chart 145 | 146 | You can also create your own Chart by using the scaffolding command `helm create mychart` 147 | 148 | This will create a folder which includes all the files necessary to create your own package : 149 | 150 | ```bash 151 | ├── Chart.yaml 152 | ├── templates 153 | │   ├── NOTES.txt 154 | │   ├── _helpers.tpl 155 | │   ├── deployment.yaml 156 | │   ├── ingress.yaml 157 | │   └── service.yaml 158 | └── values.yaml 159 | ``` 160 | 161 | All the objects that you want to deploy are stored inside the templates folder in different .yaml files. 162 | 163 | You can find more information on how to create your own chart here : [https://deis.com/blog/2016/getting-started-authoring-helm-charts/](https://deis.com/blog/2016/getting-started-authoring-helm-charts/) 164 | 165 | When you are done with your package, Helm provides a linting tool `helm lint mychart` to help you find issues in it. 166 | 167 | If you want to deploy it into your cluster, you can run the following command from the repository folder: 168 | 169 | ```bash 170 | helm install . --name my-custom-chart 171 | ``` 172 | 173 | ## Exercises 174 | 175 | ### Exercise 1 - Deploy an official Chart : DokuWiki 176 | 177 | From the [official Chart repository](https://github.com/kubernetes/charts/tree/master) you have to deploy a DokuWiki environment. 178 | 179 | [DokuWiki](https://www.dokuwiki.org/) is a standards-compliant, simple to use wiki optimized for creating documentation. It is targeted at developer teams, workgroups, and small companies. All data is stored in plain text files, so no database is required. 180 | 181 | #### Validation 182 | 183 | We want to be able to define a custom Wiki name such as `Hello MLADS` at the deployment. 184 | 185 | You should see the following web page from your deployment : 186 | 187 | ![](dokuwiki.png) 188 | 189 | #### Solution 190 | 191 |
192 | Solution (expand to see) 193 |

194 | 195 | ```bash 196 | helm install stable/dokuwiki --set dokuwikiWikiName="Hello MLADS" 197 | ``` 198 | 199 |

200 |
201 | 202 | 203 | ## Next Step 204 | 205 | [4 - Kubeflow](../4-kubeflow/README.md) -------------------------------------------------------------------------------- /3-helm/dokuwiki.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/3-helm/dokuwiki.png -------------------------------------------------------------------------------- /4-kubeflow/README.md: -------------------------------------------------------------------------------- 1 | # Kubeflow - Overview and Installation 2 | 3 | ## Prerequisites 4 | 5 | - [1 - Docker](../1-docker/README.md) 6 | - [2 - Kubernetes](../2-kubernetes/README.md) 7 | 8 | ## Summary 9 | 10 | In this module we are going to get an overview of the different components that make up [Kubeflow](https://github.com/kubeflow/kubeflow), and how to install them into our newly deployed Kubernetes cluster. 11 | 12 | ### Kubeflow Overview 13 | 14 | From [Kubeflow](https://github.com/kubeflow/kubeflow)'s own documetation: 15 | 16 | > The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow. 17 | 18 | Kubeflow is composed of multiple components: 19 | 20 | - [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/), which allows user to request an instance of a Jupyter Notebook server dedicated to them. 21 | - One or multiple training controllers. These are component that simplifies and manages the deployment of training jobs. For the purpose of this lab we are only going to deploy a training controller for TensorFlow jobs. However the Kubeflow community has started working on controllers for PyTorch and Caffe2 as well. 22 | - A serving component that will help you serve predictions with your models. 23 | 24 | For more general info on Kubeflow, head to the repo's [README](https://github.com/kubeflow/kubeflow/blob/master/README.md). 25 | 26 | ### Deploying Kubeflow 27 | 28 | Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way to package and deploy the different components. 29 | 30 | > ksonnet simplifies defining an application configuration, updating the configuration over time, and specializing it for different clusters and environments. 31 | 32 | First, install ksonnet version [0.13.1](https://ksonnet.io/#get-started), or you can [download a prebuilt binary](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) for your OS. 33 | 34 | Then run the following commands to download Kubeflow: 35 | 36 | ```bash 37 | KUBEFLOW_SRC=$(realpath kubeflow) 38 | 39 | mkdir ${KUBEFLOW_SRC} 40 | cd ${KUBEFLOW_SRC} 41 | 42 | export KUBEFLOW_TAG=v0.4.1 43 | 44 | curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash 45 | ``` 46 | 47 | `KUBEFLOW_SRC` a directory where you want to download the source to 48 | 49 | `KUBEFLOW_TAG` a tag corresponding to the version to check out, such as master for the latest code. 50 | 51 | ```bash 52 | # Initialize a kubeflow app 53 | KFAPP=mykubeflowapp 54 | ${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none 55 | 56 | # Generate kubeflow app 57 | cd ${KFAPP} 58 | ${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s 59 | 60 | # Deploy Kubeflow app 61 | ${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s 62 | ``` 63 | 64 | ### Validation 65 | 66 | `kubectl get pods -n kubeflow` 67 | 68 | should return something like this: 69 | 70 | ``` 71 | NAME READY STATUS RESTARTS AGE 72 | kubeflow ambassador-b4d9cdb8-2qgww 1/1 Running 0 111m 73 | kubeflow ambassador-b4d9cdb8-hpwdc 1/1 Running 0 111m 74 | kubeflow ambassador-b4d9cdb8-khg8l 1/1 Running 0 111m 75 | kubeflow argo-ui-6d6658d8f7-t6whw 1/1 Running 0 110m 76 | kubeflow centraldashboard-6f686c5b7c-462cq 1/1 Running 0 111m 77 | kubeflow jupyter-0 1/1 Running 0 111m 78 | kubeflow katib-ui-6c59754c48-mgf62 1/1 Running 0 110m 79 | kubeflow metacontroller-0 1/1 Running 0 111m 80 | kubeflow minio-d79b65988-6qkxp 1/1 Running 0 110m 81 | kubeflow ml-pipeline-66df9d86f6-rp245 1/1 Running 0 110m 82 | kubeflow ml-pipeline-persistenceagent-7b86dbf4b5-rgndj 1/1 Running 0 110m 83 | kubeflow ml-pipeline-scheduledworkflow-84f6477479-9tvhk 1/1 Running 0 110m 84 | kubeflow ml-pipeline-ui-f76bb5f97-2s5qb 1/1 Running 0 110m 85 | kubeflow mysql-ffc889689-xkpxb 1/1 Running 0 110m 86 | kubeflow pytorch-operator-ff46f9b7d-qkbvh 1/1 Running 0 111m 87 | kubeflow spartakus-volunteer-5b6c956c8f-2gnvb 1/1 Running 0 111m 88 | kubeflow studyjob-controller-b7cdbd4cd-nf9z5 1/1 Running 0 110m 89 | kubeflow tf-job-dashboard-7746db84cf-njdzk 1/1 Running 0 111m 90 | kubeflow tf-job-operator-v1beta1-5949f668f7-j5zrn 1/1 Running 0 111m 91 | kubeflow vizier-core-7c56465f6-t6d5p 1/1 Running 0 110m 92 | kubeflow vizier-core-rest-67f588b4cb-lqvgr 1/1 Running 0 110m 93 | kubeflow vizier-db-86dc7d89c5-8vtfs 1/1 Running 0 110m 94 | kubeflow vizier-suggestion-bayesianoptimization-7cb546fb84-tsrn4 1/1 Running 0 110m 95 | kubeflow vizier-suggestion-grid-6587f9d6b-92c9h 1/1 Running 0 110m 96 | kubeflow vizier-suggestion-hyperband-8bb44f8c8-gs72m 1/1 Running 0 110m 97 | kubeflow vizier-suggestion-random-7ff5db687b-bjdh5 1/1 Running 0 110m 98 | kubeflow workflow-controller-cf79dfbff-lv7jk 1/1 Running 0 110m 99 | ``` 100 | 101 | The most important components for the purpose of this lab are `jupyter-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-v1beta1-5949f668f7-j5zrn` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later. 102 | 103 | ### Remove Kubeflow 104 | 105 | If you want to remove the Kubeflow deployment, you can run the following to remove the namespace and installed components: 106 | 107 | ```bash 108 | cd ${KUBEFLOW_SRC}/${KFAPP} 109 | ${KUBEFLOW_SRC}/scripts/kfctl.sh delete k8s 110 | ``` 111 | 112 | ## Next Step 113 | 114 | [5 - JupyterHub](../5-jupyterhub/README.md) 115 | -------------------------------------------------------------------------------- /5-jupyterhub/README.md: -------------------------------------------------------------------------------- 1 | # Jupyter Notebooks on Kubernetes 2 | 3 | ## Prerequisites 4 | 5 | - [1 - Docker Basics](../1-docker) 6 | - [2 - Kubernetes Basics and cluster created](../2-kubernetes) 7 | - [4 - Kubeflow](../4-kubeflow) 8 | 9 | ## Summary 10 | 11 | In this module, you will learn how to: 12 | 13 | - Run Jupyter Notebooks locally using Docker 14 | - Run JupyterHub on Kubernetes using Kubeflow 15 | 16 | ## How Jupyter Notebooks work 17 | 18 | The [Jupyter Notebook](http://jupyter.org/) is an open source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text for rapid prototyping. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more. To better support exploratory iteration and to accelerate computation of Tensorflow jobs, let's look at how we can include data science tools like Jupyter Notebook with Docker and Kubernetes. 19 | 20 | ## How JupyterHub works 21 | 22 | The [JupyterHub](https://jupyterhub.readthedocs.io/en/latest/) is a multi-user Hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. JupyterHub can be used to serve notebooks to a class of students, a corporate data science group, or a scientific research group. Let's look at how we can create JupyterHub to spawn multiple instances of Jupyter Notebook on Kubernetes using Kubeflow. 23 | 24 | ## Exercises 25 | 26 | ### Exercise 1: Run Jupyter Notebooks locally using Docker 27 | 28 | In this first exercise, we will run Jupyter Notebooks locally using Docker. We will use the official tensorflow docker image as it comes with Jupyter notebook. 29 | 30 | ```console 31 | docker run -it -p 8888:8888 tensorflow/tensorflow 32 | ``` 33 | 34 | #### Validation 35 | 36 | To verify, browse to the url in the output log. 37 | 38 | For example: `http://localhost:8888/?token=a3ea3cd914c5b68149e2b4a6d0220eca186fec41563c0413` 39 | 40 | ### Exercise 2: Run JupyterHub on Kubernetes using Kubeflow 41 | 42 | In this exercise, we will run JupyterHub to spawn multiple instances of Jupyter Notebooks on a Kubernetes cluster using Kubeflow. 43 | 44 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow running in your Kubernetes cluster, you can follow [module 4 - Kubeflow and tfjob Basics](../4-kubeflow-tfjob). 45 | 46 | In module 4, you installed the kubeflow-core component, which already includes JupyterHub and a corresponding load balancer service of type `ClusterIP`. To check its status, run the following kubectl command. 47 | 48 | ``` 49 | NAMESPACE=kubeflow 50 | kubectl get svc -n=${NAMESPACE} 51 | 52 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 53 | ... 54 | jupyter-0 ClusterIP None 8000/TCP 132m 55 | jupyter-lb ClusterIP 10.0.191.68 80/TCP 132m 56 | ``` 57 | 58 | To connect to the Kubeflow dashboard locally: 59 | 60 | ```bash 61 | kubectl port-forward svc/ambassador -n ${NAMESPACE} 8080:80 62 | ``` 63 | 64 | Then navigate to JupyterHub: http://localhost:8080/hub 65 | 66 | [Optional] To connect to your JupyterHub over a public IP: 67 | 68 | To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer: 69 | 70 | ```bash 71 | cd ks_app 72 | ks param set jupyter serviceType LoadBalancer 73 | cd .. 74 | ${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s 75 | ``` 76 | 77 | Create a new Jupyter Notebook instance: 78 | 79 | - open http://localhost:8080/hub/ in your browser (or use the public IP for the service `tf-hub-lb`) 80 | - log in using any username and password 81 | - click the "Start My Server" button to sprawn a new Jupyter notebook 82 | - from the image dropdown, select a tensorflow image for your notebook 83 | - for CPU and memory, enter values based on your resource requirements, for example: 1 CPU and 2Gi 84 | - to get available GPUs in your cluster, run the following command: 85 | 86 | ``` 87 | kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.alpha\.kubernetes\.io\/nvidia-gpu" 88 | ``` 89 | 90 | - for GPU, enter values in json format `{"nvidia.com/gpu":"1"}` 91 | - click the "Spawn" button 92 | 93 | ![jupyterhub](./jupyterhub.png) 94 | 95 | The images are quite large. This process can take a long time. 96 | 97 | #### Validation 98 | 99 | You can check the status of the pod by running: 100 | 101 | ``` 102 | kubectl -n ${NAMESPACE} describe pods jupyter-${USERNAME} 103 | ``` 104 | 105 | After the pod status changes to `running`, to verify you will see a new Jupyter notebook running at: http://127.0.0.1:8000/user/{USERNAME}/tree or http://{PUBLIC-IP}/user/{USERNAME}/tree 106 | 107 | ## Next Step 108 | 109 | [6 - TfJob](../6-tfjob) 110 | -------------------------------------------------------------------------------- /5-jupyterhub/jupyterhub.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/5-jupyterhub/jupyterhub.png -------------------------------------------------------------------------------- /6-tfjob/README.md: -------------------------------------------------------------------------------- 1 | # `TFJob` 2 | 3 | ## Prerequisites 4 | 5 | - [1 - Docker](../1-docker/README.md) 6 | - [2 - Kubernetes](../2-kubernetes/README.md) 7 | - [4 - Kubeflow](../4-kubeflow/README.md) 8 | 9 | ## Summary 10 | 11 | In this module you will learn how to describe a TensorFlow training using `TFJob` object. 12 | 13 | ### Kubernetes Custom Resource Definition 14 | 15 | Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use. 16 | In the case of Kubeflow, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe a TensorFlow training. 17 | 18 | #### `TFJob` Specifications 19 | 20 | Before going further, let's take a look at what the `TFJob` object looks like: 21 | 22 | > Note: Some of the fields are not described here for brevity. 23 | 24 | **`TFJob` Object** 25 | 26 | | Field | Type | Description | 27 | | ---------- | ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------- | 28 | | apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1beta1` | 29 | | kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` | 30 | | metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata) | Standard object's metadata. | 31 | | spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. | 32 | 33 | `spec` is the most important part, so let's look at it too: 34 | 35 | **`TFJobSpec` Object** 36 | 37 | | Field | Type | Description | 38 | | ------------- | --------------------- | -------------------------------------------------------------- | 39 | | TFReplicaSpec | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below | 40 | 41 | Let's go deeper: 42 | 43 | **`TFReplicaSpec` Object** 44 | 45 | | Field | Type | Description | 46 | | ------------- | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 47 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 48 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. | 49 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. | 50 | 51 | Here is what a simple TensorFlow training looks like using this `TFJob` object: 52 | 53 | ```yaml 54 | apiVersion: kubeflow.org/v1beta1 55 | kind: TFJob 56 | metadata: 57 | name: example-tfjob 58 | spec: 59 | tfReplicaSpecs: 60 | MASTER: 61 | replicas: 1 62 | template: 63 | spec: 64 | containers: 65 | - image: /tf-mnist:gpu 66 | name: tensorflow 67 | resources: 68 | limits: 69 | nvidia.com/gpu: 1 70 | restartPolicy: OnFailure 71 | ``` 72 | 73 | Note that we are note specifying `TfReplicaType` or `Replicas` as the default values are already what we want. 74 | 75 | ## Exercises 76 | 77 | ### Exercise 1: A Simple `TFJob` 78 | 79 | Let's schedule a very simple TensorFlow job using `TFJob` first. 80 | 81 | > Note: If you completed the exercise in Module 1 and 2, you can change the image to use the one you pushed instead. 82 | 83 | When using GPU, we need to request for one (or multiple), and the image we are using also needs to be based on TensorFlow's GPU image. 84 | 85 | ```yaml 86 | apiVersion: kubeflow.org/v1beta1 87 | kind: TFJob 88 | metadata: 89 | name: module6-ex1-gpu 90 | spec: 91 | tfReplicaSpecs: 92 | MASTER: 93 | replicas: 1 94 | template: 95 | spec: 96 | containers: 97 | - image: /tf-mnist:gpu # From module 1 98 | name: tensorflow 99 | resources: 100 | limits: 101 | nvidia.com/gpu: 1 102 | restartPolicy: OnFailure 103 | ``` 104 | 105 | Save the template that applies to you in a file, and create the `TFJob`: 106 | 107 | ```console 108 | kubectl create -f 109 | ``` 110 | 111 | Let's look at what has been created in our cluster. 112 | 113 | First a `TFJob` was created: 114 | 115 | ```console 116 | kubectl get tfjob 117 | ``` 118 | 119 | Returns: 120 | 121 | ``` 122 | NAME AGE 123 | module6-ex1-gpu 5s 124 | ``` 125 | 126 | As well as a `Pod`, which was actually created by the operator: 127 | 128 | ```console 129 | kubectl get pod 130 | ``` 131 | 132 | Returns: 133 | 134 | ``` 135 | NAME READY STATUS RESTARTS AGE 136 | module6-ex1-master-xs4b-0-6gpfn 1/1 Running 0 2m 137 | ``` 138 | 139 | Note that the `Pod` might take a few minutes before actually running, the docker image needs to be pulled on the node first. 140 | 141 | Once the `Pod`'s status is either `Running` or `Completed` we can start looking at it's logs: 142 | 143 | ```console 144 | kubectl logs 145 | ``` 146 | 147 | This container is pretty verbose, but you should see a TensorFlow training happening: 148 | 149 | ``` 150 | [...] 151 | INFO:tensorflow:2017-11-20 20:57:22.314198: Step 480: Cross entropy = 0.142486 152 | INFO:tensorflow:2017-11-20 20:57:22.370080: Step 480: Validation accuracy = 85.0% (N=100) 153 | INFO:tensorflow:2017-11-20 20:57:22.896383: Step 490: Train accuracy = 98.0% 154 | INFO:tensorflow:2017-11-20 20:57:22.896600: Step 490: Cross entropy = 0.075210 155 | INFO:tensorflow:2017-11-20 20:57:22.945611: Step 490: Validation accuracy = 91.0% (N=100) 156 | INFO:tensorflow:2017-11-20 20:57:23.407756: Step 499: Train accuracy = 94.0% 157 | INFO:tensorflow:2017-11-20 20:57:23.407980: Step 499: Cross entropy = 0.170348 158 | INFO:tensorflow:2017-11-20 20:57:23.457325: Step 499: Validation accuracy = 89.0% (N=100) 159 | INFO:tensorflow:Final test accuracy = 88.4% (N=353) 160 | [...] 161 | ``` 162 | 163 | > That's great and all, but how do we grab our trained model and TensorFlow's summaries? 164 | 165 | Well currently we can't. As soon as the training is complete, the container stops and everything inside it, including model and logs are lost. 166 | 167 | Thankfully, Kubernetes `Volumes` can help us here. 168 | If you remember, we quickly introduced `Volumes` in module [2 - Kubernetes](../2-kubernetes/), and that's what we already used to mount the drivers from the node into the container. 169 | But `Volumes` are not just for mounting things from a node, we can also use them to mount a lot of different storage solutions, you can see the full list [here](https://kubernetes.io/docs/concepts/storage/volumes/). 170 | 171 | In our case we are going to use Azure Files, as it is really easy to use with Kubernetes. 172 | 173 | ## Exercise 2: Azure Files to the Rescue 174 | 175 | ### Creating a New File Share and Kubernetes Secret 176 | 177 | In the official documentation: [Persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv), follow the steps listed under `Create storage account`, `Create storage class`, and `Create persistent volume claim`. 178 | Be aware of a few details first: 179 | 180 | - It is **very** important that you create your storage account in the **same** region and the same resource group (with MC\_ prefix) as your Kubernetes cluster: because Azure File uses the `SMB` protocol it won't work cross-regions. `AKS_PERS_LOCATION` should be updated accordingly. 181 | - While this document specifically refers to AKS, it will work for any K8s cluster 182 | - Once the PVC is created, you will see a new file share under that storage account. All subsequent modules will be writing to that file share. 183 | - PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects 184 | - If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding) 185 | 186 | Once you completed all the steps, run: 187 | 188 | ```console 189 | kubectl get pvc 190 | ``` 191 | 192 | Which should return: 193 | 194 | ``` 195 | NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE 196 | azurefile Bound pvc-346ab93b-4cbf-11e8-9fed-000d3a17b5e9 5Gi RWO azurefile 5m 197 | ``` 198 | 199 | ### Updating our example to use our Azure File Share 200 | 201 | Now we need to mount our new file share into our container so the model and the summaries can be persisted. 202 | Turns out mounting an Azure File share into a container is really easy, we simply need to reference our PVC in the `Volume` definition: 203 | 204 | ```yaml 205 | [...] 206 | containers: 207 | - image: 208 | name: tensorflow 209 | resources: 210 | limits: 211 | nvidia.com/gpu: 1 212 | volumeMounts: 213 | - name: azurefile 214 | subPath: module6-ex2-gpu 215 | mountPath: /tmp/tensorflow 216 | volumes: 217 | - name: azurefile 218 | persistentVolumeClaim: 219 | claimName: azurefile 220 | ``` 221 | 222 | Update your template from exercise 1 to mount the Azure File share into your container, and create your new job. 223 | 224 | Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that: 225 | 226 | ![file-share](./file-share.png) 227 | 228 | This means that when we run a training, all the important data is now stored in Azure File and is still available as long as we don't delete the file share. 229 | 230 | #### Solution for Exercise 2 231 | 232 |
233 | Solution 234 | 235 | ```yaml 236 | apiVersion: kubeflow.org/v1beta1 237 | kind: TFJob 238 | metadata: 239 | name: module6-ex2 240 | spec: 241 | tfReplicaSpecs: 242 | MASTER: 243 | replicas: 1 244 | template: 245 | spec: 246 | containers: 247 | - image: /tf-mnist:gpu 248 | name: tensorflow 249 | resources: 250 | limits: 251 | nvidia.com/gpu: 1 252 | volumeMounts: 253 | # By default our classifier saves the summaries in /tmp/tensorflow, 254 | # so that's where we want to mount our Azure File Share. 255 | - name: azurefile 256 | # The subPath allows us to mount a subdirectory within the azure file share instead of root 257 | # this is useful so that we can save the logs for each run in a different subdirectory 258 | # instead of overwriting what was done before. 259 | subPath: module6-ex2-gpu 260 | mountPath: /tmp/tensorflow 261 | restartPolicy: OnFailure 262 | volumes: 263 | - name: azurefile 264 | persistentVolumeClaim: 265 | claimName: azurefile 266 | ``` 267 | 268 |
269 | 270 | ## Next Step 271 | 272 | [7 - Distributed TensorFlow](../7-distributed-tensorflow) 273 | -------------------------------------------------------------------------------- /6-tfjob/file-share.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/6-tfjob/file-share.png -------------------------------------------------------------------------------- /7-distributed-tensorflow/README.md: -------------------------------------------------------------------------------- 1 | # Distributed TensorFlow with `TFJob` 2 | 3 | ## Prerequisites 4 | 5 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes) 6 | * [4 - Kubeflow](../4-kubeflow) 7 | * [6 - TFJob](../6-tfjob) 8 | 9 | ## Summary 10 | 11 | Distributed TensorFlow trainings can be complicated. In this module, we will see how `TFJob`, one of the components of Kubeflow, can be used to simplify the deployment and monitoring of distributed TensorFlow trainings. 12 | 13 | ## "Vanilla" Distributed TensorFlow is Hard 14 | 15 | First let's see how we would setup a distributed TensorFlow training without Kubernetes or `TFJob` (fear not, we are not actually going to do that). 16 | First, you would have to find or setup a bunch of idle VMs, or physical machines. In most companies, this would already be a feat, and likely require the coordination of multiple department (such as IT) to get the VMs up, running and reserved for your experiment. 17 | Then you would likely have to do some back and forth with the IT department to be able to setup your training: the VMs need to be able to talk to each others and have stable endpoints. Work might be needed to access the data, you would need to upload your TF code on every single machine etc. 18 | If you add GPU to the mix, it would likely get even harder since GPUs aren't usually just waiting there because of their high cost. 19 | 20 | Assuming you get through this, you now need to modify your model for distributed training. 21 | Among other things, you will need to setup the `ClusterSpec` ([`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec)): a TensorFlow class that allows you to describe the architecture of your cluster. 22 | For example, if you were to setup a distributed training with a mere 2 workers and 2 parameter servers, your cluster spec would look like this (the `clusterSpec` would most likely not be hardcoded, but passed as argument to your training script as we will see below, this is for illustration): 23 | 24 | ```python 25 | cluster = tf.train.ClusterSpec({"worker": [":2222", 26 | ":2222"], 27 | "ps": [":2222", 28 | ":2222"]}) 29 | ``` 30 | Here we assume that you want your workers to run on GPU VMs and your parameter servers to run on CPU VMs. 31 | 32 | We will not go through the rest of the modifications needed (splitting operation across devices, getting the master session etc.), as we will look at them later and this would be pretty much the same thing no matter how you run your distributed training. 33 | 34 | Once your model is ready, you need to start the training. 35 | You will need to connect to every single VM, and pass the `ClusterSpec` as well as the assigned job name (ps or worker) and task index to each VM. 36 | So it would look something like this: 37 | 38 | ```bash 39 | # On ps0: 40 | $ python trainer.py \ 41 | --ps_hosts=:2222,:2222 \ 42 | --worker_hosts=:2222,:2222 \ 43 | --job_name=ps --task_index=0 44 | # On ps1: 45 | $ python trainer.py \ 46 | --ps_hosts=:2222,:2222 \ 47 | --worker_hosts=:2222,:2222 \ 48 | --job_name=ps --task_index=1 49 | # On worker0: 50 | $ python trainer.py \ 51 | --ps_hosts=:2222,:2222 \ 52 | --worker_hosts=:2222,:2222 \ 53 | --job_name=worker --task_index=0 54 | # On worker1: 55 | $ python trainer.py \ 56 | --ps_hosts=:2222,:2222 \ 57 | --worker_hosts=:2222,:2222 \ 58 | --job_name=worker --task_index=1 59 | ``` 60 | 61 | At this point your training would finally start. 62 | However, if for some reason an IP changes (a VM restarts for example), you would need to go back on every VM in your cluster, and restart the training with an updated `ClusterSpec` (If the IT department of your company is feeling extra-generous they might assign a DNS name to every VM which would already make your life much easier). 63 | If you see that your training is not doing well and you need to update the code, you have to redeploy it on every VM and restart the training everywhere. 64 | If for some reason you want to retrain after a while, you would most likely need to go back to step 1: ask for the VMs to be allocated, redeploy, update the `clusterSpec`. 65 | 66 | All this hurdles means that in practice very few people actually bother with distributed training as the time gained during training might not be worth the energy and time necessary to set it up correctly. 67 | 68 | ## Distributed TensorFlow with Kubernetes and `TFJob` 69 | 70 | Thankfully, with Kubernetes and `TFJob` things are much, much simpler, making distributed training something you might actually be able to benefit from. Before submitting a training job, you should have deployed Kubeflow to your cluster. Doing so ensures that the `TFJob` custom resource is available when you submit the training job. 71 | 72 | As a prerequisite, you should already have a Kubernetes cluster running, you can follow [module 2 - Kubernetes](../2-kubernetes) to create your own cluster and you should already have Kubeflow and TFJob running in your Kubernetes cluster, you can follow [module 4 - Kubeflow](../4-kubeflow) and [module 6 - TFJob](../6-tfjob). 73 | 74 | #### A Small Disclaimer 75 | The issues we saw in the first part of this module can be categorized in two groups: 76 | * Issues with getting access to enough resources for the trainings (VMs, GPU etc) 77 | * Issues with setting up the training itself 78 | 79 | The first group of issue is still very dependent on the processes in your company/group. If you need to go through a formal request to get access to extra VMs/GPU, it will still be a hassle and there is nothing Kubernetes can do about that. 80 | 81 | However, Kubernetes makes this process much easier: 82 | On AKS, you can spin up new VMs with a single command: [`az aks scale`](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-scale) 83 | 84 | Setting up the training, however, is drastically simplified with Kubernetes, KubeFlow and `TFJob`. 85 | 86 | ### Overview of `TFJob` distributed training 87 | 88 | So, how does `TFJob` work for distributed training? 89 | Let's look again at what the `TFJobSpec`and `TFReplicaSpec` objects looks like: 90 | 91 | **`TFJobSpec` Object** 92 | 93 | | Field | Type| Description | 94 | |-------|-----|-------------| 95 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below | 96 | 97 | 98 | **`TFReplicaSpec` Object** 99 | 100 | Note the last parameter `IsDefaultPS` that we didn't talk about before. 101 | 102 | | Field | Type| Description | 103 | |-------|-----|-------------| 104 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. | 105 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. | 106 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. | 107 | | **IsDefaultPS** | `boolean` | Whether the parameter server should be using a default image or a custom one (default to `true`) | 108 | 109 | In case the distinction between master and workers is not clear, there is a single master per TensorFlow cluster, and it is in fact a worker. The difference is that the master is the worker that is going to handle the creation of the `tf.Session`, write logs and save the model. 110 | 111 | As you can see, `TFJobSpec` and `TFReplicaSpec` allow us to easily define the architecture of the TensorFlow cluster we would like to setup. 112 | 113 | Once we have defined this architecture in a `TFJob` template and deployed it with `kubectl create`, the operator will do most of the work for us. 114 | For each master, worker and parameter server in our TensorFlow cluster, the operator will create a service exposing it so they can communicate. 115 | It will then create an internal representation of the cluster with each node and it's associated internal DNS name. 116 | 117 | For example, if you were to create a `TFJob` with 1 `MASTER`, 2 `WORKERS` and 1 `PS`, this representation would look similar to this: 118 | ```json 119 | { 120 | "master":[ 121 | "distributed-mnist-master-5oz2-0:2222" 122 | ], 123 | "ps":[ 124 | "distributed-mnist-ps-5oz2-0:2222" 125 | ], 126 | "worker":[ 127 | "distributed-mnist-worker-5oz2-0:2222", 128 | "distributed-mnist-worker-5oz2-1:2222" 129 | ] 130 | } 131 | ``` 132 | 133 | Finally, the operator will create all the necessary pods, and in each one, inject an environment variable named `Tf_CONFIG`, containing the cluster specification above, as well as the respective job name and task id that each node of the TensorFlow cluster should assume. 134 | 135 | For example, here is the value of the `TF_CONFIG` environment variable that would be sent to worker 1: 136 | 137 | ```json 138 | { 139 | "cluster":{ 140 | "master":[ 141 | "distributed-mnist-master-5oz2-0:2222" 142 | ], 143 | "ps":[ 144 | "distributed-mnist-ps-5oz2-0:2222" 145 | ], 146 | "worker":[ 147 | "distributed-mnist-worker-5oz2-0:2222", 148 | "distributed-mnist-worker-5oz2-1:2222" 149 | ] 150 | }, 151 | "task":{ 152 | "type":"worker", 153 | "index":1 154 | }, 155 | "environment":"cloud" 156 | } 157 | ``` 158 | 159 | As you can see, this completely takes the responsibility of building and maintaining the `ClusterSpec` away from you. 160 | All you have to do, is modify your code to read the `TF_CONFIG` and act accordingly. 161 | 162 | ### Modifying your model to use `TFJob`'s `TF_CONFIG` 163 | 164 | Concretely, let's see how you would modify your code: 165 | 166 | ```python 167 | # Grab the TF_CONFIG environment variable 168 | tf_config_json = os.environ.get("TF_CONFIG", "{}") 169 | 170 | # Deserialize to a python object 171 | tf_config = json.loads(tf_config_json) 172 | 173 | # Grab the cluster specification from tf_config and create a new tf.train.ClusterSpec instance with it 174 | cluster_spec = tf_config.get("cluster", {}) 175 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec) 176 | 177 | # Grab the task assigned to this specific process from the config. job_name might be "worker" and task_id might be 1 for example 178 | task = tf_config.get("task", {}) 179 | job_name = task["type"] 180 | task_id = task["index"] 181 | 182 | # Configure the TensorFlow server 183 | server_def = tf.train.ServerDef( 184 | cluster=cluster_spec_object.as_cluster_def(), 185 | protocol="grpc", 186 | job_name=job_name, 187 | task_index=task_id) 188 | server = tf.train.Server(server_def) 189 | 190 | # checking if this process is the chief (also called master). The master has the responsibility of creating the session, saving the summaries etc. 191 | is_chief = (job_name == 'master') 192 | 193 | # Notice that we are not handling the case where job_name == 'ps'. That is because `TFJob` will take care of the parameter servers for us by default. 194 | ``` 195 | 196 | As for any distributed TensorFlow training, you will then also need to modify your model to split the operations and variables among the workers and parameter servers as well as create on session on the master. 197 | 198 | ## Exercises 199 | 200 | ### 1 - Modifying Our MNIST Example to Support Distributed Training 201 | 202 | #### 1. a. 203 | 204 | Starting from the MNIST sample we have been working with so far, modify it to work with distributed TensorFlow and `TFJob`. 205 | You will then need to build the image and push it (you should push it under a different name or tag to avoid overwriting what you did before). 206 | 207 | ``` 208 | cd 7-distributed-tensorflow/solution-src 209 | # build from tensorflow/tensorflow:gpu for master and workers 210 | docker build -t ${DOCKER_USERNAME}/tf-mnist:distributedgpu -f ./Dockerfile.gpu . 211 | 212 | # builld from tensorflow/tensorflow for the parameter server 213 | docker build -t ${DOCKER_USERNAME}/tf-mnist:distributed . 214 | ``` 215 | 216 | #### 1. b. 217 | 218 | Modify the yaml template from module [6 - TFJob](../6-tfjob), to instead deploy 1 master, 2 workers and 1 PS. Then create a yaml to deploy TensorBoard to monitor the training with TensorBoard. 219 | Note that since our model is very simple, TensorFlow will likely use only 1 of the workers, but it will still work fine. Don't forget to update the image or tag. 220 | 221 | #### Validation 222 | 223 | ```console 224 | kubectl get pods 225 | ``` 226 | 227 | Should yield: 228 | 229 | ``` 230 | NAME READY STATUS RESTARTS AGE 231 | module6-ex1-master-m8vi-0-rdr5o 1/1 Running 0 23s 232 | module6-ex1-ps-m8vi-0-0vhjm 1/1 Running 0 23s 233 | module6-ex1-worker-m8vi-0-eyb6l 1/1 Running 0 23s 234 | module6-ex1-worker-m8vi-1-bm2ue 1/1 Running 0 23s 235 | ``` 236 | 237 | looking at the logs of the master with: 238 | 239 | ```console 240 | kubectl logs 241 | ``` 242 | 243 | Should yield: 244 | 245 | ``` 246 | [...] 247 | Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 248 | Initialize GrpcChannelCache for job ps -> {0 -> module6-ex1-ps-m8vi-0:2222} 249 | Initialize GrpcChannelCache for job worker -> {0 -> module6-ex1-worker-m8vi-0:2222, 1 -> module6-ex1-worker-m8vi-1:2222} 250 | 2018-04-30 22:45:28.963803: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:2222 251 | ... 252 | 253 | Accuracy at step 970: 0.9784 254 | Accuracy at step 980: 0.9791 255 | Accuracy at step 990: 0.9796 256 | Adding run metadata for 999 257 | ``` 258 | 259 | This indicates that the `ClusterSpec` was correctly extracted from the environment variable and given to TensorFlow. 260 | 261 | Once the TensorBoard pod is provisioned and running, we can connect to it using: 262 | 263 | ```console 264 | PODNAME=$(kubectl get pod -l app=tensorboard -o jsonpath='{.items[0].metadata.name}') 265 | kubectl port-forward ${PODNAME} 6006:6006 266 | ``` 267 | 268 | From the browser, connect to it at http://127.0.0.1:6006, you should see that your model is indeed correctly distributed between workers and PS: 269 | 270 | ![TensorBoard](./tensorboard.png) 271 | 272 | After a few minutes, the status of both worker nodes should show as `Completed` when doing `kubectl get pods -a`. 273 | 274 | #### Solution 275 | 276 | A working code sample is available in [`solution-src/main.py`](./solution-src/main.py). 277 | 278 |
279 | TFJob's Template 280 | 281 | ```yaml 282 | apiVersion: kubeflow.org/v1alpha2 283 | kind: TFJob 284 | metadata: 285 | name: module7-ex1-gpu 286 | spec: 287 | tfReplicaSpecs: 288 | MASTER: 289 | replicas: 1 290 | template: 291 | spec: 292 | volumes: 293 | - name: azurefile 294 | persistentVolumeClaim: 295 | claimName: azurefile 296 | containers: 297 | - image: /tf-mnist:distributedgpu # You can replace this by your own image 298 | name: tensorflow 299 | imagePullPolicy: Always 300 | resources: 301 | limits: 302 | nvidia.com/gpu: 1 303 | volumeMounts: 304 | - mountPath: /tmp/tensorflow 305 | subPath: module7-ex1-gpu 306 | name: azurefile 307 | restartPolicy: OnFailure 308 | WORKER: 309 | replicas: 2 310 | template: 311 | spec: 312 | containers: 313 | - image: /tf-mnist:distributedgpu # You can replace this by your own image 314 | name: tensorflow 315 | imagePullPolicy: Always 316 | resources: 317 | limits: 318 | nvidia.com/gpu: 1 319 | volumeMounts: 320 | restartPolicy: OnFailure 321 | PS: 322 | replicas: 1 323 | template: 324 | spec: 325 | containers: 326 | - image: /tf-mnist:distributed # You can replace this by your own image 327 | name: tensorflow 328 | imagePullPolicy: Always 329 | ports: 330 | - containerPort: 6006 331 | restartPolicy: OnFailure 332 | ``` 333 | 334 | There are few things to notice here: 335 | * Since only the master will be saving the model and the summaries, we only need to mount the Azure File share on the master's `tfReplicaSpecs`, not on the `WORKER`s or `PS`. 336 | * We are not specifying anything for the `PS` `tfReplicaSpecs` except the number of replicas. This is because `IsDefaultPS` is set to `true` by default. This means that the parameter server(s) will be started with a pre-built docker image that is already configured to read the `TF_CONFIG` and act as a TensorFlow server, so we don't need to do anything here. 337 | * When you have limited GPU resources, you can specify Master and Worker nodes to request GPU resources and PS node will only request CPU resources. 338 | 339 |
340 | 341 |
342 | TensorBoard Template 343 | 344 | ```yaml 345 | apiVersion: extensions/v1beta1 346 | kind: Deployment 347 | metadata: 348 | labels: 349 | app: tensorboard 350 | name: tensorboard 351 | spec: 352 | replicas: 1 353 | selector: 354 | matchLabels: 355 | app: tensorboard 356 | template: 357 | metadata: 358 | labels: 359 | app: tensorboard 360 | spec: 361 | volumes: 362 | - name: azurefile 363 | persistentVolumeClaim: 364 | claimName: azurefile 365 | containers: 366 | - name: tensorboard 367 | image: tensorflow/tensorflow:1.10.0 368 | imagePullPolicy: Always 369 | command: 370 | - /usr/local/bin/tensorboard 371 | args: 372 | - --logdir 373 | - /tmp/tensorflow/logs 374 | volumeMounts: 375 | - mountPath: /tmp/tensorflow 376 | subPath: module7-ex1-gpu 377 | name: azurefile 378 | ports: 379 | - containerPort: 6006 380 | protocol: TCP 381 | dnsPolicy: ClusterFirst 382 | restartPolicy: Always 383 | 384 | ``` 385 | There are two things to notice here: 386 | * To view logs and to get saved models from previous trainings, we need to mount `/tmp/tensorflow` from the Azure File share to TensorBoard as that is the mounting point for all the persisted data from TFJob Master. 387 | * To access TensorBoard pod locally, we need to port-forward traffic against the port specified by `containerPort: 6006`. 388 | 389 |
390 | 391 | 392 | ## Next Step 393 | 394 | [8 - Hyperparameters Sweep with Helm](../8-hyperparam-sweep) 395 | -------------------------------------------------------------------------------- /7-distributed-tensorflow/solution-src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.10.0 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] 5 | -------------------------------------------------------------------------------- /7-distributed-tensorflow/solution-src/Dockerfile.gpu: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.10.0-gpu 2 | COPY main.py /app/main.py 3 | 4 | ENTRYPOINT ["python", "/app/main.py"] 5 | -------------------------------------------------------------------------------- /7-distributed-tensorflow/solution-src/main.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the 'License'); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an 'AS IS' BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """A simple MNIST classifier which displays summaries in TensorBoard. 16 | 17 | This is an unimpressive MNIST model, but it is a good example of using 18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of 19 | naming summary tags so that they are grouped meaningfully in TensorBoard. 20 | 21 | It demonstrates the functionality of every TensorBoard dashboard. 22 | """ 23 | from __future__ import absolute_import 24 | from __future__ import division 25 | from __future__ import print_function 26 | 27 | import argparse 28 | import os 29 | import sys 30 | import ast 31 | import json 32 | 33 | import tensorflow as tf 34 | 35 | from tensorflow.examples.tutorials.mnist import input_data 36 | 37 | FLAGS = None 38 | 39 | def train(): 40 | tf_config_json = os.environ.get("TF_CONFIG", "{}") 41 | tf_config = json.loads(tf_config_json) 42 | 43 | task = tf_config.get("task", {}) 44 | cluster_spec = tf_config.get("cluster", {}) 45 | cluster_spec_object = tf.train.ClusterSpec(cluster_spec) 46 | job_name = task["type"] 47 | task_id = task["index"] 48 | server_def = tf.train.ServerDef( 49 | cluster=cluster_spec_object.as_cluster_def(), 50 | protocol="grpc", 51 | job_name=job_name, 52 | task_index=task_id) 53 | server = tf.train.Server(server_def) 54 | 55 | is_chief = (job_name == 'master') 56 | 57 | # Import data 58 | mnist = input_data.read_data_sets(FLAGS.data_dir, 59 | one_hot=True, 60 | fake_data=FLAGS.fake_data) 61 | 62 | 63 | # Create a multilayer model. 64 | 65 | 66 | # Between-graph replication 67 | with tf.device(tf.train.replica_device_setter( 68 | worker_device="/job:worker/task:%d" % task_id, 69 | cluster=cluster_spec)): 70 | 71 | # count the number of updates 72 | global_step = tf.get_variable( 73 | 'global_step', 74 | [], 75 | initializer = tf.constant_initializer(0), 76 | trainable = False) 77 | 78 | # Input placeholders 79 | with tf.name_scope('input'): 80 | x = tf.placeholder(tf.float32, [None, 784], name='x-input') 81 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input') 82 | 83 | with tf.name_scope('input_reshape'): 84 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1]) 85 | tf.summary.image('input', image_shaped_input, 10) 86 | 87 | # We can't initialize these variables to 0 - the network will get stuck. 88 | def weight_variable(shape): 89 | """Create a weight variable with appropriate initialization.""" 90 | initial = tf.truncated_normal(shape, stddev=0.1) 91 | return tf.Variable(initial) 92 | 93 | def bias_variable(shape): 94 | """Create a bias variable with appropriate initialization.""" 95 | initial = tf.constant(0.1, shape=shape) 96 | return tf.Variable(initial) 97 | 98 | def variable_summaries(var): 99 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization).""" 100 | with tf.name_scope('summaries'): 101 | mean = tf.reduce_mean(var) 102 | tf.summary.scalar('mean', mean) 103 | with tf.name_scope('stddev'): 104 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean))) 105 | tf.summary.scalar('stddev', stddev) 106 | tf.summary.scalar('max', tf.reduce_max(var)) 107 | tf.summary.scalar('min', tf.reduce_min(var)) 108 | tf.summary.histogram('histogram', var) 109 | 110 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu): 111 | """Reusable code for making a simple neural net layer. 112 | 113 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize. 114 | It also sets up name scoping so that the resultant graph is easy to read, 115 | and adds a number of summary ops. 116 | """ 117 | # Adding a name scope ensures logical grouping of the layers in the graph. 118 | with tf.name_scope(layer_name): 119 | # This Variable will hold the state of the weights for the layer 120 | with tf.name_scope('weights'): 121 | weights = weight_variable([input_dim, output_dim]) 122 | variable_summaries(weights) 123 | with tf.name_scope('biases'): 124 | biases = bias_variable([output_dim]) 125 | variable_summaries(biases) 126 | with tf.name_scope('Wx_plus_b'): 127 | preactivate = tf.matmul(input_tensor, weights) + biases 128 | tf.summary.histogram('pre_activations', preactivate) 129 | activations = act(preactivate, name='activation') 130 | tf.summary.histogram('activations', activations) 131 | return activations 132 | 133 | hidden1 = nn_layer(x, 784, 500, 'layer1') 134 | 135 | with tf.name_scope('dropout'): 136 | keep_prob = tf.placeholder(tf.float32) 137 | tf.summary.scalar('dropout_keep_probability', keep_prob) 138 | dropped = tf.nn.dropout(hidden1, keep_prob) 139 | 140 | # Do not apply softmax activation yet, see below. 141 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity) 142 | 143 | with tf.name_scope('cross_entropy'): 144 | # The raw formulation of cross-entropy, 145 | # 146 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)), 147 | # reduction_indices=[1])) 148 | # 149 | # can be numerically unstable. 150 | # 151 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the 152 | # raw outputs of the nn_layer above, and then average across 153 | # the batch. 154 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y) 155 | with tf.name_scope('total'): 156 | cross_entropy = tf.reduce_mean(diff) 157 | tf.summary.scalar('cross_entropy', cross_entropy) 158 | 159 | with tf.name_scope('train'): 160 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize( 161 | cross_entropy) 162 | 163 | with tf.name_scope('accuracy'): 164 | with tf.name_scope('correct_prediction'): 165 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) 166 | with tf.name_scope('accuracy'): 167 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 168 | tf.summary.scalar('accuracy', accuracy) 169 | 170 | # Merge all the summaries and write them out to 171 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default) 172 | merged = tf.summary.merge_all() 173 | 174 | init_op = tf.global_variables_initializer() 175 | 176 | def feed_dict(train): 177 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders.""" 178 | if train or FLAGS.fake_data: 179 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data) 180 | k = FLAGS.dropout 181 | else: 182 | xs, ys = mnist.test.images, mnist.test.labels 183 | k = 1.0 184 | return {x: xs, y_: ys, keep_prob: k} 185 | 186 | 187 | 188 | sv = tf.train.Supervisor(is_chief=is_chief, 189 | global_step=global_step, 190 | init_op=init_op, 191 | logdir=FLAGS.logdir) 192 | 193 | with sv.prepare_or_wait_for_session(server.target) as sess: 194 | train_writer = tf.summary.FileWriter(FLAGS.logdir + '/train', sess.graph) 195 | test_writer = tf.summary.FileWriter(FLAGS.logdir + '/test') 196 | # Train the model, and also write summaries. 197 | # Every 10th step, measure test-set accuracy, and write test summaries 198 | # All other steps, run train_step on training data, & add training summaries 199 | 200 | for i in range(FLAGS.max_steps): 201 | if i % 10 == 0: # Record summaries and test-set accuracy 202 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False)) 203 | test_writer.add_summary(summary, i) 204 | print('Accuracy at step %s: %s' % (i, acc)) 205 | else: # Record train set summaries, and train 206 | if i % 100 == 99: # Record execution stats 207 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 208 | run_metadata = tf.RunMetadata() 209 | summary, _ = sess.run([merged, train_step], 210 | feed_dict=feed_dict(True), 211 | options=run_options, 212 | run_metadata=run_metadata) 213 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i) 214 | train_writer.add_summary(summary, i) 215 | print('Adding run metadata for', i) 216 | else: # Record a summary 217 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True)) 218 | train_writer.add_summary(summary, i) 219 | train_writer.close() 220 | test_writer.close() 221 | 222 | 223 | def main(_): 224 | train() 225 | 226 | 227 | if __name__ == '__main__': 228 | parser = argparse.ArgumentParser() 229 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool, 230 | default=False, 231 | help='If true, uses fake data for unit testing.') 232 | parser.add_argument('--max_steps', type=int, default=1000, 233 | help='Number of steps to run trainer.') 234 | parser.add_argument('--learning_rate', type=float, default=0.001, 235 | help='Initial learning rate') 236 | parser.add_argument('--dropout', type=float, default=0.9, 237 | help='Keep probability for training dropout.') 238 | parser.add_argument( 239 | '--data_dir', 240 | type=str, 241 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 242 | 'tensorflow/input_data'), 243 | help='Directory for storing input data') 244 | parser.add_argument( 245 | '--logdir', 246 | type=str, 247 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'), 248 | 'tensorflow/logs'), 249 | help='Summaries log directory') 250 | FLAGS, unparsed = parser.parse_known_args() 251 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) -------------------------------------------------------------------------------- /7-distributed-tensorflow/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/7-distributed-tensorflow/tensorboard.png -------------------------------------------------------------------------------- /8-hyperparam-sweep/README.md: -------------------------------------------------------------------------------- 1 | # Automated Hyperparameters Sweep with `TFJob` and Helm 2 | 3 | ## Prerequisites 4 | 5 | * [3 - Helm](../3-helm) 6 | * [6 - TFJob](../6-tfjob) 7 | 8 | ### "Vanilla" Hyperparameter Sweep 9 | 10 | Just as distributed training, automated hyperparameter sweep is barely used in many organizations. 11 | The reasons are similar: It takes a lot of resources, or time, to run more than a couple training for the same model. 12 | * Either you run different hypothesis in parallel, which will likely requires a lot of resources and VMs. These VMs need to be managed by someone, the model need to be deployed, logs and checkpoints have to be gathered etc. 13 | * Or you run everything sequentially on a few number of VMs, which takes a lot of time before being able to compare result 14 | 15 | So in practice most people manually fine-tune their hyperparameters through a few runs and pick a winner. 16 | 17 | ### Kubernetes + Helm 18 | 19 | Kubernetes coupled with Helm can make this easier as we will see. 20 | Because Kubernetes on Azure also allows you to scale very easily (manually or automatically), this allows you to explore a very large hyperparameter space, while maximizing the usage of your cluster (and thus optimizing cost). 21 | 22 | In practice, this process is still rudimentary today as the technologies involved are all pretty young. Most likely tools better suited for doing hyperparameter sweeping in distributed systems will soon be available, but in the meantime Kubernetes and Helm already allow us to deploy a large number of trainings fairly easily. 23 | 24 | ### Why Helm? 25 | 26 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize it's deployment easily. 27 | To do that, Helm allows us to use Golang templating engine in the chart definitions. This means we can use conditions, loops, variables and [much more](https://docs.helm.sh/chart_template_guide). 28 | This will allow us to create complex deployment flow. 29 | 30 | In the case of hyperparameters sweeping, we will want a chart able to deploy a number of `TFJobs` each trying different values for some hyperparameters. 31 | We will also want to deploy a single TensorBoard instance monitoring all these `TFJobs`, that way we can quickly compare all our hypothesis, and even early-stop jobs that clearly don't perform well if we want to reduce cost as much as possible. 32 | For now, this chart will simply do a grid search, and while it is less efficient than random search it will be a good place to start. 33 | 34 | ## Exercise 35 | 36 | ### Creating and Deploying the Chart 37 | In this exercise, you will create a new Helm chart that will deploy a number of `TFJobs` as well as a TensorBoard instance. 38 | 39 | Here is what our `values.yaml` file could look like for example (you are free to go a different route): 40 | 41 | ```yaml 42 | image: ritazh/tf-paint:gpu 43 | useGPU: true 44 | hyperParamValues: 45 | learningRate: 46 | - 0.001 47 | - 0.01 48 | - 0.1 49 | hiddenLayers: 50 | - 5 51 | - 6 52 | - 7 53 | ``` 54 | 55 | That way, when installing the chart, 9 `TFJob` will actually get deployed, testing all the combination of learning rate and hidden layers depth that we specified. 56 | This is a very simple example (our model is also very simple), but hopefully you start to see the possibilities than Helm offers. 57 | 58 | In this exercise, we are going to use a new model based on [Andrej Karpathy's Image painting demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_regression.html). 59 | This model objective is to to create a new picture as close as possible to the original one, "The Starry Night" by Vincent van Gogh: 60 | 61 | ![Starry](./src/starry.jpg) 62 | 63 | The source code is located in [src/](./src/). 64 | 65 | Our model takes 3 parameters: 66 | 67 | | argument | description | default value | 68 | |------|-------------|---------------| 69 | |`--learning-rate` | Learning rate value | `0.001` | 70 | |`--hidden-layers` | Number of hidden layers in our network. | `4` | 71 | |`--log-dir` | Path to save TensorFlow's summaries | `None`| 72 | 73 | For simplicity, docker images have already been created so you don't have to build and push yourself: 74 | * `ritazh/tf-paint:cpu` for CPU only. 75 | * `ritazh/tf-paint:gpu` for GPU. 76 | 77 | The goal of this exercise is to create an Helm chart that will allow us to test as many variations and combinations of the two hyperparameters `--learning-rate`and `--hidden-layers` as we want by just adding them in our `values.yaml` file. 78 | This chart should also deploy a single TensorBoard instance (and it's associated service with a public IP), so we can quickly monitor and compare our different hypothesis. 79 | 80 | If you are pretty new to Kubernetes and Helm and don't feel like building your own helm chart just yet, you can skip to the solution where details and explanations are provided. 81 | 82 | #### Validation 83 | 84 | Once you have created and deployed your chart, looking at the pods that were created, you should see a bunch of them, as well as a single TensorBoard instance monitoring all of them: 85 | 86 | ```console 87 | kubectl get pods 88 | ``` 89 | 90 | ``` 91 | NAME READY STATUS RESTARTS AGE 92 | module8-tensorboard-7ccb598cdd-6vg7h 1/1 Running 0 16s 93 | module8-tf-paint-0-0-master-juc5-0-hw5cm 0/1 Pending 0 4s 94 | module8-tf-paint-0-1-master-pu49-0-jp06r 1/1 Running 0 14s 95 | module8-tf-paint-0-2-master-awhs-0-gfra0 0/1 Pending 0 6s 96 | module8-tf-paint-1-0-master-5tfm-0-dhhhv 1/1 Running 0 16s 97 | module8-tf-paint-1-1-master-be91-0-zw4gk 1/1 Running 0 16s 98 | module8-tf-paint-1-2-master-r2nd-0-zhws1 0/1 Pending 0 7s 99 | module8-tf-paint-2-0-master-7w37-0-ff0w9 0/1 Pending 0 13s 100 | module8-tf-paint-2-1-master-260j-0-l4o7r 0/1 Pending 0 10s 101 | module8-tf-paint-2-2-master-jtjb-0-5l84q 0/1 Pending 0 9s 102 | ``` 103 | Note: Some pods are pending due to the GPU resource available in the cluster. If you have 3 GPUs in the cluster, then there can only be a maximum number of 3 parallel trainings at a given time. 104 | 105 | Once the TensorBoard service for this module is created, you can use the External-IP of that service to connect to the TensorBoard. 106 | 107 | ```console 108 | kubectl get service 109 | 110 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 111 | module8-tensorboard LoadBalancer 10.0.142.217 80:30896/TCP 5m 112 | 113 | ``` 114 | 115 | Looking at TensorBoard, you should see something similar to this: 116 | ![TensorBoard](tensorboard.png) 117 | 118 | > Note that TensorBoard can take a while before correctly displaying images. 119 | 120 | Here we can see that some models are doing much better than others. Models with a learning rate of `0.1` for example are producing an all-black image, we are probably over-shooting. 121 | After a few minutes, we can see that the two best performing models are: 122 | * 5 hidden layers and learning rate of `0.01` 123 | * 7 hidden layers and learning rate of `0.001` 124 | 125 | At this point we could decide to kill all the other models if we wanted to free some capacity in our cluster, or launch additional new experiments based on our initial findings. 126 | 127 | #### Solution 128 | Check out the commented solution chart: [./solution-chart/templates/deployment.yaml](./solution-chart/templates/deployment.yaml) 129 |
130 | Hyperparameter Sweep Helm Chart 131 | 132 | Install the chart with command: 133 | 134 | ```console 135 | cd 8-hyperparam-sweep/solution-chart/ 136 | helm install . 137 | 138 | NAME: telling-buffalo 139 | LAST DEPLOYED: 140 | NAMESPACE: tfworkflow 141 | STATUS: DEPLOYED 142 | 143 | RESOURCES: 144 | ==> v1/Service 145 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 146 | module8-tensorboard LoadBalancer 10.0.142.217 80:30896/TCP 1s 147 | 148 | ==> v1beta1/Deployment 149 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 150 | module8-tensorboard 1 1 1 0 1s 151 | 152 | ==> v1alpha1/TFJob 153 | NAME AGE 154 | module8-tf-paint-0-0 1s 155 | module8-tf-paint-1-0 1s 156 | module8-tf-paint-1-1 1s 157 | module8-tf-paint-2-1 1s 158 | module8-tf-paint-2-2 1s 159 | module8-tf-paint-0-1 1s 160 | module8-tf-paint-0-2 1s 161 | module8-tf-paint-1-2 1s 162 | module8-tf-paint-2-0 0s 163 | 164 | ==> v1/Pod(related) 165 | NAME READY STATUS RESTARTS AGE 166 | module8-tensorboard-7ccb598cdd-6vg7h 0/1 ContainerCreating 0 1s 167 | 168 | ``` 169 |
170 | 171 | ## Next Step 172 | 173 | [9 - Serving](../9-serving) 174 | -------------------------------------------------------------------------------- /8-hyperparam-sweep/solution-chart/Chart.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | description: A Helm chart for Kubernetes 3 | name: module8-hyperparam-sweep 4 | version: 0.1.0 5 | -------------------------------------------------------------------------------- /8-hyperparam-sweep/solution-chart/templates/_helpers.tpl: -------------------------------------------------------------------------------- 1 | {{/* vim: set filetype=mustache: */}} 2 | {{/* 3 | Expand the name of the chart. 4 | */}} 5 | {{- define "name" -}} 6 | {{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" -}} 7 | {{- end -}} 8 | 9 | {{/* 10 | Create a default fully qualified app name. 11 | We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). 12 | */}} 13 | {{- define "fullname" -}} 14 | {{- $name := default .Chart.Name .Values.nameOverride -}} 15 | {{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" -}} 16 | {{- end -}} 17 | -------------------------------------------------------------------------------- /8-hyperparam-sweep/solution-chart/templates/deployment.yaml: -------------------------------------------------------------------------------- 1 | 2 | # First we copy the values of values.yaml in variable to make it easier to access them 3 | {{- $lrlist := .Values.hyperParamValues.learningRate -}} 4 | {{- $nblayerslist := .Values.hyperParamValues.hiddenLayers -}} 5 | {{- $image := .Values.image -}} 6 | {{- $useGPU := .Values.useGPU -}} 7 | {{- $chartname := .Chart.Name -}} 8 | {{- $chartversion := .Chart.Version -}} 9 | 10 | # Then we loop over every value of $lrlist (learning rate) and $nblayerslist (hidden layer depth) 11 | # This will result in create 1 TFJob for every pair of learning rate and hidden layer depth 12 | {{- range $i, $lr := $lrlist }} 13 | {{- range $j, $nblayers := $nblayerslist }} 14 | apiVersion: kubeflow.org/v1beta1 15 | kind: TFJob # Each one of our trainings will be a separate TFJob 16 | metadata: 17 | name: module8-tf-paint-{{ $i }}-{{ $j }} # We give a unique name to each training 18 | labels: 19 | chart: "{{ $chartname }}-{{ $chartversion | replace "+" "_" }}" 20 | spec: 21 | tfReplicaSpecs: 22 | MASTER: 23 | template: 24 | spec: 25 | restartPolicy: OnFailure 26 | containers: 27 | - name: tensorflow 28 | image: {{ $image }} 29 | env: 30 | - name: LC_ALL 31 | value: C.UTF-8 32 | args: 33 | # Here we pass a unique learning rate and hidden layer count to each instance. 34 | # We also put the values between quotes to avoid potential formatting issues 35 | - --learning-rate 36 | - {{ $lr | quote }} 37 | - --hidden-layers 38 | - {{ $nblayers | quote }} 39 | - --logdir 40 | - /tmp/tensorflow/tf-paint-lr{{ $lr }}-d-{{ $nblayers }} # We save the summaries in a different directory 41 | {{ if $useGPU }} # We only want to request GPUs if we asked for it in values.yaml with useGPU 42 | resources: 43 | limits: 44 | nvidia.com/gpu: 1 45 | {{ end }} 46 | volumeMounts: 47 | - mountPath: /tmp/tensorflow 48 | subPath: module8-tf-paint # As usual we want to save everything in a separate subdirectory 49 | name: azurefile 50 | volumes: 51 | - name: azurefile 52 | persistentVolumeClaim: 53 | claimName: azurefile 54 | --- 55 | {{- end }} 56 | {{- end }} 57 | # We only want one instance running for all our jobs, and not 1 per job. 58 | apiVersion: v1 59 | kind: Service 60 | metadata: 61 | labels: 62 | app: tensorboard 63 | name: module8-tensorboard 64 | spec: 65 | ports: 66 | - port: 80 67 | targetPort: 6006 68 | selector: 69 | app: tensorboard 70 | type: LoadBalancer 71 | --- 72 | apiVersion: extensions/v1beta1 73 | kind: Deployment 74 | metadata: 75 | labels: 76 | app: tensorboard 77 | name: module8-tensorboard 78 | spec: 79 | template: 80 | metadata: 81 | labels: 82 | app: tensorboard 83 | spec: 84 | volumes: 85 | - name: azurefile 86 | persistentVolumeClaim: 87 | claimName: azurefile 88 | containers: 89 | - name: tensorboard 90 | command: 91 | - /usr/local/bin/tensorboard 92 | - --logdir=/tmp/tensorflow 93 | - --host=0.0.0.0 94 | image: tensorflow/tensorflow 95 | ports: 96 | - containerPort: 6006 97 | volumeMounts: 98 | - mountPath: /tmp/tensorflow 99 | subPath: module8-tf-paint 100 | name: azurefile -------------------------------------------------------------------------------- /8-hyperparam-sweep/solution-chart/values.yaml: -------------------------------------------------------------------------------- 1 | image: ritazh/tf-paint:gpu 2 | useGPU: true 3 | hyperParamValues: 4 | learningRate: 5 | - 0.001 6 | - 0.01 7 | - 0.1 8 | hiddenLayers: 9 | - 5 10 | - 6 11 | - 7 12 | -------------------------------------------------------------------------------- /8-hyperparam-sweep/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.7.0 2 | COPY requirements.txt /app/requirements.txt 3 | WORKDIR /app 4 | RUN mkdir ./output 5 | RUN mkdir ./logs 6 | RUN mkdir ./checkpoints 7 | RUN pip install -r requirements.txt 8 | COPY ./* /app/ 9 | 10 | ENTRYPOINT [ "python", "/app/main.py" ] -------------------------------------------------------------------------------- /8-hyperparam-sweep/src/Dockerfile.gpu: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.7.0-gpu 2 | COPY requirements.txt /app/requirements.txt 3 | WORKDIR /app 4 | RUN mkdir ./output 5 | RUN mkdir ./logs 6 | RUN mkdir ./checkpoints 7 | RUN pip install -r requirements.txt 8 | COPY ./* /app/ 9 | 10 | ENTRYPOINT [ "python", "/app/main.py" ] -------------------------------------------------------------------------------- /8-hyperparam-sweep/src/main.py: -------------------------------------------------------------------------------- 1 | import click 2 | import tensorflow as tf 3 | import numpy as np 4 | from skimage.data import astronaut 5 | from scipy.misc import imresize, imsave, imread 6 | 7 | img = imread('./starry.jpg') 8 | img = imresize(img, (100, 100)) 9 | save_dir = 'output' 10 | epochs = 2000 11 | 12 | 13 | def linear_layer(X, layer_size, layer_name): 14 | with tf.variable_scope(layer_name): 15 | W = tf.Variable(tf.random_uniform([X.get_shape().as_list()[1], layer_size], dtype=tf.float32), name='W') 16 | b = tf.Variable(tf.zeros([layer_size]), name='b') 17 | return tf.nn.relu(tf.matmul(X, W) + b) 18 | 19 | @click.command() 20 | @click.option("--learning-rate", default=0.01) 21 | @click.option("--hidden-layers", default=7) 22 | @click.option("--logdir") 23 | def main(learning_rate, hidden_layers, logdir='./logs/1'): 24 | X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name='X') 25 | y = tf.placeholder(dtype=tf.float32, shape=(None, 3), name='y') 26 | current_input = X 27 | for layer_id in range(hidden_layers): 28 | h = linear_layer(current_input, 20, 'layer{}'.format(layer_id)) 29 | current_input = h 30 | 31 | y_pred = linear_layer(current_input, 3, 'output') 32 | 33 | #loss will be distance between predicted and true RGB 34 | loss = tf.reduce_mean(tf.reduce_sum(tf.squared_difference(y, y_pred), 1)) 35 | tf.summary.scalar('loss', loss) 36 | 37 | train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss) 38 | merged_summary_op = tf.summary.merge_all() 39 | 40 | res_img = tf.cast(tf.clip_by_value(tf.reshape(y_pred, (1,) + img.shape), 0, 255), tf.uint8) 41 | img_summary = tf.summary.image('out', res_img, max_outputs=1) 42 | 43 | xs, ys = get_data(img) 44 | 45 | with tf.Session() as sess: 46 | tf.global_variables_initializer().run() 47 | train_writer = tf.summary.FileWriter(logdir + '/train', sess.graph) 48 | test_writer = tf.summary.FileWriter(logdir + '/test') 49 | batch_size = 50 50 | for i in range(epochs): 51 | # Get a random sampling of the dataset 52 | idxs = np.random.permutation(range(len(xs))) 53 | # The number of batches we have to iterate over 54 | n_batches = len(idxs) // batch_size 55 | # Now iterate over our stochastic minibatches: 56 | for batch_i in range(n_batches): 57 | batch_idxs = idxs[batch_i * batch_size: (batch_i + 1) * batch_size] 58 | sess.run([train_op, loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) 59 | if batch_i % 100 == 0: 60 | c, summary = sess.run([loss, merged_summary_op], feed_dict={X: xs[batch_idxs], y: ys[batch_idxs]}) 61 | train_writer.add_summary(summary, (i * n_batches * batch_size) + batch_i) 62 | print("epoch {}, (l2) loss {}".format(i, c)) 63 | 64 | if i % 10 == 0: 65 | img_summary_res = sess.run(img_summary, feed_dict={X: xs, y: ys}) 66 | test_writer.add_summary(img_summary_res, i * n_batches * batch_size) 67 | 68 | def get_data(img): 69 | xs = [] 70 | ys = [] 71 | for row_i in range(img.shape[0]): 72 | for col_i in range(img.shape[1]): 73 | xs.append([row_i, col_i]) 74 | ys.append(img[row_i, col_i]) 75 | 76 | xs = (xs - np.mean(xs)) / np.std(xs) 77 | return xs, np.array(ys) 78 | 79 | if __name__ == "__main__": 80 | main() -------------------------------------------------------------------------------- /8-hyperparam-sweep/src/requirements.txt: -------------------------------------------------------------------------------- 1 | scikit-image 2 | click>=6.2 -------------------------------------------------------------------------------- /8-hyperparam-sweep/src/starry.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/8-hyperparam-sweep/src/starry.jpg -------------------------------------------------------------------------------- /8-hyperparam-sweep/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/8-hyperparam-sweep/tensorboard.png -------------------------------------------------------------------------------- /9-serving/README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Serving 2 | 3 | ## Prerequisites 4 | 5 | * [1 - Docker Basics](../1-docker) 6 | * [2 - Kubernetes Basics and cluster created](../2-kubernetes) 7 | * [3 - Helm](../3-helm) 8 | * [4 - Kubeflow](../4-kubeflow) 9 | 10 | ## Summary 11 | 12 | In this section you will learn about: 13 | 14 | * Setting up a Minio file storage in our Kubernetes cluster 15 | * Serving trained models using TensorFlow Serving 16 | 17 | ## Context 18 | 19 | TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data. 20 | 21 | ## Exercises 22 | 23 | ### Exercise 1: Setting up file storage 24 | 25 | First, we'll get started with a file storage backend. 26 | 27 | If you already have a model uploaded to storage, you can skip this step. 28 | If not, you can [download Minio client](https://minio.io/downloads.html#download-client-macos) to your operating system of choice to upload trained and exported model. 29 | 30 | As we saw in module [3 - Helm](../3-helm), Helm enables us to package an application in a chart and parametrize it's deployment easily. We'll use Helm to create a Minio deployment in our cluster. 31 | 32 | ```console 33 | ACCESS_KEY= 34 | ACCESS_SECRET_KEY= 35 | 36 | helm install --name minio --set accessKey=$ACCESS_KEY,secretKey=$ACCESS_SECRET_KEY,service.type=LoadBalancer stable/minio 37 | ``` 38 | 39 | ```console 40 | SERVICE_IP=$(kubectl get svc minio --template="{{range .status.loadBalancer.ingress}}{{.ip}}{{end}}") 41 | S3_ENDPOINT=${SERVICE_IP}:9000 42 | ``` 43 | 44 | Setting up Minio host: 45 | 46 | ```console 47 | mc config host add minio $S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY 48 | ``` 49 | 50 | Creating a bucket and uploading our trained model: 51 | 52 | ```console 53 | BUCKET_NAME=kubeflow 54 | 55 | mc mb minio/$BUCKET_NAME 56 | 57 | mc cp --recursive /path/to/your/exported/model minio/$BUCKET_NAME 58 | ``` 59 | 60 | After this command, you should see files are being uploaded. 61 | 62 | ### Exercise 2: Setting up TensorFlow Serving model server 63 | 64 | In this exercise, we are going to set up a TensorFlow model server and start serving our trained model. 65 | 66 | Creating our namespace for serving: 67 | 68 | ```console 69 | export NAMESPACE=serving 70 | 71 | kubectl create namespace $NAMESPACE 72 | ``` 73 | 74 | Creating secret for the Minio storage so TensorFlow Serving container can access it: 75 | 76 | ```console 77 | kubectl create secret generic serving-creds --from-literal=accessKeyID=${ACCESS_KEY} \ 78 | --from-literal=secretAccessKey=${ACCESS_SECRET_KEY} -n $NAMESPACE 79 | ``` 80 | 81 | Defining variables such as model name, TensorFlow Serving image. 82 | 83 | ```console 84 | S3_USE_HTTPS=0 85 | S3_VERIFY_SSL=0 86 | JOB_NAME=myjob 87 | MODEL_COMPONENT=mnist 88 | MODEL_NAME=mnist 89 | MODEL_PATH=s3://${BUCKET_NAME}/models/${JOB_NAME}/export/${MODEL_NAME}/ 90 | MODEL_SERVER_IMAGE=sozercan/tensorflow-model-server 91 | ``` 92 | 93 | Initalize Kubeflow: 94 | 95 | ```console 96 | ks init my-model-server 97 | cd my-model-server 98 | ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow 99 | ks pkg install kubeflow/tf-serving@74629b7 100 | ``` 101 | 102 | Setting up environment for Kubeflow: 103 | 104 | ```console 105 | ks env add azure 106 | ks env set azure --namespace ${NAMESPACE} 107 | ``` 108 | 109 | Generating the template: 110 | 111 | ```console 112 | ks generate tf-serving ${MODEL_COMPONENT} --name=${MODEL_NAME} 113 | ``` 114 | 115 | Overriding parameters with our own values: 116 | 117 | ```console 118 | ks param set --env azure ${MODEL_COMPONENT} modelServerImage $MODEL_SERVER_IMAGE 119 | ks param set --env azure ${MODEL_COMPONENT} modelPath $MODEL_PATH 120 | ks param set --env azure ${MODEL_COMPONENT} s3Enable true 121 | ks param set --env azure ${MODEL_COMPONENT} s3SecretName serving-creds 122 | ks param set --env azure ${MODEL_COMPONENT} s3SecretAccesskeyidKeyName accessKeyID 123 | ks param set --env azure ${MODEL_COMPONENT} s3SecretSecretaccesskeyKeyName secretAccessKey 124 | ks param set --env azure ${MODEL_COMPONENT} s3Endpoint $S3_ENDPOINT 125 | ks param set --env azure ${MODEL_COMPONENT} s3AwsRegion us-east-1 126 | ks param set --env azure ${MODEL_COMPONENT} s3UseHttps $S3_USE_HTTPS --as-string 127 | ks param set --env azure ${MODEL_COMPONENT} s3VerifySsl $S3_VERIFY_SSL --as-string 128 | ks param set --env azure ${MODEL_COMPONENT} serviceType LoadBalancer 129 | ``` 130 | 131 | Deploying TensorFlow Serving to our cluster: 132 | 133 | ```console 134 | ks apply azure -c ${MODEL_COMPONENT} 135 | ``` 136 | 137 | After deploying, you should see a deployment and service in your cluster. You can verify with the following: 138 | 139 | ```console 140 | kubectl get pods -n ${NAMESPACE} 141 | 142 | kubectl get svc -n ${NAMESPACE} 143 | ``` 144 | 145 | ### Exercise 3: Using a client to query TensorFlow Serving Model Server 146 | 147 | In this exercise, we'll use a client to query the TensorFlow Serving model server. 148 | 149 | ``` 150 | cd 9-serving 151 | ``` 152 | 153 | If you don't have virtualenv installed, you can install with: 154 | 155 | ```console 156 | pip install virtualenv 157 | ``` 158 | 159 | Setting up our virtual environment: 160 | 161 | ```console 162 | virtualenv venv 163 | source venv/bin/activate 164 | pip install -r requirements.txt 165 | ``` 166 | 167 | Starting our query from the client: 168 | 169 | ```console 170 | export TF_MODEL_SERVER_HOST=$(kubectl get svc ${MODEL_NAME} -n ${NAMESPACE} --template="{{range .status.loadBalancer.ingress}}{{.ip}}{{end}}") 171 | 172 | export TF_MNIST_IMAGE_PATH=data/7.png 173 | 174 | python mnist_client.py 175 | ``` 176 | 177 | If everything is working correctly, you should see the output from the model and the inference. 178 | 179 | Sample output: 180 | 181 | ``` 182 | outputs { 183 | key: "classes" 184 | value { 185 | dtype: DT_UINT8 186 | tensor_shape { 187 | dim { 188 | size: 1 189 | } 190 | } 191 | int_val: 7 192 | } 193 | } 194 | outputs { 195 | key: "predictions" 196 | value { 197 | dtype: DT_FLOAT 198 | tensor_shape { 199 | dim { 200 | size: 1 201 | } 202 | dim { 203 | size: 10 204 | } 205 | } 206 | float_val: 0.0 207 | float_val: 0.0 208 | float_val: 0.0 209 | float_val: 0.0 210 | float_val: 0.0 211 | float_val: 0.0 212 | float_val: 0.0 213 | float_val: 1.0 214 | float_val: 0.0 215 | float_val: 0.0 216 | } 217 | } 218 | 219 | 220 | ............................ 221 | ............................ 222 | ............................ 223 | ............................ 224 | ............................ 225 | ............................ 226 | ............................ 227 | ..............@@@@@@........ 228 | ..........@@@@@@@@@@........ 229 | ........@@@@@@@@@@@@........ 230 | ........@@@@@@@@.@@@........ 231 | ........@@@@....@@@@........ 232 | ................@@@@........ 233 | ...............@@@@......... 234 | ...............@@@@......... 235 | ...............@@@.......... 236 | ..............@@@@.......... 237 | ..............@@@........... 238 | .............@@@@........... 239 | .............@@@............ 240 | ............@@@@............ 241 | ............@@@............. 242 | ............@@@............. 243 | ...........@@@.............. 244 | ..........@@@@.............. 245 | ..........@@@@.............. 246 | ..........@@................ 247 | ............................ 248 | Your model says the above number is... 7! 249 | ``` 250 | 251 | ## Next Step 252 | 253 | [10 - Going Further](../10-going-further) 254 | -------------------------------------------------------------------------------- /9-serving/data/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/0.png -------------------------------------------------------------------------------- /9-serving/data/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/1.png -------------------------------------------------------------------------------- /9-serving/data/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/2.png -------------------------------------------------------------------------------- /9-serving/data/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/3.png -------------------------------------------------------------------------------- /9-serving/data/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/4.png -------------------------------------------------------------------------------- /9-serving/data/5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/5.png -------------------------------------------------------------------------------- /9-serving/data/6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/6.png -------------------------------------------------------------------------------- /9-serving/data/7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/7.png -------------------------------------------------------------------------------- /9-serving/data/8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/8.png -------------------------------------------------------------------------------- /9-serving/data/9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/kubeflow-labs/39657fb130a563ef03d4beaf526f8fc544825287/9-serving/data/9.png -------------------------------------------------------------------------------- /9-serving/mnist_client.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2.7 2 | 3 | import os 4 | import random 5 | import numpy 6 | 7 | from PIL import Image 8 | 9 | import tensorflow as tf 10 | from tensorflow.examples.tutorials.mnist import input_data 11 | from tensorflow_serving.apis import predict_pb2 12 | from tensorflow_serving.apis import prediction_service_pb2 13 | 14 | from grpc.beta import implementations 15 | 16 | from mnist import MNIST # pylint: disable=no-name-in-module 17 | 18 | TF_MODEL_SERVER_HOST = os.getenv("TF_MODEL_SERVER_HOST", "127.0.0.1") 19 | TF_MODEL_SERVER_PORT = int(os.getenv("TF_MODEL_SERVER_PORT", 9000)) 20 | TF_DATA_DIR = os.getenv("TF_DATA_DIR", "/tmp/data/") 21 | TF_MNIST_IMAGE_PATH = os.getenv("TF_MNIST_IMAGE_PATH", None) 22 | TF_MNIST_TEST_IMAGE_NUMBER = int(os.getenv("TF_MNIST_TEST_IMAGE_NUMBER", -1)) 23 | 24 | if TF_MNIST_IMAGE_PATH != None: 25 | raw_image = Image.open(TF_MNIST_IMAGE_PATH) 26 | int_image = numpy.array(raw_image) 27 | image = numpy.reshape(int_image, 784).astype(numpy.float32) 28 | elif TF_MNIST_TEST_IMAGE_NUMBER > -1: 29 | test_data_set = input_data.read_data_sets(TF_DATA_DIR, one_hot=True).test 30 | image = test_data_set.images[TF_MNIST_TEST_IMAGE_NUMBER] 31 | else: 32 | test_data_set = input_data.read_data_sets(TF_DATA_DIR, one_hot=True).test 33 | image = random.choice(test_data_set.images) 34 | 35 | channel = implementations.insecure_channel( 36 | TF_MODEL_SERVER_HOST, TF_MODEL_SERVER_PORT) 37 | stub = prediction_service_pb2.beta_create_PredictionService_stub(channel) 38 | 39 | request = predict_pb2.PredictRequest() 40 | request.model_spec.name = "mnist" 41 | request.model_spec.signature_name = "serving_default" 42 | request.inputs['x'].CopyFrom( 43 | tf.contrib.util.make_tensor_proto(image, shape=[1, 28, 28])) 44 | 45 | result = stub.Predict(request, 10.0) # 10 secs timeout 46 | 47 | print(result) 48 | print(MNIST.display(image, threshold=0)) 49 | print("Your model says the above number is... %d!" % 50 | result.outputs["classes"].int_val[0]) 51 | -------------------------------------------------------------------------------- /9-serving/requirements.txt: -------------------------------------------------------------------------------- 1 | grpc==0.3.post19 2 | numpy==1.14.0 3 | Pillow==5.0.0 4 | python-mnist==0.5 5 | tensorflow==1.5.0 6 | tensorflow-serving-api==1.4.0 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Corporation ("Creative Commons") is not a law firm and 2 | does not provide legal services or legal advice. Distribution of 3 | Creative Commons public licenses does not create a lawyer-client or 4 | other relationship. Creative Commons makes its licenses and related 5 | information available on an "as-is" basis. Creative Commons gives no 6 | warranties regarding its licenses, any material licensed under their 7 | terms and conditions, or any related information. Creative Commons 8 | disclaims all liability for damages resulting from their use to the 9 | fullest extent possible. 10 | 11 | Using Creative Commons Public Licenses 12 | 13 | Creative Commons public licenses provide a standard set of terms and 14 | conditions that creators and other rights holders may use to share 15 | original works of authorship and other material subject to copyright 16 | and certain other rights specified in the public license below. The 17 | following considerations are for informational purposes only, are not 18 | exhaustive, and do not form part of our licenses. 19 | 20 | Considerations for licensors: Our public licenses are 21 | intended for use by those authorized to give the public 22 | permission to use material in ways otherwise restricted by 23 | copyright and certain other rights. Our licenses are 24 | irrevocable. Licensors should read and understand the terms 25 | and conditions of the license they choose before applying it. 26 | Licensors should also secure all rights necessary before 27 | applying our licenses so that the public can reuse the 28 | material as expected. Licensors should clearly mark any 29 | material not subject to the license. This includes other CC- 30 | licensed material, or material used under an exception or 31 | limitation to copyright. More considerations for licensors: 32 | wiki.creativecommons.org/Considerations_for_licensors 33 | 34 | Considerations for the public: By using one of our public 35 | licenses, a licensor grants the public permission to use the 36 | licensed material under specified terms and conditions. If 37 | the licensor's permission is not necessary for any reason--for 38 | example, because of any applicable exception or limitation to 39 | copyright--then that use is not regulated by the license. Our 40 | licenses grant only permissions under copyright and certain 41 | other rights that a licensor has authority to grant. Use of 42 | the licensed material may still be restricted for other 43 | reasons, including because others have copyright or other 44 | rights in the material. A licensor may make special requests, 45 | such as asking that all changes be marked or described. 46 | Although not required by our licenses, you are encouraged to 47 | respect those requests where reasonable. More_considerations 48 | for the public: 49 | wiki.creativecommons.org/Considerations_for_licensees 50 | 51 | ======================================================================= 52 | 53 | Creative Commons Attribution 4.0 International Public License 54 | 55 | By exercising the Licensed Rights (defined below), You accept and agree 56 | to be bound by the terms and conditions of this Creative Commons 57 | Attribution 4.0 International Public License ("Public License"). To the 58 | extent this Public License may be interpreted as a contract, You are 59 | granted the Licensed Rights in consideration of Your acceptance of 60 | these terms and conditions, and the Licensor grants You such rights in 61 | consideration of benefits the Licensor receives from making the 62 | Licensed Material available under these terms and conditions. 63 | 64 | 65 | Section 1 -- Definitions. 66 | 67 | a. Adapted Material means material subject to Copyright and Similar 68 | Rights that is derived from or based upon the Licensed Material 69 | and in which the Licensed Material is translated, altered, 70 | arranged, transformed, or otherwise modified in a manner requiring 71 | permission under the Copyright and Similar Rights held by the 72 | Licensor. For purposes of this Public License, where the Licensed 73 | Material is a musical work, performance, or sound recording, 74 | Adapted Material is always produced where the Licensed Material is 75 | synched in timed relation with a moving image. 76 | 77 | b. Adapter's License means the license You apply to Your Copyright 78 | and Similar Rights in Your contributions to Adapted Material in 79 | accordance with the terms and conditions of this Public License. 80 | 81 | c. Copyright and Similar Rights means copyright and/or similar rights 82 | closely related to copyright including, without limitation, 83 | performance, broadcast, sound recording, and Sui Generis Database 84 | Rights, without regard to how the rights are labeled or 85 | categorized. For purposes of this Public License, the rights 86 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 87 | Rights. 88 | 89 | d. Effective Technological Measures means those measures that, in the 90 | absence of proper authority, may not be circumvented under laws 91 | fulfilling obligations under Article 11 of the WIPO Copyright 92 | Treaty adopted on December 20, 1996, and/or similar international 93 | agreements. 94 | 95 | e. Exceptions and Limitations means fair use, fair dealing, and/or 96 | any other exception or limitation to Copyright and Similar Rights 97 | that applies to Your use of the Licensed Material. 98 | 99 | f. Licensed Material means the artistic or literary work, database, 100 | or other material to which the Licensor applied this Public 101 | License. 102 | 103 | g. Licensed Rights means the rights granted to You subject to the 104 | terms and conditions of this Public License, which are limited to 105 | all Copyright and Similar Rights that apply to Your use of the 106 | Licensed Material and that the Licensor has authority to license. 107 | 108 | h. Licensor means the individual(s) or entity(ies) granting rights 109 | under this Public License. 110 | 111 | i. Share means to provide material to the public by any means or 112 | process that requires permission under the Licensed Rights, such 113 | as reproduction, public display, public performance, distribution, 114 | dissemination, communication, or importation, and to make material 115 | available to the public including in ways that members of the 116 | public may access the material from a place and at a time 117 | individually chosen by them. 118 | 119 | j. Sui Generis Database Rights means rights other than copyright 120 | resulting from Directive 96/9/EC of the European Parliament and of 121 | the Council of 11 March 1996 on the legal protection of databases, 122 | as amended and/or succeeded, as well as other essentially 123 | equivalent rights anywhere in the world. 124 | 125 | k. You means the individual or entity exercising the Licensed Rights 126 | under this Public License. Your has a corresponding meaning. 127 | 128 | 129 | Section 2 -- Scope. 130 | 131 | a. License grant. 132 | 133 | 1. Subject to the terms and conditions of this Public License, 134 | the Licensor hereby grants You a worldwide, royalty-free, 135 | non-sublicensable, non-exclusive, irrevocable license to 136 | exercise the Licensed Rights in the Licensed Material to: 137 | 138 | a. reproduce and Share the Licensed Material, in whole or 139 | in part; and 140 | 141 | b. produce, reproduce, and Share Adapted Material. 142 | 143 | 2. Exceptions and Limitations. For the avoidance of doubt, where 144 | Exceptions and Limitations apply to Your use, this Public 145 | License does not apply, and You do not need to comply with 146 | its terms and conditions. 147 | 148 | 3. Term. The term of this Public License is specified in Section 149 | 6(a). 150 | 151 | 4. Media and formats; technical modifications allowed. The 152 | Licensor authorizes You to exercise the Licensed Rights in 153 | all media and formats whether now known or hereafter created, 154 | and to make technical modifications necessary to do so. The 155 | Licensor waives and/or agrees not to assert any right or 156 | authority to forbid You from making technical modifications 157 | necessary to exercise the Licensed Rights, including 158 | technical modifications necessary to circumvent Effective 159 | Technological Measures. For purposes of this Public License, 160 | simply making modifications authorized by this Section 2(a) 161 | (4) never produces Adapted Material. 162 | 163 | 5. Downstream recipients. 164 | 165 | a. Offer from the Licensor -- Licensed Material. Every 166 | recipient of the Licensed Material automatically 167 | receives an offer from the Licensor to exercise the 168 | Licensed Rights under the terms and conditions of this 169 | Public License. 170 | 171 | b. No downstream restrictions. You may not offer or impose 172 | any additional or different terms or conditions on, or 173 | apply any Effective Technological Measures to, the 174 | Licensed Material if doing so restricts exercise of the 175 | Licensed Rights by any recipient of the Licensed 176 | Material. 177 | 178 | 6. No endorsement. Nothing in this Public License constitutes or 179 | may be construed as permission to assert or imply that You 180 | are, or that Your use of the Licensed Material is, connected 181 | with, or sponsored, endorsed, or granted official status by, 182 | the Licensor or others designated to receive attribution as 183 | provided in Section 3(a)(1)(A)(i). 184 | 185 | b. Other rights. 186 | 187 | 1. Moral rights, such as the right of integrity, are not 188 | licensed under this Public License, nor are publicity, 189 | privacy, and/or other similar personality rights; however, to 190 | the extent possible, the Licensor waives and/or agrees not to 191 | assert any such rights held by the Licensor to the limited 192 | extent necessary to allow You to exercise the Licensed 193 | Rights, but not otherwise. 194 | 195 | 2. Patent and trademark rights are not licensed under this 196 | Public License. 197 | 198 | 3. To the extent possible, the Licensor waives any right to 199 | collect royalties from You for the exercise of the Licensed 200 | Rights, whether directly or through a collecting society 201 | under any voluntary or waivable statutory or compulsory 202 | licensing scheme. In all other cases the Licensor expressly 203 | reserves any right to collect such royalties. 204 | 205 | 206 | Section 3 -- License Conditions. 207 | 208 | Your exercise of the Licensed Rights is expressly made subject to the 209 | following conditions. 210 | 211 | a. Attribution. 212 | 213 | 1. If You Share the Licensed Material (including in modified 214 | form), You must: 215 | 216 | a. retain the following if it is supplied by the Licensor 217 | with the Licensed Material: 218 | 219 | i. identification of the creator(s) of the Licensed 220 | Material and any others designated to receive 221 | attribution, in any reasonable manner requested by 222 | the Licensor (including by pseudonym if 223 | designated); 224 | 225 | ii. a copyright notice; 226 | 227 | iii. a notice that refers to this Public License; 228 | 229 | iv. a notice that refers to the disclaimer of 230 | warranties; 231 | 232 | v. a URI or hyperlink to the Licensed Material to the 233 | extent reasonably practicable; 234 | 235 | b. indicate if You modified the Licensed Material and 236 | retain an indication of any previous modifications; and 237 | 238 | c. indicate the Licensed Material is licensed under this 239 | Public License, and include the text of, or the URI or 240 | hyperlink to, this Public License. 241 | 242 | 2. You may satisfy the conditions in Section 3(a)(1) in any 243 | reasonable manner based on the medium, means, and context in 244 | which You Share the Licensed Material. For example, it may be 245 | reasonable to satisfy the conditions by providing a URI or 246 | hyperlink to a resource that includes the required 247 | information. 248 | 249 | 3. If requested by the Licensor, You must remove any of the 250 | information required by Section 3(a)(1)(A) to the extent 251 | reasonably practicable. 252 | 253 | 4. If You Share Adapted Material You produce, the Adapter's 254 | License You apply must not prevent recipients of the Adapted 255 | Material from complying with this Public License. 256 | 257 | 258 | Section 4 -- Sui Generis Database Rights. 259 | 260 | Where the Licensed Rights include Sui Generis Database Rights that 261 | apply to Your use of the Licensed Material: 262 | 263 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 264 | to extract, reuse, reproduce, and Share all or a substantial 265 | portion of the contents of the database; 266 | 267 | b. if You include all or a substantial portion of the database 268 | contents in a database in which You have Sui Generis Database 269 | Rights, then the database in which You have Sui Generis Database 270 | Rights (but not its individual contents) is Adapted Material; and 271 | 272 | c. You must comply with the conditions in Section 3(a) if You Share 273 | all or a substantial portion of the contents of the database. 274 | 275 | For the avoidance of doubt, this Section 4 supplements and does not 276 | replace Your obligations under this Public License where the Licensed 277 | Rights include other Copyright and Similar Rights. 278 | 279 | 280 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 281 | 282 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 283 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 284 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 285 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 286 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 287 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 288 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 289 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 290 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 291 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 292 | 293 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 294 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 295 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 296 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 297 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 298 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 299 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 300 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 301 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 302 | 303 | c. The disclaimer of warranties and limitation of liability provided 304 | above shall be interpreted in a manner that, to the extent 305 | possible, most closely approximates an absolute disclaimer and 306 | waiver of all liability. 307 | 308 | 309 | Section 6 -- Term and Termination. 310 | 311 | a. This Public License applies for the term of the Copyright and 312 | Similar Rights licensed here. However, if You fail to comply with 313 | this Public License, then Your rights under this Public License 314 | terminate automatically. 315 | 316 | b. Where Your right to use the Licensed Material has terminated under 317 | Section 6(a), it reinstates: 318 | 319 | 1. automatically as of the date the violation is cured, provided 320 | it is cured within 30 days of Your discovery of the 321 | violation; or 322 | 323 | 2. upon express reinstatement by the Licensor. 324 | 325 | For the avoidance of doubt, this Section 6(b) does not affect any 326 | right the Licensor may have to seek remedies for Your violations 327 | of this Public License. 328 | 329 | c. For the avoidance of doubt, the Licensor may also offer the 330 | Licensed Material under separate terms or conditions or stop 331 | distributing the Licensed Material at any time; however, doing so 332 | will not terminate this Public License. 333 | 334 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 335 | License. 336 | 337 | 338 | Section 7 -- Other Terms and Conditions. 339 | 340 | a. The Licensor shall not be bound by any additional or different 341 | terms or conditions communicated by You unless expressly agreed. 342 | 343 | b. Any arrangements, understandings, or agreements regarding the 344 | Licensed Material not stated herein are separate from and 345 | independent of the terms and conditions of this Public License. 346 | 347 | 348 | Section 8 -- Interpretation. 349 | 350 | a. For the avoidance of doubt, this Public License does not, and 351 | shall not be interpreted to, reduce, limit, restrict, or impose 352 | conditions on any use of the Licensed Material that could lawfully 353 | be made without permission under this Public License. 354 | 355 | b. To the extent possible, if any provision of this Public License is 356 | deemed unenforceable, it shall be automatically reformed to the 357 | minimum extent necessary to make it enforceable. If the provision 358 | cannot be reformed, it shall be severed from this Public License 359 | without affecting the enforceability of the remaining terms and 360 | conditions. 361 | 362 | c. No term or condition of this Public License will be waived and no 363 | failure to comply consented to unless expressly agreed to by the 364 | Licensor. 365 | 366 | d. Nothing in this Public License constitutes or may be interpreted 367 | as a limitation upon, or waiver of, any privileges and immunities 368 | that apply to the Licensor or You, including from the legal 369 | processes of any jurisdiction or authority. 370 | 371 | 372 | ======================================================================= 373 | 374 | Creative Commons is not a party to its public 375 | licenses. Notwithstanding, Creative Commons may elect to apply one of 376 | its public licenses to material it publishes and in those instances 377 | will be considered the “Licensor.” The text of the Creative Commons 378 | public licenses is dedicated to the public domain under the CC0 Public 379 | Domain Dedication. Except for the limited purpose of indicating that 380 | material is shared under a Creative Commons public license or as 381 | otherwise permitted by the Creative Commons policies published at 382 | creativecommons.org/policies, Creative Commons does not authorize the 383 | use of the trademark "Creative Commons" or any other trademark or logo 384 | of Creative Commons without its prior written consent including, 385 | without limitation, in connection with any unauthorized modifications 386 | to any of its public licenses or any other arrangements, 387 | understandings, or agreements concerning use of licensed material. For 388 | the avoidance of doubt, this paragraph does not form part of the 389 | public licenses. 390 | 391 | Creative Commons may be contacted at creativecommons.org. 392 | -------------------------------------------------------------------------------- /LICENSE-CODE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) Microsoft Corporation 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 5 | associated documentation files (the "Software"), to deal in the Software without restriction, 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 8 | subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all copies or substantial 11 | portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS) 2 | 3 | 6 | 7 | ## Prerequisites 8 | 9 | 1. Have a valid Microsoft Azure subscription allowing the creation of an AKS cluster 10 | 1. Docker client installed: [Installing Docker](https://www.docker.com/community-edition) 11 | 1. Azure-cli (2.0) installed: [Installing the Azure CLI 2.0 | Microsoft Docs](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) 12 | 1. Git cli installed: [Installing Git CLI](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 13 | 1. Kubectl installed: [Installing Kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) 14 | 1. Helm installed: [Installing Helm CLI](https://docs.helm.sh/using_helm/#from-the-binary-releases) (**Note**: On Windows you can extract the `tar` file using a tool like 7Zip.) 15 | 1. ksonnet installed: [Installing ksonnet CLI](https://ksonnet.io/#get-started) 16 | 17 | Clone this repository somewhere so you can easily access the different source files: 18 | ```console 19 | git clone https://github.com/Azure/kubeflow-labs 20 | ``` 21 | 22 | ## Content Summary 23 | 24 | | | Module | Description | 25 | | --- | --- | --- | 26 | |0| **[Introduction](0-intro)** | Introduction to this workshop. Motivations and goals.| 27 | |1| **[Docker](1-docker)** | Docker and containers 101.| 28 | |2| **[Kubernetes](2-kubernetes)** | Kubernetes important concepts overview.| 29 | |3| **[Helm](3-helm)** | Introduction to Helm | 30 | |4| **[Kubeflow](4-kubeflow)** | Introduction to Kubeflow and how to deploy it in your cluster.| 31 | |5| **[JupyterHub](5-jupyterhub)** | Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow | 32 | |6| **[TFJob](6-tfjob)** | Introduction to `TFJob` and how to use it to deploy a simple TensorFlow training.| 33 | |7| **[Distributed Tensorflow](7-distributed-tensorflow)** | Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`| 34 | |8| **[Hyperparameters Sweep with Helm](8-hyperparam-sweep)** | Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results | 35 | |9| **[Serving](9-serving)** | Using TensorFlow Serving to serve predictions | 36 | |10| **[Going Further](10-going-further)** | Links and resources to go further: Autoscaling, Distributed Storage etc. | 37 | 38 | 39 | # Contributing 40 | 41 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 42 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 43 | the rights to use your contribution. For details, visit https://cla.microsoft.com. 44 | 45 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide 46 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions 47 | provided by the bot. You will only need to do this once across all repos using our CLA. 48 | 49 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 50 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 51 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 52 | 53 | # Legal Notices 54 | 55 | Microsoft and any contributors grant you a license to the Microsoft documentation and other content 56 | in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode), 57 | see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the 58 | [LICENSE-CODE](LICENSE-CODE) file. 59 | 60 | Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation 61 | may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. 62 | The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. 63 | Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653. 64 | 65 | Privacy information can be found at https://privacy.microsoft.com/en-us/ 66 | 67 | Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, 68 | or trademarks, whether by implication, estoppel or otherwise. 69 | --------------------------------------------------------------------------------