242 | Solution (expand to see)
243 |
244 |
245 | First we need to modify the `Dockerfile`.
246 | We just need to change the tag of the TensorFlow base image to be one that support GPU:
247 |
248 | ```dockerfile
249 | FROM tensorflow/tensorflow:1.4.0-gpu
250 | COPY main.py /app/main.py
251 |
252 | ENTRYPOINT ["python", "/app/main.py"]
253 | ```
254 |
255 | Then we can create our Job template:
256 |
257 | ```yaml
258 | apiVersion: batch/v1
259 | kind: Job # Our training should be a Job since it is supposed to terminate at some point
260 | metadata:
261 | name: module4-ex2 # Name of our job
262 | spec:
263 | template: # Template of the Pod that is going to be run by the Job
264 | metadata:
265 | name: mnist-pod # Name of the pod
266 | spec:
267 | containers: # List of containers that should run inside the pod, in our case there is only one.
268 | - name: tensorflow
269 | image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
270 | args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
271 | resources:
272 | limits:
273 | alpha.kubernetes.io/nvidia-gpu: 1
274 | volumeMounts: # Where the drivers should be mounted in the container
275 | - name: lib
276 | mountPath: /usr/local/nvidia/lib64
277 | - name: libcuda
278 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
279 | restartPolicy: OnFailure
280 | volumes: # Where the drivers are located on the node
281 | - name: lib
282 | hostPath:
283 | path: /usr/lib/nvidia-384
284 | - name: libcuda
285 | hostPath:
286 | path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
287 | ```
288 |
289 | And deploy it with
290 |
291 | ```console
292 | kubectl create -f
293 | ```
294 |
295 |
296 |
297 |
298 | ## Next Step
299 | [5 - TFJob](../5-tfjob/README.md)
300 |
--------------------------------------------------------------------------------
/4-gpus/src/Dockerfile:
--------------------------------------------------------------------------------
1 | # Change this image to one that supports GPU
2 | FROM tensorflow/tensorflow:1.4.0
3 | COPY main.py /app/main.py
4 |
5 | ENTRYPOINT ["python", "/app/main.py"]
--------------------------------------------------------------------------------
/4-gpus/src/main.py:
--------------------------------------------------------------------------------
1 | # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the 'License');
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an 'AS IS' BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | # ==============================================================================
15 | """A simple MNIST classifier which displays summaries in TensorBoard.
16 |
17 | This is an unimpressive MNIST model, but it is a good example of using
18 | tf.name_scope to make a graph legible in the TensorBoard graph explorer, and of
19 | naming summary tags so that they are grouped meaningfully in TensorBoard.
20 |
21 | It demonstrates the functionality of every TensorBoard dashboard.
22 | """
23 | from __future__ import absolute_import
24 | from __future__ import division
25 | from __future__ import print_function
26 |
27 | import argparse
28 | import os
29 | import sys
30 |
31 | import tensorflow as tf
32 |
33 | from tensorflow.examples.tutorials.mnist import input_data
34 |
35 | FLAGS = None
36 |
37 |
38 | def train():
39 | # Import data
40 | mnist = input_data.read_data_sets(FLAGS.data_dir,
41 | one_hot=True,
42 | fake_data=FLAGS.fake_data)
43 |
44 | # Create a multilayer model.
45 |
46 | # Input placeholders
47 | with tf.name_scope('input'):
48 | x = tf.placeholder(tf.float32, [None, 784], name='x-input')
49 | y_ = tf.placeholder(tf.float32, [None, 10], name='y-input')
50 |
51 | with tf.name_scope('input_reshape'):
52 | image_shaped_input = tf.reshape(x, [-1, 28, 28, 1])
53 | tf.summary.image('input', image_shaped_input, 10)
54 |
55 | # We can't initialize these variables to 0 - the network will get stuck.
56 | def weight_variable(shape):
57 | """Create a weight variable with appropriate initialization."""
58 | initial = tf.truncated_normal(shape, stddev=0.1)
59 | return tf.Variable(initial)
60 |
61 | def bias_variable(shape):
62 | """Create a bias variable with appropriate initialization."""
63 | initial = tf.constant(0.1, shape=shape)
64 | return tf.Variable(initial)
65 |
66 | def variable_summaries(var):
67 | """Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
68 | with tf.name_scope('summaries'):
69 | mean = tf.reduce_mean(var)
70 | tf.summary.scalar('mean', mean)
71 | with tf.name_scope('stddev'):
72 | stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
73 | tf.summary.scalar('stddev', stddev)
74 | tf.summary.scalar('max', tf.reduce_max(var))
75 | tf.summary.scalar('min', tf.reduce_min(var))
76 | tf.summary.histogram('histogram', var)
77 |
78 | def nn_layer(input_tensor, input_dim, output_dim, layer_name, act=tf.nn.relu):
79 | """Reusable code for making a simple neural net layer.
80 |
81 | It does a matrix multiply, bias add, and then uses ReLU to nonlinearize.
82 | It also sets up name scoping so that the resultant graph is easy to read,
83 | and adds a number of summary ops.
84 | """
85 | # Adding a name scope ensures logical grouping of the layers in the graph.
86 | with tf.name_scope(layer_name):
87 | # This Variable will hold the state of the weights for the layer
88 | with tf.name_scope('weights'):
89 | weights = weight_variable([input_dim, output_dim])
90 | variable_summaries(weights)
91 | with tf.name_scope('biases'):
92 | biases = bias_variable([output_dim])
93 | variable_summaries(biases)
94 | with tf.name_scope('Wx_plus_b'):
95 | preactivate = tf.matmul(input_tensor, weights) + biases
96 | tf.summary.histogram('pre_activations', preactivate)
97 | activations = act(preactivate, name='activation')
98 | tf.summary.histogram('activations', activations)
99 | return activations
100 |
101 | hidden1 = nn_layer(x, 784, 500, 'layer1')
102 |
103 | with tf.name_scope('dropout'):
104 | keep_prob = tf.placeholder(tf.float32)
105 | tf.summary.scalar('dropout_keep_probability', keep_prob)
106 | dropped = tf.nn.dropout(hidden1, keep_prob)
107 |
108 | # Do not apply softmax activation yet, see below.
109 | y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)
110 |
111 | with tf.name_scope('cross_entropy'):
112 | # The raw formulation of cross-entropy,
113 | #
114 | # tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.softmax(y)),
115 | # reduction_indices=[1]))
116 | #
117 | # can be numerically unstable.
118 | #
119 | # So here we use tf.nn.softmax_cross_entropy_with_logits on the
120 | # raw outputs of the nn_layer above, and then average across
121 | # the batch.
122 | diff = tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)
123 | with tf.name_scope('total'):
124 | cross_entropy = tf.reduce_mean(diff)
125 | tf.summary.scalar('cross_entropy', cross_entropy)
126 |
127 | with tf.name_scope('train'):
128 | train_step = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(
129 | cross_entropy)
130 |
131 | with tf.name_scope('accuracy'):
132 | with tf.name_scope('correct_prediction'):
133 | correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
134 | with tf.name_scope('accuracy'):
135 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
136 | tf.summary.scalar('accuracy', accuracy)
137 |
138 | # Merge all the summaries and write them out to
139 | # /tmp/tensorflow/mnist/logs/mnist_with_summaries (by default)
140 | merged = tf.summary.merge_all()
141 |
142 | def feed_dict(train):
143 | """Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""
144 | if train or FLAGS.fake_data:
145 | xs, ys = mnist.train.next_batch(100, fake_data=FLAGS.fake_data)
146 | k = FLAGS.dropout
147 | else:
148 | xs, ys = mnist.test.images, mnist.test.labels
149 | k = 1.0
150 | return {x: xs, y_: ys, keep_prob: k}
151 |
152 | sess = tf.InteractiveSession()
153 | train_writer = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)
154 | test_writer = tf.summary.FileWriter(FLAGS.log_dir + '/test')
155 | tf.global_variables_initializer().run()
156 | # Train the model, and also write summaries.
157 | # Every 10th step, measure test-set accuracy, and write test summaries
158 | # All other steps, run train_step on training data, & add training summaries
159 |
160 |
161 | for i in range(FLAGS.max_steps):
162 | if i % 10 == 0: # Record summaries and test-set accuracy
163 | summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
164 | test_writer.add_summary(summary, i)
165 | print('Accuracy at step %s: %s' % (i, acc))
166 | else: # Record train set summaries, and train
167 | if i % 100 == 99: # Record execution stats
168 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
169 | run_metadata = tf.RunMetadata()
170 | summary, _ = sess.run([merged, train_step],
171 | feed_dict=feed_dict(True),
172 | options=run_options,
173 | run_metadata=run_metadata)
174 | train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
175 | train_writer.add_summary(summary, i)
176 | print('Adding run metadata for', i)
177 | else: # Record a summary
178 | summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
179 | train_writer.add_summary(summary, i)
180 | train_writer.close()
181 | test_writer.close()
182 |
183 |
184 | def main(_):
185 | train()
186 |
187 |
188 | if __name__ == '__main__':
189 | parser = argparse.ArgumentParser()
190 | parser.add_argument('--fake_data', nargs='?', const=True, type=bool,
191 | default=False,
192 | help='If true, uses fake data for unit testing.')
193 | parser.add_argument('--max_steps', type=int, default=1000,
194 | help='Number of steps to run trainer.')
195 | parser.add_argument('--learning_rate', type=float, default=0.001,
196 | help='Initial learning rate')
197 | parser.add_argument('--dropout', type=float, default=0.9,
198 | help='Keep probability for training dropout.')
199 | parser.add_argument(
200 | '--data_dir',
201 | type=str,
202 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
203 | 'tensorflow/input_data'),
204 | help='Directory for storing input data')
205 | parser.add_argument(
206 | '--log_dir',
207 | type=str,
208 | default=os.path.join(os.getenv('TEST_TMPDIR', '/tmp'),
209 | 'tensorflow/logs'),
210 | help='Summaries log directory')
211 | FLAGS, unparsed = parser.parse_known_args()
212 | tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
--------------------------------------------------------------------------------
/5-tfjob/README.md:
--------------------------------------------------------------------------------
1 | # `tensorflow/k8s` and `TFJob`
2 |
3 | ## Prerequisites
4 |
5 | * [3 - Helm](../3-helm/README.md)
6 | * [4 - GPUs](../4-gpus/README.md)
7 |
8 | ## Summary
9 |
10 | In this module you will learn how [`tensorflow/k8s`](https://github.com/tensorflow/k8s) can greatly simplify our lives when running TensorFlow on Kubernetes.
11 |
12 | ## `tensorflow/k8s`
13 |
14 | As we saw earlier, giving a container access to GPU is not exactly a breeze on Kubernetes: We need to manually mount the drivers from the node into the container.
15 | If you already tried to run a distributed TensorFlow training, you know that it's not easy either. Getting the `ClusterSpec` right can be painful if you have more than a couple VMs, and it's also quite brittle (we will look more into distributed TensorFlow in module [6 - Distributed TensorFlow](../6-distributed-tensorflow/README.md)).
16 |
17 | `tensorflow/k8s` is a new project in TensorFlow's organization on GitHub that makes all of this much easier.
18 |
19 |
20 | ### Installing `tensorflow/k8s`
21 |
22 | Installing `tensorflow/k8s` with Helm is very easy, just run the following commands:
23 |
24 | ```console
25 | > CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
26 | > helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
27 | ```
28 |
29 | If it worked, you should see something like:
30 |
31 | ```
32 | NAME: tf-job
33 | LAST DEPLOYED: Mon Nov 20 14:24:16 2017
34 | NAMESPACE: default
35 | STATUS: DEPLOYED
36 |
37 | RESOURCES:
38 | ==> v1/ConfigMap
39 | NAME DATA AGE
40 | tf-job-operator-config 1 7s
41 |
42 | ==> v1beta1/Deployment
43 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
44 | tf-job-operator 1 1 1 1 7s
45 |
46 | ==> v1/Pod(related)
47 | NAME READY STATUS RESTARTS AGE
48 | tf-job-operator-3005087210-c3js3 1/1 Running 1 4s
49 | ```
50 |
51 | This means that 3 resources were created, a `ConfigMap`, a `Deployment`, and a `Pod`.
52 | We will see in just a moment what each of them do.
53 |
54 | ### Kubernetes Custom Resource Definition
55 |
56 | Kubernetes has a concept of [Custom Resources](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (often abbreviated CRD) that allows us to create custom object that we will then be able to use.
57 | In the case of `tensorflow/k8s`, after installation, a new `TFJob` object will be available in our cluster. This object allows us to describe TensorFlow a training.
58 |
59 | #### `TFJob` Specifications
60 |
61 | Before going further, let's take a look at what the `TFJob` looks like:
62 |
63 | > Note: Some of the fields are not described here for brevity.
64 |
65 | **`TFJob` Object**
66 |
67 | | Field | Type| Description |
68 | |-------|-----|-------------|
69 | | apiVersion | `string` | Versioned schema of this representation of an object. In our case, it's `kubeflow.org/v1alpha1` |
70 | | kind | `string` | Value representing the REST resource this object represents. In our case it's `TFJob` |
71 | | metadata | [`ObjectMeta`](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata)| Standard object's metadata. |
72 | | spec | `TFJobSpec` | The actual specification of our TensorFlow job, defined below. |
73 |
74 | `spec` is the most important part, so let's look at it too:
75 |
76 | **`TFJobSpec` Object**
77 |
78 | | Field | Type| Description |
79 | |-------|-----|-------------|
80 | | ReplicaSpecs | `TFReplicaSpec` array | Specification for a set of TensorFlow processes, defined below |
81 |
82 | Let's go deeper:
83 |
84 | **`TFReplicaSpec` Object**
85 |
86 | | Field | Type| Description |
87 | |-------|-----|-------------|
88 | | TfReplicaType | `string` | What type of replica are we defining? Can be `MASTER`, `WORKER` or `PS`. When not doing distributed TensorFlow, we just use `MASTER` which happens to be the default value. |
89 | | Replicas | `int` | Number of replicas of `TfReplicaType`. Again this is useful only for distributed TensorFLow. Default value is `1`. |
90 | | Template | [`PodTemplateSpec`](https://kubernetes.io/docs/api-reference/v1.8/#podtemplatespec-v1-core) | Describes the pod that will be created when executing a job. This is the standard Pod description that we have been using everywhere. |
91 |
92 |
93 | As a refresher, here is what a simple TensorFlow training (with GPU) would look like using "vanilla" kubernetes:
94 |
95 | ```yaml
96 | apiVersion: batch/v1
97 | kind: Job
98 | metadata:
99 | name: example-job
100 | spec:
101 | template:
102 | metadata:
103 | name: example-job
104 | spec:
105 | restartPolicy: OnFailure
106 | volumes:
107 | - name: bin
108 | hostPath:
109 | path: /usr/lib/nvidia-384/bin
110 | - name: lib
111 | hostPath:
112 | path: /usr/lib/nvidia-384
113 | containers:
114 | - name: tensorflow
115 | image: wbuchwalter/