├── FreeFlow.md
├── README.md
└── sampleyaml
    ├── freeflow.yaml
    ├── tf-psservers.yaml
    ├── tfgpu-worker0.yaml
    ├── tfgpu-worker1.yaml
    └── tfworkerwithfreeflow.yaml


/FreeFlow.md:
--------------------------------------------------------------------------------
 1 | ﻿# Deploy FreeFlow plugin in Kubernetes 
 2 | (Credits: [Yibo Zhu](https://www.microsoft.com/en-us/research/people/yibzh/) from Microsoft Research)
 3 | 
 4 | Sometimes the pod-to-pod network bandwidth might not be as good as the host-to-host network bandwidth in your Kuberbetes cluster due to various reasons. And network bandwidth is often one of the most important factors for distributed model training performance. To optimize your pod-to-pod network bandwidth, the FreeFlow plugin (https://github.com/Microsoft/Freeflow) is very helpful and it’s easy to use with just 2 simple steps. 
 5 | 
 6 | ## 1. Deploy FreeFlow as a DaemonSet in your Kubernetes cluster
 7 | See sample yaml [here](https://github.com/joyq-github/TensorFlowonK8s/blob/master/sampleyaml/freeflow.yaml).  In the yaml file, change the environment variable value for HOST_IP_PREFIX to the actual IP ranges that your pods use.
 8 | 
 9 | ## 2. Create your pod with FreeFlow enabled. 
10 | See sample yaml [here](https://github.com/joyq-github/TensorFlowonK8s/blob/master/sampleyaml/tfworkerwithfreeflow.yaml). Add 2 environment variables LD_PRELOAD and VNET_PREFIX into your pod definition, as shown below. Again, change the environment variable value for VNET_PREFIX to the actual IP ranges that your pods use.
11 | <pre>
12 |       containers:
13 |       - name: tf-worker1
14 |         image: tensorflow/tensorflow:1.8.0-gpu
15 |         env:
16 |         - name: LD_PRELOAD
17 |           value: "/freeflow/libfsocket.so"
18 |         - name: VNET_PREFIX
19 |           value: 10.244.0.0/16
20 | </pre>
21 | And also mount the volume /freeflow that contains the FreeFlow library in the pod. 
22 | <pre>
23 |         volumeMounts:
24 |         - mountPath: /freeflow
25 |           name: freeflow      
26 |       volumes:
27 |       - name: freeflow
28 |         hostPath:
29 |           path: /freeflow  
30 | </pre>
31 | 
32 | That’s it! You should now have FreeFlow enabled for your pods for optimized network bandwidth performance. 
33 | 
34 | 
35 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 4 Simple Steps for running Distributed TensorFlow on Kubernetes using ACS-Engine
 2 | 
 3 | ## 1. Create a Kubernetes cluster
 4 | Follow the instructions [here](https://github.com/Azure/acs-engine/blob/master/docs/kubernetes.md) to create a Kubernetes cluster using acs-engine.  
 5 | Note that you can use [Azure Container Service](https://docs.microsoft.com/en-us/azure/container-service/kubernetes/container-service-kubernetes-walkthrough) to create a Kubernetes cluster as well, but currently ACS does not support heterogeneous agent pools yet. So if you need a cluster with different size VMs, e.g. some with CPUs only and some with GPUs, you'll need to use acs-engine for now.
 6 | 
 7 | ## 2. Setup GPU drivers on the agent host VMs with GPUs
 8 | Execute the scripts [here](https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh) on each agent host VM with GPUs.
 9 | 
10 | ## 3. Create a set of pods for distributed TensorFlow
11 | For example, to create 2 parameter servers and 2 worker, use these [sample yaml files](https://github.com/joyq-github/TensorFlowonK8s/tree/master/sampleyaml) to create pods and services in your cluster. Note that you should mount Azure Files as a PV for central storage for saving model checkpoints, etc. (For more info on how to use Azure Files with Kubernetes, go to  https://docs.microsoft.com/en-us/azure/aks/azure-files) Once done, run *kubectl get pods*, and you should see 4 pods returned. 
12 | 
13 | Run *kubectl get svc* to get the service IP address of each ps/worker node, as shown below.  
14 | <pre>    tf-ps0       10.0.211.196   <none>        6006/TCP,2222/TCP      
15 |     tf-ps1       10.0.81.168    <none>        6006/TCP,2222/TCP  
16 |     tf-worker0   10.0.221.23    <none>        6006/TCP,2222/TCP  
17 |     tf-worker1   10.0.118.248   <none>        6006/TCP,2222/TCP        
18 | </pre>
19 | ## 4. Run Distributed TensorFlow training job
20 | For example, to run a distributed TensorFlow training job using 1 parameter server and 2 workers, execute the following commands in the order listed below.   
21 | The sample tensorflow script used below can be downloaded from [here](https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py).
22 | 
23 | a) on the parameter server pod, execute the script below:  
24 | <pre>python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
25 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
26 | --data_name=imagenet --data_dir=/imagenetdata --job_name=ps --ps_hosts=10.0.211.196:2222 \
27 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0
28 | </pre>
29 | b) on the worker0 pod, execute the script below:  
30 | <pre>python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
31 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
32 | --data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
33 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0
34 | </pre>
35 | 
36 | c) on the worker1 pod, execute the script below:  
37 | <pre>python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
38 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
39 | --data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
40 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=1
41 | </pre>
42 | 
43 | ### If you need a solution that covers and automates the setup steps above, check out Deep Learning Workspace from Microsoft Research, an open source toolkit empowering DL workloads using Kubernetes.
44 | Deep Learning Workspace's alpha release is available at https://github.com/microsoft/DLWorkspace/, with documentation at https://microsoft.github.io/DLWorkspace/
45 | 


--------------------------------------------------------------------------------
/sampleyaml/freeflow.yaml:
--------------------------------------------------------------------------------
 1 | kind: DaemonSet
 2 | apiVersion: apps/v1
 3 | metadata:
 4 |   name: freeflowrouter
 5 |   namespace: default
 6 | spec:
 7 |   selector:
 8 |     matchLabels:
 9 |       freeflowrouter-node: pod 
10 |   template:
11 |     metadata:
12 |       name: freeflowrouter
13 |       labels:
14 |         freeflowrouter-node: pod      
15 |     spec:
16 |       dnsPolicy: ClusterFirstWithHostNet 
17 |       hostNetwork: true        
18 |       containers:
19 |       - name: freeflowrouter
20 |         image: dlws/freeflow:0.16
21 |         securityContext:
22 |           privileged: true
23 |         volumeMounts:
24 |         - mountPath: /freeflow
25 |           name: freeflow
26 |         env:
27 |         - name: HOST_IP_PREFIX
28 |           value: 10.240.0.0/16
29 | 
30 | 
31 |       volumes:
32 |       - name: freeflow
33 |         hostPath:
34 |           path: /freeflow        
35 |       tolerations:
36 |       - key: CriticalAddonsOnly
37 |         operator: Exists
38 |       - key: node-role.kubernetes.io/master
39 |         effect: NoSchedule             
40 | 


--------------------------------------------------------------------------------
/sampleyaml/tf-psservers.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1beta1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: tf-ps0
 5 | spec:
 6 |   replicas: 1
 7 |   template:
 8 |     metadata:
 9 |       labels:
10 |         tf-ps: "0"
11 |         name-prefix: "tf"
12 |         job: "ps"
13 |     spec:
14 |       containers:
15 |       - name: tf-ps0
16 |         image: gcr.io/tensorflow/tensorflow:1.2.0
17 |         ports:
18 |         - containerPort: 6006
19 |           name: tensorboard
20 |         - containerPort: 2222
21 |           name: grpcport
22 |         env: []
23 |         volumeMounts:
24 |         - mountPath: /joy
25 |           name: shared
26 |         - mountPath: /imagenetdata
27 |           name: imagenet
28 |         - mountPath: /imagenetazurefiles
29 |           name: imagenetazure
30 |       nodeSelector:
31 |         nodedesc: cpunode0
32 |       volumes:
33 |       - name: shared
34 |         hostPath:
35 |           path: /home/joyadmin
36 |       - name: imagenet
37 |         hostPath:
38 |           path: /mnt/imagenet-data
39 |       - name: imagenetazure
40 |         azureFile:
41 |           secretName: azure-secret
42 |           shareName: imagenet
43 |           readOnly: false
44 | ---
45 | apiVersion: v1
46 | kind: Service
47 | metadata:
48 |   name: tf-ps0
49 |   labels:
50 |     tf-ps: "0"
51 | spec:
52 |   type:  ClusterIP
53 |   ports:
54 |   - name: tensorboard
55 |     port: 6006
56 |   - name: grpcport
57 |     port: 2222
58 |   selector:
59 |     tf-ps: "0"
60 | ---
61 | 


--------------------------------------------------------------------------------
/sampleyaml/tfgpu-worker0.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1beta1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: tf-worker0
 5 | spec:
 6 |   replicas: 1
 7 |   template:
 8 |     metadata:
 9 |       labels:
10 |         tf-worker: "0"
11 |         name-prefix: "tf"
12 |         job: "worker"
13 |     spec:
14 |       containers:
15 |       - name: tf-worker0
16 |         image: gcr.io/tensorflow/tensorflow:1.2.0-gpu
17 |         ports:
18 |         - containerPort: 6006
19 |           name: tensorboard
20 |         - containerPort: 2222
21 |           name: grpcport
22 |         resources:
23 |           limits:
24 |             alpha.kubernetes.io/nvidia-gpu: 4
25 |         volumeMounts:
26 |         - mountPath: /usr/local/nvidia
27 |           name: nvidia-driver
28 |         - mountPath: /joy
29 |           name: shared
30 |         - mountPath: /imagenetdata
31 |           name: imagenet
32 |         - mountPath: /imagenetazurefiles
33 |           name: imagenetazure
34 |       volumes:
35 |       - name: nvidia-driver
36 |         hostPath:
37 |           path: /opt/nvidia-driver/current
38 |       - name: shared
39 |         hostPath:
40 |           path: /home/joyadmin
41 |       - name: imagenet
42 |         hostPath:
43 |           path: /mnt/imagenet-data
44 |       - name: imagenetazure
45 |         azureFile:
46 |           secretName: azure-secret
47 |           shareName: imagenet
48 |           readOnly: false
49 | ---
50 | apiVersion: v1
51 | kind: Service
52 | metadata:
53 |   name: tf-worker0
54 |   labels:
55 |     tf-worker: "0"
56 | spec:
57 |   type:  ClusterIP
58 |   ports:
59 |   - name: tensorboard
60 |     port: 6006
61 |   - name: grpcport
62 |     port: 2222
63 |   selector:
64 |     tf-worker: "0"
65 | ---


--------------------------------------------------------------------------------
/sampleyaml/tfgpu-worker1.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1beta1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: tf-worker1
 5 | spec:
 6 |   replicas: 1
 7 |   template:
 8 |     metadata:
 9 |       labels:
10 |         tf-worker: "1"
11 |         name-prefix: "tf"
12 |         job: "worker"
13 |     spec:
14 |       containers:
15 |       - name: tf-worker1
16 |         image: gcr.io/tensorflow/tensorflow:1.2.0-gpu
17 |         ports:
18 |         - containerPort: 6006
19 |           name: tensorboard
20 |         - containerPort: 2222
21 |           name: grpcport
22 |         resources:
23 |           limits:
24 |             alpha.kubernetes.io/nvidia-gpu: 4
25 |         volumeMounts:
26 |         - mountPath: /usr/local/nvidia
27 |           name: nvidia-driver
28 |         - mountPath: /joy
29 |           name: shared
30 |         - mountPath: /imagenetdata
31 |           name: imagenet
32 |         - mountPath: /imagenetazurefiles
33 |           name: imagenetazure
34 |       volumes:
35 |       - name: nvidia-driver
36 |         hostPath:
37 |           path: /opt/nvidia-driver/current
38 |       - name: shared
39 |         hostPath:
40 |           path: /home/joyadmin
41 |       - name: imagenet
42 |         hostPath:
43 |           path: /mnt/imagenet-data
44 |       - name: imagenetazure
45 |         azureFile:
46 |           secretName: azure-secret
47 |           shareName: imagenet
48 |           readOnly: false
49 | ---
50 | apiVersion: v1
51 | kind: Service
52 | metadata:
53 |   name: tf-worker1
54 |   labels:
55 |     tf-worker: "1"
56 | spec:
57 |   type:  ClusterIP
58 |   ports:
59 |   - name: tensorboard
60 |     port: 6006
61 |   - name: grpcport
62 |     port: 2222
63 |   selector:
64 |     tf-worker: "1"
65 | ---


--------------------------------------------------------------------------------
/sampleyaml/tfworkerwithfreeflow.yaml:
--------------------------------------------------------------------------------
 1 | apiVersion: apps/v1beta1
 2 | kind: Deployment
 3 | metadata:
 4 |   name: tf-worker1
 5 | spec:
 6 |   replicas: 1
 7 |   template:
 8 |     metadata:
 9 |       labels:
10 |         tf-worker: "1"
11 |         name-prefix: "tf"
12 |         job: "worker"
13 |     spec:
14 |       containers:
15 |       - name: tf-worker1
16 |         image: tensorflow/tensorflow:1.8.0-gpu
17 |         env:
18 |         - name: LD_PRELOAD
19 |           value: "/freeflow/libfsocket.so"
20 |         - name: VNET_PREFIX
21 |           value: 10.244.0.0/16
22 |         ports:
23 |         - containerPort: 6006
24 |           name: tensorboard
25 |         - containerPort: 2222
26 |           name: grpcport
27 |         resources:
28 |           limits:
29 |             alpha.kubernetes.io/nvidia-gpu: 4
30 |         volumeMounts:
31 |         - mountPath: /usr/local/nvidia
32 |           name: nvidia-driver
33 |         - mountPath: /joy
34 |           name: shared
35 |         - mountPath: /imagenetdata
36 |           name: imagenet
37 |         - mountPath: /freeflow
38 |           name: freeflow
39 |       nodeSelector:
40 |         nodedesc: gpunode1
41 |       volumes:
42 |       - name: nvidia-driver
43 |         hostPath:
44 |           path: /usr/local/nvidia
45 |       - name: shared
46 |         hostPath:
47 |           path: /home/joyadmin
48 |       - name: imagenet
49 |         hostPath:
50 |           path: /mnt/imagenet-data
51 |       - name: freeflow
52 |         hostPath:
53 |           path: /freeflow
54 | ---


--------------------------------------------------------------------------------