├── FreeFlow.md ├── README.md └── sampleyaml ├── freeflow.yaml ├── tf-psservers.yaml ├── tfgpu-worker0.yaml ├── tfgpu-worker1.yaml └── tfworkerwithfreeflow.yaml /FreeFlow.md: -------------------------------------------------------------------------------- 1 | # Deploy FreeFlow plugin in Kubernetes 2 | (Credits: [Yibo Zhu](https://www.microsoft.com/en-us/research/people/yibzh/) from Microsoft Research) 3 | 4 | Sometimes the pod-to-pod network bandwidth might not be as good as the host-to-host network bandwidth in your Kuberbetes cluster due to various reasons. And network bandwidth is often one of the most important factors for distributed model training performance. To optimize your pod-to-pod network bandwidth, the FreeFlow plugin (https://github.com/Microsoft/Freeflow) is very helpful and it’s easy to use with just 2 simple steps. 5 | 6 | ## 1. Deploy FreeFlow as a DaemonSet in your Kubernetes cluster 7 | See sample yaml [here](https://github.com/joyq-github/TensorFlowonK8s/blob/master/sampleyaml/freeflow.yaml). In the yaml file, change the environment variable value for HOST_IP_PREFIX to the actual IP ranges that your pods use. 8 | 9 | ## 2. Create your pod with FreeFlow enabled. 10 | See sample yaml [here](https://github.com/joyq-github/TensorFlowonK8s/blob/master/sampleyaml/tfworkerwithfreeflow.yaml). Add 2 environment variables LD_PRELOAD and VNET_PREFIX into your pod definition, as shown below. Again, change the environment variable value for VNET_PREFIX to the actual IP ranges that your pods use. 11 |
12 |       containers:
13 |       - name: tf-worker1
14 |         image: tensorflow/tensorflow:1.8.0-gpu
15 |         env:
16 |         - name: LD_PRELOAD
17 |           value: "/freeflow/libfsocket.so"
18 |         - name: VNET_PREFIX
19 |           value: 10.244.0.0/16
20 | 
21 | And also mount the volume /freeflow that contains the FreeFlow library in the pod. 22 |
23 |         volumeMounts:
24 |         - mountPath: /freeflow
25 |           name: freeflow      
26 |       volumes:
27 |       - name: freeflow
28 |         hostPath:
29 |           path: /freeflow  
30 | 
31 | 32 | That’s it! You should now have FreeFlow enabled for your pods for optimized network bandwidth performance. 33 | 34 | 35 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 4 Simple Steps for running Distributed TensorFlow on Kubernetes using ACS-Engine 2 | 3 | ## 1. Create a Kubernetes cluster 4 | Follow the instructions [here](https://github.com/Azure/acs-engine/blob/master/docs/kubernetes.md) to create a Kubernetes cluster using acs-engine. 5 | Note that you can use [Azure Container Service](https://docs.microsoft.com/en-us/azure/container-service/kubernetes/container-service-kubernetes-walkthrough) to create a Kubernetes cluster as well, but currently ACS does not support heterogeneous agent pools yet. So if you need a cluster with different size VMs, e.g. some with CPUs only and some with GPUs, you'll need to use acs-engine for now. 6 | 7 | ## 2. Setup GPU drivers on the agent host VMs with GPUs 8 | Execute the scripts [here](https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh) on each agent host VM with GPUs. 9 | 10 | ## 3. Create a set of pods for distributed TensorFlow 11 | For example, to create 2 parameter servers and 2 worker, use these [sample yaml files](https://github.com/joyq-github/TensorFlowonK8s/tree/master/sampleyaml) to create pods and services in your cluster. Note that you should mount Azure Files as a PV for central storage for saving model checkpoints, etc. (For more info on how to use Azure Files with Kubernetes, go to https://docs.microsoft.com/en-us/azure/aks/azure-files) Once done, run *kubectl get pods*, and you should see 4 pods returned. 12 | 13 | Run *kubectl get svc* to get the service IP address of each ps/worker node, as shown below. 14 |
    tf-ps0       10.0.211.196           6006/TCP,2222/TCP      
15 |     tf-ps1       10.0.81.168            6006/TCP,2222/TCP  
16 |     tf-worker0   10.0.221.23            6006/TCP,2222/TCP  
17 |     tf-worker1   10.0.118.248           6006/TCP,2222/TCP        
18 | 
19 | ## 4. Run Distributed TensorFlow training job 20 | For example, to run a distributed TensorFlow training job using 1 parameter server and 2 workers, execute the following commands in the order listed below. 21 | The sample tensorflow script used below can be downloaded from [here](https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py). 22 | 23 | a) on the parameter server pod, execute the script below: 24 |
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
25 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
26 | --data_name=imagenet --data_dir=/imagenetdata --job_name=ps --ps_hosts=10.0.211.196:2222 \
27 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0
28 | 
29 | b) on the worker0 pod, execute the script below: 30 |
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
31 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
32 | --data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
33 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0
34 | 
35 | 36 | c) on the worker1 pod, execute the script below: 37 |
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
38 | --batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
39 | --data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
40 | --worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=1
41 | 
42 | 43 | ### If you need a solution that covers and automates the setup steps above, check out Deep Learning Workspace from Microsoft Research, an open source toolkit empowering DL workloads using Kubernetes. 44 | Deep Learning Workspace's alpha release is available at https://github.com/microsoft/DLWorkspace/, with documentation at https://microsoft.github.io/DLWorkspace/ 45 | -------------------------------------------------------------------------------- /sampleyaml/freeflow.yaml: -------------------------------------------------------------------------------- 1 | kind: DaemonSet 2 | apiVersion: apps/v1 3 | metadata: 4 | name: freeflowrouter 5 | namespace: default 6 | spec: 7 | selector: 8 | matchLabels: 9 | freeflowrouter-node: pod 10 | template: 11 | metadata: 12 | name: freeflowrouter 13 | labels: 14 | freeflowrouter-node: pod 15 | spec: 16 | dnsPolicy: ClusterFirstWithHostNet 17 | hostNetwork: true 18 | containers: 19 | - name: freeflowrouter 20 | image: dlws/freeflow:0.16 21 | securityContext: 22 | privileged: true 23 | volumeMounts: 24 | - mountPath: /freeflow 25 | name: freeflow 26 | env: 27 | - name: HOST_IP_PREFIX 28 | value: 10.240.0.0/16 29 | 30 | 31 | volumes: 32 | - name: freeflow 33 | hostPath: 34 | path: /freeflow 35 | tolerations: 36 | - key: CriticalAddonsOnly 37 | operator: Exists 38 | - key: node-role.kubernetes.io/master 39 | effect: NoSchedule 40 | -------------------------------------------------------------------------------- /sampleyaml/tf-psservers.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: tf-ps0 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | tf-ps: "0" 11 | name-prefix: "tf" 12 | job: "ps" 13 | spec: 14 | containers: 15 | - name: tf-ps0 16 | image: gcr.io/tensorflow/tensorflow:1.2.0 17 | ports: 18 | - containerPort: 6006 19 | name: tensorboard 20 | - containerPort: 2222 21 | name: grpcport 22 | env: [] 23 | volumeMounts: 24 | - mountPath: /joy 25 | name: shared 26 | - mountPath: /imagenetdata 27 | name: imagenet 28 | - mountPath: /imagenetazurefiles 29 | name: imagenetazure 30 | nodeSelector: 31 | nodedesc: cpunode0 32 | volumes: 33 | - name: shared 34 | hostPath: 35 | path: /home/joyadmin 36 | - name: imagenet 37 | hostPath: 38 | path: /mnt/imagenet-data 39 | - name: imagenetazure 40 | azureFile: 41 | secretName: azure-secret 42 | shareName: imagenet 43 | readOnly: false 44 | --- 45 | apiVersion: v1 46 | kind: Service 47 | metadata: 48 | name: tf-ps0 49 | labels: 50 | tf-ps: "0" 51 | spec: 52 | type: ClusterIP 53 | ports: 54 | - name: tensorboard 55 | port: 6006 56 | - name: grpcport 57 | port: 2222 58 | selector: 59 | tf-ps: "0" 60 | --- 61 | -------------------------------------------------------------------------------- /sampleyaml/tfgpu-worker0.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: tf-worker0 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | tf-worker: "0" 11 | name-prefix: "tf" 12 | job: "worker" 13 | spec: 14 | containers: 15 | - name: tf-worker0 16 | image: gcr.io/tensorflow/tensorflow:1.2.0-gpu 17 | ports: 18 | - containerPort: 6006 19 | name: tensorboard 20 | - containerPort: 2222 21 | name: grpcport 22 | resources: 23 | limits: 24 | alpha.kubernetes.io/nvidia-gpu: 4 25 | volumeMounts: 26 | - mountPath: /usr/local/nvidia 27 | name: nvidia-driver 28 | - mountPath: /joy 29 | name: shared 30 | - mountPath: /imagenetdata 31 | name: imagenet 32 | - mountPath: /imagenetazurefiles 33 | name: imagenetazure 34 | volumes: 35 | - name: nvidia-driver 36 | hostPath: 37 | path: /opt/nvidia-driver/current 38 | - name: shared 39 | hostPath: 40 | path: /home/joyadmin 41 | - name: imagenet 42 | hostPath: 43 | path: /mnt/imagenet-data 44 | - name: imagenetazure 45 | azureFile: 46 | secretName: azure-secret 47 | shareName: imagenet 48 | readOnly: false 49 | --- 50 | apiVersion: v1 51 | kind: Service 52 | metadata: 53 | name: tf-worker0 54 | labels: 55 | tf-worker: "0" 56 | spec: 57 | type: ClusterIP 58 | ports: 59 | - name: tensorboard 60 | port: 6006 61 | - name: grpcport 62 | port: 2222 63 | selector: 64 | tf-worker: "0" 65 | --- -------------------------------------------------------------------------------- /sampleyaml/tfgpu-worker1.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: tf-worker1 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | tf-worker: "1" 11 | name-prefix: "tf" 12 | job: "worker" 13 | spec: 14 | containers: 15 | - name: tf-worker1 16 | image: gcr.io/tensorflow/tensorflow:1.2.0-gpu 17 | ports: 18 | - containerPort: 6006 19 | name: tensorboard 20 | - containerPort: 2222 21 | name: grpcport 22 | resources: 23 | limits: 24 | alpha.kubernetes.io/nvidia-gpu: 4 25 | volumeMounts: 26 | - mountPath: /usr/local/nvidia 27 | name: nvidia-driver 28 | - mountPath: /joy 29 | name: shared 30 | - mountPath: /imagenetdata 31 | name: imagenet 32 | - mountPath: /imagenetazurefiles 33 | name: imagenetazure 34 | volumes: 35 | - name: nvidia-driver 36 | hostPath: 37 | path: /opt/nvidia-driver/current 38 | - name: shared 39 | hostPath: 40 | path: /home/joyadmin 41 | - name: imagenet 42 | hostPath: 43 | path: /mnt/imagenet-data 44 | - name: imagenetazure 45 | azureFile: 46 | secretName: azure-secret 47 | shareName: imagenet 48 | readOnly: false 49 | --- 50 | apiVersion: v1 51 | kind: Service 52 | metadata: 53 | name: tf-worker1 54 | labels: 55 | tf-worker: "1" 56 | spec: 57 | type: ClusterIP 58 | ports: 59 | - name: tensorboard 60 | port: 6006 61 | - name: grpcport 62 | port: 2222 63 | selector: 64 | tf-worker: "1" 65 | --- -------------------------------------------------------------------------------- /sampleyaml/tfworkerwithfreeflow.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1beta1 2 | kind: Deployment 3 | metadata: 4 | name: tf-worker1 5 | spec: 6 | replicas: 1 7 | template: 8 | metadata: 9 | labels: 10 | tf-worker: "1" 11 | name-prefix: "tf" 12 | job: "worker" 13 | spec: 14 | containers: 15 | - name: tf-worker1 16 | image: tensorflow/tensorflow:1.8.0-gpu 17 | env: 18 | - name: LD_PRELOAD 19 | value: "/freeflow/libfsocket.so" 20 | - name: VNET_PREFIX 21 | value: 10.244.0.0/16 22 | ports: 23 | - containerPort: 6006 24 | name: tensorboard 25 | - containerPort: 2222 26 | name: grpcport 27 | resources: 28 | limits: 29 | alpha.kubernetes.io/nvidia-gpu: 4 30 | volumeMounts: 31 | - mountPath: /usr/local/nvidia 32 | name: nvidia-driver 33 | - mountPath: /joy 34 | name: shared 35 | - mountPath: /imagenetdata 36 | name: imagenet 37 | - mountPath: /freeflow 38 | name: freeflow 39 | nodeSelector: 40 | nodedesc: gpunode1 41 | volumes: 42 | - name: nvidia-driver 43 | hostPath: 44 | path: /usr/local/nvidia 45 | - name: shared 46 | hostPath: 47 | path: /home/joyadmin 48 | - name: imagenet 49 | hostPath: 50 | path: /mnt/imagenet-data 51 | - name: freeflow 52 | hostPath: 53 | path: /freeflow 54 | --- --------------------------------------------------------------------------------