├── .gitignore ├── resources ├── description.jpg └── System-overview.jpg ├── LICENSE ├── deployments ├── example-gpu-deployment.yaml └── example-gpu-deployment-nvidia-375-82.yaml ├── scripts ├── init-worker.sh ├── init-master.sh └── init-worker-with-cuda.sh ├── JupyterNotebooks └── ListGPUs.ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | ## .DS_Store files of MacOsX 2 | *.DS_Store 3 | -------------------------------------------------------------------------------- /resources/description.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Langhalsdino/Kubernetes-GPU-Guide/HEAD/resources/description.jpg -------------------------------------------------------------------------------- /resources/System-overview.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Langhalsdino/Kubernetes-GPU-Guide/HEAD/resources/System-overview.jpg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Langhalsdino / Frederic Jan Tausch 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /deployments/example-gpu-deployment.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | apiVersion: extensions/v1beta1 3 | kind: Deployment 4 | metadata: 5 | name: tf-jupyter 6 | spec: 7 | replicas: 1 8 | template: 9 | metadata: 10 | labels: 11 | app: tf-jupyter 12 | spec: 13 | volumes: 14 | - hostPath: 15 | path: /usr/lib/nvidia-375/bin 16 | name: bin 17 | - hostPath: 18 | path: /usr/lib/nvidia-375 19 | name: lib 20 | containers: 21 | - name: tensorflow 22 | image: tensorflow/tensorflow:0.11.0rc0-gpu 23 | ports: 24 | - containerPort: 8888 25 | resources: 26 | limits: 27 | alpha.kubernetes.io/nvidia-gpu: 1 28 | volumeMounts: 29 | - mountPath: /usr/local/nvidia/bin 30 | name: bin 31 | - mountPath: /usr/local/nvidia/lib 32 | name: lib 33 | --- 34 | apiVersion: v1 35 | kind: Service 36 | metadata: 37 | name: tf-jupyter-service 38 | labels: 39 | app: tf-jupyter 40 | spec: 41 | selector: 42 | app: tf-jupyter 43 | ports: 44 | - port: 8888 45 | protocol: TCP 46 | nodePort: 30061 47 | type: LoadBalancer 48 | --- 49 | -------------------------------------------------------------------------------- /scripts/init-worker.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # Following arguments are necessary: 3 | # 1. -> token , e.g. "f38242.e7f3XXXXXXXXe231e" 4 | # 2. -> IP:port of the master 5 | echo "The following token will be used: ${1}" 6 | echo "The master nodes IP:Port is: ${2}" 7 | sudo bash -c 'apt-get update && apt-get install -y apt-transport-https 8 | curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - 9 | cat </etc/apt/sources.list.d/kubernetes.list 10 | deb http://apt.kubernetes.io/ kubernetes-xenial main 11 | EOF 12 | apt-get update' 13 | sudo apt-get install -y --allow-unauthenticated docker-engine 14 | sudo apt-get install -y --allow-unauthenticated kubelet kubeadm kubectl kubernetes-cni 15 | sudo groupadd docker 16 | sudo usermod -aG docker $USER 17 | 18 | sudo systemctl enable docker && systemctl start docker 19 | sudo systemctl enable kubelet && systemctl start kubelet 20 | 21 | echo 'You might need to reboot / relogin to make docker work correctly' 22 | 23 | for file in /etc/systemd/system/kubelet.service.d/*-kubeadm.conf 24 | do 25 | echo "Found ${file}" 26 | FILE_NAME=$file 27 | done 28 | 29 | echo "Chosen ${FILE_NAME} as kubeadm.conf" 30 | sudo sed -i '/^ExecStart=\/usr\/bin\/kubelet/ s/$/ --feature-gates="Accelerators=true"/' ${FILE_NAME} 31 | 32 | sudo systemctl daemon-reload 33 | sudo systemctl restart kubelet 34 | 35 | sudo kubeadm join --token $1 $2 36 | -------------------------------------------------------------------------------- /scripts/init-master.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # Give reachable IP as an argument for this script 3 | echo "Publick IP, that will be advertised is: ${1}" 4 | sudo bash -c 'apt-get update && apt-get install -y apt-transport-https 5 | curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - 6 | cat </etc/apt/sources.list.d/kubernetes.list 7 | deb http://apt.kubernetes.io/ kubernetes-xenial main 8 | EOF 9 | apt-get update' 10 | sudo apt-get install -y --allow-unauthenticated docker-engine 11 | sudo apt-get install -y --allow-unauthenticated kubelet kubeadm kubectl kubernetes-cni 12 | sudo groupadd docker 13 | sudo usermod -aG docker $USER 14 | 15 | sudo systemctl enable docker && systemctl start docker 16 | sudo systemctl enable kubelet && systemctl start kubelet 17 | 18 | echo 'You might need to reboot / relogin to make docker work correctly' 19 | 20 | for file in /etc/systemd/system/kubelet.service.d/*-kubeadm.conf 21 | do 22 | echo "Found ${file}" 23 | FILE_NAME=$file 24 | done 25 | 26 | echo "Chosen ${FILE_NAME} as kubeadm.conf" 27 | sudo sed -i '/^ExecStart=\/usr\/bin\/kubelet/ s/$/ --feature-gates="Accelerators=true"/' ${FILE_NAME} 28 | 29 | sudo systemctl daemon-reload 30 | sudo systemctl restart kubelet 31 | 32 | sudo kubeadm init --apiserver-advertise-address=$1 33 | sudo cp /etc/kubernetes/admin.conf $HOME/ 34 | sudo chown $(id -u):$(id -g) $HOME/admin.conf 35 | export KUBECONFIG=$HOME/admin.conf 36 | 37 | kubectl apply -f https://git.io/weave-kube-1.6 38 | kubectl create -f https://git.io/kube-dashboard 39 | -------------------------------------------------------------------------------- /JupyterNotebooks/ListGPUs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from tensorflow.python.client import device_lib\n", 12 | "\n", 13 | "def get_available_devices():\n", 14 | " local_device_protos = device_lib.list_local_devices()\n", 15 | " return [x.name for x in local_device_protos]" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 4, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stdout", 25 | "output_type": "stream", 26 | "text": [ 27 | "[u'/cpu:0', u'/gpu:0']\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "print(get_available_devices())" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": { 39 | "collapsed": true 40 | }, 41 | "outputs": [], 42 | "source": [] 43 | } 44 | ], 45 | "metadata": { 46 | "kernelspec": { 47 | "display_name": "Python 2", 48 | "language": "python", 49 | "name": "python2" 50 | }, 51 | "language_info": { 52 | "codemirror_mode": { 53 | "name": "ipython", 54 | "version": 2 55 | }, 56 | "file_extension": ".py", 57 | "mimetype": "text/x-python", 58 | "name": "python", 59 | "nbconvert_exporter": "python", 60 | "pygments_lexer": "ipython2", 61 | "version": "2.7.12" 62 | } 63 | }, 64 | "nbformat": 4, 65 | "nbformat_minor": 1 66 | } 67 | -------------------------------------------------------------------------------- /deployments/example-gpu-deployment-nvidia-375-82.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | apiVersion: extensions/v1beta1 3 | kind: Deployment 4 | metadata: 5 | name: tf-jupyter 6 | spec: 7 | replicas: 1 8 | template: 9 | metadata: 10 | labels: 11 | app: tf-jupyter 12 | spec: 13 | volumes: 14 | - name: nvidia-driver-375 15 | hostPath: 16 | path: /usr/lib/nvidia-375 17 | - name: libcuda-so 18 | hostPath: 19 | path: /usr/lib/x86_64-linux-gnu/libcuda.so 20 | - name: libcuda-so-1 21 | hostPath: 22 | path: /usr/lib/x86_64-linux-gnu/libcuda.so.1 23 | - name: libcuda-so-375-82 24 | hostPath: 25 | path: /usr/lib/x86_64-linux-gnu/libcuda.so.375.82 26 | containers: 27 | - name: tensorflow 28 | image: tensorflow/tensorflow:0.11.0rc0-gpu 29 | ports: 30 | - containerPort: 8888 31 | resources: 32 | limits: 33 | alpha.kubernetes.io/nvidia-gpu: 1 34 | volumeMounts: 35 | - name: nvidia-driver-375 36 | mountPath: /usr/local/nvidia 37 | readOnly: true 38 | - name: libcuda-so 39 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so 40 | - name: libcuda-so-1 41 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 42 | - name: libcuda-so-375-82 43 | mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.375.82 44 | --- 45 | apiVersion: v1 46 | kind: Service 47 | metadata: 48 | name: tf-jupyter-service 49 | labels: 50 | app: tf-jupyter 51 | spec: 52 | selector: 53 | app: tf-jupyter 54 | ports: 55 | - port: 8888 56 | protocol: TCP 57 | nodePort: 30061 58 | type: LoadBalancer 59 | --- 60 | -------------------------------------------------------------------------------- /scripts/init-worker-with-cuda.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # Following arguments are necessary: 3 | # 1. -> token , e.g. "f38242.e7f3XXXXXXXXe231e" 4 | # 2. -> IP:port of the master 5 | echo "The following token will be used: ${1}" 6 | echo "The master nodes IP:Port is: ${2}" 7 | sudo bash -c 'apt-get update && apt-get install -y apt-transport-https 8 | curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - 9 | cat </etc/apt/sources.list.d/kubernetes.list 10 | deb http://apt.kubernetes.io/ kubernetes-xenial main 11 | EOF 12 | apt-get update' 13 | sudo apt-get install -y --allow-unauthenticated docker-engine 14 | sudo apt-get install -y --allow-unauthenticated kubelet kubeadm kubectl kubernetes-cni 15 | sudo groupadd docker 16 | sudo usermod -aG docker $USER 17 | 18 | # Install Cuda and Nvidia driver 19 | sudo apt-get install -y linux-headers-$(uname -r) 20 | sudo add-apt-repository ppa:graphics-drivers/ppa 21 | sudo apt-get update 22 | sudo apt-get install -y nvidia-375 23 | sudo apt-get install -y nvidia-cuda-dev nvidia-cuda-toolkit nvidia-nsight 24 | 25 | echo 'In order to use cudnn, you will need to install it seperadly. \nVisit https://developer.nvidia.com/cudnn' 26 | 27 | sudo systemctl enable docker && systemctl start docker 28 | sudo systemctl enable kubelet && systemctl start kubelet 29 | 30 | echo 'You might need to reboot / relogin to make docker work correctly' 31 | 32 | for file in /etc/systemd/system/kubelet.service.d/*-kubeadm.conf 33 | do 34 | echo "Found ${file}" 35 | FILE_NAME=$file 36 | done 37 | 38 | echo "Chosen ${FILE_NAME} as kubeadm.conf" 39 | sudo sed -i '/^ExecStart=\/usr\/bin\/kubelet/ s/$/ --feature-gates="Accelerators=true"/' ${FILE_NAME} 40 | 41 | sudo systemctl daemon-reload 42 | sudo systemctl restart kubelet 43 | 44 | sudo kubeadm join --token $1 $2 45 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to automate deep learning training with Kubernetes GPU-cluster 2 | 3 | This guide should help fellow researchers and hobbyists to easily automate and accelerate there deep leaning training with their own Kubernetes GPU cluster.
4 | Therefore I will explain how to easily setup a GPU cluster on multiple Ubuntu 16.04 bare metal servers and provide some useful scripts and .yaml files that do the entire setup for you. 5 | 6 | By the way: If you need a Kubernetes GPU-cluster for other reasons, this guide might be helpful to you as well. 7 | 8 | **Why did i write this guide?**
9 | I have worked as in intern for the Startup [understand.ai](https://understand.ai) and noticed the hassle of firstly designing a machine learning algorithm locally and than bringing it to the cloud for training with different parameters and datasets.
10 | The second part, bringing it to the cloud for extensive training, takes always longer than thought, is frustrating and involves usually a lot of pitfalls. 11 | 12 | For this reason i decided to work on this problem and make the second part effortless, easy and quick.
13 | The result of this work is this handy guide, that describes how everyone can setup their own Kubernetes GPU cluster to accelerate their work. 14 | 15 | **The new process for the deep learning researchers:**
16 | The automated deep learning training with a Kubernetes GPU-cluster improves the process of brining your algorithm for training in the cloud significantly. 17 | 18 | This illustration visualizes the new workflow, that involves only two simple steps:
19 | ![My inspiration for the project, designed by Langhalsdino.](resources/description.jpg?raw=true "My inspiration for the project") 20 | 21 | **Disclaimer**
22 | Be aware, that the following sections might be opinionated. Kubernetes is an evolving, fast paced environment, which means this guide will probably be outdated at times, depending on the authors spare time and individual contributions. Due to this fact contributions are highly appreciated. 23 | 24 | ## Table of Contents 25 | 26 | * [Quick Kubernetes revive](#quick-kubernetes-revive) 27 | * [Rough overview on the structure of the cluster](#rough-overview-on-the-structure-of-the-cluster) 28 | * [Initiate nodes](#initiate-nodes) 29 | - [Constraints of my setup](#constraints-of-my-setup) 30 | - [Setup instructions](#setup-instructions) 31 | - [Use fast setup script](#fast-track---setup-script) 32 | - [Manually step by step instructions](#detailed-step-by-step-instructions) 33 | * [How to build your GPU container](#how-to-build-your-gpu-container) 34 | * [Some helpful commands](#some-helpful-commands) 35 | * [Acknowledgements](#acknowledgements) 36 | * [License](#license) 37 | 38 | ## Quick Kubernetes revive 39 | 40 | **These articles might be helpful, if you need to refresh your Kubernetes knowledge:** 41 | 42 | * [Introduction to Kubernetes by DigitalOcean](https://www.digitalocean.com/community/tutorials/an-introduction-to-kubernetes) 43 | * [Kubernetes concepts](https://kubernetes.io/docs/concepts/) 44 | * [Kubernetes by example](http://kubernetesbyexample.com/) 45 | * [Kubernetes basics - interactive tutorial](https://kubernetes.io/docs/tutorials/kubernetes-basics/) 46 | 47 | ## Rough overview on the structure of the cluster 48 | The main idea is, to have a small CPU only master node, that controls a cluster of GPU-worker nodes. 49 | ![Rough overview on the structure of the cluster, designed by Langhalsdino](resources/System-overview.jpg?raw=true "Rough overview") 50 | 51 | ## Initiate nodes 52 | Before we can use the cluster, it is important to firstly initiate the cluster.
53 | Therefore each node has to be manually initiated and joined to the cluster. 54 | 55 | ### Constraints of my setup 56 | This are the constraints for my setup, I have been in some places tighter than necessary, but this is my setup and it worked for me 😒 57 | 58 | **Master** 59 | 60 | + Ubuntu 16.04 61 | + SSH access with sudo user 62 | + Internet access 63 | + ufw deactivated (not recommended, but for ease of use) 64 | + Enabled Ports (udp and tcp) 65 | - 6443, 443, 8080 66 | - 30000-32767 (only if your apps need them) 67 | - These will be used to access services from outside of the cluster 68 | 69 | **Worker** 70 | 71 | + Ubuntu 16.04 72 | + SSH access with sudo user 73 | + Internet access 74 | + ufw deactivated (not recommended, but for ease of use) 75 | + Enabled Ports (udp and tcp) 76 | - 6443, 443 77 | 78 | ### Setup instructions 79 | These instruction cover my experience on Ubuntu 16.04 and may or may not be suited to transfer to other OS’s. 80 | 81 | I have created two scripts that fully initiate the master and worker node as described bellow. If you want to take the fast track, just use them. Otherwise, i recommended to read the step by step instructions. 82 | 83 |

Fast Track - Setup script

84 | Ok, lets take the fast track. Copy the corresponding scripts on your master and workers.
85 | Furthermore make sure that your setup fits into my constraints. 86 | 87 | **MASTER NODE** 88 | 89 | Execute the initialization script and remember the token 😉
90 | The token will look like this: ```—token f38242.e7f3XXXXXXXXe231e```. 91 | 92 | ``` 93 | chmod +x init-master.sh 94 | sudo ./init-master.sh 95 | ``` 96 | 97 | **WORKER NODE** 98 | 99 | Execute the initialization script with the correct token and IP of your master.
100 | The port is usually ```6443```. 101 | 102 | ``` 103 | chmod +x init-worker.sh 104 | sudo ./init-worker.sh : 105 | ``` 106 | 107 |

Detailed step by step instructions

108 | 109 | **MASTER NODE** 110 | 111 | **1.** Add Kubernetes Repository to the packagemanager 112 | ``` 113 | sudo su - 114 | apt-get update && apt-get install -y apt-transport-https 115 | curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - 116 | cat </etc/apt/sources.list.d/kubernetes.list 117 | deb http://apt.kubernetes.io/ kubernetes-xenial main 118 | EOF 119 | apt-get update 120 | exit 121 | ``` 122 | 123 | **2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni 124 | 125 | ``` 126 | sudo apt-get install -y docker-engine 127 | sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni 128 | sudo groupadd docker 129 | sudo usermod -aG docker $USER 130 | echo 'You might need to reboot / relogin to make docker work correctly' 131 | ``` 132 | 133 | **3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the master node. 134 | Keep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes! 135 | 136 | **3.I** 137 | Add GPU support to the Kubeadm configuration, while cluster is not initialized. 138 | ``` 139 | sudo vim /etc/systemd/system/kubelet.service.d/<>-kubeadm.conf 140 | ``` 141 | append ExecStart with the flag ```—feature-gates="Accelerators=true"```, so it will look like this: 142 | ``` 143 | ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates="Accelerators=true" 144 | ``` 145 | 146 | **3.II** Restart kubelet 147 | ``` 148 | sudo systemctl daemon-reload 149 | sudo systemctl restart kubelet 150 | ``` 151 | 152 | **4.** Now we will initialize the master node.
153 | Therefore you will need the IP of your master node. 154 | Furthermore this step will provide you with the credentials to add further worker nodes, so remember your token 😉
155 | The token will look like this: ``` —token f38242.e7f3XXXXXXXXe231e 130.211.XXX.XXX:6443``` 156 | ``` 157 | sudo kubeadm init --apiserver-advertise-address= 158 | ``` 159 | **5.** Since Kubernetes 1.6 changed from ABAC roll-management to RBAC we need to advertise the credentials of the user. 160 | You will need to perform this step for each time you will log into the machine!! 161 | ``` 162 | sudo cp /etc/kubernetes/admin.conf $HOME/ 163 | sudo chown $(id -u):$(id -g) $HOME/admin.conf 164 | export KUBECONFIG=$HOME/admin.conf 165 | ``` 166 | 167 | **6.** Install network add-on that your pods can communicate with each other. Kubernetes 1.6 has some requirements for the network add-on, some of them are: 168 | 169 | + CNI-based networks 170 | + RBAC support 171 | 172 | This GoogleSheet contains a selection of suitable network add- on GoogleSheet-Network-Add-on-vergleich . 173 | I will use wave-works, just because of my personal preference ;) 174 | ``` 175 | kubectl apply -f https://git.io/weave-kube-1.6 176 | ``` 177 | **5.II** You are ready to go, maybe check your pods to confirm that everything is working ;) 178 | ``` 179 | kubectl get pods —all-namespaces 180 | ``` 181 | **N.** If you want to tear down your master, you will need to reset the master node 182 | ``` 183 | sudo kubeadm reset 184 | ``` 185 | 186 | **WORKER NODE** 187 | 188 | The beginning should be familiar to you and make this process a lot faster ;) 189 | 190 | **1.** Add Kubernetes Repository to the packagemanager 191 | ``` 192 | sudo su - 193 | apt-get update && apt-get install -y apt-transport-https 194 | curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - 195 | cat </etc/apt/sources.list.d/kubernetes.list 196 | deb http://apt.kubernetes.io/ kubernetes-xenial main 197 | EOF 198 | apt-get update 199 | exit 200 | ``` 201 | 202 | **2.** Install docker-engine, kubeadm, kubectl, kubernetes-cni 203 | 204 | ``` 205 | sudo apt-get install -y docker-engine 206 | sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni 207 | sudo groupadd docker 208 | sudo usermod -aG docker $USER 209 | echo 'You might need to reboot / relogin to make docker work correctly' 210 | ``` 211 | 212 | **3.** Since we want to build a cluster that uses GPUs we need to enable GPU acceleration in the worker nodes that have a GPU installed. 213 | Keep in mind, that this instruction may become obsolete or change completely in a later version of Kubernetes! 214 | 215 | **3.I** 216 | Add GPU support to the Kubeadm configuration, while cluster is not initialized. 217 | ``` 218 | sudo vim /etc/systemd/system/kubelet.service.d/<>-kubeadm.conf 219 | ``` 220 | append ExecStart with the flag ```—feature-gates="Accelerators=true"```, so it will look like this: 221 | ``` 222 | ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS [...] --feature-gates="Accelerators=true" 223 | ``` 224 | 225 | **3.II** Restart kubelet 226 | ``` 227 | sudo systemctl daemon-reload 228 | sudo systemctl restart kubelet 229 | ``` 230 | 231 | **4.** Now we will add the worker to the cluster.
232 | Therefore you will need to remember the token from your master node, so take a deep dive into your notes xD 233 | ``` 234 | sudo kubeadm join --token f38242.e7f3XXXXXXe231e 130.211.XXX.XXX:6443 235 | ``` 236 | **5.** Finished, check your nodes on your master and see if everything worked. 237 | ``` 238 | kubectl get nodes 239 | ``` 240 | **N.** If you want to tear down your worker node, you will need to remove the node from the cluster and reset the worker node. 241 | Furthermore it will be beneficial to remove the worker node from the cluster 242 | ***On master:*** 243 | ``` 244 | kubectl delete node 245 | ``` 246 | ***On worker node*** 247 | ``` 248 | sudo kubeadm reset 249 | ``` 250 | 251 | **Client** 252 | 253 | In order to control your cluster e.g. your master from your client, you will need to authenticate your client with the right user. 254 | This guid won’t cover creating a separate user for client, we will just copy the user from the master node.
255 | This will be easier, trust me 🤓
256 | [Instruction to add custom user, will be added in the future] 257 | 258 | **1.** Install kubectl on your client. I have only tested it on may mac, but linux should work as well. 259 | I don’t know about windows, but who cares about windows anyway :D
260 | **On Mac** 261 | ``` 262 | brew install kubectl 263 | ``` 264 | **2.** Copy the admin authentication from the master to your client 265 | ``` 266 | scp uai@130.211.XXX.64:~/admin.conf ~/.kube/ 267 | ``` 268 | **3.** Add the admin.conf configuration and credentials to Kubernetes configuration. You will need to do this for every agent 269 | ``` 270 | export KUBECONFIG=~/.kube/admin.conf 271 | ``` 272 | You are ready to use kubectl on you local client. 273 | 274 | **3.II** You can test by listing all your pods 275 | ``` 276 | kubectl get pods —all-namespaces 277 | ``` 278 | 279 | 280 | **Install Kubernetes dashboard** 281 | 282 | The kubernetes dashboard is pretty beautiful and gives script kiddies like me access to a lot of functionality. 283 | In order to use the dashboard you will need to get your client running, RBAC will ensure it 👮 284 | 285 | **You can perform this steps directly on the master or from your client** 286 | 287 | **1.** Check if the dashboard is already installed 288 | kubectl get pods --all-namespaces | grep dashboard 289 | 290 | **2.** If the dashboard isn’t installed, install it ;) 291 | ``` 292 | kubectl create -f https://git.io/kube-dashboard 293 | ``` 294 | If this did not work check if the container defined in the .yaml [git.io/kube-dashboard](https/git.io/kube-dashboard) exist. (This bug cost me a lot of time) 295 | 296 | In order to have access to your dashboard you will need to be authenticated with you client. 297 | 298 | **3.** Proxy the dashboard to your client 299 | ``` 300 | kubectl proxy 301 | ``` 302 | 303 | **4.** Access the dashboard within your browser by visiting 304 | [127.0.0.1:8001/ui](127.0.0.1:8001/ui) 305 | 306 | ## How to build your GPU container 307 | This guide should help you to get a Docker container running, that needs GPU access. 308 | 309 | For this guide i have chosen to build an example Docker container, that uses TensorFlow GPU binaries and can run TensorFlow programs in a Jupyter notebook. 310 | 311 | Keep in mind, that this guide has been written for Kubernetes 1.6, therefore further changes can compromise this guide. 312 | 313 | ### Essential parts of .yml 314 | In order to get your Nvidia GPU with CUDA running you have to pass the Nvidia driver and CUDA libraries to your container. 315 | So we will use hostPath to make them available to the Kubernetes pod. 316 | The actual path differ from machine to machine, since they are set by your Nvidia driver and CUDA installation. 317 | ``` 318 | volumes: 319 | - hostPath: 320 | path: /usr/lib/nvidia-375/bin 321 | name: bin 322 | - hostPath: 323 | path: /usr/lib/nvidia-375 324 | name: lib 325 | ``` 326 | Mount the volumes with the driver and CUDA in the right directory for your container. These might differ, due to specific requirements of your container. 327 | ``` 328 | volumeMounts: 329 | - mountPath: /usr/local/nvidia/bin 330 | name: bin 331 | - mountPath: /usr/local/nvidia/lib 332 | name: lib 333 | ``` 334 | Since you want to tell Kubernetes that you need n GPUs , you can define your requirements here. 335 | ``` 336 | resources: 337 | limits: 338 | alpha.kubernetes.io/nvidia-gpu: 1 339 | ``` 340 | Thats it, it is everything you need to build your Kuberntes 1.6 container 😏 341 | 342 | Some note at the end, that describes my overall experience:
343 | **Kubernetes + Docker + Machine Learning + GPUs = Pure awesomeness** 344 | 345 | ### Example GPU deployment 346 | My example-gpu-deployment.yaml file describes two parts, a deployment and a service, since i want to make jupyter notebook available form the outside. 347 | 348 | Run kubectl apply to make it available to the outside 349 | ``` 350 | kubectl create -f deployment.yaml 351 | ``` 352 | 353 | The deployment.yaml file looks like this: 354 | ``` 355 | --- 356 | apiVersion: extensions/v1beta1 357 | kind: Deployment 358 | metadata: 359 | name: tf-jupyter 360 | spec: 361 | replicas: 1 362 | template: 363 | metadata: 364 | labels: 365 | app: tf-jupyter 366 | spec: 367 | volumes: 368 | - hostPath: 369 | path: /usr/lib/nvidia-375/bin 370 | name: bin 371 | - hostPath: 372 | path: /usr/lib/nvidia-375 373 | name: lib 374 | containers: 375 | - name: tensorflow 376 | image: tensorflow/tensorflow:0.11.0rc0-gpu 377 | ports: 378 | - containerPort: 8888 379 | resources: 380 | limits: 381 | alpha.kubernetes.io/nvidia-gpu: 1 382 | volumeMounts: 383 | - mountPath: /usr/local/nvidia/bin 384 | name: bin 385 | - mountPath: /usr/local/nvidia/lib 386 | name: lib 387 | --- 388 | apiVersion: v1 389 | kind: Service 390 | metadata: 391 | name: tf-jupyter-service 392 | labels: 393 | app: tf-jupyter 394 | spec: 395 | selector: 396 | app: tf-jupyter 397 | ports: 398 | - port: 8888 399 | protocol: TCP 400 | nodePort: 30061 401 | type: LoadBalancer 402 | --- 403 | ``` 404 | 405 | ## Some helpful commands 406 | 407 | **Get commands** with basic output 408 | ``` 409 | kubectl get services # List all services in the namespace 410 | kubectl get pods --all-namespaces # List all pods in all namespaces 411 | kubectl get pods -o wide # List all pods in the namespace, with more details 412 | kubectl get deployment my-dep # List a particular deployment 413 | ``` 414 | 415 | **Describe commands** with verbose output 416 | ``` 417 | kubectl describe nodes 418 | kubectl describe pods 419 | ``` 420 | 421 | **Deleting Resources** 422 | ``` 423 | kubectl delete -f ./pod.yaml # Delete a pod using the type and name specified in pod.yaml 424 | kubectl delete pod,service baz foo # Delete pods and services with same names "baz" and "foo" 425 | kubectl delete pods,services -l name=