└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Kubernetes Day 2 2 | 3 | These are notes to accompany my [KubeCon EU 2017 talk](https://cloudnativeeu2017.sched.com/event/9Tcw/kubernetes-day-2-cluster-operations-i-brandon-philips-coreos). The slides [are available as well](https://docs.google.com/presentation/d/1LpiWAGbK77Ha8mOxyw01VlZ18zdMSNimhsrb35sUFv8/edit?usp=sharing). The video is [available from Youtube](https://www.youtube.com/watch?v=U1zR0eDQRYQ). 4 | 5 | How do you keep a Kubernetes cluster running long term? Just like any other service, you need a combination of monitoring, alerting, backup, upgrade, and infrastructure management strategies to make it happen. This talk will walk through and demonstrate the best practices for each of these questions and show off the latest tooling that makes it possible. The takeaway will be lessons and considerations that will influence the way you operate your own Kubernetes clusters. 6 | 7 | ## WARNING 8 | 9 | These are notes for a conference talk. Much of this may become out of date very quickly. My goal is to turn much of this into docs overtime. 10 | 11 | ## Cluster Setup 12 | 13 | All of the demos in this talk were done with a self-hosted cluster deployed with the [Tectonic Installer](https://github.com/coreos/tectonic-installer#tectonic-installer) on AWS. 14 | 15 | This cluster was also deployed using the self-hosted etcd option which at the time of this writing [isn't merged into the Tectonic Installer](https://github.com/coreos/tectonic-installer/pull/135) quite yet. 16 | 17 | ## Failing a Scheduler 18 | 19 | Scale it down to remove all schedulers 20 | 21 | ``` 22 | kubectl scale -n kube-system deployment kube-scheduler --replicas=0 23 | ``` 24 | 25 | OH NO, scale it back up 26 | 27 | ``` 28 | kubectl scale -n kube-system deployment kube-scheduler --replicas=1 29 | ``` 30 | 31 | Unfortunately, it is too late. Everything is ruined?!?! 32 | 33 | ``` 34 | kubectl get pods -l k8s-app=kube-scheduler -n kube-system 35 | NAME READY STATUS RESTARTS AGE 36 | kube-scheduler-3027616201-53jfh 0/1 Pending 0 52s 37 | ``` 38 | 39 | Get the current kubernetes deployment 40 | 41 | ``` 42 | kubectl get -n kube-system deployment -o yaml kube-scheduler > sched.yaml 43 | ``` 44 | 45 | Pick a node name from this list at random 46 | 47 | ``` 48 | kubectl get nodes -l master=true 49 | ``` 50 | 51 | Edit the sched.yaml to use just the pod spec and set the metadata.nodename field to one to the selected node above. Something like this: 52 | 53 | ``` 54 | kind: Pod 55 | metadata: 56 | labels: 57 | k8s-app: kube-scheduler 58 | name: kube-scheduler 59 | namespace: kube-system 60 | spec: 61 | nodeName: ip-10-0-37-115.us-west-2.compute.internal 62 | containers: 63 | - command: 64 | - ./hyperkube 65 | - scheduler 66 | - --leader-elect=true 67 | image: quay.io/coreos/hyperkube:v1.5.5_coreos.0 68 | imagePullPolicy: IfNotPresent 69 | name: kube-scheduler 70 | resources: {} 71 | terminationMessagePath: /dev/termination-log 72 | dnsPolicy: ClusterFirst 73 | nodeSelector: 74 | master: "true" 75 | restartPolicy: Always 76 | securityContext: {} 77 | terminationGracePeriodSeconds: 30 78 | ``` 79 | 80 | At this point the deployment scheduler should be ready and can take over 81 | 82 | ``` 83 | kubectl get pods -l k8s-app=kube-scheduler -n kube-system 84 | ``` 85 | 86 | Delete the temporary pod 87 | 88 | ``` 89 | kubectl delete pod -n kube-system kube-scheduler 90 | ``` 91 | 92 | ## Downgrade/Upgrade Scheduler 93 | 94 | Edit the scheduler and downgrade a patch release. 95 | 96 | ``` 97 | kubectl edit -n kube-system deployment kube-scheduler 98 | ``` 99 | 100 | Now edit the scheduler and upgrade a patch release. 101 | 102 | ``` 103 | kubectl edit -n kube-system deployment kube-scheduler 104 | ``` 105 | 106 | Boom! 107 | 108 | ## kubectl drain and corden 109 | 110 | ``` 111 | $ kubectl get nodes 112 | NAME STATUS AGE 113 | ip-10-0-13-248.us-west-2.compute.internal Ready 19h 114 | ``` 115 | 116 | To make a node unschedulable and remove all pods run the following 117 | 118 | ``` 119 | kubectl drain ip-10-0-13-248.us-west-2.compute.internal 120 | ``` 121 | 122 | ## kubectl cordon and uncordon 123 | 124 | To ensure a node doesn't get additional workloads you can cordon/uncordon a node. This is very useful to investigate an issue and to ensure a node doesn't change while debugging. 125 | 126 | ``` 127 | $ kubectl cordon ip-10-0-84-104.us-west-2.compute.internal 128 | node "ip-10-0-84-104.us-west-2.compute.internal" cordoned 129 | ``` 130 | 131 | To undo run uncordon 132 | 133 | ``` 134 | $ kubectl uncordon ip-10-0-84-104.us-west-2.compute.internal 135 | node "ip-10-0-84-104.us-west-2.compute.internal" uncordoned 136 | ``` 137 | 138 | ## Monitoring 139 | 140 | Using [contrib/kube-prometheus](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus) deployed in the self-hosted configuration. 141 | 142 | Proxy to run queries against prometheus 143 | 144 | ``` 145 | while true; do kubectl port-forward -n monitoring prometheus-k8s-0 9090; don 146 | ``` 147 | 148 | NOTE: a few bugs were [found and filed](https://github.com/coreos/prometheus-operator/issues/created_by/philips) against this configuration 149 | 150 | ## Configure etcd backup 151 | 152 | Note: S3 backup isn't working in the etcd Operator on self-hosted yet; hunting this down. 153 | 154 | Setup AWS upload creds: 155 | 156 | ``` 157 | kubectl create secret generic aws-credential --from-file=$HOME/.aws/credentials -n kube-system 158 | kubectl create configmap aws-config --from-file=$HOME/.aws/config-us-west-1 -n kube-system 159 | ``` 160 | 161 | ``` 162 | kubectl edit deployment etcd-operator -n kube-system 163 | ``` 164 | 165 | 166 | ``` 167 | - command: 168 | - /usr/local/bin/etcd-operator 169 | - --backup-aws-secret 170 | - aws-credential 171 | - --backup-aws-config 172 | - aws-config 173 | - --backup-s3-bucket 174 | - tectonic-eo-etcd-backups 175 | ``` 176 | 177 | ``` 178 | kubectl get cluster.etcd -n kube-system kube-etcd -o yaml > etcd 179 | kubectl replace -f etcd -n kube-system 180 | ``` 181 | --------------------------------------------------------------------------------