└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Kubernetes Day 2
  2 | 
  3 | These are notes to accompany my [KubeCon EU 2017 talk](https://cloudnativeeu2017.sched.com/event/9Tcw/kubernetes-day-2-cluster-operations-i-brandon-philips-coreos). The slides [are available as well](https://docs.google.com/presentation/d/1LpiWAGbK77Ha8mOxyw01VlZ18zdMSNimhsrb35sUFv8/edit?usp=sharing). The video is [available from Youtube](https://www.youtube.com/watch?v=U1zR0eDQRYQ).
  4 | 
  5 | How do you keep a Kubernetes cluster running long term? Just like any other service, you need a combination of monitoring, alerting, backup, upgrade, and infrastructure management strategies to make it happen. This talk will walk through and demonstrate the best practices for each of these questions and show off the latest tooling that makes it possible. The takeaway will be lessons and considerations that will influence the way you operate your own Kubernetes clusters.
  6 | 
  7 | ## WARNING
  8 | 
  9 | These are notes for a conference talk. Much of this may become out of date very quickly. My goal is to turn much of this into docs overtime.
 10 | 
 11 | ## Cluster Setup
 12 | 
 13 | All of the demos in this talk were done with a self-hosted cluster deployed with the [Tectonic Installer](https://github.com/coreos/tectonic-installer#tectonic-installer) on AWS.
 14 | 
 15 | This cluster was also deployed using the self-hosted etcd option which at the time of this writing [isn't merged into the Tectonic Installer](https://github.com/coreos/tectonic-installer/pull/135) quite yet.
 16 | 
 17 | ## Failing a Scheduler
 18 | 
 19 | Scale it down to remove all schedulers
 20 | 
 21 | ```
 22 | kubectl scale -n kube-system deployment kube-scheduler --replicas=0
 23 | ```
 24 | 
 25 | OH NO, scale it back up
 26 | 
 27 | ```
 28 | kubectl scale -n kube-system deployment kube-scheduler --replicas=1
 29 | ```
 30 | 
 31 | Unfortunately, it is too late. Everything is ruined?!?!
 32 | 
 33 | ```
 34 | kubectl get pods -l k8s-app=kube-scheduler -n kube-system
 35 | NAME                              READY     STATUS    RESTARTS   AGE
 36 | kube-scheduler-3027616201-53jfh   0/1       Pending   0          52s
 37 | ```
 38 | 
 39 | Get the current kubernetes deployment
 40 | 
 41 | ```
 42 | kubectl get -n kube-system deployment -o yaml kube-scheduler > sched.yaml
 43 | ```
 44 | 
 45 | Pick a node name from this list at random
 46 | 
 47 | ```
 48 | kubectl get nodes -l master=true
 49 | ```
 50 | 
 51 | Edit the sched.yaml to use just the pod spec and set the metadata.nodename field to one to the selected node above. Something like this:
 52 | 
 53 | ```
 54 | kind: Pod
 55 | metadata:
 56 |   labels:
 57 |     k8s-app: kube-scheduler
 58 |   name: kube-scheduler
 59 |   namespace: kube-system
 60 | spec:
 61 |   nodeName: ip-10-0-37-115.us-west-2.compute.internal
 62 |   containers:
 63 |   - command:
 64 |     - ./hyperkube
 65 |     - scheduler
 66 |     - --leader-elect=true
 67 |     image: quay.io/coreos/hyperkube:v1.5.5_coreos.0
 68 |     imagePullPolicy: IfNotPresent
 69 |     name: kube-scheduler
 70 |     resources: {}
 71 |     terminationMessagePath: /dev/termination-log
 72 |   dnsPolicy: ClusterFirst
 73 |   nodeSelector:
 74 |     master: "true"
 75 |   restartPolicy: Always
 76 |   securityContext: {}
 77 |   terminationGracePeriodSeconds: 30
 78 | ```
 79 | 
 80 | At this point the deployment scheduler should be ready and can take over
 81 | 
 82 | ```
 83 | kubectl get pods -l k8s-app=kube-scheduler -n kube-system
 84 | ```
 85 | 
 86 | Delete the temporary pod
 87 | 
 88 | ```
 89 | kubectl delete pod -n kube-system kube-scheduler<Paste>
 90 | ```
 91 | 
 92 | ## Downgrade/Upgrade Scheduler
 93 | 
 94 | Edit the scheduler and downgrade a patch release.
 95 | 
 96 | ```
 97 | kubectl edit -n kube-system deployment kube-scheduler
 98 | ```
 99 | 
100 | Now edit the scheduler and upgrade a patch release.
101 | 
102 | ```
103 | kubectl edit -n kube-system deployment kube-scheduler
104 | ```
105 | 
106 | Boom!
107 | 
108 | ## kubectl drain and corden
109 | 
110 | ```
111 | $ kubectl get nodes
112 | NAME                                        STATUS    AGE
113 | ip-10-0-13-248.us-west-2.compute.internal   Ready     19h
114 | ```
115 | 
116 | To make a node unschedulable and remove all pods run the following
117 | 
118 | ```
119 | kubectl drain  ip-10-0-13-248.us-west-2.compute.internal 
120 | ```
121 | 
122 | ## kubectl cordon and uncordon
123 | 
124 | To ensure a node doesn't get additional workloads you can cordon/uncordon a node. This is very useful to investigate an issue and to ensure a node doesn't change while debugging.
125 | 
126 | ```
127 | $ kubectl cordon ip-10-0-84-104.us-west-2.compute.internal
128 | node "ip-10-0-84-104.us-west-2.compute.internal" cordoned
129 | ```
130 | 
131 | To undo run uncordon
132 | 
133 | ```
134 | $ kubectl uncordon ip-10-0-84-104.us-west-2.compute.internal
135 | node "ip-10-0-84-104.us-west-2.compute.internal" uncordoned
136 | ```
137 | 
138 | ## Monitoring
139 | 
140 | Using [contrib/kube-prometheus](https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus) deployed in the self-hosted configuration.
141 | 
142 | Proxy to run queries against prometheus
143 | 
144 | ```
145 | while true; do kubectl port-forward -n monitoring prometheus-k8s-0 9090; don
146 | ```
147 | 
148 | NOTE: a few bugs were [found and filed](https://github.com/coreos/prometheus-operator/issues/created_by/philips) against this configuration
149 | 
150 | ## Configure etcd backup
151 | 
152 | Note: S3 backup isn't working in the etcd Operator on self-hosted yet; hunting this down.
153 | 
154 | Setup AWS upload creds:
155 | 
156 | ```
157 | kubectl create secret generic aws-credential --from-file=$HOME/.aws/credentials -n kube-system
158 | kubectl create configmap aws-config --from-file=$HOME/.aws/config-us-west-1 -n kube-system
159 | ```
160 | 
161 | ```
162 | kubectl edit deployment etcd-operator -n kube-system
163 | ```
164 | 
165 | 
166 | ```
167 |       - command:
168 |         - /usr/local/bin/etcd-operator
169 |         - --backup-aws-secret
170 |         - aws-credential
171 |         - --backup-aws-config
172 |         - aws-config
173 |         - --backup-s3-bucket
174 |         - tectonic-eo-etcd-backups
175 | ```
176 | 
177 | ```
178 | kubectl get cluster.etcd -n kube-system kube-etcd -o yaml > etcd
179 | kubectl replace -f etcd  -n kube-system
180 | ```
181 | 


--------------------------------------------------------------------------------