├── LICENSE └── readme.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Arun Gupta 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Kubernetes Debugging Tips 2 | 3 | - Debug pods: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/ 4 | - Debug Services: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ 5 | - Debug cluster: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/ 6 | - Network partitioning: `tcp_retries2` from https://medium.com/hotels-com-technology/improving-kubernetes-responsiveness-under-network-partitioning-df00403a6d97 7 | - Troubleshoot k8s deployments: https://docs.bitnami.com/kubernetes/how-to/troubleshoot-kubernetes-deployments/ 8 | - Kubernetes recipes: maintenance and troubleshooting: https://www.oreilly.com/ideas/kubernetes-recipes-maintenance-and-troubleshooting 9 | - 10 Most Common Reasons Kubernetes Deployments Fail: https://kukulinski.com/10-most-common-reasons-kubernetes-deployments-fail-part-1/ 10 | - SLIs/SLOs: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md 11 | - Services `curl: (52) Empty reply from server`: 12 | - Check service: 13 | 14 | ``` 15 | kubectl get svc 16 | NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 17 | kubernetes ClusterIP 10.100.0.1 443/TCP 7h1m 18 | myapp-greeting LoadBalancer 10.100.144.214 ae8edb3e298f211e9b81a02ee809dbcd-201517527.us-west-2.elb.amazonaws.com 80:30947/TCP 63s 19 | ``` 20 | 21 | - Check pods: 22 | 23 | ``` 24 | kubectl get pods 25 | NAME READY STATUS RESTARTS AGE 26 | myapp-greeting-84df4cf76-8mn5b 0/1 ErrImagePull 0 117s 27 | ``` 28 | 29 | Diagnosis: Check the image name and tag, make sure its on the registry, does the registry need credentials? 30 | 31 | OR 32 | 33 | ``` 34 | NAME READY STATUS RESTARTS AGE 35 | myapp-greeting-84df4cf76-fs5hr 0/1 CrashLoopBackOff 2 44s 36 | ``` 37 | 38 | - Describe pod: 39 | 40 | ``` 41 | kubectl describe pod/myapp-greeting-84df4cf76-fs5hr 42 | ... 43 | Events: 44 | Type Reason Age From Message 45 | ---- ------ ---- ---- ------- 46 | Normal Scheduled 56s default-scheduler Successfully assigned default/myapp-greeting-84df4cf76-fs5hr to ip-192-168-2-219.us-west-2.compute.internal 47 | Normal Pulling 55s kubelet, ip-192-168-2-219.us-west-2.compute.internal pulling image "arungupta/greeting:prom" 48 | Normal Pulled 50s kubelet, ip-192-168-2-219.us-west-2.compute.internal Successfully pulled image "arungupta/greeting:prom" 49 | Normal Created 2s (x4 over 49s) kubelet, ip-192-168-2-219.us-west-2.compute.internal Created container 50 | Normal Started 2s (x4 over 48s) kubelet, ip-192-168-2-219.us-west-2.compute.internal Started container 51 | Normal Pulled 2s (x3 over 48s) kubelet, ip-192-168-2-219.us-west-2.compute.internal Container image "arungupta/greeting:prom" already present on machine 52 | Warning BackOff 2s (x5 over 47s) kubelet, ip-192-168-2-219.us-west-2.compute.internal Back-off restarting failed container 53 | ``` 54 | 55 | - Check pod logs: 56 | 57 | ``` 58 | ``` 59 | 60 | - Run Docker image: 61 | 62 | ``` 63 | docker container run -p 80:8080 -it arungupta/greeting:prom 64 | root@f88d9c61ee6c:/# 65 | ``` 66 | 67 | - Check build output: 68 | 69 | ``` 70 | [INFO] Containerizing application to Docker daemon as arungupta/greeting:prom... 71 | [INFO] The base image requires auth. Trying again for openjdk:8-jre... 72 | [INFO] Container program arguments set to [bash] (inherited from base image) 73 | ``` 74 | 75 | Diagnosis: `pom.xml` uses `jib` for building the Docker image and `war` as ``. So either add `ServletInitializer` and exclude Tomcat embedded JAR, or change `` to `jar`. Alternaitvely, use Dockerfile to create your image. 76 | 77 | - Accessing service gives: 78 | 79 | ``` 80 | curl http://$(kubectl get svc/myapp-greeting \ 81 | -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')/hello 82 | curl: (6) Could not resolve host: a87d6823b994311e9b81a02ee809dbcd-1223624547.us-west-2.elb.amazonaws.com 83 | ``` 84 | 85 | Diagonsis: Takes about 2-3 mins for the ELB to be available. 86 | 87 | - kubectl not working, specifically giving error `Unable to connect to the server: net/http: TLS handshake timeout` 88 | - check Internet connection 89 | - check KUBECONFIG 90 | - check context `kubectl config get-contexts` 91 | - API server down? 92 | - Get API server address 93 | - POST `
:8080/healthz` or `
:8080/healthz/ping` ?? 94 | - check kubectl skew, supporte +/-1 version of API server 95 | - Kubectl latency is too high? 96 | - What is the SLO? 97 | - check API server logs from Container Insights 98 | - What instance type is used for etcd? IOPS-bound? 99 | - Use an SSD locally attached to the node instead of over-network? 100 | - How do we make sure etcd scales? How big is the typical database? What are the limits around it? 101 | - For large clusters, configure API server to store events in a dedicated etcd instance 102 | - Building large k8s clusters: https://www.youtube.com/watch?v=kDwQ991NCkg 103 | - Why HPA is not scaling pods? 104 | - Is metrics-server installed? 105 | - Why is Cluster Autoscaler not scaling the cluster? 106 | - Is it even installed? 107 | - Use IPVS (hashtable) instead of IPTable (linear list of routing rules) 108 | - Drop in scheduler through put and increase in latency for large clusters 109 | - etcd 110 | - 3 etcd? 111 | - what is the typical database size? 112 | - how do we size? 113 | - [Updating cluster](http://arun-gupta.github.io/update-eks-cluster/) 114 | - version skew policy: https://kubernetes.io/docs/setup/release/version-skew-policy/ 115 | - don't jump two versions at one go in control plane 116 | - control plane should always be updated first 117 | - what if kubectl start sending requests that are not yet honored by data plane 118 | - what if update fails mid way? 119 | - what is recommended? migrate to a new node group or update existing one? 120 | - `eksctl update cluster` is reliable? or should we recommend creating a new cluster, migrating the workloads, and using DNS to switch? 121 | 122 | 123 | Kubernetes Internals 124 | 125 | - How does 'kubectl exec' work? https://erkanerol.github.io/post/how-kubectl-exec-works/ 126 | - What happens when I type `kubctl run`? https://github.com/jamiehannaford/what-happens-when-k8s 127 | - client-side validation 128 | - authentication 129 | - prepare and send REST request 130 | - API server authenticates and authorizes 131 | - admission controllers 132 | - persist resource in etcd 133 | - run initializers 134 | - run the controllers 135 | - schedule to a node 136 | - kubelet generates pod status 137 | - kubelet creates cgroups and namespaces 138 | - kubelet uses CNI to setup networking 139 | - kubelet create containers using CRI 140 | - Kubernetes networking: https://sookocheff.com/post/kubernetes/understanding-kubernetes-networking-model/ 141 | --------------------------------------------------------------------------------