Kubernetes Failure Stories
15 |A compiled list of links to public failure stories related to Kubernetes. 16 | Most recent publications on top.
17 |-
18 |
- Misunderstanding the behaviour of one templating line - Skyscanner - blog post 2019
-
19 |
- involved: HAProxy-Ingress, Service VIPs, Golang templating 20 |
- impact: Significantly increased latency, 5XXs thrown from some services 21 |
23 | - All pods scheduled to same failing host - Moonlight - postmortem 2019
-
24 |
- involved: Google Kubernetes Engine, scheduler, anti-affinity rules 25 |
- impact: major production outage, 100% traffic loss 26 |
28 | - The shipwreck of GKE Cluster Upgrade - Loveholidays GKE - blog post 2019
-
29 |
- involved: GCP Ingress, GKE Cluster, nodes 30 |
- impact: critical drop in pod availability, loss of ingress, 2 hour maintenance lasting 7 hours 31 |
33 | - Breaking Kubernetes: How We Broke and Fixed our K8s Cluster - Civis Analytics - blog post 2019
-
34 |
- involved: AWS, kops, large clusters, batch jobs infrastructure, Datadog, API server, DNS, CPU throttling 35 |
- impact: production outage 36 |
38 | - Let's talk about Failures with Kubernetes - Zalando - Hamburg meetup 2019
-
39 |
- involved: AWS,
NotReady
nodes, ELB dynamic IPs, Ingress, API server, CronJob, CoreDNS,OOMKill
, kubelet memory leak, CPU throttling
40 | - impact: production outages 41 |
43 | - involved: AWS,
- Total DNS outage in Kubernetes cluster - Zalando - postmortem 2019
-
44 |
- involved: AWS, DNS, CoreDNS,
OOMKill
,ndots:5
, HTTP retries
45 | - impact: production outage 46 |
48 | - involved: AWS, DNS, CoreDNS,
- Maximize learnings from a Kubernetes cluster failure - NU.nl - blog post 2019
-
49 |
- involved: AWS,
NotReady
nodes,SystemOOM
, Helm, ElastAlert, no resource limits set
50 | - impact: user experience affected for internally used tools and dashboards 51 |
53 | - involved: AWS,
- Kubernetes Load Balancer Configuration - Beware when draining nodes - DevOps Hof - blog post 2019
-
54 |
- involved: GCP Load Balancer,
externalTrafficPolicy
, ingress-nginx
55 | - impact: total ingress traffic outage 56 |
58 | - involved: GCP Load Balancer,
- On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
-
59 |
- involved: on-premise, Kafka, large cluster, Consul, Docker daemon, high CPU usage 60 |
- impact: development environment outage 61 |
63 | - How NOT to do Kubernetes - Sr.SRE Medya Ghazizadeh - Google - Cloud Native Meetup Sep 2018
-
64 |
- involved: public container registery, ingress wild card, image size, replica count, 12factor 65 |
- impact: security, stablity of clusters. 66 |
68 | - Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
-
69 |
- involved: AWS, Ingress, CronJob, etcd, flannel, Docker, CPU throttling 70 |
- impact: production outages 71 |
73 | - Outages? Downtime? - Veracode - blog post 2018
-
74 |
- involved: AWS, AWS IAM, region migration, kubespray, Terraform, pod CIDR 75 |
- impact: QA/dev cluster outage 76 |
78 | - NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
-
79 |
- involved: GCP, kubeadm, etcd, Terraform,
livenessProbe
80 | - impact: production outage 81 |
83 | - involved: GCP, kubeadm, etcd, Terraform,
- A Perfect DNS Storm - Toyota Connected - blog post 2018
-
84 |
- involved: Azure, DNS,
ndots:5
, Alpine musl libc
85 | - impact: DNS resolution failures 86 |
88 | - involved: Azure, DNS,
- Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
-
89 |
- involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet,
NotReady
nodes
90 | - impact: 15 minutes cluster outage 91 |
93 | - involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet,
- Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
-
94 |
- involved: AWS, kops, HAProxy,
livenessProbe
, DNS, too many open files
95 | - impact: unknown outages, DNS errors 96 |
98 | - involved: AWS, kops, HAProxy,
- AirMap Platform Service Outage - AirMap - incident report 2018
-
99 |
- involved: Azure,
NotReady
nodes, kubelet PLEG, CNI
100 | - impact: production AirMap platform outage 101 |
103 | - involved: Azure,
- Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
-
104 |
- involved: AWS, etcd, Linkerd,
NullPointerException
, gRPC client, services without endpoints, incompatible Kubernetes API change
105 | - impact: production ledger/platform outage 106 |
108 | - involved: AWS, etcd, Linkerd,
- Stories from the Playbook - Google - KubeCon Europe 2018
-
109 |
- involved: GKE, etcd, Docker daemon, Image registry, dnsmasq vulnerabilities 110 |
- impact: production outage 111 |
113 | - 101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
-
114 |
- involved: on-premise, namespace deletion, domain name collision,
NotReady
nodes, etcd empty dir, TLS certificate refresh, DNS issues, OOM
115 | - impact: unknown cluster outages 116 |
118 | - involved: on-premise, namespace deletion, domain name collision,
- 101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
-
119 |
- involved: AWS,
NotReady
nodes, OOM, eviction thresholds, ELB dynamic IPs, kubelet, cluster autoscaler, etcd split
120 | - impact: full production cluster outage, other outages 121 |
123 | - involved: AWS,
- Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017
-
124 |
- involved: AWS, etcd, Linkerd,
NullPointerException
, services without endpoints
125 | - impact: major production outage, full platform outage, current account payments fail 126 |
128 | - involved: AWS, etcd, Linkerd,
- Fallacies of Distributed Computing with Kubernetes on AWS - Zalando - AWS User Group Hamburg October 2017
-
129 |
- involved: AWS, unhealthy nodes, Ingress, CronJob 130 |
- impact: production outage 131 |
133 | - Search and Reporting Outage - Universe - incident report 2017
-
134 |
- involved: Job,
RestartPolicy
, consume node resources
135 | - impact: production Universe search and reporting outage 136 |
138 | - involved: Job,
- Our First Kubernetes Outage - Saltside - blog post 2017
-
139 |
- involved: AWS, kops, Helm,
NotReady
nodes, resource exhaustion
140 | - impact: nonproduction cluster outage 141 |
143 | - involved: AWS, kops, Helm,
- Our Failure Migrating to Kubernetes - Saltside - blog post 2017
-
144 |
- involved: AWS, kops, ELB,
BackendConnectionErrors
,LoadBalancer
service
145 | - impact: aborted application migration 146 |
148 | - involved: AWS, kops, ELB,
- SaleMove US System Issue - SaleMove - incident report 2017
-
149 |
- involved: AWS, ELB dynamic IPs, DNS
A
record for master, API server
150 | - impact: production issues with SaleMove US System 151 |
153 | - involved: AWS, ELB dynamic IPs, DNS
Why
155 |Kubernetes is a fairly complex system with many moving parts. 156 | Its ecosystem is constantly evolving and adding even more layers (service mesh, ...) to the mix. 157 | Considering this environment, we don't hear enough real-world horror stories to learn from each other! 158 | This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to 159 | learn from others and reduce the unknown unknowns of running Kubernetes in production. 160 | For more information, see the blog post.
161 |Contributing
162 |Please help the community and share a link to your failure story by opening a Pull Request! 163 | Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweetstorms, ...
164 |I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_
165 |