14 |

Kubernetes Failure Stories

15 |

A compiled list of links to public failure stories related to Kubernetes. 16 | Most recent publications on top.

17 |

Misunderstanding the behaviour of one templating line - Skyscanner - blog post 2019
- involved: HAProxy-Ingress, Service VIPs, Golang templating
- impact: Significantly increased latency, 5XXs thrown from some services
22 |
All pods scheduled to same failing host - Moonlight - postmortem 2019
- involved: Google Kubernetes Engine, scheduler, anti-affinity rules
- impact: major production outage, 100% traffic loss
27 |
The shipwreck of GKE Cluster Upgrade - Loveholidays GKE - blog post 2019
- involved: GCP Ingress, GKE Cluster, nodes
- impact: critical drop in pod availability, loss of ingress, 2 hour maintenance lasting 7 hours
32 |
Breaking Kubernetes: How We Broke and Fixed our K8s Cluster - Civis Analytics - blog post 2019
- involved: AWS, kops, large clusters, batch jobs infrastructure, Datadog, API server, DNS, CPU throttling
- impact: production outage
37 |
Let's talk about Failures with Kubernetes - Zalando - Hamburg meetup 2019
- involved: AWS, NotReady nodes, ELB dynamic IPs, Ingress, API server, CronJob, CoreDNS, OOMKill, kubelet memory leak, CPU throttling
- impact: production outages
42 |
Total DNS outage in Kubernetes cluster - Zalando - postmortem 2019
- involved: AWS, DNS, CoreDNS, OOMKill, ndots:5, HTTP retries
- impact: production outage
47 |
Maximize learnings from a Kubernetes cluster failure - NU.nl - blog post 2019
- involved: AWS, NotReady nodes, SystemOOM, Helm, ElastAlert, no resource limits set
- impact: user experience affected for internally used tools and dashboards
52 |
Kubernetes Load Balancer Configuration - Beware when draining nodes - DevOps Hof - blog post 2019
- involved: GCP Load Balancer, externalTrafficPolicy, ingress-nginx
- impact: total ingress traffic outage
57 |
On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target - Medium post January 2019
- involved: on-premise, Kafka, large cluster, Consul, Docker daemon, high CPU usage
- impact: development environment outage
62 |
How NOT to do Kubernetes - Sr.SRE Medya Ghazizadeh - Google - Cloud Native Meetup Sep 2018
- involved: public container registery, ingress wild card, image size, replica count, 12factor
- impact: security, stablity of clusters.
67 |
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Zalando - DevOpsCon Munich 2018
- involved: AWS, Ingress, CronJob, etcd, flannel, Docker, CPU throttling
- impact: production outages
72 |
Outages? Downtime? - Veracode - blog post 2018
- involved: AWS, AWS IAM, region migration, kubespray, Terraform, pod CIDR
- impact: QA/dev cluster outage
77 |
NRE Labs Outage Post-Mortem - NRE Labs - blog post 2018
- involved: GCP, kubeadm, etcd, Terraform, livenessProbe
- impact: production outage
82 |
A Perfect DNS Storm - Toyota Connected - blog post 2018
- involved: Azure, DNS, ndots:5, Alpine musl libc
- impact: DNS resolution failures
87 |
Kubernetes and the Menace ELB, the tale of an outage - Turnitin - blog post 2018
- involved: AWS, kube-aws, ELB dynamic IPs, API server, kubelet, NotReady nodes
- impact: 15 minutes cluster outage
92 |
Moving the Entire Stack to K8s Within a Year – Lessons Learned - ThredUP - DevOpsStage 2018
- involved: AWS, kops, HAProxy, livenessProbe, DNS, too many open files
- impact: unknown outages, DNS errors
97 |
AirMap Platform Service Outage - AirMap - incident report 2018
- involved: Azure, NotReady nodes, kubelet PLEG, CNI
- impact: production AirMap platform outage
102 |
Anatomy of a Production Kubernetes Outage - Monzo - KubeCon Europe 2018
- involved: AWS, etcd, Linkerd, NullPointerException, gRPC client, services without endpoints, incompatible Kubernetes API change
- impact: production ledger/platform outage
107 |
Stories from the Playbook - Google - KubeCon Europe 2018
- involved: GKE, etcd, Docker daemon, Image registry, dnsmasq vulnerabilities
- impact: production outage
112 |
101 Ways to "Break and Recover" Kubernetes Cluster - Oath/Yahoo - KubeCon Europe 2018
- involved: on-premise, namespace deletion, domain name collision, NotReady nodes, etcd empty dir, TLS certificate refresh, DNS issues, OOM
- impact: unknown cluster outages
117 |
101 Ways to Crash Your Cluster - Nordstrom - KubeCon North America 2017
- involved: AWS, NotReady nodes, OOM, eviction thresholds, ELB dynamic IPs, kubelet, cluster autoscaler, etcd split
- impact: full production cluster outage, other outages
122 |
Major Outage: Current account payments may fail - Monzo - Monzo Community post 2017
- involved: AWS, etcd, Linkerd, NullPointerException, services without endpoints
- impact: major production outage, full platform outage, current account payments fail
127 |
Fallacies of Distributed Computing with Kubernetes on AWS - Zalando - AWS User Group Hamburg October 2017
- involved: AWS, unhealthy nodes, Ingress, CronJob
- impact: production outage
132 |
Search and Reporting Outage - Universe - incident report 2017
- involved: Job, RestartPolicy, consume node resources
- impact: production Universe search and reporting outage
137 |
Our First Kubernetes Outage - Saltside - blog post 2017
- involved: AWS, kops, Helm, NotReady nodes, resource exhaustion
- impact: nonproduction cluster outage
142 |
Our Failure Migrating to Kubernetes - Saltside - blog post 2017
- involved: AWS, kops, ELB, BackendConnectionErrors, LoadBalancer service
- impact: aborted application migration
147 |
SaleMove US System Issue - SaleMove - incident report 2017
- involved: AWS, ELB dynamic IPs, DNS A record for master, API server
- impact: production issues with SaleMove US System
152 |

154 |

Why

155 |

Kubernetes is a fairly complex system with many moving parts. 156 | Its ecosystem is constantly evolving and adding even more layers (service mesh, ...) to the mix. 157 | Considering this environment, we don't hear enough real-world horror stories to learn from each other! 158 | This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to 159 | learn from others and reduce the unknown unknowns of running Kubernetes in production. 160 | For more information, see the blog post.

161 |

Contributing

162 |

Please help the community and share a link to your failure story by opening a Pull Request! 163 | Failure stories can be anything like blog posts, conference/meetup talks, incident postmortems, tweetstorms, ...

164 |

I would also be glad to hear about your failure stories on Twitter: my handle is @try_except_

165 |