├── DistributedProgrammingRPC ├── README.md └── credits.md ├── Halo4 └── Readme.md ├── PapersWeLove └── README.md ├── README.md ├── Sagas └── README.md ├── ScalingStatefulServices └── readme.md ├── SoWeHearYouLikePapers └── README.md ├── TacklingAlertFatigue ├── README.md ├── credits.md └── runbook.md ├── TheVerificationOfADistributedSystem ├── README.md └── credits.md └── TwitterObservability └── README.md /DistributedProgrammingRPC/README.md: -------------------------------------------------------------------------------- 1 | # DistributedProgrammingTalk 2 | References and Credits for A Brief History of Distributed Programing: RPC. To be given at CodeMesh 2016 [[Slides](https://speakerdeck.com/caitiem20/a-brief-history-of-distributed-programming-rpc)] [[Video](https://www.youtube.com/watch?v=aDWZyYHj2XM)] 3 | 4 | #Abstract 5 | While many of the distributed systems we operate today are built with language like Java and Go, distributed programming has a long history of innovation and adoption of its ideas. This include innovations seen all throughout the various fields of computing: novel type systems for dynamic languages; the concept of the promise, now a standard programming technique in web development; and unified models of programming when data lives across nodes. Some of these ideas had major impact, while some fell incredibly short. Many technically superior ideas were not adopted simply because they were too “research” focused. 6 | During this talk, we will present the history of RPC and why RPC may not be the best abstraction for building your next distributed application. 7 | 8 | #Resources 9 | * [Node.js](https://nodejs.org/en/about/) 10 | * [The Go Programming Language](https://golang.org/) 11 | * [Finagle](https://twitter.github.io/finagle/) 12 | * [gRPC](http://www.grpc.io/) 13 | * [Blog: Remote Procedure Call](https://christophermeiklejohn.com/pl/2016/04/12/rpc.html) by Christopher Meiklejohn 14 | * [RFC 674](https://tools.ietf.org/html/rfc674) 15 | * [RFC 684](https://tools.ietf.org/html/rfc684) 16 | * [RFC 707](https://tools.ietf.org/html/rfc707) 17 | * [Implementing Remote Procedure Calls](http://www.cs.virginia.edu/~zaher/classes/CS656/birrel.pdf) 18 | * [A Critique of the Remote Procedure Call](http://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pdf) 19 | * [RFC 1094 NFS](https://tools.ietf.org/html/rfc1094) 20 | * [A Note on Distributed Computation](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.7628&rep=rep1&type=pdf) 21 | * Spores 22 | * [Strange Loop Talk](https://www.youtube.com/watch?v=coX9RKH4rOs) by Heather Miller 23 | * [Spores: A Type-Based Foundation for Closures in the Age of Concurrency and Distribution](https://infoscience.epfl.ch/record/191239/files/spores_1.pdf) 24 | -------------------------------------------------------------------------------- /DistributedProgrammingRPC/credits.md: -------------------------------------------------------------------------------- 1 | #Image Credits 2 | * https://thenounproject.com/search/?q=laptop+user&i=512528 3 | * https://thenounproject.com/term/user/512525/ 4 | * https://thenounproject.com/term/browser-cloud/523468/ 5 | * https://thenounproject.com/search/?q=database&i=9658 6 | * https://thenounproject.com/search/?q=phone&i=565365 7 | -------------------------------------------------------------------------------- /Halo4/Readme.md: -------------------------------------------------------------------------------- 1 | ## Building the Halo 4 Services with Orleans 2 | Given at QCon London 2015 [[Video](https://www.infoq.com/presentations/halo-4-orleans)] [[Slides](https://speakerdeck.com/caitiem20/qcon-london-2015-building-the-halo-4-services-with-orleans)] 3 | 4 | ### Summary 5 | Caitie McCaffrey does an overview of Orleans, the challenges faced when building the Halo 4 services, and why the Actor Model and Orleans in particular were utilized to solve these problems. 6 | 7 | ## Architecting and Launching the Halo 4 Services 8 | Given as a the evening Keynote at SRECon 2015 [[Video](https://www.usenix.org/conference/srecon15/program/presentation/mccaffrey)] [[Slides](https://speakerdeck.com/caitiem20/architecting-and-launching-the-halo-4-services-sre-con-15)] 9 | 10 | ### Abstract 11 | Halo 4 is a first-person shooter on the Xbox 360, with fast-paced, competitive gameplay. To complement the code on disc, a set of services were developed and deployed in Azure to store player statistics, display player presence information, deliver daily challenges, modify playlists, catch cheaters, and more. As of June 2013, Halo 4 had 11.6 million players who played 1.5 billion games, logging 270 million hours of gameplay. 12 | 13 | The Halo 4 services were built from the ground up to support high demand, low latency, and high availability. In addition, video games have unique load patterns where the majority of the traffic and sales occurs within the first few weeks after launch, making this a critical time period for the game and supporting services. Halo 4 went from 0 to 1 million users on day 1, and 4 million users within the first week. 14 | 15 | This talk will discuss the architectural challenges faced when building these services and how they were solved using Windows Azure and Project Orleans. In addition, we'll discuss the path to production, some of the difficulties faced, and the tooling and practices that made the launch successful. 16 | 17 | -------------------------------------------------------------------------------- /PapersWeLove/README.md: -------------------------------------------------------------------------------- 1 | The accompanying repository for all of the Papers We Love Talks I've given 2 | 3 | ## Orleans: Virtual Actors For Programability & Scalability. 4 | Given at Papers We Love DC: September 2014 & Papers We Love SF: February 2015. [[Video](https://www.youtube.com/watch?v=gY8zKZUazvo)] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-sf-orleans-distributed-virtual-actors-for-programmability-and-scalability)] [[Paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Orleans-MSR-TR-2014-41.pdf)] 5 | 6 | ## Sagas 7 | Given at Papers We Love SF: April 2016 [[Video](https://youtu.be/7dc4Tl5ZHRg?list=PLGRqfvsPiRSih6qb8PRAQYQV9dq9pMgNX)] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-sf-sagas)] [[Paper](https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf)] 8 | 9 | ## Simple Testing Can Prevent Most Critical Failures 10 | Given at Papers We Love NY: June 2016 [[Video](https://www.youtube.com/watch?v=-3tw2MYYT0Q&feature=youtu.be&t=1h6m17s)] [[Slides](https://speakerdeck.com/caitiem20/pwl-ny-simple-testing-can-prevent-most-critical-failures)] [[Paper](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf)] 11 | 12 | ## Detection of Mutual Inconsistency in Distributed Systems 13 | Given at Papers We Love PDX: June 2016 [No Video] [[Slides](https://speakerdeck.com/caitiem20/papers-we-love-pdx)] [[Paper](http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf)] 14 | 15 | ## Distributed Programming in Argus 16 | Given at Papers We Love SF: January 2017 & Papers We Love SEA November 2017 [[Slides](https://speakerdeck.com/caitiem20/argus-papers-we-love)] [[Paper](https://pdos.csail.mit.edu/6.824/papers/argus88.pdf)] 17 | 18 | 19 | 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Talks 2 | This repository contains resources, references & credits for talks I've given. 3 | -------------------------------------------------------------------------------- /Sagas/README.md: -------------------------------------------------------------------------------- 1 | ##Applying the Saga Pattern 2 | This talk was given at Craft Conf 2015 & Goto Chicago 2015. [[Video](https://www.youtube.com/watch?v=xDuwrtwYHu8&index=46&list=PLEx5khR4g7PKFs3Y-gWd8TX4Y_5yTyUTP)] [[Slides](https://speakerdeck.com/caitiem20/applying-the-saga-pattern)] 3 | 4 | ###Abstract 5 | As we build larger more complex applications and solutions that need to do collaborative processing the traditional ACID transaction model using coordinated 2-phase commit is often no longer suitable. More frequently we have long lived transactions or must act upon resources distributed across various locations and trust boundaries. The Saga Pattern is a useful model for long lived activities and distributed transactions without coordination. 6 | 7 | Sagas split work into a set of transactions whose effects can be reversed even after the work has been performed or committed. If a failure occurs compensating transactions are performed to rollback the work. So at its core the Saga is a failure Management Pattern, making it particularly applicable to distributed systems. 8 | 9 | In this talk, I'll discuss the fundamentals of the Saga Pattern, and how it can be applied to your systems. In addition we'll discuss how the Halo 4 Services successfully made use of the Saga Pattern when processing game statistics, and how we implemented it in production. 10 | -------------------------------------------------------------------------------- /ScalingStatefulServices/readme.md: -------------------------------------------------------------------------------- 1 | # Scaling Stateful Services 2 | The accompanying repository for The ScalingStatefulServices talk 3 | * V1 StrangeLoop 2015: [Video](https://www.youtube.com/watch?v=H0i_bXKwujQ), [Slides](https://speakerdeck.com/caitiem20/building-scalable-stateful-services), [High Scalability Article](http://highscalability.com/blog/2015/10/12/making-the-case-for-building-scalable-stateful-services-in-t.html), [InfoQ Article](http://www.infoq.com/news/2015/11/scaling-stateful-services) 4 | * V2 Craft Conf 2016: [Slides](https://speakerdeck.com/caitiem20/craftconf-2016-building-scalable-stateful-services#) 5 | * V4 Curry On 2016 [Video](https://www.youtube.com/watch?v=aJFxQAAMAQc) 6 | * V3 Nike Tech Talk 2016: [Slides](https://speakerdeck.com/caitiem20/building-scalable-stateful-services-1) 7 | 8 | ## Abstract 9 | The Stateless Service design principle has become ubiquitous in the tech industry 10 | for creating horizontally scalable services. However our applications do have state, 11 | we just have moved all of it to caches and databases. Today as applications are 12 | becoming more data intensive and request latencies are expected to be incredibly 13 | low, we’d like the benefits of stateful services, like data locality and sticky 14 | consistency. In this talk I will address the benefits of stateful services, 15 | how to build them so that they scale, and discuss distributed and scalable 16 | services in the real world that implement these techniques successfully. 17 | 18 | ## Resources 19 | * Consistency Models 20 | * [Strong Consistency Models](https://aphyr.com/posts/313-strong-consistency-models) by Kyle Kingsbury 21 | * [Types of Consistency](http://www.cs.colostate.edu/~cs551/CourseNotes/Consistency/TypesConsistency.html) 22 | * [Consistency Diagram](http://www.vldb.org/pvldb/vol7/p181-bailis.pdf) by Peter Bailis et al 23 | * [Eventual Consistency Revisited](http://www.allthingsdistributed.com/2008/12/eventually_consistent.html) by Werner Vogels 24 | * Cluster Membership: Gossip Protocols 25 | * [Gossip Protocol](https://en.wikipedia.org/wiki/Gossip_protocol) 26 | * [Epidemeic Algorithms for Replicated Database Maintenance](https://pdfs.semanticscholar.org/49ed/15db181c74c7067ec01800fb5392411c868c.pdf) 27 | * [Membership, Disemination and Population Protocols](https://qconnewyork.com/ny2016/ny2016/presentation/membership-dissemination-and-population-protocols.html) Video to Come 28 | * Work Distribution: Consistent Hashing & Distributed Hash Tables 29 | * [Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web](https://www.akamai.com/es/es/multimedia/documents/technical-publication/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf) 30 | * [Dynamo: Amazon's Highly Available Key Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) uses Consistent Hashing 31 | * [Distributed Hash Table](https://en.wikipedia.org/wiki/Distributed_hash_table) 32 | * Real World Systems 33 | * [Scuba: Diving into Data at Facebook](https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/) 34 | * [Twitter Nuthatch Service](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i) from Observability at Twitter: Technical Overview, part 1 35 | * Uber Ringpop 36 | * [Uber Ringpop](http://uber.github.io/ringpop/) 37 | * [Uber's Ringpop and the Fight for Flap Dampening](http://www.infoq.com/presentations/halo-4-orleans) 38 | * [SWIM:Scalable Weakly-consistnet Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) 39 | * Orleans 40 | * [Microsoft Orleans](http://dotnet.github.io/orleans/) 41 | * [Orleans: Distributed Virtual Actors for Programmability and Scalability](http://research.microsoft.com/apps/pubs/default.aspx?id=210931) 42 | * [Builiding the Halo 4 Services with Orleans](http://www.infoq.com/presentations/halo-4-orleans) 43 | * [Artificial Intelligence A Universal Modular ACTOR Formalism for Artificial Intelligence](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.77.7898) 44 | * Challenges & Lessons Learned 45 | * [Everything will Flow: Distributed Queues & Backpressure](https://www.youtube.com/watch?v=1bNOO3xxMc0&app=desktop) 46 | * [Fast Database Restarts at Facebook](https://research.facebook.com/publications/fast-database-restarts-at-facebook/) 47 | * Intro to Distributed Systems 48 | * [What we Talk about when we talk about Distributed Systems](http://videlalvaro.github.io/2015/12/learning-about-distributed-systems.html) 49 | 50 | ## Bio 51 | Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at CaitieM.com and frequently discusses technology on Twitter @Caitie 52 | -------------------------------------------------------------------------------- /SoWeHearYouLikePapers/README.md: -------------------------------------------------------------------------------- 1 | So We Hear You Like Papers is a series of talks that bridge the gap between the academia and industry, given by myself and [Ines Sombra](https://github.com/Randommood). Each version of the talk expores different academic concepts in distributed systems and how they apply to industry. 2 | 3 | ## So We Hear You Like Papers 4 | Given as the evening Keynote at QconSF 2015. [[Slides](https://speakerdeck.com/randommood/we-hear-you-like-papers-qcon-edition)] [[Video](https://www.infoq.com/presentations/papers-large-distributed-systems)] [[Resources](https://github.com/Randommood/QConSF2015)] 5 | 6 | ## So We Hear You Like Papers Too 7 | Given as a Keynote at Velocity Conf Santa Clara 2016 [[Slides](https://speakerdeck.com/randommood/we-hear-you-like-papers-velocity-edition)] [[Video](https://www.oreilly.com/ideas/so-we-hear-you-like-papers)] [[Resources](https://github.com/Randommood/Velocity2016)] 8 | 9 | ## So We Hear You Like Papers: Eventual Consistency 10 | Given at Women Who Code Sydney Meetup December 2016 [[Slides](https://speakerdeck.com/caitiem20/we-hear-you-like-papers-eventual-consistency)] 11 | 12 | ### Resources 13 | * [Detection of Mutual Inconsistency in Distributed Systems](http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf) 14 | * [Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://www.cs.berkeley.edu/~brewer/cs262b/update-conflicts.pdf) 15 | * [Peter Bailis - Papers We Love: Managing Update Conflicts in Bayou, A Weakly Connected Replicated Storage System](https://www.youtube.com/watch?v=txP7CI0PjO4) Talk 16 | * [Brewer's conjecture & the feasibility of consistent, available, partition-tolerant web](http://perso.telecom-paristech.fr/~kuznetso/INF346-2015/papers/cap.pdf) 17 | * [CAP Twelve Years Later: How the "Rules" Have Changed](http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed) 18 | * [Conflict-free Replicated Data Types](https://hal.inria.fr/inria-00609399v1/document) Talk from Codemesh IO 2016 19 | * [A Conflict-Free Replicated JSON Datatype](https://martin.kleppmann.com/2016/08/13/json-crdt.html) Martin Kleppmann et al. 20 | * [A Comprehensive study of Convergent and Commutative Replicated Data Types](https://hal.inria.fr/inria-00555588) 21 | * [Readings in Conflict Free Replicated Data Types](https://christophermeiklejohn.com/crdt/2014/07/22/readings-in-crdts.html) 22 | * [Conflict Resolution for Eventual Consistency](https://www.youtube.com/watch?v=8_DfwEpHE88) 23 | * [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf) 24 | * [The Morning Paper: Feral Concurrency Control](http://blog.acolyer.org/2015/09/04/feral-concurrency-control-an-empirical-investigation-of-modern-application-integrity/) 25 | -------------------------------------------------------------------------------- /TacklingAlertFatigue/README.md: -------------------------------------------------------------------------------- 1 | # Tackling Alert Fatigue 2 | Accompanying Repository for the "Recovering From Alert Fagitue" talk given at [Monitorama 2016](http://monitorama.com/) [[Slides](https://speakerdeck.com/caitiem20/tackling-alert-fatigue)] 3 | 4 | ##Abstract 5 | Systems that generate numerous critical alerts result in alert fatigue which can result in service outages and developer burnout. My team at Twitter found themselves in this situation. The services had scaled by an order of magnitude in two years and were generating hundreds of alerts per quarter. Over the course of a quarter I led an initiative to decrease the number of alerts, improve the experience of being on call, and increase the reliability of the services. These efforts were incredibly successful reducing the number of critical alerts by 50%. In this talk I’ll discuss the process and alerting best practices we’ve put in places to successfully combat alert fatigue and avoid over alerting in the future. 6 | 7 | ##References 8 | * [Novel Approach to Cardiac Alarm Management on Telemetry Units](http://www.nursingcenter.com/pdfjournal?AID=2545317&an=00005082-201409000-00016&Journal_ID=54006&Issue_ID=2544216) 9 | * [How one Hospital Tweaks its EHR to fight alert fatigue](http://www.healthcareitnews.com/news/how-one-hospital-tweaks-its-ehr-fight-alert-fatigue) 10 | * [Applying Cardiac Alarm Management to your Oncall](http://fractio.nl/2014/08/26/cardiac-alarms-and-ops/) 11 | * [Checklist Manifesto](http://www.amazon.com/Checklist-Manifesto-How-Things-Right/dp/0312430000) 12 | * [Engineering for the Long Game](https://www.infoq.com/presentations/continuous-innovation-systems-organizations/?utm_source=lanyrd&utm_medium=coverage&utm_campaign=lanyrdsfvideos) 13 | * [WTF is OPerations? #serverless](https://charity.wtf/2016/05/31/wtf-is-operations-serverless/) 14 | * [Devops for Developers Building an Effective Ops Org](http://www.ustream.tv/recorded/86181845) 15 | 16 | ##Observability at Twitter 17 | * [Technical Overview Part 1](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i) 18 | * [Technical Overview Part 2](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-ii) 19 | * [Of the Order of Billions: Building Observability at Twitter](https://www.youtube.com/watch?v=SC6XuD1tgcQ) 20 | 21 | ##Related Tweets 22 | * (https://twitter.com/mrtazz/status/626107423443410944) 23 | 24 | 25 | ##Bio 26 | Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter, where she is the Tech Lead of the Observability Team. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at CaitieM.com and frequently discusses technology on Twitter @Caitie 27 | 28 | 29 | -------------------------------------------------------------------------------- /TacklingAlertFatigue/credits.md: -------------------------------------------------------------------------------- 1 | ##Image Credits 2 | * https://www.flickr.com/photos/nokiae51/13950633381/ 3 | * https://www.flickr.com/photos/richardsummers/523686414/ 4 | * https://www.flickr.com/photos/ppapadimitriou/9201073336/ 5 | * https://www.flickr.com/photos/kimeriksson/621870129/ 6 | * https://www.flickr.com/photos/debsilver/102508862/ 7 | * https://www.flickr.com/photos/yourcastlesdecor/14323386992/in/photolist-nPH6W9-7NYUFJ-bxTpuu-6Rjhwc-3uXmm-f17jkv-4T4Ds5-dzPRPZ-6j1PsU-7KvvDn-7NYULd-91k8BQ-phmhM4-7TR654-9LQDLV-5XFLND-oBDwNn-4Xb2L3-7HpUGE-7A4NKR-4m6HHq-3gt8jA-21dbc-pyNHEs-4mM8VA-neEQ8H-6NCaW3-a4gnBr-7NYUQY-dPoCq6-aiR1t8-4Shmo-666Dfo-aebp8e-6nDkE-6JLSsd-dtbpHT-ReyB5-cv9iV5-2sHHNm-bnmSkE-WXfjM-bqwpZN-m3GFx-jmiL6a-n1w19-eAsGk4-izhEW-8XMq8Z-666Cib 8 | * https://www.flickr.com/photos/glenscott/2593434622/ 9 | * https://www.flickr.com/photos/68877365@N07/9500125786/in/photolist-ftuCqb-9jezut-fwi5YG-cagS-6WvUYp-dRKkmS-4Hs3ot-24whQ-a6yntM-6Kti3J-fsnucL-vQhq-2gmnHY-37jAYS-ARLam-49drf1-58qnqs-ke7AtH-4WXJhG-G3uet-68kmaN-6WSRye-8aqiU-9sjC3v-5mYbEA-5ZH94L-5kNqGH-9YPeeT-2DWhMw-2hBDK-dV2Fo-rr64Q8-8vAHKK-29GEg-2CLtH7-6QV2r5-5u4z4L-JkC4-5peTZ7-F3tqq-fgigu2-dKESSo-2S72xK-qyS5Wd-nbqWJM-qnHWoq-m1osdw-7yC8FT-95K2x1-dKMvt7 10 | * https://www.flickr.com/photos/greenputty/5250996246/ 11 | * https://www.flickr.com/photos/warriorwoman531/5443359455/ 12 | * https://www.flickr.com/photos/n-r-t/2635927764/ 13 | -------------------------------------------------------------------------------- /TacklingAlertFatigue/runbook.md: -------------------------------------------------------------------------------- 1 | # Runbook Template 2 | 3 | ## Table of Contents 4 | A Table of Contents with links to main sections 5 | 6 | ## General 7 | A quick description of the services. 1 to 2 sentences max. Why does this service matter? What is it's core functionality? What Features does it provide users? 8 | 9 | ## Dashboards 10 | Links to the Dashboards for this service 11 | 12 | ## Alerts 13 | Links to the Alerts for this service 14 | 15 | For Every Alert there should be a corresponding section in alphabetical order 16 | ### Alert Title 17 | Alert Description: Why do we have this alert? What does it mean? What is typically the cause of this alert? 18 | 19 | #### Impact to Customers: 20 | How does this situation impact our customers? If the customers are not being impacted, this is a good indicator that the alert can be deleted. 21 | 22 | #### Remediation Steps: 23 | Checklist manifesto style steps for how to resolve this alert. A person who has never worked on our stack should be able to follow these steps and remediate the incident. If it cannot be remediated, include escalation steps here. 24 | 1. Do this 25 | 2. Check this graph 26 | 3. Do this thing 27 | 4. Do this other thing 28 | 5. Verify service has recovered 29 | 30 | ## Contact Info 31 | Team contact info. Potentially contact info for who to escalate to. What services do we have dependencies on? How do we escalate to them? Define this information here. 32 | 33 | ## Latest Deployments 34 | We do Production Change Management Deployments via Jira, we included a link of all the latest changes here. Recent commits, CI log etc... is incredibly helpful in understanding what code is deployed to the system, what recent changes were made. 35 | 36 | ## Clusters 37 | Information on where this service is deployed, and how to access those machines. 38 | 39 | ## Deployment 40 | How do you deploy this services. Favor Checklist manifesto style lists here as well. 41 | 1. Do this thing 42 | 2. Do this other thing 43 | 3. Finally do this thing 44 | 45 | ### Canary Deploy 46 | Instructions on how to do a Canary Deployment 47 | 1. Do this canary thing 48 | 2. another canary task 49 | 50 | ### Rollback Deploy 51 | Instructions on how to Rollback a Deploy. 52 | 1. Get the rollback build here 53 | 2. Do this thing 54 | 3. Do this other thing. 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /TheVerificationOfADistributedSystem/README.md: -------------------------------------------------------------------------------- 1 | # The Verification of a Distributed System 2 | Accompanying Repository for The Verification of a Distributed System Talk to be given at 3 | * [GOTO Chicago 2016](http://gotocon.com/chicago-2016): [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system)][[Video](https://youtu.be/kDh5BrqiGhI?list=PLEx5khR4g7PIfvppVcaTPa5IKWTjoASRU)] 4 | * [Qcon New York 2016](https://qconnewyork.com/ny2016/presentation/verification-distributed-system) [[Slides](https://speakerdeck.com/caitiem20/qcon-newyork-2016-the-verification-of-a-distributed-system)][[Video](https://www.infoq.com/presentations/distributed-systems-verification)] 5 | * [YOW Melbourne 2016](http://melbourne.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-1)] 6 | * [YOW Brisbane 2016](http://brisbane.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-2)] 7 | * [YOW Sydney 2016](http://sydney.yowconference.com.au/) [[Slides](https://speakerdeck.com/caitiem20/the-verification-of-a-distributed-system-3)] 8 | 9 | ## Abstract 10 | Distributed Systems are difficult to build and test for two main reasons: partial failure & asynchrony. These two realities of distributed systems must be addressed to create a correct system, and often times the resulting systems have a high degree of complexity. Because of this complexity, testing and verifying these systems is critically important. In this talk we will discuss strategies for proving a system is correct, like formal methods, and less strenuous methods of testing which can help increase our confidence that our systems are doing the right thing. 11 | 12 | ## References 13 | * [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274) 14 | * Formal Specifications 15 | * [Specifying Systems](http://research.microsoft.com/en-us/um/people/lamport/tla/book-02-08-08.pdf) 16 | * [Use of Formal Methods at Amazon Web Services](http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf) 17 | * [The Coq Proof Assistant](https://coq.inria.fr/) 18 | * [Simple Testing Can Prevent Most Critical Failures](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf) 19 | * Property Based Testing 20 | * [Haskell: Quick Check](https://hackage.haskell.org/package/QuickCheck) 21 | * [Erlang: Quick Check](http://www.quviq.com/products/erlang-quickcheck/) 22 | * [Other Quick Check Implementations](https://en.wikipedia.org/wiki/QuickCheck) 23 | * [ScalaCheck](https://www.scalacheck.org/) 24 | * [29 GIFs only ScalaCheck Witches will Understand](http://nerd.kelseyinnis.com/blog/2015/01/14/29-GIFs-only-scalacheck-witches-will-understand/) 25 | * [Quick Checking Riak](https://skillsmatter.com/skillscasts/4505-quickchecking-riak) 26 | * [Testing Eventual Consistency in Riak](https://www.youtube.com/watch?v=x9mW54GJpG0) 27 | * [Combining Model Checking and Testing](http://research.microsoft.com/pubs/200544/main.pdf) 28 | * [Testing AUTOSTAR Software with QuickCheck](http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=7107466&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7107466) 29 | * [Modeling Eventual Consistency Databases with QuickCheck](https://vimeo.com/23220830) 30 | * [The Mysteries of Dropbox](https://vimeo.com/158002499) 31 | * Fault Injection 32 | * [Jepsen](http://jepsen.io/) 33 | * [Netflix Simian Army](http://techblog.netflix.com/2011/07/netflix-simian-army.html) 34 | * Game Days 35 | * [Resilience Engineering: Learning to Embrace Failure](https://queue.acm.org/detail.cfm?id=2371297) 36 | * [Game Day Exercises at Stripe: Learning from `kill-9`](https://stripe.com/blog/game-day-exercises-at-stripe) 37 | * Systems Complexity Model from [Architectural Patterns of Resillent Distributed Systems](https://github.com/Randommood/YOW2016) 38 | * Areas of Research 39 | * [Cause I'm Strong Enough: Reasoning about Consistency Choices in Distributed Systems](https://pages.lip6.fr/Marc.Shapiro/papers/CISE-POPL-2016.pdf) 40 | * [The CISE Tool: Proving Weakly Consistent Applications Correct](https://hal.inria.fr/hal-01279495v1/document) 41 | * [CISE Tool Demo](https://www.youtube.com/watch?v=HJjWqNDh-GA) 42 | * [Github: Syncfree/CISE](https://github.com/SyncFree/CISE) 43 | * [Syncfree CISE website](https://syncfree.lip6.fr/index.php/2-uncategorised/51-cise) 44 | * [IronFleet: Proving Practical Distributed Systems Correct](http://research.microsoft.com/apps/pubs/default.aspx?id=255833) 45 | * [Dafny](http://research.microsoft.com/en-us/projects/dafny/) 46 | * Lineage-Driven Fault Injection aka Molly 47 | * [Lineage-Driven Fault Injection](http://people.ucsc.edu/~palvaro/molly.pdf) 48 | * [Sigmod 2015 Slides](http://www.slideshare.net/palvaro/lineagedriven-fault-injection-sigmod15) 49 | * [Automated Failure Testing at Netflix](http://techblog.netflix.com/2016/01/automated-failure-testing.html) 50 | * ["Monkeys in Lab Coats": Applied Failure Testing Research at Netflix](http://www.infoq.com/presentations/failure-test-research-netflix) 51 | * [Automating Failure Testing Research at Internet Scale](https://people.ucsc.edu/~palvaro/socc16.pdf) 52 | * [Orchestrated Chaos: Applying Failure Testing Research at Scale](https://www.youtube.com/watch?v=QOTNBKx9Irc) 53 | * [Towards Property Based Consistency Verification](http://www.eurecom.fr/fr/publication/4874/download/ds-publi-4874.pdf) 54 | * [Certified Causally Consistent Distributed Key Value Stores](http://people.csail.mit.edu/lesani/companion/popl16/POPL16.pdf) 55 | * [Planning for Change in a Formal Verification of the Raft Consensus Protocol](https://homes.cs.washington.edu/~mernst/pubs/raft-proof-cpp2016.pdf) 56 | 57 | 58 | ## Bio 59 | Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5. She maintains a blog at [CaitieM.com](https://caitiem.com/) and frequently discusses technology on Twitter [@Caitie](https://twitter.com/caitie) 60 | -------------------------------------------------------------------------------- /TheVerificationOfADistributedSystem/credits.md: -------------------------------------------------------------------------------- 1 | # Image Credits 2 | * https://www.flickr.com/photos/allenthepostman/2701166533/ 3 | * https://www.flickr.com/photos/ehktang/6820136333/ 4 | * https://www.flickr.com/photos/craig21/16643183961/ 5 | * https://www.flickr.com/photos/pachytime/3056606057/ 6 | * https://www.flickr.com/photos/thomashawk/2727316420/in/photolist-5a1d9U-fkCzt8-forakw-fqXk3K-6GJsvL-3ewM6b-6ce9oL-d2pm5u-8Mc28D-3N8HVk-eayWoU-5EmFFT-6TWLkm-2GWc8n-4EcLbE-7FkuXb-6P9wA3-KDc9V-EYU7a-3BazNK-4TCRby-4kg9Vc-c19jP-6HDx7s-2TFPK-4HoTVj-aTcna2-b7zicc-6cFfeE-4EP5bm-5728Ar-5bT6kU-kMsUp-5FtsrE-c19jN-b4BqAt-9gb1q-55VTSc-5F9ZXB-pLN4Xm-4KXv84-2ESbDb-fERJqW-nWpqou-5X6fz7-fPsDJr-9zKXNy-nWpbLw-ftqE1w-gXdGn4/ 7 | * https://www.flickr.com/photos/rob1501/6115982967/ 8 | * https://www.flickr.com/photos/naathas/3319386898/ 9 | * https://www.flickr.com/photos/pmillera4/21466941223/ 10 | * https://www.flickr.com/photos/zionnps/6198493225/ 11 | * https://www.flickr.com/photos/leemt2/222443032/ 12 | * https://www.flickr.com/photos/joemar/1573687605/ 13 | * https://www.flickr.com/photos/leemt2/203359933/ 14 | * https://www.flickr.com/photos/doc44/8287100528/in/photolist-dCiygA-dBGHVZ-5AcVxm-dCpjhQ-dyv3Fn-ddyLx4-au6zsk-9E6zPz-dA6itL-dEen9D-dymHsJ-dSNAUn-dzc3VM-dyv39K-ddyMUL-q9jQcL-dyY7Gn-dANxgh-dBgxtL-dyPqxd-dB9VbR-dyuZZp-dzYv6R-6d9ozc-iDdQ2e-isNKTU-dyw2oF-dBb6Di-aufP4G-dyQwfu-duqS3h-dyv2ya-dwYEev-hQ8LQW-GwAi6Y-dHKhAx-7mYLLB-4BJZe5-eQVZVA-dfnVr2-dCiTd8-dxxsgY-6UX11V-dzhz9f-dziB5q-6RLFTu-dyBAg5-dyw5zB-dAGVvM-dAXoz2 15 | * https://www.flickr.com/photos/eepie/14433673/ 16 | * https://www.flickr.com/photos/kookr/7056077277/ 17 | * https://www.flickr.com/photos/omnia_mutantur/2468322428/in/photolist-4L7Ngf-bVWDFH-cSgeN1-asEyuf-axUKLP-brhFSE-cxTaWQ-aqBKGe-a9eJbP-kfksEJ-h5PkBA-asgFQ7-axjynd-FWnjB6-ddAV5G-ddASU4-9B2imo-5K73G-ddASVA-FU4Cb3-7Chmw3-Fwg8sd-FU69LQ-aym8cu-7ihaXW-F1VcMo-bVAFKn-cPz33A-FweM25-FU5YZq-FU4byb-FU4H4Q-FU4nh5-eJdVzG-apXPSV-e4jRMd-cxT8zu-5HzZ1c-ebfjTh-ddAT4q-ddATCJ-ddARUM-52fNs-hjCbb-dAz4Tu-6N8Zv-as3GF-6fnRjs-4u4QR1-ebfjYU 18 | * https://www.flickr.com/photos/44055945@N06/7224288232/ 19 | * https://www.flickr.com/photos/vickisnature/6465872605/ 20 | * https://www.flickr.com/photos/pinti1/15933819917/ 21 | * https://www.flickr.com/photos/yokohamayomama/5778717345/ 22 | * https://www.flickr.com/photos/dakiny/15079476520/ 23 | * https://www.flickr.com/photos/dakiny/15079685280/ 24 | * https://www.flickr.com/photos/vickisnature/4830596406/ 25 | * https://www.flickr.com/photos/leemt2/284068408/ 26 | -------------------------------------------------------------------------------- /TwitterObservability/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## On the Order of Billions: Building Observability at Twitter 3 | This talk was given at Twitter Flight 2015. [[Video](https://www.youtube.com/watch?v=SC6XuD1tgcQ)] [[Slides](https://speakerdeck.com/caitiem20/of-the-order-of-billions-building-observability-at-twitter)] 4 | 5 | ### Abstract 6 | Every minute Twitter’s Observability stack processes 1.5+ billion metrics in order to provide Visibility into Twitter’s distributed microservices architecture. In this talk will focus on some of the challenges associated with building and running this large scale distributed system. We will also focus on lessons learned and how to build services that scale that are applicable for services of any size. 7 | 8 | ### Related Articles 9 | * [Observability at Twitter: technical overview, Part 1](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-i) 10 | * [Observability at Twitter: technical overview, Part 2](https://blog.twitter.com/2016/observability-at-twitter-technical-overview-part-ii) 11 | --------------------------------------------------------------------------------