├── README.md ├── cloudformation_terraform_aargh.md ├── cluster_level_monitoring.md ├── green_field_in_2017.md ├── paved_paths.md ├── platforms_cf_k8s_etc.md └── scaling_deployments.md /README.md: -------------------------------------------------------------------------------- 1 | # Notes from Scale Summit 2017 2 | 3 | These are my notes from the sessions I attended 4 | at [Scale Summit](http://www.scalesummit.org) 2017. 5 | 6 | I did the same in previous years: 7 | 8 | - https://github.com/bazbremner/scalesummit-2016-notes 9 | - https://github.com/bazbremner/scalesummit-2015-notes 10 | 11 | # From other attendees 12 | 13 | I've not yet seen notes from other sessions - if you have some, please 14 | open a PR on this README to include a link! 15 | -------------------------------------------------------------------------------- /cloudformation_terraform_aargh.md: -------------------------------------------------------------------------------- 1 | What is it people build around Terraform, CloudFormation and similar 2 | tools. 3 | OP uses CloudFormation orchestrated with Ansible. Use some Terraform. 4 | 5 | How should I organise my code? How should I coordinate things if I've 6 | got 20 bits of code? How should I be testing it? 7 | How do you get your infrastructure orchestration to the point that 8 | you're comfortable? 9 | What's beyond hello world? 10 | 11 | One service provider have migrated from Cloud Formation to 12 | Terraform. Probably using about 50% of all of the Terraform providers. 13 | Separate runs for DNS, infrastructure, networking. 14 | Use tfvars files for each environment. 15 | Use a "terrafile" which is a bit like bundler/Gemfiles for Terraform 16 | modules. Have a wrapper that pulls modules into directory. 17 | Test with rspec, just test against a terraform plan. And use 18 | Jenkins to run make and then Terraform plan/apply. 19 | 20 | Use GitHubs PRs, expect devs to supply the output of the plan as part 21 | of the PR. 22 | Each Terraform module is a separate git repo. The one repo 23 | approximately per one VPC that pulls this code in. 24 | 25 | Use remote state or data sources to link results of different runs. 26 | 27 | So: how do you walk the graph if multiple runs affect each other? 28 | 29 | Jenkins jobs are running plans against all environments and different 30 | layers of runs, to ensure changes have actually be applied every 31 | where. "Janky tooling over the top" 32 | 33 | This appears to be the stage we're at at the moment. 34 | 35 | I recommend people read Charity Major's blog posts on Terraform if 36 | you're just coming at this problem: https://charity.wtf/tag/terraform/ 37 | 38 | What happens if modules are updated? Up to each project to decide if 39 | they want to update. 40 | Have two module paths: local modules, which are vendored/janky bits 41 | and then the "Terrafile" for pulling in external modules. 42 | The local modules is useful to be able to see what's being 43 | over-ridden, see if there are common patterns occurring. 44 | 45 | Moving in a different direction: is there anyone that's using 46 | Terraform or CloudFormation and automatically test that, perhaps in 47 | another AWS account and check that what they expect to happen does get 48 | applied. 49 | 50 | One team run team is using AWS spec to check that resources that were 51 | expected that were created. 52 | 53 | Another team look at a Python library called Terraform Validate that 54 | allows you do some consistency checking without spinning anything up, 55 | but at the end of the day you really need to spin something up. 56 | 57 | Atlassian have open sourced "local stack" which mocks a limited number 58 | of AWS APIs. 59 | 60 | What do people do if there's no support in a provider? 61 | 62 | - AWS CloudFormation stack support in Terraform. 63 | - exec resources (but you probably don't want to do this) 64 | - Terraform external providers are a better alternative to execs. 65 | 66 | Is anyone using pipelines for deploying? 67 | 68 | [sorry, missed a little bit here] 69 | 70 | Discussion moved on to adding and removing IAM permissions from the 71 | user running Terraform to limit accidental deletions for example. 72 | 73 | Concerns about setting up too much permissions in IAM. Locking down to 74 | different IP addresses. 75 | 76 | Within Terraform AWS provider blocks, you can specify AWS account IDs 77 | to prevent applying changing to production or another customer by 78 | mistake, for example. 79 | 80 | Another group create very targetted IAM policies to each service that 81 | walks the tree of all of the resources created by a human and then use 82 | that policy after the initial run. 83 | 84 | Some people looking at grepping Terraform source or better still the 85 | debug log and work out how to build a IAM profile from that. 86 | Alternatively, reconstruct from CloudTrail logs. 87 | 88 | Who was running CloudFormation and is now on Terraform and do you have 89 | legacy code? (About 4-5 put their hands up, out of 50). 90 | 91 | CloudFormation was all JSON, no changesets, hard to work with. (About 92 | 18 months ago). Haven't replaced everything. 93 | 94 | Nice bits of CloudFormation: 95 | - CloudFormation abstracts the nastiness of RDS API. 96 | - Autoscaling rolling updates. UpdatePolicy in LC, Terraform doesn't 97 | support this. 98 | 99 | Workflow on CloudFormation is very proscriptive. 100 | 101 | Another team are using generated CloudFormation from another tool, but 102 | it's pretty hairy, especially around IAM policies. 103 | 104 | Troposphere mentioned, appears to be a tool that's generating 105 | CloudFormation. Anyone doing something similar? 106 | 107 | One team taking YAML and generating Terraform for multiple DNS 108 | providers. 109 | 110 | Cumulous is another tool mentioned. 111 | 112 | Anyone using something that's not Terraform, CloudFormation? 113 | 114 | One person using Arm Templates for Azure. APIs seem terrible. 115 | Terraform support for Arm Templates seem good. Terraform doesn't seem 116 | to support Azure in "a better" way is because of upstream SDKs don't 117 | support it, Microsoft don't seem to be publishing Swagger files for 118 | those APIs. Person has a Trello board with all the issues they've seem 119 | 120 | CloudFoundry users would use Bosh. 121 | 122 | Show of hands, how many use CF or Terraform for managing production: 123 | 124 | About 5-10 for each. Lots of other people in the room. 125 | 126 | Who is using other people's Terraform modules: zero hands. 127 | 128 | Who has heard of and use the community modules? A couple of people use 129 | them as a reference only. 130 | 131 | Gruntworks have a blog post about the ugly bits of Terraform and 132 | potential (nasty) workarounds. Part of this series: 133 | https://blog.gruntwork.io/a-comprehensive-guide-to-terraform-b3d32832baca#.4910unu4e 134 | 135 | Why don't think modules are good enough? 136 | 137 | Puppet/chef etc, use lots of attributes to change the behaviour. 138 | For something like an ASG, it's pretty much every parameter. 139 | 140 | I've not seen good examples of people composing modules from other 141 | modules. Seems problematic. Others agree that is still a problem. 142 | 143 | Understanding the impact of a low level module changing is very 144 | difficult. 145 | 146 | Passing inputs and outputs with layers of modules is very difficult. 147 | 148 | I consider if it's worth templating out the Terraform code in a full 149 | blown language and feed that to Terraform. Terraform's JSON input is a 150 | route to this. 151 | 152 | Terraform _does_ support conditionals, but only a single line, and 153 | this doesn't pan out if things are optional. 154 | 155 | Some discussion about Helm vs Terraform, seems they're solving 156 | different levels of difficulties. And we're out of time. 157 | -------------------------------------------------------------------------------- /cluster_level_monitoring.md: -------------------------------------------------------------------------------- 1 | The problem is as the organisation scales is that host level 2 | monitoring is either noisy or not helpful. Understanding the service 3 | as a whole is much more useful, but harder to work out. 4 | 5 | Are people doing cluster/system level alerting. How do you do RCA? 6 | 7 | One org try to monitor and alert on business functionity. That's 8 | required writing services that can do monitoring. 9 | 10 | Publishing system: monitoring checks that content appears against each 11 | API. Combined with synthetic requests. 12 | End to end working like that is really valuable. 13 | 14 | Synthetic requests are another internal service. Re-publish an old 15 | article and check that the article's time is updated. 16 | 17 | So what if you can't really do it? Working with payments systems, it's 18 | hard to monitor real transactions. 19 | 20 | One suggestion is to pay for a fake product and refund it. Turns out 21 | this really upsets fraud detection systems. 22 | 23 | Monitoring users going through a checkout funnel. Watching a mix of 24 | application side and (in this case) paypal responses. Ratio of 25 | requests and responses can be very different when there are problems. 26 | 27 | Watching transaction status as they go though a process can be 28 | helpful. 29 | 30 | Engineers add more value if they focus on high level 31 | tests. Hosts/containers are disposible. 32 | 33 | When there is a problem at the system level, original person would 34 | like to know what happened. 35 | 36 | Another team have written their own tooling to write rules to collapse 37 | alerts from system level alerts to relate them to user-affecting 38 | alerts. 39 | 40 | So, are we all doing an excellent job of monitoring then? ;) 41 | 42 | Turning of Nagios emails was a big win for one team. 43 | 44 | Engineers are rotating through improving alerts. If there are no 45 | alerts improve things. 46 | 47 | What is the user need? Can your monitoring tell you which user need 48 | isn't being satisfied? Alerts that tell you that seem a real 49 | improvement - seems much more common to do this. 50 | 51 | Not alerting (especially out of hours) providing that users can still 52 | do what they want to do can be a big win - cuts down on alert fatigue 53 | and disrupted sleep. 54 | 55 | 56 | Back to the original point: It is very hard to map that a user cannot 57 | do something with - for example - Cassandra compaction taking too 58 | long. Does anyone have ideas? 59 | 60 | Being able to rely on the low level (instances, containers) being 61 | picked back up helps focus on ensuring that provided that some nodes 62 | are up and servicing users, it's OK. 63 | 64 | On the other side, it is possible to have a problematic configuration 65 | that causes nodes to be restarted constantly without the team going 66 | and investigating for 6 months. 67 | 68 | Ex: AWS EC2 machine coming up dying, being replaced every 2 minutes 69 | yet you're getting billed for the 1st hour. 70 | 71 | I'd recommend when testing end-to-end, try to separate different bits 72 | of functionality/user journeys in a way you can distinguish important 73 | stuff from things that are not used. 74 | 75 | Has anyone done service level monitoring to serverless systems yet? 76 | 77 | No-one yet. No-one using serverless for heavy production loads. 78 | One idea is to push information into logs and extract data from there. 79 | 80 | What about things that aren't user affecting, but may become an issue, 81 | i.e. SLAs, latency, credits. Are people still doing that? 82 | 83 | Yes, depends on the users and the focus of your team. Also, it 84 | transpires that no-one notices/cares, then perhaps having monitoring 85 | on that is the right choice. 86 | 87 | There's balancing the value that these things bring, what the relation 88 | is like. 89 | 90 | The closer to the metal one team gets, the more metrics around systems 91 | they build. As it goes to a higher level, it's fluffier and harder to 92 | track and keep an eye on things. 93 | 94 | Apropos SLAs. Some services in one team need an out of hours response, 95 | not everyone pays attention to that, so monitoring things that you 96 | know will cause you to get out of bed is worth doing. 97 | 98 | How do people monitor the things you depend on? Ex: S3 failure 99 | 100 | End to end tests often check that, assuming you're covering all of 101 | your functionality that hits all external providers. No need to have 102 | specific tests. 103 | 104 | There is an art to ensuring that when multiple tests fail, it's clear 105 | on which particular problem they're encountering, otherwise there will 106 | be lost time in working out commonalities. 107 | 108 | 1st line support many not have actually read the detail in alerts, 109 | which is its own problem. 110 | 111 | Alert fatigue is also an issue - if you seen the same alert time and 112 | time people will overlook that alert and perhaps others. Be prepared 113 | to turn alerts off until they are valuable (or bin them). 114 | 115 | One team gets a (manually created) summary of all of the reports that 116 | fired over the last 24 hours, shows patterns. 117 | Several teams do an end-of-shift or rotation handovers. 118 | 119 | Agreed that it would be nice to have an automatically generated report 120 | of all alerts that fired. 121 | 122 | Another group select 2 random developers to look at alerts, get used 123 | new peices of the system. 124 | Two other groups have a similar rotation - team members rotate to hand 125 | support/on-call for a week. 126 | 127 | Important note is that developers should leave their project work 128 | behind/hand it over so they can focus on the support work and not be 129 | torn. 130 | 131 | Etsy open sourced "Ops weekly" to summarise data that the on-call 132 | people had to deal with. https://github.com/etsy/opsweekly 133 | 134 | 135 | -------------------------------------------------------------------------------- /green_field_in_2017.md: -------------------------------------------------------------------------------- 1 | Proposal/question: every year, there's something new at Scale 2 | Summit. Microservices, Sensu etc in the past. 3 | 4 | What does green field look like in 2017? 5 | 6 | Provisioning servers, best practices, deployments, logging, 7 | monitoring, what would you do now? 8 | 9 | What if you're _not_ using AWS? Should you just do that? Would prefer 10 | not to just have a list of AWS providers. 11 | 12 | Show of hands, about 75% are on AWS. 13 | 14 | 2 said they wouldn't use AWS now, although they are currently on it. 15 | One around compliance reasons - knowing who can have access to 16 | data/machines etc. 17 | Another are running in China, running on Alibaba, AWS harder to get 18 | on. 19 | 20 | One person mentioned that Google Cloud and related products sound 21 | interesting. Same person is not on GC or AWS, looking at both. AWS 22 | feels like a "no-one got fired for buying IBM of the current age". 23 | 24 | GC feel like they're behind on managed services - for example, a 25 | managed PostgreSQL. 26 | IAM feels like a real weakness with GC. 27 | 28 | If you're running multiple platforms, then you're often forced to meet 29 | the lowest common denominator. 30 | 31 | GC is cheaper, but it's missing a lot of things. 32 | 33 | One person thinks GC might be offering 1st line support for free if 34 | you meet a set of up-front standards, though if you _can_ meet these 35 | standards, you are probably mature enough that you have SREs. 36 | 37 | Sounds like Google are positioning Kubernetes as the abstraction 38 | layer, but you can run on AWS, GC or others, Google gets business that 39 | way. 40 | 41 | Is anyone looking on Lambda or similar if you're starting a green 42 | field (which is what this session is about)? 43 | 44 | Yes, says one person, have just started there, but it's working well 45 | so far. 46 | "everyone has a function. Functions, functions for everyone!" 47 | (paraphrasing a little). 48 | 49 | Making sure you're _not_ special should be your first action if you're 50 | starting now. 51 | You are not special, you just need to realise that. 52 | 53 | High level services can be a good place to start - Lambda, Elastic 54 | Beanstalk etc. 55 | 56 | Defer decisions until you can learn what you need on new projects. Use 57 | the high level services to help. 58 | 59 | Are we all doing microservices? Do you want to spend the 1- to 2- 60 | thirds overhead to run a microservice model? 61 | 62 | One answer: depends on the team and the under standing of the problem 63 | space. If it's a well know problem, and a good team, perhaps. 64 | If the problem is unknown, refactoring across microservices is hard to 65 | model the problem space. 66 | 67 | Does it have to be binary? Extract functionality as you understand it. 68 | 69 | One platform has a mix of very small, simple microservices 70 | (i.e. issuing tokens and checking their validity), with a 71 | mini-megolith with the business logic it in whilst that problem space 72 | is being explored. 73 | Planning on breaking that up as the problem better known. 74 | 75 | Can take a good set of developers to write, good interfaces etc. But 76 | microservices isn't going to save you here, nor will a monolith. 77 | 78 | Write clean interfaces - ask yourself: "should this be a 79 | library". Then it's easier to pull it out into a microservice. 80 | 81 | One of the mistakes one person has seen is not to throw too many 82 | people at the problem is throwing too many people in. Try to start 83 | with a small team (where small = 5 people - the whole team, BAs, 84 | product, developer). This makes moving faster easier, more context, 85 | fewer communication overhead etc. 86 | 87 | Having a small company might force/help some decisions - for example, 88 | choosing Lamdba might help you not worry about a lot of things. 89 | 90 | Mention of Etsy - choose boring technology. http://mcfunley.com/choose-boring-technology 91 | 92 | Lamdba _is_ a boring technology. A few people agree with this. There's 93 | magic under the covers. 94 | 95 | "Greenfield 2017 - No Javascript!" There was applause. 96 | 97 | But how do you do that if you're on the web? HTML! 98 | 99 | Mention of GOV.UK not using much Javascript to mean accessibly needs 100 | and degrading gracefully. 101 | 102 | So what language would you use? 103 | 104 | Python is a good chance - person saying this didn't have someone come 105 | from other language that complained. 106 | 107 | It depends on the team - choose what works for the people you 108 | have/have hired - remember this is greenfield. 109 | 110 | Let's not get into language wars. 111 | 112 | So, just run everything on Lambda and go home? ;) 113 | 114 | Lamdba can do ~2k req/s out of the box, but you can scale up. 115 | 116 | Going back a bit AWS vs Google is partly an economic decision. Cost 117 | control in a new project is often a good idea - get more runway to 118 | explore the problem, work out what you need to build and ship. 119 | 120 | There's opportunity cost too. You _can_ just throw money at the 121 | problem. 122 | 123 | Someone asked if anyone does rearchitect based on cost drivers. "What 124 | cost? Developer time, AWS bill, etc". So, are you trying to avoid big 125 | AWS bills, or are you trying to avoid getting paged at night? 126 | 127 | Small company: as long as my AWS bill is cheaper on my developers 128 | that's fine. Would rather have the developers focus on building things 129 | than spend time shaving small percentages from an AWS bill. 130 | 131 | Original poster is not on AWS, had lots of hardware everywhere, has 132 | consolidated on one provider. Now looking at AWS, but that's a big 133 | move. There's also all the sunk cost. AWS doesn't feel smaller. 134 | Original poster is consuming a firehose of data. 135 | 136 | Anyone using petabytes of data that are in the cloud? 137 | One response: baseline load on their own hardware and scaling up and 138 | dealing with seasonal peaks with AWS. Compute and more interesting 139 | storage costs are coming down in cloud to become much closer in price. 140 | 141 | Getting data near users is a very good use for the cloud - until 142 | companies build datacentres in Brazil, China, wherever, cloud is the 143 | way to go. 144 | 145 | Talk about security and how much you trust Google, AWS etc. Generally 146 | people trust people. Block encryption under the covers. Random smaller 147 | cloud providers, it's up to you to work out how to do that yourself. 148 | 149 | Moving on, what about supporting services, monitoring, logging etc. 150 | 151 | Cloudwatch is terrible. One team just shoving things into Logstash. 152 | 153 | OP is using Influx and driving monitoring off the back of that. 154 | 155 | New Relic gets super expensive super quickly. Other people are running 156 | Data Dog, does seem cheaper. 157 | 158 | One person using AWS X-Ray, works. 159 | 160 | At a small size, (4 devs, 4 ops), tools like New Relic really paid off 161 | to find problems quickly. It's harder to justify the cost vs dev time 162 | as you grow. 163 | 164 | So, how do people make that trade off of build/buy? Mostly seems pain 165 | based. Next session is buy vs build. 166 | 167 | Has anyone looked an Influx vs Prometheus? Prometheus has great ingest 168 | performance. At the time, Prometheus had a linear projection function, 169 | which Influx added later. 170 | If you're doing a lot of requests, Prometheus is probably a better 171 | bet. 172 | Seems to be a lot of competition between the two. 173 | -------------------------------------------------------------------------------- /paved_paths.md: -------------------------------------------------------------------------------- 1 | Do we want a chosen set of technology? 2 | And if you do, then how do you end up renewing that year after year? 3 | 4 | Old tech may be a retention and recruitment problem - if people want 5 | to work on new tech, then they may leave to do that. 6 | 7 | Yes, there are some people who want to play around with new things, 8 | but letting that trump everything comes at a cost. 9 | 10 | There's a balance between organisational stability and employee 11 | happiness. 12 | 13 | "Bi-model IT" - the new shiny and the old stuff. If you have one team 14 | that gets to work on the new shiny, everyone else hates them. 15 | 16 | If you're expecting a team to do out of hours support, then you can't 17 | force them to do that, if it's not the right tool for the job. 18 | If something is cool and useful, then they will use it, but they have 19 | to be able to support it. 20 | 21 | How do you 'nudge' teams towards good behaviour. Look at "nudge unit" 22 | 23 | If you make things easy, accessible, timely, social 24 | 25 | - selfservice 26 | - Well documented 27 | - Make it easy 28 | - Social: if others are doing it and say it's great, think about 29 | lightning talks, examples of why it works for them can persaude 30 | others to go the same way. 31 | 32 | For out of hours support: is each team doing their own out of hours, 33 | or is there an ops team. 34 | 35 | 1st line support react to things going read 36 | 2nd line for escaltion 37 | 3rd line 38 | 39 | Lots of different languages to support, complex to pay people to 40 | support all of these things. 41 | Runbooks, ways of reporting incidents. You can set standards around 42 | this. 43 | 44 | Recommendation shout out for Susan J Fowler's Production Ready 45 | Microservices around setting standards for what a good service needs 46 | 47 | Organisation here is around things like runbooks, monitoring etc. 48 | 49 | If you're going to phone people up, they're more interested in 50 | creating runbooks. Your future self needs a runbook about what you're 51 | doing today. 52 | 53 | If you don't do out of hours support for your own service, you'll 54 | build a worse system. 55 | Make sure you empower teams to fix and address problems. 56 | 57 | Getting back on topic. What about infrastructure choices: i.e. ELK, 58 | for example. 59 | 60 | Q: How do people change that choice? 61 | A: normally though skunkworks, which isn't ideal. 62 | 63 | Decisions should be recorded _with_ a life time for when it should be 64 | re-evaluated. 65 | 66 | ADRs mentioned 67 | http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions 68 | 69 | Talk of pioneer, settler, town planner teams. 70 | http://blog.gardeviance.org/2012/06/pioneers-settlers-and-town-planners.html 71 | 72 | Are there teams that have tried experimenting on "lower risk" areas? 73 | 74 | Yes, but then eventually, it gets into production, and it's a Real 75 | Thing. 76 | If you use the new shiny on something you don't care about, then it's 77 | hard to get invested in something. 78 | You want to have something to have something worthy to show off. 79 | 80 | But there are two approaches (probably more): experimenting with new 81 | tech, but there's also that the existing stack can't solve/be bent to 82 | handle a new need. Solving a need is a good reason to explore new 83 | tech. 84 | 85 | If a team is working on a new problem, they might come across a tech 86 | that solves that problem, it makes it easier to justify. 87 | The teams have to justify that the technology works to the rest of the 88 | organisation. There are some principle architects who can make 89 | suggestions. There are some strategic changes - i.e. moving to AWS. 90 | 91 | Many teams will adopt technology quite quickly. 92 | 93 | Dealing with the long tail is very long. 94 | 95 | How do you get one team to sell to others? 96 | 97 | Tooling team centrally will be approached by a team. Initial team 98 | hopefully will have something already in a reasonable state, central 99 | tool team will then polish and distribute to other teams to consider. 100 | 101 | What happens if two large influential teams have differing opinions? 102 | 103 | Principle engineers are expected to have context and deep knowledge to 104 | be able to work these issues out. Expectation is that principles have 105 | both breadth and depth. 106 | This is from a service company where there is high value to 107 | standardisation. 108 | 109 | If two groups are talking, and they have a difference in opinion, then 110 | it probably _is valid_. The use cases _may_ be different. 111 | It may be necessary to mandate changes. 112 | 113 | What does that communication look like? 114 | 115 | - Get people working across teams. 116 | - Written documentation is the worst medium "the side-bar of shame in 117 | Google Docs". 118 | - Run investigations. 119 | - Product owners have to understand that teams need time to investigate 120 | ("faffing around for a week"). 121 | - Possibly having a tech radar: new tech, supported tech, deprecated 122 | tech. 123 | - Service catalogues. 124 | - Weekly demos from each engineer. Show and tells 125 | 126 | Are teams co-located? Does that make any difference? 127 | 128 | - 1st: 70% remote 129 | - 2nd 4 locations 130 | - 3rd: some colocated, other teams are outsourced 131 | 132 | Lots of discussion asynchronously, people talk on Slack etc. 133 | 134 | There are different sorts of people: magpies, planners etc 135 | Do you always have to be moving? Have a rolling wave of progress 136 | perhaps. 137 | 138 | Another group rotate people through a support team - you go from the 139 | shiny teams, you're reminded that there are still systems to support. 140 | 141 | Having roadmaps that are related and remind everyone what the common 142 | goal is. 143 | 144 | Start new teams for exploration, break them up to spread knowledge 145 | around. 146 | Make sure you're staffing such teams with people that will be around 147 | for the longer run. 148 | 149 | It's a different problem to explore new tech vs bringing in experts to 150 | build something quickly. 151 | 152 | Avoid putting people who don't have to run production services driving 153 | adoption of new bleeding edge tech. 154 | -------------------------------------------------------------------------------- /platforms_cf_k8s_etc.md: -------------------------------------------------------------------------------- 1 | Proposed because: people were talking about building things with 2 | sensible defaults. The answers varied quite a lot. 3 | There's hardware, with long lead times, going to the cloud and 4 | building lots of snowflakes is no good either. 5 | "Just" spinning up Kubernetes is harder than people make it out to 6 | be. 7 | 8 | One team are running Kubernetes (k8s). A consistent API for teams is a 9 | good thing. It was a lot of blood, sweat and tears. APIs aren't stable 10 | yet. Very little has hit the stable API layer at the moment. 11 | 12 | 100 people, few years old. No traditional infra history. 4 people 13 | building and running Kubernetes. 60-70% of time as been Kubernetes 14 | wrangling. 15 | 16 | They were running on Fleet, it was deprecated. Org was growning fast, 17 | needed to be able to service many teams in a consistent way. k8s seems 18 | like the most viable choice. 19 | "Batteries included, once you work out how to put the batteries in". 20 | 21 | Using raw k8s. Have been modifying this and effectively building their 22 | own distribution. 23 | 24 | Another team have a lot of ECS clusters, looking at k8s at the moment. 25 | 26 | k8s has has different network ranges: node, POD, service and another 27 | range. Service range aren't routed in AWS, GCE etc. That's not best 28 | practice, this team have requirements to route these, hence changes. 29 | 30 | Another team looking at 1000 node+, looking at Calico for 31 | networking. Already at 100 notes. 32 | 33 | Who in the room is running a "platform" (i.e. CloudFoundry, k8s 34 | etc)? 4-5 people put there hands up, in a room of ~35-50. 35 | 36 | One team moving from a platform they effectively built themselves. 37 | 38 | So, is this the future direction? 39 | 40 | Probably 60-70% of room put their hands up. 41 | 42 | More background: UK Government have multiple PaaS - docker, 43 | CloudFoundry. US Government have chosen CF, what else is out there. 44 | Running CloudFoundry takes ~20 instances. Needs people to run. 45 | 46 | Question/point from elsewhere: is running a PaaS really a 47 | differentiator? Should you really be running that? 48 | 49 | Kubernetes seems to be a good safe option and managed to move across 50 | multiple cloud providers. 51 | 52 | Has anyone used Marathon or Mesos or Desos? 53 | 54 | A few people have looked Desos was dropped out of one trial 55 | quickly. Mesos seems to be in a bind, not a big dev community. As a 56 | do-everything platform it's outnumbered. 57 | Lots more people at Pivotal working on CF. Docker have "all the 58 | cool". Google are effectively giving k8s away. 59 | 60 | Using Mesos for scheduling work, it's good. To handle application 61 | workloads, it's not designed for it and being outpaced by the 62 | competition. 63 | 64 | Question: has anyone looked at these platforms and decided not for us. 65 | 66 | A: yes, moving running systems, it's a hard sell if you have a running 67 | system. For a greenfield system, maybe, but as discussed in the 68 | earlier greenfield session (see my notes in this repo) then you 69 | probably want to choose a higher level option rather than sink time 70 | into platforms. 71 | 72 | I'd quite like something that's not that is not _as_ complex as a full 73 | plaform. I have some compliance issues. One suggestion was AWS Elastic 74 | Beanstalk. 75 | 76 | Dcoker Swarm seems troublesome. 77 | 78 | Question: is anyone packing images onto instances? Also, is anyone 79 | experiementing with Convox? 80 | 81 | No Convox answers. 82 | 83 | One team packing containers on instances with Docker Swarm. Got 84 | something up and running 3 people in a week. 85 | 86 | Someone else said Docker Swarm was a good way to get people to think 87 | about how to the world will look before they move to k8s. Got laughs, 88 | but seems like a good approach. 89 | 90 | If you want to get going to try k8s, highly recommendation minikube or 91 | kubeadm. 92 | 93 | So, how to you progress that? How do you go from kops to running 94 | something in production and support that and know how to debug it. 95 | 96 | Easier commenter running k8s is thinking of this now. Have 97 | qualification tests and run that on top of the platform. Understanding 98 | how that fails taught them a lot. 99 | 100 | So, how do you debug and fix things at 2AM when you're called out? 101 | 102 | Most problems seem to occur around rollouts or AZ doing down in the 103 | cloud provider. k8s seems to self-heal pretty well. 104 | This team built it from the ground up, so it's very different from 105 | using kops. 106 | 107 | Pivital have released kubo recently. CloudFoundry will run docker 108 | containers if you want to. 109 | 110 | If you're shipping containers, then k8s is good for that. 111 | 112 | Google will certify platforms for a number of nines 113 | availability. CloudFoundry apparently certified for 99.99% and Google 114 | will do 1st line support. 115 | 116 | Many people treat their plaform as the way they treat their existing 117 | systems or IaaS. They seem to want to change the low level. Is anyone 118 | running a vanilla solution. 119 | 120 | One team running Elastic Beanstalk, single container on a single 121 | instance. It's just EC2 glue. Looking at building newer things on k8s. 122 | Multiple container Elastic Beanstalk is ECS. 123 | 124 | If you're going to run a platform, you either move to it, or you don't 125 | mess with it at all. 126 | When you start modifying the platform, you're probably setting 127 | yourself up for failure. 128 | 129 | Upsteam k8s is a framework to run your platform. You have to bring 130 | opinions. If you don't want to do this - use one of the distributions. 131 | With the upstream, you have to think about network choices etc. 132 | 133 | This seems to be a distinction that's not well understood. 134 | 135 | Focus on the thing that provides value for your customers. If running 136 | a platform doesn't deliver value, don't do it. 137 | 138 | So, you you stick with the vanilla boring things and work down as you 139 | need to? 140 | 141 | Using Elastic Beanstalk and embracing its limits has been useful. 142 | 143 | So: which distribution and why? Is there one coming out on top? 144 | 145 | It's too earlier to tell. 146 | "People on the internet are running upstream" 147 | Commercial market is still pretty small. 148 | 149 | Feels really gold-rushy at the moment. 150 | 151 | Cloud Native Computing Foundation was useful place to look at for 152 | solutions that might fit together with k8s. 153 | 154 | Cap Gemini pitching their own distribution. Have another platform 155 | 'Apollo'. HP has something similar. 156 | 157 | When building a platform, we need to build teams. What skills are 158 | available? 159 | 160 | Ask yourself if you should be running that 3-5 years - should you wait 161 | 2 years and move to a hosted solution? 162 | 163 | Different groups will want to run the bleeding edge of things. We're 164 | perhaps a little more cynical and real world ;) 165 | 166 | Are people starting to upskill. Do I need to train people? 167 | 168 | If you want to keep your costs down, yes, hire a couple of contractors 169 | and have them train your team. If you want to hire a lot of people, 170 | it'll be expensive. 171 | 172 | You should think before spinning up a new team. Try Pivotal Web 173 | Services (PWS), use GCE, Heroku, use something else. You shouldn't 174 | need to spin up a team to run this stuff. 175 | 176 | What does operations look like for application looking like on 177 | platforms? There's still configuration, but there's a line that is the 178 | platform. 179 | 180 | App-ops might be a good thing for a later session. It's lunch time! 181 | 182 | 183 | 184 | -------------------------------------------------------------------------------- /scaling_deployments.md: -------------------------------------------------------------------------------- 1 | I would like to know how teams handle their deploys. How are they 2 | automated, fast, safe, how are rollbacks handled, how do you avoid 3 | excessive coordination and assumptions etc. 4 | 5 | 450 services, continuous deploys. Looking to get to 10k deploys a 6 | day aspirational. Actually at a few hundred a day. 7 | 8 | One of the first thing was to have declarive pipelines. Used YAML to 9 | model pipelines. 10 | Another principle was zero click deploys. Use Drone. "It's been 11 | alright". Auth model is terrible, tied to git access etc. 12 | 13 | So, code flows through the pipeline. What are the steps and how do 14 | things get to the next step? 15 | 16 | The team with the 450 services are not actually enforcing automated 17 | tests, but it is obviously good practice. 18 | Some "discussions" around experimentation being an excuse not to 19 | test. Cultural issue. 20 | 21 | Sally Goble from the Guardian has a good talk about not running 22 | tests. (I believe the talk in question is https://vimeo.com/162635477). 23 | 24 | There are no staging environments. Dev environments (devs are expected 25 | to be able to spin up dependancies for their microservice). UX testing 26 | is hard, slow. 27 | 28 | Have blue/green deploys. 29 | 30 | Do you have any post-deploy smoke tests etc? Much of this is done with 31 | existing monitoring. 32 | 33 | Blue/Green deploys, but the newest deploy doesn't get any traffic, 34 | then synthetic requests, then bleed over production traffic. Still 35 | building out this logic right now, and trying to work out what the 36 | decision point is. 37 | 38 | What about other people? How to you avoid having people sitting there 39 | clicking buttons and watching things. 40 | 41 | On PR merge, build, package, deploy all the way to pre-prod. Depending 42 | on acceptance tests to give confidence. Deploy to live is enabled only 43 | when the tests pass. 44 | 45 | Do you get a queue of deploys? 46 | 47 | Yes, that does happen, but there's no good answer yet. 48 | 49 | Do you have a time limit on acceptance tests? No limits right now, 50 | tests are generally fast, so it's not been an issue. 51 | 52 | What are people doing when code hits production? Canary, big bang, 53 | validation, what are people doing? 54 | 55 | One team has the usual pipeline, but they make a lot of UI changes... 56 | For big-end backend changes, then deploy the branch (ahead of any 57 | merge) - 100% production traffic and monitor for an hour or two. 58 | 59 | Another team with a LAMP stack use a special header to do gradual 60 | routing on that header to different versions of the code. 61 | 62 | Netflix for example will use 5% canarys and watch the 63 | metrics. Provided new code is within limits, it's promoted. 64 | 65 | Guardian had matrix of user journey vs software versions. Customers 66 | are really good at testing your software ;) 67 | 68 | Amy Hughes talked at Pipeline (?) about doing statistical analysis on 69 | new releases. Didn't get the details on this. 70 | 71 | One team tag AWS instances to be able to be able to identify issues. 72 | 73 | Being able to run multiple versions of code is very useful. 74 | Feature flagging is worth investigating 75 | Poisoning session caches is hard to do - if you're changing the shape 76 | of data in caches is difficult. 77 | Running multiple versions makes rolling back much easier, especially 78 | if you're routing traffic. 79 | 80 | How to deploys go wrong? 81 | 82 | - DB schemas are fun 83 | - Caches 84 | - Anything persistent data issues 85 | 86 | One team deployed once every 2 weeks many years ago. Handed it over to 87 | ops team to deploy using a 30 step process. Moved to a continuous 88 | deployment. 89 | 90 | Looked at database change scripts to work out if there were any 91 | non-backwards compatible schema changes. ~90% were OK to deploy and 92 | the remaining 10% could have been written in a way to be backwards 93 | compatible. 94 | 95 | Persuading the team this was possible was one of the bigger issues. 96 | 97 | Smaller changesets made a big difference. Making new code to read old 98 | DB schema, update schema, update code was much easier to understand 99 | when everyone was deployed when things in a short period of time. 100 | 101 | So, how did you make sure that the code and DB versions work together? 102 | 103 | Test environment had anonymoised production data. Could replay 104 | production access logs against it. Deploy DB update, test, deploy 105 | code, test. 106 | Put out comments that "Rollback wasn't a thing, stop it". 107 | 108 | MBS wrote a blog post about DB migrations done properly a while ago. 109 | http://www.brunton-spall.co.uk/post/2014/05/06/database-migrations-done-right/ 110 | 111 | ORMs can make migrations every difficult, and provide empty 112 | rollbacks. Doesn't exactly work when you DROP COLUMN :) 113 | 114 | Don't pretend you can rollback. 115 | 116 | Does anyone enforce how often old data has to kick around for safety 117 | and backwards/forward compatibility? 118 | 119 | One answer: plus/minus 1 version. 120 | 121 | Customers shouldn't notice deploys. 122 | 123 | Another team support many (5) versions of APIs, keep the data around, 124 | it's all managed in code. 125 | 126 | How do you avoid coordinating backend vs frontend dependancies? 127 | i.e. don't deploy/have issues with deploying a frontend that depends 128 | on a backend that's not been deployed yet? 129 | 130 | Answers: build that into your service discovery at deploy time - 131 | discover if what you need is already there. 132 | 133 | Developer teams can also think about this themselves - doesn't have to 134 | be a technical solution. 135 | 136 | One team clients query using a query string, not just an endpoint 137 | (client specify which fields they expect to get back). Do validations. 138 | 139 | Another team did something similar - baseline fields were the same, 140 | but newer clients could request additional newer fields. Easier than 141 | maintaining schema versions of API endpoints. 142 | 143 | 75% using Jenkins 144 | 5% Teamcity 145 | 1 Drone 146 | 1 Codeship 147 | 148 | Next session: is there something better than Jenkins. 149 | 150 | Anyone using AWS CodeDeploy? Not too bad, but invocation is a bit nasty. 151 | --------------------------------------------------------------------------------