├── README.md
├── cloudformation_terraform_aargh.md
├── cluster_level_monitoring.md
├── green_field_in_2017.md
├── paved_paths.md
├── platforms_cf_k8s_etc.md
└── scaling_deployments.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Notes from Scale Summit 2017
 2 | 
 3 | These are my notes from the sessions I attended
 4 | at [Scale Summit](http://www.scalesummit.org) 2017.
 5 | 
 6 | I did the same in previous years:
 7 | 
 8 | - https://github.com/bazbremner/scalesummit-2016-notes
 9 | - https://github.com/bazbremner/scalesummit-2015-notes
10 | 
11 | # From other attendees
12 | 
13 | I've not yet seen notes from other sessions - if you have some, please
14 | open a PR on this README to include a link!
15 | 


--------------------------------------------------------------------------------
/cloudformation_terraform_aargh.md:
--------------------------------------------------------------------------------
  1 | What is it people build around Terraform, CloudFormation and similar
  2 | tools.
  3 | OP uses CloudFormation orchestrated with Ansible. Use some Terraform.
  4 | 
  5 | How should I organise my code? How should I coordinate things if I've
  6 | got 20 bits of code? How should I be testing it?
  7 | How do you get your infrastructure orchestration to the point that
  8 | you're comfortable?
  9 | What's beyond hello world?
 10 | 
 11 | One service provider have migrated from Cloud Formation to
 12 | Terraform. Probably using about 50% of all of the Terraform providers.
 13 | Separate runs for DNS, infrastructure, networking.
 14 | Use tfvars files for each environment.
 15 | Use a "terrafile" which is a bit like bundler/Gemfiles for Terraform
 16 | modules. Have a wrapper that pulls modules into directory.
 17 | Test with rspec, just test against a terraform plan. And use
 18 | Jenkins to run make and then Terraform plan/apply.
 19 | 
 20 | Use GitHubs PRs, expect devs to supply the output of the plan as part
 21 | of the PR.
 22 | Each Terraform module is a separate git repo. The one repo
 23 | approximately per one VPC that pulls this code in.
 24 | 
 25 | Use remote state or data sources to link results of different runs.
 26 | 
 27 | So: how do you walk the graph if multiple runs affect each other?
 28 | 
 29 | Jenkins jobs are running plans against all environments and different
 30 | layers of runs, to ensure changes have actually be applied every
 31 | where. "Janky tooling over the top"
 32 | 
 33 | This appears to be the stage we're at at the moment.
 34 | 
 35 | I recommend people read Charity Major's blog posts on Terraform if
 36 | you're just coming at this problem: https://charity.wtf/tag/terraform/
 37 | 
 38 | What happens if modules are updated? Up to each project to decide if
 39 | they want to update.
 40 | Have two module paths: local modules, which are vendored/janky bits
 41 | and then the "Terrafile" for pulling in external modules.
 42 | The local modules is useful to be able to see what's being
 43 | over-ridden, see if there are common patterns occurring.
 44 | 
 45 | Moving in a different direction: is there anyone that's using
 46 | Terraform or CloudFormation and automatically test that, perhaps in
 47 | another AWS account and check that what they expect to happen does get
 48 | applied.
 49 | 
 50 | One team run team is using AWS spec to check that resources that were
 51 | expected that were created.
 52 | 
 53 | Another team look at a Python library called Terraform Validate that
 54 | allows you do some consistency checking without spinning anything up,
 55 | but at the end of the day you really need to spin something up.
 56 | 
 57 | Atlassian have open sourced "local stack" which mocks a limited number
 58 | of AWS APIs.
 59 | 
 60 | What do people do if there's no support in a provider?
 61 | 
 62 | - AWS CloudFormation stack support in Terraform.
 63 | - exec resources (but you probably don't want to do this)
 64 | - Terraform external providers are a better alternative to execs.
 65 | 
 66 | Is anyone using pipelines for deploying?
 67 | 
 68 | [sorry, missed a little bit here]
 69 | 
 70 | Discussion moved on to adding and removing IAM permissions from the
 71 | user running Terraform to limit accidental deletions for example.
 72 | 
 73 | Concerns about setting up too much permissions in IAM. Locking down to
 74 | different IP addresses.
 75 | 
 76 | Within Terraform AWS provider blocks, you can specify AWS account IDs
 77 | to prevent applying changing to production or another customer by
 78 | mistake, for example.
 79 | 
 80 | Another group create very targetted IAM policies to each service that
 81 | walks the tree of all of the resources created by a human and then use
 82 | that policy after the initial run.
 83 | 
 84 | Some people looking at grepping Terraform source or better still the
 85 | debug log and work out how to build a IAM profile from that.
 86 | Alternatively, reconstruct from CloudTrail logs.
 87 | 
 88 | Who was running CloudFormation and is now on Terraform and do you have
 89 | legacy code? (About 4-5 put their hands up, out of 50).
 90 | 
 91 | CloudFormation was all JSON, no changesets, hard to work with. (About
 92 | 18 months ago). Haven't replaced everything.
 93 | 
 94 | Nice bits of CloudFormation:
 95 | - CloudFormation abstracts the nastiness of RDS API.
 96 | - Autoscaling rolling updates. UpdatePolicy in LC, Terraform doesn't
 97 |   support this.
 98 | 
 99 | Workflow on CloudFormation is very proscriptive.
100 | 
101 | Another team are using generated CloudFormation from another tool, but
102 | it's pretty hairy, especially around IAM policies.
103 | 
104 | Troposphere mentioned, appears to be a tool that's generating
105 | CloudFormation. Anyone doing something similar?
106 | 
107 | One team taking YAML and generating Terraform for multiple DNS
108 | providers.
109 | 
110 | Cumulous is another tool mentioned.
111 | 
112 | Anyone using something that's not Terraform, CloudFormation?
113 | 
114 | One person using Arm Templates for Azure. APIs seem terrible.
115 | Terraform support for Arm Templates seem good. Terraform doesn't seem
116 | to support Azure in "a better" way is because of upstream SDKs don't
117 | support it, Microsoft don't seem to be publishing Swagger files for
118 | those APIs. Person has a Trello board with all the issues they've seem
119 | 
120 | CloudFoundry users would use Bosh.
121 | 
122 | Show of hands, how many use CF or Terraform for managing production:
123 | 
124 | About 5-10 for each. Lots of other people in the room.
125 | 
126 | Who is using other people's Terraform modules: zero hands.
127 | 
128 | Who has heard of and use the community modules? A couple of people use
129 | them as a reference only.
130 | 
131 | Gruntworks have a blog post about the ugly bits of Terraform and
132 | potential (nasty) workarounds. Part of this series:
133 | https://blog.gruntwork.io/a-comprehensive-guide-to-terraform-b3d32832baca#.4910unu4e
134 | 
135 | Why don't think modules are good enough?
136 | 
137 | Puppet/chef etc, use lots of attributes to change the behaviour.
138 | For something like an ASG, it's pretty much every parameter.
139 | 
140 | I've not seen good examples of people composing modules from other
141 | modules. Seems problematic. Others agree that is still a problem.
142 | 
143 | Understanding the impact of a low level module changing is very
144 | difficult.
145 | 
146 | Passing inputs and outputs with layers of modules is very difficult.
147 | 
148 | I consider if it's worth templating out the Terraform code in a full
149 | blown language and feed that to Terraform. Terraform's JSON input is a
150 | route to this.
151 | 
152 | Terraform _does_ support conditionals, but only a single line, and
153 | this doesn't pan out if things are optional.
154 | 
155 | Some discussion about Helm vs Terraform, seems they're solving
156 | different levels of difficulties. And we're out of time.
157 | 


--------------------------------------------------------------------------------
/cluster_level_monitoring.md:
--------------------------------------------------------------------------------
  1 | The problem is as the organisation scales is that host level
  2 | monitoring is either noisy or not helpful. Understanding the service
  3 | as a whole is much more useful, but harder to work out.
  4 | 
  5 | Are people doing cluster/system level alerting. How do you do RCA?
  6 | 
  7 | One org try to monitor and alert on business functionity. That's
  8 | required writing services that can do monitoring.
  9 | 
 10 | Publishing system: monitoring checks that content appears against each
 11 | API. Combined with synthetic requests.
 12 | End to end working like that is really valuable.
 13 | 
 14 | Synthetic requests are another internal service. Re-publish an old
 15 | article and check that the article's time is updated.
 16 | 
 17 | So what if you can't really do it? Working with payments systems, it's
 18 | hard to monitor real transactions.
 19 | 
 20 | One suggestion is to pay for a fake product and refund it. Turns out
 21 | this really upsets fraud detection systems.
 22 | 
 23 | Monitoring users going through a checkout funnel. Watching a mix of
 24 | application side and (in this case) paypal responses. Ratio of
 25 | requests and responses can be very different when there are problems.
 26 | 
 27 | Watching transaction status as they go though a process can be
 28 | helpful.
 29 | 
 30 | Engineers add more value if they focus on high level
 31 | tests. Hosts/containers are disposible.
 32 | 
 33 | When there is a problem at the system level, original person would
 34 | like to know what happened.
 35 | 
 36 | Another team have written their own tooling to write rules to collapse
 37 | alerts from system level alerts to relate them to user-affecting
 38 | alerts.
 39 | 
 40 | So, are we all doing an excellent job of monitoring then? ;)
 41 | 
 42 | Turning of Nagios emails was a big win for one team.
 43 | 
 44 | Engineers are rotating through improving alerts. If there are no
 45 | alerts improve things.
 46 | 
 47 | What is the user need? Can your monitoring tell you which user need
 48 | isn't being satisfied? Alerts that tell you that seem a real
 49 | improvement - seems much more common to do this.
 50 | 
 51 | Not alerting (especially out of hours) providing that users can still
 52 | do what they want to do can be a big win - cuts down on alert fatigue
 53 | and disrupted sleep.
 54 | 
 55 | 
 56 | Back to the original point: It is very hard to map that a user cannot
 57 | do something with - for example - Cassandra compaction taking too
 58 | long. Does anyone have ideas?
 59 | 
 60 | Being able to rely on the low level (instances, containers) being
 61 | picked back up helps focus on ensuring that provided that some nodes
 62 | are up and servicing users, it's OK.
 63 | 
 64 | On the other side, it is possible to have a problematic configuration
 65 | that causes nodes to be restarted constantly without the team going
 66 | and investigating for 6 months.
 67 | 
 68 | Ex: AWS EC2 machine coming up dying, being replaced every 2 minutes
 69 | yet you're getting billed for the 1st hour. 
 70 | 
 71 | I'd recommend when testing end-to-end, try to separate different bits
 72 | of functionality/user journeys in a way you can distinguish important
 73 | stuff from things that are not used.
 74 | 
 75 | Has anyone done service level monitoring to serverless systems yet?
 76 | 
 77 | No-one yet. No-one using serverless for heavy production loads.
 78 | One idea is to push information into logs and extract data from there.
 79 | 
 80 | What about things that aren't user affecting, but may become an issue,
 81 | i.e. SLAs, latency, credits. Are people still doing that?
 82 | 
 83 | Yes, depends on the users and the focus of your team. Also, it
 84 | transpires that no-one notices/cares, then perhaps having monitoring
 85 | on that is the right choice.
 86 | 
 87 | There's balancing the value that these things bring, what the relation
 88 | is like.
 89 | 
 90 | The closer to the metal one team gets, the more metrics around systems
 91 | they build. As it goes to a higher level, it's fluffier and harder to
 92 | track and keep an eye on things.
 93 | 
 94 | Apropos SLAs. Some services in one team need an out of hours response,
 95 | not everyone pays attention to that, so monitoring things that you
 96 | know will cause you to get out of bed is worth doing.
 97 | 
 98 | How do people monitor the things you depend on? Ex: S3 failure
 99 | 
100 | End to end tests often check that, assuming you're covering all of
101 | your functionality that hits all external providers. No need to have
102 | specific tests.
103 | 
104 | There is an art to ensuring that when multiple tests fail, it's clear
105 | on which particular problem they're encountering, otherwise there will
106 | be lost time in working out commonalities.
107 | 
108 | 1st line support many not have actually read the detail in alerts,
109 | which is its own problem.
110 | 
111 | Alert fatigue is also an issue - if you seen the same alert time and
112 | time people will overlook that alert and perhaps others. Be prepared
113 | to turn alerts off until they are valuable (or bin them).
114 | 
115 | One team gets a (manually created) summary of all of the reports that
116 | fired over the last 24 hours, shows patterns.
117 | Several teams do an end-of-shift or rotation handovers.
118 | 
119 | Agreed that it would be nice to have an automatically generated report
120 | of all alerts that fired.
121 | 
122 | Another group select 2 random developers to look at alerts, get used
123 | new peices of the system.
124 | Two other groups have a similar rotation - team members rotate to hand
125 | support/on-call for a week.
126 | 
127 | Important note is that developers should leave their project work
128 | behind/hand it over so they can focus on the support work and not be
129 | torn.
130 | 
131 | Etsy open sourced "Ops weekly" to summarise data that the on-call
132 | people had to deal with. https://github.com/etsy/opsweekly
133 | 
134 | 
135 | 


--------------------------------------------------------------------------------
/green_field_in_2017.md:
--------------------------------------------------------------------------------
  1 | Proposal/question: every year, there's something new at Scale
  2 | Summit. Microservices, Sensu etc in the past.
  3 | 
  4 | What does green field look like in 2017?
  5 | 
  6 | Provisioning servers, best practices, deployments, logging,
  7 | monitoring, what would you do now?
  8 | 
  9 | What if you're _not_ using AWS? Should you just do that? Would prefer
 10 | not to just have a list of AWS providers.
 11 | 
 12 | Show of hands, about 75% are on AWS.
 13 | 
 14 | 2 said they wouldn't use AWS now, although they are currently on it.
 15 | One around compliance reasons - knowing who can have access to
 16 | data/machines etc.
 17 | Another are running in China, running on Alibaba, AWS harder to get
 18 | on.
 19 | 
 20 | One person mentioned that Google Cloud and related products sound
 21 | interesting. Same person is not on GC or AWS, looking at both. AWS
 22 | feels like a "no-one got fired for buying IBM of the current age".
 23 | 
 24 | GC feel like they're behind on managed services - for example, a
 25 | managed PostgreSQL.
 26 | IAM feels like a real weakness with GC.
 27 | 
 28 | If you're running multiple platforms, then you're often forced to meet
 29 | the lowest common denominator.
 30 | 
 31 | GC is cheaper, but it's missing a lot of things.
 32 | 
 33 | One person thinks GC might be offering 1st line support for free if
 34 | you meet a set of up-front standards, though if you _can_ meet these
 35 | standards, you are probably mature enough that you have SREs.
 36 | 
 37 | Sounds like Google are positioning Kubernetes as the abstraction
 38 | layer, but you can run on AWS, GC or others, Google gets business that
 39 | way.
 40 | 
 41 | Is anyone looking on Lambda or similar if you're starting a green
 42 | field (which is what this session is about)?
 43 | 
 44 | Yes, says one person, have just started there, but it's working well
 45 | so far.
 46 | "everyone has a function. Functions, functions for everyone!"
 47 | (paraphrasing a little).
 48 | 
 49 | Making sure you're _not_ special should be your first action if you're
 50 | starting now.
 51 | You are not special, you just need to realise that.
 52 | 
 53 | High level services can be a good place to start - Lambda, Elastic
 54 | Beanstalk etc.
 55 | 
 56 | Defer decisions until you can learn what you need on new projects. Use
 57 | the high level services to help.
 58 | 
 59 | Are we all doing microservices? Do you want to spend the 1- to 2-
 60 | thirds overhead to run a microservice model?
 61 | 
 62 | One answer: depends on the team and the under standing of the problem
 63 | space. If it's a well know problem, and a good team, perhaps.
 64 | If the problem is unknown, refactoring across microservices is hard to
 65 | model the problem space.
 66 | 
 67 | Does it have to be binary? Extract functionality as you understand it.
 68 | 
 69 | One platform has a mix of very small, simple microservices
 70 | (i.e. issuing tokens and checking their validity), with a
 71 | mini-megolith with the business logic it in whilst that problem space
 72 | is being explored.
 73 | Planning on breaking that up as the problem better known.
 74 | 
 75 | Can take a good set of developers to write, good interfaces etc. But
 76 | microservices isn't going to save you here, nor will a monolith.
 77 | 
 78 | Write clean interfaces - ask yourself: "should this be a
 79 | library". Then it's easier to pull it out into a microservice.
 80 | 
 81 | One of the mistakes one person has seen is not to throw too many
 82 | people at the problem is throwing too many people in. Try to start
 83 | with a small team (where small = 5 people - the whole team, BAs,
 84 | product, developer). This makes moving faster easier, more context,
 85 | fewer communication overhead etc.
 86 | 
 87 | Having a small company might force/help some decisions - for example,
 88 | choosing Lamdba might help you not worry about a lot of things.
 89 | 
 90 | Mention of Etsy - choose boring technology. http://mcfunley.com/choose-boring-technology
 91 | 
 92 | Lamdba _is_ a boring technology. A few people agree with this. There's
 93 | magic under the covers.
 94 | 
 95 | "Greenfield 2017 - No Javascript!" There was applause.
 96 | 
 97 | But how do you do that if you're on the web? HTML!
 98 | 
 99 | Mention of GOV.UK not using much Javascript to mean accessibly needs
100 | and degrading gracefully.
101 | 
102 | So what language would you use?
103 | 
104 | Python is a good chance - person saying this didn't have someone come
105 | from other language that complained.
106 | 
107 | It depends on the team - choose what works for the people you
108 | have/have hired - remember this is greenfield.
109 | 
110 | Let's not get into language wars.
111 | 
112 | So, just run everything on Lambda and go home? ;)
113 | 
114 | Lamdba can do ~2k req/s out of the box, but you can scale up.
115 | 
116 | Going back a bit AWS vs Google is partly an economic decision. Cost
117 | control in a new project is often a good idea - get more runway to
118 | explore the problem, work out what you need to build and ship.
119 | 
120 | There's opportunity cost too. You _can_ just throw money at the
121 | problem.
122 | 
123 | Someone asked if anyone does rearchitect based on cost drivers. "What
124 | cost? Developer time, AWS bill, etc". So, are you trying to avoid big
125 | AWS bills, or are you trying to avoid getting paged at night?
126 | 
127 | Small company: as long as my AWS bill is cheaper on my developers
128 | that's fine. Would rather have the developers focus on building things
129 | than spend time shaving small percentages from an AWS bill.
130 | 
131 | Original poster is not on AWS, had lots of hardware everywhere, has
132 | consolidated on one provider. Now looking at AWS, but that's a big
133 | move. There's also all the sunk cost. AWS doesn't feel smaller.
134 | Original poster is consuming a firehose of data.
135 | 
136 | Anyone using petabytes of data that are in the cloud?
137 | One response: baseline load on their own hardware and scaling up and
138 | dealing with seasonal peaks with AWS. Compute and more interesting
139 | storage costs are coming down in cloud to become much closer in price.
140 | 
141 | Getting data near users is a very good use for the cloud - until
142 | companies build datacentres in Brazil, China, wherever, cloud is the
143 | way to go.
144 | 
145 | Talk about security and how much you trust Google, AWS etc. Generally
146 | people trust people. Block encryption under the covers. Random smaller
147 | cloud providers, it's up to you to work out how to do that yourself.
148 | 
149 | Moving on, what about supporting services, monitoring, logging etc.
150 | 
151 | Cloudwatch is terrible. One team just shoving things into Logstash.
152 | 
153 | OP is using Influx and driving monitoring off the back of that.
154 | 
155 | New Relic gets super expensive super quickly. Other people are running
156 | Data Dog, does seem cheaper.
157 | 
158 | One person using AWS X-Ray, works.
159 | 
160 | At a small size, (4 devs, 4 ops), tools like New Relic really paid off
161 | to find problems quickly. It's harder to justify the cost vs dev time
162 | as you grow.
163 | 
164 | So, how do people make that trade off of build/buy? Mostly seems pain
165 | based. Next session is buy vs build.
166 | 
167 | Has anyone looked an Influx vs Prometheus? Prometheus has great ingest
168 | performance. At the time, Prometheus had a linear projection function,
169 | which Influx added later.
170 | If you're doing a lot of requests, Prometheus is probably a better
171 | bet.
172 | Seems to be a lot of competition between the two.
173 | 


--------------------------------------------------------------------------------
/paved_paths.md:
--------------------------------------------------------------------------------
  1 | Do we want a chosen set of technology?
  2 | And if you do, then how do you end up renewing that year after year?
  3 | 
  4 | Old tech may be a retention and recruitment problem - if people want
  5 | to work on new tech, then they may leave to do that.
  6 | 
  7 | Yes, there are some people who want to play around with new things,
  8 | but letting that trump everything comes at a cost.
  9 | 
 10 | There's a balance between organisational stability and employee
 11 | happiness.
 12 | 
 13 | "Bi-model IT" - the new shiny and the old stuff. If you have one team
 14 | that gets to work on the new shiny, everyone else hates them.
 15 | 
 16 | If you're expecting a team to do out of hours support, then you can't
 17 | force them to do that, if it's not the right tool for the job.
 18 | If something is cool and useful, then they will use it, but they have
 19 | to be able to support it.
 20 | 
 21 | How do you 'nudge' teams towards good behaviour. Look at "nudge unit"
 22 | 
 23 | If you make things easy, accessible, timely, social
 24 | 
 25 | - selfservice
 26 | - Well documented
 27 | - Make it easy
 28 | - Social: if others are doing it and say it's great, think about
 29 |   lightning talks, examples of why it works for them can persaude
 30 |   others to go the same way.
 31 |   
 32 | For out of hours support: is each team doing their own out of hours,
 33 | or is there an ops team.
 34 | 
 35 | 1st line support react to things going read
 36 | 2nd line for escaltion
 37 | 3rd line
 38 | 
 39 | Lots of different languages to support, complex to pay people to
 40 | support all of these things.
 41 | Runbooks, ways of reporting incidents. You can set standards around
 42 | this.
 43 | 
 44 | Recommendation shout out for Susan J Fowler's Production Ready
 45 | Microservices around setting standards for what a good service needs
 46 | 
 47 | Organisation here is around things like runbooks, monitoring etc.
 48 | 
 49 | If you're going to phone people up, they're more interested in
 50 | creating runbooks. Your future self needs a runbook about what you're
 51 | doing today.
 52 | 
 53 | If you don't do out of hours support for your own service, you'll
 54 | build a worse system.
 55 | Make sure you empower teams to fix and address problems.
 56 | 
 57 | Getting back on topic. What about infrastructure choices: i.e. ELK,
 58 | for example.
 59 | 
 60 | Q: How do people change that choice?
 61 | A: normally though skunkworks, which isn't ideal.
 62 | 
 63 | Decisions should be recorded _with_ a life time for when it should be
 64 | re-evaluated.
 65 | 
 66 | ADRs mentioned
 67 | http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions
 68 | 
 69 | Talk of pioneer, settler, town planner teams.
 70 | http://blog.gardeviance.org/2012/06/pioneers-settlers-and-town-planners.html
 71 | 
 72 | Are there teams that have tried experimenting on "lower risk" areas?
 73 | 
 74 | Yes, but then eventually, it gets into production, and it's a Real
 75 | Thing.
 76 | If you use the new shiny on something you don't care about, then it's
 77 | hard to get invested in something.
 78 | You want to have something to have something worthy to show off.
 79 | 
 80 | But there are two approaches (probably more): experimenting with new
 81 | tech, but there's also that the existing stack can't solve/be bent to
 82 | handle a new need. Solving a need is a good reason to explore new
 83 | tech.
 84 | 
 85 | If a team is working on a new problem, they might come across a tech
 86 | that solves that problem, it makes it easier to justify.
 87 | The teams have to justify that the technology works to the rest of the
 88 | organisation. There are some principle architects who can make
 89 | suggestions. There are some strategic changes - i.e. moving to AWS.
 90 | 
 91 | Many teams will adopt technology quite quickly.
 92 | 
 93 | Dealing with the long tail is very long.
 94 | 
 95 | How do you get one team to sell to others?
 96 | 
 97 | Tooling team centrally will be approached by a team. Initial team
 98 | hopefully will have something already in a reasonable state, central
 99 | tool team will then polish and distribute to other teams to consider.
100 | 
101 | What happens if two large influential teams have differing opinions?
102 | 
103 | Principle engineers are expected to have context and deep knowledge to
104 | be able to work these issues out. Expectation is that principles have
105 | both breadth and depth.
106 | This is from a service company where there is high value to
107 | standardisation.
108 | 
109 | If two groups are talking, and they have a difference in opinion, then
110 | it probably _is valid_. The use cases _may_ be different.
111 | It may be necessary to mandate changes.
112 | 
113 | What does that communication look like?
114 | 
115 | - Get people working across teams.
116 | - Written documentation is the worst medium "the side-bar of shame in
117 |   Google Docs".
118 | - Run investigations.
119 | - Product owners have to understand that teams need time to investigate
120 | ("faffing around for a week").
121 | - Possibly having a tech radar: new tech, supported tech, deprecated
122 |   tech.
123 | - Service catalogues.
124 | - Weekly demos from each engineer. Show and tells
125 | 
126 | Are teams co-located? Does that make any difference?
127 | 
128 | - 1st: 70% remote
129 | - 2nd 4 locations
130 | - 3rd: some colocated, other teams are outsourced
131 | 
132 | Lots of discussion asynchronously, people talk on Slack etc.
133 | 
134 | There are different sorts of people: magpies, planners etc
135 | Do you always have to be moving? Have a rolling wave of progress
136 | perhaps.
137 | 
138 | Another group rotate people through a support team - you go from the
139 | shiny teams, you're reminded that there are still systems to support.
140 | 
141 | Having roadmaps that are related and remind everyone what the common
142 | goal is.
143 | 
144 | Start new teams for exploration, break them up to spread knowledge
145 | around.
146 | Make sure you're staffing such teams with people that will be around
147 | for the longer run.
148 | 
149 | It's a different problem to explore new tech vs bringing in experts to
150 | build something quickly.
151 | 
152 | Avoid putting people who don't have to run production services driving
153 | adoption of new bleeding edge tech.
154 | 


--------------------------------------------------------------------------------
/platforms_cf_k8s_etc.md:
--------------------------------------------------------------------------------
  1 | Proposed because: people were talking about building things with
  2 | sensible defaults. The answers varied quite a lot.
  3 | There's hardware, with long lead times, going to the cloud and
  4 | building lots of snowflakes is no good either.
  5 | "Just" spinning up Kubernetes is harder than people make it out to
  6 | be.
  7 | 
  8 | One team are running Kubernetes (k8s). A consistent API for teams is a
  9 | good thing. It was a lot of blood, sweat and tears. APIs aren't stable
 10 | yet.  Very little has hit the stable API layer at the moment.
 11 | 
 12 | 100 people, few years old. No traditional infra history. 4 people
 13 | building and running Kubernetes. 60-70% of time as been Kubernetes
 14 | wrangling.
 15 | 
 16 | They were running on Fleet, it was deprecated. Org was growning fast,
 17 | needed to be able to service many teams in a consistent way. k8s seems
 18 | like the most viable choice.
 19 | "Batteries included, once you work out how to put the batteries in".
 20 | 
 21 | Using raw k8s. Have been modifying this and effectively building their
 22 | own distribution.
 23 | 
 24 | Another team have a lot of ECS clusters, looking at k8s at the moment.
 25 | 
 26 | k8s has has different network ranges: node, POD, service and another
 27 | range. Service range aren't routed in AWS, GCE etc. That's not best
 28 | practice, this team have requirements to route these, hence changes.
 29 | 
 30 | Another team looking at 1000 node+, looking at Calico for
 31 | networking. Already at 100 notes.
 32 | 
 33 | Who in the room is running a "platform" (i.e. CloudFoundry, k8s
 34 | etc)? 4-5 people put there hands up, in a room of ~35-50.
 35 | 
 36 | One team moving from a platform they effectively built themselves.
 37 | 
 38 | So, is this the future direction?
 39 | 
 40 | Probably 60-70% of room put their hands up.
 41 | 
 42 | More background: UK Government have multiple PaaS - docker,
 43 | CloudFoundry. US Government have chosen CF, what else is out there.
 44 | Running CloudFoundry takes ~20 instances. Needs people to run.
 45 | 
 46 | Question/point from elsewhere: is running a PaaS really a
 47 | differentiator? Should you really be running that?
 48 | 
 49 | Kubernetes seems to be a good safe option and managed to move across
 50 | multiple cloud providers.
 51 | 
 52 | Has anyone used Marathon or Mesos or Desos?
 53 | 
 54 | A few people have looked Desos was dropped out of one trial
 55 | quickly. Mesos seems to be in a bind, not a big dev community. As a
 56 | do-everything platform it's outnumbered.
 57 | Lots more people at Pivotal working on CF. Docker have "all the
 58 | cool". Google are effectively giving k8s away.
 59 | 
 60 | Using Mesos for scheduling work, it's good. To handle application
 61 | workloads, it's not designed for it and being outpaced by the
 62 | competition.
 63 | 
 64 | Question: has anyone looked at these platforms and decided not for us.
 65 | 
 66 | A: yes, moving running systems, it's a hard sell if you have a running
 67 | system. For a greenfield system, maybe, but as discussed in the
 68 | earlier greenfield session (see my notes in this repo) then you
 69 | probably want to choose a higher level option rather than sink time
 70 | into platforms.
 71 | 
 72 | I'd quite like something that's not that is not _as_ complex as a full
 73 | plaform. I have some compliance issues. One suggestion was AWS Elastic
 74 | Beanstalk.
 75 | 
 76 | Dcoker Swarm seems troublesome.
 77 | 
 78 | Question: is anyone packing images onto instances? Also, is anyone
 79 | experiementing with Convox?
 80 | 
 81 | No Convox answers.
 82 | 
 83 | One team packing containers on instances with Docker Swarm. Got
 84 | something up and running 3 people in a week.
 85 | 
 86 | Someone else said Docker Swarm was a good way to get people to think
 87 | about how to the world will look before they move to k8s. Got laughs,
 88 | but seems like a good approach.
 89 | 
 90 | If you want to get going to try k8s, highly recommendation minikube or
 91 | kubeadm.
 92 | 
 93 | So, how to you progress that? How do you go from kops to running
 94 | something in production and support that and know how to debug it.
 95 | 
 96 | Easier commenter running k8s is thinking of this now. Have
 97 | qualification tests and run that on top of the platform. Understanding
 98 | how that fails taught them a lot.
 99 | 
100 | So, how do you debug and fix things at 2AM when you're called out?
101 | 
102 | Most problems seem to occur around rollouts or AZ doing down in the
103 | cloud provider. k8s seems to self-heal pretty well.
104 | This team built it from the ground up, so it's very different from
105 | using kops.
106 | 
107 | Pivital have released kubo recently. CloudFoundry will run docker
108 | containers if you want to.
109 | 
110 | If you're shipping containers, then k8s is good for that.
111 | 
112 | Google will certify platforms for a number of nines
113 | availability. CloudFoundry apparently certified for 99.99% and Google
114 | will do 1st line support.
115 | 
116 | Many people treat their plaform as the way they treat their existing
117 | systems or IaaS. They seem to want to change the low level. Is anyone
118 | running a vanilla solution.
119 | 
120 | One team running Elastic Beanstalk, single container on a single
121 | instance. It's just EC2 glue. Looking at building newer things on k8s.
122 | Multiple container Elastic Beanstalk is ECS.
123 | 
124 | If you're going to run a platform, you either move to it, or you don't
125 | mess with it at all.
126 | When you start modifying the platform, you're probably setting
127 | yourself up for failure.
128 | 
129 | Upsteam k8s is a framework to run your platform. You have to bring
130 | opinions. If you don't want to do this - use one of the distributions.
131 | With the upstream, you have to think about network choices etc.
132 | 
133 | This seems to be a distinction that's not well understood.
134 | 
135 | Focus on the thing that provides value for your customers. If running
136 | a platform doesn't deliver value, don't do it.
137 | 
138 | So, you you stick with the vanilla boring things and work down as you
139 | need to?
140 | 
141 | Using Elastic Beanstalk and embracing its limits has been useful.
142 | 
143 | So: which distribution and why? Is there one coming out on top?
144 | 
145 | It's too earlier to tell.
146 | "People on the internet are running upstream"
147 | Commercial market is still pretty small.
148 | 
149 | Feels really gold-rushy at the moment.
150 | 
151 | Cloud Native Computing Foundation was useful place to look at for
152 | solutions that might fit together with k8s.
153 | 
154 | Cap Gemini pitching their own distribution. Have another platform
155 | 'Apollo'. HP has something similar.
156 | 
157 | When building a platform, we need to build teams. What skills are
158 | available?
159 | 
160 | Ask yourself if you should be running that 3-5 years - should you wait
161 | 2 years and move to a hosted solution?
162 | 
163 | Different groups will want to run the bleeding edge of things. We're
164 | perhaps a little more cynical and real world ;)
165 | 
166 | Are people starting to upskill. Do I need to train people?
167 | 
168 | If you want to keep your costs down, yes, hire a couple of contractors
169 | and have them train your team. If you want to hire a lot of people,
170 | it'll be expensive.
171 | 
172 | You should think before spinning up a new team. Try Pivotal Web
173 | Services (PWS), use GCE, Heroku, use something else. You shouldn't
174 | need to spin up a team to run this stuff.
175 | 
176 | What does operations look like for application looking like on
177 | platforms? There's still configuration, but there's a line that is the
178 | platform.
179 | 
180 | App-ops might be a good thing for a later session. It's lunch time!
181 | 
182 | 
183 | 
184 | 


--------------------------------------------------------------------------------
/scaling_deployments.md:
--------------------------------------------------------------------------------
  1 | I would like to know how teams handle their deploys. How are they
  2 | automated, fast, safe, how are rollbacks handled, how do you avoid
  3 | excessive coordination and assumptions etc.
  4 | 
  5 | 450 services, continuous deploys. Looking to get to 10k deploys a
  6 | day aspirational. Actually at a few hundred a day. 
  7 | 
  8 | One of the first thing was to have declarive pipelines. Used YAML to
  9 | model pipelines.
 10 | Another principle was zero click deploys. Use Drone. "It's been
 11 | alright". Auth model is terrible, tied to git access etc.
 12 | 
 13 | So, code flows through the pipeline. What are the steps and how do
 14 | things get to the next step?
 15 | 
 16 | The team with the 450 services are not actually enforcing automated
 17 | tests, but it is obviously good practice.
 18 | Some "discussions" around experimentation being an excuse not to
 19 | test. Cultural issue.
 20 | 
 21 | Sally Goble from the Guardian has a good talk about not running
 22 | tests. (I believe the talk in question is https://vimeo.com/162635477).
 23 | 
 24 | There are no staging environments. Dev environments (devs are expected
 25 | to be able to spin up dependancies for their microservice). UX testing
 26 | is hard, slow.
 27 | 
 28 | Have blue/green deploys.
 29 | 
 30 | Do you have any post-deploy smoke tests etc? Much of this is done with
 31 | existing monitoring.
 32 | 
 33 | Blue/Green deploys, but the newest deploy doesn't get any traffic,
 34 | then synthetic requests, then bleed over production traffic. Still
 35 | building out this logic right now, and trying to work out what the
 36 | decision point is.
 37 | 
 38 | What about other people? How to you avoid having people sitting there
 39 | clicking buttons and watching things.
 40 | 
 41 | On PR merge, build, package, deploy all the way to pre-prod. Depending
 42 | on acceptance tests to give confidence. Deploy to live is enabled only
 43 | when the tests pass.
 44 | 
 45 | Do you get a queue of deploys?
 46 | 
 47 | Yes, that does happen, but there's no good answer yet.
 48 | 
 49 | Do you have a time limit on acceptance tests? No limits right now,
 50 | tests are generally fast, so it's not been an issue.
 51 | 
 52 | What are people doing when code hits production? Canary, big bang,
 53 | validation, what are people doing?
 54 | 
 55 | One team has the usual pipeline, but they make a lot of UI changes...
 56 | For big-end backend changes, then deploy the branch (ahead of any
 57 | merge) - 100% production traffic and monitor for an hour or two.
 58 | 
 59 | Another team with a LAMP stack use a special header to do gradual
 60 | routing on that header to different versions of the code.
 61 | 
 62 | Netflix for example will use 5% canarys and watch the
 63 | metrics. Provided new code is within limits, it's promoted.
 64 | 
 65 | Guardian had matrix of user journey vs software versions. Customers
 66 | are really good at testing your software ;)
 67 | 
 68 | Amy Hughes talked at Pipeline (?) about doing statistical analysis on
 69 | new releases. Didn't get the details on this.
 70 | 
 71 | One team tag AWS instances to be able to be able to identify issues.
 72 | 
 73 | Being able to run multiple versions of code is very useful.
 74 | Feature flagging is worth investigating
 75 | Poisoning session caches is hard to do - if you're changing the shape
 76 | of data in caches is difficult.
 77 | Running multiple versions makes rolling back much easier, especially
 78 | if you're routing traffic.
 79 | 
 80 | How to deploys go wrong?
 81 | 
 82 | - DB schemas are fun
 83 | - Caches
 84 | - Anything persistent data issues
 85 | 
 86 | One team deployed once every 2 weeks many years ago. Handed it over to
 87 | ops team to deploy using a 30 step process. Moved to a continuous
 88 | deployment.
 89 | 
 90 | Looked at database change scripts to work out if there were any
 91 | non-backwards compatible schema changes. ~90% were OK to deploy and
 92 | the remaining 10% could have been written in a way to be backwards
 93 | compatible.
 94 | 
 95 | Persuading the team this was possible was one of the bigger issues.
 96 | 
 97 | Smaller changesets made a big difference. Making new code to read old
 98 | DB schema, update schema, update code was much easier to understand
 99 | when everyone was deployed when things in a short period of time.
100 | 
101 | So, how did you make sure that the code and DB versions work together?
102 | 
103 | Test environment had anonymoised production data. Could replay
104 | production access logs against it. Deploy DB update, test, deploy
105 | code, test.
106 | Put out comments that "Rollback wasn't a thing, stop it".
107 | 
108 | MBS wrote a blog post about DB migrations done properly a while ago.
109 | http://www.brunton-spall.co.uk/post/2014/05/06/database-migrations-done-right/
110 | 
111 | ORMs can make migrations every difficult, and provide empty
112 | rollbacks. Doesn't exactly work when you DROP COLUMN :)
113 | 
114 | Don't pretend you can rollback.
115 | 
116 | Does anyone enforce how often old data has to kick around for safety
117 | and backwards/forward compatibility?
118 | 
119 | One answer: plus/minus 1 version.
120 | 
121 | Customers shouldn't notice deploys.
122 | 
123 | Another team support many (5) versions of APIs, keep the data around,
124 | it's all managed in code.
125 | 
126 | How do you avoid coordinating backend vs frontend dependancies?
127 | i.e. don't deploy/have issues with deploying a frontend that depends
128 | on a backend that's not been deployed yet?
129 | 
130 | Answers: build that into your service discovery at deploy time -
131 | discover if what you need is already there.
132 | 
133 | Developer teams can also think about this themselves - doesn't have to
134 | be a technical solution.
135 | 
136 | One team clients query using a query string, not just an endpoint
137 | (client specify which fields they expect to get back). Do validations.
138 | 
139 | Another team did something similar - baseline fields were the same,
140 | but newer clients could request additional newer fields. Easier than
141 | maintaining schema versions of API endpoints.
142 | 
143 | 75% using Jenkins
144 | 5% Teamcity
145 | 1 Drone
146 | 1 Codeship
147 | 
148 | Next session: is there something better than Jenkins.
149 | 
150 | Anyone using AWS CodeDeploy? Not too bad, but invocation is a bit nasty.
151 | 


--------------------------------------------------------------------------------