├── .gitignore ├── LICENSE ├── README.md ├── chapters ├── chapter-01.md ├── chapter-02.md ├── chapter-03.md ├── chapter-04.md ├── chapter-05.md ├── chapter-06.md ├── chapter-07.md ├── chapter-08.md ├── chapter-09.md ├── chapter-10.md ├── chapter-11.md └── chapter-12.md ├── package-lock.json └── package.json /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules/* 2 | *.txt -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Andrew Davis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Building Microservices by Sam Newman (Personal Notes) 2 | 3 | My personal notes for the book: Building Microservices by Sam Newman. 4 | 5 | Book available for purchase [here](https://www.amazon.com/-/es/Sam-Newman/dp/1491950358). 6 | 7 | # Table of Contents 8 | 9 | - [Chapter 1: Microservices](/chapters/chapter-01.md) 10 | - [Chapter 2: The Evolutionary Architect](/chapters/chapter-02.md) 11 | - [Chapter 3: How to Model Services](/chapters/chapter-03.md) 12 | - [Chapter 4: Integration](/chapters/chapter-04.md) 13 | - [Chapter 5: Splitting the Monolith](/chapters/chapter-05.md) 14 | - [Chapter 6: Deployment](/chapters/chapter-06.md) 15 | - [Chapter 7: Testing](/chapters/chapter-07.md) 16 | - [Chapter 8: Monitoring](/chapters/chapter-08.md) 17 | - [Chapter 9: Security](/chapters/chapter-09.md) 18 | - [Chapter 10: Conway’s Law and System Design](/chapters/chapter-10.md) 19 | - [Chapter 11: Microservices at Scale](/chapters/chapter-11.md) 20 | - [Chapter 12: Bringing It All Together](/chapters/chapter-12.md) 21 | -------------------------------------------------------------------------------- /chapters/chapter-01.md: -------------------------------------------------------------------------------- 1 | # Chapter 1: Microservices 2 | 3 | --- 4 | ## What are Microservices? 5 | Microservices are small, autonomous services that work together. 6 | 7 | --- 8 | ## What does it mean “Small, and Focused on Doing One Thing Well”? How do we set services boundaries? 9 | Microservices take this same approach to independent services. We focus our service boundaries on business boundaries, making it obvious where code lives for a given piece of functionality. And by keeping this service focused on an explicit boundary, we avoid the temptation for it to grow too large, with all the associated difficulties that this can introduce. 10 | 11 | --- 12 | ## Which is one of the aspects that help us answer the “how small?” question? 13 | A strong factor in helping us answer how small? is how well the service aligns to team structures. If the codebase is too big to be managed by a small team, looking to break it down is very sensible. 14 | 15 | --- 16 | ## What does it mean for a microservice to be autonomous? 17 | Our microservice is a separate entity. It might be deployed as an isolated service on a platform as a service (PAAS), or it might be its own operating system process. 18 | 19 | --- 20 | ## Three properties of an Autonomous microservice. 21 | - These services need to be able to change independently of each other, and be deployed by themselves without requiring consumers to change. 22 | - We need to think about what our services should expose, and what they should allow to be hidden. 23 | - If there is too much sharing, our consuming services become coupled to our internal representations. This decreases our autonomy, as it requires additional coordination with consumers when making changes. 24 | 25 | --- 26 | ## Key Benefit: Technology Heterogeneity? 27 | With a system composed of multiple, collaborating services, we can decide to use different technologies inside each one. This allows us to pick the right tool for each job, rather than having to select a more standardized, one-size-fits-all approach that often ends up being the lowest common denominator. 28 | 29 | --- 30 | ## How can we use the key benefit of “technology heterogeneity” to try new technologies? 31 | With a system consisting of multiple services, I have multiple new places in which to try out a new piece of technology. I can pick a service that is perhaps lowest risk and use the technology there, knowing that I can limit any potential negative impact. 32 | 33 | --- 34 | ## Key benefit: Resilience? 35 | If one component of a system fails, but that failure doesn’t cascade, you can isolate the problem and the rest of the system can carry on working. Service boundaries become your obvious bulkheads (walls). In a monolithic service, if the service fails, everything stops working. 36 | 37 | --- 38 | ## When it comes to resilience what should be careful of? Any new source of failure? 39 | We do need to be careful, however. To ensure our microservice systems can properly embrace this improved resilience, we need to understand the new sources of failure that distributed systems have to deal with. Networks can and will fail, as will machines. 40 | 41 | --- 42 | ## Key benefit: Scaling? 43 | With a large, monolithic service, we have to scale everything together. One small part of our overall system is constrained in performance, but if that behavior is locked up in a giant monolithic application, we have to handle scaling everything as a piece. With smaller services, we can just scale those services that need scaling, allowing us to run other parts of the system on smaller, less powerful hardware. 44 | 45 | --- 46 | ## What is one of the issues of deploying small changes with monolithic systems? (Key benefit: Ease of deployment) 47 | A one-line change to a million-line-long monolithic application requires the whole application to be deployed in order to release the change. That could be a large impact, high-risk deployment. In practice, large-impact, high-risk deployments end up happening infrequently due to understandable fear. Unfortunately, this means that our changes build up and build up between releases, until the new version of our application hitting production has masses of changes. And the bigger the delta between releases, the higher the risk that we’ll get something wrong! 48 | 49 | --- 50 | ## Which advantages do we have when using mircroservices for deployment? 51 | With microservices, we can make a change to a single service and deploy it independently of the rest of the system. This allows us to get our code deployed faster. If a problem does occur, it can be isolated quickly to an individual service, making fast rollback easy to achieve. It also means we can get our new functionality out to customers faster. 52 | 53 | --- 54 | ## How do microservices help? (Key benefit: Organizational Alignment) 55 | We know that smaller teams working on smaller codebases tend to be more productive 56 | 57 | --- 58 | ## Which opportunities do we have when using microservices? (Key benefit: Composability) 59 | One of the key promises of distributed systems and service-oriented architectures is that we open up opportunities for reuse of functionality. With microservices, we allow for our functionality to be consumed in different ways for different purposes. 60 | 61 | --- 62 | ## List of all microservices Key Benefits 63 | - Technology Heterogeneity 64 | - Resilience 65 | - Scaling 66 | - Ease of Deployment 67 | - Organizational Alignment 68 | - Composability 69 | - Optimizing for Replaceability 70 | 71 | --- 72 | ## So is the microservice strategy a silver bullet? 73 | Of course it's not. If you’re coming from a monolithic system point of view, you’ll have to get much better at handling deployment, testing, and monitoring to unlock the benefits we’ve covered so far. You’ll also need to think differently about how you scale your systems and ensure that they are resilient. Don’t also be surprised if things like distributed transactions start giving you headaches, either! 74 | 75 | --- 76 | 77 | -------------------------------------------------------------------------------- /chapters/chapter-02.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: The Evolutionary Architect 2 | 3 | --- 4 | 5 | ## What architects should shitft their focus away from? What should they focus on instead? 6 | 7 | Thus, our architects need to shift their thinking away from creating the perfect end product, and instead focus on helping create a **framework** in which the right systems can emerge, and continue to grow as we learn more. 8 | 9 | --- 10 | 11 | ## Why is "town planner" a more suitable role when refering to software architects? 12 | 13 | Erik Doernenburg first shared with me the idea that we should think of our role more as town planners than architects for the built environment. The role of the town planner should be familiar to any of you who have played SimCity before. 14 | The way he influences how the city evolves, though, isinteresting. He does not say, “build this specific building there”; instead, he zones a city. 15 | 16 | --- 17 | 18 | ## What a "town planner" should be more worried about? (from the software architecture perspective) 19 | 20 | Rather than worrying too much about what happens in one zone (service), the town planner will instead spend far more time working out how people and utilities (data) move from one zone to another. 21 | 22 | --- 23 | 24 | ## In software, Why are "cities" better representations than buildings? 25 | 26 | The comparison with software should be obvious. As our users use our software, we need to react and change. We cannot foresee everything that will happen, and so rather than plan for any eventuality, we should plan to allow for change by avoiding the urge to overspecify every last thing. 27 | 28 | --- 29 | 30 | ## Communication between services can get messy. Why this happens? 31 | 32 | Between services is where things can get messy, however. If one service decides to expose REST over HTTP, another makes use of protocol buffers, and a third uses Java RMI, then integration can become a nightmare as consuming services have to understand and support multiple styles of interchange. This is why I try to stick to the guideline that we should “be worried about what happens between the boxes, and be liberal in what happens inside.” 33 | 34 | --- 35 | 36 | ## What should we define when making decisions related to microservice archictectures? 37 | 38 | In order to frame our decision. We need to define: 39 | 40 | - Strategic Goals 41 | - Principles 42 | - Practices 43 | 44 | --- 45 | 46 | ## What are Strategic goals? 47 | 48 | The role of the architect is already daunting enough, so luckily we usually don’t have to also define strategic goals! Strategic goals should speak to where your company is going, and how it sees itself as best making its customers happy. These will be highlevel goals, and may not include technology at all. They could be defined at a company level or a division level. 49 | 50 | --- 51 | 52 | ## What are Principles? 53 | 54 | Principles are rules you have made in order to align what you are doing to some larger goal, and will sometimes change. 55 | 56 | --- 57 | 58 | ## What's the difference between Principles and Constraints? 59 | 60 | A constraint is really something that is very hard (or virtually impossible) to change, whereas principles are things we decide to choose (and can change over time but not so frequently) 61 | 62 | --- 63 | 64 | ## What are Practices? 65 | 66 | Our practices are how we ensure our principles are being carried out. They are a set of detailed, practical guidance for performing tasks. They will often be technologyspecific, and should be low level enough that any developer can understand them. Practices could include coding guidelines, the fact that all log data needs to be captured centrally, or that HTTP/REST is the standard integration style. Due to their technical nature, practices will often change more often than principles. 67 | 68 | --- 69 | 70 | ## Is OK to combine Principles and Practices? 71 | 72 | One person’s principles are another’s practices. You might decide to call the use of HTTP/REST a principle rather than a practice, for example. And that would be fine. For a small enough group, perhaps a single team, combining principles and practices might be OK. However, for larger organizations, where the technology and working practices may differ, you may want a different set of practices in different places, as long as they both map to a common set of principles. 73 | 74 | --- 75 | 76 | ## Which "clear attribute" should we define for each service? 77 | 78 | - Monitoring 79 | - Interfaces 80 | - Architectural Safety 81 | 82 | --- 83 | 84 | ## When it comes to Monitoring, what is important to keep in mind? 85 | 86 | Whatever you pick, try to keep it standardized. Make the technology inside the box opaque, and don’t require that your monitoring systems change in order to support it. Logging falls into the same category here: we need it in one place. 87 | 88 | --- 89 | 90 | ## When it comes to Interfaces, what is important to keep in mind? 91 | 92 | Picking a small number of defined interface technologies helps integrate new consumers. Having one standard is a good number. Two isn’t too bad, either. Having 20 different styles of integration is bad. This isn’t just about picking the technology and the protocol. If you pick HTTP/REST, for example, will you use verbs or nouns? How will you handle pagination of resources? How will you handle versioning of end points? 93 | 94 | --- 95 | 96 | ## When it comes to Architectural Safety, what is important to keep in mind? 97 | 98 | Playing by the rules is important when it comes to response codes, too. If your circuit breakers rely on HTTP codes, and one service decides to send back 2XX codes for errors, or confuses 4XX codes with 5XX codes, then these safety measures can fall apart. 99 | 100 | --- 101 | 102 | ## Why "exemplars" are important to apply Principles and Practices? 103 | 104 | But developers also like code, and code they can run and explore. If you have a set of standards or best practices you would like to encourage, then having exemplars that you can point people to is useful. The idea is that people can’t go far wrong just by imitating some of the better parts of your system. Ideally, these should be real-world services you have that get things right, rather than isolated services that are just implemented to be perfect examples. 105 | 106 | --- 107 | 108 | ## What's a Tailored Service Template? And why are they helpful? 109 | 110 | It's a group of libraries/templates/frameworks we implement and encourage to use. In order to make sure that the principles and practices we defined are being met. By using a "Tailored Service Template", Developers can have most of the code in place, which will allow them to implement the core attributes that each service needs. This also ensures that teams can get going faster, and also that developers have to go out of their way to make their services badly behaved. 111 | 112 | --- 113 | 114 | ## When it comes to designing a service template, should it be the task of one developer/team? 115 | 116 | Ideally, it shouldn't. You do have to be careful that creating the service template doesn’t become the job of a central tools or architecture team who dictates how things should be done, albeit via 117 | code. Defining the practices you use should be a collective activity, so ideally your team(s) should take joint responsibility for updating this template (an internal open source approach works well here). 118 | 119 | --- 120 | 121 | ## Should a service template be optional or mandatory? 122 | 123 | Ideally, its use should be purely optional, but if you are going to be more forceful in its adoption you need to understand that ease of use for the developers has to be a prime guiding force. 124 | 125 | --- 126 | 127 | ## What does it mean "governance through code"? and which ways do we have to ensure it? 128 | 129 | It's a way to facilitate the correct fulfillment of our principles and practices by providing tangible examples in code. This code can either come in the form of "exemplars" or "tailored service templates". 130 | 131 | --- 132 | 133 | ## What's the governance job? 134 | 135 | If one of the architect’s jobs is ensuring there is a technical vision, then governance is about ensuring what we are building matches this vision, and evolving the vision if needed. 136 | 137 | --- 138 | 139 | ## How can we organize Governance? 140 | 141 | Normally, governance is a group activity. It could be an informal chat with a small enough team, or a more structured regular meeting with formal group membership for a larger scope. This is where I think the principles we covered earlier should be discussed and changed as required. This group needs to be led by a technologist, and to consist predominantly of people who are executing the work being governed. This group should also be responsible for tracking and managing technical risks. 142 | -------------------------------------------------------------------------------- /chapters/chapter-03.md: -------------------------------------------------------------------------------- 1 | # Chapter 3: How to Model Services 2 | 3 | --- 4 | 5 | ## What is Loose Coupling? 6 | 7 | When services are loosely coupled, a change to one service should not require a change to another. The whole point of a microservice is being able to make a change to one service and deploy it, without needing to change any other part of the system. This is really quite important. 8 | 9 | --- 10 | 11 | ## What sort of things cause tight coupling? 12 | 13 | A classic mistake is to pick an integration style that tightly binds one service to another, causing changes inside the service to require a change to consumers. 14 | 15 | --- 16 | 17 | ## What is desireble about the number of calls between microservices? 18 | 19 | This also means we probably want to limit the number of different types 20 | of calls from one service to another, because beyond the potential performance prob‐ 21 | lem, chatty communication can lead to tight coupling. 22 | 23 | --- 24 | 25 | ## (High Cohesion) We want related behavior to sit together, and unrelated behavior to sit elsewhere. Why? 26 | 27 | Well, if we want to change behavior, we want to be able to change it in one place, and release that change as soon as possible. If we have to change that behavior in lots of different places, we’ll have to release lots of different services (perhaps at the same time) to deliver that change. Making changes in lots of different places is slower, and deploying lots of services at once is risky—both of which we want to avoid. 28 | 29 | --- 30 | 31 | ## In the MusicCorp example. Which is the domain? And Which are the bounded contexts? 32 | 33 | Let’s return for a moment to the MusicCorp business. Our domain is the whole busiess in which we are operating. It covers everything from the warehouse to the reception desk, from finance to ordering. We may or may not model all of that in our 34 | software, but that is nonetheless the domain in which we are operating. Let’s think 35 | about parts of that domain that look like the bounded contexts that Evans refers to. 36 | At MusicCorp, our warehouse is a hive of activity—managing orders being shipped 37 | out (and the odd return), taking delivery of new stock, having forklift truck races, and 38 | so on. Elsewhere, the finance department is perhaps less fun-loving, but still has a 39 | very important function inside our organization. These employees manage payroll, 40 | keep the company accounts, and produce important reports. Lots of reports. They 41 | probably also have interesting desk toys. 42 | 43 | --- 44 | 45 | ## Why are shared models important? (Example: Warehouse and Finance departments) 46 | 47 | To be able to work out the valuation of the company, though, the finance employees 48 | need information about the stock we hold. The stock item then becomes a shared 49 | model between the two contexts. However, note that we don’t need to blindly expose 50 | everything about the stock item from the warehouse context. For example, although 51 | internally we keep a record on a stock item as to where it should live within the warehouse, that doesn’t need to be exposed in the shared model. So there is the internalonly representation, and the external representation we expose. 52 | 53 | --- 54 | 55 | ## We need to clearly think what models should be shared. Why? 56 | 57 | By thinking clearly about what models should be shared, and not sharing our internal 58 | representations, we avoid one of the potential pitfalls that can result in tight coupling 59 | (the opposite of what we want). We have also identified a boundary within our 60 | domain where all like-minded business capabilities should live, giving us the high 61 | cohesion we want. 62 | 63 | --- 64 | 65 | ## Why is not a good idea to decompose your system into microservices from the beginning? 66 | 67 | Prematurely decomposing a system into microservices can be costly, especially if you are new to the domain. In many ways, having an existing codebase you want to decompose into microservices is much easier than trying to go to microservices from the beginning. 68 | 69 | --- 70 | 71 | ## What questions do we have to make when deciding which models we should share? 72 | 73 | So ask first “What does this context do?”, and then “So what data 74 | does it need to do that?” When modeled as services, these capabilities become the key operations that will be exposed over the wire to other collaborators. 75 | 76 | --- 77 | 78 | ## What do we have to do when deciding the boundaries of our microservices? 79 | 80 | When considering the boundaries of your microservices, first think in terms of the larger, coarser-grained contexts, and then subdivide along these nested contexts when you’re looking for the benefits of splitting out these seams. 81 | 82 | --- 83 | 84 | ## Nested approach or full separation approach? What should we consider to decide which one of those two? 85 | 86 | In general, there isn’t a hard-and-fast rule as to what approach makes the most sense. 87 | However, whether you choose the nested approach over the full separation approach 88 | should be based on your organizational structure. If order fulfillment, inventory 89 | management, and goods receiving are managed by different teams, they probably 90 | deserve their status as top-level microservices. If, on the other hand, all of them are 91 | managed by one team, then the nested model makes more sense. 92 | 93 | --- 94 | 95 | ## When modeling microservices, what is important to keep in mind when it comes to communication and how it relates to the business concepts? 96 | 97 | The same terms and ideas that are shared between parts of your organization should be reflected in your interfaces. It can be useful to think of forms being sent between these microservices, much as forms are sent around an organization. 98 | 99 | --- 100 | 101 | ## Is it correct to model service boundaries along technical seams? 102 | 103 | Making decisions to model service boundaries along technical seams isn’t always 104 | wrong. I have certainly seen this make lots of sense when an organization is looking 105 | to achieve certain performance objectives, for example. However, it should be your 106 | secondary driver for finding these seams, not your primary one. 107 | -------------------------------------------------------------------------------- /chapters/chapter-04.md: -------------------------------------------------------------------------------- 1 | # Chapter 4: Integration 2 | 3 | --- 4 | 5 | ## How should we keep the APIs used for communication between microservices? 6 | 7 | It is also why I think it is very important to ensure that you keep the APIs used 8 | for communication between microservices technology-agnostic. This means avoiding 9 | integration technology that dictates what technology stacks we can use to implement 10 | our microservices. 11 | 12 | --- 13 | 14 | ## What are the problems of "Database Integration" when using microservices? 15 | 16 | Remember when we talked about the core principles behind good microservices? 17 | Strong cohesion and loose coupling—with database integration, we lose both things. 18 | Database integration makes it easy for services to share data, but does nothing about 19 | sharing behavior. Our internal representation is exposed over the wire to our consumers, and it can be very difficult to avoid making breaking changes, which inevitably 20 | leads to a fear of any change at all. Avoid at (nearly) all costs. 21 | 22 | --- 23 | 24 | ## How can we classify the types of communication we use with microservices? 25 | 26 | - Synchronous communication 27 | - Asynchronous communication 28 | - Including: event-based collaboration 29 | 30 | --- 31 | 32 | ## Explain Orchestration and Choreography of microservices 33 | 34 | With orchestration, we rely on a central brain to guide and drive the 35 | process, much like the conductor in an orchestra. With choreography, we inform 36 | each part of the system of its job, and let it work out the details, like dancers all finding their way and reacting to others around them in a ballet. 37 | 38 | --- 39 | 40 | ## Which is the downside of the orchestration approach? 41 | 42 | The downside to this orchestration approach is that the customer service can become 43 | too much of a central governing authority. It can become the hub in the middle of a 44 | web, and a central point where logic starts to live. I have seen this approach result in a 45 | small number of smart “god” services telling anemic CRUD-based services what to 46 | do. 47 | 48 | --- 49 | 50 | ## How can we apply the choreography approach? 51 | 52 | With a choreographed approach, we could instead just have the customer service 53 | emit an event in an asynchronous manner, saying Customer created. The email service, postal service, and loyalty points bank then just subscribe to these events and 54 | react accordingly. This approach is significantly more decoupled. If 55 | some other service needed to reach to the creation of a customer, it just needs to subscribe to the events and do its job when needed. 56 | 57 | --- 58 | 59 | ## Which is the downside of the choreography approach? 60 | 61 | The downside is that the explicit 62 | view of the business process we see before is now only implicitly reflected in 63 | our system. This means additional work is needed to ensure that you can monitor and track that 64 | the right things have happened 65 | 66 | --- 67 | 68 | ## How can we tackle the downsides of the choreography approach? 69 | 70 | One approach I like for dealing with this is to build a monitoring system that explicitly matches the view of the business process, but then tracks what each of the services does as independent entities, letting you see odd exceptions mapped onto the more 71 | explicit process flow 72 | 73 | --- 74 | 75 | ## When using microservices, should we trust networks? 76 | 77 | You need to think about the network itself. Famously, the first of the fallacies of distributed computing is “The network is reliable”. Networks aren’t reliable. They can 78 | and will fail, even if your client and the server you are speaking to are fine. They can 79 | fail fast, they can fail slow, and they can even malform your packets. You should 80 | assume that your networks are plagued with malevolent entities ready to unleash 81 | their ire on a whim. 82 | 83 | --- 84 | 85 | ## Which is the key challeng when using RPC mechanisms? 86 | 87 | This is a key challenge with any RPC mechanism that promotes the 88 | use of binary stub generation: you don’t get to separate client and server deployments. 89 | If you use this technology, lock-step releases may be in your future. (This means: Tight-coupling) 90 | 91 | --- 92 | 93 | ## How should we design our remote calls when applying RPC? 94 | 95 | Don’t abstract your remote calls to the point where the network is 96 | completely hidden, and ensure that you can evolve the server interface without having to insist on lock-step upgrades for clients (tight-coupling). Finding the right balance for your client code is important, for example. Make sure your clients aren’t oblivious to the fact 97 | that a network call is going to be made 98 | 99 | --- 100 | 101 | ## In REST, what does a "Resource" mean? 102 | 103 | Most important is the concept of resources. You can think of a resource as a thing 104 | that the service itself knows about, like a `Customer`. The server creates different representations of this `Customer` on request. How a resource is shown externally is completely decoupled from how it is stored internally. A client might ask for a JSON 105 | representation of a `Customer`, for example, even if it is stored in a completely different 106 | format. Once a client has a representation of this `Customer`, it can then make requests 107 | to change it, and the server may or may not comply with them. 108 | 109 | --- 110 | 111 | ## What's the idea behind HATEOAS? 112 | 113 | The idea behind HATEOAS is that clients should perform interactions (potentially leading to state transitions) with the server via these links (hypermedia controls) to other resources. It doesn’t need to know where exactly customers live on the server by knowing which URI to hit; instead, the client looks for and navigates links to find what it needs. 114 | 115 | ## Explain this HEATEOAS example... 116 | 117 | ``` 118 | 119 | Give Blood 120 | 121 | 122 | Awesome, short, brutish, funny and loud. Must buy! 123 | 124 | 125 | 126 | ``` 127 | 128 | - This hypermedia control shows us where to find information about the artist. 129 | - And if we want to purchase the album, we now know where to go. 130 | 131 | --- 132 | 133 | ## What's one of the benefits of using HEATEOAS? 134 | 135 | Using these controls to decouple the client and server yields significant benefits over 136 | time that greatly offset the small increase in the time it takes to get these protocols up 137 | and running. By following the links, the client gets to progressively discover the API, 138 | which can be a really handy capability when we are implementing new clients. 139 | 140 | --- 141 | 142 | ## What's one of the downsides of using HEATEOAS? 143 | 144 | One of the downsides is that this navigation of controls can be quite chatty, as the 145 | client needs to follow links to find the operation it wants to perform. 146 | 147 | --- 148 | 149 | ## What pattern can we use when deciding how to store our data? 150 | 151 | There is a more general problem at play here. How we decide to store our data, and 152 | how we expose it to our consumers, can easily dominate our thinking. One pattern I 153 | saw used effectively by one of our teams was to delay the implementation of proper 154 | persistence for the microservice, until the interface had stabilized enough. 155 | 156 | --- 157 | 158 | ## Book for learning more about REST 159 | 160 | Despite these disadvantages, REST over HTTP is a sensible default choice for serviceto-service interactions. If you want to know more, I recommend REST in Practice 161 | (O’Reilly), which covers the topic of REST over HTTP in depth. 162 | 163 | --- 164 | 165 | ## How should we keep our middlewares? (message brokers, queues) 166 | 167 | However, vendors tend to want to package lots of software with them, which can lead 168 | to more and more smarts being pushed into the middleware, as evidenced by things 169 | like the Enterprise Service Bus. Make sure you know what you’re getting: keep your 170 | middleware dumb, and keep the smarts in the endpoints. 171 | 172 | --- 173 | 174 | ## Which strategy did Sam Newman use to view bad messages in the pricing system he worked on? 175 | 176 | Aside from the bug itself, we’d failed to specify a maximum retry limit for the job on 177 | the queue. We fixed the bug itself, and also configured a maximum retry. But we also 178 | realized we needed a way to view, and potentially replay, these bad messages. We 179 | ended up having to implement a message hospital (or dead letter queue), where messages got sent if they failed. We also created a UI to view those messages and retry 180 | them if needed. These sorts of problems aren’t immediately obvious if you are only 181 | familiar with synchronous point-to-point communication. 182 | 183 | --- 184 | 185 | ## What should we ensure to have if we end up adopting event-driven architectures? 186 | 187 | The associated complexity with event-driven architectures and asynchronous programming in general leads me to believe that you should be cautious in how eagerly 188 | you start adopting these ideas. Ensure you have good monitoring in place, and 189 | strongly consider the use of correlation IDs, which allow you to trace requests across 190 | process boundaries 191 | 192 | --- 193 | 194 | ## What does happen when you introduce shared code outside your service boundary and how does RealEstate.com.au deal with it? 195 | 196 | If your use of shared code ever leaks outside your service boundary, you have introduced a potential form of coupling. Using common code like logging libraries is fine, 197 | as they are internal concepts that are invisible to the outside world. RealEstate.com.au 198 | makes use of a tailored service template to help bootstrap new service creation. 199 | Rather than make this code shared, the company copies it for every new service to 200 | ensure that coupling doesn’t leak in. 201 | 202 | --- 203 | 204 | ## What's the general rule of thumb when sharing code in microservices? (DRY) 205 | 206 | My general rule of thumb: don’t violate DRY within a microservice, but be relaxed 207 | about violating DRY across all services 208 | 209 | --- 210 | 211 | ## What's the problem of creeping logic into a client library? 212 | 213 | The more logic that creeps into the client library, the more 214 | cohesion starts to break down, and you find yourself having to change multiple clients to roll out fixes to your server. 215 | 216 | --- 217 | 218 | ## What's the model of AWS when using client libraries? 219 | 220 | A model for client libraries I like is the one for Amazon Web Services (AWS). The 221 | underlying SOAP or REST web service calls can be made directly, but everyone ends 222 | up using just one of the various software development kits (SDKs) that exist, which 223 | provide abstractions over the underlying API. These SDKs, though, are written by the 224 | community or AWS people other than those who work on the API itself. This degree 225 | of separation seems to work, and avoids some of the pitfalls of client libraries. Part of 226 | the reason this works so well is that the client is in charge of when the upgrade happens. If you go down the path of client libraries yourself, make sure this is the case. 227 | 228 | --- 229 | 230 | ## What could happen after we retrieve a resource from a specific service? (e.g Customer resource) 231 | 232 | When we retrieve a given Customer resource from the customer service, we get to see 233 | what that resource looked like when we made the request. It is possible that after we 234 | requested that Customer resource, something else has changed it. What we have in 235 | effect is a memory of what the Customer resource once looked like. The longer we 236 | hold on to this memory, the higher the chance that this memory will be false. 237 | 238 | --- 239 | 240 | ## In event-based collaboration. What could be valuable to have if we want to know what happend with a specific resource? (e.g Customer resource) 241 | 242 | With events, we’re saying this happened, but we need to know what happened. If we’re 243 | receiving updates due to a Customer resource changing, for example, it could be valuable to us to know what the Customer looked like when the event occurred. As long 244 | as we also get a reference to the entity itself so we can look up its current state, then 245 | we can get the best of both worlds. 246 | 247 | --- 248 | 249 | ## Which mechanisms can we used to reduce load in our services when we need to retrieve some resource information? 250 | 251 | If we provide additional information when the resource is retrieved, letting us know at what 252 | time the resource was in the given state and perhaps how long we can consider this 253 | information to be fresh, then we can do a lot with caching to reduce load. 254 | 255 | --- 256 | 257 | ## What does the "Postel's law consist in? 258 | 259 | The example of a client trying to be as flexible as possible in consuming a service 260 | demonstrates Postel’s Law (otherwise known as the robustness principle), which states: 261 | “Be conservative in what you do, be liberal in what you accept from others.” 262 | 263 | --- 264 | 265 | ## Explain semantic versioning. What is about? 266 | 267 | With semantic versioning, each version number is in the form 268 | MAJOR.MINOR.PATCH. When the MAJOR number increments, it means that backward 269 | incompatible changes have been made. When MINOR increments, new functionality 270 | has been added that should be backward compatible. Finally, a change to PATCH states 271 | that bug fixes have been made to existing functionality 272 | 273 | --- 274 | 275 | ## Explain a simple use case when working with semantic versioning 276 | 277 | Our helpdesk application is built to work against version 1.2.0 of the customer service. If a 278 | new feature is added, causing the customer service to change to 1.3.0, our helpdesk 279 | application should see no change in behavior and shouldn’t be expected to make any 280 | changes. We couldn’t guarantee that we could work against version 1.1.0 of the customer service, though, as we may rely on functionality added in the 1.2.0 release. We 281 | could also expect to have to make changes to our application if a new 2.0.0 release of 282 | the customer service comes out. 283 | 284 | --- 285 | 286 | ## When introducing a breaking interface change, how can we use our endpoints to handle this issue? 287 | 288 | One approach I have used successfully to handle this is to coexist both the old and new interfaces in the same running service. So if we want to release a breaking change, we deploy a new version of the service that exposes both the old and new versions of the endpoint. 289 | 290 | --- 291 | 292 | ## When coexisting different endpoint versions, things can get messy, with a lot of duplicated code, additional tests, etc. How can we work around this? 293 | 294 | To make this more manageable, we internally transformed all requests to the V1 endpoint to a V2 request, and then V2 requests to the V3 endpoint. This meant we could clearly delineate what code was going to be retired when the old endpoint(s) died. 295 | 296 | --- 297 | 298 | ## When introducing a breaking interface change, how can we use our services to handle this issue? 299 | 300 | Another versioning solution often cited is to have different versions of the service live 301 | at once, and for older consumers to route their traffic to the older version, with newer 302 | versions seeing the new one. 303 | 304 | --- 305 | 306 | ## What problems could we have when using multiple concurrent service versions? 307 | 308 | if I need to fix an internal bug in my service, I now have to fix and deploy two different sets of services. This would probably mean I have to branch the codebase for my 309 | service, and this is always problematic. Second, it means I need smarts to handle 310 | directing consumers to the right microservice. 311 | 312 | --- 313 | 314 | ## Which could be a good rule when deciding which approach to take when dealing with breaking interface changes? 315 | 316 | The longer it takes for you to get consumers upgraded to the newer version and released, 317 | the more you should look to coexist different endpoints in the same microservice 318 | rather than coexist entirely different versions. 319 | 320 | --- 321 | 322 | ## When it comes to our user interfaces. How should we adapt our core services? 323 | 324 | So, although our core services —our core offering— might be the same, we need a way 325 | to adapt them for the different constraints that exist for each type of interface. 326 | 327 | --- 328 | 329 | ## Explain UI Fragment Composition 330 | 331 | Rather than having our UI make API calls and map everything back to UI controls, 332 | we could have our services provide parts of the UI directly, and then just pull these 333 | fragments in to create a UI. Imagine, for example, that the recommendation service provides a recommendation widget that is combined with other 334 | controls or UI fragments to create an overall UI. It might get rendered as a box on a 335 | web page along with other content. 336 | 337 | --- 338 | 339 | ## What's an API gateway? 340 | 341 | A common solution to the problem of chatty interfaces with backend services, or the 342 | need to vary content for different types of devices, is to have a server-side aggregation 343 | endpoint, or API gateway. This can marshal multiple backend calls, vary and aggregate content if needed for different devices, and serve it up. 344 | 345 | --- 346 | 347 | ## What problems could we have when using unique API gateways and which pattern can we use to resolve them? 348 | 349 | The problem that can occur is that normally we’ll have one giant layer for all our 350 | services. This leads to everything being thrown in together, and suddenly we start to 351 | lose isolation of our various user interfaces, limiting our ability to release them independently. A model I prefer and that I’ve seen work well is to restrict the use of these 352 | backends to one specific user interface or application, as we see in Figure 4-10. 353 | 354 | ![image](https://user-images.githubusercontent.com/1868409/85929062-b11c4a00-b87f-11ea-9d0e-9019a71740da.png) 355 | 356 | --- 357 | 358 | ## Which dangers do we have when using BFF's? 359 | 360 | The danger with this approach is the same as with any aggregating layer; it can take 361 | on logic it shouldn’t. The business logic for the various capabilities these backends use 362 | should stay in the services themselves. These BFFs should only contain behavior specific to delivering a particular user experience. 363 | 364 | --- 365 | 366 | ## When deciding the inclusion of new software... “Should I build, or should I buy?” 367 | 368 | My clients often struggle with the question “Should I build, or should I buy?” In general, the advice I and my colleagues give when having this conversation with the average enterprise organization boils down to “Build if it is unique to what you do, and 369 | can be considered a strategic asset; buy if your use of the tool isn’t that special.” 370 | 371 | --- 372 | 373 | ## How can we work around CMS's in our microservice systems? 374 | 375 | The answer? Front the CMS with your own service that provides the website to the 376 | outside world, as shown in Figure 4-11. Treat the CMS as a service whose role is to 377 | allow for the creation and retrieval of content. In your own service, you write the 378 | code and integrate with services how you want. You have control over scaling the 379 | website (many commercial CMSes provide their own proprietary add-ons to handle 380 | load), and you can pick the templating system that makes sense. 381 | 382 | ![image](https://user-images.githubusercontent.com/1868409/85929253-5388fd00-b881-11ea-99c7-5614f6c4e879.png) 383 | 384 | --- 385 | 386 | ## How can we work around CRM's in our microservice systems? 387 | 388 | The first thing we did was identify the core concepts to our domain that the CRM 389 | system currently owned. One of these was the concept of a project — that is, something 390 | to which a member of staff could be assigned. Multiple other systems needed project 391 | information. What we did was instead create a project service. This service exposed 392 | projects as RESTful resources, and the external systems could move their integration 393 | points over to the new, easier-to-work-with service. Internally, the project service was 394 | just a façade, hiding the detail of the underlying integration (the CRM itself). You can see this in 395 | Figure 4-12. 396 | 397 | ![image](https://user-images.githubusercontent.com/1868409/85929360-1bce8500-b882-11ea-966d-c705e266fbf1.png) 398 | 399 | --- 400 | 401 | ## What is the Strangler Application Pattern? 402 | 403 | A useful pattern here is the Strangler Application Pattern. 404 | Much like with our example of fronting the CMS system with our own code, with a 405 | strangler you capture and intercept calls to the old system. This allows you to decide 406 | if you route these calls to existing, legacy code, or direct them to new code you may have written. This allows you to replace functionality over time without requiring a 407 | big bang rewrite. 408 | 409 | --- 410 | -------------------------------------------------------------------------------- /chapters/chapter-05.md: -------------------------------------------------------------------------------- 1 | # Chapter 5: Splitting the Monolith 2 | 3 | --- 4 | 5 | ## What are seams and why are they important to identify? 6 | 7 | In his book Working Eectively with Legacy Code (Prentice-Hall), Michael Feathers 8 | defines the concept of a seam — that is, a portion of the code that can be treated in 9 | isolation and worked on without impacting the rest of the codebase. We also want to 10 | identify seams. But rather than finding them for the purpose of cleaning up our codebase, we want to identify seams that can become service boundaries. 11 | 12 | --- 13 | 14 | ## Which are the two main steps that we need to take when breaking apart a Monolith? 15 | 16 | Imagine we have a large backend monolithic service that represents a substantial 17 | amount of the behavior of MusicCorp’s online systems. To start, we should identify 18 | the high-level bounded contexts that we think exist in our organization, as we discussed in Chapter 3. Then we want to try to understand what bounded contexts the 19 | monolith maps to. 20 | 21 | --- 22 | 23 | ## What tool can we use to identify some database-level constraints? 24 | 25 | This doesn’t give us the whole story, however. For example, we may be able to tell that 26 | the finance code uses the ledger table, and that the catalog code uses the line item 27 | table, but it might not be clear that the database enforces a foreign key relationship 28 | from the ledger table to the line item table. To see these database-level constraints, 29 | which may be a stumbling block, we need to use another tool to visualize the data. A 30 | great place to start is to use a tool like the freely available SchemaSpy, which can generate graphical representations of the relationships between tables. 31 | 32 | --- 33 | 34 | ## We have this situation: The finance context wants to get some data from the "Line items table" which belongs to a different bounded contex The monolith can get access to that table directly but how can we work around this issue using microservices? 35 | 36 | ![image](https://user-images.githubusercontent.com/1868409/86134257-64bb4f00-bab7-11ea-983a-9565f67f51d1.png) 37 | 38 | So how do we fix things here? Well, we need to make a change in two places. First, we 39 | need to stop the finance code from reaching into the line item table, as this table 40 | really belongs to the catalog code, and we don’t want database integration happening 41 | once catalog and finance are services in their own rights. The quickest way to address 42 | this is rather than having the code in finance reach into the line item table, we’ll 43 | expose the data via an API call in the catalog package that the finance code can call. 44 | This API call will be the forerunner of a call we will make over the wire, as we see in 45 | Figure 5-3. 46 | 47 | ![image](https://user-images.githubusercontent.com/1868409/86134441-9fbd8280-bab7-11ea-8a0f-5d49571d913a.png) 48 | 49 | --- 50 | 51 | ## How can we share static data (e.g country codes) using microservices? (Hint: There are three ways) 52 | 53 | - Well, we have a few options. One is to duplicate this table for each of our packages, with the long-term view that it will be duplicated within each service also. (Downside: updating multiple tables when new static data is avalable) 54 | - A second option is to instead treat this shared, static data as code. Perhaps it could be in a property file deployed as part of the service, or perhaps just as an enumeration. (Downside: same as before, but at least updating code has shown to be easier than updating a DB table/collection) 55 | - A third option, which may well be extreme, is to push this static data into a service of its own right. (Downside: Overkill solution when the amount of static data is not that big and not that complex) 56 | 57 | --- 58 | 59 | ## We have the following situation: the finance and the warehouse code are writing to, and probably occasionally reading from, the same table. How can we tease this apart? 60 | 61 | ![image](https://user-images.githubusercontent.com/1868409/86135819-4d7d6100-bab9-11ea-831b-96113067b387.png) 62 | 63 | We need to make the current abstract concept of the customer concrete. As a transient step, we create a new package called Customer. We can then use an API to expose 64 | Customer code to other packages, such as finance or warehouse. Rolling this all the way forward, we may now end up with a distinct customer service (Figure 5-6). 65 | 66 | ![image](https://user-images.githubusercontent.com/1868409/86136008-8a495800-bab9-11ea-9ec7-f3ba7125c2c0.png) 67 | 68 | --- 69 | 70 | ## We have the following situation: Our catalog needs to store the name and price of the records we sell, and the warehouse needs to keep an electronic record of inventory. We decide to keep these two things in the same place in a generic line item table. How can we break this down? 71 | 72 | ![image](https://user-images.githubusercontent.com/1868409/86195684-7df7e600-bb1f-11ea-87b5-8847e4ac01f5.png) 73 | 74 | The answer here is to split the table in two as we have in Figure 5-8, perhaps creating 75 | a stock list table for the warehouse, and a catalog entry table for the catalog details. 76 | 77 | ![image](https://user-images.githubusercontent.com/1868409/86195790-ac75c100-bb1f-11ea-92ef-2b52e7d8593a.png) 78 | 79 | --- 80 | 81 | ## What's the best way to split our schemas so that we can avoid doing a "big-bang release"? 82 | 83 | What next? Do you do a big-bang release, going from one monolithic service with a single schema to two services, each with its own schema? I would actually 84 | recommend that you split out the schema but keep the service together before splitting the application code out into separate microservices, as shown in Figure 5-9. 85 | 86 | ![image](https://user-images.githubusercontent.com/1868409/86196115-856bbf00-bb20-11ea-939a-aa8a6b935646.png) 87 | 88 | --- 89 | 90 | ## Which benefits do we get by spliting our schemas first (staging the break)? 91 | 92 | By splitting the schemas out but keeping the application code together, we give ourselves the ability to revert our changes or continue to 93 | tweak things without impacting any consumers of our service. Once we are satisfied 94 | that the DB separation makes sense, we can then think about splitting out the application code into two services. 95 | 96 | --- 97 | 98 | ## What's eventual consistency? 99 | 100 | In many ways, this is another form of what is called eventual consistency. Rather than 101 | using a transactional boundary to ensure that the system is in a consistent state when 102 | the transaction completes, instead we accept that the system will get itself into a consistent state at some point in the future. This approach is especially useful with business operations that might be long-lived. 103 | 104 | --- 105 | 106 | ## What's the "two-phase commit" algorithmn about? 107 | 108 | The most common algorithm for handling distributed transactions — especially shortlived transactions, as in the case of handling our customer order — is to use a two phase commit. With a two-phase commit, first comes the voting phase. This is where 109 | each participant (also called a cohort in this context) in the distributed transaction 110 | tells the transaction manager whether it thinks its local transaction can go ahead. If 111 | the transaction manager gets a yes vote from all participants, then it tells them all to 112 | go ahead and perform their commits. A single no vote is enough for the transaction 113 | manager to send out a rollback to all parties. 114 | 115 | --- 116 | 117 | ## What's one the downsides of using the "two-phase commit" algorithmn? 118 | 119 | This approach relies on all parties halting until the central coordinating process tells 120 | them to proceed. This means we are vulnerable to outages. If the transaction manager 121 | goes down, the pending transactions never complete. If a cohort fails to respond during voting, everything blocks. And there is also the case of what happens if a commit 122 | fails after voting. 123 | 124 | --- 125 | 126 | ## Should we implement our own algorithmn for distributed transactions? 127 | 128 | I’d suggest you avoid trying to create your own. 129 | Instead, do lots of research on this topic if this seems like the route you want to take, 130 | and see if you can use an existing implementation. 131 | 132 | --- 133 | 134 | ## How do we do when we encounter a state that we really want to be kept consistent? 135 | 136 | If you do encounter state that really, really wants to be kept consistent, do everything 137 | you can to avoid splitting it up in the first place. Try really hard. If you really need to 138 | go ahead with the split, think about moving from a purely technical view of the process (e.g, a database transaction) and actually create a concrete concept to represent 139 | the transaction itself. 140 | 141 | --- 142 | 143 | ## (Reporting) What is the "Data Retrieval via Service Calls" strategy about? 144 | 145 | There are many variants of this model, but they all rely on pulling the required data 146 | from the source systems via API calls. For a very simple reporting system, like a dashboard that might just want to show the number of orders placed in the last 15 147 | minutes, this might be fine. To report across data from two or more systems, you 148 | need to make multiple calls to assemble this data. 149 | 150 | --- 151 | 152 | ## When processing big volumes of data. Which approach can we use to avoid returning a huge data response? 153 | 154 | For example, the customer service might expose something like a 155 | `BatchCustomerExport` resource endpoint. The calling system would POST a 156 | `BatchRequest`, perhaps passing in a location where a file can be placed with all the 157 | data. The customer service would return an HTTP 202 response code, indicating that 158 | the request was accepted but has not yet been processed. The calling system could 159 | then poll the resource waiting until it retrieves a 201 Created status, indicating that 160 | the request has been fulfilled, and then the calling system could go and fetch the data. 161 | This would allow potentially large data files to be exported without the overhead of 162 | being sent over HTTP; instead, the system could simply save a CSV file to a shared 163 | location. 164 | 165 | --- 166 | 167 | ## (Reporting) What is the "Data Pump" strategy about? 168 | 169 | An alternative option is to have a standalone program that directly accesses 170 | the database of the service that is the source of data, and pumps it into a reporting 171 | database, as shown in Figure 5-13. 172 | 173 | ![image](https://user-images.githubusercontent.com/1868409/86518917-58aaf680-be03-11ea-8a98-9cade67ad130.png) 174 | 175 | --- 176 | 177 | ## How can we mitigate the tight-coupling and low-cohesion downsides that comes with using the "Data pump" strategy? 178 | 179 | To start with, the data pump should be built and managed by the same team that 180 | manages the service. We try to reduce the problems with coupling to the service’s 181 | schema by having the same team that manages the service also manage the pump. I 182 | would suggest, in fact, that you version-control these together, and have builds of the 183 | data pump created as an additional artifact as part of the build of the service itself, 184 | with the assumption that whenever you deploy one of them, you deploy them both. 185 | 186 | --- 187 | 188 | ## (Reporting) What is the "Event Data Pump" strategy about? 189 | 190 | In Chapter 4, we touched on the idea of microservices emitting events based on the 191 | state change of entities that they manage. For example, our customer service may 192 | emit an event when a given customer is created, or updated, or deleted. For those 193 | microservices that expose such event feeds, we have the option of writing our own 194 | event subscriber that pumps data into the reporting database, as shown in 195 | Figure 5-15. 196 | 197 | ![image](https://user-images.githubusercontent.com/1868409/86519011-79278080-be04-11ea-8ac8-00d32ed53347.png) 198 | 199 | --- 200 | 201 | ## What's the main downside of using the "Event Data Pump" strategy? 202 | 203 | The main downsides to this approach are that all the required information must be 204 | broadcast as events, and it may not scale as well as a data pump for larger volumes of 205 | data that has the benefit of operating directly at the database level. Nonetheless, the 206 | looser coupling and fresher data available via such an approach makes it strongly 207 | worth considering if you are already exposing the appropriate events. 208 | 209 | --- 210 | 211 | ## What solution does Netflix implement using "Backup Data Pumps"? 212 | 213 | In the end, Netflix ended up implementing a pipeline capable of processing large amounts of data using this approach, which it then open sourced as the 214 | **Aegisthus project**. 215 | 216 | --- 217 | 218 | ## Which technique can we use to mitigate the cost of change asossiated with spliting a monolith? 219 | 220 | A great technique here is to adapt an approach more typically taught for the design of 221 | object-oriented systems: class-responsibility-collaboration (CRC) cards. With CRC 222 | cards, you write on one index card the name of the class, what its responsibilities are, 223 | and who it collaborates with. When working through a proposed design, for each service I list its responsibilities in terms of the capabilities it provides, with the collaboators specified in the diagram. As you work through more use cases, you start to get 224 | a sense as to whether all of this hangs together properly. 225 | 226 | --- 227 | 228 | ## Part of the problem is knowing where to start, and I’m hoping this chapter has helped. But another challenge is the cost associated with splitting out services. Finding somewhere to run the service, spinning up a new service stack, and so on, are non‐trivial tasks. So how do we address this? 229 | 230 | Well, if doing something is right but difficult, 231 | we should strive to make things easier. Investment in libraries and lightweight service 232 | frameworks can reduce the cost associated with creating the new service. Giving people access to self-service provision virtual machines or even making a platform as a 233 | service (PaaS) available will make it easier to provision systems and test them. 234 | -------------------------------------------------------------------------------- /chapters/chapter-06.md: -------------------------------------------------------------------------------- 1 | # Chapter 6: Deployment 2 | 3 | --- 4 | 5 | ## What's the core goal of CI? 6 | 7 | With CI, the core goal is to keep everyone in sync with each other, which we achieve 8 | by making sure that newly checked-in code properly integrates with existing code. To 9 | do this, a CI server detects that the code has been committed, checks it out, and carries out some verification like making sure the code compiles and that tests pass. 10 | 11 | --- 12 | 13 | ## What benefits do we get with CI? (Hint: there are 3) 14 | 15 | CI has a number of benefits: 16 | 17 | - We get some level of fast feedback as to the quality of our code. 18 | - It allows us to automate the creation of our binary artifacts. All the code 19 | required to build the artifact is itself version controlled, so we can re-create the artifact if needed. 20 | - We also get some level of traceability from a deployed artifact back to 21 | the code, and depending on the capabilities of the CI tool itself, can see what tests 22 | were run on the code and artifact too 23 | 24 | --- 25 | 26 | ## What questions do we have to ask ourserlf to see if we uderstand CI? 27 | 28 | - Do you check in to mainline once per day? 29 | - Do you have a suite of tests to validate your changes? 30 | - When the build is broken, is it the #1 priority of the team to fix it? 31 | 32 | --- 33 | 34 | ## According to Jez Humble and Dave Farley’s book. What's Continuos Delivery (CD)? 35 | 36 | Continuous delivery (CD) builds on this concept, and then some. As outlined in Jez 37 | Humble and Dave Farley’s book of the same name, continuous delivery is the 38 | approach whereby we get constant feedback on the production readiness of each and 39 | every check-in, and furthermore treat each and every check-in as a release candidate. 40 | 41 | --- 42 | 43 | ## What's "configuration drift"? 44 | 45 | By storing all our configuration in source control, we are trying to ensure that we can 46 | automatically reproduce services and hopefully entire environments at will. But once 47 | we run our deployment process, what happens if someone comes along, logs into the 48 | box, and changes things independently of what is in source control? This problem is 49 | often called con†guration drift — the code in source control no longer reflects the configuration of the running host. 50 | 51 | --- 52 | 53 | ## How does PaaS work? 54 | 55 | When using a platform as a service (PaaS), you are working at a higher-level abstraction than at a single host. Most of these platforms rely on taking a technology-specific 56 | artifact, such as a Java WAR file or Ruby gem, and automatically provisioning and 57 | running it for you. Some of these platforms will transparently attempt to handle scaling the system up and down for you, although a more common (and in my experience less error-prone) way will allow you some control over how many nodes your 58 | service might run on, but it handles the rest. 59 | 60 | --- 61 | 62 | ## Disavantages of using PaaS solutions 63 | 64 | When PaaS solutions work well, they work very well indeed. However, when they 65 | don’t quite work for you, you often don’t have much control in terms of getting under 66 | the hood to fix things. This is part of the trade-off you make. I would say that in my 67 | experience the smarter the PaaS solutions try to be, the more they go wrong. I’ve used 68 | more than one PaaS that attempts to autoscale based on application use, but does it 69 | badly. 70 | 71 | --- 72 | 73 | ## Why slicing up a phisical machine into multiple VM's might not be a good idea? 74 | 75 | Well, for some people, you can. However, slicing up the machine into ever increasing 76 | VMs isn’t free. Think of our physical machine as a sock drawer. If we put lots of 77 | wooden dividers into our drawer, can we store more socks or fewer? The answer is 78 | fewer: the dividers themselves take up room too! Our drawer might be easier to deal 79 | with and organize, and perhaps we could decide to put T-shirts in one of the spaces 80 | now rather than just socks, but more dividers means less overall space. 81 | 82 | --- 83 | 84 | ## Which tool can we use to define the infraestructure of a system? 85 | 86 | Building a system like this required a significant amount of work. The effort is often 87 | front-loaded, but can be essential to manage the deployment complexity you have. I 88 | hope in the future you won’t have to do this yourself. **Terraform** is a very new tool 89 | from Hashicorp, which works in this space. 90 | 91 | --- 92 | -------------------------------------------------------------------------------- /chapters/chapter-07.md: -------------------------------------------------------------------------------- 1 | # Chapter 7: Testing 2 | 3 | --- 4 | 5 | ## How has the trend been when it comes to testing? (manual or automated) 6 | 7 | The trend 8 | recently has been away from any large-scale manual testing, in favor of automating as 9 | much as possible, and I certainly agree with this approach. If you currently carry out 10 | large amounts of manual testing, I would suggest you address that before proceeding 11 | too far down the path of microservices, as you won’t get many of their benefits if you 12 | are unable to validate your software quickly and efficiently 13 | 14 | --- 15 | 16 | ## When it comes to the testing pyramid, what happens when we go up? 17 | 18 | When you’re reading the pyramid, the key thing to take away is that as you go up the 19 | pyramid, the test scope increases, as does our confidence that the functionality being 20 | tested works. On the other hand, the feedback cycle time increases as the tests take 21 | longer to run, and when a test fails it can be harder to determine which functionality 22 | has broken 23 | 24 | --- 25 | 26 | ## When it comes to the testing pyramid, what happens when we go down? 27 | 28 | As you go down the pyramid, in general the tests become much faster, so 29 | we get much faster feedback cycles. We find broken functionality faster, our continuous integration builds are faster, and we are less likely to move on to a new task 30 | before finding out we have broken something. When those smaller-scoped tests fail, 31 | we also tend to know what broke, often exactly what line of code. On the flipside, we 32 | don’t get a lot of confidence that our system as a whole works if we’ve only tested one 33 | line of code! 34 | 35 | --- 36 | 37 | ## Explain stubbing downstream collaborators. 38 | 39 | When I talk about stubbing downstream collaborators, I mean that we create a stub 40 | service that responds with canned responses to known requests from the service 41 | under test. For example, I might tell my stub points bank that when asked for the balance of customer 123, it should return 15,000. The test doesn’t care if the stub is 42 | called 0, 1, or 100 times. 43 | 44 | --- 45 | 46 | ## Explain mocking in comparison to stubbing 47 | 48 | When using a mock, I actually go further and make sure the call was made. If the 49 | expected call is not made, the test fails. Implementing this approach requires more 50 | smarts in the fake collaborators that we create, and if overused can cause tests to 51 | become brittle. As noted, however, a stub doesn’t care if it is called 0, 1, or many 52 | times 53 | 54 | --- 55 | 56 | ## What's Mountebank? 57 | 58 | You can think of Mountebank as a small software appliance that is programmable via 59 | HTTP. The fact that it happens to be written in NodeJS is completely opaque to any 60 | calling service. When it launches, you send it commands telling it what port to stub 61 | on, what protocol to handle (currently TCP, HTTP, and HTTPS are supported, with 62 | more planned), and what responses it should send when requests are sent. It also sup‐ 63 | ports setting expectations if you want to use it as a mock. You can add or remove 64 | these stub endpoints at will, making it possible for a single Mountebank instance to 65 | stub more than one downstream dependency. 66 | 67 | --- 68 | 69 | ## What happens when you have more moving parts in your tests? 70 | 71 | The more moving parts, the more brittle our tests may be, and the less deterministic 72 | they are. If you have tests that sometimes fail, but everyone just re-runs them because 73 | they may pass again later, then you have flaky tests. It isn’t only tests covering lots of 74 | different process that are the culprit here. 75 | 76 | --- 77 | 78 | ## According to Martin Fowlen, what should we do when we have flaky tests? 79 | 80 | In “Eradicating Non-Determinism in Tests”, Martin Fowler advocates the approach 81 | that if you have flaky tests, you should track them down and if you can’t immediately 82 | fix them, remove them from the suite so you can treat them. 83 | 84 | --- 85 | 86 | ## When is it good to remove a e2e test? 87 | 88 | When it comes to the 89 | larger-scoped test suites, however, this is exactly what we need to be able to do. If the 90 | same feature is covered in 20 different tests, perhaps we can get rid of half of them, as 91 | those 20 tests take 10 minutes to run! 92 | 93 | --- 94 | 95 | ## Instead of including a e2e test for each new piece of functionality, which other approach can we take? 96 | 97 | The best way to counter this is to focus on a small number of core journeys to test for 98 | the whole system. Any functionality not covered in these core journeys needs to be 99 | covered in tests that analyze services in isolation from each other. These journeys 100 | need to be mutually agreed upon, and jointly owned. For our music shop, we might 101 | focus on actions like ordering a CD, returning a product, or perhaps creating a new 102 | customer—high-value interactions and very few in number. 103 | 104 | --- 105 | 106 | ## Explain CDC (Consumer-Driven Contracts) 107 | 108 | With CDCs, we are defining the expectations of a consumer on a service (or producer). The expectations of the consumers are captured in code form as tests, which 109 | are then run against the producer. If done right, these CDCs should be run as part of 110 | the CI build of the producer, ensuring that it never gets deployed if it breaks one of 111 | these contracts. 112 | 113 | --- 114 | 115 | ## So Should You Use End-to-End Tests? 116 | 117 | You can view running end-to-end tests prior to production deployment as training 118 | wheels. While you are learning how CDCs work, and improving your production 119 | monitoring and deployment techniques, these end-to-end tests may form a useful 120 | safety net, where you are trading off cycle time for decreased risk. But as you improve 121 | those other areas, you can start to reduce your reliance on end-to-end tests to the 122 | point where they are no longer needed. 123 | 124 | --- 125 | 126 | ## What's a smoke test suite? 127 | 128 | A common example of this is 129 | the smoke test suite, a collection of tests designed to be run against newly deployed 130 | software to confirm that the deployment worked. These tests help you pick up any 131 | local environmental issues. 132 | 133 | --- 134 | 135 | ## What's a blue/green deployment? 136 | 137 | Another example of this is what is called blue/green deployment. With blue/green, we 138 | have two copies of our software deployed at a time, but only one version of it is 139 | receiving real requests. 140 | Let’s consider a simple example, seen in Figure 7-12. In production, we have v123 of 141 | the customer service live. We want to deploy a new version, v456. We deploy this 142 | alongside v123, but do not direct any traffic to it. Instead, we perform some testing in 143 | situ against the newly deployed version. Once the tests have worked, we direct the 144 | production load to the new v456 version of the customer service. It is common to 145 | keep the old version around for a short period of time, allowing for a fast fallback if 146 | you detect any errors. 147 | 148 | ![image](https://user-images.githubusercontent.com/1868409/89364152-da5ea000-d69f-11ea-85c3-bb6b400676ae.png) 149 | 150 | --- 151 | 152 | ## What's canary releasing? 153 | 154 | With canary releasing, we are verifying our newly deployed software by directing 155 | amounts of production traffic against the system to see if it performs as expected. 156 | “Performing as expected” can cover a number of things, both functional and non‐ 157 | functional. For example, we could check that a newly deployed service is responding 158 | to requests within 500ms, or that we see the same proportional error rates from the 159 | new and the old service. 160 | 161 | --- 162 | 163 | ## What do you need to decide when considering canary releasing? (about the requests) 164 | 165 | When considering canary releasing, you need to decide if you are going to divert a 166 | portion of production requests to the canary or just copy production load. Some 167 | teams are able to shadow production traffic and direct it to their canary. In this way, 168 | the existing production and canary versions can see exactly the same requests, but 169 | only the results of the production requests are seen externally. This allows you to do a 170 | side-by-side comparison while eliminating the chance that a failure in the canary can 171 | be seen by a customer request. 172 | 173 | --- 174 | 175 | ## Explain the trade-offs between MTBF and MTTR? 176 | 177 | Sometimes expending the same effort into getting better at remediation of a release 178 | can be significantly more beneficial than adding more automated functional tests. In 179 | the web operations world, this is often referred to as the trade-off between optimizing 180 | for mean time between failures (MTBF) and mean time to repair (MTTR). 181 | 182 | --- 183 | 184 | ## What are nonfunctional requirements? 185 | 186 | Nonfunctional 187 | requirements is an umbrella term used to describe those characteristics your system 188 | exhibits that cannot simply be implemented like a normal feature. They include 189 | aspects like the acceptable latency of a web page, the number of users a system should 190 | support, how accessible your user interface should be to people with disabilities, or 191 | how secure your customer data should be. 192 | 193 | --- 194 | 195 | ## What should you ensure to measure in your performance tests? 196 | 197 | To generate worthwhile results, you’ll often need to run given scenarios with gradu‐ 198 | ally increasing numbers of simulated customers. This allows you to see how latency of 199 | calls varies with increasing load. This means that performance tests can take a while 200 | to run. In addition, you’ll want the system to match production as closely as possible, to ensure that the results you see will be indicative of the performance you can expect 201 | on the production systems. 202 | 203 | --- 204 | 205 | ## When it comes to our performance test results, what's important to consider? 206 | 207 | And make sure you also look at the results! I’ve been very surprised by the number of 208 | teams I have encountered who have spent a lot of work implementing tests and run‐ 209 | ning them, and never check the numbers. Often this is because people don’t know 210 | what a good result looks like. You really need to have targets. This way, you can make 211 | the build go red or green based on the results, with a red (failing) build being a clear 212 | call to action. 213 | 214 | --- 215 | 216 | ## Two points about CDC and E2E tests... 217 | 218 | - Avoid the need for end-to-end tests wherever possible by using consumer-driven 219 | contracts. 220 | - Use consumer-driven contracts to provide focus points for conversations 221 | between teams. 222 | -------------------------------------------------------------------------------- /chapters/chapter-08.md: -------------------------------------------------------------------------------- 1 | # Chapter 8: Monitoring 2 | 3 | --- 4 | 5 | ## What two thing do you need to monitor in your services? 6 | 7 | - First, we’ll want to monitor the host itself. CPU, memory, etc. 8 | - Secondly, we might want to monitor the application itself. At a bare minimum, monitoring the response time of the service is a good idea. 9 | 10 | --- 11 | 12 | ## What tool can we use to avoid that our logs take up all our disk space? 13 | 14 | We may even get advanced and use logrotate to move old logs out of the way and avoid them taking up all our disk space. 15 | 16 | --- 17 | 18 | ## When it comes to monitoring our service resources (CPU, memory), what tool can we use? 19 | 20 | We’ll want to know what they should be when things are healthy, so we can alert 21 | when they go out of bounds. If we want to run our own monitoring software, we 22 | could use something like Nagios to do so, or else use a hosted service like New Relic. 23 | 24 | --- 25 | 26 | ## When monitoring a single service across multiple services. What strategy and tool should we use? 27 | 28 | So at this point, we still want to track the host-level metrics, and alert on them. But 29 | now we want to see what they are across all hosts, as well as individual hosts. In other 30 | words, we want to aggregate them up, and still be able to drill down. Nagios lets us 31 | group our hosts like this—so far, so good. A similar approach will probably suffice for 32 | our application. 33 | 34 | --- 35 | 36 | ## What tool can we use to view aggregated logs? 37 | 38 | Kibana is an ElasticSearch-backed system for viewing logs, illustrated in Figure 8-4. 39 | You can use a query syntax to search through logs, allowing you to do things like 40 | restrict time and date ranges or use regular expressions to find matching strings. 41 | Kibana can even generate graphs from the logs you send it, allowing you to see at a 42 | glance how many errors have been generated over time, for example. 43 | 44 | --- 45 | 46 | ## What's the secrect to knowing when to paninc and when to relax? 47 | 48 | Our website is seeing nearly 50 4XX HTTP error codes per second. Is that bad? The CPU load on the catalog service has increased by 20% since lunch; has something gone wrong? The secret to knowing when to panic and when to relax is to gather metrics about how your sys‐ 49 | tem behaves over a long-enough period of time that clear patterns emerge. 50 | 51 | --- 52 | 53 | ## What's a good way to collect metrics across multiple services? And which would it be a good tool for that? 54 | 55 | We’ll want to be able to look at a metric aggregated for the whole 56 | system—for example, the avergage CPU load—but we’ll also want to aggregate that 57 | metric for all the instances of a given service, or even for a single instance of that service. That means we’ll need to be able to associate metadata with the metric to allow 58 | us to infer this structure. Graphite is one such system that makes this very easy. 59 | 60 | --- 61 | 62 | ## What metrics are good to expose for our services? 63 | 64 | I would strongly suggest having your services expose basic metrics themselves. At a 65 | bare minimum, for a web service you should probably expose metrics like response 66 | times and error rates — vital if your server isn’t fronted by a web server that is doing 67 | this for you. But you should really go further. For example, our accounts service may 68 | want to expose the number of times customers view their past orders, or your web 69 | shop might want to capture how much money has been made during the last day. 70 | 71 | --- 72 | 73 | ## Why do we care about knowing which features the final user is using? (2 reasons) 74 | 75 | Why do we care about this? Well, for a number of reasons. 76 | 77 | - First, there is an old adage 78 | that 80% of software features are never used. Now I can’t comment on how accurate 79 | that figure is, but as someone who has been developing software for nearly 20 years, I 80 | know that I have spent a lot of time on features that never actually get used. Wouldn’t 81 | it be nice to know what they are? 82 | - Second, we are getting better than ever at reacting to how our users are using our sys‐ 83 | tem to work out how to improve it. Metrics that inform us of how our systems behave 84 | can only help us here. We push out a new version of the website, and find that the 85 | number of searches by genre has gone up significantly on the catalog service. Is that a 86 | problem, or expected? 87 | 88 | --- 89 | 90 | ## How do we call the fake events that we generate periodically to verify if a service is working correctly? And which technice is associated with it? 91 | 92 | This fake event we created is an example of synthetic transaction. We used this syn‐ 93 | thetic transaction to ensure the system was behaving semantically, which is why this 94 | technique is often called semantic monitoring. 95 | 96 | --- 97 | 98 | ## What do we need to be careful when applying semantic monitoring? 99 | 100 | Likewise, we have to make sure we don’t accidentally trigger unforeseen side effects. 101 | A friend told me a story about an ecommerce company that accidentally ran its tests 102 | against its production ordering systems. It didn’t realize its mistake until a large num‐ 103 | ber of washing machines arrived at the head office. 104 | 105 | --- 106 | 107 | ## How can we trace a transaction or request that goes across multiple services? 108 | 109 | One approach that can be useful here is to use correlation IDs. When the first call is 110 | made, you generate a GUID for the call. This is then passed along to all subsequent 111 | calls, and can be put into your logs in a structured way, much as 112 | you’ll already do with components like the log level or date. With the right log aggregation tooling, you’ll then be able to trace that event all the way through your system. 113 | 114 | --- 115 | 116 | ## When taling about our development cycles, when is a good moment to start including correlation IDs? 117 | 118 | Although it might seem like additional work up front, I 119 | would strongly suggest you consider putting them in as soon as you can, especially if 120 | your system will make use of event-drive architecture patterns, which can lead to 121 | some odd emergent behavior. 122 | 123 | --- 124 | 125 | ## When talking about monitoring, why is important to have some level of standardization? 126 | 127 | You should try to write your logs out in a standard format. You definitely want to 128 | have all your metrics in one place, and you may want to have a list of standard names 129 | for your metrics too; it would be very annoying for one service to have a metric called 130 | ResponseTime, and another to have one called RspTimeSecs, when they mean the 131 | same thing. 132 | 133 | --- 134 | 135 | ## What questions do we have to make when thinking about our monitoring metrics? 136 | 137 | What our people want to see and react to right now is different than what they need 138 | when drilling down. So, for the type of person who will be looking at this data, consider the following: 139 | 140 | - What they need to know right now 141 | - What they might want later 142 | - How they like to consume data 143 | 144 | --- 145 | 146 | ## Good book for the graphical display of quantitative information 147 | 148 | A discussion about all the nuances involved in the graphical display of quantita‐ 149 | tive information is certainly outside the scope of this book, but a great place to start is 150 | Stephen Few’s excellent book Information Dashboard Design: Displaying Data for Ata-Glance Monitoring (Analytics Press). 151 | 152 | --- 153 | 154 | ## Two things to consider when monitoring one service 155 | 156 | - Track inbound response time at a bare minimum. Once you’ve done that, follow 157 | with error rates and then start working on application-level metrics. 158 | - Track the health of all downstream responses, at a bare minimum including the 159 | response time of downstream calls, and at best tracking error rates. 160 | 161 | --- 162 | 163 | ## Four things to consider when monitoring the whole system 164 | 165 | - Ensure your metric storage tool allows for aggregation at a system or service 166 | level, and drill down to individual hosts. 167 | - Ensure your metric storage tool allows you to maintain data long enough to 168 | understand trends in your system. 169 | - Understand what requires a call to action, and structure alerting and dashboards 170 | accordingly. 171 | - Investigate the possibility of unifying how you aggregate all of your various met‐ 172 | rics by seeing if a tool like Suro or Riemann makes sense for you. 173 | -------------------------------------------------------------------------------- /chapters/chapter-09.md: -------------------------------------------------------------------------------- 1 | # Chapter 9: Security 2 | 3 | --- 4 | 5 | ## How do I identify what it's being authenticated? (abstractly speaking) 6 | 7 | Generally, when we’re talking abstractly about who or what is being authentica‐ 8 | ted, we refer to that party as the principal. 9 | 10 | --- 11 | 12 | ## What's authorization? 13 | 14 | Authorization is the mechanism by which we map from a principal to the action we 15 | are allowing her to do. Often, when a principal is authenticated, we will be given 16 | information about her that will help us decide what we should let her do. We might, 17 | for example, be told what department or office she works in—pieces of information 18 | that our systems can use to decide what she can and cannot do. 19 | 20 | --- 21 | 22 | ## How does a SSO solution work? 23 | 24 | When a principal tries to access a resource (like a web-based interface), she is directed to authenticate with an identity provider. This may ask her to provide a username 25 | and password, or might use something more advanced like two-factor authentication. 26 | Once the identity provider is satisfied that the principal has been authenticated, it 27 | gives information to the service provider, allowing it to decide whether to grant her 28 | access to the resource. 29 | 30 | --- 31 | 32 | ## How can we use API gateways with SSO solutions? 33 | 34 | Rather than having each service manage handshaking with your identity provider, 35 | you can use a gateway to act as a proxy, sitting between your services and the outside 36 | world (as shown in Figure 9-1). The idea is that we can centralize the behavior for 37 | redirecting the user and perform the handshake in only one place. 38 | 39 | ![image](https://user-images.githubusercontent.com/1868409/90997068-72102980-e58e-11ea-9166-9b20a3c7c28d.png) 40 | 41 | --- 42 | 43 | ## What's the downside of using API gateways with SSO's? 44 | 45 | I have seen some people put all their eggs in one basket, relying on the 46 | gateway to handle every step for them. And we all know what happens when we have 47 | a single point of failure. 48 | 49 | --- 50 | 51 | ## When defining principal roles, why should we keep those roles local to the microservice in question? 52 | 53 | These decisions need to be local to the microservice in question. I have seen people 54 | use the various attributes supplied by identity providers in horrible ways, using really 55 | fine-grained roles like CALL_CENTER_50_DOLLAR_REFUND, where they end up 56 | putting information specific to one part of one of our system’s behavior into their 57 | directory services. This is a nightmare to maintain and gives very little scope for our 58 | services to have their own independent lifecycle, as suddenly a chunk of information 59 | about how a service behaves lives elsewhere, perhaps in a system managed by a different part of the organization. 60 | 61 | --- 62 | 63 | ## Why should HTTP basic authentication be used over HTTPS? 64 | 65 | When using HTTPS, the client gains strong guarantees that the server it is talking to 66 | is who the client thinks it is. It also gives us additional protection against people 67 | eavesdropping on the traffic between the client and server or messing with the payload. 68 | 69 | --- 70 | 71 | ## What's the downside of using HTTPS? And how can you work around it? 72 | 73 | Another downside is that traffic sent via SSL cannot be cached by reverse proxies like 74 | Varnish or Squid. This means that if you need to cache traffic, it will have to be done 75 | either inside the server or inside the client. You can fix this by having a load balancer 76 | terminate the SSL traffic, and having the cache sit behind the load balancer. 77 | 78 | --- 79 | 80 | ## Explain TLS 81 | 82 | Another approach to confirm the identity of a client is to make use of capabilities in 83 | Transport Layer Security (TLS), the successor to SSL, in the form of client certificates. 84 | Here, each client has an X.509 certificate installed that is used to establish a link 85 | between client and server. The server can verify the authenticity of the client certificate, providing strong guarantees that the client is valid. 86 | 87 | --- 88 | 89 | ## Using TLS has its complications. So when should we use it? 90 | 91 | Using wildcard certificates can help, 92 | but won’t solve all problems. This additional burden means you’ll be looking to use 93 | this technique when you are especially concerned about the sensitivity of the data 94 | being sent, or if you are sending data via networks you don’t fully control. So you 95 | might decide to secure communication of very important data between parties that is 96 | sent over the Internet, for example. 97 | 98 | --- 99 | 100 | ## What important consideration should we bear in mind when using strategies like JWT? 101 | 102 | Finally, understand that this approach ensures only that no third party has manipulated the request and that the private key itself remains private. The rest of the data in 103 | the request will still be visible to parties snooping on the network. 104 | 105 | --- 106 | 107 | ## What two aproches do we have when using API keys, and how do we manage them? 108 | 109 | Some systems use a single API key that is shared, and use an approach similar to 110 | HMAC as just described. A more common approach is to use a public and private 111 | key pair. Typically, you’ll manage keys centrally, just as we would manage identities of 112 | people centrally. The gateway model is very popular in this space. 113 | 114 | --- 115 | 116 | ## Explain the type of vulnerability called "confused deputy problem" 117 | 118 | There is a type of vulnerability called the confused deputy problem, which in the context of service-to-service communication refers to a situation where a malicious party 119 | can trick a deputy service into making calls to a downstream service on his behalf 120 | that he shouldn’t be able to. For example, as a customer, when I log in to the online 121 | shopping system, I can see my account details. What if I could trick the online shop‐ 122 | ping UI into making a request for someone else’s details, maybe by making a call with 123 | my logged-in credentials? 124 | 125 | --- 126 | 127 | ## Which encryption algorithmns should you use? 128 | 129 | For encryption at rest, unless you have a very good reason for picking something else, 130 | pick a well-known implementation of AES-128 or AES-256 for your platform 131 | 132 | --- 133 | 134 | ## What techniche should you use for securing your stored passwords? 135 | 136 | For passwords, you should consider using a technique called salted password hashing. 137 | 138 | --- 139 | 140 | ## How should we store or keys to access our databases and services? 141 | 142 | One solution is to use a separate security appliance to encrypt and decrypt data. 143 | Another is to use a separate key vault that your service can access when it needs a key. 144 | The lifecycle management of the keys (and access to change them) can be a vital 145 | operation, and these systems can handle this for you. 146 | 147 | --- 148 | 149 | ## Explain: "Decrypt on Demand" 150 | 151 | Encrypt data when you first see it. Only decrypt on demand, and ensure that data is 152 | never stored anywhere. 153 | 154 | --- 155 | 156 | ## What do we have to do in order to keep our backups safe? 157 | 158 | So it may seem like an obvious point, but we need to make sure that our 159 | backups are also encrypted. This also means that we need to know which keys are 160 | Securing Data at Rest 161 | needed to handle which version of data, especially if the keys change. Having clear 162 | key management becomes fairly important. 163 | 164 | --- 165 | 166 | ## When it comes to security. How can our logs help us out? 167 | 168 | Good logging, and specifically the ability to aggregate logs from multiple systems, is 169 | not about prevention, but can help with detecting and recovering from bad things 170 | happening. For example, after applying security patches you can often see in logs if 171 | people have been exploiting certain vulnerabilities. Patching makes sure it won’t hap‐ 172 | pen again, but if it already has happened, you may need to go into recovery mode. 173 | Having logs available allows you to see if something bad happened after the fact. 174 | 175 | --- 176 | 177 | ## In security, what are IDS and IPS? 178 | 179 | Intrusion detection systems (IDS) can monitor networks or hosts for suspicious behav‐ 180 | ior, reporting problems when it sees them. Intrusion prevention systems (IPS), as well 181 | as monitoring for suspicious activity, can step in to stop it from happening. Unlike a 182 | firewall, which is primarily looking outward to stop bad things from getting in, IDS 183 | and IPS are actively looking inside the perimeter for suspect behavior. 184 | 185 | --- 186 | 187 | ## In AWS, which technice can we use to segregate our networks? 188 | 189 | AWS, for example, provides the ability to automatically provision a virtual private cloud 190 | (VPC), which allow hosts to live in separate subnets. You can then specify which 191 | VPCs can see each other by defining peering rules, and even route traffic through 192 | gateways to proxy access, giving you in effect multiple perimeters at which additional 193 | security measures can be put into place. 194 | 195 | --- 196 | 197 | ## Which two steps should you take when securing OS environment? 198 | 199 | - Here, basic advice can get you a long way. Start with only running services as OS users that have as few permissions as possible, to ensure that if such an account is compromised it will do minimal damage. 200 | - Next, patch your software. Regularly. This needs to be automated, and you need to 201 | know if your machines are out of sync with the latest patch levels. 202 | 203 | --- 204 | 205 | ## How having a microservice architecture can give us much more freedom when implementing our security? 206 | 207 | For those parts that deal with the most sensitive information 208 | or expose the most valuable capabilities, we can adopt the strictest security provi‐ 209 | sions. But for other parts of the system, we can afford to be much more lax in what 210 | we worry about. 211 | 212 | --- 213 | 214 | ## When it comes to storing private data, what do you have to take into account? (hint: The German phrase Datensparsamkeit) 215 | 216 | The German phrase Datensparsamkeit represents this concept. Originating from German privacy legislation, it encapsulates the concept of only storing as much information as is absolutely required to fulfill business operations or satisfy local laws. 217 | 218 | --- 219 | 220 | ## How can we help to educate developers about security concerns? 221 | 222 | Getting people familar with the OWASP Top Ten list and OWASP’s Security Testing Framework can be a great place to start. Specialists absolutely have their place, though, and if you have access to them, use them to help you. 223 | 224 | --- 225 | 226 | ## How can we use external parties to assest the security of our system? 227 | 228 | With security, I think there is great value in having an external assessment done. 229 | Exercises like penetration testing, when done by an outside party, really do mimic 230 | real-world attempts. They also sidestep the issue that teams aren’t always able to see 231 | the mistakes they have made themselves, as they are too close to the problem. 232 | -------------------------------------------------------------------------------- /chapters/chapter-10.md: -------------------------------------------------------------------------------- 1 | # Chapter 10: Conway’s Law and System Design 2 | 3 | --- 4 | 5 | ## How does S. Raymond sumarize the Conway's law? 6 | 7 | This statement is often quoted, in various forms, as Conway’s law. Eric S. Raymond 8 | summarized this phenomenon in The New Hacker’s Dictionary (MIT Press) by stating 9 | “If you have four groups working on a compiler, you’ll get a 4-pass compiler.” 10 | 11 | --- 12 | 13 | ## What rules does Amazon use to manage the size of its teans? 14 | 15 | ... It wanted teams to own and operate the systems they looked after, managing the entire 16 | lifecycle. But Amazon also knew that small teams can work faster than large teams. 17 | This led famously to its two-pizza teams, where no team should be so big that it could 18 | not be fed with two pizzas. 19 | 20 | --- 21 | 22 | ## What does happen when the cost of coordinating change increases? 23 | 24 | When the cost of coordinating change increases, one of two things happen. Either 25 | people find ways to reduce the coordination/communication costs, or they stop making changes. The latter is exactly how we end up with large, hard-to-maintain 26 | codebases. 27 | 28 | --- 29 | 30 | ## When it comes to defining teams based on their geographic location, what's a good advice? 31 | 32 | So where does this leave us when considering evolving our own service design? Well, 33 | I would suggest that geographical boundaries between people involved with the 34 | development of a system can be a great way to drive when services should be decomposed, and that in general, you should look to assign ownership of a service to a single, colocated team who can keep the cost of change low. 35 | 36 | --- 37 | 38 | ## Explain having a core service as an internal open source project. What would the responsability of the core committers be(the ones with the ownership)? 39 | 40 | With normal open source, a small group of people are considered core committers. 41 | They are the custodians of the code. If you want a change to an open source project, 42 | you either ask one of the committers to make the change for you, or else you make 43 | the change yourself and send them a pull request. The core committers are still in 44 | charge of the codebase; they are the owners. 45 | 46 | --- 47 | 48 | ## What would the process of vetting a approving changes be for the core team? 49 | 50 | The core team needs to have some way of vetting and approving the changes. It needs 51 | to make sure the changes are idiomatically consistent—that is, that they follow the 52 | general coding guidelines of the rest of the codebase. The people doing the vetting are 53 | therefore going to have to spend time working with the submitters to make sure the 54 | change is of sufficient quality. 55 | 56 | --- 57 | 58 | ## When working with open source projects, when is it good to take submissions and when it's not? 59 | 60 | Most open source projects tend to not take submissions from a wider group of 61 | untrusted committers until the core of the first version is done. Following a similar 62 | model for your own organizations makes sense. If a service is pretty mature, and is 63 | rarely changed—for example, our cart service—then perhaps that is the time to open 64 | it up for other contributions. 65 | 66 | --- 67 | 68 | ## What benefits do we get by drawing our service boundaries around our bounded contexts? (3) 69 | 70 | This has multiple benefits. First, a team will find it easier to grasp domain 71 | concepts within a bounded context, as they are interrelated. Second, services within a 72 | bounded context are more likely to be services that talk to each other, making system 73 | design and release coordination easier. Finally, in terms of how the delivery team 74 | interacts with the business stakeholders, it becomes easier for the team to create good 75 | relationships with the one or two experts in that area. 76 | 77 | --- 78 | 79 | ## What about those services that are not changed frequently, how can we work around those when our team structures are aligned along the bounded contexts? 80 | 81 | If your team structures are aligned along the bounded contexts of your organization, 82 | then even services that are not changed frequently still have a de facto owner. Imagine 83 | a team that is aligned with the consumer web sales context. It might handle the website, cart, and recommendation services. Even if the cart service hasn’t been changed 84 | in months, it would naturally fall to this team to make the change. 85 | 86 | --- 87 | 88 | ## Explain the "Line of Business" implemented by RealState.au 89 | 90 | Each squad inside a line of business is expected to own the entire lifecycle of the serv‐ 91 | ices it creates, including building, testing and releasing, supporting, and even decom‐ 92 | missioning. A core delivery services team provides advice and guidance to these 93 | teams, as well as tooling to help it get the job done. 94 | 95 | --- 96 | 97 | ## Coming up with a vision for how things should be done without considering how your current staff will feel about this or without considering what capabilities they have is likely to lead to a bad place. How can we adress this issue? 98 | 99 | Each organization has its own set of dynamics around this topic. Understand your 100 | staff ’s appetite to change. Don’t push them too fast! Maybe you still have a separate 101 | team handle frontline support or deployment for a short period of time, giving your 102 | developers time to adjust to other new practices. 103 | 104 | --- 105 | 106 | ## In summary, How should we align our teams? 107 | 108 | This leads us to trying to align service ownership to colocated teams, which themselves are aligned around the same bounded contexts of the 109 | organization. 110 | -------------------------------------------------------------------------------- /chapters/chapter-11.md: -------------------------------------------------------------------------------- 1 | # Chapter 11: Microservices at Scale 2 | 3 | --- 4 | 5 | ## Prevent failure or deal with it gracefully? 6 | 7 | We can also spend a bit less of our time trying to stop the inevitable, and a bit more of 8 | our time dealing with it gracefully. I’m amazed at how many organizations put processes and controls in place to try to stop failure from occurring, but put little to no 9 | thought into actually making it easier to recover from failure in the first place. 10 | 11 | --- 12 | 13 | ## Which 3 requirements do you need to understand to handle failure? 14 | 15 | - Response time/latency 16 | - Availability 17 | - Durability of data 18 | 19 | --- 20 | 21 | ## What question do you have to make about response time or latency? and why is important to ask such question? 22 | 23 | How long should various operations take? It can be useful here to measure this 24 | with different numbers of users to understand how increasing load will impact 25 | the response time. 26 | 27 | --- 28 | 29 | ## When it comes to "Avaiability", what's important to consider? (3 questions) 30 | 31 | Can you expect a service to be down? Is this considered a 24/7 service? Some 32 | people like to look at periods of acceptable downtime when measuring availability, but how useful is this to someone calling your service? I should either be able 33 | to rely on your service responding or not. Measuring periods of downtime is 34 | really more useful from a historical reporting angle. 35 | 36 | --- 37 | 38 | ## When talking about Durability of data, What 2 questions are important to answer? and what's important to consider? 39 | 40 | How much data loss is acceptable? How long should data be kept for? This is 41 | highly likely to change on a case-by-case basis. For example, you might choose to 42 | keep user session logs for a year or less to save space, but your financial transaction records might need to be kept for many years. 43 | 44 | --- 45 | 46 | ## When and how should you "degrade functionality" in your microservices? 47 | 48 | What we need to do is understand the impact of each outage, and work out how to 49 | properly degrade functionality. If the shopping cart service is unavailable, we’re probably in a lot of trouble, but we could still show the web page with the listing. Perhaps 50 | we just hide the shopping cart or replace it with an icon saying “Be Back Soon!” 51 | 52 | --- 53 | 54 | ## What do you have to ask yourself when dealing with multiple microservices that depend on multiple downstream collaborator? (Resilience) (2) 55 | 56 | But for every customer-facing 57 | interface that uses multiple microservices, or every microservice that depends on 58 | multiple downstream collaborators, you need to ask yourself, “What happens if this is 59 | down?” and know what to do. 60 | 61 | --- 62 | 63 | ## Systems that act slow or systems that fail fast? Which is the worst? 64 | 65 | When you get down to it, we discovered the hard way that systems 66 | that just act slow are much harder to deal with than systems that just fail fast. In a 67 | distributed system, latency kills. 68 | 69 | --- 70 | 71 | ## What three (3) fixes did the Sam's team implement in the ads website project where he was the technical lead? 72 | 73 | We ended up implementing three fixes to avoid this happening again: getting our timeouts right, 74 | implementing bulkheads to separate out different connection pools, and implementing a circuit breaker to avoid sending calls to an unhealthy system in the first place. 75 | 76 | --- 77 | 78 | ## Explain Chaos Monkey 79 | 80 | The most famous of these programs is the Chaos Monkey, which during certain 81 | hours of the day will turn off random machines. Knowing that this can and will happen in production means that the developers who create the systems really have to be 82 | prepared for it. 83 | 84 | --- 85 | 86 | ## Explain the trade-offs when deciding "Timeouts" (3) 87 | 88 | Wait too long to decide that a call has failed, and you can slow the whole system 89 | down. Time out too quickly, and you’ll consider a call that might have worked as 90 | failed. Have no timeouts at all, and a downstream system being down could hang 91 | your whole system. 92 | 93 | --- 94 | 95 | ## Where do you have to put timeouts? and what do you have to do with them? 96 | 97 | Put timeouts on all out-of-process calls, and pick a default timeout for everything. 98 | Log when timeouts occur, look at what happens, and change them accordingly 99 | 100 | --- 101 | 102 | ## Explain "Circuit Breakers" 103 | 104 | With a circuit breaker, after a certain number of requests to the downstream resource 105 | have failed, the circuit breaker is blown. All further requests fail fast while the circuit 106 | breaker is in its blown state. After a certain period of time, the client sends a few 107 | requests through to see if the downstream service has recovered, and if it gets enough 108 | healthy responses it resets the circuit breaker. You can see an overview of this process 109 | in Figure 11-2. 110 | 111 | ![image](https://user-images.githubusercontent.com/1868409/92310004-12dfeb00-ef78-11ea-8a72-e2350a00e4d6.png) 112 | 113 | --- 114 | 115 | ## What do you have to do with a request when a circuit breaker is blown (open)? (synchronous and asynchronous operations) 116 | 117 | While the circuit breaker is blown, you have some options. One is to queue up the 118 | requests and retry them later on. For some use cases, this might be appropriate, especially if you’re carrying out some work as part of a asynchronous job. If this call is 119 | being made as part of a synchronous call chain, however, it is probably better to fail 120 | fast. This could mean propagating an error up the call chain, or a more subtle degrading of functionality. 121 | 122 | --- 123 | 124 | ## Explain Bulkheads 125 | 126 | We should have used different connection pools for each downstream connection. That way, if one connection pool gets exhausted, the other connections aren’t 127 | impacted, as we see in Figure 11-3. This would ensure that if a downstream service 128 | started behaving slowly in the future, only that one connection pool would be impac‐ 129 | ted, allowing other calls to proceed as normal. 130 | 131 | ![image](https://user-images.githubusercontent.com/1868409/92310219-30ae4f80-ef7a-11ea-9284-10c24c66b99a.png) 132 | 133 | --- 134 | 135 | ## How can we combine circuit breakers with bulk-heads? 136 | 137 | We can think of our circuit breakers as an automatic mechanism to seal a bulkhead, 138 | to not only protect the consumer from the downstream problem, but also to potentially protect the downstream service from more calls that may be having an adverse impact. 139 | 140 | --- 141 | 142 | ## Why is important to keep isolation in our services? 143 | 144 | The more one service depends on another being up, the more the health of one 145 | impacts the ability of the other to do its job. If we can use integration techniques that 146 | allow a downstream server to be offline, upstream services are less likely to be affected by outages, planned or unplanned 147 | 148 | --- 149 | 150 | ## What's an Idempotent operation? 151 | 152 | In idempotent operations, the outcome doesn’t change after the first application, even 153 | if the operation is subsequently applied multiple times. If operations are idempotent, 154 | we can repeat the call multiple times without adverse impact. This is very useful when 155 | we want to replay messages that we aren’t sure have been processed, a common way 156 | of recovering from error. 157 | 158 | --- 159 | 160 | ## When does using Idempotent operations work well? 161 | 162 | This mechanism works just as well with event-based collaboration, and can be especially useful if you have multiple instances of the same type of service subscribing to 163 | events. Even if we store which events have been processed, with some forms of asynchronous message delivery there may be small windows where two workers can see 164 | the same message. By processing the events in an idempotent manner, we ensure this 165 | won’t cause us any issues. 166 | 167 | --- 168 | 169 | ## Why is good to move microservices into their own hosts? 170 | 171 | As the microservices are independent 172 | processes that communicate over the network, it should be an easy task to then move 173 | them onto their own hosts to improve throughput and scaling. This can also increase 174 | the resiliency of the system, as a single host outage will impact a reduced number of 175 | microservices. 176 | 177 | --- 178 | 179 | ## What's important to consider when looking at SLA? (SLA stands for...?) 180 | 181 | If you’re using an underlying service provider, it is important to know if a service-level agreement (SLA) is offered and plan 182 | accordingly. If you need to ensure your services are down for no more than four 183 | hours every quarter, but your hosting provider can only guarantee a downtime of 184 | eight hours per quarter, you have to either change the SLA, or come up with an alternative solution. 185 | 186 | --- 187 | 188 | ## Explain VLAN 189 | 190 | One mitigation is to have all the instances of the microservice inside a single 191 | VLAN, as we see in Figure 11-5. A VLAN is a virtual local area network, that is isolated in such a way that requests from outside it can come only via a router, and in 192 | this case our router is also our SSL-terminating load balancer. The only communication to the microservice from outside the VLAN comes over HTTPS, but internally 193 | everything is HTTP. 194 | 195 | ![image](https://user-images.githubusercontent.com/1868409/92339525-05685500-f08d-11ea-8fca-f72cdcc28bfb.png) 196 | 197 | --- 198 | 199 | ## Explain what Jeff Dean said in its presentation: "“Challenges in Building Large-Scale Information Retrieval Systems" (2) 200 | 201 | The architecture that gets you started may not be the architecture that keeps you 202 | going when your system has to handle very different volumes of load. As Jeff Dean 203 | said in his presentation “Challenges in Building Large-Scale Information Retrieval 204 | Systems” (WSDM 2009 conference), you should “design for ~10× growth, but plan to 205 | rewrite before ~100×.” At certain points, you need to do something pretty radical to 206 | support the next level of growth. 207 | 208 | --- 209 | 210 | ## Why is preparing our systems for a massive usage from the very beginning a bad idea? 211 | 212 | We need to be able to rapidly 213 | experiment, and understand what capabilities we need to build. If we tried building 214 | for massive scale up front, we’d end up front-loading a huge amount of work to prepare for load that may never come, while diverting effort away from more important 215 | activities, like understanding if anyone will want to actually use our product. 216 | 217 | --- 218 | 219 | ## How can we scale reads in relational databases? How is such setup called? 220 | 221 | In a relational database management system (RDBMS) like MySQL or Postgres, data 222 | can be copied from a primary node to one or more replicas. This is often done to 223 | ensure that a copy of our data is kept safe, but we can also use it to distribute our 224 | reads. A service could direct all writes to the single primary node, but distribute reads 225 | to one or more read replicas, as we see in Figure 11-6. The replication from the primary database to the replicas happens at some point after the write. This means that 226 | with this technique reads may sometimes see stale data until the replication has completed. Eventually the reads will see the consistent data. Such a setup is called eventually consistent, and if you can handle the temporary inconsistency it is a fairly easy 227 | and common way to help scale systems 228 | 229 | ![image](https://user-images.githubusercontent.com/1868409/92339830-69d7e400-f08e-11ea-9961-0b324caad525.png) 230 | 231 | --- 232 | 233 | ## What approach do we have when scaling for writes? 234 | 235 | One approach is to use 236 | sharding. With sharding, you have multiple database nodes. You take a piece of data 237 | to be written, apply some hashing function to the key of the data, and based on the 238 | result of the function learn where to send the data. To pick a very simplistic (and 239 | actually bad) example, imagine that customer records A–M go to one database 240 | instance, and N–Z another. 241 | 242 | --- 243 | 244 | ## In short words, Explain CQRS 245 | 246 | The Command-Query Responsibility Segregation (CQRS) pattern refers to an alter‐ 247 | nate model for storing and querying information. With normal databases, we use one 248 | system for performing modifications to data and querying the data. With CQRS, part 249 | of the system deals with commands, which capture requests to modify state, while 250 | another part of the system deals with queries. 251 | 252 | --- 253 | 254 | ## Explain Client-Side, Proxy and Server-side Caching 255 | 256 | - In client-side caching, the client stores the cached result. The client gets to decide 257 | when (and if) it goes and retrieves a fresh copy. Ideally, the downstream service will 258 | provide hints to help the client understand what to do with the response, so it knows 259 | when and if to make a new request. 260 | - With proxy caching, a proxy is placed between 261 | the client and the server. A great example of this is using a reverse proxy or content 262 | delivery network (CDN). 263 | - With server-side caching, the server handles caching 264 | responsibility, perhaps making use of a system like Redis or Memcache, or even a 265 | simple in-memory cache. 266 | 267 | --- 268 | 269 | ## Benefits of Client-side caching 270 | 271 | Clientside caching can help reduce network calls drastically, and can be one of the fastest 272 | ways of reducing load on a downstream service. In this case, the client is in charge of 273 | the caching behavior. 274 | 275 | --- 276 | 277 | ## Benefits of Proxy caching 278 | 279 | With proxy caching, everything is opaque to both the client and server. This is often a 280 | very simple way to add caching to an existing system. If the proxy is designed to 281 | cache generic traffic, it can also cache more than one service; a common example is a 282 | reverse proxy like Squid or Varnish, which can cache any HTTP traffic. 283 | 284 | --- 285 | 286 | ## Benefits of Server-side caching 287 | 288 | With a cache near or inside a service boundary, it can be easier 289 | to reason about things like invalidation of data, or track and optimize cache hits. In a 290 | situation where you have multiple types of clients, a server-side cache could be the 291 | fastest way to improve performance. 292 | 293 | --- 294 | 295 | ## How can we use HTTP headers in our caching systems? 296 | 297 | First, with HTTP, we can use cache-control directives in our responses to clients. 298 | These tell clients if they should cache the resource at all, and if so how long they 299 | should cache it for in seconds. We also have the option of setting an Expires header, 300 | where instead of saying how long a piece of content can be cached for, we specify a 301 | time and date at which a resource should be considered stale and fetched again. 302 | 303 | --- 304 | 305 | ## Explain the Guardian technique 306 | 307 | A technique I saw used at the Guardian, and subsequently elsewhere, was to crawl the 308 | existing live site periodically to generate a static version of the website that could be 309 | served in the event of an outage. Although this crawled version wasn’t as fresh as the 310 | cached content served from the live system, in a pinch it could ensure that a version 311 | of the site would get displayed. 312 | 313 | --- 314 | 315 | ## How can we avoid our services of getting flooded with requests if all the cache vanishes? 316 | 317 | One way to protect the origin in such a situation is never to allow requests to go to 318 | the origin in the first place. Instead, the origin itself populates the cache asynchro‐ 319 | nously when needed, as shown in Figure 11-7. If a cache miss is caused, this triggers 320 | an event that the origin can pick up on, alerting it that it needs to repopulate the 321 | cache. So if an entire shard has vanished, we can rebuild the cache in the background. 322 | 323 | ![image](https://user-images.githubusercontent.com/1868409/92422506-0617ee80-f154-11ea-9c45-38e787733dec.png) 324 | 325 | --- 326 | 327 | ## When applying cache, what do we have to be careful about? 328 | 329 | Be careful about caching in too many places! The more caches between you and the 330 | source of fresh data, the more stale the data can be, and the harder it can be to determine the freshness of the data that a client eventually sees. This can be especially 331 | problematic with a microservice architecture where you have multiple services 332 | involved in a call chain. 333 | 334 | --- 335 | 336 | ## What's a good advice when using Autoscaling? 337 | 338 | Both reactive and predictive scaling are very useful, and can help you be much more 339 | cost effective if you’re using a platform that allows you to pay only for the computing 340 | resources you use. But they also require careful observation of the data available to 341 | you. I’d suggest using autoscaling for failure conditions first while you collect the 342 | data. Once you want to start scaling for load, make sure you are very cautious about 343 | scaling down too quickly. In most situations, having more computing power at your 344 | hands than you need is much better than not having enough! 345 | 346 | --- 347 | 348 | ## Explain the CAP Theorem in short words 349 | 350 | At its heart it tells us that in a distributed system, we have three things we 351 | can trade off against each other: consistency, availability, and partition tolerance. 352 | Specifically, the theorem tells us that we get to keep two in a failure mode. 353 | 354 | --- 355 | 356 | ## How can we sacrfice consistency? What do we get by doing so? 357 | 358 | Let’s assume that we don’t shut the inventory service down entirely. If I make a 359 | change now to the data in DC1, the database in DC2 doesn’t see it. This means any 360 | requests made to our inventory node in DC2 see potentially stale data. In other 361 | words, our system is still available in that both nodes are able to serve requests, and 362 | we have kept the system running despite the partition, but we have lost consistency. 363 | This is often called a AP system. We don’t get to keep all three. 364 | 365 | --- 366 | 367 | ## How can we sacrfice availability? What do we get by doing so? 368 | 369 | Now in the partition, if the database nodes can’t talk to each other, they cannot coordinate to ensure consistency. We 370 | are unable to guarantee consistency, so our only option is to refuse to respond to the 371 | request. In other words, we have sacrificed availability. Our system is consistent and 372 | partition tolerant, or CP. In this mode our service would have to work out how to 373 | degrade functionality until the partition is healed and the database nodes can be 374 | resynchronized. 375 | 376 | --- 377 | 378 | ## What's a good advice when we want to achieve multinode consistency? What tool can we use? 379 | 380 | Getting multinode consistency right is so hard that I would strongly, strongly suggest 381 | that if you need it, don’t try to invent it yourself. Instead, pick a data store or lock 382 | service that offers these characteristics. Consul, for example, which we’ll discuss 383 | shortly, implements a strongly consistent key/value store designed to share configura‐ 384 | tion between multiple nodes. 385 | 386 | --- 387 | 388 | ## So AP or CP? 389 | 390 | Without knowing the context in which the operation is being used, we can’t know the 391 | right thing to do. Knowing about the CAP theorem just helps you understand that 392 | this trade-off exists and what questions to ask. 393 | 394 | --- 395 | 396 | ## What about those posts claiming they have beaten the CAP theorem, are they true? 397 | 398 | You’ll often see posts about people beating the CAP theorem. They haven’t. What they 399 | have done is create a system where some capabilities are CP, and some are AP. 400 | 401 | --- 402 | 403 | ## Why is sometimes more convenient to go AP rather than CP? 404 | 405 | We have to recognize that no matter how consistent our systems might be in and of 406 | themselves, they cannot know everything that happens, especially when we’re keeping 407 | records of the real world. This is one of the main reasons why AP systems end up 408 | being the right call in many situations. Aside from the complexity of building CP systems, they can’t fix all our problems anyway. 409 | 410 | --- 411 | 412 | ## Good tool for generating documentation of our Microservices 413 | 414 | Swagger lets you describe your API in order to generate a very nice web UI that 415 | allows you to view the documentation and interact with the API via a web browser. 416 | The ability to execute requests is very nice: you can define POST templates, for exam‐ 417 | ple, making it clear what sort of content the server expects. 418 | 419 | --- 420 | 421 | ## Whent to use HAL, and when to use Swagger? 422 | 423 | If you’re using hypermedia, my recommendation is to go with 424 | HAL over Swagger. But if you’re not using hypermedia and can’t justify the switch, I’d definitely suggest giving Swagger a go. 425 | -------------------------------------------------------------------------------- /chapters/chapter-12.md: -------------------------------------------------------------------------------- 1 | [//]: <> (Chaper 12: Bringing It All Together) 2 | 3 | ## List the Principles of Microservices (7) 4 | 5 | --- 6 | 7 | ![image](https://user-images.githubusercontent.com/1868409/94346320-32f84c80-0002-11eb-8888-10b919c74c91.png) 8 | 9 | === 10 | 11 | ## What can we do to maximize the autonomy that microservices make possible? 12 | 13 | --- 14 | 15 | To maximize the autonomy that microservices make possible, we need to constantly 16 | be looking for the chance to delegate decision making and control to the teams that 17 | own the services themselves. This process starts with embracing self-service wherever 18 | possible, allowing people to deploy software on demand, making development and 19 | testing as easy as possible, and avoiding the need for separate teams to perform these 20 | activities. 21 | 22 | === 23 | 24 | ## If we don’t account for the fact that a downstream call can and will fail, what can happen? 25 | 26 | --- 27 | 28 | If we don’t account for 29 | the fact that a downstream call can and will fail, our systems might suffer catastrophic 30 | cascading failure, and we could find ourselves with a system that is much more fragile 31 | than before. 32 | 33 | === 34 | 35 | ## When taking decisions as to our microservice architecture, we need to accept that we're going to get some things wrong. What are our options? 36 | 37 | --- 38 | 39 | So, knowing we are going to get some things wrong, what are our options? Well, I would suggest 40 | finding ways to make each decision small in scope; that way, if you get it wrong, you 41 | only impact a small part of your system. Learn to embrace the concept of evolutionary architecture, where your system bends and flexes and changes over time as you 42 | learn new things. 43 | -------------------------------------------------------------------------------- /package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "building-microservices-notes", 3 | "version": "1.0.0", 4 | "lockfileVersion": 1, 5 | "requires": true, 6 | "dependencies": { 7 | "markdown-to-anki": { 8 | "version": "0.2.3", 9 | "resolved": "https://registry.npmjs.org/markdown-to-anki/-/markdown-to-anki-0.2.3.tgz", 10 | "integrity": "sha512-IPRqVtFfVrN7f+my60XxMwRJyxtKoRV7XO1pVj8qiO4F7xWIumi29qziN8jvGApeoHd4RGG5qZ6xaTXwFYtfIw==", 11 | "requires": { 12 | "marked": "^0.3.6" 13 | } 14 | }, 15 | "marked": { 16 | "version": "0.3.19", 17 | "resolved": "https://registry.npmjs.org/marked/-/marked-0.3.19.tgz", 18 | "integrity": "sha512-ea2eGWOqNxPcXv8dyERdSr/6FmzvWwzjMxpfGB/sbMccXoct+xY+YukPD+QTUZwyvK7BZwcr4m21WBOW41pAkg==" 19 | } 20 | } 21 | } 22 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "building-microservices-notes", 3 | "version": "1.0.0", 4 | "description": "My personal notes for the book: Building Microservices by Sam Newman.", 5 | "main": "none", 6 | "scripts": { 7 | "build-12": "markdown-to-anki chapters/chapter-12.md > chapter-12.txt" 8 | }, 9 | "repository": { 10 | "type": "git", 11 | "url": "git+https://github.com/Andrew4d3/building-microservices-notes.git" 12 | }, 13 | "keywords": [ 14 | "anki", 15 | "microservices" 16 | ], 17 | "author": "Andrew4d3", 18 | "license": "MIT", 19 | "bugs": { 20 | "url": "https://github.com/Andrew4d3/building-microservices-notes/issues" 21 | }, 22 | "homepage": "https://github.com/Andrew4d3/building-microservices-notes#readme", 23 | "dependencies": { 24 | "markdown-to-anki": "^0.2.3" 25 | } 26 | } --------------------------------------------------------------------------------