├── CMG-Cover-600.jpg ├── LICENSE └── README.md /CMG-Cover-600.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csjcode/solutions-architect-metrics-cheatsheet/e93df87eea6cb6be7b29d7b81cf529ed333a61ce/CMG-Cover-600.jpg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cloud Metrics Guide Update!!! 2 | 3 | Good news!!! I have created a full 800+ page PDF of the info below!!! The "cheatsheet" is still available below too. But this PDF goes into extensive detail with charts, formulas, explanations, best practices and more! 4 | 5 | SystemsArchitect.io Store: https://store.systemsarchitect.io/ 6 | 7 | Cloud Metrics Guide (800+ pages, 190+ metrics!): https://www.cloudmetricsguide.com/ 8 | 9 | 10 | ![Cloud Metrics Guide](https://github.com/csjcode/solutions-architect-metrics-cheatsheet/blob/main/CMG-Cover-600.jpg "Cloud Metrics Guide") 11 | 12 | 13 | ### Features 14 | 15 | Cloud Metrics Guide "Banyan Book" (full version, 800+pgs.) 16 | 17 | #### Cloud Metrics Guide ("Banyan Book") 18 | 19 | The Cloud Metrics Guide ("Banyan Book") is packed with over 190+ cloud metrics and 800+ pages, in detail, with scenarios, gotchas, diagrams and best practices, providing an important resource for those eager to apply learned insights for success. 20 | 21 | This Cloud Metrics reference guide is for the technical, project lead and business/marketing people motivated to improve their cloud software, unlock new hidden value and take the leadership initiative for strategy and KPIs. 22 | 23 | ### For Technical, Devs, Manager, Business and Marketing roles! 24 | 25 | **Technical engineering roles:** 26 | 27 | Cloud Architects, DevOps, and Developers require careful consideration of metric calculations to ensure software success and make improvements. 28 | 29 | **Business, marketing and management roles:** 30 | 31 | Leadership, Managers, Product owners, Project managers, operations and IT managers can also gain special insights about key performance indicators in discussions with their tech team. 32 | 33 | **Cloud metrics covered:** 34 | 35 | 190+ in the categories of User, Network, Reliability, Compute, Compute Scaling, API, Database, Storage, Events and Queues, Security and Cost metrics. 36 | 37 | * 190 + cloud metrics in detail! 38 | * 800+ pages of valuable content, the ULTIMATE reference guide! 39 | * PDF, Searchable, Linked Table of Contents (TOC) for fast navigation. 40 | * Scenario/s giving example calculations, formulas and realistic situations. 41 | * List of "Gotchas" for each metric usage, for better understanding. 42 | * 5+ Best Practices and implementation suggestions for each metric!!! 43 | 44 | Cloud Metrics Guide (800+ pages, 190+ metrics!): https://www.cloudmetricsguide.com/ 45 | 46 | #### ⚡️ 150+ Solutions Architect metrics/calculations cheatsheet 47 | 48 | 150+ Solutions Architect metrics and calculations for systems design, technology comparisons, planning and projects. btw, If there is interest, I'll make this into tables, I made a few but haven't had time to do it all yet. 49 | 50 | Categories: User, Network, Reliability, Compute, Storage, Database, Queues/Events, Security, Cost. 51 | 52 | Thanks for checking it out... if you have ideas for improvements, feel free to comment or make a PR on my github repo: https://github.com/csjcode/solutions-architect-metrics-cheatsheet 53 | 54 | 55 | ### ⭐️ User 56 | 57 | ### Daily Active Users (DAU) 58 | * Unique active users / day 59 | * Performance and capacity needs, daily, projections. 60 | * Gauge the day-to-day engagement and optimize infrastructure for daily load variations. This helps in scaling resources efficiently to manage daily user activity without over-provisioning. 61 | 62 | * **Monthly Active Users (MAU)** 63 | * Unique active users / month 64 | * Performance and capacity over longer time periods, projections. 65 | * For understanding the broader trend in user engagement over a month. It assists in strategic planning for capacity, marketing efforts, and long-term scaling, ensuring that the system can handle monthly growth patterns without degradation in performance. 66 | * **Concurrent Users, Avg/Max** 67 | * Number of users at same time, average and peaks 68 | * Reliability, server capacity, quotas, service bottlenecks 69 | * For designing systems that are resilient to spikes in user load, particularly during peak times. Architects rely on this data to ensure that the infrastructure can handle sudden increases in demand, maintaining uptime and performance by provisioning for peak loads. 70 | * **Actions Per User (APU): Average actions per user** 71 | * Actions performed by all unique users / by the number of unique users. 72 | * Performance, bandwidth, concurrency, cost optimization for high/low microservices. 73 | * Understand user behavior in terms of interaction with the system. It informs decisions on optimizing specific paths or features users frequently interact with, allowing for more efficient resource allocation and improved user experience by focusing on high-usage areas. 74 | * **Actions Per User delta (APU Δ)** 75 | * Change in actions per user over a given time period 76 | * Speed of scalability, concurrency, performance. 77 | * Utilize this metric to monitor the evolution of user engagement and interaction intensity. For adjusting the system to evolving user demands, ensuring that scalability and performance enhancements align with changes in user behavior. 78 | * **Daily User Actions (DUA)** 79 | * Total actions performed by all users in a day. 80 | * System load, performance, cost and capacity growth needs. 81 | * Leverage DUA to comprehend the daily operational demands on the system. It's essential for predicting system load, informing capacity planning, and ensuring the infrastructure can accommodate daily activity without compromising performance. 82 | * **Requests Per Second (RPS)** 83 | * Service requests per second 84 | * Reliability in high traffic, quotas, performance. 85 | * Deploy this metric to gauge the system's ability to handle real-time demand. For ensuring the infrastructure can sustain high traffic volumes, maintain reliability, and optimize performance under varying loads. 86 | * **User Delta** 87 | * Change in the number of users over a given time period 88 | * Speed of scalability, capacity planning, concurrency, performance, cost.. 89 | * Track growth trends and fluctuations in the user base. For proactive scalability, efficient capacity planning, and optimizing resource allocation to match the changing size and needs of the user population. 90 | * **Session length** 91 | * Average duration of user session 92 | * Server resources, cost, performance, concurrency planning. 93 | * Apply this metric to understand user engagement depth and resource utilization patterns. For managing server resources, minimizing costs, and ensuring high performance and concurrency by tailoring the system to typical user session behaviors. 94 | 95 | 96 | 97 | ### ⭐️ Network 98 | 99 | 100 | * **CIDR/Subnet Calculation** 101 | * IP address and network mask calculation, 102 | * Number of IPs = 232-2prefix 103 | * 10.0.0.0/24 = 232-224 = 28 = 256 104 | * For network sizing and planning, to determine the appropriate size of subnets to efficiently allocate IP addresses within networks, ensuring optimal use of IP space and network organization. 105 | * **Bandwidth Consumed/Utilization** 106 | * (Bandwidth used / Total available bandwidth) x 100% 107 | * Network resource usage, potential bottlenecks, cost, quotas, performance 108 | * Enables monitoring of network load and identification of potential bottlenecks, guiding bandwidth management and optimization to ensure smooth data flow and prevent over-utilization. 109 | * **Available Bandwidth** 110 | * Total available bandwidth - bandwidth used. 111 | * Cost, performance, quotas, adequate resources for network traffic and avoiding slowdowns. 112 | * Critical for ensuring that there are sufficient bandwidth resources to handle expected traffic volumes, allowing for proactive measures to avoid congestion and maintain high-quality network performance. 113 | * **Data Transmission Rate** 114 | * Amount of data transferred / time taken 115 | * Data transfer speed and network efficiency. 116 | * Provides insights into the efficiency and speed of data movement across the network, helping in the optimization of network configurations and protocols to enhance data transmission rates. 117 | * **Link Capacity** 118 | * Maximum bandwidth capacity of a link. 119 | * Reliability - determines maximum bandwidth available for a link to ensure it can handle traffic. 120 | * For network planning and resilience, ensuring each network link can support the required data volumes without compromising on performance or reliability. 121 | * **Network Throughput** 122 | * (Amount of data transferred / Total time) x 100% 123 | * Performance, operations, network efficiency and data transfer speed. 124 | * For evaluating the overall capacity of the network to handle data transfers, ensuring the infrastructure is adequately provisioned to meet demand without causing slowdowns or impacting user experience. 125 | * **Network Latency** 126 | * Delay between sending and receiving data. Round-trip time (RTT) 127 | * Performance, delay in data transfer and network responsiveness. 128 | * For optimizing user experience and system performance, as lower latency translates to faster interactions for end-users. It's important in applications requiring real-time responses, helping to identify and mitigate network path issues. 129 | * **Network I/O** 130 | * Input/output rate of network data. 131 | * Warning of network bottlenecks, performance, cost. 132 | * Provides insights into the balance between incoming and outgoing data, highlighting potential congestion points or inefficiencies in the network. This helps in ensuring that the network architecture supports the required data flow smoothly and cost-effectively. 133 | * **Request Error rate** 134 | * Percentage of failed requests, (Number of failed requests / Total requests) x 100% 135 | * Identifies issues in the network or application that need to be addressed. 136 | * For maintaining high-quality service, as a high error rate may indicate underlying problems with the network infrastructure or application code. Monitoring this metric helps in quickly identifying and rectifying issues to minimize user impact and maintain service reliability 137 | * **Packet Loss Rate** 138 | * Percentage of lost packets. 139 | * Identifies potential vulnerabilities or issues in the network or application. 140 | * **HTTP response codes** 141 | * Partial list, see [full list](https://en.wikipedia.org//wiki/List_of_HTTP_status_codes) 142 | * **1xx info**, processing 143 | * **2xx successful**: 200 OK 144 | * **3xx redirection**: 301 Moved Permanently, 302 Found, 304 Not Modified 145 | * **4xx client error**: 400 Bad Request (parameters missing?), 401 Unauthorized (API token missing?), 403 Forbidden (permissions?), 404 Not Found (url incorrect?), 405 Method Not Allowed (eg. POST not accepted on resource), 429 Too Many Requests (exceeded rate limits?), 499 Client Closed Request 146 | * **5xx server errors**: 500 Internal Server Error, 501 Not Implemented, 502 Bad Gateway, 503 Service Unavailable (overload?), 504 Gateway Timeout (upstream timeout?) 147 | * **HTTP Header Fields** 148 | * List of [standard headers](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields) 149 | 150 | 151 | 152 | ### ⭐️ Reliability 153 | 154 | 155 | * **Recovery Time Objective (RTO)** 156 | * Maximum time window delay within which a service must be restored after a disaster. 157 | * For disaster recovery planning, ensuring that critical systems and data can be quickly restored to operational status following an outage. This helps in minimizing downtime and maintaining business continuity. 158 | * **Recovery Point Objective (RPO)** 159 | * Maximum amount of time before unacceptable amounts of data have been lost due to a disaster, failure, or comparable event. 160 | * Critical for data protection strategies, determining the maximum tolerable data loss in terms of time. This informs the frequency of backups and the design of data replication mechanisms to ensure data integrity and availability. 161 | * **Single Point of Failure (SPOF)** 162 | * Component that can cause system failure. 163 | * 1 - (redundant components/total components). 164 | * For enhancing system resilience. It guides the design of redundancy and fault tolerance into the system architecture to prevent a single component failure from causing a total system outage. 165 | * **Cloud Services Quotas** 166 | * Limits on usage of cloud services. 167 | * Possible SPOF, if quota reached, service unavailable/degraded. 168 | * To prevent service disruptions. It ensures that applications scale within the operational bounds of the cloud environment, avoiding unexpected outages or degraded performance due to quota limits. 169 | * **Availability (% of time)** 170 | * proportion of time a system is operational 171 | * uptime / (uptime + downtime) 172 | * For quantifying the reliability of a system. It measures the total operational time as a percentage of overall time, offering a clear view of the system's stability and reliability over a defined period. 173 | * * **Availability (% of requests)** 174 | * proportion of time a system is operational 175 | * successful requests / (valid requests) 176 | * Focused on the user perspective, this measures the effectiveness of a system in handling requests successfully. It provides insights into how often users can expect to complete their actions without encountering errors or outages, directly impacting user satisfaction and trust. 177 | * **Availability in 9s ("nines")** 178 | * Percentage of uptime in a year. 179 | * (total time - downtime) / total time 180 | * Expressing availability in terms of "nines" allows for a precise and universally understood benchmark of system reliability. It sets clear expectations for operational performance, guiding infrastructure resilience and redundancy planning to meet these stringent standards. 181 | * **Maximum Availability with Dependencies** 182 | * Maximum Availability estimate for multiple services in a distributed system. 183 | * Availability of Service 1 * Availability of Service 2 * ... Availability of Service n 184 | * MTBF / MTBF + MTTR 185 | * Underscores the compounded effect of dependencies on overall system availability. It aids in identifying the weakest links within a distributed system, focusing efforts on bolstering the resilience of those components to enhance the collective uptime. 186 | * **Maximum Availability with redundant components** 187 | * Maximum Availability estimate with duplicated components (higher reliability) 188 | * A = 1-F ≈ 1-f(1-a)s+1 189 | * where s = spare components, F= failure modes, a= availability (in %) 190 | * ex: 99.5% availability, with two spares the workload’s availability is A ≈ 1 − (1)(1−.995)3 = 99.9999875% availability 191 | * Illustrates the significant impact of redundancy on improving system availability. By integrating spare components, the system's resilience against failures is markedly enhanced, driving availability figures closer to the ideal 100%. It quantifies the benefit of redundancy in minimizing downtime, essential for critical systems where continuity is paramount. 192 | * **Mean Time to Failure (MTTF)** 193 | * Average time until a component fails. 194 | * total uptime / number of any failures. 195 | * Assessing the reliability of non-repairable systems and components. It provides a measure of the expected lifetime of a system, guiding maintenance schedules and informing the design of robust systems that maximize operational lifespan, thereby reducing replacement costs and downtime. 196 | * **Mean Time Between Critical Failures (MTBCF)** 197 | * total uptime / number of critical failures. 198 | * Helps in prioritizing improvements and redundancies for components that are most impactful on overall system stability and performance. 199 | * **Mean Time to Data Loss (MTTDL)** 200 | * Average time before data loss. 201 | * MTTDL = (1 - Annualized Rate of Data Loss) / Annualized Rate of Data Loss 202 | * Annualized Rate of Data Loss = (Total Data Stored) x (Data Loss Rate) 203 | * Assesses the effectiveness of data protection strategies. It enables organizations to gauge the robustness of their data backup, replication, and recovery solutions, ensuring that data loss incidents are exceedingly rare and manageable within operational risk tolerances. 204 | * **Recovery Time (RT)** 205 | * TTR + MTTD + MTTI + MTTRM, where TTR is Time to Respond. 206 | * Quantifies the significant reliability boost provided by redundant components, illustrating how spare parts can drastically reduce the system's overall failure rate. For designing highly available systems where downtime is nearly unacceptable, guiding the strategic implementation of redundancy to meet stringent availability targets. 207 | * **Mean Recovery Time (MRT)** 208 | * Average time it takes to recover from a failure. 209 | * MRT = ∑RT / Number of incidents 210 | * A holistic measure of the system's resilience, indicating the effectiveness of recovery strategies over time. It aids in benchmarking and improving recovery processes, aiming to minimize the impact of failures on operations. 211 | * **Mean Time to Detect (MTTD)** 212 | * Average time it takes to detect a failure. 213 | * MTTD = ∑time taken to detect incidents / Number of incidents. 214 | * For improving monitoring and alerting systems, MTTD measures the responsiveness to initial failure signs. Lowering MTTD means faster recognition of issues, enabling quicker response to prevent escalation. 215 | * **Mean Time to Identify (MTTI)** 216 | * Average time it takes to identify a failure. 217 | * MTTI = ∑time taken to identify incidents / Number of incidents. 218 | * Highlights the effectiveness of diagnostic processes and tools in pinpointing the causes of incidents. Reducing MTTI is key to accelerating recovery efforts by quickly understanding the nature of a failure. 219 | * **Mean Time to Remediate (MTTRM)** 220 | * Average time it takes to fix a failure. 221 | * ∑time taken to remediate incidents / Number of incidents. 222 | * Focuses on the repair phase, assessing how swiftly a system can be returned to its fully operational state after a failure. It's critical for continuous improvement efforts aimed at reducing repair times and enhancing system reliability. 223 | * **Mean Time to Resolve (MTTR)** 224 | * Average time it takes to resolve a failure. 225 | * MTTR = ∑time taken to resolve incidents / Number of incidents. 226 | **Mean Time to Respond (MTTR)** 227 | * Average time it takes to respond to a failure. 228 | * MTTR = ∑time taken to respond to incidents / Number of incidents. 229 | * **Change Failure Rate (CFR)** 230 | * Rate at which changes introduce failures. 231 | * CFR = Number of failed changes / Number of changes attempted. 232 | * **Defect Escape Rate** 233 | * Rate at which defects escape detection. 234 | * DER = Number of defects found after release / Total number of defects. 235 | * **Defect Density** 236 | * Number of defects per unit of code. 237 | * Number of defects / Size of the software. 238 | * **Failure Rate** 239 | * Rate at which components fail. 240 | * Number of failures / Unit of time 241 | * **Service restoration time** 242 | * Time taken to restore a failed system. 243 | * **Redundancy** 244 | * (Number of redundant (backup) components / Total number of components) x 100% 245 | * **Resiliency** 246 | * Ability to recover from a failure. 247 | * Availability x Reliability x Maintainability x Recoverability 248 | * Availability (% time), reliability (probability of working service), maintainability, and recoverability (MTTR) 249 | 250 | 251 | ### ⭐️ Compute 252 | 253 | 254 | Many of the following metrics are available in analytics services of the cloud provider tools such as AWS Cloudwatch or service dashboards. Metrics are included here for awareness and a reminder when evaluating Compute resources. 255 | 256 | * **CPU utilization** 257 | * CPU usage rate. Monitor performance & efficiency. Optimize performance. 258 | * Total CPU time / Elapsed time 259 | * **Disk I/O** 260 | * Disk input/output. Measure read/write speeds for the Compute device. Optimize throughput. 261 | * Total bytes read/written / Elapsed time 262 | * **Network I/O** 263 | * Network input/output. Measure bandwidth usage on the Compute device. Optimize connectivity. 264 | * Total bytes sent/received / Elapsed time 265 | * **IOPS** 266 | * Input/Output operations/second. Assess data access performance. Optimize throughput. 267 | * Total operations / Elapsed time 268 | * **Memory utilization** 269 | * RAM usage rate. Assess RAM usage for performance. 270 | * Total RAM usage / Total RAM available 271 | * **Caching** 272 | * Data storage/retrieval. Improve data performance. 273 | * Hits / Misses 274 | * **Cost** 275 | * Expense management. Estimate resource expenses. 276 | * Actual cost/Estimated cost 277 | * **Container density** 278 | * Resource utilization. Optimize resource use. 279 | * Used resources / Total resources 280 | * **Function duration vs. limits** 281 | * Execution time. Gauge execution time. 282 | * Analytics or Start time - End time vs. quota 283 | * **Function concurrency** 284 | * Simultaneous operations. Measure concurrency. 285 | * Analytics or Number of operations per lambda / Elapsed time 286 | * **Function response time** 287 | * Execution time. Evaluate speed. Gauge execution time. 288 | 289 | ### ⭐️ Load Balancing 290 | * **Load Balancing Algorithm** 291 | * Algorithm used by the load balancer to distribute traffic 292 | * Round Robin, Least Connections, Weighted Round Robin, Weighted Least Connections, Dynamic Least Connections, Source IP Hash, Least Time, Least Packets, Agent-Based Load Balancing, URL Hash, Server Affinity (Sticky Sessions) 293 | * **Request Success Rate** 294 | * Number of successful requests/Total number of requests 295 | * **Latency** 296 | * Time taken to serve a request 297 | * **Error Rate** 298 | * Number of failed requests/Total number of requests 299 | * **Connection Count** 300 | * Number of connections between clients and servers 301 | * **Active Connections** 302 | * Number of active connections between clients and servers 303 | * **Backend Server Health** 304 | * Availability and response time of backend servers 305 | * **SSL Handshake Time** 306 | * Time taken to establish a secure connection 307 | * **Connection Rebalancing Time** 308 | * Time taken to rebalance connections across servers 309 | 310 | ### ⭐️ Autoscaling 311 | 312 | * **Scaling metric** 313 | * A metric that determines when autoscaling should occur, such as CPU utilization or request count 314 | * **Scaling policy** 315 | * A set of rules that define how autoscaling should occur, such as increasing or decreasing the number of instances based on the scaling metric 316 | * **Target Tracking Scaling** 317 | * Adjusts capacity based on target metrics. 318 | * **Step Scaling** 319 | * Adds or removes capacity based on specific thresholds. 320 | * **Simple Scaling** 321 | * Adds or removes capacity based on logging or cloudwatch alarms. 322 | * **Scheduled Scaling** 323 | * Changes capacity at specific times or dates. 324 | * **Predictive Scaling** 325 | * Uses ML to forecast demand and adjust capacity. 326 | * **Dynamic Scaling** 327 | * Resizes based on changing demand and traffic patterns. 328 | * **Capacity Optimized Scaling** 329 | * Provisions instances for optimal cost and performance. 330 | * **Scale-out threshold** 331 | * The threshold value for the scaling metric that triggers scaling out (adding instances) 332 | * **Scale-in threshold** 333 | * The threshold value for the scaling metric that triggers scaling in (removing instances) 334 | * **Cool-down period** 335 | * The period of time after scaling has occurred during which autoscaling is suspended to prevent rapid scaling up and down 336 | 337 | ### ⭐️ Elasticity 338 | * **Resource utilization** 339 | * The percentage of available resources (such as CPU or memory) that are currently in use 340 | * **Capacity planning** 341 | * The process of estimating future resource needs based on historical usage patterns and growth projections 342 | * **Time to Scale** 343 | * The time it takes to add or remove resources to meet demand changes 344 | * **Cost Optimization** 345 | * The process of minimizing costs while maintaining the necessary level of elasticity and performance. 346 | 347 | ### ⭐️ Database 348 | 349 | * **Throughput** 350 | * The amount of data transferred per unit of time 351 | * Data transferred / time 352 | * **Latency** 353 | * The time it takes to process a request 354 | * Time to first byte + time to last byte 355 | * **Response time** 356 | * The time it takes to respond to a request 357 | * Time to last byte - time to first byte 358 | * **Concurrency** 359 | * The number of simultaneous users or connections 360 | * Simultaneous requests / time 361 | * **Read-to-Write Ratio** 362 | * The ratio of read requests to write requests 363 | * Read requests / write requests 364 | * **Cache Hit Rate** 365 | * The percentage of data that is retrieved from cache 366 | * cache hit rate = cache hits / (cache hits + cache misses) 367 | * **Database Connections** 368 | * The number of active database connections 369 | * Measured using database monitoring tools. 370 | * **Query performance** 371 | * Time to execute a database query 372 | * Execution time = end time - start time 373 | * **Index usage** 374 | * How frequently an index is used to retrieve data 375 | * Index usage = (number of times index is used) / (total number of queries) 376 | * **Lock waits** 377 | * Time spent waiting for a locked database object 378 | * Lock wait time = total time spent waiting for a lock 379 | * **Deadlocks** 380 | * Occurrences of simultaneous locking conflicts in transactions 381 | * Deadlocks = number of occurrences of simultaneous locking conflicts 382 | * **Data consistency** 383 | * Degree of uniformity and accuracy in data across systems 384 | * Data consistency = (number of errors detected / total number of checks) * 100 385 | * **Backup and Recovery** 386 | * Time taken to backup and recover data in case of failure 387 | * Time taken to backup or recover / number of backups or recoveries 388 | * **Database Size and Growth** 389 | * Total size of the database and its growth rate 390 | * Current size of database + (growth rate * time interval) 391 | * **CAP Theorum** 392 | * Pick two of the following three properties: 393 | * **Consistency**: Each read request receives the most recent write or an error when consistency can’t be guaranteed. 394 | * **Availability**: Each request receives a non-error response, even when nodes are down or unavailable. 395 | * **Partition tolerance**: The system operates despite the loss of messages between nodes. 396 | * **ACID** 397 | * **Atomicity**: All or nothing. Either all operations succeed or all operations fail. 398 | * **Consistency**: Data is consistent before and after the transaction. 399 | * **Isolation**: Transactions are isolated from each other. 400 | * **Durability**: Once a transaction has been committed, it will remain so, even in the event of power loss or system crash. 401 | * **BASE** 402 | * Basically Available 403 | * Soft state (may be inconsistent for brief periods) 404 | * Eventually consistent 405 | 406 | 407 | ### ⭐️ Storage 408 | 409 | 410 | #### General Storage metrics 411 | Applies to most storage mediums, including block, file, and object storage. 412 | 413 | * **Data Durability** 414 | * Probability of data remaining intact over time 415 | * (1 - Annual Failure Rate) ^ Years 416 | * **Latency** 417 | * Time for data to be accessed 418 | * Total time to read or write data 419 | * **Replication Latency** 420 | * Time for replica data to be transferred or accessed 421 | * Total time to read or write data after transfer 422 | * **IOPS (Input/Output Operations Per Second)** 423 | * Number of read/write operations per second 424 | * Total number of operations / time interval 425 | * **Throughput** 426 | * Amount of data transferred per unit of time 427 | * Total data transferred / time interval 428 | * **Maximum Throughput** 429 | * Amount of data transferred per unit of time 430 | * Total data transferred / time interval 431 | 432 | #### Object storage 433 | 434 | * **Object Storage Utilization** 435 | * Amount of object storage used versus total available object storage 436 | * Used object storage / Total object storage 437 | * **Object Storage Tier Data Stored** 438 | * Length of time objects are stored in a tier before being transfered/deleted 439 | * Object transfer or deletion execution duration by tier 440 | * **Object Storage API/request Calls** 441 | * API requests made to object storage service 442 | * API requests / Time interval 443 | * PUT, COPY, POST, LIST requests (pricing may be different) 444 | * GET, SELECT, and all other requests 445 | * Lifecycle Transition requests 446 | * Data Retrieval requests 447 | * **Object Storage Latency** 448 | * Time taken for object storage service to process a request 449 | * Total time for requests / Number of requests 450 | * **Data Transfer per time interval** 451 | * Total amount of data transferred 452 | * Data Transferred (in bytes) / time interval 453 | * **Object Storage Retention** 454 | * Length of time objects are stored before being deleted 455 | * Object transfer or deletion execution duration 456 | * **Geographic Put/Get requests** 457 | * Latency of Put/Get requests 458 | * Latency (in milliseconds) / time interval 459 | * **Availability metrics** 460 | * Number of requests that fail 461 | * Requests that fail / total number of requests 462 | * **Data Consistency metrics** 463 | * Number of objects that are successfully stored/retrieved 464 | * Number of objects successfully stored/retrieved / time interval 465 | * **Bandwidth used in/out total, timeframe, region, internet, inside cloud provider** 466 | * Amount of data transferred in/out of object storage 467 | * Data transferred in/out / time interval 468 | * There may be different policies per cloud provider. 469 | 470 | 471 | #### Disk storage 472 | 473 | * **Disk Utilization** 474 | * Amount of disk space used versus total available disk space 475 | * Used disk space / Total disk space 476 | * **Disk IOPS** 477 | * Number of read and write requests to a disk in a second 478 | * Number of requests / Time interval 479 | * **Disk Latency** 480 | * Time taken for a disk to process a read/write request 481 | * Total time for read/write requests / Number of requests 482 | * **Data Replication Latency** 483 | * Time taken to replicate data from one location to another 484 | * Time for replication completion - Time of data creation 485 | * **Data Replication Bandwidth** 486 | * Amount of data replicated per second 487 | * Amount of data / Time interval 488 | * **SSD Endurance** 489 | * Amount of data that can be written to an SSD before failure 490 | * Total bytes written / (Drive size in GB * Drive endurance) 491 | * **Disk Utilization** 492 | * Percentage of disk space used 493 | * (Amount of space used / Total amount of space) x 100 494 | * **RAID Reliability** 495 | * Probability that the RAID will remain operational 496 | * (1 - Probability of failure) ^ Number of disks 497 | * **Storage Capacity** 498 | * Total amount of storage space available 499 | * Amount of space used + amount of space available 500 | * **Block Storage IOPS** 501 | * Number of read and write requests to a block storage device in a second 502 | * Number of requests / Time interval 503 | * **Block Storage Latency** 504 | * Time taken for a block storage device to process a read/write request 505 | * Total time for read/write requests / Number of requests 506 | 507 | * Types of RAIDS 508 | 509 | 510 | RAID Level | Description 511 | -----------|------------ 512 | RAID 0 | Data is striped across multiple disks for increased performance, but offers no redundancy. 513 | RAID 1 | Data is mirrored across two disks for fault tolerance, but offers no performance improvement. 514 | RAID 5 | Data is striped across multiple disks with parity information stored on each disk for fault tolerance. 515 | RAID 6 | Similar to RAID 5, but with two sets of parity information for even greater fault tolerance. 516 | RAID 10 | A combination of RAID 1 and RAID 0, where data is mirrored and striped for both performance and fault tolerance. 517 | RAID 50 | A combination of RAID 5 and RAID 0, where data is striped across multiple RAID 5 arrays for increased performance and fault tolerance. 518 | RAID 60 | A combination of RAID 6 and RAID 0, where data is striped across multiple RAID 6 arrays for even greater performance and fault tolerance. 519 | 520 | 521 | ### ⭐️ Queues/Events 522 | 523 | * **Queue Depth** 524 | * Number of events in a queue waiting to be processed. 525 | * Total events - Processed events 526 | * Measures the backlog of work and helps in understanding system load and the efficiency of processing mechanisms. For identifying potential bottlenecks and ensuring that the system is scaled properly to handle incoming volumes, maintaining system responsiveness. 527 | * **Queue Wait Time** 528 | * Amount of time an event spends waiting in a queue before being processed. 529 | * Total time events spend in queue / Number of events in queue 530 | * Indicates the efficiency of the system in handling incoming requests, directly impacting user experience by determining the delay before a request is addressed. For optimizing processing priorities and capacities to minimize wait times, enhancing overall system performance. 531 | * **Event Arrival Rate** 532 | * Rate at which events are arriving at a queue. 533 | * Number of events arriving / Time interval 534 | * For capacity planning and scalability assessments, this metric shows the demand placed on the system, guiding adjustments to processing capabilities or infrastructure to accommodate fluctuating event volumes without degradation in service quality. 535 | * **Event Processing Time** 536 | * Amount of time it takes to process an event. 537 | * Total time spent processing events / Number of events processed 538 | * Reflects the efficiency of the system's processing capabilities, influencing both throughput and user satisfaction. Shortening event processing time is essential for improving system throughput and reducing the overall latency of operations. 539 | * **Event Processing Rate** 540 | * Rate at which events are being processed. 541 | * Number of events processed / Time interval 542 | * For understanding the efficiency of the event handling system. It helps in assessing whether the processing capabilities are aligned with the incoming workload, ensuring timely handling of events without backlog accumulation. 543 | * **Queue Processing Rate** 544 | * Rate at which events are being processed from a queue. 545 | * Number of events processed from queue / Time interval 546 | * For ensuring the system's responsiveness and efficiency in handling queued events. A higher queue processing rate indicates a more efficient system, reducing the risk of bottleneck formation and ensuring smooth flow of data through the system. 547 | * **Queue Time** 548 | * Total time that events spend in a queue, including both wait time and processing time. 549 | * Queue Wait Time + Event Processing Time 550 | * A comprehensive view of the time an event spends in the system, from arrival to completion. For identifying inefficiencies in the queue management and processing stages, aiming to minimize the overall time spent in the queue to enhance throughput and user experience. 551 | * **Queue Throughput** 552 | * The rate at which events are moving through a queue, including both incoming and outgoing events. 553 | * Incoming Event Rate + Outgoing Event Rate 554 | * For assessing the overall performance of the queuing system, indicating the volume of data that can be handled over a specific time period. High throughput is indicative of a well-optimized queue that efficiently manages both the intake and processing of events. 555 | * **Event Drop Rate** 556 | * the rate at which events are being dropped or lost, typically due to queue overflow. 557 | * Number of dropped events / Total number of events 558 | * An indicator of system reliability and capacity. A high drop rate may signal that the system is overwhelmed or improperly configured, necessitating adjustments to either the system's capacity or the management strategy to reduce data loss and improve reliability. 559 | * **Queue Latency** 560 | * the time it takes for an event to travel through a queue, including both wait time and pr**ocessing time. 561 | * Queue Time / Number of events 562 | * Essential for gauging the delay introduced by queuing mechanisms, impacting the overall speed at which the system can respond to and process events. Reducing queue latency is key to improving responsiveness and ensuring that time-sensitive data is handled promptly. 563 | 564 | 565 | ### ⭐️ Security 566 | * Network Security Score: a metric that measures the security posture of a network, including factors such as the number of vulnerabilities, exposure to threats, and compliance with security standards. 567 | 568 | * **Incident Response Time** 569 | * The amount of time it takes to respond to a security incident 570 | * Detection Time + Response Time + Mitigation Time 571 | * Time Detected - Time Reported 572 | * **Risk Assessment Score** 573 | * A numerical score based on a risk assessment methodology such as the NIST Risk Management Framework 574 | * (risk rating x probability of risk) + (residual risk x probability of residual risk) 575 | * Number of Potential Risks / Number of Acceptable Risks 576 | * **Attack Surface** 577 | * The total number of entry points or attack vectors available to attackers 578 | * Attack surface = sum of (threats x vulnerabilities) 579 | * **Vulnerability Assessment Score** 580 | * A numerical score based on a vulnerability assessment methodology such as CVSS 581 | * Common Vulnerability Scoring System [CVSS calculator(https://www.first.org/cvss/calculator/3.1) 582 | * Potential Vulnerabilities/Number of Acceptable Vulnerabilities 583 | * CVSS: (Base Score + Temporal Score + Environmental Score) 584 | * **Access Control Effectiveness** 585 | * The ability of the access control system to protect the system from unauthorized access 586 | * (Authenticated Access Attempts - Unauthorized Access Attempts) / Authenticated Access Attempts 587 | * Number of Access Control Rules/Number of Access Control Rules Enforced 588 | * **Authentication Effectiveness** 589 | * The ability of the authentication system to accurately identify and authenticate users 590 | * Authentication Effectiveness = (Authenticated Users - Unauthenticated Users) / Authenticated Users 591 | * **Authorization Effectiveness** 592 | * The ability of the authorization system to accurately authorize users to access resources 593 | * Number of Authorization Controls/Number of Authorization Controls Enforced 594 | * **Security Audit Log Analysis** 595 | * The ability of the security audit system to detect, monitor, and analyze security events 596 | * (Audited Events - Unaudited Events) / Audited Events 597 | * Number of Security Events Detected/Number of Security Events Recorded 598 | * **Security Incident Rate** 599 | * Rate of security incident per time interval 600 | * Security incidents / time 601 | * **Security Compliance Score** 602 | * Score of how well the system complies with security policies and best practices 603 | * Compliance score = (number of compliant components / total number of components) x 100 604 | * **Security Training Effectiveness** 605 | * Measurement of how well users understand and adhere to security policies 606 | * Training effectiveness = (number of users who successfully complete security trainings / total number of users) x 100 607 | * **Vulnerability Scanning Frequency** 608 | * Measurement of how often the system is tested for security vulnerabilities 609 | * Scanning frequency = (number of scans performed in a given time period / total time period) 610 | * **Identity and Access Management (IAM) roles and permissions audit** 611 | * Measurement of the accuracy and security of IAM roles and permissions 612 | * IAM audit = (number of correct roles and permissions / total number of roles and permissions) x 100 613 | * **Key Management Service (KMS) usage and audit** 614 | * Measurement of the accuracy and security of KMS usage 615 | * KMS audit = (number of correct KMS usage / total number of KMS usage) x 100 616 | * **Security Information and Event Management (SIEM) alerts and monitoring** 617 | * Measurement of the accuracy and security of SIEM alerts 618 | * SIEM monitoring = (number of correctly triggered alerts / total number of alerts) x 100 619 | * **Threat intelligence feeds integration and usage** 620 | * Measurement of how threat intelligence feeds are used to help identify and respond to security threats 621 | * Number of threat intelligence feeds used / total number of threat intelligence feeds available 622 | * **Encryption key rotation frequency** 623 | * Measurement of how often cryptographic keys are changed 624 | * Time interval between key changes 625 | * **Compliance posture** 626 | * Measurement of how well the system complies with a security standard 627 | * Number of security standards met / total number of security standards 628 | * **Network traffic monitoring and analysis** 629 | * Measurement of the ability to monitor and analyze network traffic for suspicious activity 630 | * Amount of network traffic monitored / total network traffic 631 | * **User behavior analytics (UBA) and anomaly detection** 632 | * Measurement of the ability to detect anomalous user behavior 633 | * Number of anomalies detected / total user behavior events 634 | * **Data Loss Prevention (DLP)** 635 | * The practice of preventing sensitive data from leaving the organization 636 | * DLP = implementation of policies and technologies + monitoring of user activity 637 | * **Data Encryption** 638 | * The practice of transforming sensitive data into an unreadable format 639 | * Data encryption = implementation of encryption algorithms + encryption of data 640 | 641 | ### ⭐️ Cost 642 | 643 | I am only going to give some brief metrics on Cost, because almost everything above can affect cost and it will vary a lot between providers. 644 | 645 | This is not to minimize cost, it's one of the most important factors!!! 646 | 647 | Just that cost should be considered on ALL the metrics above. 648 | 649 | * **Total Cost of Ownership (TCO**) = (cost of acquisition + cost of operation + cost of maintenance) over the useful life of the asset 650 | * **Cost per transaction** = total cost / number of transactions 651 | * **Cost per unit of time** = total cost / time period 652 | * **Cost per user** = total cost / number of users 653 | * **Return on Investment (ROI)** = (gain from investment - cost of investment) / cost of investment 654 | * **Cost of Downtime (CoD)** = (lost revenue + recovery costs + damage to brand reputation) / total downtime hours 655 | * **Cost of Poor Quality (CoPQ)** = (internal failure costs + external failure costs + cost of appraisal + cost of prevention) / total number of units produced 656 | * **Cost of Delay (CoD)** = (value of time saved by earlier release - cost of delay) / time saved 657 | 658 | There are a ton of cloud cost tools but these are some of the popular ones on the biggest platforms (there are many more if you search on their sites): 659 | 660 | * **AWS** 661 | * AWS Cost Explorer 662 | * AWS Budgets 663 | * AWS Trusted Advisor 664 | * **Azure** 665 | * Azure Cost Management + Billing 666 | * Azure Advisor 667 | * Azure Service Health 668 | * **GCP** 669 | * GCP Billing 670 | * GCP Pricing Calculator 671 | * GCP Cost Management 672 | * **Third-party** 673 | * CloudCheckr 674 | * CloudHealth by VMware 675 | * Apptio Cloudability 676 | * CloudBolt 677 | * CloudZero 678 | 679 | ### ⭐️ References 680 | 681 | * [Amazon AWS Well Architected](https://aws.amazon.com/architecture/well-architected/) 682 | * [Microsoft Azure Well Architected](https://learn.microsoft.com/en-us/azure/architecture/framework/) 683 | * [Google Cloud Architecture Framework](https://cloud.google.com/architecture/framework) 684 | * [SystemsArchitect.io](https://systemsarchitect.io/) 685 | * [The Site Reliability Engineering](https://sre.google/books/) 686 | * [The Site Reliability Workbook](https://sre.google/books/) 687 | * [System Design Primer](https://github.com/donnemartin/system-design-primer) 688 | * [System Design Interview](https://github.com/checkcheckzz/system-design-interview) 689 | 690 | 691 | Thanks for checking it out... if you have ideas for improvements, feel free to comment or make a PR on my github repo: https://github.com/csjcode/solutions-architect-metrics-cheatsheet --------------------------------------------------------------------------------