└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # CAP Theorem with Tom the Prankster 2 | 3 | Link: https://newsletter.systemdesigncodex.com/p/cap-theorem 4 | 5 | 1. **Essential Elements**: Consistency, Availability, Partition Tolerance. 6 | 2. **Partition Tolerance**: Must-have due to inherent unreliability in communication networks. 7 | 3. **Choice between Consistency and Availability**: 8 | - **Consistency**: All nodes see the same data at the same time. Requires a trade-off with availability. 9 | - **Availability**: Ensures that the system is always operational, but might compromise on having the latest data across all nodes. 10 | 11 | # The Inevitable Law Governing Software Design 12 | 13 | Link: https://newsletter.systemdesigncodex.com/p/the-inevitable-law-governing-software-design 14 | 15 | - **Basic Principle**: The structure of a software system reflects the communication structure of its creating organization. 16 | - **Example**: In a company with inventory, invoicing, and shipping departments, the software will likely have separate systems for each, mirroring these divisions. 17 | - **Implications**: 18 | - Software integration quality depends on how well these departments communicate. 19 | - Better communication leads to more effective and integrated software modules. 20 | - **Strategies for Addressing Conway’s Law**: 21 | 1. **Acknowledge It**: Recognize its impact on software design. 22 | 2. **Structure Teams Effectively**: Place teams working on similar systems close to each other for better communication. 23 | 3. **Avoid Dividing by Technology**: Instead of splitting teams by tech layers (front end, back end), focus on business features for smoother collaboration. 24 | 4. **Use Architectural Insights**: Align team structures with desired software architecture, understanding that organizational decisions influence software design. 25 | 26 | # The Ingredients to Delicious Software 27 | 28 | Link: https://newsletter.systemdesigncodex.com/p/the-ingredients-to-delicious-software 29 | 30 | 1. **Scalability**: 31 | 32 | - Ability to handle increased workload efficiently. 33 | - Important to identify the point where scaling becomes cost-ineffective. 34 | 35 | 2. **Latency & Throughput**: 36 | 37 | - Latency: Time taken to respond to a request (e.g., time to serve a cheese sandwich). 38 | - Throughput: Number of requests handled in a given time (e.g., serving multiple customers). 39 | 40 | 3. **Availability and Consistency**: 41 | - Availability: Ability to operate despite issues (e.g., with one cook absent). 42 | - Measured in 'nines' (e.g., 99.9% availability). 43 | - Consistency: Synchronization of information across different parts of the system (e.g., order copies being in sync). 44 | 45 | # What happens when you type a URL into your browser? 46 | 47 | Link: https://blog.bytebytego.com/p/what-happens-when-you-type-a-url 48 | 49 | When you type a URL into your browser: 50 | 51 | 1. **URL Parsing**: The browser identifies the HTTP protocol, domain, path, and resource. 52 | 2. **DNS Lookup**: It searches for the IP address of the domain, checking various caches. 53 | 3. **TCP Connection**: Establishes a connection with the server. 54 | 4. **HTTP Request**: Sends a request for the specific resource. 55 | 5. **Server Response**: The server sends back the requested content. 56 | 6. **Rendering**: The browser displays the webpage. 57 | 58 | # How does CDN work? 59 | 60 | Link: https://blog.bytebytego.com/p/how-does-cdn-work 61 | 62 | 1. **Domain Name Lookup**: 63 | 64 | - Bob enters `www.myshop.com` in his browser. 65 | - The browser checks the local DNS cache for the domain. 66 | 67 | 2. **DNS Resolver**: 68 | 69 | - If not in the local cache, the browser contacts the DNS resolver (usually via the ISP). 70 | 71 | 3. **Recursive Domain Resolution**: 72 | 73 | - The DNS resolver performs recursive resolution for `www.myshop.com`. 74 | 75 | 4. **CDN Integration**: 76 | 77 | - Instead of pointing directly to the London server, the authoritative name server redirects to a CDN domain (`www.myshop.cdn.com`). 78 | 79 | 5. **Load Balancer Query**: 80 | 81 | - The DNS resolver queries the CDN load balancer domain (`www.myshop.lb.com`). 82 | 83 | 6. **Optimal Server Selection**: 84 | 85 | - The CDN load balancer selects the best CDN edge server based on factors like the user’s location and server load. 86 | 87 | 7. **Content Delivery**: 88 | 89 | - The browser connects to the chosen CDN edge server to load content. 90 | - The content includes static (e.g., images, videos) and dynamic elements. 91 | - If content is not on the edge server, it's fetched from higher-level CDN servers or the origin server in London. 92 | 93 | 8. **CDN Network**: 94 | - This process is part of a geographically distributed CDN network for efficient content delivery. 95 | 96 | # Time complexity 97 | 98 | Link: https://newsletter.francofernando.com/p/time-complexity 99 | 100 | 1. **Purpose**: Time complexity evaluates how an algorithm's performance scales with the size of the input data. 101 | 102 | 2. **Types of Complexity**: 103 | 104 | - **Worst-Case Complexity**: Maximum number of steps for any input of size `n`. Most commonly used as it provides guarantees about the algorithm's upper limit. 105 | - **Best-Case Complexity**: Minimum number of steps for any input of size `n`. 106 | - **Average-Case Complexity**: Average number of steps over all possible instances of input size `n`. 107 | 108 | 3. **Big Oh Notation**: Simplifies the expression of an algorithm's worst-case complexity by focusing on growth rates rather than precise step counts. 109 | 110 | 4. **Common Complexity Classes**: 111 | - **Constant - O(1)**: Time is independent of input size (e.g., adding two numbers). 112 | - **Logarithmic - O(log n)**: Each step cuts the problem size in half (e.g., binary search). 113 | - **Linear - O(n)**: Time grows linearly with input size (e.g., finding max in an array). 114 | - **Superlinear - O(n log n)**: Combines linear and logarithmic growth (e.g., Quicksort, Mergesort). 115 | - **Quadratic - O(n^2)**: Time grows with the square of input size (e.g., insertion sort). 116 | - **Cubic - O(n^3)**: Involves triple nested loops (e.g., certain dynamic programming algorithms). 117 | - **Exponential - O(c^n)**: Time doubles with each addition to input size (e.g., enumerating subsets). 118 | - **Factorial - O(n!)**: Time grows with the factorial of input size (e.g., generating permutations). 119 | 120 | # 8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day With Only 32 Engineers 121 | 122 | Link: https://newsletter.systemdesign.one/p/whatsapp-engineering 123 | 124 | 1. **Single Responsibility Principle**: 125 | 126 | - Focus on core feature: Messaging. 127 | - Avoided feature creep and unnecessary functionalities. 128 | - Prioritized reliability above all. 129 | 130 | 2. **Technology Stack**: 131 | 132 | - Chose Erlang for server functionalities due to its scalability and support for hot-loading. 133 | - Erlang's efficient threading and context-switching mechanisms contributed to performance. 134 | 135 | 3. **Utilizing Existing Solutions**: 136 | 137 | - Leveraged open-source solutions like Ejabberd, an Erlang-based messaging server. 138 | - Customized existing solutions to fit specific needs. 139 | - Integrated third-party services for functionalities like push notifications. 140 | 141 | 4. **Cross-Cutting Concerns**: 142 | 143 | - Emphasized aspects like monitoring and alerting for service health. 144 | - Implemented Continuous Integration and Continuous Delivery for software development. 145 | 146 | 5. **Scalability Strategies**: 147 | 148 | - Adopted diagonal scaling, combining horizontal and vertical scaling methods. 149 | - Ran servers on FreeBSD, optimized for handling millions of connections. 150 | - Overprovisioned servers for handling traffic spikes and potential failures. 151 | 152 | 6. **Continuous Improvement (Flywheel Effect)**: 153 | 154 | - Regularly measured performance metrics to identify and eliminate bottlenecks. 155 | - Maintained a cycle of continuous feedback and improvement. 156 | 157 | 7. **Focus on Quality**: 158 | 159 | - Conducted load testing to identify and address single points of failure. 160 | - Used simulated production traffic for realistic testing. 161 | 162 | 8. **Small Team Size**: 163 | - Kept the engineering team small (32 engineers) to maintain efficiency and reduce communication overhead. 164 | 165 | # This Is How Quora Shards MySQL to Handle 13+ Terabytes 166 | 167 | Link: https://newsletter.systemdesign.one/p/mysql-sharding 168 | 169 | 1. **Vertical Sharding**: 170 | 171 | - **Implementation**: Separating tables into different servers (leader-follower model). 172 | - **Purpose**: Enhances write scalability. 173 | - **Challenges**: Replication lag, transactional limitations, and potential performance issues for large tables. 174 | 175 | 2. **Horizontal Sharding**: 176 | 177 | - **Reasons for Adoption**: Addressing challenges with large tables such as schema changes and error risks. 178 | - **Approach**: Splitting a logical table into multiple physical tables. 179 | 180 | 3. **Key Decisions in Horizontal Sharding**: 181 | - **Build vs. Buy**: Opted to build their own sharding solution, reusing vertical sharding logic. 182 | - **Shard Level**: Focused on sharding at the table level due to extensive use of secondary indexes. 183 | - **Sharding Method**: Chose range-based partitioning, favoring common range queries. 184 | - **Metadata Management**: Stored shard metadata in Apache Zookeeper. 185 | - **Database API**: Modified to handle sharding columns and keys, enhancing security against SQL injections. 186 | - **Sharding Column Selection**: Based on latency sensitivity and query per second (QPS) considerations. 187 | - **Cross-Shard Indexes**: Used to optimize non-sharding column queries, though with potential performance and consistency trade-offs. 188 | - **Number of Shards**: Maintained a lower count to reduce latency in non-sharding column queries. 189 | 190 | # Making Your Database Highly Available 191 | 192 | Link: https://newsletter.systemdesigncodex.com/p/making-your-database-highly-available 193 | 194 | ## Redundancy 195 | 196 | - **Purpose**: Ensure continuous database operation even if one server fails. 197 | - **Not Backup**: Unlike backups, redundancy involves running multiple active database instances. 198 | - **Cost of Outage**: Can be significantly high, averaging $7,900 per minute. 199 | - **Redundancy Patterns**: 200 | - **Active-Passive**: One active server handles requests while others stand by. 201 | - **Active-Active**: Multiple servers handle requests simultaneously. 202 | - **Multi-Active**: An extension of Active-Active with more complex setups. 203 | 204 | ## Isolation 205 | 206 | - **Goal**: Minimize disaster impact by physically separating database components. 207 | - **Degrees of Separation**: 208 | - **Server**: Different servers in the same data center. 209 | - **Rack**: Separate racks within a data center. 210 | - **Data-Center**: Multiple data centers. 211 | - **Availability Zone**: Distinct zones within a cloud provider's network. 212 | - **Region**: Geographically dispersed locations. 213 | 214 | # How Rate Limiting Works? 215 | 216 | Link: https://newsletter.systemdesigncodex.com/p/how-rate-limiting-works 217 | 218 | ## How Rate Limiting Works: 219 | 220 | 1. **Concept**: Limits the number of requests sent to a server. 221 | 2. **Implementation**: A rate limiter is used to control traffic to servers or APIs. 222 | 223 | ## Key Concepts 224 | 225 | 1. **Limit**: Maximum number of requests allowed in a set time frame (e.g., 600 requests per day). 226 | 2. **Window**: The duration for the limit, varying from seconds to days. 227 | 3. **Identifier**: A unique attribute (like User ID or IP address) to identify request senders. 228 | 229 | ## Designing a Rate Limiter 230 | 231 | - **Process**: 232 | 1. **Count Requests**: Track the number of requests from a user or IP. 233 | 2. **Limit Exceeded**: If count exceeds the limit, block or restrict further requests. 234 | - **Considerations**: 235 | - Storage of request counters. 236 | - Rate limiting rules. 237 | - Response strategy for blocked requests. 238 | - Rule change implementation. 239 | - Maintaining application performance. 240 | 241 | ## System Components 242 | 243 | - **Rate Limiter Component**: Checks incoming requests against the rules and stored data (number of requests made). 244 | - **Rules Engine**: Defines the rate limiting rules. 245 | - **Cache**: Stores rate-limiting data for high throughput and low latency. 246 | - **Response Handling**: 247 | - Allow request if within limit. 248 | - Block request if over limit, typically with HTTP status code 429. 249 | 250 | ## Improvements 251 | 252 | - **Silent Drop**: Fool attackers by silently dropping excess requests. 253 | - **Cached Rules**: Enhance performance with a cache for the rules engine and background updates for rule changes. 254 | 255 | # Caching 256 | 257 | Link: https://newsletter.francofernando.com/p/caching 258 | 259 | ## Concept of Caching 260 | 261 | - **Purpose**: Speeds up data access by storing data temporarily in a fast-access hardware or software layer. 262 | - **Cache Hit**: Data is found in the cache. 263 | - **Cache Miss**: Data is not in the cache and must be fetched from its original location. 264 | 265 | ## Caching in Distributed Systems 266 | 267 | - **Levels**: Hardware, OS, front-end, web apps, databases, etc. 268 | - **Roles**: 269 | - Reducing latency. 270 | - Saving network requests. 271 | - Storing results of resource-intensive operations. 272 | - Avoiding repetitive operations. 273 | 274 | ## Types of Caching 275 | 276 | - **Application Caching**: Integrated into app code, checks cache before database access. Examples: Memcached, Redis. 277 | - **Database Caching**: Built into databases, requires no code changes, optimizes data retrieval. 278 | 279 | ## Considerations and Challenges 280 | 281 | - **Cache Miss Rate**: High miss rates can add more latency. 282 | - **Stale Data**: Ensuring cache data is up-to-date and relevant. 283 | 284 | ## Caching Strategies 285 | 286 | 1. **Cache Aside (Lazy Loading)**: 287 | 288 | - Direct read from cache. If miss, read from DB and update cache. 289 | - Advantages: Good for read-heavy workloads. Cache only stores necessary data. 290 | - Disadvantages: Can serve stale data. Initial cache misses. 291 | 292 | 2. **Read Through**: 293 | 294 | - Interact only with cache. Cache manages data fetching from DB. 295 | - Simplifies app code but complicates cache implementation. 296 | 297 | 3. **Write Through**: 298 | 299 | - Writes data to cache and DB simultaneously. 300 | - Ensures data consistency. Higher write latency. 301 | 302 | 4. **Write Back (Asynchronous Writing)**: 303 | 304 | - Writes data to cache, then asynchronously to DB. 305 | - Lower write latency. Good for write-heavy workloads. 306 | 307 | 5. **Write Around**: 308 | - Writes directly to DB, cache only stores read data. 309 | - Good for infrequently read data. Higher read latency for new data. 310 | 311 | ## Choosing a Cache Strategy 312 | 313 | - **Depends on data access patterns**. 314 | - **Cache-Around**: Good for general-purpose, read-intensive applications. 315 | - **Write-Heavy Workloads**: Write-back approaches are beneficial. 316 | - **Infrequent Reads**: Write-around strategy. 317 | 318 | ## Eviction Policies 319 | 320 | - **Manage Limited Cache Space**: 321 | - FIFO: First in, first out. 322 | - LIFO: Last in, first out. 323 | - LRU: Least recently used. 324 | - MRU: Most recently used. 325 | - LFU: Least frequently used. 326 | - RR: Random replacement. 327 | 328 | # Database Replication Under the Hood 329 | 330 | Link: https://newsletter.systemdesigncodex.com/p/database-replication-under-the-hood 331 | 332 | ## Statement-based Replication 333 | 334 | - **How It Works**: The leader logs every SQL write statement (INSERT, UPDATE, DELETE) and forwards these statements to follower nodes. 335 | - **Advantages**: 336 | - Efficient in network bandwidth, only SQL statements are transferred. 337 | - Portable across different database versions. 338 | - Simpler to implement. 339 | - **Limitations**: 340 | - Non-deterministic functions (e.g., NOW(), UUID()) yield different values on replicas. 341 | - Transactions involving auto-incrementing columns must be executed in the same order. 342 | - Potential unforeseen effects due to triggers or stored procedures. 343 | 344 | ## Shipping the Write-Ahead Log (WAL) 345 | 346 | - **Concept**: The WAL, an append-only sequence of all writes, is shared with follower nodes. 347 | - **Usage**: Common in databases like PostgreSQL. 348 | - **Advantage**: Creates an exact replica of the leader’s data structures. 349 | - **Disadvantage**: Tightly coupled to the storage engine, making it less flexible with database version changes and hindering zero-downtime upgrades. 350 | 351 | ## Row-Based Replication 352 | 353 | - **Functionality**: Uses a logical log showing writes in a row format. 354 | - **Operation Details**: 355 | - Inserts log new values for all columns. 356 | - Deletes log identifiers for deleted rows. 357 | - Updates log identifiers and new values for modified columns. 358 | - **Advantage**: Decouples from the storage engine, allowing backward compatibility and version flexibility between leader and follower databases. 359 | 360 | ## Choosing Replication Methods 361 | 362 | - The choice depends on the specific requirements of the system, such as: 363 | - Network efficiency. 364 | - Consistency requirements. 365 | - Database version compatibility. 366 | - **Statement-based Replication**: Best for simple, less concurrent environments. 367 | - **WAL Shipping**: Suitable for systems where exact replica and data integrity are critical. 368 | - **Row-Based Replication**: Ideal for environments requiring flexibility and compatibility across different database versions. 369 | 370 | # Consistent Hashing 371 | 372 | Link: https://newsletter.francofernando.com/p/consistent-hashing 373 | 374 | ## Caching Servers 375 | 376 | - **Use Case**: Store frequently accessed data in fast, in-memory caches. 377 | - **Hashing Role**: Ensures identical requests are sent to the same server by hashing request attributes (IP, username, etc.). 378 | - **Challenge**: Maintaining effective caching when servers are added or removed. 379 | 380 | ## Data Partitioning 381 | 382 | - **Purpose**: Distribute data across multiple database servers. 383 | - **Hashing Function**: Data keys are hashed to determine the server where data will be stored. 384 | - **Limitation**: Similar to caching, adding or removing servers complicates data distribution. 385 | 386 | ## The Hashing Problem 387 | 388 | - **Goal**: Map keys (data identifiers or workload requests) to servers efficiently. 389 | - **Desired Properties**: 390 | - **Balancing**: Equal distribution of keys among servers. 391 | - **Scalability**: Easily adding or removing servers with minimal reconfiguration. 392 | - **Lookup Speed**: Quickly finding the server for a given key. 393 | 394 | ## Naïve Hashing Approach 395 | 396 | - **Method**: Number servers, use `hash(key) % N` to assign keys to servers. 397 | - **Drawback**: Not scalable. Changing server count (N) requires remapping all keys. 398 | 399 | ## Consistent Hashing 400 | 401 | - **Concept**: Treat hash values as a circular space. Map keys and servers onto this circle. 402 | - **Operation**: Assign each key to the nearest server on the circle in a clockwise direction. 403 | - **Advantages**: 404 | - Only a fraction of keys need remapping when adding/removing servers. 405 | - Better scalability. 406 | - **Issue**: Does not guarantee even key distribution (balancing). 407 | 408 | ## Virtual Nodes Solution 409 | 410 | - **Strategy**: Introduce replicas or virtual nodes for each server on the hash circle. 411 | - **Benefits**: 412 | - Better balancing due to smaller ranges and more uniform key distribution. 413 | - Faster rebalancing when servers are added or removed. 414 | - Support for server fault tolerance and heterogeneity. 415 | - **Implementation**: Assign more virtual nodes to more powerful servers for load balancing. 416 | 417 | # Why Replication Lag Occurs in Databases 418 | 419 | Link: https://newsletter.systemdesigncodex.com/p/why-replication-lag-occurs-in-databases 420 | 421 | - **Concept**: Replication Lag is the delay between a write operation on the leader node and its replication on follower nodes in a database system. 422 | 423 | - **Leader-based Replication Setup**: 424 | 425 | - Writes are processed by a single node (leader). 426 | - Read queries can be served by any replica (follower). 427 | - Common in systems with more reads than writes. 428 | 429 | - **Asynchronous vs. Synchronous Replication**: 430 | 431 | - **Synchronous**: All replicas must confirm write operations, causing potential unavailability if a replica is down. 432 | - **Asynchronous**: Allows distribution of reads across followers, but can lead to outdated reads if a follower lags. 433 | 434 | - **How Replication Lag Occurs**: 435 | 436 | 1. User A updates data on the leader node. 437 | 2. Leader sends replication data to followers. 438 | 3. User B reads from a follower (replica 2) before it's updated, receiving outdated information. 439 | 4. Replica 2 eventually gets updated. 440 | 441 | - **Implications**: 442 | 443 | - Lag duration varies from fractions of a second to minutes. 444 | - Causes temporary data inconsistencies (eventual consistency). 445 | - Large lags can significantly impact application performance. 446 | 447 | - **Challenge**: Managing replication lag to minimize data inconsistencies and ensure efficient operation. 448 | 449 | # Problems Caused by Database Replication 450 | 451 | Link: https://newsletter.systemdesigncodex.com/p/problems-caused-by-db-replication 452 | 453 | 1. **Vanishing Updates** 454 | 455 | - **Scenario**: User updates data on the leader node, but a subsequent read request to a lagging replica shows outdated data. 456 | - **Problem**: User experiences frustration as their updates appear to vanish. 457 | - **Solution**: Implement read-after-write consistency. Methods include: 458 | - Reading user-modified data from the leader. 459 | - Tracking recent writes with timestamps. 460 | - Monitoring and limiting queries on lagging replicas. 461 | 462 | 2. **Going Backward in Time** 463 | 464 | - **Issue**: User sees an update (e.g., a new comment) and then it disappears upon refreshing, due to a lagging replica. 465 | - **User Experience**: Confusion and inconsistency. 466 | - **Solution**: Ensure Monotonic Reads. 467 | - Users always read from the same replica. 468 | - Use hashing based on User ID for replica selection. 469 | 470 | 3. **Violation of Causality** 471 | - **Problem**: In sharded databases, replication lag causes sequence disorder in communication (e.g., a reply appears before the original message). 472 | - **Result**: Appears as if cause and effect are reversed. 473 | - **Solution**: Provide consistent prefix reads. 474 | - Ensures writes are read in the order they were made. 475 | 476 | # How Request Coalescing Works 477 | 478 | Link: https://newsletter.systemdesigncodex.com/p/how-request-coalescing-works 479 | 480 | **Concept**: Request Coalescing is a technique for optimizing database queries by reducing redundant requests for the same data. 481 | 482 | **Application**: Successfully used by Discord to manage trillions of messages efficiently. 483 | 484 | **Functionality**: 485 | 486 | 1. **Setup**: Involves intermediary data services between the API layer and the database. 487 | 2. **Process**: 488 | - When the first request is made, a worker task is initiated in the data service. 489 | - Subsequent requests for the same data subscribe to this existing task. 490 | - The worker task queries the database once and returns the result to all subscribers simultaneously. 491 | 492 | **Differences from Caching**: 493 | 494 | 1. **Request Initiation**: In request coalescing, only the first request triggers a database query. Subsequent ones wait for its result. In caching, all requests would hit the cache. 495 | 2. **Use with Caching**: Request coalescing can complement caching by reducing the number of hits to the cache. 496 | 497 | **Internal Working (Based on Discord's Implementation)**: 498 | 499 | - Each worker task maintains a local state with requests and a list of requesters. 500 | - Responses are propagated to all waiting requesters upon arrival. 501 | 502 | **Applicability**: 503 | 504 | - Request Coalescing is particularly useful for systems with high concurrency and redundant requests. 505 | - The necessity of this technique depends on the scale and specific challenges of the system. 506 | 507 | # How to Migrate a MySQL Database 508 | 509 | Link: https://newsletter.systemdesign.one/p/how-to-migrate-a-mysql-database 510 | 511 | **Context**: Tumblr's MySQL database, spanning 21 terabytes and 60+ billion rows across 200+ servers, necessitated a migration strategy that minimizes user impact. 512 | 513 | **Challenges**: 514 | 515 | - Maintaining high availability and scalability. 516 | - Minimizing downtime and user impact during migration. 517 | 518 | **Strategies Used**: 519 | 520 | 1. **CQRS Pattern (Command and Query Responsibility Segregation)**: 521 | 522 | - Separated read and write operations for the database. 523 | - Ensured continuous read availability during migration. 524 | 525 | 2. **Leader-Follower Replication**: 526 | 527 | - Leader in a remote data center handled read-write operations. 528 | - Local data center had followers for handling read requests. 529 | - Used persistent connections to reduce latency issues. 530 | 531 | 3. **Database Proxy (ProxySQL)**: 532 | - Positioned in the local data center. 533 | - Maintained persistent connections to the remote leader. 534 | - Enabled connection pooling, improving performance and reducing disconnections. 535 | 536 | **Migration Process**: 537 | 538 | 1. **Preparation**: 539 | - Stored metadata of leaders, followers, and proxies in each data center. 540 | 2. **Migration Execution**: 541 | - Shifted the database leader from Data Center A to B. 542 | - Automated tools redirected followers and proxies to the new leader. 543 | 3. **Outcome**: 544 | - Followers continued serving read requests. 545 | - Write requests were briefly halted or buffered, resulting in minimal user impact. 546 | 547 | **Consideration for Further Improvement**: 548 | 549 | - **Leader-Leader Replication**: Could enhance write availability but poses a risk of data conflicts. 550 | - **Reason for Non-Use**: Potential conflicts might be why Tumblr opted against this approach. 551 | 552 | # Durability 553 | 554 | Link: https://newsletter.francofernando.com/p/durability 555 | 556 | **Core Objective**: Durability, ensuring data is not lost despite failures like power outages, system crashes, or hardware issues. 557 | 558 | ### Single-Node Database Persistence 559 | 560 | 1. **Durability Method**: Data is written to nonvolatile storage (hard drive, SSD). 561 | 2. **Transaction Processing**: 562 | - **Log Writing**: Data first written to a log file before making actual data updates. 563 | - **Update Execution**: After log entry, the database updates the actual data. 564 | - **Role of Log**: Enables reprocessing of transactions to restore consistent state post-failure. 565 | - **Efficiency**: Log writing is fast due to its append-only nature, minimizing seek time. 566 | 567 | ### Distributed Database Persistence 568 | 569 | 1. **Complexity**: Higher due to the need for coordination across multiple servers. 570 | 2. **Two-Phase Commit Protocol**: 571 | - **Coordinator Role**: A designated server coordinates the commit process. 572 | - **Process**: 573 | - Coordinator sends commit instruction to all participant servers. 574 | - Waits for acknowledgments from all participants. 575 | - Finalizes the transaction with a commit or rollback based on responses. 576 | 577 | # Redis 578 | 579 | Link: https://newsletter.francofernando.com/p/redis 580 | 581 | **Redis Overview**: 582 | 583 | - Redis stands for REmote DIctionary Server. 584 | - It's an open-source, key-value database store. 585 | - Functions as a data structure server, supporting various data structures like Strings, Lists, Sets, Hashes, Sorted Sets, and HyperLogLogs. 586 | 587 | **History**: 588 | 589 | - Created by Salvatore Sanfilippo in the late 2000s. 590 | - Developed to address scaling issues with MySQL in real-time analytics. 591 | - Gained popularity and wide adoption due to its efficiency and flexibility. 592 | 593 | **Operations and Data Types**: 594 | 595 | - Basic operations include `GET` and `SET`. 596 | - Supports diverse data structures, each with specific use cases and operations. 597 | 598 | **Redis Architectures**: 599 | 600 | 1. **Single Instance**: Simplest form, running on the same or a separate server. 601 | 2. **Replicated Instances**: Primary instance replicated across secondary instances for parallel read requests and backup. 602 | 3. **Sentinel**: Manages high availability, monitoring, and failure handling. 603 | 4. **Cluster**: Distributes data across multiple machines using sharding. 604 | 605 | **Data Persistency**: 606 | 607 | - Offers two methods: 608 | - RDB (Redis Database Backup): Snapshot-based backups. 609 | - AOF (Append Only File): Logs every change for more recent backups. 610 | - Choice between RDB and AOF depends on the need for speed vs. data recency. 611 | 612 | **Single-thread Model**: 613 | 614 | - Utilizes a single-threaded model for operations, avoiding multi-threading overhead. 615 | - Performance typically limited by memory and network, not CPU. 616 | 617 | **Use Cases**: 618 | 619 | - **Database**: As a primary key-value store. 620 | - **Cache**: For storing frequent queries or caching API requests. 621 | - **Pub/Sub**: For scalable and fast messaging systems. 622 | 623 | # Salt and Pepper 624 | 625 | Link: https://newsletter.francofernando.com/p/salt-and-pepper 626 | 627 | ### 1. **Hashing** 628 | 629 | - **Method**: Converts plain text passwords into a random string of characters. 630 | - **Process**: User's password is hashed and compared with the stored hash during login. 631 | - **Common Algorithms**: MD5, SHA family. However, these are vulnerable to rainbow table attacks. 632 | 633 | ### 2. **Salting** 634 | 635 | - **Purpose**: Enhances hashing by defending against pre-computation attacks like rainbow tables. 636 | - **Implementation**: 637 | - Generate a unique salt for each password. 638 | - Combine salt with the password and hash the result. 639 | - Store the salt in plain text and the hashed password in the database. 640 | - **Validation Process**: 641 | 1. Retrieve the salt from the database. 642 | 2. Combine entered password with salt and hash. 643 | 3. Compare with stored hash for validation. 644 | - **Uniqueness**: Ensures each stored hash is unique, even for identical passwords. 645 | 646 | ### 3. **Peppering** 647 | 648 | - **Function**: Adds an extra layer of security to salting. 649 | - **Mechanism**: 650 | - Add a pepper value to the password before hashing. 651 | - The pepper is not stored in the database. 652 | - **Login Process**: 653 | - Attempt combinations of password and pepper until a match is found. 654 | - **Benefit**: Significantly increases the effort required for brute force attacks. 655 | 656 | **Key Takeaways**: 657 | 658 | - **Combining Techniques**: Using both salting and peppering provides robust protection. 659 | - **Importance of Uniqueness**: Unique salts and peppers make each hash distinct. 660 | - **Updating Practices**: Continuously update and improve password storage methods to counteract new hacking techniques. 661 | 662 | # Moving from Monolithic to Microservices 663 | 664 | Link: https://newsletter.systemdesigncodex.com/p/from-monolithic-to-microservices 665 | 666 | ### 1. **Modular Monolith Approach** 667 | 668 | - **Concept**: Incorporates modular design within a monolithic architecture. 669 | - **Characteristics**: 670 | - Loosely-coupled modules. 671 | - Well-defined boundaries. 672 | - Explicit dependencies. 673 | - **Structure**: Application divided into independent modules. 674 | - **Deployment**: Still maintains single application deployment. 675 | - **Advantages**: 676 | - Streamlines development and maintenance. 677 | - Offers microservices-like benefits without associated complexities. 678 | 679 | ### 2. **Evolution to Vertical Slice Architecture** 680 | 681 | - **Design Shift**: From horizontal layers to vertical slices of business functionality. 682 | - **Benefits**: 683 | - Scoped changes to specific business areas. 684 | - Easier feature addition and modification. 685 | - **Microservices Potential**: Vertical modules can gradually evolve into independent microservices. 686 | - **Learning Opportunity**: Provides insights into domain and functional splits. 687 | 688 | ### Key Takeaway 689 | 690 | - **Balance**: No inherent superiority of microservices over monoliths or vice versa. 691 | - **Evolutionary Approach**: Adapt the architecture to evolving application needs. 692 | - **Pragmatism**: Choose the architecture that best suits the project's requirements and context. 693 | 694 | # The Secret Trick to High-Availability 695 | 696 | Link: https://newsletter.systemdesigncodex.com/p/the-secret-trick-to-high-availability 697 | 698 | ### Strategies for Static Stability 699 | 700 | 1. **Active-Active High Availability**: 701 | 702 | - **Implementation**: Distribute traffic across instances in multiple Availability Zones (AZs). 703 | - **Example**: If two instances are needed, create three (50% over-provisioning). 704 | - **Benefit**: Maintains full capacity even if an entire AZ fails. 705 | 706 | 2. **Active-Passive High Availability**: 707 | - **Use Case**: For stateful services like databases. 708 | - **Setup**: Primary instance in one AZ and a standby in another. 709 | - **Function**: Standby becomes primary if the original primary AZ goes down. 710 | 711 | ### Criticism and Justification 712 | 713 | - **Criticism**: Viewed as resource wasteful due to over-provisioning. 714 | - **Justification**: 715 | - Essential for mission-critical applications where downtime is unacceptable. 716 | - Used by major cloud services like AWS (EC2, S3, RDS) to prevent outages. 717 | 718 | ### Key Takeaway 719 | 720 | - **Outages as a Norm**: Disruptions are inevitable; planning for them is crucial. 721 | - **Risk Management**: Over-provisioning is a strategic choice to mitigate downtime risks. 722 | - **Context-Dependent**: The level of static stability required varies based on the system's criticality. 723 | 724 | Static stability, while resource-intensive, is a fundamental approach for ensuring continuous operation in high-stake environments where reliability and uptime are non-negotiable. 725 | 726 | # 4 Types of NoSQL Databases 727 | 728 | Link: https://newsletter.systemdesigncodex.com/p/4-types-of-nosql-databases 729 | 730 | ### 1. **Document Databases** 731 | 732 | - **Examples**: MongoDB, Couchbase, RavenDB. 733 | - **Data Storage**: In the form of JSON, BSON, or XML documents. 734 | - **Advantages**: Align closely with domain-level data objects in applications. 735 | - **Use Case**: Ideal for projects requiring a structure close to application data. 736 | 737 | ### 2. **Key-Value Store** 738 | 739 | - **Examples**: Redis, etcd, DynamoDB. 740 | - **Structure**: Data stored as key-value pairs. 741 | - **Simplicity**: Resembles a two-column table (key and value). 742 | - **Use Cases**: Caching, shopping carts, user profiles. 743 | 744 | ### 3. **Column-Oriented Database** 745 | 746 | - **Examples**: Apache Cassandra, Apache HBase. 747 | - **Storage Method**: Data stored in columns rather than rows. 748 | - **Advantages**: Efficient for analytics and aggregations on specific columns. 749 | - **Considerations**: Not strongly consistent; write operations can be complex. 750 | 751 | ### 4. **Graph Databases** 752 | 753 | - **Examples**: Neo4j, Amazon Neptune. 754 | - **Concept**: Focuses on relationships between data elements (nodes and links). 755 | - **Strengths**: Eliminates the need for multiple table joins as in SQL databases. 756 | - **Use Cases**: Knowledge graphs, social networks, map-like applications. 757 | 758 | ### Decision Guide: 759 | 760 | - **Document DBs**: Versatile, suitable for most applications traditionally using SQL. 761 | - **Key-Value Stores**: For applications requiring fast read/write access to data items. 762 | - **Column-Oriented**: Analytics and operations on large datasets. 763 | - **Graph Databases**: Applications where relationships are central to the data model. 764 | --------------------------------------------------------------------------------