├── 1_Intro to ML System Design.pdf ├── 2_ML Product Design.pdf ├── 3_Data Science Methodology.pdf ├── 4_Production ML System Design.pdf ├── Guide - ML System Design Doc.md ├── README.md └── static ├── image-1.png ├── image-2.png └── image-3.png /1_Intro to ML System Design.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/1_Intro to ML System Design.pdf -------------------------------------------------------------------------------- /2_ML Product Design.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/2_ML Product Design.pdf -------------------------------------------------------------------------------- /3_Data Science Methodology.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/3_Data Science Methodology.pdf -------------------------------------------------------------------------------- /4_Production ML System Design.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/4_Production ML System Design.pdf -------------------------------------------------------------------------------- /Guide - ML System Design Doc.md: -------------------------------------------------------------------------------- 1 | # Guide - ML System Design Doc - v1 2 | 3 | ***Overview:** This guide provides a structured approach to creating high-quality Machine Learning System Design Documents. It's designed to help business leaders, data scientists and engineers, effectively communicate the strategic value, technical implementation details, and practical considerations of an ML project.* 4 | 5 | ***Purpose:** An ML system design document serves as a blueprint for the entire project, ensuring all stakeholders have a clear understanding of the goals, approach, and implementation details. It facilitates alignment, guides decision-making, and serves as a reference throughout the project lifecycle.* 6 | 7 | # **1. Overview: Purpose and Impact** 8 | 9 | **Overview:** This section provides a high-level summary of the entire project, including purpose, problem, solution, and desired outcome. It usually takes 3-5 sentences. 10 | 11 | Key points: 12 | 13 | - Clear problem statement 14 | - Proposed ML solution 15 | - Expected business impact 16 | - High-level implementation timeline 17 | 18 |
19 | Guide 20 | 21 | **Purpose:** To provide a concise summary that captures the essence of the project and its expected outcomes. 22 | 23 | **Guiding questions:** 24 | 25 | - What specific problem are we addressing? 26 | - Why is this problem important to the business? 27 | - What are the high-level goals of this ML system? 28 | - What key outcomes do we expect? 29 | - What's our timeline for implementation? 30 | 31 |
32 | 33 | # **2. ML Product Design** 34 | 35 | 39 | 40 | > *“…most businesses don’t care about ML metrics unless they can move business metrics”* 41 | Source: [Designing Machine Learning Systems (Chip Huyen 2022)](https://github.com/chiphuyen/dmls-book/blob/main/summary.md#chapter-1-overview-of-machine-learning-systems) 42 | 43 | ## **2.1 Problem Statement (Motivation)** 44 | 45 | Is it the right problem to solve? 46 | 47 | Overview: Explain the business problem and its importance to the organization. 48 | 49 | Key points: 50 | 51 | - Detailed description of the business problem 52 | - Current approaches and their limitations 53 | - Market or industry context 54 | - Alignment with business strategy 55 | 56 |
57 | Guide 58 | 59 | Purpose: Cearly define the business problem and its relevance to the organization. 60 | 61 | Guiding questions: 62 | 63 | - Why the problem is important to solve, and why now? 64 | - What are the costs of not solving this problem? 65 | - How does this align with our overall business strategy? 66 | 67 | > "A problem well-stated is a problem half-solved." - Charles Kettering 68 | 69 |
70 | 71 | 72 | ## 2.2 Customers 73 | 74 | Overview: This section identifies all parties involved in or affected by the ML system. 75 | 76 | Key points: 77 | 78 | - List of key stakeholders and their roles 79 | - Primary end users and their needs 80 | - Potential impact on each group 81 | 82 |
83 | Guide 84 | 85 | Purpose: To ensure all relevant perspectives are considered and to clarify who will be using or impacted by the system. 86 | 87 | Guiding questions: 88 | 89 | - Who will be directly using the ML system? 90 | - Whose work or processes will be affected by the system? 91 | - Who needs to be involved in the decision-making process? 92 | 93 | > "If you want to go fast, go alone. If you want to go far, go together." - African Proverb 94 | 95 |
96 | 97 | ## 2.3 Value Proposition 98 | 99 | Overview: This section articulates why AI/ML is the right approach for solving the problem. 100 | 101 | Key points: 102 | 103 | - Unique advantages of using AI/ML 104 | - Potential improvements over current methods 105 | 106 |
107 | Guide 108 | 109 | Purpose: To justify the use of AI/ML over traditional approaches and highlight its unique benefits. 110 | 111 | Guiding questions: 112 | 113 | - How does AI/ML solve this problem better than traditional methods? 114 | - What new capabilities does AI/ML bring to our business? 115 | - How does this solution position us for future growth? 116 | - Why AI/ML is required? 117 | 118 |
119 | 120 | ## **2.4 Business Metrics (Success)** 121 | 122 | Overview: This section defines measurable outcomes that indicate project success. Usually framed as business goals, such as increased customer engagement (e.g., CTR, DAU), revenue, or reduced cost. 123 | 124 | Key points: 125 | 126 | - Specific, quantifiable business metrics 127 | - Technical performance metrics 128 | - Timeline for achieving key milestones 129 | 130 |
131 | Guide 132 | 133 | Purpose: To establish clear, quantifiable goals that align business objectives with technical performance. 134 | 135 | Guiding questions: 136 | 137 | - How will we measure the success of this ML system? 138 | - What metrics align with our business objectives? 139 | - How do we balance technical and business performance? 140 | 141 | > Guiding Quote: "What gets measured gets managed." - Peter Drucker 142 | 143 |
144 | 145 | 146 | ## 2.5 Assumptions and Constraints 147 | 148 | Overview: This section outlines the foundational beliefs and limitations that shape the project. Make explicit your assumptions and understanding of the environment 149 | 150 | Key points: 151 | 152 | - Key assumptions about data, users, and processes 153 | - Technical constraints (e.g., computational resources) 154 | - Business constraints (e.g., budget, timeline) 155 | 156 |
157 | Guide 158 | 159 | Purpose: To make explicit the assumptions underlying the project and acknowledge known constraints. 160 | 161 | Guiding questions: 162 | 163 | - What critical assumptions are we making? 164 | - What technical or business constraints might limit our solution? 165 | - How might our assumptions or constraints impact the project's success? 166 | 167 | > "It is better to be roughly right than precisely wrong." - John Maynard Keynes 168 | 169 |
170 | 171 | ## 2.6 Cost Structure & ROI 172 | 173 | Overview: This section provides a clear financial picture of the project's costs and expected returns. 174 | 175 | Key points: 176 | 177 | - Detailed breakdown of costs (development, serving, infrastructure, maintenance) 178 | - Projected financial benefits and timeline 179 | - ROI calculation and sensitivity analysis 180 | 181 |
182 | Guide 183 | 184 | Purpose: To justify the investment in the ML system and set realistic expectations for financial returns. 185 | 186 | Guiding questions: 187 | 188 | - What are all the costs associated with this project? 189 | - How the costs will change with time? 190 | - When do we expect to see a return on our investment? 191 | - How sensitive is our ROI to changes in key assumptions? 192 | - Consider cost in money and in time spent by dev/ml specialists. 193 | 194 | > "Price is what you pay. Value is what you get." - Warren Buffett 195 | > 196 | 197 | > *What are the major cost implications of the model we are building? How much will it cost to train, retrain, and serve the model? Computing the exact cost is hard, but a ballpark estimate is usually enough. 198 | Source: [**Design documents for ML models](https://medium.com/people-ai-engineering/design-documents-for-ml-models-bbcd30402ff7)* 199 | 200 |
201 | 202 | # 3. Solution Design 203 | 204 | 208 | 209 | ## 3.1 High-Level Solution Overview 210 | 211 | Overview: This section provides a bird's-eye view of the proposed ML system. How the predictions should look like for a consumer? Here we don’t care how these predictions are derived, but we write down our expectations on their form/structure. 212 | 213 | Key points: 214 | 215 | - Consumable format of prediction for key users 216 | - Key components and their interactions 217 | 218 |
219 | Guide 220 | 221 | Purpose: To give all stakeholders a clear understanding of the overall system architecture and components. 222 | 223 | Guiding questions: 224 | 225 | - How the predictions should look like for a consumer? 226 | - What are the main components of our ML system? 227 | - How do these components interact with each other? 228 | - How will your system integrate with upstream data (what data we’ll pass to the model) and downstream users (how users will access predictions)? 229 | 230 | > "Simplicity is the ultimate sophistication." - Leonardo da Vinci 231 | 232 |
233 | 234 | ## **3.2** User Interface and Experience Design 235 | 236 | Overview: This section describes how users will interact with the ML system. 237 | 238 | Key points: 239 | 240 | - User interface mockups or wireframes 241 | - User journey maps 242 | - Integration points with existing systems 243 | 244 |
245 | Guide 246 | 247 | Purpose: To ensure the ML system is user-friendly and integrates well with existing workflows. 248 | 249 | Guiding questions: 250 | 251 | - How will users interact with the ML system? 252 | - What changes to existing workflows are required? 253 | - How can we make the system intuitive and user-friendly? 254 | - How will you incorporate human intervention into your ML system (e.g., product/customer exclusion lists)? 255 | 256 | > "Design is not just what it looks like and feels like. Design is how it works." - Steve Jobs 257 | 258 |
259 | 260 | ## 3.3 Performance Metrics 261 | 262 | Overview: This section defines how the system's performance will be measured. 263 | 264 | Key points: 265 | 266 | - Technical metrics (e.g., accuracy, latency) 267 | - Metrics we calculate for model predictions (offline). This means the case when predictions doesn’t affect real users. E.g. if we deploy the ML system in shadow mode and it gives us recommendation/decision/etc, how we will be measuring the quality of those predictions. 268 | - Business metrics (online). This means the case when predictions affect the real users. 269 | 270 |
271 | Guide 272 | 273 | Purpose: To establish clear criteria for evaluating the system's technical performance and business impact. 274 | 275 | - How will we know that the solution is successful? 276 | 277 | > "Not everything that can be counted counts, and not everything that counts can be counted." - Albert Einstein 278 | 279 |
280 | 281 | ## 3.4 Validation Strategy and Pilot Project Plan 282 | 283 | Overview: This section outlines the approach for testing and validating the ML system. 284 | 285 | Key points: 286 | 287 | - Validation methodology 288 | - Pilot project scope and timeline 289 | - Success criteria for moving to full implementation 290 | 291 |
292 | Guide 293 | 294 | Purpose: To ensure the system performs as expected and delivers value before full-scale deployment. 295 | 296 | Guiding questions: 297 | 298 | - How will we validate the ML system's performance? How to validate the solution works in production process (system)? 299 | - What does a successful pilot look like? Regardless of ML model used under the hood. 300 | - How will we incorporate learnings from the pilot? 301 | - If you're A/B testing, how will you assign treatment and control (e.g., customer vs. session-based) and what metrics will you measure? What are the success and [guardrail](https://medium.com/airbnb-engineering/designing-experimentation-guardrails-ed6a976ec669) metrics? 302 | 303 | > "In God we trust. All others must bring data." - W. Edwards Deming 304 | 305 |
306 | 307 | ## **3.4 Requirements & Constraints** 308 | 309 | Overview: This section specifies what the system must do and how well it must perform. Include functional and non-functional Requirements. 310 | 311 | Key points: 312 | 313 | - Detailed functional requirements 314 | - Non-functional requirements (e.g., scalability, security, type of inference: batch, real-time, stream) 315 | - Constraints (hardware, model) 316 | - Compliance and regulatory requirements 317 | - Corner cases 318 | - Scope of the solution 319 | - Risks 320 | 321 |
322 | Guide 323 | 324 | Purpose: To clearly define the system's capabilities and performance standards. 325 | 326 | Guiding questions: 327 | 328 | - What specific functions must the system perform? 329 | - What are the performance, security, and scalability requirements? 330 | - What regulatory standards must we adhere to? 331 | - What's in-scope & out-of-scope? Some problems are too big to solve all at once. Be clear about what's out of scope. 332 | - Corner cases: What’s the worst that can happen if the model is wrong only once but for a very important data point? Are all data points equally important? 333 | - Risks: What are the major risks you are facing? What are you doing to mitigate them? Are you doing some bleeding-edge research? Do you depend on a major infrastructure component that is yet to be built? 334 | 335 | > "Quality is never an accident; it is always the result of intelligent effort." - John Ruskin 336 | 337 | **Requirements** 338 | 339 | - Functional requirements are those that should be met to ship the project. They should be described in terms of the customer perspective and benefit. (See [this](https://eugeneyan.com/writing/ml-design-docs/#the-why-and-what-of-design-docs) for more details.) 340 | - Non-functional/technical requirements are those that define system quality and how the system should be implemented. These include performance (throughput, latency, error rates), cost (infra cost, ops effort), security, data privacy, etc. 341 | - Type of inference: batch, real-time, stream 342 | 343 | **Constraints (hardware, model)** 344 | 345 | - Constraints can come in the form of non-functional requirements (e.g., cost below $`x` a month, p99 latency < `y`ms) 346 | 347 |
348 | 349 | 350 | # **4. Data Science Methodology** 351 | 352 | 356 | 357 | ## **4.1.** Problem Framing and Approach 358 | 359 | Overview: This section explains how the business problem translates into a data science problem. How will you frame the problem? For example, fraud detection can be framed as an unsupervised (outlier detection, graph cluster) or supervised problem (e.g., classification). 360 | 361 | Key points: 362 | 363 | - ML problem type (e.g., classification, regression) 364 | - Approach selection rationale 365 | - Potential alternative approaches 366 | - Baseline solution (without ML) 367 | 368 |
369 | Guide 370 | 371 | Purpose: To ensure the ML approach aligns with the business problem and leverages appropriate techniques. 372 | 373 | Guiding questions: 374 | 375 | - How do we frame this as a machine learning problem? 376 | - What ML metric we should optimize? 377 | - Why is this approach the most suitable? 378 | - What is the simplest solution? Can we solve the problem without ML? 379 | - What is a feasible baseline solution? 380 | 381 | > "If I had an hour to solve a problem, I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions." - Albert Einstein 382 | 383 |
384 | 385 | ## **4.2.** Data and Feature Engineering 386 | 387 | Overview: This section outlines the plan for data acquisition, processing, and management. 388 | 389 | Key points: 390 | 391 | - Data sources and collection methods 392 | - Data preprocessing and feature engineering 393 | - Data quality assurance processes 394 | - Data Labeling 395 | 396 |
397 | Guide 398 | 399 | Purpose: To ensure the ML system has access to high-quality, relevant data. 400 | 401 | Guiding questions: 402 | 403 | - What data will you use to train your model? 404 | - What input data is needed during serving? 405 | - How will we ensure data quality and relevance? 406 | - **Data Processing Techniques:** What machine learning techniques will you use? How will you clean and prepare the data (e.g., excluding outliers) 407 | - **Feature Engineering**: How will you create features? 408 | 409 | > "Data is the new oil. It's valuable, but if unrefined it cannot really be used." - Clive Humby 410 | 411 |
412 | 413 | ## 4.3 Modeling Techniques and Algorithms 414 | 415 | Overview: This section describes the specific ML techniques and algorithms to be used. 416 | 417 | Key points: 418 | 419 | - Selected algorithms and rationale 420 | - Model architecture details 421 | - Hyperparameter tuning strategy 422 | 423 |
424 | Guide 425 | 426 | Purpose: To provide a clear understanding of the technical approach and its rationale 427 | 428 | Guiding questions: 429 | 430 | - Which ML algorithms are most suitable for our problem? 431 | - How will we optimize model performance? 432 | - What are the trade-offs between different modeling approaches? 433 | 434 | > "All models are wrong, but some are useful." - George Box 435 | 436 |
437 | 438 | ## **4.4.** Model Validation and Evaluation Framework 439 | 440 | Overview: This section explains how model performance will be evaluated and validated. 441 | 442 | Key points: 443 | 444 | - **Techniques**: Cross-validation (e.g., k-fold cross-validation), holdout validation, stratified sampling. 445 | - **Data Splits**: Training set, validation set, and test set. 446 | - **Metrics**: Performance metrics specific to the validation and test phases. Evaluation metrics should be relevant to business metrics. 447 | 448 |
449 | Guide 450 | 451 | > "Trust, but verify." - Ronald Reagan 452 | > 453 | 454 | **Purpose:** To ensure the model's performance can be reliably measured and meets business requirements. 455 | 456 | **Guiding questions:** 457 | 458 | - How will we split our data to validate the model effectively? 459 | - What techniques will we use to validate the model? 460 | - How will we interpret the validation metrics to improve the model? 461 | - How will we measure model performance? 462 | - How will we ensure the evaluation process is unbiased and thorough? 463 | - How will we interpret and report the evaluation results to stakeholders? 464 | 465 | Summary on model Validation & Evaluation 466 | 467 | | | **Model Validation** | **Model Evaluation** | 468 | | --- | --- | --- | 469 | | **Purpose** | Assess generalization to unseen data | Comprehensive assessment before deployment | 470 | | **Focus** | Tuning and improving model performance | Final performance metrics and business impact | 471 | | **Timing** | During the model development phase | After model training and validation | 472 | | **Techniques** | Cross-validation, holdout validation | Confusion matrix, ROC/AUC, MAE, MSE | 473 | | **Data Used** | Training set, validation set | Test set | 474 | | **Metrics** | Validation accuracy, validation loss | Precision, recall, F1-score, RMSE, business metrics | 475 | | **Outcome** | Model refinement and selection | Final decision on model readiness for production | 476 | 477 |
478 | 479 | # **5.** Production ML System Design 480 | 481 | 485 | 486 | ## **5.1. High-level design** 487 | 488 | **Overview:** This section provides a detailed view of the system's technical architecture. Start by providing a big-picture view. [System-context diagrams](https://en.wikipedia.org/wiki/System_context_diagram) and [data-flow diagrams](https://en.wikipedia.org/wiki/Data-flow_diagram) work well. 489 | 490 | **Key points:** 491 | 492 | - Detailed system architecture diagram 493 | - Data flow and processing pipeline 494 | - Integration with existing infrastructure 495 | 496 |
497 | Guide 498 | 499 | Purpose: To ensure all technical stakeholders understand how the system will be built and how data will flow through it. 500 | 501 | Guiding questions: 502 | 503 | - How will the ML system integrate with our current infrastructure? 504 | - What are the key components of our data processing pipeline? 505 | - How will we ensure efficient data flow through the system? 506 | 507 | > "Simplicity is a prerequisite for reliability." - Edsger Dijkstra 508 | 509 |
510 | 511 | ## 5.2 Deployment and Serving 512 | 513 | Overview: This section outlines the plan for deploying and serving the ML model. 514 | 515 | Key points: 516 | 517 | - Type of inference (batch, real-time) 518 | - Requirements for CI/CD. 519 | - Deployment methodology (e.g., blue-green, canary) 520 | - Serving infrastructure details 521 | 522 |
523 | Guide 524 | 525 | Purpose: To ensure smooth deployment and reliable serving of model predictions. 526 | 527 | Guiding questions: 528 | 529 | - Do we want to perform batch (offline) or real-time (online) inference? What tools should we use? 530 | - How will we deploy the model with minimal disruption? Do we need human checks? How do we test the models before deployment to be sure they doesn’t break prod? 531 | - Do we need to support online A/B testing, canary deployment, etc? 532 | - What infrastructure is needed to serve model predictions? 533 | 534 | > "Hope is not a strategy." - Vince Lombardi 535 | 536 |
537 | 538 | 539 | ## **5.3** Data Engineering 540 | 541 | Overview: 542 | 543 | This section describes the data infrastructure supporting the ML system. 544 | 545 | Key points: 546 | 547 | - Data ingestion and storage solutions 548 | - ETL processes (data pipelines) 549 | - Data versioning and lineage tracking 550 | - Feature Stores (Data Marts) 551 | 552 |
553 | Guide 554 | 555 | Purpose: To ensure reliable, efficient data processing and feature engineering in production. 556 | 557 | Guiding questions: 558 | 559 | - How will we handle data ingestion and storage? 560 | - What ETL processes are needed to prepare data for the model? 561 | - How will we track data versions and lineage? 562 | 563 | > "Data is the foundation of all machine learning systems." - Andrew Ng 564 | 565 |
566 | 567 | ## **5.4** Model Development Lifecycle 568 | 569 | Overview: 570 | 571 | This section explains the process for ongoing model development and deployment. 572 | 573 | Key points: 574 | 575 | - Requirements for automation, reproducibility, reliability 576 | - Model versioning strategy. 577 | - Model updating and retraining process 578 | - Model Registry 579 | 580 |
581 | Guide 582 | 583 | Purpose: To ensure continuous improvement and reliable updates to the ML system. 584 | 585 | Guiding questions: 586 | 587 | - How to train the model? 588 | - How often to re-train? 589 | - How will we manage model versions? 590 | - What is our CI/CD pipeline for model deployment? 591 | - How and when will we retrain our models? 592 | 593 | > "Continuous improvement is better than delayed perfection." - Mark Twain 594 | 595 |
596 | 597 | ## 5.5. **Testing and Monitoring** 598 | 599 | Overview: This section outlines the approach for ensuring ongoing system quality and performance. 600 | 601 | Key points: 602 | 603 | - Monitoring system design 604 | - Testing strategy (unit, integration, system tests) 605 | - Quality assurance processes 606 | - Biases and misuses of your model. 607 | - Performance Drop 608 | - Data Drift 609 | 610 |
611 | Guide 612 | 613 | Purpose: To maintain system reliability and catch issues before they impact business operations. 614 | 615 | Guiding questions: 616 | 617 | - How will we monitor the system's performance in production? 618 | - What testing procedures will ensure system reliability? 619 | - How will we maintain quality as the system evolves? 620 | - How we understand the model performs well? 621 | - How will you log events in your system? What metrics will you monitor and how? Will you have alarms if a metric breaches a threshold or something else goes wrong? 622 | - What are model and data metrics to track? 623 | 624 | > "Quality is not an act, it is a habit." - Aristotle 625 | 626 |
627 | 628 | ## **5.6** Scalability and Infrastructure Planning 629 | 630 | Overview: This section describes how the system will scale to meet future demands. 631 | 632 | Key points: 633 | 634 | - Scalability requirements and approach 635 | - Infrastructure growth plan 636 | - Performance optimization strategies 637 | - Infrastructure costs 638 | 639 |
640 | Guide 641 | 642 | Purpose: To ensure the system can grow with the business and handle increased load. 643 | 644 | Guiding questions: 645 | 646 | - How will you host your system? On-premise, cloud, or hybrid? 647 | - How will our system handle increased load? 648 | - What infrastructure changes are needed for future growth? 649 | - How can we optimize system performance at scale? 650 | - How much will it cost to build and operate your system? Share estimated monthly costs (e.g., EC2 instances, Lambda, etc.) 651 | 652 | > "Scalability is not a feature; it's an architectural characteristic." - Martin L. Abbott 653 | 654 |
655 | 656 | ## **5.7 Requirements & Constraints** 657 | 658 | Overview: 659 | 660 | This section addresses critical security, privacy, regulatory, and other requirements and constraints. 661 | 662 | Key points: 663 | 664 | - Data security measures 665 | - System security 666 | - Data Privacy. Compliance with relevant regulations (e.g., GDPR, CCPA) 667 | - Risks & Uncertainties 668 | - Ethical considerations 669 | - Additional requirements and constraints 670 | 671 |
672 | Guide 673 | 674 | Purpose: To ensure the ML system protects sensitive data and complies with relevant regulations. 675 | 676 | Guiding questions: 677 | 678 | - How will your system/application authenticate users and incoming requests? If it's publicly accessible, will it be behind a firewall? 679 | - How will we protect sensitive data? 680 | - What measures ensure user privacy? 681 | - How do we maintain regulatory compliance? 682 | - Data Privacy. How will you ensure the privacy of customer data? Will your system be compliant with data retention and deletion policies (e.g., [GDPR](https://gdpr.eu/what-is-gdpr/))? 683 | - What worries you and you would like others to review? Risks are the known unknowns; uncertainties are the unknown unknows. 684 | 685 | > "Security is always excessive until it's not enough." - Robbie Sinclair 686 | 687 |
688 | 689 | 690 | # Resources 691 | 692 | **Templates** 693 | 694 | - [ML System Design doc template - eugeneyan](https://github.com/eugeneyan)/[ml-design-docs](https://github.com/eugeneyan/ml-design-docs) 695 | - [ML DESIGN TEMPLATE (machinelearninginterviews.com)](https://www.machinelearninginterviews.com/ml-design-template/) 696 | - [Postmortem / Correction of Error (CoE) template](https://medium.com/@josh_70523/postmortem-correction-of-error-coe-template-db69481da31d) 697 | 698 | **Posts & Guides** 699 | 700 | - [How to Write Design Docs for Machine Learning Systems](https://eugeneyan.com/writing/ml-design-docs/) 701 | - [The Undeniable Importance of Design Docs to Data Scientists](https://towardsdatascience.com/the-undeniable-importance-of-design-docs-to-data-scientists-421132561f3c) 702 | - [Understanding Design Docs Principles**](https://towardsdatascience.com/understanding-design-docs-principles-for-achieving-data-scientists-53e6d5ad6f7e) 703 | - [Machine Learning Product Design (Made With ML)](https://madewithml.com/courses/mlops/product-design/) 704 | - [Machine Learning System Design (Made With ML)](https://madewithml.com/courses/mlops/systems-design/) 705 | - [The Undeniable Importance of Design Docs to Data Scientists](https://towardsdatascience.com/the-undeniable-importance-of-design-docs-to-data-scientists-421132561f3c) (Vincent Tatan) 706 | - [Understanding Design Docs Principles (Vincent Tatan)](https://towardsdatascience.com/understanding-design-docs-principles-for-achieving-data-scientists-53e6d5ad6f7e) 707 | - [Design documents for ML models (Olexiy Oryeshko, People.ai)](https://medium.com/people-ai-engineering/design-documents-for-ml-models-bbcd30402ff7) 708 | 709 | **Examples** 710 | 711 | - [The Quickest Analytics to Build Your Instagram Business](https://towardsdatascience.com/the-quickest-analytics-to-build-your-instagram-business-b7b3c5d68056) 712 | - Video of the [Demo Day for the class CS 329S: Machine Learning Systems Deign at Stanford (Winter 2022)](https://www.youtube.com/live/AZNTqytOhXk?feature=shared) and [project reports](https://stanford-cs329s.github.io/reports/) 713 | - Stanford, CS 329S - Machine Learning Systems Design: [Student project - Tender Matching People to Recipes”](https://stanford-cs329s.github.io/reports/tender-recipe-recommendations/) (Recipes recommendation + Streamlit) 714 | - Stanford, CS 329S - Machine Learning Systems Design: [](https://stanford-cs329s.github.io/reports/tender-recipe-recommendations/)[Student project - ML Production System For Detecting Covid-19 From Coughs](https://stanford-cs329s.github.io/reports/ml-production-system-for-covid-detection/)** (Audio + tabular features + GCP) 715 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Building a Context Graph Generator](https://stanford-cs329s.github.io/reports/context-graph-generator/) (Graph DB + BERT + Streamlit) 716 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [An active data valuation system for dashcam data crowdsourcing](https://stanford-cs329s.github.io/reports/dashcam-data-valuation/) (Technical ML for ML focus + AWS) 717 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Stylify](https://stanford-cs329s.github.io/reports/stylify/) (GAN + AWS) 718 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Fact-Checking Tool for Public Health Claims](https://stanford-cs329s.github.io/reports/Fact-Checking-Tool-for-Public-Health-Claims/) (Streamlit + Docker + GCP) 719 | 720 | **Books** 721 | 722 | - [Designing Machine Learning Systems book (Chip Huyen, O'Reilly 2022)](https://github.com/chiphuyen/dmls-book/blob/main/summary.md#chapter-1-overview-of-machine-learning-systems) 723 | - 724 | 725 | **Courses** 726 | 727 | - [CS 329S: Machine Learning Systems Design](https://stanford-cs329s.github.io/) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ML System Design 2 | 3 | ## Goals 4 | 5 | - Understand the critical role of system design in successful ML projects 6 | - Learn to bridge the gap between business needs and technical solutions 7 | - Develop skills to effectively communicate with stakeholders 8 | 9 | ## Learning Outcomes 10 | 11 | - Differentiate ML System Design from traditional approaches 12 | - Create effective ML design documents 13 | - Set appropriate goals and metrics for ML projects 14 | - Decompose a business request into a set of smaller and better-defined problems 15 | 16 | ## Outline 17 | 18 | ### 1. ML System Design Document 19 | 20 | ML System Design Document 21 | 22 | 23 | ### 2. Implement ML System 24 | 25 | Implement ML System 26 | 27 | ### 3. Reflection on ML System Design 28 | 29 | Reflection on ML System Design -------------------------------------------------------------------------------- /static/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-1.png -------------------------------------------------------------------------------- /static/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-2.png -------------------------------------------------------------------------------- /static/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-3.png --------------------------------------------------------------------------------