├── 1_Intro to ML System Design.pdf
├── 2_ML Product Design.pdf
├── 3_Data Science Methodology.pdf
├── 4_Production ML System Design.pdf
├── Guide - ML System Design Doc.md
├── README.md
└── static
    ├── image-1.png
    ├── image-2.png
    └── image-3.png


/1_Intro to ML System Design.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/1_Intro to ML System Design.pdf


--------------------------------------------------------------------------------
/2_ML Product Design.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/2_ML Product Design.pdf


--------------------------------------------------------------------------------
/3_Data Science Methodology.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/3_Data Science Methodology.pdf


--------------------------------------------------------------------------------
/4_Production ML System Design.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/4_Production ML System Design.pdf


--------------------------------------------------------------------------------
/Guide - ML System Design Doc.md:
--------------------------------------------------------------------------------
  1 | # Guide - ML System Design Doc - v1
  2 | 
  3 | ***Overview:** This guide provides a structured approach to creating high-quality Machine Learning System Design Documents. It's designed to help business leaders, data scientists and engineers, effectively communicate the strategic value, technical implementation details, and practical considerations of an ML project.*
  4 | 
  5 | ***Purpose:** An ML system design document serves as a blueprint for the entire project, ensuring all stakeholders have a clear understanding of the goals, approach, and implementation details. It facilitates alignment, guides decision-making, and serves as a reference throughout the project lifecycle.*
  6 | 
  7 | # **1. Overview: Purpose and Impact**
  8 | 
  9 | **Overview:** This section provides a high-level summary of the entire project, including purpose, problem, solution, and desired outcome. It usually takes 3-5 sentences.
 10 | 
 11 | Key points:
 12 | 
 13 | - Clear problem statement
 14 | - Proposed ML solution
 15 | - Expected business impact
 16 | - High-level implementation timeline
 17 | 
 18 | <details>
 19 | <summary>Guide</summary>
 20 | 
 21 | **Purpose:** To provide a concise summary that captures the essence of the project and its expected outcomes.
 22 | 
 23 | **Guiding questions:**
 24 | 
 25 | - What specific problem are we addressing?
 26 | - Why is this problem important to the business?
 27 | - What are the high-level goals of this ML system?
 28 | - What key outcomes do we expect?
 29 |     - What's our timeline for implementation?
 30 | 
 31 | </details>
 32 | 
 33 | # **2. ML Product Design**
 34 | 
 35 | <aside>
 36 | 💡 This section discusses how the ML solution will function as a product, helping users and driving business value. It connects technical work to business outcomes.
 37 | 
 38 | </aside>
 39 | 
 40 | > *“…most businesses don’t care about ML metrics unless they can move business metrics”*
 41 | Source: [Designing Machine Learning Systems (Chip Huyen 2022)](https://github.com/chiphuyen/dmls-book/blob/main/summary.md#chapter-1-overview-of-machine-learning-systems)
 42 | 
 43 | ## **2.1 Problem Statement (Motivation)**
 44 | 
 45 | Is it the right problem to solve?
 46 | 
 47 | Overview: Explain the business problem and its importance to the organization. 
 48 | 
 49 | Key points:
 50 | 
 51 | - Detailed description of the business problem
 52 | - Current approaches and their limitations
 53 | - Market or industry context
 54 | - Alignment with business strategy
 55 | 
 56 | <details>
 57 | <summary>Guide</summary>
 58 | 
 59 | Purpose: Cearly define the business problem and its relevance to the organization.
 60 | 
 61 | Guiding questions:
 62 | 
 63 | - Why the problem is important to solve, and why now?
 64 | - What are the costs of not solving this problem?
 65 | - How does this align with our overall business strategy?
 66 | 
 67 | > "A problem well-stated is a problem half-solved." - Charles Kettering
 68 | 
 69 | </details>
 70 | 
 71 | 
 72 | ## 2.2 Customers
 73 | 
 74 | Overview: This section identifies all parties involved in or affected by the ML system.
 75 | 
 76 | Key points:
 77 | 
 78 | - List of key stakeholders and their roles
 79 | - Primary end users and their needs
 80 | - Potential impact on each group
 81 | 
 82 | <details>
 83 | <summary>Guide</summary>
 84 | 
 85 | Purpose: To ensure all relevant perspectives are considered and to clarify who will be using or impacted by the system.
 86 | 
 87 | Guiding questions:
 88 | 
 89 | - Who will be directly using the ML system?
 90 | - Whose work or processes will be affected by the system?
 91 | - Who needs to be involved in the decision-making process?
 92 | 
 93 | > "If you want to go fast, go alone. If you want to go far, go together." - African Proverb
 94 | 
 95 | </details>
 96 | 
 97 | ## 2.3 Value Proposition
 98 | 
 99 | Overview: This section articulates why AI/ML is the right approach for solving the problem.
100 | 
101 | Key points:
102 | 
103 | - Unique advantages of using AI/ML
104 | - Potential improvements over current methods
105 | 
106 | <details>
107 | <summary>Guide</summary>
108 |     
109 | Purpose: To justify the use of AI/ML over traditional approaches and highlight its unique benefits.
110 | 
111 | Guiding questions:
112 | 
113 | - How does AI/ML solve this problem better than traditional methods?
114 | - What new capabilities does AI/ML bring to our business?
115 | - How does this solution position us for future growth?
116 | - Why AI/ML is required?
117 | 
118 | </details>
119 | 
120 | ## **2.4 Business Metrics (Success)**
121 | 
122 | Overview: This section defines measurable outcomes that indicate project success. Usually framed as business goals, such as increased customer engagement (e.g., CTR, DAU), revenue, or reduced cost.
123 | 
124 | Key points:
125 | 
126 | - Specific, quantifiable business metrics
127 | - Technical performance metrics
128 | - Timeline for achieving key milestones
129 | 
130 | <details>
131 | <summary>Guide</summary>
132 | 
133 | Purpose: To establish clear, quantifiable goals that align business objectives with technical performance.
134 | 
135 | Guiding questions:
136 | 
137 | - How will we measure the success of this ML system?
138 | - What metrics align with our business objectives?
139 | - How do we balance technical and business performance?
140 | 
141 | > Guiding Quote: "What gets measured gets managed." - Peter Drucker
142 | 
143 | </details>
144 | 
145 | 
146 | ## 2.5 Assumptions  and Constraints
147 | 
148 | Overview: This section outlines the foundational beliefs and limitations that shape the project. Make explicit your assumptions and understanding of the environment 
149 | 
150 | Key points:
151 | 
152 | - Key assumptions about data, users, and processes
153 | - Technical constraints (e.g., computational resources)
154 | - Business constraints (e.g., budget, timeline)
155 | 
156 | <details>
157 | <summary>Guide</summary>
158 | 
159 | Purpose: To make explicit the assumptions underlying the project and acknowledge known constraints.
160 | 
161 | Guiding questions:
162 | 
163 | - What critical assumptions are we making?
164 | - What technical or business constraints might limit our solution?
165 | - How might our assumptions or constraints impact the project's success?
166 | 
167 | > "It is better to be roughly right than precisely wrong." - John Maynard Keynes
168 | 
169 | </details>
170 | 
171 | ## 2.6 Cost Structure & ROI
172 | 
173 | Overview: This section provides a clear financial picture of the project's costs and expected returns.
174 | 
175 | Key points:
176 | 
177 | - Detailed breakdown of costs (development, serving, infrastructure, maintenance)
178 | - Projected financial benefits and timeline
179 | - ROI calculation and sensitivity analysis
180 | 
181 | <details>
182 | <summary>Guide</summary>
183 | 
184 | Purpose: To justify the investment in the ML system and set realistic expectations for financial returns.
185 | 
186 | Guiding questions:
187 | 
188 | - What are all the costs associated with this project?
189 | - How the costs will change with time?
190 | - When do we expect to see a return on our investment?
191 | - How sensitive is our ROI to changes in key assumptions?
192 | - Consider cost in money and in time spent by dev/ml specialists.
193 | 
194 | > "Price is what you pay. Value is what you get." - Warren Buffett
195 | > 
196 | 
197 | > *What are the major cost implications of the model we are building? How much will it cost to train, retrain, and serve the model? Computing the exact cost is hard, but a ballpark estimate is usually enough. 
198 | Source: [**Design documents for ML models](https://medium.com/people-ai-engineering/design-documents-for-ml-models-bbcd30402ff7)*
199 | 
200 | </details>
201 | 
202 | # 3. Solution Design
203 | 
204 | <aside>
205 | 💡 This section provides a detailed view of the proposed ML solution, including how it will work and how it will be implemented within the business context.
206 | 
207 | </aside>
208 | 
209 | ## 3.1 High-Level Solution Overview
210 | 
211 | Overview: This section provides a bird's-eye view of the proposed ML system. How the predictions should look like for a consumer? Here we don’t care how these predictions are derived, but we write down our expectations on their form/structure.  
212 | 
213 | Key points:
214 | 
215 | - Consumable format of prediction for key users
216 | - Key components and their interactions
217 | 
218 | <details>
219 | <summary>Guide</summary>
220 | 
221 | Purpose: To give all stakeholders a clear understanding of the overall system architecture and components.
222 | 
223 | Guiding questions:
224 | 
225 | - How the predictions should look like for a consumer?
226 | - What are the main components of our ML system?
227 | - How do these components interact with each other?
228 | - How will your system integrate with upstream data (what data we’ll pass to the model) and downstream users (how users will access predictions)?
229 | 
230 | > "Simplicity is the ultimate sophistication." - Leonardo da Vinci
231 | 
232 | </details>
233 | 
234 | ## **3.2** User Interface and Experience Design
235 | 
236 | Overview: This section describes how users will interact with the ML system. 
237 | 
238 | Key points:
239 | 
240 | - User interface mockups or wireframes
241 | - User journey maps
242 | - Integration points with existing systems
243 | 
244 | <details>
245 | <summary>Guide</summary>
246 | 
247 | Purpose: To ensure the ML system is user-friendly and integrates well with existing workflows.
248 | 
249 | Guiding questions:
250 | 
251 | - How will users interact with the ML system?
252 | - What changes to existing workflows are required?
253 | - How can we make the system intuitive and user-friendly?
254 | - How will you incorporate human intervention into your ML system (e.g., product/customer exclusion lists)?
255 | 
256 | > "Design is not just what it looks like and feels like. Design is how it works." - Steve Jobs
257 | 
258 | </details>
259 | 
260 | ## 3.3 Performance Metrics
261 | 
262 | Overview: This section defines how the system's performance will be measured.
263 | 
264 | Key points:
265 | 
266 | - Technical metrics (e.g., accuracy, latency)
267 | - Metrics we calculate for model predictions (offline). This means the case when predictions doesn’t affect real users. E.g. if we deploy the ML system in shadow mode and it gives us recommendation/decision/etc, how we will be measuring the quality of those predictions.
268 | - Business metrics (online). This means the case when predictions affect the real users.
269 | 
270 | <details>
271 | <summary>Guide</summary>
272 | 
273 | Purpose: To establish clear criteria for evaluating the system's technical performance and business impact. 
274 | 
275 | - How will we know that the solution is successful?
276 | 
277 | > "Not everything that can be counted counts, and not everything that counts can be counted." - Albert Einstein
278 | 
279 | </details>
280 | 
281 | ## 3.4 Validation Strategy and Pilot Project Plan
282 | 
283 | Overview: This section outlines the approach for testing and validating the ML system.
284 | 
285 | Key points:
286 | 
287 | - Validation methodology
288 | - Pilot project scope and timeline
289 | - Success criteria for moving to full implementation
290 | 
291 | <details>
292 | <summary>Guide</summary>
293 | 
294 | Purpose: To ensure the system performs as expected and delivers value before full-scale deployment.
295 | 
296 | Guiding questions:
297 | 
298 | - How will we validate the ML system's performance? How to validate the solution works in production process (system)?
299 | - What does a successful pilot look like? Regardless of ML model used under the hood.
300 | - How will we incorporate learnings from the pilot?
301 | - If you're A/B testing, how will you assign treatment and control (e.g., customer vs. session-based) and what metrics will you measure? What are the success and [guardrail](https://medium.com/airbnb-engineering/designing-experimentation-guardrails-ed6a976ec669) metrics?
302 | 
303 | > "In God we trust. All others must bring data." - W. Edwards Deming
304 | 
305 | </details>
306 | 
307 | ## **3.4 Requirements & Constraints**
308 | 
309 | Overview: This section specifies what the system must do and how well it must perform. Include functional and non-functional Requirements. 
310 | 
311 | Key points:
312 | 
313 | - Detailed functional requirements
314 | - Non-functional requirements (e.g., scalability, security, type of inference: batch, real-time, stream)
315 | - Constraints (hardware, model)
316 | - Compliance and regulatory requirements
317 | - Corner cases
318 | - Scope of the solution
319 | - Risks
320 | 
321 | <details>
322 | <summary>Guide</summary>
323 | 
324 | Purpose: To clearly define the system's capabilities and performance standards. 
325 | 
326 | Guiding questions:
327 | 
328 | - What specific functions must the system perform?
329 | - What are the performance, security, and scalability requirements?
330 | - What regulatory standards must we adhere to?
331 | - What's in-scope & out-of-scope? Some problems are too big to solve all at once. Be clear about what's out of scope.
332 | - Corner cases: What’s the worst that can happen if the model is wrong only once but for a very important data point? Are all data points equally important?
333 | - Risks: What are the major risks you are facing? What are you doing to mitigate them? Are you doing some bleeding-edge research? Do you depend on a major infrastructure component that is yet to be built?
334 | 
335 | > "Quality is never an accident; it is always the result of intelligent effort." - John Ruskin
336 | 
337 | **Requirements**
338 | 
339 | - Functional requirements are those that should be met to ship the project. They should be described in terms of the customer perspective and benefit. (See [this](https://eugeneyan.com/writing/ml-design-docs/#the-why-and-what-of-design-docs) for more details.)
340 | - Non-functional/technical requirements are those that define system quality and how the system should be implemented. These include performance (throughput, latency, error rates), cost (infra cost, ops effort), security, data privacy, etc.
341 | - Type of inference: batch, real-time, stream
342 | 
343 | **Constraints (hardware, model)**
344 | 
345 | - Constraints can come in the form of non-functional requirements (e.g., cost below $`x` a month, p99 latency < `y`ms)
346 | 
347 | </details>
348 |     
349 | 
350 | # **4. Data Science Methodology**
351 | 
352 | <aside>
353 | 💡 This section covers the methodology for developing the ML model, from problem framing to model validation.
354 | 
355 | </aside>
356 | 
357 | ## **4.1.** Problem Framing and Approach
358 | 
359 | Overview: This section explains how the business problem translates into a data science problem. How will you frame the problem? For example, fraud detection can be framed as an unsupervised (outlier detection, graph cluster) or supervised problem (e.g., classification).
360 | 
361 | Key points:
362 | 
363 | - ML problem type (e.g., classification, regression)
364 | - Approach selection rationale
365 | - Potential alternative approaches
366 | - Baseline solution (without ML)
367 | 
368 | <details>
369 | <summary>Guide</summary>
370 | 
371 | Purpose: To ensure the ML approach aligns with the business problem and leverages appropriate techniques.
372 | 
373 | Guiding questions:
374 | 
375 | - How do we frame this as a machine learning problem?
376 | - What ML metric we should optimize?
377 | - Why is this approach the most suitable?
378 | - What is the simplest solution? Can we solve the problem without ML?
379 | - What is a feasible baseline solution?
380 | 
381 | > "If I had an hour to solve a problem, I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions." - Albert Einstein
382 | 
383 | </details>
384 | 
385 | ## **4.2.** Data and Feature Engineering
386 | 
387 | Overview: This section outlines the plan for data acquisition, processing, and management.
388 | 
389 | Key points:
390 | 
391 | - Data sources and collection methods
392 | - Data preprocessing and feature engineering
393 | - Data quality assurance processes
394 | - Data Labeling
395 | 
396 | <details>
397 | <summary>Guide</summary>
398 | 
399 | Purpose: To ensure the ML system has access to high-quality, relevant data.
400 | 
401 | Guiding questions:
402 | 
403 | - What data will you use to train your model?
404 | - What input data is needed during serving?
405 | - How will we ensure data quality and relevance?
406 | - **Data Processing Techniques:** What machine learning techniques will you use? How will you clean and prepare the data (e.g., excluding outliers)
407 | - **Feature Engineering**: How will you create features?
408 | 
409 | > "Data is the new oil. It's valuable, but if unrefined it cannot really be used." - Clive Humby
410 | 
411 | </details>
412 | 
413 | ## 4.3 Modeling Techniques and Algorithms
414 | 
415 | Overview: This section describes the specific ML techniques and algorithms to be used.
416 | 
417 | Key points:
418 | 
419 | - Selected algorithms and rationale
420 | - Model architecture details
421 | - Hyperparameter tuning strategy
422 | 
423 | <details>
424 | <summary>Guide</summary>
425 | 
426 | Purpose: To provide a clear understanding of the technical approach and its rationale
427 | 
428 | Guiding questions:
429 | 
430 | - Which ML algorithms are most suitable for our problem?
431 | - How will we optimize model performance?
432 | - What are the trade-offs between different modeling approaches?
433 | 
434 | > "All models are wrong, but some are useful." - George Box
435 | 
436 | </details>
437 | 
438 | ## **4.4.** Model Validation and Evaluation Framework
439 | 
440 | Overview: This section explains how model performance will be evaluated and validated.
441 | 
442 | Key points:
443 | 
444 | - **Techniques**: Cross-validation (e.g., k-fold cross-validation), holdout validation, stratified sampling.
445 | - **Data Splits**: Training set, validation set, and test set.
446 | - **Metrics**: Performance metrics specific to the validation and test phases. Evaluation metrics should be relevant to business metrics.
447 | 
448 | <details>
449 | <summary>Guide</summary>
450 |     
451 | > "Trust, but verify." - Ronald Reagan
452 | > 
453 | 
454 | **Purpose:** To ensure the model's performance can be reliably measured and meets business requirements.
455 | 
456 | **Guiding questions:**
457 | 
458 | - How will we split our data to validate the model effectively?
459 | - What techniques will we use to validate the model?
460 | - How will we interpret the validation metrics to improve the model?
461 | - How will we measure model performance?
462 | - How will we ensure the evaluation process is unbiased and thorough?
463 | - How will we interpret and report the evaluation results to stakeholders?
464 | 
465 | Summary on model Validation & Evaluation
466 | 
467 | |  | **Model Validation** | **Model Evaluation**  |
468 | | --- | --- | --- |
469 | | **Purpose** | Assess generalization to unseen data | Comprehensive assessment before deployment |
470 | | **Focus** | Tuning and improving model performance | Final performance metrics and business impact |
471 | | **Timing** | During the model development phase | After model training and validation |
472 | | **Techniques** | Cross-validation, holdout validation | Confusion matrix, ROC/AUC, MAE, MSE |
473 | | **Data Used** | Training set, validation set | Test set |
474 | | **Metrics** | Validation accuracy, validation loss | Precision, recall, F1-score, RMSE, business metrics |
475 | | **Outcome** | Model refinement and selection | Final decision on model readiness for production |
476 | 
477 | </details>
478 | 
479 | # **5.** Production ML System Design
480 | 
481 | <aside>
482 | 💡 This section outlines the design for deploying and operationalizing the ML system in a production environment.
483 | 
484 | </aside>
485 | 
486 | ## **5.1. High-level design**
487 | 
488 | **Overview:** This section provides a detailed view of the system's technical architecture. Start by providing a big-picture view. [System-context diagrams](https://en.wikipedia.org/wiki/System_context_diagram) and [data-flow diagrams](https://en.wikipedia.org/wiki/Data-flow_diagram) work well.
489 | 
490 | **Key points:**
491 | 
492 | - Detailed system architecture diagram
493 | - Data flow and processing pipeline
494 | - Integration with existing infrastructure
495 | 
496 | <details>
497 | <summary>Guide</summary>
498 | 
499 | Purpose: To ensure all technical stakeholders understand how the system will be built and how data will flow through it.
500 | 
501 | Guiding questions:
502 | 
503 | - How will the ML system integrate with our current infrastructure?
504 | - What are the key components of our data processing pipeline?
505 | - How will we ensure efficient data flow through the system?
506 | 
507 | > "Simplicity is a prerequisite for reliability." - Edsger Dijkstra
508 | 
509 | </details>
510 | 
511 | ## 5.2 Deployment and Serving
512 | 
513 | Overview: This section outlines the plan for deploying and serving the ML model.
514 | 
515 | Key points:
516 | 
517 | - Type of inference (batch, real-time)
518 | - Requirements for CI/CD.
519 | - Deployment methodology (e.g., blue-green, canary)
520 | - Serving infrastructure details
521 | 
522 | <details>
523 | <summary>Guide</summary>
524 | 
525 | Purpose: To ensure smooth deployment and reliable serving of model predictions.
526 | 
527 | Guiding questions:
528 | 
529 | - Do we want to perform batch (offline) or real-time (online) inference?  What tools should we use?
530 | - How will we deploy the model with minimal disruption? Do we need human checks? How do we test the models before deployment to be sure they doesn’t break prod?
531 | - Do we need to support online A/B testing, canary deployment, etc?
532 | - What infrastructure is needed to serve model predictions?
533 | 
534 | > "Hope is not a strategy." - Vince Lombardi
535 | 
536 | </details>
537 |     
538 | 
539 | ## **5.3** Data Engineering
540 | 
541 | Overview: 
542 | 
543 | This section describes the data infrastructure supporting the ML system.
544 | 
545 | Key points:
546 | 
547 | - Data ingestion and storage solutions
548 | - ETL processes (data pipelines)
549 | - Data versioning and lineage tracking
550 | - Feature Stores (Data Marts)
551 | 
552 | <details>
553 | <summary>Guide</summary>
554 | 
555 | Purpose: To ensure reliable, efficient data processing and feature engineering in production.
556 | 
557 | Guiding questions:
558 | 
559 | - How will we handle data ingestion and storage?
560 | - What ETL processes are needed to prepare data for the model?
561 | - How will we track data versions and lineage?
562 | 
563 | > "Data is the foundation of all machine learning systems." - Andrew Ng
564 | 
565 | </details>
566 | 
567 | ## **5.4** Model Development Lifecycle
568 | 
569 | Overview: 
570 | 
571 | This section explains the process for ongoing model development and deployment.
572 | 
573 | Key points:
574 | 
575 | - Requirements for automation, reproducibility, reliability
576 | - Model versioning strategy.
577 | - Model updating and retraining process
578 | - Model Registry
579 | 
580 | <details>
581 | <summary>Guide</summary>
582 | 
583 | Purpose: To ensure continuous improvement and reliable updates to the ML system.
584 | 
585 | Guiding questions:
586 | 
587 | - How to train the model?
588 | - How often to re-train?
589 | - How will we manage model versions?
590 | - What is our CI/CD pipeline for model deployment?
591 | - How and when will we retrain our models?
592 | 
593 | > "Continuous improvement is better than delayed perfection." - Mark Twain
594 | 
595 | </details>
596 | 
597 | ## 5.5. **Testing and Monitoring**
598 | 
599 | Overview: This section outlines the approach for ensuring ongoing system quality and performance.
600 | 
601 | Key points:
602 | 
603 | - Monitoring system design
604 | - Testing strategy (unit, integration, system tests)
605 | - Quality assurance processes
606 | - Biases and misuses of your model.
607 | - Performance Drop
608 | - Data Drift
609 | 
610 | <details>
611 | <summary>Guide</summary>
612 | 
613 | Purpose: To maintain system reliability and catch issues before they impact business operations.
614 | 
615 | Guiding questions:
616 | 
617 | - How will we monitor the system's performance in production?
618 | - What testing procedures will ensure system reliability?
619 | - How will we maintain quality as the system evolves?
620 | - How we understand the model performs well?
621 | - How will you log events in your system? What metrics will you monitor and how? Will you have alarms if a metric breaches a threshold or something else goes wrong?
622 | - What are model and data metrics to track?
623 | 
624 | > "Quality is not an act, it is a habit." - Aristotle
625 | 
626 | </details>
627 | 
628 | ## **5.6**  Scalability and Infrastructure Planning
629 | 
630 | Overview: This section describes how the system will scale to meet future demands.
631 | 
632 | Key points:
633 | 
634 | - Scalability requirements and approach
635 | - Infrastructure growth plan
636 | - Performance optimization strategies
637 | - Infrastructure costs
638 | 
639 | <details>
640 | <summary>Guide</summary>
641 | 
642 | Purpose: To ensure the system can grow with the business and handle increased load.
643 | 
644 | Guiding questions:
645 | 
646 | - How will you host your system? On-premise, cloud, or hybrid?
647 | - How will our system handle increased load?
648 | - What infrastructure changes are needed for future growth?
649 | - How can we optimize system performance at scale?
650 | - How much will it cost to build and operate your system? Share estimated monthly costs (e.g., EC2 instances, Lambda, etc.)
651 | 
652 | > "Scalability is not a feature; it's an architectural characteristic." - Martin L. Abbott
653 | 
654 | </details>
655 | 
656 | ## **5.7 Requirements & Constraints**
657 | 
658 | Overview: 
659 | 
660 | This section addresses critical security, privacy, regulatory, and other requirements and constraints.
661 | 
662 | Key points:
663 | 
664 | - Data security measures
665 | - System security
666 | - Data Privacy. Compliance with relevant regulations (e.g., GDPR, CCPA)
667 | - Risks & Uncertainties
668 | - Ethical considerations
669 | - Additional requirements and constraints
670 | 
671 | <details>
672 | <summary>Guide</summary>
673 | 
674 | Purpose: To ensure the ML system protects sensitive data and complies with relevant regulations.
675 | 
676 | Guiding questions:
677 | 
678 | - How will your system/application authenticate users and incoming requests? If it's publicly accessible, will it be behind a firewall?
679 | - How will we protect sensitive data?
680 | - What measures ensure user privacy?
681 | - How do we maintain regulatory compliance?
682 | - Data Privacy. How will you ensure the privacy of customer data? Will your system be compliant with data retention and deletion policies (e.g., [GDPR](https://gdpr.eu/what-is-gdpr/))?
683 | - What worries you and you would like others to review? Risks are the known unknowns; uncertainties are the unknown unknows.
684 | 
685 | > "Security is always excessive until it's not enough." - Robbie Sinclair
686 | 
687 | </details>
688 | 
689 | 
690 | # Resources
691 | 
692 | **Templates** 
693 | 
694 | - [ML System Design doc template - eugeneyan](https://github.com/eugeneyan)/[ml-design-docs](https://github.com/eugeneyan/ml-design-docs)
695 | - [ML DESIGN TEMPLATE (machinelearninginterviews.com)](https://www.machinelearninginterviews.com/ml-design-template/)
696 | - [Postmortem / Correction of Error (CoE) template](https://medium.com/@josh_70523/postmortem-correction-of-error-coe-template-db69481da31d)
697 | 
698 | **Posts & Guides** 
699 | 
700 | - [How to Write Design Docs for Machine Learning Systems](https://eugeneyan.com/writing/ml-design-docs/)
701 | - [The Undeniable Importance of Design Docs to Data Scientists](https://towardsdatascience.com/the-undeniable-importance-of-design-docs-to-data-scientists-421132561f3c)
702 | - [Understanding Design Docs Principles**](https://towardsdatascience.com/understanding-design-docs-principles-for-achieving-data-scientists-53e6d5ad6f7e)
703 | - [Machine Learning Product Design (Made With ML)](https://madewithml.com/courses/mlops/product-design/)
704 | - [Machine Learning System Design (Made With ML)](https://madewithml.com/courses/mlops/systems-design/)
705 | - [The Undeniable Importance of Design Docs to Data Scientists](https://towardsdatascience.com/the-undeniable-importance-of-design-docs-to-data-scientists-421132561f3c) (Vincent Tatan)
706 | - [Understanding Design Docs Principles (Vincent Tatan)](https://towardsdatascience.com/understanding-design-docs-principles-for-achieving-data-scientists-53e6d5ad6f7e)
707 | - [Design documents for ML models (Olexiy Oryeshko, People.ai)](https://medium.com/people-ai-engineering/design-documents-for-ml-models-bbcd30402ff7)
708 | 
709 | **Examples**  
710 | 
711 | - [The Quickest Analytics to Build Your Instagram Business](https://towardsdatascience.com/the-quickest-analytics-to-build-your-instagram-business-b7b3c5d68056)
712 | - Video of the [Demo Day for the class CS 329S: Machine Learning Systems Deign at Stanford (Winter 2022)](https://www.youtube.com/live/AZNTqytOhXk?feature=shared) and [project reports](https://stanford-cs329s.github.io/reports/)
713 | - Stanford, CS 329S - Machine Learning Systems Design: [Student project - Tender Matching People to Recipes”](https://stanford-cs329s.github.io/reports/tender-recipe-recommendations/)   (Recipes recommendation + Streamlit)
714 | - Stanford, CS 329S - Machine Learning Systems Design: [](https://stanford-cs329s.github.io/reports/tender-recipe-recommendations/)[Student project - ML Production System For Detecting Covid-19 From Coughs](https://stanford-cs329s.github.io/reports/ml-production-system-for-covid-detection/)** (Audio + tabular features + GCP)
715 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Building a Context Graph Generator](https://stanford-cs329s.github.io/reports/context-graph-generator/) (Graph DB + BERT + Streamlit)
716 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [An active data valuation system for dashcam data crowdsourcing](https://stanford-cs329s.github.io/reports/dashcam-data-valuation/) (Technical ML for ML focus + AWS)
717 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Stylify](https://stanford-cs329s.github.io/reports/stylify/) (GAN + AWS)
718 | - Stanford, CS 329S - Machine Learning Systems Design: Student project - [Fact-Checking Tool for Public Health Claims](https://stanford-cs329s.github.io/reports/Fact-Checking-Tool-for-Public-Health-Claims/) (Streamlit + Docker + GCP)
719 | 
720 | **Books** 
721 | 
722 | - [Designing Machine Learning Systems book (Chip Huyen, O'Reilly 2022)](https://github.com/chiphuyen/dmls-book/blob/main/summary.md#chapter-1-overview-of-machine-learning-systems)
723 | - 
724 | 
725 | **Courses**
726 | 
727 | - [CS 329S: Machine Learning Systems Design](https://stanford-cs329s.github.io/)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ML System Design 
 2 | 
 3 | ## Goals
 4 | 
 5 | - Understand the critical role of system design in successful ML projects
 6 | - Learn to bridge the gap between business needs and technical solutions
 7 | - Develop skills to effectively communicate with stakeholders
 8 | 
 9 | ## Learning Outcomes
10 | 
11 | - Differentiate ML System Design from traditional approaches
12 | - Create effective ML design documents
13 | - Set appropriate goals and metrics for ML projects
14 | - Decompose a business request into a set of smaller and better-defined problems
15 | 
16 | ## Outline 
17 | 
18 | ### 1. ML System Design Document
19 | 
20 | <img src="static/image-1.png"  alt="ML System Design Document" width="500">
21 | 
22 | 
23 | ### 2. Implement ML System 
24 | 
25 | <img src="static/image-2.png"  alt="Implement ML System " width="500">
26 | 
27 | ### 3. Reflection on ML System Design
28 | 
29 | <img src="static/image-3.png"  alt="Reflection on ML System Design" width="500">


--------------------------------------------------------------------------------
/static/image-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-1.png


--------------------------------------------------------------------------------
/static/image-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-2.png


--------------------------------------------------------------------------------
/static/image-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlrepa/ml-system-design/1688d94d53dd39c26bd56d43de604b43120bb6cb/static/image-3.png


--------------------------------------------------------------------------------