├── .github
    └── CODEOWNERS
├── .gitignore
├── BookOutline.md
├── CONTRIBUTING.md
├── Design_Doc_Examples
    ├── Examples
    │   ├── EN
    │   │   └── Retail_Demand_Forecasting.md
    │   └── README.md
    ├── Mock
    │   ├── EN
    │   │   └── Mock_ML_System_Design_RAG_Chat_With_Doc_Versions
    │   │   │   ├── Mock_ML_System_Design_RAG_Chat_With_Doc_Versions.md
    │   │   │   └── docs
    │   │   │       ├── rag_reliable.png
    │   │   │       ├── rag_reliable_interactive.png
    │   │   │       ├── rag_simple.png
    │   │   │       └── retrieval_baseline.png
    │   └── README.md
    └── README.md
├── README.md
└── templates
    ├── basic_ml_design_doc.md
    └── design_doc_checklist.md


/.github/CODEOWNERS:
--------------------------------------------------------------------------------
1 | # Assign ownership for everything in the repository to 'admin-team'
2 | * @ML-SystemDesign/core-team
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | Machine_Learning_System_Design.txt
2 | 


--------------------------------------------------------------------------------
/BookOutline.md:
--------------------------------------------------------------------------------
  1 | # Chapter 1: Essentials of Machine Learning System Design
  2 | 
  3 | ## Introduction
  4 | Machine Learning System Design (MLSD) is a multifaceted discipline that bridges machine learning, software engineering, data engineering, and project management. Its primary goal is to develop ML systems that are not only accurate but also robust, scalable, maintainable, and aligned with business objectives. This chapter lays the groundwork by defining MLSD, highlighting its importance, and outlining the core principles and structure that will be explored throughout this book. It emphasizes the necessity of a holistic, iterative approach, acknowledging that building successful ML systems requires a blend of technical expertise and strategic thinking.
  5 | 
  6 | ## Main Sections
  7 | 
  8 | ### Section 1: Understanding Machine Learning System Design
  9 | - **Definition and scope:**
 10 |     - MLSD involves the end-to-end process of creating production-ready ML systems. This includes problem framing, data acquisition and preparation, model selection and training, system architecture design, deployment, monitoring, and iteration.
 11 |     - The scope extends beyond just the ML model itself to include the entire infrastructure, data pipelines, APIs, and user interfaces that support its operation.
 12 |     - It addresses challenges like ensuring reliability, managing technical debt, handling large-scale data, and adapting to evolving requirements.
 13 | - **Interdisciplinary roles:**
 14 |     - Successful MLSD requires collaboration among various roles:
 15 |         - *Machine Learning Engineers/Scientists:* Focus on model development, experimentation, and algorithm selection.
 16 |         - *Software Engineers:* Build the surrounding infrastructure, APIs, and ensure system robustness and scalability.
 17 |         - *Data Engineers:* Design and manage data pipelines, storage, and quality.
 18 |         - *Product Managers:* Define the problem, requirements, and ensure alignment with business goals.
 19 |         - *DevOps/MLOps Engineers:* Manage deployment, CI/CD pipelines, and system monitoring.
 20 |         - *Domain Experts:* Provide crucial insights into the problem space and data.
 21 | - **The necessity of a comprehensive approach:**
 22 |     - A piecemeal approach focusing solely on model accuracy is insufficient for production systems.
 23 |     - MLSD advocates for a holistic view, considering the entire lifecycle of the ML system, from conception to decommissioning.
 24 |     - This includes upfront planning, rigorous testing at all stages, clear documentation, and proactive monitoring and maintenance.
 25 | 
 26 | ### Section 2: The Importance of Machine Learning System Design
 27 | - **Architectural approach:**
 28 |     - Emphasizes designing a well-defined system architecture that outlines components, their interactions, data flows, and technology choices.
 29 |     - A good architecture promotes modularity, making the system easier to understand, develop, test, and maintain.
 30 |     - It considers trade-offs between different architectural patterns (e.g., microservices, monolithic, batch vs. real-time processing) based on project needs.
 31 | - **Scalability and flexibility:**
 32 |     - ML systems must be designed to handle growing amounts of data, increasing user traffic, and evolving model complexity. Scalability ensures the system can perform efficiently under varying loads.
 33 |     - Flexibility refers to the system's ability to adapt to changes, such as new data sources, updated models, or different business requirements, without requiring a complete overhaul. This often involves designing for easy component replacement and configuration.
 34 | - **Case studies (Illustrative Examples):**
 35 |     - *E.g., Recommendation Systems:* Designing for real-time updates, handling massive user-item interaction data, and A/B testing different recommendation algorithms.
 36 |     - *E.g., Fraud Detection:* Balancing low latency requirements with complex feature engineering and model inference, ensuring high availability and robust monitoring for critical alerts.
 37 |     - These examples will be used throughout the book to illustrate how MLSD principles are applied in practice to solve real-world problems and avoid common pitfalls.
 38 | 
 39 | ### Section 3: The Structure of This Book
 40 | - **Book overview:**
 41 |     - This book guides you through the entire ML system design lifecycle, from initial problem definition and research to deployment, monitoring, and maintenance.
 42 |     - It is divided into parts covering preparations, core design stages, and operational aspects.
 43 |     - Each chapter focuses on a specific component of MLSD, providing practical advice, checklists, and drawing from real-world and illustrative examples.
 44 | - **Essential checklist items:**
 45 |     - Throughout the book, key checklist items will be highlighted to help ensure all critical aspects of design are considered.
 46 |     - These checklists serve as a practical tool during the design process and review stages. (A consolidated checklist is also available in `templates/design_doc_checklist.md`).
 47 | - **Practical guidelines and experiences:**
 48 |     - The content is enriched with practical guidelines derived from industry best practices and common experiences (both successes and failures).
 49 |     - It aims to provide actionable insights rather than purely theoretical discussions.
 50 | 
 51 | ### Section 4: Principles of Machine Learning System Design
 52 | - **Critical principles for complex systems:**
 53 |     - *Start Simple (Iterative Development):* Begin with a basic, functional system and incrementally add complexity. Avoid premature optimization.
 54 |     - *Modularity and Abstraction:* Break down the system into loosely coupled, well-defined components with clear interfaces.
 55 |     - *Automation:* Automate repetitive tasks like data preprocessing, model training, testing, and deployment (MLOps practices).
 56 |     - *Reproducibility:* Ensure that experiments and results can be consistently reproduced. This involves versioning data, code, and configurations.
 57 |     - *Testability:* Design the system for comprehensive testing at all levels (unit, integration, end-to-end, and A/B testing).
 58 | - **System improvement vs. maintenance:**
 59 |     - MLSD covers both the initial creation of a system and its ongoing evolution.
 60 |     - Improvement involves enhancing performance, adding new features, or incorporating new ML techniques.
 61 |     - Maintenance involves bug fixing, addressing model drift, updating dependencies, and ensuring the system continues to operate reliably and efficiently. A good design facilitates both.
 62 | - **The role of design documents:**
 63 |     - Design documents are crucial artifacts in MLSD. They serve as a blueprint for the system, a communication tool for stakeholders, and a record of design decisions.
 64 |     - They help clarify requirements, identify potential risks early, facilitate reviews, and ensure alignment across the team. Chapter 4 will delve deeper into creating effective design documents.
 65 | 
 66 | ## Conclusion
 67 | - MLSD requires a diverse skill set, blending ML expertise with software engineering rigor, data management capabilities, and a strong understanding of the business domain. A holistic approach that considers all these facets is paramount.
 68 | - Implementing structured design processes, such as those outlined in this book, is vital for creating ML systems that are not only powerful but also scalable, maintainable, reliable, and ultimately successful in delivering value.
 69 | - Adhering to the core principles of MLSD, from starting simple to ensuring reproducibility and comprehensive testing, significantly increases the likelihood of project success and mitigates common risks associated with complex ML initiatives.
 70 | 
 71 | # Сhapter 2: Is there a problem?
 72 | 
 73 | ### Introduction
 74 | 
 75 | This chapter emphasizes the critical importance of accurately identifying and articulating the problem before diving into machine learning system design. It argues that a deep understanding of the problem space is essential for the successful development of ML systems. The chapter aims to guide readers through the process of problem identification, exploring the problem space versus the solution space, and understanding the implications of risks, limitations, and the costs of mistakes in ML projects. A failure to thoroughly investigate the problem often leads to solutions that are misaligned with actual needs, wasting resources and effort.
 76 | 
 77 | ### Main Sections
 78 | 
 79 | ### Section 1: Problem Space vs. Solution Space
 80 | 
 81 | - **Key Concepts:**
 82 |     - **The distinction between problem space and solution space:** 
 83 |         - *Problem Space:* Encompasses the needs, goals, context, and constraints of the users or business. It's about understanding the "what" and "why" of the problem itself, independent of any specific solution. For example, "users are spending too much time finding relevant documents."
 84 |         - *Solution Space:* Consists of specific products, features, technologies, or methodologies proposed to address the problem. It's about the "how." For example, "implement a semantic search engine" or "develop a better document categorization system."
 85 |     - **The importance of focusing on the problem space before considering the solution:**
 86 |         - Jumping to solutions prematurely can lead to building the wrong thing or addressing symptoms rather than root causes.
 87 |         - A deep dive into the problem space ensures that any proposed ML system is genuinely addressing a validated need and has a clear purpose.
 88 |         - It helps in defining clear success criteria that are tied to solving the actual problem, not just technical metrics of a chosen solution.
 89 |     - **Techniques like "Five Whys" for deep exploration of the problem space:**
 90 |         - *Five Whys:* An iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. By repeatedly asking "Why?" (five is a rule of thumb), one can peel away layers of symptoms to get to the root cause.
 91 |         - *User Interviews & Observation:* Directly engaging with end-users to understand their pain points, workflows, and unmet needs.
 92 |         - *Data Analysis:* Examining existing data to identify patterns, anomalies, or trends that highlight the problem's impact and characteristics.
 93 | 
 94 | ### Section 2: Finding the Problem
 95 | 
 96 | - **Key Concepts:**
 97 |     - **Strategies for defining and understanding the problem:**
 98 |         - *Problem Statement:* A concise description of the issue that needs to be addressed. It should be clear, specific, and focus on the user or business impact.
 99 |         - *Contextual Inquiry:* Observing users in their natural environment to gain a deeper understanding of their tasks and challenges.
100 |         - *Stakeholder Workshops:* Bringing together diverse stakeholders (users, business owners, technical teams) to collaboratively define and prioritize problems.
101 |     - **The role of engineers in problem space analysis alongside product managers:**
102 |         - While product managers often lead problem definition, engineers (including ML engineers) bring a crucial technical perspective.
103 |         - Engineers can assess the feasibility of potential technical directions early on, identify data requirements or limitations, and contribute to understanding the nuances of how a problem might be framed for an ML approach.
104 |         - Collaborative exploration prevents a disconnect between business needs and technical realities.
105 |     - **The inverted pyramid scheme for problem statement articulation:**
106 |         - Start with the most critical information: a clear, concise summary of the problem and its impact.
107 |         - Follow with important details: supporting facts, evidence, scope, and constraints.
108 |         - Conclude with background information and less critical details.
109 |         - This structure ensures that anyone reading the problem statement can quickly grasp the essentials.
110 | 
111 | ### Section 3: Approximating a Solution through an ML System
112 | 
113 | - **Key Concepts:**
114 |     - **Reframing business problems into software/ML problems:**
115 |         - Business problems are often broad (e.g., "increase customer retention"). ML problems need to be more specific and framed in terms of prediction, classification, generation, etc. (e.g., "predict which customers are likely to churn next month").
116 |         - This involves identifying where an ML model's predictive power can provide a tangible benefit towards solving the broader business problem.
117 |         - It may involve breaking down a large business problem into smaller, solvable ML tasks.
118 |     - **The heuristic of approximating the behavior of a "magic oracle" or expert:**
119 |         - Imagine you had a perfect, all-knowing oracle or a human expert who could instantly provide the desired output given some input. What question would you ask them? What information would they need?
120 |         - This thought experiment helps define the ideal output of the ML system and the necessary inputs (features).
121 |         - It guides the definition of the target variable and the scope of the prediction task.
122 |     - **The trade-offs between robustness and correctness in ML system design:**
123 |         - *Correctness:* How accurate the model's predictions are (e.g., precision, recall, accuracy).
124 |         - *Robustness:* How well the system performs under imperfect conditions, such as noisy data, unexpected inputs, or concept drift. This also includes reliability and availability.
125 |         - Often, achieving perfect correctness is impossible or impractical. A robust system that performs reasonably well across a wide range of real-world scenarios might be more valuable than a brittle system that is highly accurate only under ideal conditions.
126 |         - The acceptable trade-off depends on the specific application and the cost of errors.
127 | 
128 | ### Section 4: Risks, Limitations, and Possible Consequences
129 | 
130 | - **Key Concepts:**
131 |     - **Identifying potential risks and limitations early in the design process:**
132 |         - *Data Risks:* Availability, quality, quantity, bias, privacy concerns, cost of acquisition.
133 |         - *Model Risks:* Difficulty in achieving desired performance, interpretability issues, vulnerability to adversarial attacks, potential for unfair bias.
134 |         - *Technical Risks:* Integration challenges, scalability issues, infrastructure limitations, dependency on specific technologies.
135 |         - *Operational Risks:* Monitoring challenges, maintenance overhead, cost of operation, skill gaps in the team.
136 |         - *Ethical Risks:* Unintended societal impact, discriminatory outcomes, lack of transparency.
137 |     - **The impact of non-functional requirements on system design:**
138 |         - Non-functional requirements (NFRs) define the quality attributes of a system, such as performance (latency, throughput), scalability, reliability, security, maintainability, and cost-effectiveness.
139 |         - NFRs heavily influence architectural choices, technology selection, and the overall complexity of the system.
140 |         - For example, a requirement for low-latency predictions will guide model choice and deployment strategy differently than a batch processing system.
141 |     - **Real-world examples illustrating the importance of considering risks and limitations:**
142 |         - *E.g., a healthcare diagnostic tool with biased training data leading to poorer performance for certain demographic groups.*
143 |         - *E.g., a financial fraud detection system that is too slow to prevent fraudulent transactions in real-time.*
144 |         - *E.g., a content recommendation system that creates filter bubbles or promotes harmful content due to unforeseen feedback loops.*
145 | 
146 | ### Section 5: Costs of a Mistake
147 | 
148 | - **Key Concepts:**
149 |     - **Evaluating the potential costs associated with errors in ML systems:**
150 |         - *Financial Costs:* Lost revenue, operational inefficiencies, fines for non-compliance, cost of remediation.
151 |         - *Reputational Costs:* Damage to brand image, loss of customer trust.
152 |         - *Societal Costs:* Unfair treatment, discrimination, erosion of privacy, safety risks (e.g., autonomous vehicles).
153 |         - *User Impact:* Poor user experience, frustration, incorrect decisions based on flawed outputs.
154 |     - **The significance of understanding both direct and second-order consequences of mistakes:**
155 |         - *Direct Consequences:* The immediate impact of an error (e.g., a misclassification leading to a denied loan application).
156 |         - *Second-Order Consequences:* The ripple effects or longer-term impacts (e.g., the denied loan leading to financial hardship, or systemic bias in lending decisions affecting community development).
157 |         - Considering these broader impacts is crucial for responsible ML system design.
158 |     - **Strategies for assessing and mitigating risks in ML projects:**
159 |         - *Risk Assessment Matrix:* Identifying potential risks, their likelihood, and their potential impact to prioritize mitigation efforts.
160 |         - *Failure Mode and Effects Analysis (FMEA):* A systematic process to identify potential failures in a system and their consequences.
161 |         - *Red Teaming / Adversarial Testing:* Proactively trying to break the system or find vulnerabilities.
162 |         - *Human-in-the-Loop Systems:* Incorporating human oversight and intervention capabilities for critical decisions or when the model's confidence is low.
163 |         - *Regular Audits and Monitoring:* Continuously tracking performance, data quality, and potential biases post-deployment.
164 | 
165 | ### Conclusion
166 | 
167 | - **Key Takeaways:**
168 |     1. A thorough understanding of the problem space is foundational to effective ML system design, ensuring that solutions are relevant and targeted. Rushing into solutions without this clarity leads to wasted effort and systems that fail to deliver real value.
169 |     2. Engaging deeply with the problem, through techniques like the "Five Whys," user research, and the inverted pyramid scheme for articulation, enables designers to uncover essential insights and requirements for the ML system. Collaboration between product, engineering, and domain experts is key.
170 |     3. Considering the potential risks, limitations, and costs of mistakes (both direct and second-order) early in the design process is crucial for developing robust, effective, and safe ML systems. This includes evaluating data, model, technical, operational, and ethical risks.
171 |     4. ML system designers must balance the trade-offs between robustness and correctness, tailoring their approach to the specific context and requirements of the project, and always considering the non-functional requirements that shape the system's architecture and behavior.
172 | 
173 | This outline provides a comprehensive framework for understanding and applying the principles discussed in the chapter, guiding readers through the critical early stages of machine learning system design.
174 | # Chapter 3: Priliminary research
175 | 
176 | ### Introduction
177 | 
178 | This chapter transitions from identifying the problem to exploring the solution space for machine learning system design. It emphasizes the importance of preliminary research in understanding available solutions, deciding whether to build or buy, decomposing the problem, and determining the appropriate level of innovation. The goal is to lay the groundwork for creating a comprehensive design document by examining use cases, addressing the build or buy dilemma, decomposing problems, and choosing the right degree of innovation.
179 | 
180 | ### Main Sections
181 | 
182 | ### Section 1: What Problems Can Inspire You?
183 | 
184 | - **Key Concepts:**
185 |     - The value of learning from existing solutions in various domains.
186 |     - How to draw inspiration from related problems and their solutions.
187 |     - The importance of considering both domain-specific and technical aspects of similar systems.
188 | 
189 | ### Section 2: Build or Buy, Open Source-Based or Proprietary Tech
190 | 
191 | - **Key Concepts:**
192 |     - Evaluating the decision to develop a solution in-house or to purchase a ready-made solution.
193 |     - The trade-offs between using open-source technologies and proprietary solutions.
194 |     - Factors influencing the build or buy decision, including core business relevance, economic considerations, and scalability needs.
195 | 
196 | ### Section 3: Problem Decompositioning
197 | 
198 | - **Key Concepts:**
199 |     - The "divide and conquer" approach to simplifying complex problems.
200 |     - Examples of problem decomposition in machine learning and software engineering.
201 |     - Reasons for decomposition, including computational complexity, algorithm imperfection, and data fusion needs.
202 | 
203 | ### Section 4: Choosing the Right Degree of Innovation
204 | 
205 | - **Key Concepts:**
206 |     - Defining the required level of innovation: minimum viable ML system, average human-level ML system, and best-in-class ML system.
207 |     - Balancing innovation with practical constraints such as time, budget, and existing capabilities.
208 |     - The dynamic nature of innovation requirements as projects evolve from prototypes to mature systems.
209 | 
210 | ### Conclusion
211 | 
212 | - **Key Takeaways:**
213 |     1. Preliminary research is crucial for navigating the solution space effectively, enabling designers to make informed decisions about building or buying, leveraging existing solutions, and setting innovation goals.
214 |     2. The decision to build or buy should be carefully considered, taking into account the core business impact, cost implications, and the strategic value of in-house development versus leveraging third-party solutions.
215 |     3. Problem decomposition is a powerful strategy for managing complexity in machine learning system design, allowing for more manageable sub-problems that can be addressed with targeted solutions.
216 |     4. The degree of innovation required for a machine learning system is context-dependent, influenced by business goals, competitive landscape, and resource availability. Designers must balance the desire for cutting-edge solutions with the practicalities of their specific situation.
217 | 
218 | This outline provides a roadmap for the preliminary research phase of machine learning system design, highlighting the importance of a strategic approach to problem-solving and innovation.
219 | # Chapter 4: Design document
220 | 
221 | ### Introduction
222 | 
223 | This chapter emphasizes the critical role of a design document in the machine learning system design process. It outlines the steps for drafting, reviewing, and evolving a design document, highlighting its importance in clarifying project goals, identifying potential issues, and ensuring stakeholder alignment. The chapter underscores that a well-crafted design document can often lead to the realization that a complex ML project may not be necessary, saving significant time and resources.
224 | 
225 | ### Main Sections
226 | 
227 | ### Section 1: Goals and Antigoals
228 | 
229 | - **Key Concepts:**
230 |     - Importance of clearly defining what the project aims to achieve and deliberately stating what it does not aim to solve (antigoals).
231 |     - Utilizing antigoals to focus the project scope and avoid unnecessary work.
232 |     - Examples of how setting improper goals can mislead the project direction.
233 | 
234 | ### Section 2: Design Document Structure
235 | 
236 | - **Key Concepts:**
237 |     - There is no universal structure for a design document, but it must cover essential aspects such as problem definition, relevance, previous work, and risks.
238 |     - The structure should facilitate easy navigation and understanding for stakeholders from various backgrounds.
239 |     - Introduction of two fictional case studies to illustrate the practical application of design document principles.
240 | 
241 | ### Section 3: Reviewing a Design Document
242 | 
243 | - **Key Concepts:**
244 |     - The iterative nature of design document development, emphasizing the importance of peer feedback.
245 |     - Strategies for effective review, including focusing on areas of expertise, questioning assumptions, and suggesting improvements.
246 |     - The role of the reviewer in enriching the document through constructive criticism and alternative solutions.
247 | 
248 | ### Section 4: The Evolution of Design Docs
249 | 
250 | - **Key Concepts:**
251 |     - Acknowledgment that a design document is a "living" artifact that evolves over time based on new insights, real-world feedback, and changing project requirements.
252 |     - The continuous cycle of iteration and improvement that a design document undergoes, even post-implementation.
253 |     - The necessity of keeping the design document updated to reflect the current understanding and state of the system for future maintenance and scalability.
254 | 
255 | ### Conclusion
256 | 
257 | - **Key Takeaways:**
258 |     1. The process of creating and refining a design document is as critical as the technical development of the ML system itself, serving as a blueprint for the project.
259 |     2. Antigoals are as important as goals in defining the scope and focus of the project, helping to avoid effort on unnecessary or low-impact areas.
260 |     3. Peer review of the design document is essential for uncovering blind spots, validating assumptions, and ensuring the project's technical and business viability.
261 |     4. A design document is never truly final; it must evolve with the project, reflecting changes, learnings, and improvements to remain a relevant and useful guide.
262 | 
263 | This outline provides a comprehensive overview of the process and importance of creating, reviewing, and updating a design document in the context of machine learning system design, highlighting the iterative and collaborative nature of this foundational step.
264 | # Chapter 5: Loss Functions and Metrics
265 | 
266 | 
267 | ### Introduction
268 | 
269 | This chapter delves into the critical aspects of selecting appropriate metrics and loss functions for machine learning systems. It emphasizes the distinction between metrics used for model evaluation and loss functions optimized during training. The chapter aims to guide the reader through the process of choosing metrics and loss functions that align with the system's objectives, ensuring the model's performance is accurately measured and optimized.
270 | 
271 | ### Main Sections
272 | 
273 | ### Section 1: Losses
274 | 
275 | - **Key Concepts:**
276 |     - Importance of choosing the right loss function for model training.
277 |     - Criteria for a function to be considered as a loss function: global continuity and differentiability.
278 |     - Impact of loss function choice on model performance and behavior.
279 |     - Examples illustrating how different loss functions (MSE vs. MAE) can lead to different model outcomes.
280 | 
281 | ### Section 2: Metrics
282 | 
283 | - **Key Concepts:**
284 |     - Distinction between loss functions and metrics.
285 |     - Role of metrics in evaluating model performance.
286 |     - The necessity of aligning metrics with the system's final goals.
287 |     - Discussion on offline and online metrics, including proxy metrics and the hierarchy of metrics.
288 |     - Importance of consistency metrics in ensuring model stability over variations in input or model retraining.
289 | 
290 | ### Section 3: Design Document: Adding Losses and Metrics
291 | 
292 | - **Key Concepts:**
293 |     - Application of the discussed principles to two fictional cases: Supermegaretail and PhotoStock Inc.
294 |     - Detailed walkthrough on selecting metrics and loss functions for each case, considering their specific business objectives and challenges.
295 |     - Emphasis on the iterative process of refining the choice of metrics and loss functions as part of the system design document.
296 | 
297 | ### Conclusion
298 | 
299 | - **Key Takeaways:**
300 |     1. The choice of loss functions and metrics is pivotal in guiding a machine learning system towards achieving its intended objectives. These choices directly influence the model's learning focus and evaluation criteria.
301 |     2. While every loss function can serve as a metric, not all metrics are suitable as loss functions due to requirements for continuity and differentiability.
302 |     3. Consistency metrics play a crucial role in practical applications, ensuring that models remain stable and reliable across varying inputs and retraining cycles.
303 |     4. The development of a machine learning system should include a thoughtful selection of both offline and online metrics, with a clear understanding of how these metrics relate to the system's ultimate goals. Proxy metrics and a well-defined hierarchy of metrics can facilitate this alignment, enabling more effective system evaluation and optimization.
304 | 
305 | This structured outline encapsulates the essence of Chapter 5, highlighting the importance of carefully selecting metrics and loss functions in the design and evaluation of machine learning systems.
306 | # Chapter 6: Gathering Datasets
307 | 
308 | ### Introduction
309 | 
310 | This chapter emphasizes the foundational role of datasets in the development and operation of machine learning systems. It draws parallels between essential life elements and datasets, highlighting the necessity of quality data for the functionality of ML systems. The chapter aims to guide through the process of identifying, processing, and utilizing data sources to construct effective datasets, underlining the principle that the quality of input data directly influences the output of ML systems.
311 | 
312 | ### Main Sections
313 | 
314 | ### Section 1: Data Sources
315 | 
316 | - **Key Concepts:**
317 |     - Variety of data sources ranging from global activities and physical processes to local business operations and artificially generated datasets.
318 |     - Importance of selecting data sources based on the ML system's goals.
319 |     - Challenges and strategies in accessing unique or proprietary data versus publicly available datasets.
320 | 
321 | ### Section 2: Cooking the Dataset
322 | 
323 | - **Key Concepts:**
324 |     - The necessity of transforming raw data into a structured and usable format for ML models.
325 |     - Overview of techniques like ETL (Extract, Transform, Load), filtering, feature engineering, and labeling.
326 |     - The balance between automated processing and manual intervention for data preparation.
327 | 
328 | ### Section 3: Data and Metadata
329 | 
330 | - **Key Concepts:**
331 |     - Distinction between data (directly used for ML modeling) and metadata (descriptive information about data).
332 |     - Role of metadata in ensuring data consistency, aiding in data management, and supporting system functionality.
333 | 
334 | ### Section 4: How Much is Enough?
335 | 
336 | - **Key Concepts:**
337 |     - Discussion on determining the adequate size and quality of datasets for training ML models.
338 |     - Considerations on data diversity, representativeness, and the diminishing returns of adding more data.
339 | 
340 | ### Section 5: Solving the Cold Start Problem
341 | 
342 | - **Key Concepts:**
343 |     - Strategies for overcoming the lack of initial data for training ML systems, including synthetic data generation and leveraging similar datasets.
344 |     - The importance of approximation and proxy datasets in the early stages of ML system development.
345 | 
346 | ### Section 6: Properties of a Healthy Data Pipeline
347 | 
348 | - **Key Concepts:**
349 |     - Essential properties of a data pipeline: reproducibility, consistency, and availability.
350 |     - Importance of data management practices that ensure the reliability and accessibility of data for both the ML system and its developers.
351 | 
352 | ### Conclusion
353 | 
354 | - **Key Takeaways:**
355 |     1. The success of an ML system is heavily dependent on the quality and relevance of its underlying datasets. Identifying and utilizing the right data sources is crucial for system effectiveness.
356 |     2. Data preparation is a critical step in ML system development, requiring a thoughtful balance between automation and manual oversight to ensure data quality and relevance.
357 |     3. Metadata plays a vital role in maintaining data consistency and supporting effective data management throughout the lifecycle of an ML system.
358 |     4. Addressing the cold start problem requires creative approaches to data acquisition and utilization, emphasizing the need for flexibility and innovation in early system development stages.
359 |     5. Establishing a healthy data pipeline is foundational to the long-term success and scalability of ML systems, underscoring the importance of reproducibility, consistency, and availability in data management practices.
360 | 
361 | This outline captures the essence of Chapter 6, providing a structured overview of the critical aspects of gathering and preparing datasets for machine learning systems.
362 | # Chapter 7: Validation Schemas
363 | 
364 | 
365 | ### Introduction
366 | 
367 | This chapter delves into the critical aspect of building a robust evaluation process for machine learning systems through proper validation schemas. It discusses the importance of selecting the right validation schema based on the specifics of a given problem and the factors to consider when designing the evaluation process. The goal is to achieve confident estimates of system performance, ensuring the model's predictive power on unseen data is accurately measured.
368 | 
369 | ### Main Sections
370 | 
371 | ### Section 1: Reliable Evaluation
372 | 
373 | - **Key Concepts:**
374 |     - Importance of a stable and reliable evaluation pipeline.
375 |     - Challenges with simple train-validation-test splits and the assumption of data distribution consistency.
376 |     - The necessity of repeatable use of validation sets and the risks of overfitting towards these sets.
377 | 
378 | ### Section 2: Standard Schemas
379 | 
380 | - **Key Concepts:**
381 |     - Overview of time-tested validation schemas like holdout sets and cross-validation.
382 |     - Discussion on the choice of K in K-fold cross-validation and its impact on bias, variance, and computation time.
383 |     - Special considerations for time-series validation, including window size, training size, seasonality, and gap.
384 | 
385 | ### Section 3: Non-trivial Schemas
386 | 
387 | - **Key Concepts:**
388 |     - Introduction to nested validation for hyperparameter optimization within the learning process.
389 |     - Adversarial validation for estimating dataset differences and constructing representative datasets.
390 |     - Quantifying dataset leakage exploitation with specific measures to minimize data leakage.
391 | 
392 | ### Section 4: Split Updating Procedure
393 | 
394 | - **Key Concepts:**
395 |     - Strategies for updating validation splits in dynamically changing datasets.
396 |     - Fixed shift, fixed ratio, and fixed set as common options for split updating.
397 |     - The importance of maintaining robust and adaptive evaluation processes.
398 | 
399 | ### Section 5: Design Document: Choosing Validation Schemas
400 | 
401 | - **Key Concepts:**
402 |     - Detailed examples of validation schema choices for two hypothetical companies, Supermegaretail and Photostock Inc.
403 |     - Considerations for ensuring validation and test sets are representative, diverse, and free from data leakage.
404 |     - The use of deterministic bucketing for user split assignments and the potential for future adjustments.
405 | 
406 | ### Conclusion
407 | 
408 | - **Key Takeaways:**
409 |     1. The selection of a validation schema is crucial for accurately measuring a model's performance on unseen data, requiring careful consideration of the specific characteristics of the dataset and the problem at hand.
410 |     2. Standard validation schemas provide a solid foundation for most machine learning applications, but non-trivial schemas may be necessary to address unique challenges or specific data characteristics.
411 |     3. Updating validation splits in response to new data or changing distributions is essential for maintaining the relevance and accuracy of performance estimates.
412 |     4. Detailed planning and documentation of the chosen validation schemas within the design document are vital for ensuring the evaluation process is aligned with the project's goals and constraints.
413 | 
414 | This outline captures the essence of Chapter 7, providing a structured overview of the considerations and methodologies involved in selecting and implementing validation schemas for machine learning systems.
415 | # Chapter 8: Baseline Solution
416 | 
417 | ### Introduction
418 | 
419 | This chapter emphasizes the importance of establishing a baseline solution in machine learning system design, likening it to the MVP (Minimum Viable Product) in product development. The baseline serves as the simplest but operational version of a model, setting a foundational performance metric from which improvements can be iteratively made. It underscores the principle that a mediocre model in production is more valuable than a sophisticated model that never leaves the drawing board.
420 | 
421 | ### Main Sections
422 | 
423 | ### Section 1: Baseline: What Are You?
424 | 
425 | - **Key Concepts:**
426 |     - Definition and purpose of a baseline in machine learning.
427 |     - Baselines as risk reducers, early feedback providers, and early value deliverers.
428 |     - The baseline as a placeholder, a comparative measure, and a fallback option.
429 | 
430 | ### Section 2: Constant Baselines
431 | 
432 | - **Key Concepts:**
433 |     - The simplest form of baselines, approximating solutions without dependency on input variables.
434 |     - Use cases for constant baselines, including benchmarking and providing fallback answers.
435 |     - Examples include average or median predictions for regression tasks and major class predictions for classification tasks.
436 | 
437 | ### Section 3: Model Baselines and Feature Baselines
438 | 
439 | - **Key Concepts:**
440 |     - Progression from constant baselines to more complex models like rule-based models and linear models.
441 |     - The importance of starting with a minimal set of features and gradually adding complexity based on the accuracy-effort trade-off.
442 |     - The role of feature engineering and the selection of baseline features.
443 | 
444 | ### Section 4: Variety of Deep Learning Baselines
445 | 
446 | - **Key Concepts:**
447 |     - Strategies for establishing baselines in deep learning, including reusing pretrained models and training simple models from scratch.
448 |     - The benefits of transfer learning and fine-tuning pretrained models for specific tasks.
449 |     - Considerations for choosing between reusing features, applying zero-shot or few-shot learning, and training simpler models.
450 | 
451 | ### Section 5: Baseline Comparison
452 | 
453 | - **Key Concepts:**
454 |     - The trade-off between model accuracy and the effort required for development.
455 |     - Factors to consider when choosing a baseline, including accuracy, development time, interpretability, and computation time.
456 |     - The diminishing returns of increasing model complexity and the importance of early stopping based on the accuracy-effort trade-off.
457 | 
458 | ### Conclusion
459 | 
460 | - **Key Takeaways:**
461 |     1. Establishing a baseline is a critical first step in machine learning system design, serving as a simple, operational starting point for iterative improvement.
462 |     2. The choice of baseline should be guided by a trade-off between desired accuracy and the effort required for development, with simplicity often providing significant advantages in terms of robustness, scalability, and interpretability.
463 |     3. In deep learning applications, leveraging pretrained models or training simple models from scratch can provide effective baselines, with the choice influenced by the specific requirements and constraints of the project.
464 |     4. Continuous evaluation and comparison against the baseline are essential for guiding the development process, ensuring that complexity is added only when it yields proportional benefits in performance.
465 | 
466 | This structured outline captures the essence of Chapter 8, highlighting the strategic role of baselines in the development of machine learning systems and providing a roadmap for their selection and implementation.
467 | # Chapter 9: Error Analysis
468 | 
469 | 
470 | ### Introduction
471 | 
472 | Error analysis acts as a crucial compass in the iterative improvement of machine learning systems, offering insights into error dynamics and patterns post-prediction. This chapter delves into learning curve analysis, residual analysis, and the identification of commonalities in errors, guiding the enhancement of system performance through detailed examination of where and how models falter.
473 | 
474 | ### Main Sections
475 | 
476 | ### Section 1: Learning Curve Analysis
477 | 
478 | - **Key Concepts:**
479 |     - Introduction to learning curves and their significance in evaluating the learning process.
480 |     - Examination of model convergence and the balance between underfitting and overfitting.
481 |     - Exploration of learning curves based on the number of iterations, model complexity, and dataset size.
482 | 
483 | ### Section 2: Overfitting and Underfitting
484 | 
485 | - **Key Concepts:**
486 |     - Definitions and implications of overfitting and underfitting.
487 |     - The bias-variance trade-off and its impact on model performance.
488 |     - Strategies for mitigating overfitting and underfitting.
489 | 
490 | ### Section 3: Residual Analysis
491 | 
492 | - **Key Concepts:**
493 |     - Calculation and interpretation of residuals to assess model predictions.
494 |     - Goals of residual analysis, including verification of model assumptions and detection of error sources.
495 |     - Examination of residual distribution, fairness, and specific patterns like underprediction and overprediction.
496 | 
497 | ### Section 4: Finding Commonalities in Residuals
498 | 
499 | - **Key Concepts:**
500 |     - Techniques for grouping and analyzing residuals to uncover patterns.
501 |     - Worst/best-case analysis for identifying model strengths and weaknesses.
502 |     - Adversarial validation as a method for distinguishing between "good" and "bad" samples.
503 |     - Group analysis and corner-case analysis for detailed error pattern identification.
504 | 
505 | ### Conclusion
506 | 
507 | - **Key Takeaways:**
508 |     1. Error analysis is an indispensable step in refining machine learning systems, providing a deeper understanding of model errors and guiding targeted improvements.
509 |     2. Learning curve analysis offers early insights into model adequacy, highlighting issues of convergence, overfitting, and underfitting that need addressing.
510 |     3. Residual analysis serves as a powerful tool for verifying model assumptions and identifying biases, enabling the detection of specific error patterns and guiding the development of more robust models.
511 |     4. Identifying commonalities in residuals through various analytical approaches, including adversarial validation and group analysis, helps pinpoint specific areas for improvement, ensuring the model performs well across diverse scenarios and datasets.
512 | 
513 | This structured outline encapsulates the essence of Chapter 9, underscoring the critical role of error analysis in the iterative process of machine learning system development and optimization.
514 | # Chapter 10: Training Pipelines
515 | 
516 | 
517 | ### Introduction
518 | 
519 | This chapter delves into the essence of training pipelines in machine learning projects, emphasizing their critical role beyond mere model training. It explores the structured sequence of steps necessary to prepare, train, evaluate, and deploy machine learning models efficiently and reproducibly. The chapter aims to transition the reader from a model-centric to a pipeline-centric view of machine learning, highlighting the importance of reproducibility, scalability, and configurability in the ML lifecycle.
520 | 
521 | ### Main Sections
522 | 
523 | ### Section 1: Understanding Training Pipelines
524 | 
525 | - **Key Concepts:**
526 |     - Definition and importance of training pipelines in ML projects.
527 |     - Distinction between training and inference pipelines.
528 |     - Overview of typical steps in a training pipeline, including data fetching, preprocessing, model training, evaluation, postprocessing, report generation, and artifact packaging.
529 | 
530 | ### Section 2: Tools and Platforms for Training Pipelines
531 | 
532 | - **Key Concepts:**
533 |     - Introduction to various tools and platforms that facilitate the creation and maintenance of training pipelines, without endorsing any specific technology.
534 |     - Discussion on the evolving landscape of MLOps tools and the criteria for selecting appropriate tools based on project needs and infrastructure.
535 | 
536 | ### Section 3: Scalability of Training Pipelines
537 | 
538 | - **Key Concepts:**
539 |     - Strategies for scaling training pipelines to handle large datasets, including vertical and horizontal scaling.
540 |     - Considerations for choosing between scaling strategies based on dataset size, computational resources, and future growth.
541 | 
542 | ### Section 4: Configurability of Training Pipelines
543 | 
544 | - **Key Concepts:**
545 |     - The balance between under-configuration and over-configuration in training pipelines.
546 |     - Guidelines for making pipelines flexible yet manageable, focusing on parameters likely to change and adopting a pragmatic approach to hyperparameter tuning.
547 | 
548 | ### Section 5: Testing Training Pipelines
549 | 
550 | - **Key Concepts:**
551 |     - The challenge and necessity of testing ML pipelines to ensure reliability and performance.
552 |     - Recommendations for combining high-level smoke tests with low-level unit tests to cover both pipeline integrity and component functionality.
553 |     - Introduction to property-based testing for validating model properties such as consistency, monotonicity, and robustness.
554 | 
555 | ### Conclusion
556 | 
557 | - **Key Takeaways:**
558 |     1. Training pipelines are foundational to the success of ML projects, ensuring that models are not only trained but also prepared, evaluated, and deployed in a consistent and reproducible manner.
559 |     2. The choice of tools and platforms for training pipelines should be guided by the project's scale, infrastructure, and specific needs, with an emphasis on flexibility and future growth.
560 |     3. Scalability and configurability are critical attributes of effective training pipelines, enabling them to handle large datasets and adapt to changing requirements without excessive complexity.
561 |     4. Comprehensive testing, including both smoke tests and unit tests, is essential for maintaining the integrity of training pipelines and ensuring the reliability of deployed models.
562 | 
563 | This structured outline encapsulates the core themes of Chapter 10, providing a roadmap for designing and implementing robust training pipelines that are scalable, configurable, and thoroughly tested.
564 | # Chapter 11: Features and Feature Engineering
565 | 
566 | 
567 | ### Introduction
568 | 
569 | This chapter emphasizes the pivotal role of features in machine learning systems, asserting that even a mediocre model can excel with well-engineered features. It explores the iterative process of feature engineering, the analysis of feature importance, the selection of optimal features, and the advantages and disadvantages of feature stores. The chapter aims to guide readers through enhancing model performance and interpretability by meticulously crafting and selecting features.
570 | 
571 | ### Main Sections
572 | 
573 | ### Section 1: The Essence of Feature Engineering
574 | 
575 | - **Key Concepts:**
576 |     - Definition and importance of feature engineering in ML.
577 |     - Iterative process involving creativity, domain expertise, and data engineering.
578 |     - The role of features in deep learning and traditional ML models.
579 |     - Criteria for good versus bad features, including model performance, data availability, and trade-offs between feature quantity and quality.
580 | 
581 | ### Section 2: Feature Generation 101
582 | 
583 | - **Key Concepts:**
584 |     - Strategies for generating new features, including adding new data sources and transforming existing features.
585 |     - Techniques for transforming numeric, categorical, and sequential data.
586 |     - Importance of considering feature interactions and model predictions as features.
587 | 
588 | ### Section 3: Feature Importance Analysis
589 | 
590 | - **Key Concepts:**
591 |     - Methods for determining the impact of features on model predictions.
592 |     - Distinction between interpretability and explainability.
593 |     - Classification of methods into model-specific vs. model-agnostic and individual prediction vs. entire model interpretation.
594 | 
595 | ### Section 4: Feature Selection
596 | 
597 | - **Key Concepts:**
598 |     - The necessity of feature selection for improving model performance and interpretability.
599 |     - Overview of feature selection methods, including filter, wrapper, and embedded methods.
600 |     - The balance between retaining valuable signals and managing feature complexity.
601 | 
602 | ### Section 5: Feature Store
603 | 
604 | - **Key Concepts:**
605 |     - Definition and benefits of a feature store in centralizing feature management.
606 |     - Discussion on the pros and cons of implementing a feature store.
607 |     - Desired properties of a feature store, including read-write skew, pre-calculation, feature versioning, dependencies, and feature catalog.
608 | 
609 | ### Conclusion
610 | 
611 | - **Key Takeaways:**
612 |     1. Effective feature engineering is crucial for enhancing the performance and interpretability of machine learning models. It involves not only the generation of new features but also the careful selection and analysis of these features to ensure they contribute positively to the model's predictions.
613 |     2. The process of feature engineering requires a balance between creativity, domain knowledge, and technical skills. It's an iterative process that often involves collaboration across teams to identify and implement the most impactful features.
614 |     3. Feature importance analysis is essential for understanding the contribution of each feature to the model's predictions. This analysis aids in model transparency and can guide further feature engineering efforts.
615 |     4. Implementing a feature store can offer significant benefits in terms of feature management, reusability, and collaboration across teams. However, it requires careful consideration of the specific needs and infrastructure of the organization.
616 |     5. Ultimately, the goal of feature engineering is to create a set of features that are not only predictive but also interpretable, manageable, and aligned with the business objectives of the machine learning system.
617 | 
618 | This structured outline encapsulates the core themes of Chapter 11, providing a comprehensive guide to the critical role of features in the development and optimization of machine learning models.
619 | # Chapter 12: Measuring and Reporting Results
620 | 
621 | ### Introduction
622 | 
623 | This chapter delves into the critical phase of evaluating and communicating the outcomes of a machine learning system. It underscores the importance of measuring results through offline and online testing, conducting A/B tests to validate the system's effectiveness in real-world scenarios, and effectively reporting these results to stakeholders. The chapter aims to bridge the gap between technical achievements and business impacts, ensuring that the advancements in machine learning translate into tangible benefits.
624 | 
625 | ### Main Sections
626 | 
627 | ### Section 1: Measuring Results
628 | 
629 | - **Key Concepts:**
630 |     - The necessity of understanding the system's goal and designing experiments with clear hypotheses.
631 |     - Offline testing as a proxy for anticipating online performance.
632 |     - Transitioning from model performance metrics to real-world business metrics.
633 |     - The role of simulated environments in enhancing offline testing robustness.
634 | 
635 | ### Section 2: A/B Testing
636 | 
637 | - **Key Concepts:**
638 |     - A/B testing as a gold standard for evaluating changes in a live environment.
639 |     - Designing A/B tests with a clear hypothesis and selecting appropriate metrics.
640 |     - Strategies for splitting data and ensuring representative test groups.
641 |     - Statistical criteria for interpreting A/B test results and the importance of simulated experiments to validate test designs.
642 | 
643 | ### Section 3: Reporting Results
644 | 
645 | - **Key Concepts:**
646 |     - Monitoring control and auxiliary metrics during experiments to identify issues early.
647 |     - Tracking uplift and understanding its implications for business metrics.
648 |     - Deciding when to conclude an experiment and how to interpret mixed outcomes.
649 |     - Structuring reports to communicate findings, including uplift monitoring, statistical significance, and the broader impact on business objectives.
650 | 
651 | ### Conclusion
652 | 
653 | - **Key Takeaways:**
654 |     1. Effective measurement and reporting are foundational to translating machine learning advancements into business value. They ensure that technical improvements are accurately assessed and communicated in terms of their impact on key business metrics.
655 |     2. A/B testing serves as a critical tool for validating the real-world effectiveness of machine learning systems. Properly designed and executed A/B tests provide a reliable basis for making data-driven decisions about deploying new models or system changes.
656 |     3. Reporting results goes beyond stating statistical significance; it involves a comprehensive analysis of how changes affect various metrics, the potential for scaling these changes, and their implications for future strategies.
657 |     4. The process of measuring and reporting results is iterative and should inform ongoing development and refinement of machine learning systems. It requires close collaboration between technical teams and business stakeholders to ensure that the insights gained drive actionable improvements.
658 | 
659 | This structured outline encapsulates the essence of Chapter 12, offering a roadmap for effectively measuring, evaluating, and reporting the outcomes of machine learning systems in a way that aligns technical achievements with business goals.
660 | # Chapter 13: Integration
661 | 
662 | ### Introduction
663 | 
664 | This chapter emphasizes the significance of integration as an ongoing process crucial for the success of machine learning systems. It highlights the importance of API design, the release cycle, operating the system, and implementing overrides and fallbacks to ensure the system's robustness and adaptability. The chapter aims to guide readers through the technical aspects of integrating ML systems into existing workflows and infrastructures, ensuring they are prepared for real-world deployment and operation.
665 | 
666 | ### Main Sections
667 | 
668 | ### Section 1: API Design
669 | 
670 | - **Key Concepts:**
671 |     - API as a contract between the system and its users, emphasizing the importance of simplicity and predictability.
672 |     - The challenge of finding the right level of abstraction to avoid leaky abstractions and overcomplication.
673 |     - The necessity of versioning and ensuring deterministic behavior in ML systems.
674 | 
675 | ### Section 2: Release Cycle
676 | 
677 | - **Key Concepts:**
678 |     - Differences between ML systems and regular software in terms of testing and deployment frequency.
679 |     - The need for human-in-the-loop evaluation due to the unique trade-offs presented by ML model updates.
680 |     - Strategies for managing long training times and ensuring system stability through various release techniques like blue-green and canary deployments.
681 | 
682 | ### Section 3: Operating the System
683 | 
684 | - **Key Concepts:**
685 |     - The role of continuous integration (CI) in facilitating smooth development and integration processes.
686 |     - Importance of logs, metrics, alerting, and incident management platforms for maintaining system health.
687 |     - Addressing non-technical operational concerns such as compliance with regulations, user data management, and system explainability.
688 | 
689 | ### Section 4: Overrides and Fallbacks
690 | 
691 | - **Key Concepts:**
692 |     - Implementing fallback strategies to maintain operational efficiency during unforeseen circumstances or model failures.
693 |     - The use of overrides to manually adjust the system's output in specific scenarios or during transitional periods.
694 |     - The potential of using multisource weak supervision to improve model performance based on collections of overrides.
695 | 
696 | ### Conclusion
697 | 
698 | - **Key Takeaways:**
699 |     1. Integration is a continuous and essential process that ensures the success and longevity of ML systems. It requires careful planning, from API design to deployment and operation.
700 |     2. API design should prioritize simplicity and predictability, with a focus on creating interfaces that hide complexity while allowing for necessary customization and ensuring deterministic behavior.
701 |     3. The release cycle of ML systems presents unique challenges, necessitating a balance between agility and stability. Techniques like blue-green and canary deployments can facilitate safer updates and minimize disruptions.
702 |     4. Operational robustness is achieved not only through technical means such as CI, logging, and monitoring but also by addressing non-technical aspects like compliance and user data management. Overrides and fallbacks are critical for maintaining service continuity and adapting to changes or failures in real-time.
703 | 
704 | This structured outline provides a comprehensive overview of Chapter 13, offering insights into the crucial aspects of integrating ML systems into broader systems and workflows, ensuring they are ready for deployment and capable of evolving in response to new challenges and requirements.
705 | # Chapter 14: **Monitoring and Reliability**
706 | 
707 | ### **Introduction**
708 | 
709 | Chapter 14 of "Machine Learning System Design With End-to-End Examples" addresses the critical aspects of monitoring and ensuring the reliability of machine learning systems post-deployment. This chapter explores why traditional software monitoring practices are insufficient for ML systems and details the specific challenges posed by the dynamic nature of ML models, which may degrade over time or behave unpredictably when faced with real-world data.
710 | 
711 | ### **Main Sections**
712 | 
713 | **Section 14.1: Importance of Monitoring**
714 | 
715 | - Overview of the risks associated with not monitoring ML systems.
716 | - Discussion on the types of issues that can arise from unmonitored systems, including model degradation and data drift.
717 | - The importance of continuous validation and testing to ensure system stability.
718 | 
719 | **Section 14.2: Software System Health**
720 | 
721 | - Critical aspects of maintaining the health of the software infrastructure supporting ML models.
722 | - Techniques and tools for monitoring software performance, such as application and infrastructure monitoring, alerting, and incident management.
723 | - Example metrics for monitoring including error rates, request rates, and system utilization.
724 | 
725 | **Section 14.3: Data Quality and Integrity**
726 | 
727 | - Challenges in maintaining data quality and integrity in dynamic environments.
728 | - Common data issues such as processing errors, source corruption, and cascade/upstream model impacts.
729 | - Strategies for monitoring data quality, including anomaly detection and schema validation.
730 | 
731 | **Section 14.4: Model Quality and Relevance**
732 | 
733 | - Exploring model decay and the concepts of data drift and concept drift.
734 | - Methods for assessing model performance and relevance over time.
735 | - Techniques for detecting and addressing model drift, such as retraining and using robust model architectures.
736 | 
737 | ### **Conclusion**
738 | 
739 | 1. **Monitoring is Essential**: Without proper monitoring, even the most sophisticated ML models can fail, highlighting the need for robust monitoring frameworks that include software health, data quality, and model performance.
740 | 2. **Proactive Maintenance**: Proactive strategies in monitoring can mitigate risks associated with data drift and model decay, ensuring that ML systems continue to perform optimally over time.
741 | 3. **Integrated Approach**: Effective monitoring combines traditional software monitoring techniques with new approaches tailored to the nuances of ML systems, integrating data quality checks, performance benchmarks, and business KPIs to create a holistic view of system health.
742 | 4. **Continuous Improvement**: The field of ML monitoring is evolving, necessitating ongoing adjustments to monitoring practices as new challenges and technological advancements arise.
743 | 
744 | # Chapter 15: **Serving and Inference Optimization**
745 | 
746 | ### **Introduction**
747 | 
748 | Chapter 15 delves into the crucial aspects of deploying and optimizing machine learning models for serving and inference in production environments. It addresses the common challenges that practitioners face during this phase and explores various methods to enhance the efficiency of inference pipelines, emphasizing the importance of this final step in ensuring the practical utility of machine learning models.
749 | 
750 | ### **Main Sections**
751 | 
752 | **Section 15.1: Challenges in Serving and Inference**
753 | 
754 | - Overview of key performance indicators such as latency, throughput, and scalability.
755 | - Discussion on the diverse requirements depending on the target platforms like mobile, IoT, or cloud servers.
756 | - Cost considerations and the balance between computational expense and system performance.
757 | - The importance of system reliability, flexibility, and security in various deployment contexts.
758 | 
759 | **Section 15.2: Trade-offs and Optimization Patterns**
760 | 
761 | - Analysis of common trade-offs between latency, throughput, and cost.
762 | - Exploration of patterns like batching, caching, and model routing to optimize inference.
763 | - Strategies to balance model accuracy with computational efficiency.
764 | 
765 | **Section 15.3: Tools and Frameworks for Inference**
766 | 
767 | - Introduction to different frameworks and their suitability for specific types of inference tasks.
768 | - Discussion on the benefits of separating training and inference frameworks to maximize performance.
769 | - Overview of popular tools such as ONNX, OpenVINO, TensorRT, and their roles in optimizing inference.
770 | 
771 | ### **Conclusion**
772 | 
773 | 1. **Critical Balance of Factors**: Effective inference optimization requires a careful balance between competing factors such as latency, throughput, and cost. Understanding and prioritizing these based on specific application needs is key to successful deployment.
774 | 2. **Choice of Tools and Frameworks**: Selecting the right tools and frameworks is crucial and should be guided by the specific requirements of the deployment environment and the nature of the machine learning tasks.
775 | 3. **Continuous Monitoring and Optimization**: Continuous performance monitoring and iterative optimization are essential to maintain and improve the inference capabilities of machine learning systems in production.
776 | 4. **Strategic Planning for Scalability**: Planning for scalability from the outset can mitigate future challenges and help manage costs effectively as system demand grows.
777 | 
778 | # Chapter 16: **Ownership and Maintenance**
779 | 
780 | ### **Introduction**
781 | 
782 | Chapter 16 of "Machine Learning System Design With end-to-end examples" emphasizes the critical aspects of ownership and maintenance in machine learning (ML) systems. It addresses the necessity of accountability, the balance between efficiency and redundancy, the importance of thorough documentation, and the pitfalls of excessive complexity in system design. This chapter guides the reader on integrating these principles from the inception of the system to ensure its robustness and sustainability over time.
783 | 
784 | ### **Main Sections**
785 | 
786 | **Section 16.1: Accountability**
787 | 
788 | - **Key Concepts:**
789 |     - Involvement of various stakeholders and the importance of their contributions from the early stages.
790 |     - Definition and assignment of clear responsibilities using the RACI matrix.
791 |     - The value of redundancy in roles to ensure continuity and the prevention of knowledge silos.
792 | 
793 | **Section 16.2: The Bus Factor**
794 | 
795 | - **Key Concepts:**
796 |     - Explanation of the "bus factor" and its impact on project continuity.
797 |     - Strategies to mitigate risks associated with high bus factors, such as cross-training and comprehensive documentation.
798 |     - Balancing team efficiency with the need for redundancy in knowledge and skills.
799 | 
800 | **Section 16.3: Documentation**
801 | 
802 | - **Key Concepts:**
803 |     - The role of documentation in maintaining long-term system health and facilitating knowledge transfer.
804 |     - Effective documentation practices, including the use of detailed system descriptions, operational procedures, and maintenance logs.
805 |     - The pitfalls of neglecting documentation, illustrated with practical consequences and mitigation strategies.
806 | 
807 | **Section 16.4: Complexity Management**
808 | 
809 | - **Key Concepts:**
810 |     - Discussion on the deceptive appeal of creating complex systems and the associated risks.
811 |     - Importance of simplicity in design for ease of use, maintenance, and scalability.
812 |     - Techniques for reducing unnecessary complexity, such as modular design and adherence to design principles.
813 | 
814 | ### **Conclusion**
815 | 
816 | - **Key Takeaways:**
817 |     - Proper system maintenance and clear accountability are foundational to the success and sustainability of a machine learning system.
818 |     - Ensuring sufficient redundancy and comprehensive documentation are crucial to mitigate risks associated with personnel changes and system complexity.
819 |     - Avoiding overly complex solutions not only simplifies maintenance but also enhances system reliability and performance.
820 |     - Continuous evaluation and adaptation of the maintenance plan are necessary to respond to new challenges and changes in system requirements or team structure.
821 | 
822 | This structured approach to maintaining ML systems as outlined in Chapter 16 ensures that they remain robust, efficient, and adaptable to changes, providing sustained value over time.
823 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to ML System Design
 2 | 
 3 | Thank you for your interest in contributing to the ML System Design repository! This guide will help you understand how to contribute effectively.
 4 | 
 5 | ## Types of Contributions
 6 | 
 7 | We welcome the following types of contributions:
 8 | 
 9 | 1. **Design Documents**: New ML system design examples
10 | 2. **Templates**: New or improved templates for design documents
11 | 3. **Best Practices**: Documentation of ML system design best practices
12 | 4. **Code Examples**: Implementation examples related to design documents
13 | 5. **Reviews**: Feedback on existing design documents
14 | 6. **Bug Fixes**: Corrections to existing content
15 | 
16 | ## How to Contribute
17 | 
18 | ### Adding a New Design Document
19 | 
20 | 1. Use the template in `templates/basic_ml_design_doc.md`
21 | 2. Place your document in the `Design_Doc_Examples` directory
22 | 3. Follow the naming convention: `[Domain]_[Problem]_Design.md`
23 | 4. Include all relevant sections from the template
24 | 5. Add any supporting diagrams or code snippets
25 | 
26 | ### Quality Guidelines
27 | 
28 | - Follow the structure provided in the templates
29 | - Include clear problem statements and business context
30 | - Provide detailed technical specifications
31 | - Include diagrams where appropriate
32 | - Reference sources and prior work
33 | - Use clear, professional language
34 | 
35 | ### Submission Process
36 | 
37 | 1. Fork the repository
38 | 2. Create a new branch for your contribution
39 | 3. Make your changes
40 | 4. Run through the checklist in `templates/design_doc_checklist.md`
41 | 5. Submit a pull request
42 | 6. Respond to review comments
43 | 
44 | ### Review Process
45 | 
46 | All submissions will be reviewed for:
47 | - Adherence to templates and guidelines
48 | - Technical accuracy and completeness
49 | - Clear writing and organization
50 | - Practical value and applicability
51 | 
52 | ## Style Guide
53 | 
54 | ### Writing Style
55 | - Use clear, concise language
56 | - Define technical terms
57 | - Use consistent formatting
58 | - Include examples where helpful
59 | 
60 | ### Markdown Guidelines
61 | - Use proper heading hierarchy
62 | - Include table of contents for long documents
63 | - Use code blocks for technical content
64 | - Include alt text for images
65 | 
66 | ## Questions?
67 | 
68 | If you have questions about contributing, please:
69 | 1. Check existing issues
70 | 2. Review the templates and examples
71 | 3. Open a new issue for clarification
72 | 
73 | Thank you for helping improve ML system design practices! 


--------------------------------------------------------------------------------
/Design_Doc_Examples/Examples/EN/Retail_Demand_Forecasting.md:
--------------------------------------------------------------------------------
  1 | # Mega retail
  2 | 
  3 | ### **I. Problem definition**
  4 | 
  5 | ### **i. Origin**
  6 | 
  7 | Supermegaretail is a retail chain operating through a network of thousands of stores across different countries in various regions. The chain's customers buy various goods, primarily groceries, household essentials, personal care, sports supplements, and many more.
  8 | 
  9 | To sell these goods, Supermegaretail must purchase or produce them before delivering them to a store's location. The number of purchased goods is the key figure that needs to be defined, and there are different possible scenarios here.
 10 | 
 11 | For easier calculations, we can assume that Supermegaretail brought 1000 units of Item A to the specific store.
 12 | 
 13 | 1. Supermegaretail brought 1000 units and sold 999 until the next delivery. This is an optimal situation. With only 0.1% of leftovers, the retailer is close to the optimal revenue and margin.
 14 | 2. Supermegaretail brought 1000 units and sold 100 until the next delivery. This is usually an awful situation for an apparent reason. Supermegaretail wants to sell almost as many units as it purchased without going out of stock. The more significant the gap, the more considerable Supermegaretail's losses.
 15 | 3. Supermegaretail brought 1000 units and sold 1000. This should be considered a terrible situation because we don't know how many units people would buy if they had an opportunity. It could be 1001, 2000, or 10,000. An out-of-stock situation like that obscures our understanding of the world. Even worse than that—it drives customers from Supermegaretail to its competitors, where they can buy the stuff with no shortages.
 16 | 
 17 | An additional constraint is that we have a lot of perishable foods and can't wait too long; it's either gone or wasted.
 18 | 
 19 | The project goal is to reduce the gap between delivered and sold items, making it as narrow as possible, while avoiding an out-of-stock situation with a specific service-level agreement (SLA) to be specified further. To do that, we plan to forecast the demand for a specific item in a specific store during a particular period with the help of a machine learning system.
 20 | 
 21 | ### **ii. Relevance & Reasons**
 22 | 
 23 | *This section highlights the problem's relevance, backed by exploratory data analysis.*
 24 | 
 25 | **ii.i. Existing flow analysis**
 26 | 
 27 | What is the current way of ordering, delivering, and selling goods in Supermegaretail?
 28 | 
 29 | For Supermegaretail, the key aspects include:
 30 | 
 31 | 1. Planning horizon for making a deal with goods manufacturers:
 32 |    * It's a one-year deal with the opportunity to adjust 90 days ahead within the first nine months.
 33 | 2. Additional discount with an increased volume:
 34 |    * It's an extra 2% off for every additional $20M.
 35 | 3. The number of distribution centers serving as logistics hubs between manufacturers and stores:
 36 |    * There are 47 distribution centers around the country, making them a point of presence and an aggregated entity for the forecast.
 37 | 4. Delivery cadence between distribution centers and stores:
 38 |    * Usually, every two days there is a truck connecting the distribution center and the grocery store.
 39 | 5. Presence or absence of in-store warehouses:
 40 |    * There are no warehouses in most stores. However, the loading bay zone can be (and is) effectively used to store offloaded items for 2–3 days.
 41 | 6. Who and at what stage decides what and where to deliver:
 42 |    * There's a delivery plan coming down from the distribution center. A store's manager can override and adjust it.
 43 | 7. Forecast horizon:
 44 |    * The primary forecast horizon is week-long and month-long. However, a one-year horizon is needed when dealing with goods manufacturers.
 45 | 8. Business owner of the process:
 46 |    * Logistics department.
 47 |    * Procurement department.
 48 |    * Operational department (store managers).
 49 | 
 50 | **ii.ii. How much does Supermegaretail lose on the gap between forecasted and factual demand?**
 51 | 
 52 | While it is relatively easy to calculate the loss due to overstock and expired items, it is much harder to calculate the loss due to out-of-stock situations. The latter can be estimated either through a series of A/B tests or an expert opinion, which is usually much quicker and cheaper than running those tests.
 53 | 
 54 | The overall loss can be approximated by summing up the two, providing an estimate of the gain with an ideal and non-achievable solution.
 55 | 
 56 | **Initial calculation showed the loss to be around $800M USD during the last year.**
 57 | 
 58 | *Starting from section 1.2.3 of the design document (but only for this chapter), we've sketched questions only to avoid this being too voluminous. Answering these questions will help decide on further actions, while the answers are to be revealed in the later chapters with us pacing through different stages of the system.*
 59 | 
 60 | **ii.iii. Other reasons**
 61 | 
 62 | - Can other teams use our solution, making development more appealing and reasonable?
 63 | - Perhaps we can sell demand forecast solutions to other retail companies (obviously not to direct competitors).
 64 | 
 65 | ### **iii. Previous work**
 66 | 
 67 | *This section covers whether this is an entirely new problem or something has been done before. Usually, it is a list of questions you ask to avoid doing double work or repeating previous mistakes.*
 68 | 
 69 | - What if Supermegaretail was aware of this issue and had already implemented some demand forecast approach? It has various stores in different locations; its demand forecast is probably already pretty efficient. How do they do it?
 70 |    - Rolling window?
 71 |    - Experts committee?
 72 |    - Rule of thumb + extra quick delivery?
 73 | - Do we have some limitations to consider that we can't avoid? Like minimum or maximum order size?
 74 | - Can we quickly improve the existing solution, or do we need an entirely new one?
 75 | - What if Supermegaretail's current forecasting is good enough for some categories and useless for others? In other words, can we use a hybrid approach here, at least in the very beginning, and start with the least successful categories, where the existing gap between predictions and actual sales is the widest?
 76 | - If our approach unintentionally breaks something, it is not that dangerous. We are testing it for categories where we always had problems while not touching categories where everything is good.
 77 | - In other words, we need to run an extensive and fresh exploratory data analysis of the existing solution.
 78 | 
 79 | ### **iv. Other issues & Risks**
 80 | 
 81 | - Do we have a required infrastructure, or do we need to build it?
 82 | - If we pick something sophisticated, it can go crazy. What necessary checks and balances do we need to implement to avoid a disaster? Do we have a fallback in case something is broken?
 83 | - How sure are we that we can significantly improve the quality and reduce the manual load? Can we really solve this?
 84 | - What is the price of a mistake? Probably out-of-stock and overstock have different costs of errors.
 85 | - If we deal with an out-of-stock situation, can we handle increased traffic?
 86 | - How often and on what granularity do we need to perform predictions?
 87 | 
 88 | As you can see, even a brief overview of the problem to solve and research using the previously gathered data can easily force us to write down a 10-page doc. This draft will help us decide if we need to go further or it is better to stop right now and avoid a complicated ML solution.
 89 | 
 90 | The next section of Chapter 4 is no less important, though, as it gives a practical example of how to review a design document. If you're new to ML system design, you probably haven't reached the stage of your career where you have enough experience and credibility to be involved in this kind of working routine. However, stepping up to review your first design doc is just a matter of time, so better be prepared beforehand, and you will see some practical advice on the reviewing basics.
 91 | 
 92 | ### **II. Metrics and losses**
 93 | 
 94 | ### **i. Metrics**
 95 | 
 96 | Before picking up a metric on our own, it makes sense to do preliminary research. Fortunately, there are many papers related to this problem, but the one that stands out is *Evaluating predictive count data distributions in retail sales forecasting*.
 97 | 
 98 | Let's recall the project goal, which is to reduce the gap between delivered and sold items, making it as narrow as possible, while avoiding an out-of-stock situation with a specific service-level agreement (SLA) to be specified further. To do that, we plan to forecast the demand for a specific item in a specific store during a particular period using a machine learning system.
 99 | 
100 | In this case, this paper abstract looks like almost a perfect fit to address.
101 | 
102 | *Massive increases in computing power and new database architectures allow data to be stored and processed at increasingly finer granularities, yielding count data time series with lower and lower counts. These series can no longer be dealt with using approximate methods appropriate for continuous probability distributions. In addition, it is not sufficient to calculate point forecasts alone: we need to forecast the entire (discrete) predictive distributions, particularly for supply chain forecasting and inventory control, but also for other planning processes.*
103 | 
104 | (*Count data is an integer-valued time series. It is essential for the supply chain forecasting we are facing, where most products are sold in units.)*
105 | 
106 | With that in mind, we can briefly review this paper (within the lettered list below) and pick the metrics that are most appropriate for our end goal.
107 | 
108 | **a. Measures based on absolute errors**
109 | 
110 | MAE optimizes the median, weighted mean absolute percentage error (wMAPE) is MAE divided by the mean of the out-of-sample realizations, and the mean absolute scaled error is obtained by dividing the MAE by the in-sample MAE of the random walk forecast.
111 | 
112 | Optimizing for the median does not differ much from optimizing for the mean in a symmetric predictive distribution. However, the predictive distributions appropriate for low-volume count data are usually far from symmetric, and this distinction makes a difference in such cases and yields biased forecasts.
113 | 
114 | **b. Percentage errors**
115 | 
116 | The mean absolute percentage error (MAPE) is undefined if any future realization is zero, so it is singularly unsuitable for count data.
117 | 
118 | The symmetric MAPE (sMAPE) is an 'symmetrized' version of the MAPE, which is defined if the point forecasts and actuals are not both zero at all future time points. However, in any period with a zero actual, its contribution is two, regardless of the point forecast, making it unsuitable for count data.
119 | 
120 | **c. Measures based on squared errors**
121 | 
122 | Minimizing the squared error naturally leads to an unbiased point forecast. However, the mean squared error is unsuitable for intermittent-demand items because it is sensitive to very high forecast errors. The same argument stands for non-intermittent count data.
123 | 
124 | **d. Relative errors**
125 | 
126 | Prominent variations are the median relative absolute error (MdRAE) and the geometric mean relative absolute error (GMRAE).
127 | 
128 | In the specific context of forecasting count data, these suffer from two main weaknesses:
129 | 
130 | - Relative errors commonly compare absolute errors. As such, they are subject to the same criticism as MAE-based errors, as detailed above.
131 | - On a period-by-period basis, simple benchmarks such as the naive random walk may forecast without errors, and thus, this period's relative error would be undefined because of a division by zero.
132 | 
133 | **e. Rate-based errors**
134 | 
135 | Kourentzes (2014) recently suggested two new error measures for the intermittent demand, MSR and MAR, which aim to assess whether an intermittent demand point forecast captures the average demand correctly over an increasing period of time. This is an interesting suggestion, but one property of these measures is that they implicitly weigh the short-term future more heavily than the mid-to-long-term future. One could argue that this is exactly what we want to do while forecasting, but even then, a case could be made that such weighting should be explicit, e.g., by using an appropriate weighting scheme when averaging over future time periods.
136 | 
137 | **f. Scaled errors**
138 | 
139 | Petropoulos and Kourentzes (2015) suggest a scaled version of the MSE, the sMSE, which is the mean over squared errors that have been scaled by the squared average actuals over the forecast horizon. The sMSE is well-defined unless all actuals are zero, is minimized by the expectation of *f* and, due to the scaling, can be compared between different time series. In addition (again because of the scaling) it is not quite as sensitive to high-forecast errors as the MSE. Specifically, it is more robust to dramatic underforecasts, although it is still sensitive to large overforecasts.
140 | 
141 | **g. Functionals and loss functions**
142 | 
143 | An alternative way of looking at forecasts concentrates on point forecasts that are functionals of the predictive distribution. One could argue that a retailer aims at a certain level of service (say 95%) and that therefore they are only interested in the corresponding quantile of the predictive distribution. This would then be elicited with appropriate loss functions or scoring rules. This approach is closely related to the idea of considering forecasts as part of a stock control system. In this perspective, quantile forecasts are used as inputs to standard stock control strategies, and the quality of the forecasts is assessed by valuing the total stock position over time and weighting it against out-of-stocks.
144 | 
145 | Although the authors did not see this as the best solution and proposed an alternative, the last paragraph of the paper review is quite promising. Not only does it make sense from a business perspective to predict different quantiles to uphold SLA, but it is desirable from the point of view of having the loss function equal to the metric. Thus, Quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99 look like a proper choice. Moreover, if we need to pay more attention to a specific SKU, item group or cluster, in that case, Quantile metrics support the calculation of object/group weights (for example, item price).
146 | 
147 | **i.ii. Metrics to pick**
148 | 
149 | Quantile metrics for quantiles of 1.5, 25, 50, 75, and 95, and 99 both as is and with weights equal to SKU price and an additional penalty for underforecasting or overforecasting if deemed necessary. Calculated as point estimates with 95% confidence intervals (using bootstrap or cross-validation). In addition, we can further transform this metric, representing it not as an absolute value, but an absolute percentage error at a given quantile. All considerations from the article review above regarding percentage errors have to be taken into account. Ultimately, a set of experiments will help to decide a final form. Most probably we will have both, as it makes sense to check both absolute values in money/pieces and percentage error.
150 | 
151 | Online metrics of interest during A/B test are:
152 | 
153 | - Revenue (expected to increase),
154 | - Level of stock (expected to decrease or maintain the same),
155 | - Margin (expected to increase).
156 | - Alpha — coefficient used in quantile-based losses
157 | - W — weights
158 | - I — indicator function
159 | - A — model output
160 | - T — label
161 | 
162 | ### **ii. Loss Functions**
163 | 
164 | With metrics equal to our loss functions, it is straightforward to pick the latter. We will train six models using a quantile loss of 1.5, 25, 50, 75, 95, and 99, resulting in six different models, providing us with various guarantees for the corresponding quantile of the predictive distribution.
165 | 
166 | As a second line of experimentation, we will additionally review the Tweedie Loss Function. Tweedie distributions are a family of probability distributions, including the purely continuous normal, gamma, and inverse Gaussian distributions, the purely discrete scaled Poisson distribution, and the class of compound Poisson–gamma distributions that have positive mass at zero but are otherwise continuous. These qualities make it an attractive candidate for our Count data.
167 | 
168 | ### **III. Dataset**
169 | 
170 | The atomic object of the dataset is a bundle of (date, product, store), and the target variable we aim to predict is the number of units sold.
171 | 
172 | ### **i. Data Sources**
173 | 
174 | There are multiple sources of data we can utilize for our objective.
175 | 
176 | ### Inner sources
177 | 
178 | 1. *Historical data on purchases (i.e. transaction history)* is collected from the chain of stores of Supermegaretail and saved to a centralized database. It will be our *primary source* of truth: the number of sales, the amount of money spent, the discounts applied, transaction ID, and so on.
179 | 2. *Stock history* is the second most important source of truth for our problem since it directly determines how many units of each product can be sold in each store. This source can help estimate how many products are available for sale at the beginning of each day and how many were expired and withdrawn from sale.
180 | 3. Metadata of *each product, store, and transaction*.
181 | 4. Calendar of planned *promo activities*. A significant factor affecting future sales that definitely needs to be taken into account.
182 | 
183 | ### Outer sources: manually gathered data
184 | 
185 | 1. *Price monitoring*. Prices and other product info collected from our competitors. They are manually gathered daily from a subset of stores of different competitors. It could be done either by our in-house team or by a third party (outsourced). A hybrid approach also can take place. Each product should also contain a global product identifier (barcode), so we can easily match collected data with our product. Knowing aggregated competitors' prices and their dynamics helps us understand what is happening in the market.
186 | 
187 | ### Outer sources: purchased data
188 | 
189 | 1. *Weather history and its forecast* we buy from the meteorological service. Weather is an important factor directly affecting consumer behavior.
190 | 2. *Customer Traffic* estimation nearby our stores (from telecom providers).
191 | 3. *Global market indicators*.
192 | 
193 | ### Mobile app and website data (optional)
194 | 
195 | 1. Supermegaretail has a *delivery service* (even if it takes less than 5% of revenue). We will collect *additional data* about specific sales in a specific location. Sometimes this information can be a valuable predictor.
196 | 2. Also, mobile and web services collect implicit feedback about user activity, including views, clicks, or adding to the cart, which also can predict sales in physical stores.
197 | 
198 | ### **ii. Data Labeling**
199 | 
200 | Since we are dealing with a demand forecasting problem, we don't need extra data labeling derived directly from the transaction history.
201 | 
202 | ### **iii. Available Metadata**
203 | 
204 | We forecast demand based on the SKU per store level, with three key elements: products, stores, and transactions.
205 | 
206 | ### Products
207 | 
208 | - *Product ID and barcode*.
209 | - *Category codes of different levels (1, 2, 3)*. We can use a hierarchy of categories for a rule-based measurement of similarity between products. Also, other categorical info, like *brand, manufacturer, or pricing group*.
210 | - *Shelf life* determines how bad it is to overpredict sales for this product.
211 | - Date when the product was *added to the assortment matrix* of the chain.
212 | - *Dimensions* and *weight* of the product.
213 | 
214 | ### Stores
215 | 
216 | - *Store ID*.
217 | - *Location (coordinates)*, with the support of third-party sources—we can use it to add information about the weather, flow of people, and distance to critical points. Other related items, like city, region, and associated logistics center.
218 | - *The nearest competitors' stores* (with their IDs and distances).
219 | - The *size of the store* and its *format*. They determine which products and how many unique products will be in the assortment of this store.
220 | - The dates when the store is open and when it was closed.
221 | 
222 | ### Transactions
223 | 
224 | - *Timestamp*. It allows us to enrich the dataset with things like holidays.
225 | - *Customer ID* (if a loyalty card was applied). Despite the fact that the final unit of the dataset is (product, store), a bundle of (customer, product) can be used in a separate data pipeline for calculating product embeddings via aggregating transactions to a user-item matrix and its factorization. The embeddings will contain patterns of purchasing behavior.
226 | - *Store ID* and *product ID*.
227 | 
228 | ### **iv. Available History**
229 | 
230 | **Demand forecast is nothing new for Supermegaretail**. Critical ETL processes are already in place. Supermegaretail has been collecting data for more than three years.
231 | 
232 | This history is essential for our forecasting model to learn patterns, catching the seasonality of sales, estimate trends, etc. The same is applicable for products and stores metadata. Weather data (which we take from external sources) has been available for a period in the past as long as we need.
233 | 
234 | Stocks history and promo activities have been gathered as well.
235 | 
236 | Price monitoring data of competitors has been collected for the last 2 years.
237 | 
238 | ### **v. Data Quality Issues**
239 | 
240 | Transactions, stock, and promo data may contain missing or duplicated values, so additional filtering or preprocessing is required before aggregation.
241 | 
242 | The external data we bought has already been cleaned and passed some quality control before coming to us. However, necessary checks need to be implemented.
243 | 
244 | The competitors' prices cover about 25% of SKU and have gaps.
245 | 
246 | ### **vi. Final ETL Pipeline**
247 | 
248 | The top-level scheme is as follows:
249 | 
250 | 1.  Transaction data is aggregated daily.
251 | 2.  Add the newly aggregated partition to the table of transaction aggregates.
252 | 3.  (Optionally) We rewrite not only the last day but also the previous 2-3 days to fix possible data corruptions (duplicates, incomplete data, duplicated data, and so on).
253 | 4.  Join other sources of internal/external data based on date, product ID, or store ID.
254 | 5.  Finally, calculate features based on the joined dataset.
255 | 
256 | Optionally, we can add a data pipeline for product embeddings, as described in the **iii. Available Metadata** section, if needed.
257 | 
258 | ### **IV. Validation Schema**
259 | 
260 | ### **i. Requirements**
261 | 
262 | What are the assumptions that we need to pay attention to when figuring out the evaluation process?
263 | 
264 | 1.  New data is coming daily.
265 | 2.  Data can arrive with a delay of up to 48 hours.
266 | 3.  New labels (number of units sold) come with the new data.
267 | 4.  Recent data is most probably more relevant for the prediction task.
268 | 5.  The assortment matrix changes by 15% every month.
269 | 6.  Seasonality is present in the data (weekly/annual cycles).
270 | 
271 | Despite the fact that the data is naturally divided into categories, it is irrelevant for the choice of validation schema.
272 | 
273 | ### **ii. Inference**
274 | 
275 | After fixing a model (within the hyperparameters optimization procedure), we train it on the last 2 years of data and predict future demand for the next 4 weeks. This process is fully reproduced in both inner and outer validation.
276 | 
277 | It is important to note that there should be a gap of 3 days between training and validation sets in order to be prepared for the fact that data may arrive with a delay. Subsequently, this will affect which features we can and cannot calculate when building a model.
278 | 
279 | 
280 | ### **iii. Inner and outer loops**
281 | 
282 | We use two layers of validation. The outer loop is used for the final estimation of the model's performance, while the inner loop is used for hyperparameter optimization.
283 | 
284 | **Outer loop**. Given we are working with time series, rolling cross-validation is an obvious choice. We set K=5 to train five models with optimal parameters. Since we are predicting 4 weeks ahead, the validation window size also consists of 28 days in all splits. There is a gap of 3 days between sets, and the step size is 7 days.
285 | 
286 | *Example of the outer loop:*
287 | 
288 | *1st outer fold:*
289 | 
290 | - *Data for the testing is from **2022-10-10** to **2022-11-06** (4 weeks)*
291 | - *Data for the training is from **2020-10-07** to **2022-10-06** (2 years)*
292 | 
293 | *2nd outer fold:*
294 | 
295 | - *Data for the testing is from **2022-10-03** to **2022-10-30***
296 | - *Data for the training is from **2020-09-29** to **2022-09-28***
297 | 
298 | *…*
299 | 
300 | *5th outer fold:*
301 | 
302 | - *Data for the testing is from **2022-09-12** to **2022-10-09***
303 | - *Data for the training is from **2020-09-09** to **2022-09-08***
304 | 
305 | **Inner loop**. Inside each "train set" of the outer validation, we perform additional rolling cross-validation with a 3-fold split. Each inner loop training sample consists of a 2-year history as well to capture both annual and weekly seasonality. We use the inner loop to tune hyperparameters or for feature selection.
306 | 
307 | *Example of the inner loop (for the 2nd fold of the outer loop):*
308 | 
309 | *Training data for the 2nd outer fold is from **2020-10-03** to **2022-10-02**.*
310 | 
311 | *1st inner fold:*
312 | 
313 | - *Data for the testing is from **2022-09-05** to **2022-10-02** (4 weeks)*
314 | - *Data for the training is from **2020-09-02** to **2022-09-01** (2 years)*
315 | 
316 | *2nd inner fold:*
317 | 
318 | - *Data for the testing is from **2022-08-29** to **2022-09-25***
319 | - *Data for the training is from **2020-08-26** to **2022-08-25***
320 | 
321 | *3rd inner fold:*
322 | 
323 | - *Data for the testing is from **2022-08-22** to **2022-09-18***
324 | - *Data for the training is from **2020-08-19** to **2022-08-18***
325 | 
326 | If the model does not require model tuning yet, we can skip the inner loop.
327 | 
328 | 
329 | ### **iv. Update frequency**
330 | 
331 | We update the split weekly along with new data and labels (so that each validation set always consists of a whole week). This will help us catch local changes and trends in model performance.
332 | 
333 | Additionally, we have a separate holdout set as a benchmark ("golden set"). We update it every three months. It helps us track long-term improvements in our system.
334 | 
335 | ### **V. Baseline Solution**
336 | 
337 | ### **i. Constant Baseline**
338 | 
339 | As a constant baseline for Supermegaretail's demand forecasting system, we plan to use the actual sales value of the previous day per SKU per grocery store. Knowing that data sometimes could appear with delay and that grocery sales experience strong weekly seasonality, we will go one full week back instead of going one day back. As a result, our prediction for a specific item on Sep 8, 2022 will be the actual sales value for this item on Sep 1, 2022.
340 | 
341 | ### **ii. Advanced constant Baseline**
342 | 
343 | The Metrics and Losses chapter mentioned quantile losses of 1.5, 25, 50, 75, 95, and 99th percentiles. We can calculate the same with our baseline using a yearly window.
344 | 
345 | ### **iii. Linear model Baseline**
346 | 
347 | We will use a basic set of features to use linear regression with quantile loss; for a start, we can use target variables with multiple lags and aggregations such as sum/min/max/avg/median or corresponding quantiles for the last 7/14/30/60/90/180 days or different rolling windows of different sizes. A similar approach can be applied with other dynamic data beyond sales date, like price, revenue, average check, or a number of unique customers.
348 | 
349 | 
350 | ### **iv. Time series-specific baseline**
351 | 
352 | ARIMA (Autoregressive integrated moving average) and SARIMA (seasonal ARIMA). Both are autoregressive algorithms for forecasting; the second one considers any seasonality patterns.
353 | 
354 | Both require fine-tuning multiple hyperparameters to provide satisfying accuracy. To avoid this, we may prefer a SOTA forecasting procedure that works great out-of-the-box and is called Prophet (https://github.com/facebook/prophet). The nice advantage of Prophet is that it's robust and doesn't require a lot of preprocessing: outliers, missing values, shifts, and trends are handled automatically.
355 | 
356 | ### **v. Feature Baselines**
357 | 
358 | What additional information could some baselines and possible future models benefit from?
359 | 
360 | We will include extra static info about products (brand, category), stores (geo features), and context (time-based features, seasonality, day of the week)—all of them with preprocessing and encoding appropriate for a chosen model.
361 | 
362 | Features that are also suitable for the baseline are counters and interactions.
363 | 
364 | Examples include:
365 | 
366 | - Difference between current and average price (absolute and relative)
367 | - Penetration: the ratio of product sales to sales of a category (of levels 1, 2, 3) for rolling windows of different sizes
368 | - Number of days since the last purchase
369 | - Number of unique customers
370 | 
371 | etc.
372 | 
373 | ### **VI. Error analysis**
374 | 
375 | Remember we have six quantile losses for 1.5, 25, 50, 75, 95, and 99th quantiles of the target and corresponding six models for each. Constant baseline estimates these quantiles for each product based on the last N days of its sales. These baselines already have some residual distribution with some specific bias that is helpful to consider.
376 | 
377 | Comparing more complex models (linear models and gradient boosting) with these dummy baselines will give us an understanding of whether we are moving in the right direction in modeling and feature engineering or not.
378 | 
379 | ### **i. Learning Curve Analysis**
380 | 
381 | ### Convergence analysis
382 | 
383 | A step-wise learning curve based on the number of iterations comes into play only when we start experimenting with the gradient-boosting algorithm.
384 | 
385 | What are the key questions we should answer when examining the loss curve:
386 | 
387 | 1.  Does the model converge at all?
388 | 2.  Does the model beat baseline metrics (quantile loss, MAPE, etc.)?
389 | 3.  Are issues like underfitting/overfitting present or not?
390 | 
391 | Once we ensure the model converges, we can pick a sufficient number of trees on a rough grid (500-1000-2000-3000-5000) and then fixate on this number for future experiments.
392 | 
393 | For simpler baselines, convergence analysis is not the case.
394 | 
395 | ### Model complexity
396 | 
397 | We will use a model-wise learning curve to decide an optimal number of features and overall model complexity.
398 | 
399 | Let's say we fixate all hyperparameters except the number of lags we use: the more we take, the more complicated patterns and seasonalities our model can capture – and the easier it will be to overfit training data. Should it be N-1, N-2, N-3 days? Or N-1, N-2, …, N-30 days? The optimal number can be determined by the "model size vs. error size" plot.
400 | 
401 | Similarly, we can optimize window sizes. For instance, windows "7/14/21/…" are more granular than "30/60/90/…" ones. The appropriate level of granularity can be chosen, again, by using a model-wise learning curve.
402 | 
403 | In the same fashion, we tweak other key hyperparameters of the model during the initial adjustments, for instance, regularization term size.
404 | 
405 | ### Dataset size
406 | 
407 | Do we need to use all the available data to train the model? How many last months are enough and relevant? Do we need to utilize all (day, store, item) data points, or can we downsample 20% / 10% / 5% of them without noticeable downgrading in metrics?
408 | 
409 | Sample-wise learning curve analysis can help here, as it determines how many samples are necessary for the error on the validation set to reach a plateau.
410 | 
411 | We should make an important design decision of whether we should use (day, store, item) as an object of the dataset or move to less granular (week, store, item). The latter option reduces the number of required computations by a factor of 7, while model performance can either be left unchanged or even be increased.
412 | 
413 | This design decision affects not only the demand forecasting service speed and performance but also the overall product (a stock management system), drastically reshaping the landscape of its possible use cases. Therefore, despite the possible advantages, this decision should be agreed upon with our product managers, users (category managers), and stakeholders.
414 | 
415 | ### **ii. Residual analysis**
416 | 
417 | Remember, we have an asymmetric cost function: overstock is far less harmful than out-of-stock problems. We have either expired goods or missed profit. The uncovered demand problem is a much worse scenario, and in the long run, it is expressed in customers' dissatisfaction and an increased risk that they will pass to competitors.
418 | 
419 | ### Residual Distribution
420 | 
421 | The mentioned peculiarity of the demand should guide us throughout the residual analysis of our forecasting model: positive residuals (overprediction) are more preferred than negative ones (underpredictions). However, too much overprediction is bad as well.
422 | 
423 | Therefore, we plot the distribution of the residuals along with their bias (a simple average among raw residuals). We expect this to be true in one of the following possible scenarios:
424 | 
425 | 1.  A small positive bias reveals slight overprediction, which is the desirable outcome. If, in addition, residuals are not widely spread (low variance), we get a perfect scenario.
426 | 2.  Equally spread residuals in both negative and positive directions would be okay but is less preferred than A. We should force the model to produce more optimistic forecasts to ensure we minimize missed profit.
427 | 3.  The worst scenario is when we have a skew in favor of negative residuals. It means our model tends to increase customers' dissatisfaction. This would definitely be a red flag for the current model version deployment.
428 | 4.  If we have a skew but favor positive residuals, this is unambiguously a good case for Supermegaretail as well, hence, less preferred than the first case.
429 | 5.  These scenarios are applicable when we try to estimate unbiased demand (we use median prediction for that). But as mentioned, we also have a bunch of other models for other quantiles (1.5%, 25%, 75%, 95%, 99%).
430 | 
431 | 
432 | For each of them, we analyze the basic assumption behind each model.
433 | 
434 | For example:
435 | 
436 | - Is it true that 95% of residuals are positive for a model that predicts a 95%-quantile?
437 | - Is it true that 75% of residuals are negative for a model that predicts a 25%-quantile?
438 | 
439 | And so on.
440 | 
441 | ### Elasticity
442 | 
443 | We should validate the elasticity assumptions using elasticity curves. There is no solid understanding of whether all the goods are expected to demonstrate elasticity, and this needs to be confirmed with stakeholders.
444 | 
445 | If we face problems related to elasticity, we have two options to improve the elasticity capturing:
446 | 
447 | 1.  **Post-processing (fast, simple, ad hoc solution).** We can apply an additional model (e.g., isotonic regression) for prediction post-processing to calibrate forecasts.
448 | 2.  **Improve the model (slow, hard, generic solution).** It requires additional modeling, feature engineering, data preprocessing, etc. There is no predefined set of actions that will solve the problem for sure.
449 | 
450 | ### Best-case vs. Worst-case vs. Corner-case
451 | 
452 | Each time we roll out a new version of the model, we automatically report its performance on best/worst/corner cases and save top-N% of these cases as artifacts of the training pipeline. Here is a draft of a checklist of questions for which we should find answers in this report:
453 | 
454 | - What's the model's prediction error when the sales history of an item is short? Are the residuals mostly positive or mostly negative?
455 | - What about items with a high price or with a low price?
456 | - How does prediction error depend on weekends/holidays/promotion days?
457 | - What are the commonalities among the items with almost zero residuals? Is a long sales history necessarily required for them? How long should the sales history be in order to get acceptable performance? Does the model require other conditions that can help us to distinguish those cases where we are certain about the quality of the forecast?
458 | - What are the commonalities among the items with the largest negative residuals? We would 100% prefer to exclude these cases or whole categories from A/B-testing groups or pilots. We should also focus on these items when we start to improve the model.
459 | - And finally, what do the items with the largest positive residuals have in common?
460 | 
461 | ### **VII. Training Pipeline**
462 | 
463 | ### **i. Overview**
464 | 
465 | The demand forecasting model for Supermegaretail aims to predict the demand for specific items in specific stores during a particular period. To achieve this, we need a training pipeline that can preprocess data, train the model, and evaluate its performance. We assume the pipeline should be scalable, easy to maintain, and allow for experimentation with various model architectures, feature engineering techniques, and hyperparameters.
466 | 
467 | ### **ii. Toolset**
468 | 
469 | The suggested tools for the pipeline are:
470 | 
471 | - Python as the primary programming language for its versatility and rich ecosystem for data processing and machine learning.
472 | - Spark for parallel and distributed computing.
473 | - PyTorch for deep learning models.
474 | - MLflow for tracking experiments and managing the machine learning lifecycle.
475 | - Docker for containerization and reproducibility.
476 | - AWS Sagemaker or Google Cloud AI Platform for cloud-based training and deployment.
477 | 
478 | ### **iii. Data Preprocessing**
479 | 
480 | The data preprocessing stage should include:
481 | 
482 | - Data cleaning: Handling missing values, removing duplicates, and correcting erroneous data points.
483 | - Feature engineering: Creating new features from existing ones, such as aggregating sales data, extracting temporal features (day of the week, month, etc.), and incorporating external data (e.g., holidays, weather, and promotions).
484 | - Data normalization: Scaling numeric features to a standard range.
485 | - Train-test split: Splitting the dataset into training and validation sets, ensuring that they do not overlap in time to prevent data leakage.
486 | 
487 | ### **iv. Model Training**
488 | 
489 | The model training stage should accommodate various model architectures and configurations, including:
490 | 
491 | - Baseline models: Simple forecasting methods like moving average, exponential smoothing, and ARIMA.
492 | - Machine learning models: Decision trees, random forests, gradient boosting machines, and support vector machines.
493 | - Deep learning models: Recurrent neural networks (RNNs), Long Short-Term Memory (LSTM) networks, and transformers (if needed).
494 | 
495 | We should also implement a mechanism for hyperparameter tuning, such as grid search or Bayesian optimization, to find the best model configurations.
496 | 
497 | ### **v. Model Evaluation**
498 | 
499 | Model performance should be evaluated using previously derived metrics, such as Quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99 both as is and with weights equal to SKU price, calculated as point estimates with 95% confidence intervals (using bootstrap or cross-validation) plus standard metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root Mean Squared Error (RMSE). We should also include custom metrics that are specific to Supermegaretail's business requirements, such as the cost of overstock and out-of-stock situations. (See *Validation* chapter.)
500 | 
501 | ### **vi. Experiment Tracking and Model Management**
502 | 
503 | Using a tool like MLflow, we should track and manage experiments, including:
504 | 
505 | - Model parameters and hyperparameters.
506 | - Input data and feature engineering techniques.
507 | - Evaluation metrics and performance.
508 | - Model artifacts, such as trained model weights and serialized models.
509 | 
510 | ### **vii. Continuous Integration and Deployment**
511 | 
512 | The training pipeline should be integrated into Supermegaretail's existing CI/CD infrastructure. This includes setting up automated training and evaluation on a regular basis, ensuring that the latest data is used to update the model, and deploying the updated model to production with minimal manual intervention.
513 | 
514 | ### **viii. Monitoring and Maintenance**
515 | 
516 | We should monitor the model's performance in production and set up alerts for significant deviations from expected performance. This will enable us to catch issues early and trigger retraining or model updates when necessary. (See *Monitoring* chapter.)
517 | 
518 | ### **ix. Future Work and Experimentation**
519 | 
520 | The training pipeline should be flexible enough to accommodate future experimentation, such as incorporating additional data sources, trying new model architectures, and adjusting loss functions to optimize for specific business objectives.
521 | 
522 | ### **VIII. Features**
523 | 
524 | Our key criteria for selecting the right features (beyond prediction quality) are:
525 | 
526 | 1.  **Prediction quality.** The more accurate forecasts we get, the better.
527 | 2.  **Interpretability and explainability.** We prefer features that are easy to describe and explain ("black box" solutions are neither transparent nor trustworthy, especially in the initial phases of the project).
528 | 3.  **Computation time** (and computational complexity). Features that take a lot of time to compute (also features from multiple data sources and with complex dependencies) are less preferred unless the improvement in prediction quality they give is worth it. This is because they slow down the training cycle and reduce the number of hypotheses we can test.
529 | 4.  **Risks** (and feature stability). Features that require external/multiple data sources, auxiliary models (or simply poorly designed features), and features based on data sources with low data quality make the pipeline more fragile, which should be avoided.
530 | 
531 | If a feature adds a statistically significant improvement to the model's performance but violates one of the other criteria (e.g., it takes 2 days to compute), we prefer not to add this feature to the pipeline.
532 | 
533 | Primary sources of new features:
534 | 
535 | - Adding more internal and external data sources (e.g., monitoring competitors).
536 | - Transforming and combining existing features.
537 | 
538 | Here is a list of features we will experiment with that will guide our further steps of model improvements after initial deployment:
539 | 
540 | 1.  Competitors' prices and how they differ from our prices (external sources).
541 | 2.  Special promotion and discount calendars.
542 | 3.  Prices (old price, discounted price).
543 | 4.  Penetration (ratio between sales of an SKU and sales of a category).
544 | 5.  SKU's attributes (brand, categories of different levels).
545 | 6.  Linear elasticity coefficient.
546 | 7.  A sum/min/max/mean/std of sales of SKU for previous N days.
547 | 8.  A median/quantiles of sales of SKU for previous N days.
548 | 9.  Predicted weather (external sources).
549 | 10. Store's traffic (external sources).
550 | 11. Store's sales volume.
551 | 12. Sales for this SKU 1 year ago.
552 | 13. Economic indicators (external sources).
553 | 
554 | We formulate them as a hypothesis. An example: *Using a promo calendar will help the model capture an instant increase in demand during marketing activities, which will decrease overstock in that period.*
555 | 
556 | We will use model-agnostic (SHAP, LIME, shuffle importance) and in-built methods (linear model's coefficients, number of splits in gradient boosting) to measure feature importance. Main goal: to understand the contribution of each feature to the model's outcome. If a feature doesn't contribute much, we drop it.
557 | 
558 | For automatic feature selection during the first stages (when we haven't determined the basic feature set yet), we use RFE (rotation feature elimination).
559 | 
560 | Also, we include feature tests in the training pipeline before and after training the model:
561 | 
562 | - Test feature ranges & outlier detectors (e.g. 0.0 <= discount < X).
563 | - Test that correlation between any pair of features less than X.
564 | - Test that feature's coefficient/number of splits > 0.
565 | - Test that computation time is less than 6 hours for any feature.
566 | - Etc.
567 | 
568 | To compute and access features more easily, we can use a centralized feature store. This store collects data from different DWH sources and, after various transformations and aggregations, merges it into one datamart (SKU, store, day). It recalculates features daily, making it easy to experiment with new ones and track their versions, dependencies, and other meta-info.
569 | 
570 | 
571 | ### **IX. Measuring and reporting**
572 | 
573 | ### **i. Measuring results**
574 | 
575 | As a first step to improve the prediction quality of the already deployed sales forecasting model, we plan to experiment with combining existing models (one per each category) into one. The reasoning behind that is that we will encode information about the group of items without loss in specificity but will gain more data and, thus, better generalization. Even with no improvement in quality, it is still much easier to maintain one model than many.
576 | 
577 | As offline metrics, we looked into different quantiles, percentage errors, and biases. But instead of evaluating only the overall score, in addition to that, we checked on metrics and the error analysis of specific categories. These offline tests yielded the following intermediate results:
578 | 
579 | - The general prediction quality steadily increases across all metrics, and the majority of validation folds when switching to a split model.
580 | - The categories with a small number of products showed an increase in the offline metrics compared to a baseline (multi) model. The amount of data they had was insufficient to learn meaningful patterns. The large categories didn't show significant wins when switching to a unified model. The result is reproducible through different seasons and key geo regions.
581 | 
582 | Previously, A/B tests provided some estimation on what uplift to expect in each major category based on offline metrics.
583 | 
584 | *If we reduce metric M for product P by X%, it leads to a decrease in missed profit (out-of-stock) by Y% and cuts losses due to the overstock situation by Z%.*
585 | 
586 | According to our estimates, the total increase in revenue for the pilot group is expected to be 0.3–0.7%.
587 | 
588 | ### **ii. A/B tests**
589 | 
590 | Hypothesis: According to the offline metrics improvement, we expect revenue to increase by at least 0.3%.
591 | 
592 | **Key metrics.** The best proxy metric for revenue we can use in the A/B experiment is the average check. It perfectly correlates with the revenue (assuming the number of checks has not changed). The atomic observation for the statistical test will be a single check.
593 | 
594 | **Splitting strategy.** We split by distribution center and, through them, acquire two sets of stores, as each center serves a cluster of stores. From those sets, we pick subsets that are representative of each other and use them as groups A and B.
595 | 
596 | **Additional metrics.** Control metrics for the experiment:
597 | 
598 | - Number of checks per day: whether the sales volume has no significant drop.
599 | - Model's update frequency: does the model accumulate newly gathered data regularly?
600 | - Model's offline metrics: quantile metric, bias, WAPE.
601 | 
602 | Auxiliary metrics:
603 | 
604 | - Daily revenue.
605 | - Daily profit.
606 | 
607 | **Statistical criteria.** We'll use Welch's t-test to capture the difference between samples.
608 | 
609 | **Error rates.** We set the significant level to 5% and the type II error to 10%.
610 | 
611 | **Experiment duration.** According to our calculations, two weeks will be enough to check the results. However, given the distribution center's replenishment cadence of one week, we will extend this period to one full month.
612 | 
613 | ### **iii. Reporting results**
614 | 
615 | A report containing the following chapters is to be provided:
616 | 
617 | - **Results,** shown as 95% confidence intervals for primary and auxiliary metrics.
618 | - **Graphical representation** (value of specific metric at a specific date) of all metrics from both control and treatment group for ease of consumption.
619 | - **Absolute numbers.** E.g., the number of stores in each group, the total number of checks, and total revenue.
620 | - **Methodology** to be supplied in the appendix. E.g., how groups were picked to be representative of each other, simulations run to check for type 1 and type 2 errors, etc.
621 | - **Recommendation / Further steps:** what to do next based on the received results.
622 | 
623 | ### **X. Integration**
624 | 
625 | ### **i. Fallback Strategies**
626 | 
627 | Fallbacks are crucial for maintaining operational efficiency in the face of unforeseen circumstances. Supermegaretail has adopted a multi-tiered fallback system:
628 | 
629 | - **Primary fallback.** The primary model is trained on a subset of the most significant features. It will be used if no feature drift or problems are detected within this subset.
630 | - **Secondary fallback.** Our next layer of fallback involves time-series models like SARIMA or Prophet, which we explored in section 4.4. These models are less dependent on external features, allowing for more robust predictions if drift occurs.
631 | - **Tertiary fallback.** As a last resort, we would predict sales akin to the previous week's data, with modifications for expected events and holidays.
632 | 
633 | The system monitors for data drifts and quality issues, triggering alarms that would automatically switch to the appropriate fallback to ensure the most accurate predictions possible.
634 | 
635 | ### **ii. API design**
636 | 
637 | - **HTTP API Handler.** This component will manage requests and responses, interfacing with users in a structured JSON format.
638 | - **Model API.** This will extract predictions directly from the model.
639 | 
640 | Request format:
641 | 
642 | GET /predictions?query=<query_string>&parameters=<parameters>&version=<version>&limit=<limit>&request_id=<request_id>&sku=<sku>&entity_id=<entity_id>&group=<group_type>
643 | 
644 | Response format:
645 | 
646 | {
647 | 
648 |   "predictions": [
649 | 
650 |     {
651 | 
652 |       "sku": <sku_id>,
653 | 
654 |       "demand": <demand>,
655 | 
656 |       "entity": <entity_id>,
657 | 
658 |       "period": <time_period_for_demand>,
659 | 
660 |     },
661 | 
662 |     ...
663 | 
664 |   ]
665 | 
666 | }
667 | 
668 | ### **iii. Release Cycle**
669 | 
670 | **Release of the wrapper vs. release of the model**
671 | 
672 | Within our integration strategy, the release of the wrapper and the release of the model represent two distinct processes. Below are the nuances for each.
673 | 
674 | **For the release of the wrapper (infrastructure):**
675 | 
676 | - **Frequency & timeline.** The release typically happens less frequently than that of the model. As demand patterns can shift overnight, it is important to be able to incorporate them into the model through training.
677 | - **Dependencies.** Infrastructure releases are mostly dependent on software updates, third-party services, or system requirements. Any changes in such areas may necessitate a new release.
678 | - **Testing.** Comprehensive integration testing is a must to ensure all components work harmoniously. It is also crucial to ensure backward compatibility, so existing services are not disrupted.
679 | - **Rollout.** Usually employs standard software deployment strategies. Depending on the nature of the changes, a blue-green deployment might not always be necessary, especially if the changes are not user-facing and do not affect batch jobs.
680 | - **Monitoring.** The focus will be on system health, uptime, response times, and any error rates.
681 | 
682 | **For the release of the model:**
683 | 
684 | - **Frequency & timeline.** Model releases are more frequent and are tied to the availability of new data, changes in data patterns, or significant improvements in modeling techniques.
685 | - **Dependencies.** Predominantly rely on the quality and quantity of new training data. Any drifts in data patterns or introduction of new data sources can trigger the model's update.
686 | - **Testing.** Before rolling out, the model undergoes a rigorous offline validation. Once validated, it might be tested in a shadow mode, where its predictions run alongside the current model but are not used. This helps in comparing and validating the new model's performance in a real-world scenario without any risks.
687 | - **Rollout.** When introducing a new model, it's not just about deploying the model file. There's also a need to ensure that any preprocessing steps, feature engineering, and other pipelines are consistent with what the model expects.
688 | - **Monitoring.** The primary focus remains on model performance metrics. Also, keeping an eye on data drift is essential. See the Monitoring chapter for more details.
689 | 
690 | **Interplay between wrapper and model releases**
691 | 
692 | In cases where the infrastructure has updates that would affect the model (e.g., changes in data pipelines), coordination between the two releases becomes vital. Additionally, any significant changes to the model's architecture might require updates to the wrapper to accommodate the changes.
693 | 
694 | By treating them as separate processes yet ensuring they're coordinated, we maintain the system's stability while continuously improving its capabilities.
695 | 
696 | ### **iv. Operational concerns**
697 | 
698 | Feedback is integral for continuous improvement. A feedback mechanism, inclusive of an override function, will be available to internal users. Not only does this aid in refining the predictions, but also gives business users a sense of control and adaptability based on real-time insights.
699 | 
700 | ### **v. Non-engineering considerations**
701 | 
702 | The integration strategy will also take into account non-engineering factors.
703 | 
704 | For instance:
705 | 
706 | - **Admin panels.** They will be crucial for managing the system and obtaining a high-level overview.
707 | - **Integration with company-level dashboards.** For company-wide visibility and decision-making.
708 | - **Additional reports.** Essential for deeper insights and analysis.
709 | - **Overrides.** A necessary feature to account for manual adjustments based on unforeseen or unique circumstances.
710 | 
711 | Furthermore, standard CI tools used in the company, along with a typical scheduler, will be integrated to maintain consistency and optimize workflow.
712 | 
713 | ### **vi. Green-Blue Deployment**
714 | 
715 | Given our audience primarily consists of internal customers and the frequent batch jobs, there's no immediate need for green-blue deployment. The absence of end-user traffic eliminates the need for such staggered deployments, simplifying our rollout strategy.
716 | 
717 | ### **XI. Monitoring**
718 | 
719 | ### **i. Existing Infrastructure Analysis**
720 | 
721 | Unfortunately, demand forecasting is among the pioneering machine learning projects for Supermegaretail, meaning that there is no proper ML monitoring infrastructure in place. Luckily, quick preliminary research proved to be fruitful, and we found Evidently AI - an open-source Python library (https://github.com/evidentlyai/evidently) that helps with monitoring. With the motto - "We build tools to evaluate, test and monitor machine learning models, so you don't have to" - this perfectly suits our goals unless and until we decide to build our own platform (see Preliminary research chapter, subchapter - Build or buy, open source-based or proprietary tech). According to the description, they cover: Model Quality, Data Drift, Target Drift and Data Quality. This means that we still have to build some foundations for this to implement.
722 | 
723 | ### **ii. Logging**
724 | 
725 | We will keep model prediction logs in column-oriented DBMS. We should record data on every prediction: the features that went in and the model output + timestamps. We will use an open-source solution such as ClickHouse as it is already used for other similar needs in the company.
726 | 
727 | In addition to that, we will log basic statistics: RPS, resource utilization, error rate, p90, p99, p999 latency and error rate. The number of model calls per hour, day, and week, and the average, median, min, and max prediction from the model on the same aggregation level. We will use Kafka, Prometheus, and Grafana for that. We will keep the last month of data.
728 | 
729 | We will use this stack as well for real-time ML monitoring and visualization (https://grafana.com/docs/grafana-cloud/data-configuration/metrics/metrics-prometheus/prometheus-config-examples/evidently-ai-evidently/).
730 | 
731 | ### **iii. Data quality**
732 | 
733 | In addition to basic ETL and data quality checks, we will monitor for the following:
734 | 
735 | - Missing data, as a percentage from the whole dataset and separately as a percentage of the most important features according to feature importance (see the subchapter). We will use historical data (cleaning out occurrences with broken pipelines) to calculate the z-score. We will set an alarm of 3-z scores for important features and 4-z scores for the rest. In addition, we will use several Test Suite presets from the Evidently AI library. There is a preset to check for Data Quality and another for Data Stability.
736 | - Schema compliance. Are all the features there? Do their types match? Are there new columns?
737 | - Feature ranges and stats. To ensure that the learning model is being fed good quality data, we will manually define expected ranges for each important feature, as three z-scores + checks for invalid stats (e.g., negative data for the amount of sales, min >= 0).
738 | - Correlations. To detect any abnormalities in the data we will plot the correlation matrix between features and compare the difference between the two plots. A basic alarm will be set for residuals higher than |0.15|.
739 | 
740 | ### **iv. Model quality**
741 | 
742 | We are very fortunate to have the availability of true labels with very minor delay. We receive daily sales information 15 minutes after they happen. With that in mind, we will monitor quantile metrics for quantiles of 1.5, 25, 50, 75, 95, and 99, both as is and with weights equal to the SKU price. In addition, we will monitor RMSE and MAE to track the mean and median. We will set up final thresholds after we receive the first three months of the data; we will pick initial thresholds based on the historical data and model performance on the validation.
743 | 
744 | In addition, we set up alarms for Negative values and max values. It will fire if the new max value is 50% higher than the previously seen max value.
745 | 
746 | We will set up prediction drift monitoring. We will use it as an early alarm for the next day, week and month predictions. We will test two approaches: Population Stability Index > 0.2 and Wasserstein distance > 0.1. For Wasserstein distance, we will apply growth multiplayer to the control dataset. For example, when comparing April 2021 to April 2022, knowing that overall growth is expected to be 15% we will multiply everything in 2021 by 1.15. We will further adjust this based on historical data experimentation.
747 | 
748 | ### **v. Data drift**
749 | 
750 | While we have true data available with a minor delay, we don't need to track input drift as a proxy for understanding the model relevance. However, it would still help us detect upcoming change before it affects the model quality. After reading this post: *Which test is the best? We compared 5 methods to detect data drift on large datasets,* https://www.evidentlyai.com/blog/data-drift-detection-large-datasets, we decided to pick Wasserstein distance to alert us in case of data drift. We start from the threshold of the mean drift score.
751 | 
752 | We can later try to apply this paper: *Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests,* https://papers.nips.cc/paper/2020/file/e2d52448d36918c575fa79d88647ba66-Paper.pdf.
753 | 
754 | ### **vi. Business metrics**
755 | 
756 | Business metrics of interest remain the same as we describe in chapter 2.1.1 Metrics to pick - Revenue (expected to increase), level of stock (expected to decrease or maintain the same), Margin (expected to increase), which we will monitor through a series of A/B tests switching and swapping control groups over the course of time.
757 | 
758 | ### **XII. Serving and inference**
759 | 
760 | The primary considerations for serving and inference are:
761 | 
762 | - Efficient batch throughput, as forecasts will be run daily, weekly, and monthly on large volumes of data.
763 | - Security of the sensitive inventory and sales data.
764 | - Cost-effective architecture that can scale batch jobs.
765 | - Monitoring data and prediction quality.
766 | 
767 | ### **i. Serving architecture**
768 | 
769 | We will serve the batch demand forecasting jobs using Docker containers orchestrated by AWS Batch on EC2 machines. AWS Batch will allow defining resource requirements, dynamically scaling the required number of containers, and queuing large workloads.
770 | 
771 | The batch jobs will be triggered on a schedule to process the input data from S3, run inferences, and output results back to S3. A simple Flask API will allow on-demand batch inference requests if required.
772 | 
773 | All data transferring and processing will occur on secured AWS infrastructure, isolated from external access. Proper credentials will be used for authentication and authorization.
774 | 
775 | ### **ii. Infrastructure**
776 | 
777 | The batch servers will use auto-scaling groups to match workload demands. Spot instances can be used to reduce costs for flexible batch jobs.
778 | 
779 | No specialized hardware or optimizations are required at this stage, as batch throughput is the priority, and the batch nature allows ample parallelization. We will leverage the horizontal scalability options provided by AWS Batch and S3.
780 | 
781 | ### **iii. Monitoring**
782 | 
783 | Key metrics to track for the batch jobs include:
784 | 
785 | - Job success rate, duration, and failure rate.
786 | - Number of rows processed per job.
787 | - Server utilization: CPU, memory, disk space.
788 | - Prediction accuracy compared to actual demand.
789 | - Data validation checks and alerts.
790 | 
791 | The above monitoring will help ensure the batch process remains efficient and scalable and produces high-quality predictions. We can assess optimization needs in the future based on production data.
792 | 


--------------------------------------------------------------------------------
/Design_Doc_Examples/Examples/README.md:
--------------------------------------------------------------------------------
 1 | # Real-World Inspired ML System Design Examples
 2 | 
 3 | This directory contains ML system design documents based on real-world scenarios. While these examples are inspired by actual systems, they may be adapted or modified for educational purposes.
 4 | 
 5 | ## Current Examples
 6 | 
 7 | ### EN (English)
 8 | 
 9 | 1. **Retail Demand Forecasting**
10 |    - Domain: Retail/Supply Chain
11 |    - Problem: Optimizing inventory management through ML-based demand forecasting
12 |    - Key Features:
13 |      - Multi-level forecasting (store/product/category)
14 |      - Integration of external data sources
15 |      - Real-time adaptation
16 |      - Risk management and monitoring
17 | 
18 | ## Contributing New Examples
19 | 
20 | To contribute a new example:
21 | 
22 | 1. Choose an appropriate name: `[Domain]_[Problem]_Design.md`
23 | 2. Use the template from `templates/basic_ml_design_doc.md`
24 | 3. Include:
25 |    - Clear problem definition
26 |    - System architecture
27 |    - Data pipeline design
28 |    - Model selection rationale
29 |    - Evaluation strategy
30 |    - Implementation plan
31 | 4. Add supporting diagrams if helpful
32 | 5. Follow the contribution guidelines in the root `CONTRIBUTING.md`
33 | 
34 | ## Example Structure
35 | 
36 | Each example should:
37 | 1. Follow the standard template
38 | 2. Include an executive summary
39 | 3. Provide clear architectural diagrams
40 | 4. Detail evaluation metrics
41 | 5. Discuss implementation challenges
42 | 6. Include monitoring and maintenance plans
43 | 
44 | ## Language Organization
45 | 
46 | - `EN/` - English examples
47 | - Additional language folders can be added following the same structure 


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions.md:
--------------------------------------------------------------------------------
   1 | # MagicSharepoint
   2 | 
   3 | ### **I. Problem definition**
   4 | 
   5 | ### **i. Origin**
   6 | 
   7 | MagicSharepoint is a platform designed for collaborative work and document sharing among clients. Documents can be in text format (Markdown) or scanned/image formats.
   8 | 
   9 | - Expected Document Size: Up to 500 pages (system restriction, hard limit with error notification).
  10 | - Structure: Documents typically include a table of contents and dedicated sections, such as introduction or glossary.
  11 | - Content: Documents may include text with all Markdown features (e.g., quotes, headings, formulas, tables).
  12 | 
  13 | Clients can edit documents online via the platform or upload documents from their local machines. Each document receives a version number upon:
  14 | 
  15 | - Saving
  16 | - Uploading a new document
  17 | - Uploading a version of the existing document
  18 | 
  19 | Clients can access all versions of each document.
  20 | 
  21 | The project's goal is to provide clients with a tool to get answers about document content and version changes more easily and quickly than by proofreading and comparing documents on their own.
  22 | 
  23 | ### **ii. Relevance & Reasons**
  24 | 
  25 | **ii.i. Existing flow**
  26 | 
  27 | To get answers about the content of a document, clients need to read through the document or use search functionality. 
  28 | 
  29 | Since documents are domain-specific, clients must have relevant expertise depending on the nature of the question. Answers must be cataloged manually in an external tool. 
  30 | 
  31 | Additionally, if another client has the same or a similar question, they have no way of knowing that the question was already asked and answered.
  32 | 
  33 | **ii.ii. Other reasons**
  34 | 
  35 | The proposed tool could be reused to support frequent questions or bulk inquiries.
  36 | 
  37 | In the future, we may reuse some components of Q&A solutions as a starting point for building a Knowledge Center for clients, allowing documents to be represented as a graph of facts or knowledge.
  38 | 
  39 | ### **iii. Expectations**
  40 | 
  41 | Clients expect answers to be:
  42 | - Fast
  43 |     - First token within 1 minute.
  44 |         - "1 minute" was identified by deep interviews, can be specified later.
  45 | - Trustworthy
  46 |     - Limited hallucinations or 'extended' answers. At least 95% of the answers should not contain fact mismatching of level 1 and level 2 (described below).
  47 |     - In case clients have any doubts, they won't need to proofread the whole document to resolve the uncertainty. 
  48 | - Interactive
  49 |     - Ability to provide more details/follow-up questions if the answer is insufficient.
  50 |     - Automatically request more details if unable to generate sufficient answer.
  51 | - Direct
  52 |     - Indicate when an answer cannot be provided due to lack of context.
  53 | 
  54 | Clients need to get answers about:
  55 | - A single document in its latest version.
  56 | - Multiple documents in their latest versions.
  57 | - A single document and its various versions.
  58 | - Multiple documents and their various versions.
  59 | 
  60 | Clients want to select documents:
  61 | - Explicitly, using filters.
  62 | - Implicitly, through dialogue.
  63 | 
  64 | The expectation is that the response would be based on the content of the selected documents (explicit or implicit).
  65 | 
  66 | Fact mismatching levels:
  67 | 1. Fact Presence
  68 | - Numbers / Terms / Facts in the answer were not present in the document
  69 | 2. Fact Integrity
  70 | - Numbers / Terms / Facts in the answer were present in the document but had different context and meaning
  71 | 3. Reasoning
  72 | - Numbers / Terms / Facts in the answer were correct and had the same context and meaning, but the conclusion was off
  73 | 
  74 | Use case examples:
  75 | 
  76 | We will categorize the use cases as follows:
  77 | 
  78 | - Addressable - The system's response must be useful and relevant.
  79 | - Non-addressable - The system cannot provide an answer, or the question is beyond the scope of its capabilities. In such cases, the appropriate action is to implement a proper restriction. Although those questions might not bring benefits, it's very crucial to have a proper fallback.
  80 | 
  81 | We will use split the use cases to better define the overall problem space for the solution.
  82 | We will index the use cases as `Na` where `N` is a priority and `a/na` is an addressability flag.
  83 | 
  84 | Addressable use cases:
  85 | - `1a` Specific Question about Document Content Available in the Document
  86 |     - e.g., how the Attention mechanism works from 'Attention is All You Need'.
  87 | - `2a` Specific Question about Document Version Changes
  88 |     - e.g., how section names changed between v2 and v12 from 'Machine Learning System Design'.
  89 | - `3a` Specific Question about Multiple Document Version Changes
  90 |     - e.g., differences between the first available draft and the published version for all books in the 'Harry Potter' series.
  91 | - `4a` Specific Question about Document Metadata (propose to remove - too little data to use any algorithms)
  92 |     - e.g., author and date.
  93 | 
  94 | Non-addressable use cases
  95 | - `1na` Abstract/Not Relevant Question about Document Content
  96 |     - e.g., how are you doing.
  97 | - `2na` Specific Question about Document Content Not Available in the Document
  98 |     - e.g., how the Attention mechanism works from 'Bible'.
  99 | 
 100 | 
 101 | ### **iv. Previous work**
 102 | 
 103 | - Implemented 'smart' full-text search to help navigate faster and easier.
 104 |     - Similar to the "match" query available in the Elasticsearch out of the box.
 105 | - Cataloged frequent questions and used Mechanical Turk to get answers in advance.
 106 | 
 107 | ### **v. Usage volumes and patterns**
 108 | 
 109 | Every month:
 110 | - Platform has ~1000 unique users.
 111 | - Each user is having 10 document reading sessions.
 112 | - Reading session lasts between 30 and 120 minutes.
 113 | - Version is assigned to 500 documents.
 114 |     - New documents or edited existing documents.
 115 | - 10% of the documents are image based.
 116 | 
 117 | ### **vi. Other details**
 118 | 
 119 | - Cloud Object Storage with automated version cataloging.
 120 | - OCR is not implemented.
 121 | - Documents could be sent to service vendors, provided they are not used for training as per SLA (e.g., OpenAI, Anthropic, etc.).
 122 | 
 123 | ### **II. Metrics and losses**
 124 | 
 125 | **i. Metrics**
 126 | 
 127 | The task could be split into independent subtasks: OCR, Intent classification -> Search/Retrieval -> Answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance.
 128 | 
 129 | 
 130 | ***Question Intent Classification Metrics:***
 131 | 
 132 | Pre-requirements: Define all possible labels (intents). Prepare dataset of questions/sentences and appropriate labels.
 133 | 
 134 | It shows how accurately a model can classify the intent of questions, which is crucial for the whole processing pipeline. Below metrics could be used for QIC (macro + per class): 
 135 | 
 136 | 1. Precision
 137 | 2. Recall
 138 | 3. F1
 139 | 
 140 |    
 141 | ***Data Extraction Metrics:***
 142 | 
 143 | Pre-requirements: Dataset of scanned documents and their corresponding texts. (As a workaround: readable documents could be scanned manually, which gives both - scanned image and ground truth text values)
 144 | 
 145 | It's reasonable to measure OCR quality separately, as in the case of poor OCR quality, an accurate result can't be reached. In the first stage of the project, let's skip this metrics step, calculate high-level metrics on markup docs vs scanned images, only in case of significant difference in numbers data extraction metrics are calculated.
 146 | In the future, we will need to design a fallback mechanism for processing documents with poor OCR performance.
 147 | 
 148 | **a. Word Error Rate**
 149 | 
 150 | It operates on the word level and shows the percentage of words that are misspelled in the OCR output compared to the ground truth. LLMs usually cope well with misprints; however, there still could be errors in important data, so WER could be used as a quick check of OCR quality. The lower the better. It could be replaced with the Character Error Rate.
 151 | 
 152 | $`word\_error\_rate = \frac{amount\_of\_misspelled\_words}{total\_amount\_of\_words}`$
 153 | 
 154 | **b. Formula Error Rate**
 155 | 
 156 | Formulas could be present in the document as well. They are not a typical OCR problem, so if they are not recognized well, it degrades the system performance. The formula error rate could be measured as the percentage of incorrect OCRed formulas to the total amount of formulas.
 157 | 
 158 | $`formula\_error\_rate = \frac{amount\_of\_misspelled\_formulas}{total\_amount\_of\_formulas}`$
 159 | 
 160 | **c. Cell Error Rate**
 161 | 
 162 | As it is important to extract table-structured data as well, the percentage of incorrectly detected table cells compared to the ground truth could be used as one of the metrics.
 163 | 
 164 | $`cell\_error\_rate = \frac{amount\_of\_incorrectly\_detected\_cells}{total\_amount\_of\_cells}`$
 165 | 
 166 | **d. Data Extraction Error Severity**
 167 | 
 168 | Depending on the documents' domain, we may introduce quality thresholds for the data extraction process depending on specific errors.
 169 | We may introduce more detailed errors in the future (e.g., Formula Error related to a special character).
 170 | 
 171 | ***Retrieval Metrics:***
 172 | 
 173 | Pre-requirements: Dataset of queries collected from experts and list of N most relevant chunks for each of the query. 
 174 | 
 175 | **a. Recall@k**
 176 | 
 177 | It determines how many relevant results from all existing relevant results for the query are returned in the retrieval step, where K is the number of results considered, a hyperparameter of the model. 
 178 | 
 179 | **b. Normalized discounted cumulative gain (NDCG)**
 180 | NDCG is a metric that calculates the average of DCGs for a given set of results, which is a measure of the sum of relevance scores taken from the first N results divided by the ideal DCG. In its turn, DCG is the sum of relevance scores among the first N results in the order of decreasing relevance. 
 181 | 
 182 | ***Answer Generation Metrics:***
 183 | 
 184 | **a. Average Relevance Score**
 185 | 
 186 | Measures how well the generated answers match the context and query. 
 187 | There are several approaches to calculate that metric:
 188 | - automatically with framework ragas [https://docs.ragas.io/en/stable/] (detailed description provided in section IV. Validation Schema)
 189 | - with other llms (paper to consider "Automatic Evaluation of Attribution by Large Language Models" [https://arxiv.org/pdf/2305.06311])
 190 | - manually based on experts output (approach is provided in section IX. Measuring and reporting)
 191 | 
 192 |   
 193 | **b. Hallucination Rate**
 194 | 
 195 | As one of the requirements is to avoid hallucinating, it is possible to calculate the percentage of incorrect or fabricated information in the generated answers. 
 196 | 
 197 | How to calculate: 
 198 | 
 199 | - Manually: prepare dataset of queries (including queries without answer in dataset)  + expected responses; calculate by comparing expected response to provided
 200 | - Finetune smaller llms to detect hallucination
 201 | - Add guardrails [https://github.com/NVIDIA/NeMo-Guardrails] - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate
 202 | 
 203 | $`hallucination\_rate = \frac{amount\_of\_hallucinated\_responses}{total\_amount\_of\_responses}`$
 204 | 
 205 | 
 206 | **c. Clarification Capability**
 207 | 
 208 | Pre-requirements: Dataset of queries (ideally with unambiguous answer) + expected response, domain experts to evaluate the metric manually.
 209 | 
 210 | As one of the requirements is the ability to automatically request more details if an insufficient answer is generated, the average number of interactions or follow-up questions needed to clarify or correct an answer could be calculated to measure clarification capability and average relevance of follow-up questions. This metric helps to check the system's ability to provide comprehensive answers initially or minimise the number of interactions needed for detalization.
 211 | 
 212 | $`clarification\_capability = \frac{number\_of\_clarification\_questions}{total\_amount\_of\_queries}`$
 213 | 
 214 | 
 215 | *Metrics to pick:*
 216 | 
 217 | A lot of metrics were provided, but it's a good idea to go in the reverse direction: start from the more general and dive deeper into partial ones only when necessary.
 218 | 
 219 | **TODO**: build hierarchy of metrics
 220 | 
 221 | *Online metrics of interest during A/B tests are (more details are provided in section IX. Measuring and reporting):*
 222 | - ~~Time to Retrieve (TTR)~~
 223 | - Average Relevance score
 224 | - Average amount of clarification questions
 225 | - Average time of dialogue
 226 | 
 227 | 
 228 | 
 229 | 
 230 | ### **III. Dataset**
 231 | 
 232 | We have two types of data:
 233 | * data that was used to train the main LLM model;
 234 | * data to perform RAG on.
 235 | 
 236 | We don't control the data for the main LLM training (meaning that we're coping with the LLM limitations and don't influence this until the moment we realize that we could really benefit from fine-tuning, which will become completely different project).
 237 | 
 238 | #### i. Available Data (to perform RAG)
 239 | 
 240 | We don't distinguish between client roles for data access. Every client would have access to every document. So we basically have a shared dataset for the whole system.
 241 | 
 242 | Includes a set of documents available on the Platform. Documents can be in text format (Markdown) or scanned/image formats.
 243 | 
 244 | 1. For Markdown documents
 245 |     - Expected Document Size: Up to 500 pages.
 246 |     - Structure: Documents typically include a table of contents and dedicated sections, such as introduction or glossary.
 247 |     - Content: Documents may include text with all Markdown features (e.g., quotes, headings, formulas, tables).
 248 | 2. For documents in Image form (scans/images): no additional description available. They are just files that contain image/scan inside.
 249 | 3. Each document have origination metadata
 250 | 4. Documents may have v1-v2-v3-... versions.
 251 | 
 252 | Clients can edit documents online via the platform or upload documents from their local machines. Each document receives a version number upon:
 253 | 
 254 | - Saving after editing on the platform
 255 | - Uploading a document
 256 |     - As a new document
 257 |     - As a version of an existing document
 258 | 
 259 | Clients can access all versions of each document.
 260 | 
 261 | 
 262 | #### ii. Document Versioning
 263 | 
 264 | Some documents would be edited directly on our platform. It this case we will know both: the applied diff and the next version of a document. 
 265 | But for the most documents - we would have only the next version of a document, not the diff.
 266 | 
 267 | Required adjustments:
 268 | - Diff generation step
 269 | - Match the question intent against available data
 270 |     - Ex.: custom fallback if the question is about diff, but currently we don't have diff
 271 | - Treat diffs as a document
 272 | 
 273 | #### iii. Data Cleaning
 274 | 
 275 | Data Сleaning process should be automatized with reproducible scripts. Script runs once for each new document that gets uploaded to the system and then for each new version of that document.
 276 | 
 277 | We don't perform duplicate removal neither for markdown nor for images.scans, considering that if the client uploaded several duplicating documents, he has the reason to do this, and this is as it should be.
 278 | 
 279 | Cleaned documents and images should be stored separately from the original files in a `cleaned_data` directory or database. This ensures keeping the original versions for reference and debugging.
 280 | 
 281 | **iii.i. Markdown Documents:**
 282 | 
 283 | **TODO**: update this part after the project'll start and data will be shared. 
 284 | 
 285 | **iii.ii. Scanned/Image Documents:**
 286 | - Enhance the quality of scans (e.g., adjusting brightness/contrast, removing noise).
 287 | - Perform Optical Character Recognition (OCR) for scans. Store both initial scan and its recognized content.
 288 | 
 289 | **TODO**: add more details/ideas how to do it 
 290 | 
 291 | **iii.iii. Quality Controls:**
 292 | Introduce quality metrics and thresholds to indicate successfully cleaned documents. [Monitoring](#xi-monitoring)
 293 | 
 294 | Example:
 295 | - Text length per OCR'ed page.
 296 | - Processing errors raised and on which step.
 297 | 
 298 | #### iv. Data Chunking Strategy
 299 | 
 300 | Common approach to Document Chunking is:
 301 | - a document `[level 0]`
 302 | - an article `[level 1]`
 303 | - a paragraph `[level 2]`
 304 | - a sentence `[level 3]`
 305 | 
 306 | It allows to use the following path:
 307 | - search without specifying a particular document -> `[level 0]`
 308 | - user selects a document and asks if there is a particular section in it -> `[level 1]`
 309 | ... 
 310 | 
 311 | Large chunks can be an issue for Embedder.
 312 | If a specific chunk exceeds size limit, there are following approaches:
 313 | - Replace such chunk with several sub-chunks
 314 |     - Restricted by size limit
 315 |     - Overlapped by 20%
 316 | - Generate Summary
 317 |     - Restricted by size limit
 318 | Such chunk alternatives should be treated as a proper chunk for RAG purposes, but should be indicated in metadata.
 319 | 
 320 | #### v. Data Enhancing
 321 | 
 322 | The purpose of Data Enhancing is to empower the context search, by allowing the use of vector databases, understand the structure and presence of special entities.
 323 | 
 324 | **v.i. Embeddings:**
 325 | 
 326 | A good embedder should define the retrieval layer's ability to:
 327 | 
 328 | - Store content representations efficiently
 329 | - Encode the content and queries to be semantically similar
 330 | - Capture nuanced details due to domain-specific content
 331 | - Compare different levels on chunks between eachother
 332 | 
 333 | Considering that the retrieval layer is the first step in the pipeline, we do believe it could become a bottleneck in performance. This is because if the context is provided incorrectly, there is less ability for the upstream generation model to improve upon the irrelevant context.
 334 | 
 335 | Taking these factors into account, we might be ready to explore and evaluate the performance of various encoders at our disposal. We seek to avoid being restricted by any particular design solution, ensuring that there is room for continuous enhancement over time. With this perspective, we will consider the potential of implementing an in-house embedding solution.
 336 | 
 337 | Here are the benefits we highlight:
 338 | - Provides potential for improving this critical component without vendor lock-in
 339 | - Provides control over versioning, determinism, availability (not going to be depricated)
 340 | - Does not require us to provide per-token costs
 341 | - Could potentially benefit from interaction data enhancements
 342 | 
 343 | Drawbacks:
 344 | - Development and maintenance costs.
 345 | - Per-token costs may not be as optimized as those of larger companies.
 346 | 
 347 | When it comes to generation levels, considering the number of users and the app economy, there is no clear evidence that the company would like to invest in training or fine-tuning custom LLMs. Therefore, it might be beneficial to keep in mind the use of vendor-based API-accessible LLMs.
 348 | 
 349 | Here are the potential benefits:
 350 | - LLMs are continually improving, particularly in few-shot learning capabilities
 351 | - Competitive market dynamics are driving down the cost of API calls over time
 352 | - Switching vendors involves minimal effort since it only requires switching APIs, allowing for potential utilization of multiple vendors.
 353 | 
 354 | Drawbacks:
 355 | - Less control over the responses
 356 | - Data privacy (though not a significant concern)
 357 | - There is a possibility of service denial from a vendor on account of policy-related issues, such as content restrictions or economic sanctions
 358 | 
 359 | 
 360 | **v.ii. Documents Enriching:**
 361 | - Inter-document links
 362 |     - Links that reference another part of the same document.
 363 |     - This helps RAG to understand the context and content of the link.
 364 |     - Example: "Section 2" -> "Section 2: Data Cleaning Procedures"
 365 | - Plain URLs
 366 |     - Links to external web resources
 367 |     - Don't modify, keep them as-is for LLM consumption.
 368 | - Table of Contents (ToC)
 369 |     - Extract ToC for document.
 370 | - Named Entity Recognition
 371 |     - Recognise Entities in document.
 372 |     - May cover only common ones or train own NER.
 373 | - Summaries
 374 |     - Text summary for selected levels of chunks.
 375 | 
 376 | **v.iii. Quality Controls:**
 377 | Introduce quality metrics and thresholds to indicate successfully enhanced documents. [Monitoring](#xi-monitoring)
 378 | Example:
 379 | - Text length per OCR'ed page.
 380 | - Processing errors raised and on which step.
 381 | 
 382 | #### vi. Metadata
 383 | 
 384 | **Document Metadata:**
 385 | - Document title
 386 | - Author
 387 | - Creation date
 388 | - Last modified date
 389 | - Table of Contents (for text documents)
 390 | - Summary (need to discuss whether it's necessary)
 391 | - Version history
 392 |    - Version number
 393 |    - Editor
 394 |    - Version creation date
 395 |    - Changes made in the version (if available)
 396 |    - Diff information (if available)
 397 | 
 398 | **Cleaning Metadata:**
 399 | - Script versions used for OCR.
 400 | - Script versions used for cleaning.
 401 | - (optional) Store exact cleaning steps.
 402 | - Time to apply the OCR.
 403 | - Time to apply the cleaning.
 404 | 
 405 | **Enhancing Metadata:**
 406 | For different chunk levels:
 407 | - Script versions used for embedding generation.
 408 | - Script versions used for enriching.
 409 | - (optional) Store exact enriching steps.
 410 | - Time to apply the embedding generation.
 411 | - Time to apply the enriching.
 412 | 
 413 | **Handling Metadata:**
 414 | 
 415 | For Markdown documents, embed metadata in a YAML format at the top of each document. For images, metadata can be stored in a separate JSON file with the same name as the image.
 416 | 
 417 | #### Example Metadata Structure for a Markdown Document:
 418 | 
 419 | ```yaml
 420 | ---
 421 | title: "Sample Document"
 422 | author: "John Doe"
 423 | created_at: "2023-01-01"
 424 | last_modified: "2024-06-30"
 425 | toc:
 426 |   - chapter: Introduction
 427 |     starts_with: In this article we're about to introduce RAG implementation system for high-load cases.
 428 |     chapter_summary: Introduction to RAG implementation system for high-load cases with author's motivation and real-world examples
 429 |   - chapter: Chapter 1
 430 |     starts_with: Let's consider a situation where we have a platform designed for collaborative work and document sharing among clients.
 431 |     chapter_summary: Problem statement and available data are described.
 432 |   - chapter: Chapter 2
 433 |     starts_with: In order to perform quality RAG, we need the data to be prepared for this.
 434 |     chapter_summary: Data cleaning schema and other aspects.
 435 |   - chapter: Conclusion
 436 |     starts_with: Now let's move on to conclusion.
 437 |     chapter_summary: Conclusion about the ways we can built a system
 438 | summary: "This document provides an overview of..."
 439 | version_info:
 440 |   - version: "v1"
 441 |     editor: "Jane Smith"
 442 |     change_date: "2023-02-01"
 443 |     diff: "Initial creation of the document."
 444 |   - version: "v2"
 445 |     editor: "John Doe"
 446 |     change_date: "2023-06-15"
 447 |     diff: "Added new chapter on advanced topics."
 448 |   - version: "v3"
 449 |     editor: "Jane Smith"
 450 |     change_date: "2024-06-30"
 451 |     diff: "Updated the introduction and conclusion sections."
 452 | ---
 453 | ```
 454 | 
 455 | ### **IV. Validation Schema**
 456 | 
 457 | For validation purposes, we will use a data set generated from the original documents using the [RAGAS](https://docs.ragas.io/en/stable/)  functionality. This approach allows us to create a comprehensive validation set that closely mirrors the real-world usage of our system.
 458 | 
 459 | **TODO**: add more details about what RAGAS is and how it is doing it
 460 | 
 461 | **TODO**: add complementary frameworks for better covering all important aspects
 462 | 
 463 | #### i. Question Selection and Dataset Creation
 464 | RAGAS takes the original documents and their associated metadata and generates a structured dataset with the following components
 465 | 
 466 | * Question: Simulation of user queries
 467 | * Context: Relevant parts of the document(s)
 468 | * Answer: The expected answer
 469 | 
 470 | This structure allows us to evaluate both the retrieval and generation aspects of our RAG system.
 471 | 
 472 | To create a comprehensive and representative validation dataset, we'll employ a multi-faceted approach to question selection:
 473 | 
 474 | 1. Automated Question Generation
 475 |     * Use natural language processing (NLP) techniques to automatically generate questions from the documents.
 476 |     * Apply techniques such as named entity recognition, key phrase extraction and syntactic parsing to identify potential question targets.
 477 |     * Use question generation models (e.g. T5 or BART fine-tuned for question generation) to create different types of questions.
 478 | 
 479 | 2. Human-in-the-Loop Curation
 480 |     * Engage subject matter experts to review and refine auto-generated questions.
 481 |     * Have experts create additional questions, especially for complex scenarios or edge cases that automated systems might miss.
 482 |     * Ensure questions cover various difficulty levels and reasoning types.
 483 | 
 484 | 3.  Real User Query Mining
 485 |     * Analyse logs of actual user queries (if available) to identify common question patterns and topics.
 486 |     * Include anonymised versions of real user questions in the dataset to ensure relevance to actual use cases.
 487 | 
 488 | 4. Question Diversity. Ensure a balanced distribution of question types:
 489 |     * Factual questions (e.g. "Who is the author of this document?")
 490 |     * Inferential questions (e.g. "What are the implications of the findings in section 3?)
 491 |     * Comparative questions (e.g. "How does the methodology in version 2 differ from that in version 1?)
 492 |     * Multi-document questions (e.g. "Summarise the common themes across these three related documents.)
 493 |     * Version-specific questions (e.g. "What changes have been made to the conclusion between versions 3 and 4?)
 494 | 
 495 | 5. Context Selection
 496 |     * For each question, select a relevant context from the document(s).
 497 |     * Include both perfectly matching contexts and partially relevant contexts to test the system's ability to handle nuanced scenarios.
 498 | 
 499 | 6. Answer Generation
 500 |     * Generate a gold standard answer for each question-context pair.
 501 |     * Use a combination of automated methods and human expert review to ensure answer quality
 502 | 
 503 | 7. Metadata Inclusion
 504 |     * Include relevant metadata for each question-context-answer triplet, such as document version or section headings.
 505 | 
 506 | 8. Edge Case Scenarios
 507 |     * Deliberately include edge cases, such as questions about rare document types or extremely long documents.
 508 |     * Create questions that require an understanding of document structure, such as tables of contents or footnotes.
 509 | 
 510 | 9. Negative Examples
 511 |     * Include some questions that cannot be answered from the given context to test the system's ability to recognise when it doesn't have sufficient information.
 512 | 
 513 | 
 514 | #### ii. Periodic Updates
 515 | The validation dataset will be updated periodically to maintain its relevance and comprehensiveness. This includes:
 516 | 
 517 | * Addition of newly uploaded documents
 518 | * Including new versions of existing documents
 519 | * Updating the question set to reflect evolving user needs
 520 | 
 521 | We recommend updating the validation set monthly or whenever there's a significant influx of new documents or versions.
 522 | 
 523 | #### iii. Stratified Sampling
 524 | To ensure balanced representation, we'll use stratified sampling when creating the validation set. Strata may include:
 525 | 
 526 | * Document length (short, medium, long)
 527 | * Document type (text, scanned image)
 528 | * Topic areas
 529 | * Query complexity (simple factual, multi-step reasoning, version comparison)
 530 | 
 531 | 
 532 | ### V. Baseline Solution
 533 | 
 534 | #### Document Extraction Process 
 535 | 
 536 | ##### Baseline extraction
 537 | 
 538 | This baseline pipeline might cover only textual formats like .txt, .doc, or .pdb for simplicity. This significantly simplifies the first iteration by avoiding the need to handle OCR, which contains a machine learning model and requires managing its lifecycle.
 539 | 
 540 | 1. Format Reader: Differentiate and handle file types accordingly
 541 | 2. Markdown Formatting: Ensure that the extracted content is formatted correctly according to markdown standards.
 542 | 3. Error Management & Spell Checking: This part ensures extraction logging and raises awareness for the maintainer that some documents might not be reliable.
 543 | 
 544 | #### Retrieval-Augmented Generation Framework
 545 | 
 546 | The Retrieval-Augmented Generation (RAG) framework can be broken down into two main components:
 547 | 
 548 | - Retrieval
 549 | - Augmented Generation
 550 | 
 551 | Augmented Generation is a recent advancement, while the concept of document retrieval is something that has been with us since the emergine of web search. While there is little to no sense in building the second part using solutions other than LLMs, it might make sense to implement a simple baseline for the retrieval.
 552 | The high-level flow is depicted in the image below.
 553 | 
 554 | ![Retrieval Baseline](docs/retrieval_baseline.png)
 555 | 
 556 | Considering the solution that covers both Retrieval and Generation, we might divide the scope in terms of functionality into three parts:
 557 | 
 558 | - Basic solution: Raw generation from the model given a context.
 559 | - Reliable solution: Queries are checked to be correct, outputs are only generated when the context is considered relevant, the outputs are checked on correctness.
 560 | - Reliable and interactive solution: Reliable, with a context constructed by taking into account the dialogue interaction. The solution is expected to ask for clarification if the context is not clear or the output is ambiguous.
 561 | The idea here is that although the interactive mode brings value and covers the use cases described in Section 1, it comes with the cost of creating, managing, and monitoring a complex workflow.
 562 | 
 563 | 
 564 | Below you will find the schemas describing three options of increased complexity during implementation. The description is available further in the chapter.
 565 | 
 566 | **RAG: Basic solution**
 567 | 
 568 | ![RAG Baseline](docs/rag_simple.png)
 569 | 
 570 | **RAG: Reliable solution**
 571 | 
 572 | ![RAG Reliable](docs/rag_reliable.png)
 573 | 
 574 | **RAG: Reliable & Interactive**
 575 | 
 576 | ![RAG Reliable & Interactive](docs/rag_reliable_interactive.png)
 577 | 
 578 | #### Retrieval Baseline: Sparse Encoded Solution
 579 | 
 580 | Objectives:
 581 | - Create a robust baseline with minimal effort.
 582 | - Validate the hypothesis that an enhanced search capability is beneficial.
 583 | - Gather a dataset based on retrieval, incorporating both implicit and explicit feedback for future refinement.
 584 | 
 585 | Applicability:
 586 | This covers use case `1a`. The solution is not applicable to the use cases `1na` and `2na`, thus also addressed.
 587 | 
 588 | The system enables content search within documents using the BM25 algorithm.
 589 | 
 590 | Components:
 591 | 1. Preprocessing Layer
 592 |     - Tokenizes input data
 593 |     - Filters out irrelevant content
 594 |     - Applies stemming / lemmatization
 595 | 2. Indexing Layer
 596 |     - Maintains a DB-represented corpus
 597 |     - Creates indexes for Term Frequency (TF) and Inverse Document Frequency (IDF)
 598 | 3. Inference Layer
 599 |     - Given query passed trough the preprocessing layer, Executes parallelized scoring computations
 600 |     - Manages ranking and retrieval of results
 601 | 4. Representation Layer
 602 |     - Highlights the top-k results for the user
 603 |     - Handles an explicit user feedback dialogue ("Have you found what you were looking for?")
 604 | 
 605 | **TODO**: add criteria for irrelevant content and some examples
 606 | 
 607 | ##### Pros & Cons
 608 | 
 609 | Pros:
 610 | + Simple to implement, debug, and analyze
 611 | + Fast retrieval due to lightweight computation
 612 | + Scalable, as computation jobs can process document segments independently
 613 | + Popular, with many optimized implementations available
 614 | + Low maintenance costs, suitable for junior engineers
 615 | 
 616 | Cons:
 617 | - No semantic understanding: synonyms are not supported by default
 618 | - Bag-of-words approach: word order is not considered
 619 | - Requires updates to accommodate new vocabulary
 620 | 
 621 | #### RAG
 622 | 
 623 | ##### RAG: common functionality 
 624 | 
 625 | A basic RAG system consists of the following components components:
 626 | 
 627 | 1. Ingestion Layer:
 628 |     - Embedder
 629 |     - DB-indexing
 630 | 2. Retrieval Layer:
 631 |     - Embedder
 632 |     - DB similarity search. This part is usually provided by the same tools utilized for indexing.
 633 | 3. Chat Service:
 634 |     - Manages chat context
 635 |     - Prompt template constructor: supports dialogs for clarification
 636 |     - Stores chat history
 637 | 4. Synthesis Component:
 638 |     - Utilizes an LLM for response generation
 639 | 6. Representation Layer:
 640 |     - Provides a dialogue mode for user interaction.
 641 |     - User Feedback: Collects user input to continuously refine the system.
 642 | 
 643 | 
 644 | ##### RAG: Bridging the Qualitative Gap
 645 | 
 646 | Currently, the common functionality lacks modules to ensure the solution meets quality criteria, specifically in areas such as hallucination mitigation and tolerance against misuse. To address these gaps, we propose using guardrails for quality assurance. This includes a retry strategy and a fallback mechanism designed to enhance reliability and robustness. ([About Guardrails](https://docs.nvidia.com/nemo/guardrails/user_guides/guardrails-process.html))
 647 | 
 648 | NeMo guardrails provide the following levels of checks that are relevant to our project:
 649 | 
 650 | - Input rails
 651 | - Dialogue rails
 652 | - Retrieval rails
 653 | - Output rails
 654 | 
 655 | The first two are simple use cases with fallback: just determine if a query needs to be executed.
 656 | 
 657 | The retrieval rails are designed to help reject non-relevant chunks or alter them. Although this part might be crucial, it's hard to design the specific implementation in advance.
 658 | 
 659 | The output rails are used to check the correctness of the answer. This is probably the right place where we might consider additional designs for tackling hallucination issues.
 660 | 
 661 | ###### Output rails example
 662 | 
 663 | Here is an example of an alogirthm we might utilise. 
 664 | The fallback strategy could involve calling another or multiple LLMs. The Guardrails would evaluate these answers to select the best one that meets quality standards. This approach increases the likelihood of obtaining a satisfactory response.
 665 | 
 666 | The complexity might be increased or decreased depending on the metrics we obtain for the baseline, but this is something we need to keep in mind while choosing the framework in advance.
 667 | 
 668 | *** Algorithm ***
 669 | 
 670 | **Input:** Request from user
 671 | **Output:** Response to user
 672 | 
 673 | 1. **Primary Answer Generation**
 674 |     1.1 `main_answer` ← obtain answer from main process
 675 | 
 676 | 2. **Guardrails Evaluation**
 677 |     2.1 `guardrail_result` ← evaluate `main_answer` with Guardrails
 678 |     2.2 If `guardrail_result` is satisfactory:
 679 |         2.2.1 Return `main_answer` to user
 680 |     2.3 Else:
 681 |         2.3.1 `time_remaining` ← check remaining response time
 682 |         2.3.2 If `time_remaining` is sufficient to invoke fallback model:
 683 |             2.3.2.1 `fallback_answer` ← obtain answer from fallback pipeline
 684 |             2.3.2.2 `fallback_guardrail_result` ← evaluate `fallback_answer` with Guardrails
 685 |             2.3.2.3 If `fallback_guardrail_result` is satisfactory:
 686 |                 2.3.2.3.1 Return `fallback_answer` to user
 687 |             2.3.2.4 Else:
 688 |                 2.3.2.4.1 Return `override_response` to user
 689 |         2.3.3 Else:
 690 |             2.3.3.1 Return `override_response` to user
 691 | 
 692 | **End Algorithm**
 693 | 
 694 | 
 695 | ##### Locating the components
 696 | 
 697 | ###### Embedder
 698 | 
 699 | **Granularity of embeddings**
 700 | 
 701 | Surely enough, we could cover all the chunk levels [Data Chunking Strategy](#iv-data-chunking-strategy) by having separate embeddings for each levels and deciding which one to use based on the context, for example:
 702 | - search without specifying a particular document -> `[level 0]`
 703 | - user selects a document and asks if there is a particular section in it -> `[level 1]`
 704 | ... 
 705 | 
 706 | This approach might bring high accuracy, but it's complex and costly to implement. For the baseline solution, we would like to start with a single embedding representation. Based on the most common use case, this would be a paragraph encoding. Because according to our analysis, most of the problems' answers could be found given the context of a single paragraph.
 707 | 
 708 | #### Framework Selection
 709 | 
 710 | When considering a framework, we would like it to support the following features:
 711 | 1. Document storage
 712 | 2. Index storage
 713 | 3. Chat service
 714 | 4. Modular design for document extraction that supports custom modules
 715 | 5. Modular design for retrieval and generation that can utilize both local and vendor-based solutions
 716 | 6. Built-in logging and monitoring capabilities
 717 | 
 718 | We will compare a couple of popular frameworks that might suit our needs: LlamaIndex and LangChain.
 719 | 
 720 | Here are some resources that summarize the differences between the two frameworks:
 721 | 1. [LlamaIndex vs LangChain: Haystack – Choosing the Right One](https://www.linkedin.com/pulse/llamaindex-vs-langchain-haystack-choosing-right-one-subramaniam-yvere/)
 722 | 2. [LlamaIndex vs LangChain: Key Differences](https://softwaremind.com/blog/llamaindex-vs-langchain-key-differences/)
 723 | 3. [LangChain vs LlamaIndex: Main Differences](https://addepto.com/blog/langchain-vs-llamaindex-main-differences/)
 724 | 
 725 | | Feature/Aspect          | LangChain                                        | LlamaIndex                                      |
 726 | |-------------------------|--------------------------------------------------|-------------------------------------------------|
 727 | | **Main Purpose**        | Various tasks                                    | Querying and retrieving information using LLMs  |
 728 | | **Modularity**          | High, allows swapping of components              | Average, yet sufficient for our current design  |
 729 | | **Workflow Management** | High, supports managing chains of models/prompts | Average, primarily focused on querying          |
 730 | | **Integration**         | High: APIs, databases, guardrails,  etc.         | Average: APIs, data sources, guardrails         |
 731 | | **Tooling**             | Debugging, monitoring, optimization              | Debugging, monitoring                           |
 732 | | **LLM Flexibility**     | Supports various LLMs (local/APIs)               | Supports various LLMs (local/APIs)              |
 733 | | **Indexing**            | No primary focus on indexing                     | Core feature, creates indices for data          |
 734 | | **Query Interface**     | Complex workflows                                | Straightforward                                 |
 735 | | **Optimization**        | Optimization of LLM applications                 | Optimized for the retrieval of relevant data    |
 736 | | **Ease of Use**         | Challenging                                      | Easy                                            |
 737 | 
 738 | Given the pros and cons listed above, it appears that LlamaIndex provides all the features we are looking for, combined with an ease of use that could reduce development and maintenance costs. Additionally, LlamaIndex offers enterprise cloud versions of the platform. If our solution evolves towards a simpler design, we might want to move to the paid cloud version if it makes economical sense.
 739 | 
 740 | 
 741 | ### **VI. Error analysis**
 742 | 
 743 | **TODO**: update this part after the project'll start and data will be shared.
 744 | 
 745 | Given the multi-step nature of the solution, consider potential isuues on each of the steps:
 746 | 
 747 | **0. Intent classification**
 748 | - "under" filtering: Irrelevant questions are treated as meaningful, pipeline's running for  nothing
 749 | - "over" filtering: Filtering out relevant questions ⇒ the response won't be provided
 750 | 
 751 | **1. Embeddings**
 752 | - Poor Quality Embeddings: If the embeddings do not accurately capture the semantic meaning of the input text, the entire pipeline is compromised.
 753 | - Embedding Drift: The embeddings may become less effective over time due to adding docs from specific domain / terminology.
 754 | 
 755 | **2. Retrieval**
 756 | - Ineffective Retrieval Algorithm: If the retrieval mechanism fails to fetch relevant documents, even the best embeddings won't help.
 757 | - Outdated Index: The new versions of documents may not be indexed, which could lead to providing irrelevant/out-of-date information.
 758 | 
 759 | **3. Generation:**
 760 | - Model Hallucination: The generative model might produce  incorrect fabricated information, which looks believable.
 761 | - Lack of Context Understanding: The model might fail to catch a relevant information from context and instead of providing clarification question generate incomplete responses.
 762 | 
 763 | **4. Guardrails**
 764 | - "under" filtering:  filtering out inappropriate or harmful content might fail, leading to problematic outputs.
 765 | - "over" filtering: guardrails might filter out correct, useful information, reducing the system's performance.
 766 | 
 767 | 
 768 | As errors are inharitated from one stage to another. Use following approaches to diagnose them:
 769 | 
 770 | 1. *Isolate Components:*
 771 |     - Test each component individually. For embeddings, manually inspect a sample of embeddings to check their quality. For retrieval, evaluate the retrieved documents separately from the generation step. [**Metrics and Losses**](#ii-metrics-and-losses)
 772 | 2. *Step-by-Step Analysis:*
 773 |     - Follow a query through the entire pipeline to catch where the first major difference from expected behaviour occurs.
 774 | 
 775 | Corner-cases to check are mentioned in section 4. Validation Schema.
 776 | 
 777 | 
 778 | ### **VII. Training Pipeline**
 779 | 
 780 | ### **i. Overview**
 781 | 
 782 | We are planning to use external/pretrained solutions for Embedding generation, OCR, and LLM components. [**Dataset**](#iii-dataset)
 783 | Because of this, our Training Pipeline should focus on:
 784 | - **Stable Data Preprocessing:** Should be executed regularly upon new document submission.
 785 | - **Stable Context Selection:** Enabling robust Prompt Engineering.
 786 | 
 787 | ### **ii. Toolset**
 788 | 
 789 | The suggested tools are:
 790 | - Python
 791 | - Cloud Vector DB service (Pinecone, Azure AI Search, etc.)
 792 | - On-premise out-of-the-box OCR / Cloud OCR service
 793 | - Docker
 794 | - Cloud LLM service (OpenAI / Azure OpenAI)
 795 | 
 796 | ### **iii. Data Preprocessing**
 797 | 
 798 | The data preprocessing should include:
 799 | 
 800 | - **Text Recognition:** OCR module to convert image documents into text representation.
 801 | - **Text Metadata Extraction:** Store explicit information and statistics about documents.
 802 | - **Feature Engineering:** Extract required features on different levels.
 803 | - **Preprocessing Metadata Storage:** Store explicit information about tools and their versions used for preprocessing.
 804 | - **Feature Storage:** Make features accessible and searchable.
 805 | 
 806 | The main goal - document should be preprocessed within 1-2 hours after submission/new version being assigned.
 807 | 
 808 | ### **iv. Evaluations**
 809 | For Context, Prompt, and Chat E2E evaluations we would be using RAGAS tool as described in [**IV. Validation Schema**](#iv-validation-schema)
 810 | 
 811 | ### **v. Continuous Integration and Deployment**
 812 | 
 813 | The pipeline should be integrated into the existing CI/CD infrastructure. This includes setting up automated evaluation on a regular basis, ensuring that the latest data is used and pulled, and releasing changes to production with minimal manual intervention.
 814 | 
 815 | ### **vi. Monitoring and Maintenance**
 816 | 
 817 | We should monitor the model's performance in production and set up alerts for significant deviations from expected performance. This will enable us to catch issues early and trigger retraining or model updates when necessary. [**XI. Monitoring**](#xi-monitoring)
 818 | 
 819 | ### **vii. Future Work and Experimentation**
 820 | 
 821 | Considering we would like more customised and/or on-premise models, we will need to extend this section to cover training for in-house models.
 822 | The section lacks of testing approaches for the specific user needs, such as asking for more details.
 823 | 
 824 | ### **VIII. Features**
 825 | 
 826 | Our key criteria to select features:
 827 | 1. **Context selection flexibility:** Users may ask a variety of questions and features need to be adaptive enough to ensure that context may be selected for any question.
 828 | 2. **Context selection relevance:** Besides supporting a variety of questions, we need to ensure that whenever context is selected - it is relevant for the question.
 829 | 3. **Computational time:** We are not working with online data and are not very time-restricted, and features could be generated with some lag within several hours. However, as we may encounter pretty long documents (up to 500 pages long), some features could be expensive at such scale.
 830 | 
 831 | Adding new features should be formulated as new hypotheses, which should originate from covering specific corner cases or improving metrics.
 832 | 
 833 | The idea is not to execute any automated feature selection, rather than focusing on providing the most complete and relevant context in the prompt for LLM.
 834 | 
 835 | Features are met in the following services, by their components and areas:
 836 | - Metadata enriching service
 837 |     - Document-level enriching component
 838 |         - Storage and accessing the features
 839 |     - Text-level enriching component
 840 |         - Training
 841 |         - Inference
 842 |         - Storage and accessing the features
 843 |     - Token-level enriching component
 844 |         - Training
 845 |         - Inference
 846 |         - Storage and accessing the features
 847 | - OCR service
 848 |     - OCR component
 849 |         - Training - out of scope, as the current goal to use external/out of the box solution.
 850 |         - Inference - out of scope, as the current goal to use external/out of the box solution.
 851 |         - Storage and accessing the features
 852 | - Embedding service
 853 |     - Embedder component
 854 |         - Training - out of scope, as the current goal to use external/out of the box solution.
 855 |         - Inference
 856 |         - Storage and accessing the features
 857 | - Chat service
 858 |     - Vector search component
 859 |         - Search & access the existing features
 860 |     - LLM component
 861 |         - Out of scope, as the current goal to use external/out of the box solution.
 862 |     - Prompt Engineering component
 863 |         - Access the existing features
 864 |         - System Prompt
 865 | 
 866 | On the high level, all features could be classified into:
 867 | - Document level
 868 | - Text level
 869 | - Token level
 870 | - Prompt templates
 871 | 
 872 | **i. Document level features**
 873 | 
 874 | The purpose of this features is to enable easier selection of documents by users.
 875 | Either by using filters, or non-explicit mentioning in the chat.
 876 | Such features are not explictly extracted or crafted, rather then translating the state of a document from other sources.
 877 | 
 878 | As a side effect, they could be usefull for Prompt Engineering to represent some structure to the LLM. [**Metadata**](#vi-metadata)
 879 | 
 880 | **ii. Text level features**
 881 | 
 882 | Such features are targeting context selection for the Prompt. [Data Enhancing](#v-data-enhancing)
 883 | They are split into 3 high level groups:
 884 | 1. **Metadata.** 
 885 |     - Not extracted features. Represents high level state/statistics of text.
 886 | 2. **Explicit enriching.**
 887 |     - Features which are not explicitly available and should be extracted by models / regexp / other approaches.
 888 | 3. **Embeddings.**
 889 |     - Focusing the ability to properly extract the context from the Vector DB.
 890 | 
 891 | **iii. Token level features**
 892 | 
 893 | This set of features further focusing the ability to select context and document selection. [Data Enhancing](#v-data-enhancing)
 894 | 
 895 | **ix. Prompt templates**
 896 | 
 897 | Set of Prompt templates which would be able to cover different question types and intents.
 898 | Each Prompt template would consist out of following component templates (not ordered):
 899 | - Agent Role & Knowledge
 900 | - Agent Task
 901 | - Output Formating
 902 | - Output Restrictions
 903 | - Input Metadata Context
 904 | - Input Document Context
 905 | - Input Documents Relations 
 906 | - (Optional) Task Knowledge Context
 907 | - (Optional) Task & Role Examples
 908 | 
 909 | Some component templates would be pre-created. 
 910 | Some - constructed on the go from selected relevent documents according to the component template.
 911 | 
 912 | In the beggining component templates should be able to cover following conditions (from [**Expectations**](#iii-expectations)):
 913 | - General domain questions
 914 | - Addressable questions
 915 | - Non-addressable questions
 916 | - Questions for 1 Document
 917 | - Questions for multiple different Documents
 918 | - Questions for 1 Document in multiple historical versions
 919 | - Questions for multiple different Documents in their multiple historical versions 
 920 | 
 921 | 
 922 | ### **IX. Measuring and reporting**
 923 | 
 924 | #### i. Measuring Results
 925 | 
 926 | Understanding how to measure the system's performance precisely is essential for validating the effectiveness of the machine learning solution. Experiments must be well-designed to capture meaningful metrics that reflect real-world utility.
 927 | 
 928 | In the [**Baseline Solution**](#v-baseline-solution) section, we highlighted several reasonable approaches, each with unique advantages and drawbacks. The challenge lies in determining which approach is most suitable for specific scenarios, particularly when the trade-offs impact performance unpredictably.
 929 | 
 930 | **Previous work**
 931 | 
 932 | The **Sparse Encoded Retrieval Baseline** serves as a straightforward search engine. While functional, it presents several limitations that new methodologies aim to overcome:
 933 | 
 934 | - No semantic understanding: synonyms are not supported by default
 935 | - Bag-of-words approach: word order is not considered
 936 | - Requires updates to accommodate new vocabulary
 937 | 
 938 | **Evaluation approach**
 939 | 
 940 | Evaluating the relevance of responses to user queries can be challenging. For this purpose, we could use a crowdsourcing platform. Assessors will be provided with a series of prompts and answers not only to assess relevance but also to detect hallucinations. We consider the following metrics:
 941 | 
 942 | - **Average Relevance Score** of the direct questions.
 943 | - **Average Relevance Score** of following-up questions.
 944 | - **Hallucination Rate**: This new metric quantifies the percentage of responses that contain hallucinated content. Responses are considered hallucinated if they include information not supported by facts or the input prompt.
 945 | 
 946 | **Assessment Method**:
 947 | - Assessors will be provided with a series of prompts and answers to evaluate both their relevance and accuracy. Alongside the 5-point scale for relevance, assessors will use a binary scale (Yes/No) to indicate whether each response contains hallucinated information.
 948 | - For nuanced analysis, we can further categorize hallucinations by severity, with minor inaccuracies noted separately from outright fabrications.
 949 | 
 950 | **Platform and Settings**:
 951 | 
 952 | - **Platform Choice**: Yandex.Toloka or Amazon Mechanical Turk, etc.
 953 | - **Total Assessors:** 100
 954 | - **Query-Answer Pairs for Direct Questions:** 500
 955 | - **Query-Answer Pairs for Follow-up Questions:** 500
 956 | 
 957 | **Terminology**:
 958 |  - **Task**: Defined as one Query-Answer Pair, which is a single item for assessment.
 959 |  - **Pool**: Described as a page with multiple tasks for assessors to evaluate.
 960 |  - **Overlap**: Indicates how many different assessors evaluate the same task, ensuring accurate data by having multiple reviewers.
 961 | 
 962 | **Cost Calculation**:
 963 | 
 964 | - **Pool Price:** $0.05
 965 | - **Total Tasks**: 1000 (500 for direct questions and 500 for following-up questions)
 966 | - **Tasks Per Pool:** 5
 967 | - **Overlap:** 5
 968 | 
 969 | **Expense Formula**: `expense = pool_price * (total_tasks / tasks_per_pool) * overlap`
 970 | 
 971 | - Cost of direct questions: $25
 972 | - Cost of following-up questions: $25
 973 | - Total Cost with Hallucination Assessment: $50
 974 | 
 975 | **Budget Adjustments**:
 976 | 
 977 | The settings can be adapted based on the budget, with potential increases to accommodate the additional complexity of assessing hallucinations.
 978 | 
 979 | **Special Considerations for Niche Domains**: The evaluation approach works well for well-known domains. For specific domains, we can use local experts who are familiar with the context.
 980 | 
 981 | #### ii. A/B Tests
 982 | 
 983 | **Hypothesis**
 984 |  - **Primary Hypothesis**: We hypothesize that the new system enhancements will increase user retention rates by making the platform more engaging and responsive to user needs.
 985 |  - **Secondary Hypothesis**: We hypothesize that these enhancements will also lead to an increase in the subscription conversion rate, as the improved user experience encourages more users to commit to a paid subscription.
 986 | 
 987 | **Termination Criteria.** 
 988 | 
 989 |  - The system must deliver responses within an average of 1.5 minutes. 
 990 |  - The percentage of reports with offensive or improper responses must be below 1%.
 991 |  
 992 |  If the termination criteria are met, the experiment will be paused and resumed after corrections.
 993 | 
 994 | **Key Metrics**
 995 | 
 996 | - **User Retention Rate**: Measures the ratio of users who return to the platform within the next month. This metric is a direct indicator of the ongoing engagement and satisfaction of users.
 997 | - **Subscription Conversion Rate**: Measures the percentage of users who upgrade from a free to a paid subscription during the test period.
 998 | 
 999 | **Control metrics**
1000 | 
1001 | - **Positive Feedback Rate**: The percentage of feedback that is positive, reflecting users' approval of new features or improvements. This metric helps identify strengths in the service or product.
1002 | - **Negative Feedback Rate**: The percentage of feedback that is negative, indicating areas of user dissatisfaction. This metric is crucial for pinpointing problems and areas needing improvement.
1003 | - **Reading Efficiency Differential**: Measures the change in time taken to complete reading or information retrieval tasks with the RAG system compared to traditional methods. This metric is designed to quantify the impact of the RAG system on enhancing or reducing the efficiency of document reading processes.
1004 |   - **Baseline Reading Time**: Average time users spend reading or retrieving information using traditional methods.
1005 |   - **RAG Reading Time**: Average time users spend when using the RAG system for similar tasks.
1006 | - **Time to Retrieve (TTR)**: Measures the average time taken by the system to fetch and display results after a query is submitted.
1007 | - **Correction Attempts Rate**: Measures the percentage of responses with correction attempts.
1008 | - **Average Number of Correction Attempts**: Measures the average number of attempts to correct an answer.
1009 | - **Graceful Exits Rate**: Measures the percentage of interactions that result in a graceful exit after unsuccessful correction attempts.
1010 | 
1011 | **Auxiliary metrics**
1012 | 
1013 | - Total Document Count
1014 | - Daily New Documents
1015 | - Total User Count
1016 | - New Users per Day
1017 | - Session Count per Day
1018 | 
1019 | **Splitting Strategy.** Users will be split into two groups by their IDs.
1020 | 
1021 | **Experiment Duration.** 
1022 | 
1023 | The experiment will last four month. After two month, groups will swap configurations to mitigate any biases introduced by variable user experiences and external factors.
1024 | 
1025 | **TODO**: are there any seasonality concerns?
1026 | 
1027 | **Statistical Criteria.** Statistical significance will be determined using Welch's t-test, with a significance level set at 5% and the type II error at 10%.
1028 | 
1029 | **Future Steps for Experiment Improvement.** 
1030 | 
1031 | To further validate our experimental setup, we propose incorporating an A/A testing phase to ensure the reliability of our measurements, followed by A/B/C testing to compare multiple new solutions simultaneously.
1032 | 
1033 | #### iii. Reporting Results
1034 | 
1035 | At the end of the experiment, a comprehensive report will be generated. This will include:
1036 | 
1037 | - Key, control and auxiliary metric results with a 95% confidence interval.
1038 | - Distribution plots showing metric trends over time.
1039 | - Absolute numbers for all collected data.
1040 | - Detailed descriptions of each tested approach with links to full documentation.
1041 | - Conclusive summary and recommendations for further steps.
1042 | 
1043 | ### **X. Integration**
1044 | 
1045 | ![RAG Reliable & Interactive](docs/rag_reliable_interactive.png)
1046 | 
1047 | ### **i. Embeddings Database**
1048 | 
1049 | This is one of the core components in the system for efficient document search and retrieval. [Data Enhancing](#v-data-enhancing)
1050 | It consists of:
1051 | 1. Vector representations of uploaded/created by users files.
1052 | 2. Chat communications and response ratings, structured with fields of user queries, responses, timestamps, ratings and session ID.
1053 | 
1054 | **i.i. Embeddings Generation**
1055 | 
1056 | - **Query Embeddings:** Converts client's query into embeddings for further selecting nearest neighbors within the database.
1057 | - **Document Embeddings:** Creates embeddings using a pre-trained BERT-based model for new documents. The model processes documents via API request, resulting in document ID, document metadata and embeddings. The original file is stored in Documents Storage with the same document ID to avoid overloading the vector DB.
1058 | - **Updates:** Automatically updates embeddings when document version changes to maintain vector relevance.
1059 | 
1060 | **i.ii. Database Features**
1061 | 
1062 | A cloud and scalable database, e.g., Pinecone. It is designed to scale horizontally to handle large volumes of embeddings.
1063 | - Supports nearest neighbor search, using cosine or Euclidean similarity.
1064 | - Supports filtering based on metadata.
1065 | - The following fields are stored for further mapping with Documents Storage:
1066 |   1. Document ID
1067 |   2. Version number
1068 |   3. Document's Metadata (document title, author, creation date)
1069 |   4. Model's Metadata (Baseline / Main embedding tool, model's release version)
1070 |   5. Embeddings representation
1071 | - Embeddings, metadata, and queries are encrypted to ensure security. Strict access control management.
1072 | 
1073 | **Database quantified requirements:**
1074 | - Should return top-10 nearest neighbors within 100ms for up to 1 million vectors.
1075 | - Should support at least 1000 Queries Per Second for nearest neighbor searches on a dataset of 1 million vectors.
1076 | 
1077 | ### **ii. Documents Storage**
1078 | 
1079 | A scalable cloud service (e.g. AWS S3) for scalable storage and files managment for original files uploaded by clients, including their versions. The service returns a URL (Document ID) for each uploaded file, which is stored in the embeddings database metadata.
1080 | 
1081 | ### **iii. Chat UI**
1082 | 
1083 | An intuitive and responsive interface for clients to query and receive results.
1084 | 
1085 | **iii.i. Frameworks and Technologies**
1086 | 1. **Frontend.** React.js: A well-known framework for building user interfaces with a modular, component-based architecture.
1087 | 2. **Backend.** Redux or Context API: for managing application state.
1088 | 3. **Backend integration.** Axios: for handling API requests and responses.
1089 | 4. **Real-Time Interaction.** Socket.io: For real-time, bi-directional communication between web clients and servers.
1090 | 5. **Styling.** CSS-in-JS libraries like styled-components or Emotion for component-based styling.
1091 | 
1092 | **iii.ii. Features**
1093 |   1. Clients can upload new documents, which automatically triggers embedding generation and storage.
1094 |   2. Consists of a positive/negative feedback on answers.
1095 |   3. Clients can report offensive or improper responses, which triggers another LLM as a fallback scenario.
1096 |   4. Allows to save a chat history and responses for future reference.
1097 | 
1098 | ### **iv. OCR**
1099 | 
1100 | MagicSharepoint utilizes Optical Character Recognition technology to convert text from image-based documents into machine-readable text. This offers users greater flexibility in input data formats. The algorithm takes an intermediate place between the user upload process and embeddings creation. [Data Cleaning](#iii-data-cleaning)
1101 | 
1102 | MagicSharepoint leverages an existing OCR solution - AWS Textract, considering the complexitity and high-accuracy it requires. This choice also facilitates scalability, reducing development time and maintenance overhead.
1103 | 
1104 | **Features.**
1105 | 1. **Document upload.** The system automatically identifies the format of uploaded file and OCR gets triggered if image-based inputs.
1106 | 2. **Engine.** A cloud solution is used to be scalable enough to handle large volumes of documents simultaneously.
1107 | 3. **Storage.** Both the original image and the extracted text is stored in the Documents Storage, linked by the same Document ID.
1108 | 4. **Multi-language Support.** The OCR engine supports multiple languages to serve wide range of clients.
1109 | 
1110 | ### **v. Backend API Design**
1111 | 
1112 | Below are the events when a corresponding API action gets triggered while interacting with a user.
1113 | 
1114 | **Documents Management**
1115 | - Upload a new document
1116 | - Delete a document or version
1117 | - Retrieve document metadata
1118 | - Retrieve all versions of a document
1119 | - Retrieve a specific version of a document
1120 | - Apply OCR technology for image-based documents, if required
1121 | 
1122 | **User Queries Management**
1123 | - Retrieve a query result
1124 | - Rate a query response
1125 | - Report an inappropriate response
1126 | 
1127 | **Embeddings Management**
1128 | - **Model level:** Generate embeddings for a new document - uses a pre-trained model and sends to the IO level component for storage.
1129 | - **IO level:** Update embeddings for a document version change, keeping previous embeddings corresponding to the same session ID.
1130 | 
1131 | **Chat Session Management**
1132 | - Start a new chat session
1133 | - End a session
1134 | - Retrieve chat history
1135 | - Save chat history
1136 | 
1137 | **Notifications and Alerts API**
1138 | - Inform users of system updates, document changes they worked with
1139 | - Real-time notifications for document processing status, e.g., "Your document is being uploaded. Please wait..." or "OCR processing started for your document"
1140 | - Error Handling and Logging Events. Examples are "File upload failed due to network timeout", "Document size exceeds limit" or "Unsupported file format"
1141 | 
1142 | ### **vi. Parallel Processing**
1143 | To handle simultaneous queries and ensure document processing tasks do not slow down user interactions, we adopt a parallel processing strategy that separates asynchronous tasks (e.g., embedding generation) from synchronous tasks (e.g., real-time user interactions).
1144 | 
1145 | - **Embedding Generation for Index Updating:** Asynchronous tasks, which are placed in an asynchronous task queue (e.g., Celery with RabbitMQ) and from there are taken for parallel processing. It is triggered upon Document upload, new version, or OCR completion. Processed by general workers.
1146 | - **Real-time User Interaction:** Synchronous tasks, which are prioritized for low latency. Processed by reserved workers for entire chat sessions to maintain context and improve response relevance.
1147 | 
1148 | ### **vii. SLAs**
1149 | 
1150 | **TODO**: separate SLA and Latency Expectations ideas
1151 | 
1152 | To control system performance and meet defined standards, the MagicSharepoint service is integrated with a monitoring tool.
1153 | 
1154 | Key validated components:
1155 | - **Response Time:** Guarantee first token response within 1 minute.
1156 | - **Uptime:** Ensure a high availability rate, aiming for 99.9% uptime.
1157 | 
1158 | **Time Estimates per Stage**
1159 | 
1160 | If Time Estimate is not met - we need to save related logs and highlight/mark them for future analysis.
1161 | 
1162 | 1. User Query Processing
1163 | - Intent Classification: 200-300ms
1164 | - Context Retrieval from Vector Database: 300-500ms
1165 | - Response Generation by LLM:
1166 |   - Network latency to the vendor LLM: 100-200ms (depending on vendor and location)
1167 |   - Token Generation: 200-500ms (varies based on response length)
1168 | - Total Estimated Time for User Query Processing: 800-1500ms
1169 | 
1170 | 2. Document Processing
1171 | - Document Upload and OCR (if required): 1-2 minutes
1172 | - Embedding Generation:
1173 |   - Text Processing and Embedding Creation: 1-3 minutes (depending on document size)
1174 |   - Database Update: 200-300ms
1175 | - Total Estimated Time for Document Processing: 2-5 minutes
1176 | 
1177 | **Infrastructure Requirements for Meeting SLAs**
1178 | - CPU Nodes: For general API handling, metadata storage, and other small tasks
1179 | - GPU Nodes: For intensive tasks such as embedding generation
1180 | - Fast SSDs: For quick read/write operations during document processing and storing original files
1181 | - High-speed Network: To ensure low latency between API services
1182 | 
1183 | ### **viii. Fallback Strategies**
1184 | 
1185 | Fallbacks are crucial for maintaining operational efficiency in the face of unforeseen circumstances. MagicSharepoint uses a multi-tiered fallback system to ensure seamless service:
1186 | 
1187 | **TODO**: add more details to illustrate multi-component nature of a problem
1188 | 
1189 | - **Primary fallback:** The primary model is served by the chosen vendor. It is used unless there is negative user feedback on the model outcome or if the latency is outside the accepted range.
1190 | - **Secondary fallback:** Our next layer of fallback involves using a pretrained LLM from Hugging Face, installed locally. This approach addresses both potential issues.
1191 |   
1192 | The system has latency and feedback-based switching, which reroutes requests to the secondary model. Once conditions improve, it switches back to the primary model. To simplify management, each service within MagicSharepoint handles its fallback mechanisms independently.
1193 | 
1194 | ### **XI. Monitoring**
1195 | 
1196 | **TODO**: update this part with main metrics to be monitored for each system.
1197 | 
1198 | **TODO**: select main metrics for each system.
1199 | 
1200 | #### Engineering Logging & Monitoring
1201 | 
1202 | 1. **Ingestion Layer**
1203 |     - Process and I/O timings
1204 |     - Code errors
1205 | 
1206 | 2. **Retrieval**
1207 |    - **Embedder**: Monitor preprocessing time, embedding model time, and utilization of embedding model instances.
1208 |    - **Database (DB)**: Monitor time taken for each retrieval operation and DB utilization.
1209 |    
1210 | 3. **Generation**
1211 |    - **LLM**: Monitor latency, cost, error rates, uptime, and token volume for input and output to predict scaling needs.
1212 | 
1213 | #### ML Logging & Monitoring
1214 | 
1215 | 1. **Ingestion Layer**
1216 |     - Every step of the ETL pipeline for document extraction must be fully logged to ensure the process is reproducible and help with issue resolution
1217 |     - OCR process must be logged separately, as it is a different system
1218 |     - Statistics for documents during ingestion should be monitored, including word count, character distribution, document length, paragraph length, detected languages, and percentage of tables or images
1219 |     - Monitor the preprocessing layer to bring awareness of non-ingested documents or documents with too many errors
1220 | 
1221 | 2. **Retrieval**
1222 |     - Log the details of each query, including the tokenizer used, document context found within a particular document version, and other relevant metadata for future analyses
1223 |     - Keep track of the indexes found and similarity scores
1224 | 
1225 | 3. **Chat History**: Store all chat history for thorough analysis and debugging, providing valuable insights into user interactions and system performance
1226 | 
1227 | 4. **Augmented Generation**
1228 |     - Quality of generated content through user feedback
1229 | 
1230 | 5. **Alerting Mechanisms**: Have alerting mechanisms for any anomalies or exceeded thresholds based on the metrics being monitored
1231 | 
1232 | #### Tooling
1233 | 
1234 | 1. **For RAG Operations - Langfuse Callback**
1235 |     - Integrates with LlamaIndex
1236 |     - Supports measuring the quality of the model through user feedback, both explicit and implicit
1237 |     - Calculates costs, latency, and total volume
1238 |     - For more information about analytics capabilities, see: [Langfuse Analytics Overview](https://langfuse.com/docs/analytics/overview)
1239 | 
1240 | 2. **For System Health Metrics, Ingestion Layer, Alerting - Prometheus & Grafana**
1241 |     - Prometheus is an open-source system monitoring and alerting toolkit
1242 |     - Grafana is used to visualize the data collected by Prometheus
1243 |     - Since LLMs logging is stored within Langfuse, there is no need to build additional solutions for this
1244 | 
1245 |     **Why not a standard ELK stack?**
1246 | 
1247 |     For more details, please read this great blog post: [Prometheus-vs-ELK](https://www.metricfire.com/blog/prometheus-vs-elk/)
1248 | 
1249 |     | Feature/Aspect                 | Prometheus                                        | ELK (Elasticsearch, Logstash, Kibana)              |
1250 |     |-------------------------------|---------------------------------------------|---------------------------------------------------|
1251 |     | **Primary Use Case**          | Metrics collection and monitoring           | Log management, analysis, and visualization       |
1252 |     | **Data Type**                 | Numeric time series data                    | Various data types (numeric, string, boolean, etc.)|
1253 |     | **Database Model**            | Time-series DB                              | Search engine with inverted index                 |
1254 |     | **Data Ingestion Method**     | Pull-based metrics collection via HTTP      | Log collection from various sources using Beats and Logstash |
1255 |     | **Data Retention**            | Short-term (default 15 days, configurable)  | Long-term                                          |
1256 |     | **Visualization Tool**        | Grafana                                     | Kibana                                             |
1257 |     | **Alerting**                  | Integrated with Prometheus                  | Extensions                                         |
1258 |     | **Operational Complexity**    | Lower (single-node)                         | Higher (clustering)                               |
1259 |     | **Scalability**               | Limited horizontal scaling                  | High horizontal and vertical scalability           |
1260 |     | **Setup and Configuration**   | Simple                                      | Complex                                            |
1261 | 
1262 |     **Pros for this solution:**
1263 |     1. Metric-Focused Monitoring: Prometheus is optimized for collecting and analyzing time-series data, making it ideal for tracking metrics
1264 |     2. Ease of Setup and Configuration: Prometheus's pull-based model simplifies the setup process
1265 |     3. Operational Simplicity: It is advantageous without needing a large, dedicated team to manage it
1266 |     4. Real-Time Alerts and Querying: Prometheus provides a powerful query language (PromQL) and supports real-time alerting
1267 | 
1268 |     **Cons:**
1269 |     1. No vertical scaling
1270 |     2. Limited logs data retention. This might become a problem if we change the RAG framework and want to store ML logs elsewhere
1271 | 
1272 | 3. **Code error reports - Sentry.io**
1273 |     - Sentry is a widely-used error tracking tool that helps developers monitor, fix, and optimize application performance
1274 |     - We may choose between self-hosted versions and the paid cloud version in the future
1275 | 
1276 | 
1277 | ### **XII. Serving and inference**
1278 | 
1279 | ![RAG Reliable & Interactive](docs/rag_reliable_interactive.png)
1280 | 
1281 | The inference part of a system would contain from 3 major on-premise services:
1282 | - **Embedding service**
1283 | - **OCR service**
1284 | - **Chat service**
1285 | 
1286 | Also on-remise we should have the following infrastructure services:
1287 | - **Load Balancer service**
1288 | - **Cacher service** 
1289 | 
1290 | While following parts of a solution are considered as external cloud services:
1291 | - **Vector database**
1292 | - **Document storage**
1293 | - **Metadata database**
1294 | - **LLM**
1295 | 
1296 | **TODO**: concider separating app nodes bs data nodes
1297 | 
1298 | ### **i. Serving architecture**
1299 | 
1300 | On-premise services would be hosted as REST API services hosted in Docker containers and orchestrated by Kubernetes cluster.
1301 | Chat service should support an option to stream a responce via https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events (default OpenAI tool to stream response).
1302 | 
1303 | #### **Embedding service**
1304 | 
1305 | Invoked upon every document receiving a version.
1306 | Should pull required metadata from other databases to enrich embeddings with metadata upon saving.
1307 | 
1308 | For every document:
1309 | - Invoke OCR service.
1310 |     - If document is Image based.
1311 | - Pull text representation from Document storage.
1312 | - Embedding service will generate embeddings.
1313 |     - on different aggregations - from document down to sentence level.
1314 | - Import embeddings with corresponding metadata into the Vector database.
1315 | 
1316 | #### **OCR service**
1317 | 
1318 | Invoked from Embedding service.
1319 | 
1320 | For every image document:
1321 | - Invoke OCR process.
1322 | - Import text representation into the Document storage
1323 | 
1324 | #### **Chat service**
1325 | 
1326 | Invoked on every question.
1327 | 
1328 | For every question:
1329 | - Retrieve scope metadata from Metadata database.
1330 | - Confirm scope with user on Chat UI.
1331 |     - If scope request was non-explicit.
1332 |     - ~~Can start generating response before client would confirm it.~~
1333 | - Retrieve chat history.
1334 |     - If any.
1335 | - Use internal cache.
1336 |     - If scope, chat history and question match.
1337 |     - No fuzzymatch
1338 | - Context search in Vector database.
1339 | - Construct Prompt & invoke LLM.
1340 | - Receive response ~~and stream generated tokens to the Chat UI~~.
1341 | - Invoke guardrails.
1342 | - Calculate performance and consumption statistics.
1343 | - Calculate automated quality metrics (if any).
1344 | - Make decision and execute it: 
1345 |     - Return answer to the Chat UI.
1346 |     - Request more details from user on Chat UI.
1347 |     - Adjust response according to standards.
1348 | - Save record to the internal cache.
1349 | - Save chat history to Metadata database.
1350 | 
1351 | ### **ii. Infrastructure**
1352 | 
1353 | **Load Balancer service** should be hosted on a regular node. It should be responsible for routing requests and start-stopping for **Embedding service** and **OCR service**. 
1354 | 
1355 | **LLM service** is an external service and already has auto-scaling functionality and load balancer. 
1356 | 
1357 | **Cacher service** would have Redis under the hood, and to be hosted on a regular node. 
1358 | 
1359 | **Embedding service** and **OCR service** should be hosted on GPU nodes.
1360 | As we expect around 500 new document versions per month, they could be hosted on Spot machines or be start-stopped on demand (if it will take less than 2 minutes).
1361 | Both services would be stopped if they won't receive new requests within 30 (??) minutes.
1362 | 
1363 | **Chat service** could be hosted on non-GPU node. More focus on the RAM to manage retrieved contexts and assemble prompts, then on CPU. Minimal RAM should be 4 gb. Will be connected with **Cacher service** to manage cached/frequently similar requests.
1364 | 
1365 | ### **iii. Monitoring**
1366 | 
1367 | Key **inference** metrics:
1368 | - ~~Time to first token~~
1369 | - Time to show full reponse.
1370 | - % of questions covered by internal cache.
1371 | - % of answers rejected by guardrail.
1372 | - % of answers requested more details from user.
1373 | - % of questions having empty context.
1374 | - % of explicit 'we can not answer' answers. 
1375 | - Average chat history lenght.
1376 | 


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_reliable.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ML-SystemDesign/MLSystemDesign/304d9a02711eaf503aa5fb467eb4ba99a538ca92/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_reliable.png


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_reliable_interactive.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ML-SystemDesign/MLSystemDesign/304d9a02711eaf503aa5fb467eb4ba99a538ca92/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_reliable_interactive.png


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_simple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ML-SystemDesign/MLSystemDesign/304d9a02711eaf503aa5fb467eb4ba99a538ca92/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/rag_simple.png


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/retrieval_baseline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ML-SystemDesign/MLSystemDesign/304d9a02711eaf503aa5fb467eb4ba99a538ca92/Design_Doc_Examples/Mock/EN/Mock_ML_System_Design_RAG_Chat_With_Doc_Versions/docs/retrieval_baseline.png


--------------------------------------------------------------------------------
/Design_Doc_Examples/Mock/README.md:
--------------------------------------------------------------------------------
 1 | # Educational Mock ML System Design Examples
 2 | 
 3 | This directory contains educational mock examples of ML system design documents. These examples are created specifically for learning purposes and demonstrate best practices using simplified scenarios.
 4 | 
 5 | ## Current Examples
 6 | 
 7 | ### EN (English)
 8 | 
 9 | 1. **RAG Chat with Document Versions**
10 |    - Domain: Natural Language Processing/Document Management
11 |    - Problem: Building a RAG-based chat system for versioned documents
12 |    - Key Features:
13 |      - Document version management
14 |      - Retrieval Augmented Generation
15 |      - Multi-language support
16 |      - Real-time chat interface
17 | 
18 | ## Purpose of Mock Examples
19 | 
20 | Mock examples serve several educational purposes:
21 | 1. Demonstrate ML system design principles in simplified contexts
22 | 2. Show best practices without real-world complexities
23 | 3. Provide clear, focused examples for learning
24 | 4. Illustrate common patterns and anti-patterns
25 | 
26 | ## Contributing New Mock Examples
27 | 
28 | To contribute a new mock example:
29 | 
30 | 1. Choose a clear, educational scenario
31 | 2. Use the template from `templates/basic_ml_design_doc.md`
32 | 3. Keep the example focused and simplified
33 | 4. Include:
34 |    - Clear learning objectives
35 |    - Step-by-step explanations
36 |    - Common pitfalls to avoid
37 |    - Best practices highlighted
38 | 5. Follow the contribution guidelines in the root `CONTRIBUTING.md`
39 | 
40 | ## Mock Example Structure
41 | 
42 | Each mock example should:
43 | 1. Follow the standard template
44 | 2. Include clear educational goals
45 | 3. Provide simplified but realistic scenarios
46 | 4. Highlight key learning points
47 | 5. Include common challenges and solutions
48 | 
49 | ## Language Organization
50 | 
51 | - `EN/` - English examples
52 | - Additional language folders can be added following the same structure 


--------------------------------------------------------------------------------
/Design_Doc_Examples/README.md:
--------------------------------------------------------------------------------
 1 | # ML System Design Examples
 2 | 
 3 | This directory contains various ML system design documents organized into three categories:
 4 | 
 5 | ## Directory Structure
 6 | 
 7 | ```
 8 | ├── Mock/                  # Educational mock examples
 9 | │   └── EN/               # English mock examples
10 | ├── Examples/             # Real-world inspired examples
11 | │   └── EN/              # English examples
12 | └── Real/                # Actual production design docs
13 |     └── EN/              # English real examples
14 | ```
15 | 
16 | ## Categories
17 | 
18 | ### Mock Documents
19 | Located in `Mock/` directory. These are educational examples created specifically for learning purposes. They demonstrate best practices and common patterns but may use fictional companies or simplified scenarios.
20 | 
21 | ### Example Documents
22 | Located in `Examples/` directory. These are based on real-world scenarios but may be adapted or modified for educational purposes. They maintain realistic complexity while being accessible for learning.
23 | 
24 | ### Real Documents
25 | Located in `Real/` directory. These are actual production design documents from real projects (with sensitive information removed). They show how ML systems are designed in practice.
26 | 
27 | ## Language Organization
28 | 
29 | Each category has language-specific subdirectories:
30 | - `EN/` for English documents
31 | - Additional language folders can be added as needed
32 | 
33 | ## Contributing
34 | 
35 | When adding new design documents:
36 | 1. Choose the appropriate category (Mock/Examples/Real)
37 | 2. Place in the correct language subdirectory
38 | 3. Follow the template from `templates/basic_ml_design_doc.md`
39 | 4. Include all necessary diagrams and supporting files
40 | 5. Update this README if adding new language folders 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Machine Learning System Design
 2 | 
 3 | This repository is dedicated to Machine Learning System Design, featuring end-to-end examples and partially based on [this book](https://www.manning.com/books/machine-learning-system-design). While it doesn't offer a comprehensive teaching experience like the book, it provides a structure and a variety of design documents for your use.
 4 | 
 5 | ## Repository Structure
 6 | 
 7 | ```
 8 | ├── templates/                    # Templates and guidelines
 9 | │   ├── basic_ml_design_doc.md   # Basic template for ML design docs
10 | │   └── design_doc_checklist.md  # Review checklist for design docs
11 | ├── Design_Doc_Examples/         # Example design documents
12 | │   ├── EN/                      # English examples
13 | │   └── [other languages]/       # Examples in other languages
14 | ├── BookOutline.md              # Book chapter summaries and key concepts
15 | ├── CONTRIBUTING.md             # Contribution guidelines
16 | └── README.md                   # This file
17 | ```
18 | 
19 | ## Getting Started
20 | 
21 | 1. **New to ML System Design?**
22 |    - Start with the `BookOutline.md` for key concepts
23 |    - Review examples in `Design_Doc_Examples/`
24 |    - Use templates in `templates/` for your own designs
25 | 
26 | 2. **Want to Contribute?**
27 |    - Read `CONTRIBUTING.md` for guidelines
28 |    - Use the templates provided
29 |    - Follow the checklist before submitting
30 | 
31 | 3. **Looking for Examples?**
32 |    - Check `Design_Doc_Examples/` for real-world cases
33 |    - Each example follows our standard template
34 |    - Includes different domains and complexity levels
35 | 
36 | ## Templates
37 | 
38 | We provide two main templates:
39 | 1. `basic_ml_design_doc.md` - Standard template for ML system design documents
40 | 2. `design_doc_checklist.md` - Comprehensive review checklist
41 | 
42 | ## Contributing
43 | 
44 | We welcome contributions! Please see `CONTRIBUTING.md` for detailed guidelines. Key areas for contribution:
45 | - New design document examples
46 | - Template improvements
47 | - Best practices documentation
48 | - Code examples
49 | - Reviews and feedback
50 | 
51 | ## License
52 | 
53 | This repository is licensed under the MIT License - see the LICENSE file for details.
54 | 
55 | ## Acknowledgments
56 | 
57 | - Based on concepts from [Machine Learning System Design](https://www.manning.com/books/machine-learning-system-design)
58 | - Contributors to the examples and templates
59 | - ML community for feedback and improvements
60 | 
61 | 


--------------------------------------------------------------------------------
/templates/basic_ml_design_doc.md:
--------------------------------------------------------------------------------
  1 | # Machine Learning System Design Document Template
  2 | 
  3 | ## Executive Summary
  4 | - **Project Name**: [Name]
  5 | - **Problem Statement**: [Brief description of the problem]
  6 | - **Business Impact**: [Expected impact and KPIs]
  7 | - **Timeline**: [High-level timeline]
  8 | 
  9 | ## I. Problem Definition
 10 | 
 11 | ### i. Origin
 12 | - What is the core problem?
 13 | - Who are the stakeholders?
 14 | - What are the current limitations/challenges?
 15 | - What is the current workflow?
 16 | 
 17 | ### ii. Relevance & Reasons
 18 | - Why is this problem important?
 19 | - What are the expected benefits?
 20 | - What is the estimated business impact?
 21 | - How much does the current problem cost?
 22 | 
 23 | ### iii. Previous Work
 24 | - What existing solutions have been tried?
 25 | - What worked/didn't work?
 26 | - What can we learn from past attempts?
 27 | - Can we improve existing solutions?
 28 | 
 29 | ### iv. Other Issues & Risks
 30 | - What infrastructure requirements exist?
 31 | - What are the potential failure modes?
 32 | - What is the cost of mistakes?
 33 | - What checks and balances are needed?
 34 | 
 35 | ## II. Metrics and Losses
 36 | 
 37 | ### i. Metrics
 38 | - What are the key business metrics?
 39 | - What are the model performance metrics?
 40 | - How do these align with business goals?
 41 | - What are the trade-offs between metrics?
 42 | 
 43 | ### ii. Loss Functions
 44 | - What loss functions will be used?
 45 | - How do they relate to business metrics?
 46 | - What are the trade-offs?
 47 | - How will we handle edge cases?
 48 | 
 49 | ## III. Dataset
 50 | 
 51 | ### i. Data Sources
 52 | - What internal data sources are available?
 53 | - What external data can be used?
 54 | - What are the data quality issues?
 55 | - How fresh is the data?
 56 | 
 57 | ### ii. Data Labeling
 58 | - How will data be labeled?
 59 | - What is the labeling process?
 60 | - How will we ensure quality?
 61 | - What are the costs?
 62 | 
 63 | ### iii. Available Metadata
 64 | - What metadata is available?
 65 | - How will it be used?
 66 | - What additional context can it provide?
 67 | 
 68 | ### iv. Data Quality Issues
 69 | - What quality issues exist?
 70 | - How will they be addressed?
 71 | - What is the data cleaning process?
 72 | - How will we handle missing data?
 73 | 
 74 | ### v. ETL Pipeline
 75 | - How will data be collected?
 76 | - What transformations are needed?
 77 | - How will we handle updates?
 78 | - What is the refresh frequency?
 79 | 
 80 | ## IV. Validation Schema
 81 | 
 82 | ### i. Requirements
 83 | - What are the validation requirements?
 84 | - How will we prevent data leakage?
 85 | - What temporal constraints exist?
 86 | 
 87 | ### ii. Inference Process
 88 | - How will the model make predictions?
 89 | - What is the prediction horizon?
 90 | - What constraints must be considered?
 91 | 
 92 | ### iii. Inner and Outer Loops
 93 | - How will we validate the model?
 94 | - What cross-validation strategy will be used?
 95 | - How will we handle time series data?
 96 | 
 97 | ### iv. Update Frequency
 98 | - How often will the model be updated?
 99 | - What triggers an update?
100 | - How will we handle data drift?
101 | 
102 | ## V. Baseline Solution
103 | 
104 | ### i. Constant Baseline
105 | - What simple baselines will we use?
106 | - How will we measure against them?
107 | - What are the minimum acceptable results?
108 | 
109 | ### ii. Model Baselines
110 | - What model architectures will we try?
111 | - What are the trade-offs?
112 | - How will we compare them?
113 | 
114 | ### iii. Feature Baselines
115 | - What features will we start with?
116 | - How will we measure feature importance?
117 | - What feature engineering is needed?
118 | 
119 | ## VI. Error Analysis
120 | 
121 | ### i. Learning Curve Analysis
122 | - How will we analyze learning curves?
123 | - What patterns are we looking for?
124 | - How will we handle overfitting/underfitting?
125 | 
126 | ### ii. Residual Analysis
127 | - How will we analyze residuals?
128 | - What distributions do we expect?
129 | - How will we handle outliers?
130 | 
131 | ### iii. Best/Worst Case Analysis
132 | - How will we identify edge cases?
133 | - What are the failure modes?
134 | - How will we improve worst cases?
135 | 
136 | ## VII. Training Pipeline
137 | 
138 | ### i. Overview
139 | - What is the training architecture?
140 | - What tools will we use?
141 | - How will we ensure reproducibility?
142 | 
143 | ### ii. Data Preprocessing
144 | - What preprocessing is needed?
145 | - How will we handle feature engineering?
146 | - What normalization is required?
147 | 
148 | ### iii. Model Training
149 | - What is the training process?
150 | - How will we handle hyperparameters?
151 | - What hardware requirements exist?
152 | 
153 | ### iv. Experiment Tracking
154 | - How will we track experiments?
155 | - What metrics will we log?
156 | - How will we version models?
157 | 
158 | ## VIII. Features
159 | 
160 | ### i. Feature Selection Criteria
161 | - What criteria will we use?
162 | - How will we measure importance?
163 | - What are the computational constraints?
164 | 
165 | ### ii. Feature List
166 | - What features will we use?
167 | - What transformations are needed?
168 | - What are the dependencies?
169 | 
170 | ### iii. Feature Tests
171 | - How will we test features?
172 | - What quality checks are needed?
173 | - How will we handle drift?
174 | 
175 | ## IX. Measuring and Reporting
176 | 
177 | ### i. Measuring Results
178 | - How will we measure success?
179 | - What metrics will we track?
180 | - How will we report results?
181 | 
182 | ### ii. A/B Testing
183 | - What is the testing strategy?
184 | - How will we split traffic?
185 | - What are the success criteria?
186 | 
187 | ### iii. Reporting Results
188 | - What reports will be generated?
189 | - Who are the stakeholders?
190 | - How will results be communicated?
191 | 
192 | ## X. Integration
193 | 
194 | ### i. Fallback Strategies
195 | - What are the fallback plans?
196 | - When do we fall back?
197 | - How do we recover?
198 | 
199 | ### ii. API Design
200 | - What APIs will we expose?
201 | - What are the interfaces?
202 | - What are the SLAs?
203 | 
204 | ### iii. Release Cycle
205 | - How will we release updates?
206 | - What is the deployment strategy?
207 | - How will we handle rollbacks?
208 | 
209 | ### iv. Operational Concerns
210 | - How will we monitor the system?
211 | - What alerts are needed?
212 | - How will we handle incidents? 


--------------------------------------------------------------------------------
/templates/design_doc_checklist.md:
--------------------------------------------------------------------------------
  1 | # ML System Design Document Review Checklist
  2 | 
  3 | ## Problem Definition
  4 | - [ ] Clear problem statement with measurable objectives
  5 | - [ ] Well-defined scope and constraints
  6 | - [ ] Identified stakeholders and their requirements
  7 | - [ ] Justified business value and impact
  8 | - [ ] Analyzed existing solutions and their limitations
  9 | - [ ] Assessed risks and failure modes
 10 | - [ ] Estimated costs of mistakes
 11 | - [ ] Defined success criteria
 12 | 
 13 | ## Metrics and Losses
 14 | - [ ] Defined business metrics
 15 | - [ ] Selected appropriate model metrics
 16 | - [ ] Justified loss functions
 17 | - [ ] Aligned metrics with business goals
 18 | - [ ] Considered trade-offs
 19 | - [ ] Defined evaluation strategy
 20 | - [ ] Set up measurement framework
 21 | - [ ] Planned A/B testing approach
 22 | 
 23 | ## Data Considerations
 24 | - [ ] Identified all data sources (internal/external)
 25 | - [ ] Assessed data quality and freshness
 26 | - [ ] Documented data pipeline architecture
 27 | - [ ] Addressed data privacy and security
 28 | - [ ] Considered data versioning strategy
 29 | - [ ] Evaluated data storage requirements
 30 | - [ ] Planned data labeling process
 31 | - [ ] Documented metadata usage
 32 | - [ ] Designed ETL pipeline
 33 | - [ ] Set up data quality checks
 34 | 
 35 | ## Validation Strategy
 36 | - [ ] Defined validation requirements
 37 | - [ ] Designed validation schema
 38 | - [ ] Prevented data leakage
 39 | - [ ] Planned update frequency
 40 | - [ ] Set up cross-validation strategy
 41 | - [ ] Considered temporal aspects
 42 | - [ ] Documented validation process
 43 | - [ ] Planned for data drift
 44 | 
 45 | ## Baseline Solutions
 46 | - [ ] Defined constant baselines
 47 | - [ ] Selected model baselines
 48 | - [ ] Identified feature baselines
 49 | - [ ] Set minimum performance requirements
 50 | - [ ] Planned comparison methodology
 51 | - [ ] Documented baseline results
 52 | - [ ] Set up improvement metrics
 53 | 
 54 | ## Error Analysis
 55 | - [ ] Planned learning curve analysis
 56 | - [ ] Set up residual analysis
 57 | - [ ] Identified edge cases
 58 | - [ ] Planned monitoring of failure modes
 59 | - [ ] Designed error tracking
 60 | - [ ] Set up performance analysis
 61 | - [ ] Planned improvement process
 62 | 
 63 | ## Training Pipeline
 64 | - [ ] Designed training architecture
 65 | - [ ] Selected appropriate tools
 66 | - [ ] Planned data preprocessing
 67 | - [ ] Set up experiment tracking
 68 | - [ ] Defined model versioning
 69 | - [ ] Planned resource allocation
 70 | - [ ] Documented training process
 71 | - [ ] Set up monitoring
 72 | 
 73 | ## Feature Engineering
 74 | - [ ] Defined feature selection criteria
 75 | - [ ] Listed initial features
 76 | - [ ] Planned feature tests
 77 | - [ ] Set up feature monitoring
 78 | - [ ] Documented feature dependencies
 79 | - [ ] Planned feature updates
 80 | - [ ] Considered computational constraints
 81 | 
 82 | ## Integration
 83 | - [ ] Designed API interfaces
 84 | - [ ] Planned release cycle
 85 | - [ ] Set up fallback strategies
 86 | - [ ] Defined operational procedures
 87 | - [ ] Planned monitoring and alerts
 88 | - [ ] Documented deployment process
 89 | - [ ] Set up incident response
 90 | - [ ] Defined SLAs
 91 | 
 92 | ## Documentation
 93 | - [ ] Clear writing and organization
 94 | - [ ] Technical details sufficient
 95 | - [ ] Diagrams and visualizations
 96 | - [ ] References and citations
 97 | - [ ] Glossary of terms
 98 | - [ ] Version history
 99 | - [ ] Maintenance procedures
100 | - [ ] Update guidelines
101 | 
102 | ## System Architecture
103 | - [ ] Detailed infrastructure requirements
104 | - [ ] Scalability considerations
105 | - [ ] Latency requirements
106 | - [ ] Security measures
107 | - [ ] Integration points
108 | - [ ] Deployment strategy
109 | 
110 | ## Evaluation Strategy
111 | - [ ] Clear success metrics
112 | - [ ] A/B testing methodology
113 | - [ ] Performance benchmarks
114 | - [ ] Monitoring plan
115 | - [ ] Alert thresholds
116 | - [ ] Fallback strategies
117 | 
118 | ## Implementation Plan
119 | - [ ] Realistic timeline
120 | - [ ] Resource requirements
121 | - [ ] Dependencies identified
122 | - [ ] Risk assessment
123 | - [ ] Mitigation strategies
124 | - [ ] Success criteria
125 | 
126 | ## Maintenance & Operations
127 | - [ ] Monitoring setup
128 | - [ ] Update procedures
129 | - [ ] Backup strategies
130 | - [ ] Incident response plan
131 | - [ ] SLAs defined
132 | - [ ] Resource scaling plan 


--------------------------------------------------------------------------------