├── ANN.md
├── Ad.md
├── COTA.md
├── LICENSE
├── README.md
├── _images
    ├── Faiss.jpg
    ├── RankingFlow.jpg
    ├── Under-the-Hood_Stills_Final.jpg
    ├── ad_system_design.png
    ├── architectural_ml_system.png
    ├── building_blocks.png
    ├── classic_setup.png
    ├── de_ds_ml.png
    ├── facebook-newsfeed-architecture.png
    ├── faiss5.png
    ├── image1-2-768x777.png
    ├── image2.png
    ├── image3.png
    ├── image4-1-768x484.png
    ├── image4-768x309.png
    ├── image5-1-768x410.png
    ├── image5-768x399.png
    ├── image5.png
    ├── image6-2-768x492.png
    ├── image8.png
    ├── nearest_neighbors.png
    ├── offline_training_feature_store.png
    ├── online_serving_feature_store.png
    ├── pull_model.png
    └── push_model.png
├── newsfeed.md
├── platform.md
└── ranking.md


/ANN.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Using approximate nearest neighbor search in real-world applications
  3 | 
  4 | 
  5 | > Similarity search
  6 | 
  7 | - Traditional databases are made up of structured tables containing symbolic information.
  8 | - For example, an image collection would be represented as a table with one row per indexed photo. Each row contains information such as an image identifier and descriptive text.
  9 | - Rows can be linked to entries from other tables as well, such as an image with people in it being linked to a table of names.
 10 | - The trained Text embedding (word2vec) or convolutional neural net (CNN) models can generate high-dimensional vectors. These representations are much more powerful and flexible than a fixed symbolic representation, as we’ll explain in this post. 
 11 | - These new representations can not be queried with SQL. First, the huge inflow of new multimedia items creates billions of vectors. Second, finding similar entries means finding similar high-dimensional vectors, which is inefficient if not impossible with standard query languages.
 12 | 
 13 | Most often, we are interested in finding the most similar vectors. This is called k-nearest neighbor (KNN) search or similarity search and has all kinds of useful applications.
 14 | 
 15 | However, the nearest neighbor search is only a part of the process for many applications. For applications doing search and recommendation, the potential candidates from the KNN search are often combined with other facets of the query or request, such as some form of filtering, to refine the results.
 16 | 
 17 | > The solution is to integrate the nearest neighbor search with filtering: https://github.com/vespa-engine/vespa
 18 | 
 19 | 
 20 | ### Finding the (approximate) nearest neighbors
 21 | 
 22 | - The representations can be visualized as points in a high-dimension space, even though it’s kind of difficult to envision a space with hundreds of dimensions.
 23 | - We can use various distance metrics to measure the likeness or similarity between them. Examples are the dot (or inner) product, cosine angle, or euclidean distance.
 24 | 
 25 | ![](./_images/nearest_neighbors.png)
 26 | 
 27 | - Finding the nearest neighbors of a point is straightforward: just compute the similarity using the distance metric between the point and all other points. Unfortunately, this brute-force approach doesn’t scale very well, particularly in time-critical settings such as online serving, where you have a large number of points to consider.
 28 | - There are no known exact methods for finding nearest neighbors efficiently. A good enough solution for many applications is to trade accuracy for efficiency. In approximately nearest neighbors (ANN), we build index structures that narrow down the search space. 
 29 | 
 30 | You can roughly divide the approaches used for ANNs into whether or not they can be implemented using an inverse index.
 31 | 
 32 | > The inverse index originates from information retrieval and is comparable to the index often found at many books’ back. This index points from a word (or term) to the documents containing it.
 33 | 
 34 | - Using k-means clustering, one can cluster all points and index them by which cluster they belong to.
 35 | - A related approach is product quantization, which splits the vectors into products of lower-dimensional spaces. Yet another is locality-sensitive hashing, which uses hash functions to group similar vectors together. These approaches index the centroids or buckets.
 36 | - A method that is not compatible with inverted indexes is HNSW (hierarchical navigable small world). HNSW is based on graph structures, is very efficient, and lets the graph be incrementally built at runtime. This is in contrast to most other methods that require offline, batch-oriented index building.
 37 | 
 38 | > A good overview of tradeoffs for these can be found at: https://github.com/erikbern/ann-benchmarks/
 39 | 
 40 | 
 41 | ### Nearest neighbors in search and recommendation
 42 | 
 43 | In many applications, such as search and recommendation, the results of the nearest neighbor search are combined with additional facets of the request.
 44 | 
 45 | > Text search
 46 | 
 47 | Modern text search increasingly uses representation vectors, often called text embeddings or embedding vectors. Word2vec was an early example. More recently, sophisticated language understanding models such as BERT and other Transformer-based models are increasingly used. These are capable of assigning different representations for a word depending upon the context. For text search, the current state-of-the-art uses different models to encode query vectors and document vectors. These representations are trained so that the inner product of these vectors is maximized for relevant results.
 48 | 
 49 | > Recommendation
 50 | 
 51 | It’s essential to learn the interests or preferences of the user. Such user profiles are represented by one or more vectors, as are the items that should be recommended. These vectors are often generated by using some form of collaborative filtering. One method is matrix factorization, where the maximum inner product is used as a distance function. The problem of filtering is more evident for recommendation systems than for text search. These filters’ quantity and strength lead to a greater probability that items retrieved from the ANN search are filtered away. 
 52 | 
 53 | > Serving ads
 54 | 
 55 | Given a user profile and a context such as a search query or page content, the system should provide an advertisement relevant to the user. The advertisements are stored with advertiser-specific rules, for instance, who the ad or campaign should target. 
 56 | 
 57 | 
 58 | ### Faiss
 59 | 
 60 | Facebook AI Similarity Search (Faiss), a library that allows us to quickly search for multimedia documents that are similar to each other — a challenge where traditional query search engines fall short. The nearest-neighbor search implementations for billion-scale data sets are some 8.5x faster than the previously reported state-of-the-art.
 61 | 
 62 | ![](./_images/Faiss.jpg)
 63 | 
 64 | - Faiss provides several similarity search methods that span a wide spectrum of usage trade-offs.
 65 | - Faiss is optimized for memory usage and speed.
 66 | - Faiss offers a state-of-the-art GPU implementation for the most relevant indexing methods.
 67 | 
 68 | 
 69 | #### How FAISS Makes Search Efficient
 70 | 
 71 | - The first of those efficiency savings comes from efficient usage of the GPU, so the search can process calculations in parallel.
 72 | - Additionally, FAISS implements three additional steps in the indexing process. A preprocessing step, followed by two quantization operations — the **coarse** quantizer for inverted file indexing (IVF), and the **fine** quantizer for vector encoding.
 73 | 
 74 | 
 75 | #### Preprocessing
 76 | 
 77 | The very first step is to transform these vectors into a more friendly/efficient format. FAISS offers several options here.
 78 | 
 79 | - PCA : use principal component analysis to reduce the number of dimensions in our vectors.
 80 | - L2norm : L2-normalize our vectors.
 81 | - OPQ : rotates our vectors so they can be encoded more efficiently by the fine quantizer — if using product quantization (PQ).
 82 | - Pad : pads input vectors with zeros up to a given target dimension.
 83 | 
 84 | #### Inverted File Indexing
 85 | 
 86 | The next step is our inverted file (IVF) indexing process. Again, there are multiple options — but each one is aiming to partition data into similar clusters.
 87 | 
 88 | This means that when we query FAISS, and our query is converted into a vector — it will be compared against these partition/cluster centroids.
 89 | 
 90 | ![](./_images/faiss5.png)
 91 | 
 92 | Figure from: <https://www.pinecone.io/learn/faiss-tutorial/>
 93 | 
 94 | 
 95 | We compare similarity metrics against our query vector and each of these centroids — and once we find the nearest centroid (nprobe = 1), we then access all of the full vectors within that centroid (and ignore all others).
 96 | 
 97 | > nprobe attribute value — which defines how many nearby cells to search.
 98 | 
 99 | Immediately, we have significantly reduced the required search area — reducing complexity and speeding up the search.
100 | 
101 | 
102 | #### Vector Encoding
103 | 
104 | This encoding process is carried out by our fine quantizer. The goal here is to reduce index memory size and increase search speed.
105 | 
106 | There are several options:
107 | 
108 | - Flat : Vectors are stored as is, without any encoding.
109 | - PQ : Applies product quantization.
110 | - SQ : Applies scalar quantization.
111 | 
112 | It’s worth noting that even with the Flat encoding, FAISS is still going to be very fast.
113 | 
114 | All of these steps and improvements combine to create an incredibly fast similarity search engine — which on GPU is still unbeaten.
115 | 
116 | 
117 | > Facebook AI Similarity Search (Faiss): The Missing Manual: https://www.pinecone.io/learn/faiss-tutorial/
118 | 
119 | 
120 | ## References
121 | 
122 | - [Faiss: A library for efficient similarity search](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/)
123 | - [Using approximate nearest neighbor search in real world applications](https://towardsdatascience.com/using-approximate-nearest-neighbor-search-in-real-world-applications-a75c351445d)
124 | - [Facebook AI Similarity Search](https://towardsdatascience.com/facebook-ai-similarity-search-7c564daee9eb)


--------------------------------------------------------------------------------
/Ad.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Ad Click Prediction for Social Networks
 3 | 
 4 | 
 5 | ## Requirements & Goals
 6 | 
 7 | ### Functional requirements
 8 | 
 9 | - Build a machine learning model to predict if an ad will be clicked. For simplicity reasons, we will not focus on the cascade of classifiers that are commonly used in adtech.
10 | - ML model with good performance
11 | 
12 | ### Non-functional requirements
13 | 
14 | - System can scale to a larger number of users with low latency.
15 | - Imbalance data: you can assume Click Through Rate (CTR) is very small in practice (1%-2%).
16 | - Serving: from the Real-Time Bidding (RTB) workflow diagram, it's important to have low latency (150 ms) for ad prediction.
17 | 
18 | 
19 | ### Calculate and estimation
20 | 
21 | - Assumptions: 4K ads requests per second which are 10 billion ads requests per month.
22 | - Data: historical ad clicks data includes `[user, ads, click_or_not]`. With an estimated 1% CTR, it has 100 million clicked ads. We can start with 1 month of data for training and validation.
23 | - Train/validation data split: We split train/validation to simulate the actual online system for example: split by time.
24 | - Features: naturally, the model needs to have enough capacity to learn patterns from big training data. In practice, it's common to have hundreds even thousands of features.
25 | - Training: ability to retrain many times within one day to increase model performance in an online manner.
26 | - Serving: latency within 150ms per requests and 4K request per second.
27 | - Number of predictions: a million per second
28 | 
29 | 
30 | ### Metrics evaluation
31 | 
32 | - During the training phase, we can focus on machine learning metrics instead of revenue metrics or CTR metrics. Regarding revenue-related metrics, we usually monitor during deployment.  offline metrics and online metrics.
33 | - Normalized Cross-Entropy: predictive log loss divided by the cross-entropy of the background CTR. This way NCE is insensitive to background CTR.
34 | - Calibration metrics measured by the expected clicks vs the actually observed clicks.
35 | 
36 | ### Modeling
37 | 
38 | - Model: We can use a probabilistic sparse linear classifier (logistic regression). It's popular because of the computation efficiency and sparsity features.
39 | - Feature engineering: AdvertiserID: it's easy to have millions of advertisers. One common way is to use embedding as a distributed representation for advitiserID.
40 | - Data processing: One way is subsampling the majority negative class at different sub-sampling ratios. The key here is ensuring that the validation dataset has the same distribution as the test data set.
41 | 
42 | ### Model deployment and testing
43 | 
44 | - During the deployment phase, it's crucial to monitor the actual CTR and other revenue-related metrics.
45 | - Related to this topic, read more about A/B testing and multi-arms bandit.
46 | 
47 | > A/B testing: compares the performance of two versions of content to see which one appeals more to visitors/viewers.
48 | > multi-armed bandit: dynamically allocate traffic to variations that are performing well, while allocating less traffic to underperforming variations
49 | 
50 | 
51 | ## High-level system design
52 | 
53 | 
54 | 
55 | ![](./_images/ad_system_design.png)
56 | 
57 | 
58 | It’s challenging to be able to train models every few hours to use only up-to-date data in production. Furthermore, those models need to be easily improvable through feature selection and hyperparameter tuning. This requires the ability to run offline and online tests. 
59 | 
60 | 


--------------------------------------------------------------------------------
/COTA.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # COTA: Improving Uber Customer Care with NLP & Machine Learning
  3 | 
  4 | Customer Obsession Ticket Assistant (COTA), a tool that uses machine learning and natural language processing (NLP) techniques to help agents deliver better customer support.
  5 | 
  6 | When customers contact Uber for support, it is important that we route them to the best possible resolution in a timely manner. One way to facilitate this is to have users click through a hierarchy of issue types when they report an issue; this provides our agents with additional context around the issue, thereby enabling them to solve it more quickly, as detailed below:
  7 | 
  8 | ![](./_images/image5-768x399.png)
  9 | 
 10 | 
 11 | ## Requirements & Goals
 12 | 
 13 | ### Functional requirements
 14 | 
 15 | - Making customer support easier and more accessible.
 16 | - Determine the issue type
 17 | - Identify the right resolution
 18 | 
 19 | 
 20 | ### Non-functional requirements
 21 | 
 22 | - Hundreds of thousands issues daily
 23 | - 600+ cities worldwide
 24 | - Supporting multiple languages
 25 | - Handle an ever-increasing volume and diversity of support tickets
 26 | 
 27 | 
 28 | ## High level system design
 29 | 
 30 | - Built on top of customer support platform
 31 | - Suggest the three most likely issue types and solutions based on ticket content and trip context
 32 | 
 33 | ![](./_images/image4-1-768x484.png)
 34 | 
 35 | 
 36 | ### COTA Architecture
 37 | 
 38 | 1. Once a new ticket enters the customer support platform (CSP), the back-end service collects all relevant features of the ticket.
 39 | 2. The back-end service then sends these features to the machine learning model in Michelangelo.
 40 | 3. The model predicts scores for each possible solution.
 41 | 4. The back-end service receives the predictions and scores, and saves them to our Schemaless data store.
 42 | 5. Once an agent opens a given ticket, the front-end service triggers the back-end service to check if there are any updates to the ticket. If there are no updates, the back-end service will retrieve the saved predictions; if there are updates, it will fetch the updated features and go through steps 2-4 again.
 43 | 6. The back-end service returns the list of solutions ranked by the predicted score to the frontend.
 44 | 7. The top three ranked solutions are suggested to agents; from there, agents make a selection and resolve the support ticket.
 45 | 
 46 | To accomplish the goal, machine learning model leverages features extracted from
 47 | - customer support messages
 48 | - trip information
 49 | - and customer selections in the ticket issue submission hierarchy outlined earlier.
 50 | 
 51 | 
 52 | ### NLP pipeline
 53 | 
 54 | Uber built a NLP pipeline to transform text across several different languages into useful features
 55 | 
 56 | ![](./_images/image4-768x309.png)
 57 | 
 58 | - Preprocessing
 59 | 
 60 | Cleaning the text by removing HTML tags, tokenizing the message's sentences and remove stopwords, conducting lemmatization to convert words in different inflected forms into the same base form
 61 | 
 62 | - Topic modeling: TF-IDF and LSA (Latent Semantic Analysis) to extract topics from rich text data
 63 | 
 64 | - Feature engineering
 65 | 
 66 | Topic modeling enables us to directly use the topic vectors as features to perform downstream classifications for issue type identification and solution selection.
 67 | 
 68 | - With a very high-dimensional feature space and large amount of data to process, training these models becomes challenging
 69 | 
 70 | 
 71 | ### Pointwise ranking
 72 | 
 73 | - Combined cosine similarity features together with other ticket and trip features that matches tickets to solutions
 74 | - With over 1,000 possible solutions, for 100s of ticket types, solution space becomes a challenge for the ranking algorithm of distinguishing the fine differences between these solutions
 75 | - Learning to rank
 76 | 
 77 | One solution:
 78 | 
 79 | - Label the correct match between solution and ticket pair as **positive**
 80 | - Sample a random subset of solutions that do not match with the ticket and label the pairs **negative**
 81 | - Binary classification, e.g., random forest
 82 | 
 83 | 
 84 | How Uber actually measured they were successful for handling customer issues?
 85 | 
 86 | - Uber performed A/B test experiments online on English language tickets.
 87 | - A few key metrics, including model accuracy, average handle time, and customer satisfication score
 88 | 
 89 | 
 90 | ### Moving to COTA v2 with deep learning
 91 | 
 92 | - Building a spark-based deep learning pipeline to productize the second generation of COTA v2
 93 | - Given that model performance decays over time, we also built a model management pipeline to automatically retrain models to keep them up-to-date
 94 | 
 95 | 
 96 | COTA v1 (a) was built with topic modeling-based traditional NLP and machine learning techniques that incorporate a mixture of textual, categorical, and numerical features, as shown in below:
 97 | 
 98 | ![](./_images/image6-2-768x492.png)
 99 | 
100 | COTA v2 (b) supports a deep learning architecture with a mixture of input features.
101 | 
102 | - NLP Pipeline was built to process incoming ticket messages. 
103 | - Topic modeling was used to extract feature representation from the text feature.
104 | - The text feature goes through typical NLP preprocessing such as text cleaning and tokenization, and each word in the ticket is encoded using an embedding layer to convert the word to a dense representation.
105 | - Categorical features are encoded using an embedding layer to capture the closeness between different
106 | categories. 
107 | - Numerical features are batch normalized to stabilize the training process.
108 | - Deep learning can improve the solution’s top-1 prediction accuracy by 16 percent (from 49 percent to 65 percent) for the Contact Type model, and 8 percent (from 47 percent to 55 percent) for the Reply model compared to COTA v1. 
109 | 
110 | ### Challenges with COTA V2
111 | 
112 | - To leverage Spark for the NLP transformations in a distributed fashion
113 | - Spark computations are typically done using CPU clusters
114 | - Deep learning training runs more efficiently on a GPU-based infrastructure
115 | - use both Spark transformations and GPU training, as well as build a unified Pipeline for training and serving the deep learning model
116 | 
117 | 
118 | ### COTA v2’s deep learning Spark Pipeline
119 | 
120 | - Assign tasks to CPUs and GPUs based on which hardware would be most efficient. 
121 | - Defining the pipeline into two stages, one for Spark pre-processing and one for deep learning, seemed like the best way of allocating the work load. 
122 | - By extending the concept of a Spark Pipeline, we can serve models for both batch prediction and real-time prediction services using our existing infrastructure.
123 | 
124 | 
125 | ![](./_images/image1-2-768x777.png)
126 | 
127 | 
128 | > Training
129 | 
130 | Model training is split into two stages:
131 | 
132 | - Pre-processing transformations using Spark: All the transformations performed on the data during pre-processing are saved as Spark transformers, which are then used to build a Spark Pipeline for serving. We compute both fitted transformations (transformations that require persisting data, e.g., StringIndexer) and non-fitted transformations (e.g., cleaning up HTML tags from strings etc.) in the Spark cluster.
133 | - Deep learning training using TensorFlow: We leverage the pre-processed data to train the deep learning model using TensorFlow. The trained model from this stage is then merged with the Spark Pipeline generated in step (1).This produces the final Spark Pipeline encompassing the pre-processing transformers and the TensorFlow model, which can be used to run predictions. We are able to combine the Spark Pipeline with TensorFlow model by implementing a special type of transformer called TFTransformer, which brings the TensorFlow model into Spark.
134 | 
135 | 
136 | > Serving
137 | 
138 | - The Spark Pipeline built from training contains both pre-processing transformers and TensorFlow transformations. 
139 | - We extended Michelangelo to support serving generic Spark Pipelines, and utilized the existing deployment and serving infrastructure to serve the deep learning model. The pipeline used for serving runs on a Java Virtual Machine (JVM). 
140 | - The performance we see while serving has a latency of p95 < 10ms, which demonstrates the advantage of low latency when using an existing JVM serving infrastructure for deep learning models.
141 | - By extending Spark Pipelines to encapsulate deep learning models, we were able to leverage the best of both CPU and GPU-driven worlds: 1) the distributed computation of Spark transformations and low-latency serving of Spark Pipelines using CPUs and 2) the acceleration of deep learning model training using GPUs.
142 | 
143 | 
144 | ### Model lifecycle management Pipeline: keeping models fresh
145 | 
146 | To prevent COTA v2 model performance from decaying over time, we built a model lifecycle management Pipeline (MLMP) on top of our DLSP. 
147 | 
148 | ![](./_images/image5-1-768x410.png)
149 | 
150 | Consisting of six jobs in total, it uses the existing APIs from Michelangelo to retrain the model. These jobs form a directed acyclic graph (DAG) with dependency indicated by the arrows:
151 | 
152 | 1. Data ETL: This involves writing a data extraction, basic transformation, and loading (ETL) job to prepare data. It typically pulls data from several different data sources, converting it into the right format and putting it into a Hive database.
153 | 
154 | 2. Spark Transformation: This step transforms raw data (textual, categorical, numerical, etc.) into Tensor format so that it can be consumed by a TensorFlow graph for model training.
155 | 
156 | 3. Data Transfer: Computer clusters with CPUs perform the Spark transformations. Deep learning training requires GPUs to speed up the progress. 
157 | 
158 | 4. Deep Learning Training: Once the data is transferred to the GPU clusters. A job is triggered to open a GPU session with a custom Docker container and start the deep learning training process.
159 | 
160 | 5. Model Merging: The Spark transformers from Step 2 and the TensorFlow model from Step 4 are merged to form the final model.
161 | 
162 | 6. Model Deployment: The final model is deployed, and a `model_id` is generated as a reference to the newly deployed model. External microservices can hit the endpoint using the serving framework of `Thrift` by referencing the `model_id`.
163 | 
164 | 
165 | 
166 | ## References
167 | 
168 | - [COTA: Improving Uber Customer Care with NLP & Machine Learning](https://eng.uber.com/cota/)
169 | - [Scaling Uber’s Customer Support Ticket Assistant (COTA) System with Deep Learning](https://eng.uber.com/cota-v2/)


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Fei Ding
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Machine Learning System Design : An interview framework
  3 | 
  4 | Interviewers will generally ask you to design a machine learning system for a particular task. This question is usually broad. 
  5 | 
  6 | - The first thing you need to do is to **ask questions to narrow down the scope** of the problem and ensure your system’s requirements. 
  7 | - You should also ask questions about **performance and capacity considerations** of the system.
  8 | 
  9 | ![](./_images/de_ds_ml.png)
 10 | 
 11 | Figure from: [Pooyan Jamshidi, UofSC, CSCE 585: Machine Learning Systems](https://pooyanjamshidi.github.io/mls/lectures/)
 12 | 
 13 | 
 14 | ![](./_images/building_blocks.png)
 15 | 
 16 | Figure from: <https://developers.google.com/machine-learning/crash-course/production-ml-systems>
 17 | 
 18 | 
 19 | ## Overview
 20 | 
 21 | 1. Clarify Requirements
 22 | 2. How the ML system fits into the overall product backend
 23 | 3. Data Related Activities
 24 | 4. Model Related Activities
 25 | 5. Scaling
 26 | 
 27 | ## Requirements & Goals
 28 | 
 29 | ### Functional requirements
 30 | 
 31 | > What is the goal? Any secondary goal?
 32 | 
 33 | - e.g. for CTR, maximizing the number of clicks 
 34 | - A secondary goal might be the quality of the ads/content
 35 | 
 36 | > Batch prediction
 37 | - Hourly, weekly, etc
 38 | - Processing accumulated data when you don't need immediate results, e.g. recommendation
 39 | - High throughput
 40 | 
 41 | > Online prediction
 42 | - As soon as requests come
 43 | - When predictions are needed as soon as data sample is generated e.g., fraud detection
 44 | - Low latency
 45 | 
 46 | 
 47 | ### Non-functional requirements
 48 | 
 49 | 1. Reliability: ML system fails, may not give an error, give garbage outputs
 50 | 2. Scalability: Ask questions about the scale of the system - how many users, how much content?
 51 | 3. Maintainability: data distribution might get changed, how to re-train and update the model
 52 | 4. Adaptability: new data with added features, changes in business objectives, should be flexible
 53 | 
 54 | ## Overview
 55 | 
 56 | ### Data Related Activities
 57 | 
 58 | 1. Data Explore - what the dataset looks like?
 59 | 2. Understand different features and their relationship with the target
 60 |   - Is the data balanced? 
 61 |   - Is there a missing value (not an issue for tree-based models)
 62 |   - Is there an unexpected value for one/more data columns? How do you know if it's a typo etc. and decide to ignore it?
 63 | 3. Feature Importance - partial dependency plot, SHAP values
 64 | 4. ML Pipeline - Think of Data ingestion services/storage
 65 | 5. ML Pipeline - Feature Engineering : encoding categorical features, embedding generation, etc.
 66 | 6. ML Pipeline - Data split : train set, validation set, test set
 67 | 
 68 | > Embeddings enable us to encode entities (e.g., words, docs, images, person) in a low-dimensional vector space in order to capture their semantic information.
 69 | 
 70 | ### Model Related Activities
 71 | 
 72 | 1. ML Pipeline - Model Train and Evaluation: How to select a model and hyperparameters?
 73 | 2. ML Pipeline - Model Train and Evaluation: Once the model is built, do a bias-variance tradeoff, it will give you an idea of overfitting vs underfitting, you need different approaches to make your model better.
 74 | 3. Draw the ML pipeline 
 75 | 
 76 | ![](./_images/classic_setup.png)
 77 | 
 78 | 4. Model Debug
 79 | 5. Model Deployment
 80 | 6. ML Pipeline - Performance Monitoring: Metrics
 81 |   - AUC, F1, MSE, Accuracy, NDCG for ranking problems etc.
 82 |   - When to use which metrics?
 83 | 
 84 | You should carefully choose your system’s performance metrics for both online and offline testing. These metrics will differ depending on the problem your system is trying to solve.
 85 | 
 86 | - For example, if you are performing binary classification, you will use the following offline metrics: Area Under Curve (AUC), log loss, precision, recall, and F1-score.
 87 | - When deciding on online metrics, you may need both component-wise and end-to-end metrics.
 88 | - **Component-wise metrics** are used to evaluate the performance of ML systems that are plugged in to and used to improve other ML systems.
 89 | - **End-to-end metrics** evaluate a system’s performance after an ML model has been applied. For example, a metric for a search engine would be the users’ engagement and retention rate after your model has been plugged in.
 90 | 
 91 | > Offline experiment
 92 | 
 93 | - Model training challenge: different use cases in different teams, not the same features, not the same model settings
 94 | - Model training steps should be configurable by using configuration objects, formalizing the testing protocol, keeping track of the experiment results
 95 | - Evaluation metrics
 96 |   - TP, FP, TN, FN
 97 |   - Confusion matrix
 98 |   - Accuracy, Precision, Recall/Sensitivity, Specificity, F-score (how do you choose among these? imbalanced datasets)
 99 |   - ROC curve (TPR vs FPR, threshold selection)
100 |   - AUC (model comparison)
101 | 
102 | 
103 | ### Architecture
104 | 
105 | The next step is to design your system’s architecture. You need to think about the system’s components and how the data will **flow through those components**. In this step, you aim to design a model that can scale easily.
106 | 
107 | ![](./_images/architectural_ml_system.png)
108 | 
109 | 
110 | ### Model Serving
111 | 
112 | - Embedded model
113 | - Model deployed as a separate service
114 | - Model published as data
115 | 
116 | 
117 | ### Testing and Quality in ML
118 | 
119 | - Unit tests to check features are calculated correctly, numeric is normalized, one-hot vector, missing value?
120 | - Test that the exported model still produces the same results (offline vs online)
121 | - Validating the model quality, collecting and monitoring metrics
122 | - Validating model bias and fairness
123 | - Integration test, distribute a **holdout dataset** along with model, and allow to reassess the model's performance against the holdout dataset after it is integrated
124 | 
125 | 
126 | ### Model deployment
127 | 
128 | - Multiple models: more than one model performing the same task
129 | - Shadow models: deploy the new model side-by-side with the current one, send the dame production traffic to gather data on how the shadow model performs before promoting it into the production
130 | - Competing models: (1) multiple versions of the model in production - like an A/B test, take some time to gather enough data to make statistically significant decisions. (2) evaluating multiple competing models is multi-armed bandits
131 | - Online learning models: constantly learning in production, extra complexities, version model and data.
132 | 
133 | 
134 | ### Orchestration in ML pipelines
135 | 
136 | - Provisioning of infrastructure and the execution of the ML pipelines to train models and capture metrics
137 | - Building, testing, and deploying data pipelines
138 | - Testing and validation to decide which models to promote
139 | - Provisioning of infrastructure and deployment of models to production
140 | 
141 | 
142 | ## Scaling
143 | 
144 | To build a scalable system, your design needs to efficiently deal with a **large and continually increasing** amount of data. 
145 | 
146 | - Scaling for increased demand (same as in distributed systems): scaling web app and serving system, data partitioning
147 | - Data parallelism
148 | - Model parallelism
149 | 
150 | For instance, an ML system that displays relevant ads to users can’t process every ad in the system at once. You could use the **funnel approach**, wherein each stage has fewer ads to process. This will yield a scalable system that quickly determines relevant ads for users despite the increase in data.
151 | 
152 | 
153 | ## Real-world examples
154 | 
155 | - [Machine Learning Platforms](./platform.md)
156 | - [Ad Click Prediction for Social Networks](./Ad.md)
157 | - [COTA: Improving Uber Customer Care with NLP & Machine Learning](./COTA.md)
158 | - [Facebook Newsfeed Architecture](./newsfeed.md)
159 | - [How machine learning powers Facebook’s News Feed ranking algorithm](./ranking.md)
160 | - [Using approximate nearest neighbor search in real world applications](./ANN.md)
161 | 
162 | 
163 | 
164 | ## Reference
165 | - [Machine Learning System Design : A framework for the interview day](https://leetcode.com/discuss/interview-question/system-design/566057/Machine-Learning-System-Design-%3A-A-framework-for-the-interview-day)
166 | - [A Look at Machine Learning System Design](https://www.analyticsvidhya.com/blog/2021/01/a-look-at-machine-learning-system-design/)
167 | - [Cracking the machine learning interview: System design approaches](https://www.educative.io/blog/cracking-machine-learning-interview-system-design)
168 | - [Machine Learning for AdTech in Action with Cyrille Dubarry and Han Ju](https://www.slideshare.net/databricks/machine-learning-for-adtech-in-action-with-cyrille-dubarry-and-han-ju)
169 | - [Designing Computer Systems for Machine Learning](https://pooyanjamshidi.github.io/mls/lectures/)
170 | 


--------------------------------------------------------------------------------
/_images/Faiss.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/Faiss.jpg


--------------------------------------------------------------------------------
/_images/RankingFlow.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/RankingFlow.jpg


--------------------------------------------------------------------------------
/_images/Under-the-Hood_Stills_Final.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/Under-the-Hood_Stills_Final.jpg


--------------------------------------------------------------------------------
/_images/ad_system_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/ad_system_design.png


--------------------------------------------------------------------------------
/_images/architectural_ml_system.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/architectural_ml_system.png


--------------------------------------------------------------------------------
/_images/building_blocks.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/building_blocks.png


--------------------------------------------------------------------------------
/_images/classic_setup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/classic_setup.png


--------------------------------------------------------------------------------
/_images/de_ds_ml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/de_ds_ml.png


--------------------------------------------------------------------------------
/_images/facebook-newsfeed-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/facebook-newsfeed-architecture.png


--------------------------------------------------------------------------------
/_images/faiss5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/faiss5.png


--------------------------------------------------------------------------------
/_images/image1-2-768x777.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image1-2-768x777.png


--------------------------------------------------------------------------------
/_images/image2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image2.png


--------------------------------------------------------------------------------
/_images/image3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image3.png


--------------------------------------------------------------------------------
/_images/image4-1-768x484.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image4-1-768x484.png


--------------------------------------------------------------------------------
/_images/image4-768x309.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image4-768x309.png


--------------------------------------------------------------------------------
/_images/image5-1-768x410.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image5-1-768x410.png


--------------------------------------------------------------------------------
/_images/image5-768x399.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image5-768x399.png


--------------------------------------------------------------------------------
/_images/image5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image5.png


--------------------------------------------------------------------------------
/_images/image6-2-768x492.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image6-2-768x492.png


--------------------------------------------------------------------------------
/_images/image8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/image8.png


--------------------------------------------------------------------------------
/_images/nearest_neighbors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/nearest_neighbors.png


--------------------------------------------------------------------------------
/_images/offline_training_feature_store.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/offline_training_feature_store.png


--------------------------------------------------------------------------------
/_images/online_serving_feature_store.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/online_serving_feature_store.png


--------------------------------------------------------------------------------
/_images/pull_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/pull_model.png


--------------------------------------------------------------------------------
/_images/push_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ifding/ml-system-design/01ef9c2ac6d76b66ea2721cf28ee73c9f2480426/_images/push_model.png


--------------------------------------------------------------------------------
/newsfeed.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Facebook Newsfeed Architecture
  3 | 
  4 | 
  5 | ## Requirements & Goals
  6 | 
  7 | ### Functional requirements
  8 | 
  9 | - Display updates from friends and followers for each specific user
 10 | 
 11 | 
 12 | ### Non-functional requirements
 13 | 
 14 | - Show content to the user according to **its relevance** towards each particular user
 15 | 
 16 | 
 17 | 
 18 | ## Overview of the System
 19 | 
 20 | - It uses a `ranking algorithm` to determine what content to show.
 21 | - If we describe the system at a higher level, the chain of actions starts when the user `adds or updates` a post on Facebook.
 22 | - This post is received by the `web server`, which then sends it to `application servers`.
 23 | - These application servers coordinate with the `back-end database` to generate a newsfeed.
 24 | - The generated newsfeed is then sent for storage in a `cache feed`.
 25 | - Whenever the user asks for a more recent feed, a `feed notification` is sent to servers, then the feed is retrieved by the servers and eventually sent to users that requested it.
 26 | 
 27 | 
 28 | ![](./_images/facebook-newsfeed-architecture.png)
 29 | Figure from <https://algodaily.com/lessons/dive-into-facebook-newsfeed-architecture>
 30 | 
 31 | 
 32 | ### System Design
 33 | 
 34 | - The most important entity is the `user`, who will be assigned a unique ID along with all the necessary information that is required to create an account (such as birthday, email, etc).
 35 | - The other important entity is the `feed item`, which will be assigned a unique `feed_id`, along with content and metadata attributes that should support images and videos.
 36 | - There will be two main relationships, that would be a `user - user` relation and `feed item - media` relation. This is because users can be friends with or follow different users, and each feed item will correspond to different media sources.
 37 | 
 38 | 
 39 | ### Feed Generation
 40 | 
 41 | - The feed generation process would take quite a lot of time, especially when dealing with users with a large number of followers.
 42 | - The feed can be pre-generated and stored in a `cache memory`, it allows for the faster retrieval of feed items, it also allows for the generation of the feed for offline users or users with poor internet connection.
 43 | 
 44 | 
 45 | ### Feed Publishing
 46 | 
 47 | - `Feed publishing` is the step where feed data is displayed according to each specfic user.
 48 | - This can be a costly action, as the user may have a large number of friends and followers. 
 49 | - To deal with this, feed publishing has three approaches:
 50 | 
 51 | 1. push
 52 | 2. pull
 53 | 3. hybrid model
 54 | 
 55 | 
 56 | > Pull model or Fan-out-on-load
 57 | 
 58 | - When a user creates a post, and a friend reloads their timeline, the feed is created and the pull model `stores` the generated feed in memory.
 59 | - The most recent feed is only loaded when the user requests a recent feed.
 60 | - This approach reduces the amount of `write operation` from the system database.
 61 | - The downside of this approach is that the user will not be able to view recent feeds unless they issue a request to the server.
 62 | - Another problem could be the increase in `read` or `load` operations from the server, which may fail to load a user's newsfeed.
 63 | 
 64 | ![](./_images/pull_model.png)
 65 | 
 66 | 
 67 | > Push model or Fan-out-on-write
 68 | 
 69 | - Once a user creates a post, the push model `pushes` or `sends` this post to all the followers immediately.
 70 | - This prevents the system from having to go through a user's entire friends and follower list to check for updates on their posts published.
 71 | - The `read operations` by the system database is significantly reduced.
 72 | - When a user has a large number of friends, as that would result in an increase in number of write operations from the database.
 73 | 
 74 | ![](./_images/push_model.png)
 75 | 
 76 | 
 77 | > Hybrid model
 78 | 
 79 | - It combines the beneficial features of the above two approaches and tries to provide a balanced approach between the two.
 80 | - One method for achieving that would be by allowing only users with a lesser number of followers to use the push model. For users with a higher number of followers, a pull model will be used. 
 81 | - This can result in saving a huge number of resources.
 82 | 
 83 | 
 84 | ### Ranking Newsfeed
 85 | 
 86 | - After the feed is generated, each feed item is ranked according to the relevance for each specific user.
 87 | - Facebook utilizes an `edge rank algorithm` for ranking all the feed items in the newsfeed for a particular user. The `edge` refers to every small activity on Facebook, such as posts, likes, shares, etc.
 88 | - The algorithm utilizes this feature of the network and ranks each edge connected to the user according to relevance.
 89 | - Edges with higher ranks will usually be displayed on top of the feed for the user.
 90 | 
 91 | The rank for each feed item in Facebook's edge rank algorithm is described by,
 92 | 
 93 | ```
 94 | Rank = Affinity x Weight x Decay
 95 | ```
 96 | 
 97 | - `Affinity`: the closeness of the user to the creator of the edge. If a user frequently likes, comments, or messages the edge creator, then the value of affinity will be higher resulting in a higher rank for the post.
 98 | - `Weight`: Content with heavier weight would increase the rank. An example could be, a comment having higher weightage than likes, and thus a post with more comments is more likely to get a higher rank.
 99 | - `Decay`: the measure of the creation of the edge. The older the edge, the lesser will be the value of decay and eventually the rank.
100 | 
101 | These factors combine to give a suitable rank to stories for each specific user. Once the posts are ranked, they are sent to memory or directly retrieved from servers to display on the newsfeed when the user requests it through the feed publishing process.
102 | 
103 | 
104 | ### Summary
105 | 
106 | Facebook newsfeed system is constantly updating and changing to optimize for better performance. However, the core technologies to understand the newsfeed system remain the same.
107 | 
108 | 
109 | 
110 | ## References
111 | 
112 | - [A Dive into the Facebook Newsfeed Architecture](https://algodaily.com/lessons/dive-into-facebook-newsfeed-architecture)
113 | 


--------------------------------------------------------------------------------
/platform.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Machine Learning Platforms
  3 | 
  4 | 
  5 | - Review principal components of an ML platform
  6 | - Identify key challenges of scaling an ML pipeline to a large number of heterogeneous models
  7 | - Define solutions that scale model training and serving
  8 | 
  9 | 
 10 | ## Requirements & Goals
 11 | 
 12 | ### Functional requirements
 13 | 
 14 | - Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale.
 15 | - ML at Uber: Uber Eats, ETAs, Autonomous Cars, Customer Support, Dispatch, Personalization, Demand Modeling, Dynamic Pricing
 16 | 
 17 | ### Non-functional requirements
 18 | 
 19 | - Build a production-ready ML pipeline that guarantees reliability, resiliency, responsiveness, and elasticity
 20 | - Standardize workflows and tools 
 21 | - Provide scalable support for end-to-end ML workflow
 22 | - Democratize and accelerate machine learning through ease of use
 23 | 
 24 | We designed Michelangelo specifically to provide scalable, reliable, reproducible, easy-to-use, and automated tools to address the following six-step workflow:  
 25 | 
 26 | 1. Manage data
 27 | 2. Train models
 28 | 3. Evaluate models
 29 | 4. Deploy models
 30 | 5. Make predictions
 31 | 6. Monitor predictions
 32 | 
 33 | 
 34 | ## Key Platform Components
 35 | 
 36 | 
 37 | ### Feature Store & Feature Engineering
 38 | 
 39 | > Feature store problem
 40 | 
 41 | - The hardest part of ML is finding good features
 42 | - Same features are often used by different models built by different teams
 43 | 
 44 | > Feature store solution
 45 | 
 46 | - Centralized feature store for collecting and sharing features
 47 | - Platform team curates core set of widely applicable features
 48 | - Modelers contribute more features as part of ongoing model building
 49 | - Meta-data for each feature to track ownership, how computed, where used, etc
 50 | - Modelers select features by name & join key. Offline & online pipelines autoconfigured
 51 | 
 52 | > Functionality of feature store
 53 | 
 54 | - Allow users to easily add features into a shared feature store
 55 | - Easy to consume, both online and offline
 56 | 
 57 | 
 58 | > Pipeline for offline training with Feature Store
 59 | 
 60 | ![](./_images/offline_training_feature_store.png)
 61 | 
 62 | 
 63 | > Pipeline for online serving with Feature Store
 64 | 
 65 | ![](./_images/online_serving_feature_store.png)
 66 | 
 67 | 
 68 | > Batch precompute:
 69 | 
 70 | - To conduct bulk precomputing and loading historical features from HDFS into Cassandra regularly. 
 71 | - 'restaurant’s average meal preparation time over the last seven days.'
 72 | 
 73 | > Near-real-time compute:
 74 | 
 75 | - Publish relevant metrics to Kafka and then run Samza-based streaming compute jobs to generate aggregate features at low latency. These features are then written directly to Cassandra for serving and logged back to HDFS for future training jobs.
 76 | - 'restaurant’s average meal preparation time over the last one hour.'
 77 | 
 78 | ![](./_images/image5.png)
 79 | 
 80 | 
 81 | ### Scalable Model Training
 82 | 
 83 | - Support offline, large-scale distributed training of decision trees, linear and logistic models, unsupervised models (k-means), time series models, and deep neural networks. 
 84 | - The distributed model training system scales up to handle billions of samples and down to small datasets for quick iterations.
 85 | - A model configuration specifies the model type, hyper-parameters, data source reference, and feature DSL expressions, as well as compute resource requirements (the number of machines, how much memory, whether or not to use GPUs, etc.). 
 86 | - After the model is trained, performance metrics (e.g., ROC curve and PR curve) are computed and combined into a model evaluation report.
 87 | - At the end of training, the original configuration, the learned parameters, and the evaluation report are saved back to our model repository for analysis and deployment.
 88 | - In addition to training single models, Michelangelo supports hyper-parameter search for all model types as well as partitioned models. With partitioned models, we automatically partition the training data based on configuration from the user and then train one model per partition.
 89 | 
 90 | 
 91 | ![](./_images/image2.png)
 92 | > Model training jobs use Feature Store and training data repository data sets to train models and then push them to the model repository.
 93 | 
 94 | 
 95 | ### Evaluate models 
 96 | 
 97 | Models are often trained as part of a methodical exploration process to identify the set of features, algorithms, and hyper-parameters that create the best model for their problem.
 98 | 
 99 | For every model that is trained in Michelangelo, we store a versioned object in our model repository in Cassandra that contains a record of:
100 | 
101 | - Who trained the model
102 | - Start and end time of the training job
103 | - Full model configuration (features used, hyper-parameter values, etc.)
104 | - Reference to training and test data sets
105 | - Distribution and relative importance of each feature
106 | - Model accuracy metrics
107 | - Standard charts and graphs for each model type (e.g. ROC curve, PR curve, and confusion matrix for a binary classifier)
108 | - Full learned parameters of the model
109 | - Summary statistics for model visualization
110 | 
111 | 
112 | ### Deploy models
113 | 
114 | Michelangelo has end-to-end support for managing model deployment via the UI or API and three modes in which a model can be deployed:
115 | 
116 | 1. **Offline deployment**. The model is deployed to an offline container and run in a Spark job to generate batch predictions either on-demand or on a repeating schedule.
117 | 
118 | 2. **Online deployment**. The model is deployed to an online prediction service cluster (generally containing hundreds of machines behind a load balancer) where clients can send individual or batched prediction requests as network RPC calls.
119 | 
120 | 3. **Library deployment**. We intend to launch a model that is deployed to a serving container that is embedded as a library in another service and invoked via a Java API. (It is not shown below, but works similarly to online deployment).
121 | 
122 | ![](./_images/image3.png)
123 | 
124 | - In all cases, the required model artifacts (metadata files, model parameter files, and compiled DSL expressions) are packaged in a ZIP archive and copied to the relevant hosts across Uber’s data centers using our standard code deployment infrastructure. 
125 | - The prediction containers automatically load the new models from the disk and start handling prediction requests.
126 | - Many teams have automation scripts to schedule regular model retraining and deployment via Michelangelo’s API. 
127 | 
128 | 
129 | > Referencing models
130 | 
131 | - More than one model can be deployed at the same time to a given serving container. This allows safe transitions from old models to new models and side-by-side A/B testing of models.
132 | - At serving time, a model is identified by its UUID and an optional tag (or alias) that is specified during deployment. 
133 | - In the case of an online model, the client service sends the feature vector along with the model UUID or model tag that it wants to use; in the case of a tag, the container will generate the prediction using the model most recently deployed to that tag.
134 | - In the case of batch models, all deployed models are used to score each batch data set and the prediction records contain the model UUID and optional tag so that consumers can filter as appropriate.
135 | - For A/B testing of models, users can simply deploy competing models either via UUIDs or tags and then use Uber’s experimentation framework from within the client service to send portions of the traffic to each model and track performance metrics.
136 | 
137 | 
138 | > Scale and latency
139 | 
140 | - Since machine learning models are stateless and share nothing, they are trivial to scale out, both in online and offline serving modes.
141 | - In the case of online models, we can simply add more hosts to the prediction service cluster and let the load balancer spread the load. 
142 | - In the case of offline predictions, we can add more Spark executors and let Spark manage the parallelism.
143 | - Online serving latency depends on the model type and complexity and whether or not the model requires features from the Cassandra feature store.
144 | - In the case of a model that does not need features from Cassandra, we typically see P95 latency of less than 5 milliseconds (ms). 
145 | - In the case of models that do require features from Cassandra, we typically see P95 latency of less than 10ms. 
146 | 
147 | 
148 | ### Make predictions
149 | 
150 | - Once models are deployed and loaded by the serving container, they are used to make predictions based on feature data loaded from a data pipeline or directly from a client service.
151 | - The raw features are passed through the compiled DSL expressions which can modify the raw features and/or fetch additional features from the Feature Store. 
152 | - The final feature vector is constructed and passed to the model for scoring.
153 | - In the case of online models, the prediction is returned to the client service over the network. 
154 | - In the case of offline models, the predictions are written back to Hive where they can be consumed by downstream batch jobs or accessed by users directly through SQL-based query tools
155 | 
156 | 
157 | ![](./_images/image8.png)
158 | 
159 | 
160 | ### Monitor predictions
161 | 
162 | - To make sure that a model is working well into the future, it is critical to monitor its predictions to ensure that the data pipelines are continuing to send accurate data.
163 | - Michelangelo can automatically log and optionally hold back a percentage of the predictions that it makes and then later join those predictions to the observed outcomes (or labels) generated by the data pipeline.
164 | 
165 | 
166 | 
167 | ## References
168 | 
169 | - [Meet Michelangelo: Uber’s Machine Learning Platform](https://eng.uber.com/michelangelo-machine-learning-platform/)
170 | 


--------------------------------------------------------------------------------
/ranking.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # How machine learning powers Facebook’s News Feed ranking algorithm
 3 | 
 4 | 
 5 | ## Requirements & Goals
 6 | 
 7 | - Designing a personalized ranking system for more than 2 billion people (all with different interests)
 8 | - Without machine learning (ML), people’s News Feeds could be flooded with content they don’t find as relevant or interesting
 9 | - Predict which content will matter most to each person to support a more engaging and positive experience
10 | - The state-of-the-art ML, such as multitask learning on neural networks, embeddings, and offline learning systems
11 | 
12 | 
13 | ![](./_images/RankingFlow.jpg)
14 | 
15 | Figure from <https://engineering.fb.com/2021/01/26/ml-applications/news-feed-ranking/>
16 | 
17 | 
18 | ## Building a ranking algorithm
19 | 
20 | - A hypothetical person, Juan, logging in to Facebook,
21 | - His good friend Wei posted a **photo** of his cocker spaniel,
22 | - Another friend, Saanvi, posted a **video** from her morning run,
23 | - His favorite Page published an interesting **article** about the best way to view the Milky Way at night,
24 | - His favorite cooking **Group** posted four new sourdough recipes. 
25 | - To rank some of these things higher than others in Juan’s News Feed, we need to learn **what matters most** to Juan and which content carries the highest value for him.
26 | - For each person on Facebook there are thousands of signals we need to evaluate to determine what that person might find most relevant, so the algorithm gets very complex in practice. 
27 | 
28 | > On Facebook, one concrete observable signal that an item has value for someone is if they click the like button. 
29 | 
30 | - Given various attributes we know about a post (who is tagged in a photo, when it was posted, etc.), we can use the characteristics of the post to predict whether Juan might like the post. 
31 | - For example, if Juan tends to interact with Saanvi a lot or share the content Saanvi posts, and the running video is very recent (e.g., from this morning), we might see a high probability that Juan likes content like this.
32 | - On the other hand, perhaps Juan has previously engaged more with video content than photos, so the like prediction for Wei’s cocker spaniel photo might be lower. 
33 | - In this case, our ranking algorithm would rank Saanvi’s running video higher than Wei’s cocker spaniel photo.
34 | - Multiple prediction models provide us with multiple predictions for Juan: a probability he’d engage with (e.g., like or leave a comment on) Wei’s cocker spaniel picture, Saanvi’s running video, the article shared on the Page, and the cooking Group posts.
35 | 
36 | 
37 | ## Approximating the ideal ranking function in a scalable ranking system
38 | 
39 | - We need to score all the posts available for more than 2 billion people (more than 1,000 posts per user, per day, on average), which is challenging.
40 | - We need to do this in real-time, we need to know if an article has received a lot of likes, even if it was just posted minutes ago. 
41 | - We also need to know if Juan liked a lot of other content a minute ago, so we can use this information optimally in ranking.
42 | 
43 | 
44 | Our system architecture uses a **Web/PHP layer**, which queries the **feed aggregator**. The role of the **feed aggregator** is to collect all relevant information about a post and analyze all the features (e.g., how many people have liked this post before) in order to predict the post’s value to the user, as well as the final ranking score by aggregating all the predictions.
45 | 
46 | ![](./_images/Under-the-Hood_Stills_Final.jpg)
47 | 
48 | Figure from: <https://engineering.fb.com/2021/01/26/ml-applications/news-feed-ranking/>
49 | 
50 | - When someone opens up Facebook, regardless of the **front-end** interface (e.g., iPhone, Android phone, web browser), the interface will send a request to a **Web/PHP (front-end) layer**, which then queries the **feed aggregator (back-end layer)**. 
51 | - After accepting a request from the front end, the feed aggregator fetches actions and objects, along with an object summary, from the **feed leaf databases** so that it can process, aggregate, rank, and return the resulting list of ranked FeedStories to the front end for rendering.
52 | 
53 | Now let's review how the aggregator works:
54 | 
55 | 1. Query inventory
56 | 
57 | - Collect all the candidate posts we possibly rank for Juan (the cocker spaniel picture, the running video, etc.). 
58 | - The eligible inventory includes any non-deleted post shared with Juan by a friend, Group, or Page that he is connected to that was made since his last login. 
59 | -  Fresh posts that Juan has not yet seen but that were ranked for him in his previous sessions are eligible again for him to see. 
60 | 
61 | 
62 | 2. Score for Juan for each prediction
63 | 
64 | - Now that we have Juan’s inventory, we score each post using multitask neural nets. 
65 | - There are many features we can use to predict score, including the type of post, embeddings, and what the viewer tends to interact with. 
66 | - To calculate this for more than 1,000 posts, for each of billions of users — all in real time — we run these models for all candidate stories in parallel on multiple machines, called predictors.
67 | 
68 | 
69 | 3. Calculate a single score out of many predictions
70 | 
71 | - Multiple passes are needed to save computational power and to apply rules, such as content type diversity (i.e., content type should be varied so that viewers don’t see redundant content types, such as multiple videos, one after another).
72 | - In pass 0, a lightweight model is run to select approximately 500 of the most relevant posts for Juan that are eligible for ranking. This helps us rank fewer stories with high recall in later passes so that we can use more powerful neural network models.
73 | - Pass 1 is the main scoring pass, where each story is scored independently and then all 500 eligible posts are ordered by score.
74 | - Finally, we have pass 2, which is the contextual pass. Here, contextual features, such as content-type diversity rules, are added to help diversify Juan’s News Feed.
75 | 
76 | 
77 | Once we’ve completed these ranking steps, we have a scored News Feed for Juan (and all the people using Facebook) in real time, ready for him to consume and enjoy.
78 | 
79 | 
80 | ## References
81 | 
82 | - [How machine learning powers Facebook’s News Feed ranking algorithm](https://engineering.fb.com/2021/01/26/ml-applications/news-feed-ranking/)
83 | 


--------------------------------------------------------------------------------