├── LICENSE ├── questions ├── ml-coding.md ├── ml-theory.md ├── ml-algorithms.md ├── regression.md ├── applied-ml-cases.md ├── ml-system-design.md ├── optimization-techniques.md ├── statistics-probability.md ├── anomaly-detection.md ├── reinforcement-learning.md ├── time-series-clustering.md ├── natural-language-processing.md ├── computer-vision.md └── generative-models.md └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 rohanmistry231 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /questions/ml-coding.md: -------------------------------------------------------------------------------- 1 | # ML Coding Questions 2 | 3 | This file contains machine learning coding questions commonly asked in interviews at companies like **Uber**, **Google**, and various startups. These questions focus on implementing ML algorithms from scratch, often without third-party libraries like Scikit-Learn, to test your coding skills and conceptual understanding. 4 | 5 | Below are the questions with detailed answers, including explanations and Python code where relevant. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [Write an AUC from scratch using vanilla Python](#1-write-an-auc-from-scratch-using-vanilla-python) 12 | 2. [Write the K-Means algorithm using NumPy only](#2-write-the-k-means-algorithm-using-numpy-only) 13 | 3. [Code Gradient Descent from scratch using NumPy and SciPy only](#3-code-gradient-descent-from-scratch-using-numpy-and-scipy-only) 14 | 15 | --- 16 | 17 | ## 1. Write an AUC from scratch using vanilla Python 18 | 19 | **Question**: [Uber] Implement the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) from scratch using only vanilla Python (no NumPy or other libraries). 20 | 21 | **Answer**: 22 | 23 | The AUC-ROC measures the performance of a binary classifier by calculating the area under the curve of true positive rate (TPR) vs. false positive rate (FPR) at various thresholds. To compute it from scratch: 24 | 25 | 1. Sort predictions and true labels by predicted probabilities in descending order. 26 | 2. Calculate TPR (sensitivity) and FPR (1-specificity) for each threshold. 27 | 3. Approximate the area under the curve using the trapezoidal rule. 28 | 29 | Here’s the implementation: 30 | 31 | ```python 32 | def calculate_auc(y_true, y_pred): 33 | # Ensure inputs are lists 34 | y_true = list(y_true) 35 | y_pred = list(y_pred) 36 | 37 | # Pair predictions with true labels and sort by predictions (descending) 38 | pairs = sorted(zip(y_pred, y_true), reverse=True) 39 | y_true = [label for _, label in pairs] 40 | 41 | # Initialize variables 42 | tp = 0 # True positives 43 | fp = 0 # False positives 44 | tpr_list = [] 45 | fpr_list = [] 46 | total_pos = sum(y_true) # Total actual positives 47 | total_neg = len(y_true) - total_pos # Total actual negatives 48 | 49 | # Edge case: no positives or negatives 50 | if total_pos == 0 or total_neg == 0: 51 | return 0.0 52 | 53 | # Calculate TPR and FPR for each threshold 54 | for i in range(len(y_true)): 55 | if y_true[i] == 1: 56 | tp += 1 57 | else: 58 | fp += 1 59 | tpr = tp / total_pos # True positive rate 60 | fpr = fp / total_neg # False positive rate 61 | tpr_list.append(tpr) 62 | fpr_list.append(fpr) 63 | 64 | # Calculate AUC using trapezoidal rule 65 | auc = 0.0 66 | for i in range(1, len(tpr_list)): 67 | auc += (fpr_list[i] - fpr_list[i-1]) * (tpr_list[i] + tpr_list[i-1]) / 2 68 | 69 | return auc 70 | 71 | # Example usage 72 | y_true = [0, 1, 1, 0, 1] 73 | y_pred = [0.1, 0.8, 0.6, 0.3, 0.9] 74 | print(calculate_auc(y_true, y_pred)) # Output: ~0.833 75 | ``` 76 | 77 | **Explanation**: 78 | - The code avoids external libraries, using only Python’s built-in functions. 79 | - Sorting by predictions ensures we evaluate thresholds from high to low probability. 80 | - TPR = TP / (TP + FN), FPR = FP / (FP + TN). We track TP and FP incrementally. 81 | - The trapezoidal rule approximates the area by summing trapezoids formed by (FPR, TPR) points. 82 | - In an interview, explain the logic step-by-step and handle edge cases (e.g., no positives). 83 | 84 | --- 85 | 86 | ## 2. Write the K-Means algorithm using NumPy only 87 | 88 | **Question**: [Google] Implement the K-Means clustering algorithm from scratch using only NumPy (no Scikit-Learn). 89 | 90 | **Answer**: 91 | 92 | K-Means clustering groups data into `k` clusters by minimizing the variance within each cluster. The algorithm: 93 | 94 | 1. Randomly initializes `k` centroids. 95 | 2. Assigns each point to the nearest centroid. 96 | 3. Updates centroids as the mean of assigned points. 97 | 4. Repeats until centroids stabilize or max iterations are reached. 98 | 99 | Here’s the NumPy implementation: 100 | 101 | ```python 102 | import numpy as np 103 | 104 | def kmeans(X, k, max_iters=100, random_state=42): 105 | # Set random seed for reproducibility 106 | np.random.seed(random_state) 107 | 108 | # Randomly initialize k centroids 109 | n_samples, n_features = X.shape 110 | idx = np.random.choice(n_samples, k, replace=False) 111 | centroids = X[idx] 112 | 113 | for _ in range(max_iters): 114 | # Assign points to nearest centroid 115 | distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2)) 116 | labels = np.argmin(distances, axis=0) 117 | 118 | # Store old centroids for convergence check 119 | old_centroids = centroids.copy() 120 | 121 | # Update centroids 122 | for i in range(k): 123 | if np.sum(labels == i) > 0: # Avoid empty clusters 124 | centroids[i] = np.mean(X[labels == i], axis=0) 125 | 126 | # Check for convergence 127 | if np.all(old_centroids == centroids): 128 | break 129 | 130 | return labels, centroids 131 | 132 | # Example usage 133 | X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) 134 | labels, centroids = kmeans(X, k=2) 135 | print("Labels:", labels) 136 | print("Centroids:", centroids) 137 | ``` 138 | 139 | **Explanation**: 140 | - **Initialization**: Randomly select `k` points as initial centroids using NumPy’s `random.choice`. 141 | - **Assignment**: Compute Euclidean distances from each point to centroids using vectorized operations. 142 | - **Update**: Calculate new centroids as the mean of points in each cluster. 143 | - **Convergence**: Stop if centroids don’t change or after `max_iters`. 144 | - In an interview, mention handling empty clusters (checked here) and computational efficiency via NumPy’s vectorization. 145 | 146 | --- 147 | 148 | ## 3. Code Gradient Descent from scratch using NumPy and SciPy only 149 | 150 | **Question**: [Startup] Implement Gradient Descent from scratch for a simple linear regression model using only NumPy and SciPy (no Scikit-Learn). 151 | 152 | **Answer**: 153 | 154 | Gradient Descent optimizes a model’s parameters (e.g., weights and bias in linear regression) by minimizing a loss function (mean squared error here). The algorithm: 155 | 156 | 1. Initializes parameters randomly. 157 | 2. Computes the gradient of the loss with respect to parameters. 158 | 3. Updates parameters in the opposite direction of the gradient. 159 | 4. Repeats until convergence or max iterations. 160 | 161 | We’ll use NumPy for computations and SciPy (though minimally, as it’s allowed). 162 | 163 | ```python 164 | import numpy as np 165 | 166 | def gradient_descent(X, y, learning_rate=0.01, max_iters=1000, tol=1e-6): 167 | # Add bias term (column of 1s) to X 168 | X = np.c_[np.ones(len(X)), X] 169 | 170 | # Initialize weights randomly 171 | np.random.seed(42) 172 | theta = np.random.randn(X.shape[1]) 173 | 174 | # Track loss for convergence 175 | prev_loss = float('inf') 176 | 177 | for _ in range(max_iters): 178 | # Forward pass: predictions 179 | y_pred = X @ theta 180 | 181 | # Compute mean squared error loss 182 | loss = np.mean((y_pred - y) ** 2) 183 | 184 | # Check convergence 185 | if abs(prev_loss - loss) < tol: 186 | break 187 | prev_loss = loss 188 | 189 | # Compute gradients 190 | gradients = 2 / len(X) * X.T @ (y_pred - y) 191 | 192 | # Update weights 193 | theta -= learning_rate * gradients 194 | 195 | return theta 196 | 197 | # Example usage 198 | X = np.array([[1], [2], [3], [4], [5]]) 199 | y = np.array([2, 4, 6, 8, 10]) # Linear: y = 2x 200 | theta = gradient_descent(X, y) 201 | print("Parameters (bias, weight):", theta) 202 | ``` 203 | 204 | **Explanation**: 205 | - **Setup**: Add a bias term to `X` and initialize weights randomly. 206 | - **Loss**: Use mean squared error (MSE) as the loss function. 207 | - **Gradients**: Compute partial derivatives of MSE w.r.t. weights using matrix operations. 208 | - **Update**: Adjust weights using the learning rate. 209 | - **Convergence**: Stop if loss stabilizes (within `tol`) or after `max_iters`. 210 | - In an interview, explain the choice of learning rate, potential issues (e.g., overshooting), and how to extend this to other loss functions. 211 | 212 | --- 213 | 214 | ## Notes 215 | 216 | - **Code Simplicity**: The implementations are concise yet complete, suitable for whiteboard or live-coding interviews. 217 | - **Edge Cases**: Each answer addresses edge cases (e.g., no positives in AUC, empty clusters in K-Means, convergence in Gradient Descent). 218 | - **Explanations**: Answers include step-by-step logic to demonstrate understanding, crucial for verbalizing in interviews. 219 | - **NumPy Efficiency**: Where allowed, vectorized operations reduce runtime, showing good coding practices. 220 | 221 | For additional practice, try modifying these implementations (e.g., add regularization to Gradient Descent or handle edge cases differently). 222 | 223 | --- 224 | 225 | **Next Steps**: Continue preparing with other categories like [ML Theory](ml-theory.md) or explore more coding challenges to solidify your skills! 🚀 -------------------------------------------------------------------------------- /questions/ml-theory.md: -------------------------------------------------------------------------------- 1 | # ML Theory (Breadth) Questions 2 | 3 | This file contains machine learning theory questions commonly asked in interviews at companies like **Amazon**, **Etsy**, and **McKinsey**. These questions assess your **broad understanding** of machine learning concepts, such as cross-validation, handling imbalanced datasets, and the bias-variance trade-off. They test your ability to articulate foundational principles clearly. 4 | 5 | Below are the questions with detailed answers, including explanations and, where relevant, mathematical intuition or practical insights. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [Explain how cross-validation works](#1-explain-how-cross-validation-works) 12 | 2. [How do you handle imbalanced labels in classification models](#2-how-do-you-handle-imbalanced-labels-in-classification-models) 13 | 3. [What is the bias-variance trade-off?](#3-what-is-the-bias-variance-trade-off) 14 | 15 | --- 16 | 17 | ## 1. Explain how cross-validation works 18 | 19 | **Question**: [Amazon] Describe the process of cross-validation and its purpose in machine learning. 20 | 21 | **Answer**: 22 | 23 | **Cross-validation** is a technique used to evaluate a machine learning model’s performance on unseen data, ensuring it generalizes well and isn’t overfitting to the training set. It works by splitting the dataset into multiple subsets, training the model on some subsets, and testing it on others, repeating this process to get a robust estimate of performance. 24 | 25 | The most common form is **k-fold cross-validation**: 26 | 27 | 1. **Split**: Divide the dataset into `k` equal-sized folds (e.g., `k=5` or `k=10`). 28 | 2. **Iterate**: For each fold: 29 | - Use `k-1` folds for training and the remaining fold for validation. 30 | - Train the model on the training folds and evaluate it (e.g., using accuracy, MSE) on the validation fold. 31 | 3. **Average**: Compute the average performance metric (and standard deviation) across all `k` folds to estimate the model’s generalization ability. 32 | 33 | **Purpose**: 34 | - **Generalization**: Assess how well the model performs on unseen data. 35 | - **Hyperparameter Tuning**: Compare different models or parameters to select the best configuration. 36 | - **Robustness**: Reduce the risk of overfitting by testing on multiple validation sets. 37 | 38 | **Example**: 39 | For a dataset with 100 samples and 5-fold CV: 40 | - Each fold has 20 samples. 41 | - Train on 80 samples, test on 20, repeat 5 times. 42 | - Average the accuracy scores (e.g., [0.85, 0.87, 0.84, 0.86, 0.88] → mean = 0.86). 43 | 44 | **Variations**: 45 | - **Stratified k-fold**: Ensures class distribution is preserved in each fold (useful for imbalanced data). 46 | - **Leave-One-Out (LOO)**: Uses `k = n` (n = number of samples), computationally expensive. 47 | - **Hold-out**: A single train-test split (less robust but faster). 48 | 49 | **Interview Tips**: 50 | - Mention trade-offs: Higher `k` gives better estimates but is computationally costly. 51 | - Explain why it’s better than a single train-test split (reduces variance in performance metrics). 52 | - Note edge cases, like ensuring stratification for classification tasks. 53 | 54 | --- 55 | 56 | ## 2. How do you handle imbalanced labels in classification models 57 | 58 | **Question**: [Etsy] Discuss techniques to address imbalanced labels in classification models and their trade-offs. 59 | 60 | **Answer**: 61 | 62 | Imbalanced labels occur when one class dominates the dataset (e.g., 90% negative, 10% positive), causing models to bias toward the majority class and perform poorly on the minority class. Here are common techniques to handle this, with trade-offs: 63 | 64 | 1. **Resampling Techniques**: 65 | - **Oversampling** (e.g., SMOTE): 66 | - **How**: Generate synthetic samples for the minority class (e.g., SMOTE interpolates new points). 67 | - **Pros**: Increases minority class representation without losing data. 68 | - **Cons**: Risk of overfitting (synthetic data may not reflect real-world distribution); computationally expensive. 69 | - **Undersampling**: 70 | - **How**: Randomly remove samples from the majority class. 71 | - **Pros**: Reduces dataset size, faster training. 72 | - **Cons**: Loss of information, may discard useful data. 73 | - **Trade-off**: Oversampling preserves data but risks overfitting; undersampling is simpler but sacrifices information. 74 | 75 | 2. **Class Weights**: 76 | - **How**: Assign higher weights to the minority class in the loss function (e.g., `weight = 1 / frequency`). 77 | - **Pros**: No data modification, easy to implement in libraries like Scikit-Learn (`class_weight='balanced'`). 78 | - **Cons**: May not suffice for extreme imbalances; requires model support. 79 | - **Example**: In logistic regression, penalize misclassifying the minority class more heavily. 80 | 81 | 3. **Anomaly Detection Approach**: 82 | - **How**: Treat the minority class as anomalies and use algorithms like Isolation Forest or One-Class SVM. 83 | - **Pros**: Effective for extreme imbalances (e.g., fraud detection). 84 | - **Cons**: May not generalize to all classification tasks; requires rethinking the problem. 85 | 86 | 4. **Evaluation Metrics**: 87 | - **How**: Use metrics like precision, recall, F1-score, or AUC-ROC instead of accuracy, which can be misleading. 88 | - **Pros**: Better reflects performance on the minority class. 89 | - **Cons**: Requires careful interpretation (e.g., trade-off between precision and recall). 90 | - **Example**: AUC-ROC evaluates model performance across thresholds, less sensitive to imbalance. 91 | 92 | 5. **Data Collection**: 93 | - **How**: Gather more data for the minority class if possible. 94 | - **Pros**: Addresses the root cause, improves model robustness. 95 | - **Cons**: Often infeasible due to cost or availability. 96 | 97 | **Practical Example**: 98 | For a fraud detection dataset (1% fraud, 99% non-fraud): 99 | - Apply SMOTE to oversample fraud cases. 100 | - Use class weights in a Random Forest model. 101 | - Evaluate with F1-score and AUC-ROC to ensure minority class performance. 102 | 103 | **Interview Tips**: 104 | - Emphasize that the choice depends on the problem (e.g., SMOTE for moderate imbalance, anomaly detection for extreme cases). 105 | - Discuss trade-offs (e.g., oversampling vs. undersampling). 106 | - Mention real-world constraints, like computational resources or data availability. 107 | 108 | --- 109 | 110 | ## 3. What is the bias-variance trade-off? 111 | 112 | **Question**: [McKinsey] Explain the bias-variance trade-off and its implications for model selection. 113 | 114 | **Answer**: 115 | 116 | The **bias-variance trade-off** is a fundamental concept in machine learning that describes the balance between a model’s ability to fit the training data (bias) and its sensitivity to variations in the data (variance). It helps explain why models overfit or underfit and guides model selection. 117 | 118 | - **Bias**: 119 | - Measures how much a model’s predictions deviate from the true values due to oversimplification. 120 | - **High bias**: Model is too simple (e.g., linear regression on non-linear data), leading to underfitting. 121 | - **Example**: Predicting house prices with a constant value ignores patterns, resulting in high error. 122 | 123 | - **Variance**: 124 | - Measures how much a model’s predictions vary with different training sets. 125 | - **High variance**: Model is too complex (e.g., a deep neural network with few data points), overfitting to noise. 126 | - **Example**: A decision tree that memorizes training data performs poorly on new data. 127 | 128 | - **Trade-off**: 129 | - Simplifying a model reduces variance but increases bias. 130 | - Complicating a model reduces bias but increases variance. 131 | - The goal is to find the **sweet spot** where total error (bias + variance + irreducible error) is minimized. 132 | 133 | **Mathematical Intuition**: 134 | Expected error = (Bias)² + Variance + Irreducible Error 135 | - **Bias²**: Systematic error from model assumptions. 136 | - **Variance**: Error from sensitivity to training data. 137 | - **Irreducible Error**: Noise inherent in the data, unavoidable. 138 | 139 | **Implications for Model Selection**: 140 | - **Simple Models** (e.g., linear regression): 141 | - High bias, low variance. 142 | - Good for small datasets or when interpretability matters. 143 | - Risk: Underfitting if data is complex. 144 | - **Complex Models** (e.g., deep neural networks): 145 | - Low bias, high variance. 146 | - Good for large, complex datasets. 147 | - Risk: Overfitting without enough data or regularization. 148 | - **Techniques to Balance**: 149 | - **Regularization**: (e.g., L1/L2) reduces variance by penalizing complexity. 150 | - **Cross-validation**: Estimates generalization error to choose the right complexity. 151 | - **Ensemble Methods**: (e.g., Random Forest) combine models to reduce variance while keeping bias low. 152 | 153 | **Practical Example**: 154 | For a dataset with non-linear patterns: 155 | - A linear model (high bias) may underfit, missing trends. 156 | - A deep decision tree (high variance) may overfit to noise. 157 | - A Random Forest with tuned depth balances bias and variance, achieving better generalization. 158 | 159 | **Interview Tips**: 160 | - Draw the bias-variance curve (error vs. model complexity) if possible. 161 | - Relate to real algorithms (e.g., linear models vs. neural networks). 162 | - Mention practical solutions like regularization or more data to reduce variance. 163 | 164 | --- 165 | 166 | ## Notes 167 | 168 | - **Clarity**: Answers are concise yet thorough, ideal for verbalizing in interviews. 169 | - **Practicality**: Each answer includes examples and trade-offs to show real-world application. 170 | - **Depth**: Explanations cover theory, intuition, and interview strategies (e.g., what to emphasize). 171 | - **Consistency**: Matches the style of `ml-coding.md` for a cohesive repository. 172 | 173 | For deeper practice, revisit these concepts with hands-on exercises (e.g., implement cross-validation manually) or explore related topics in [ML Algorithms](ml-algorithms.md). 🚀 174 | 175 | --- 176 | 177 | **Next Steps**: Keep building your ML knowledge with [ML Coding](ml-coding.md) or dive into [ML Algorithms](ml-algorithms.md) for deeper algorithm-specific questions! 🌟 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🚀 Machine Learning Interview Questions 2 | 3 | Welcome to the **Ultimate Guide to 120+ REAL Machine Learning Interview Questions**! 🎉 Whether you're aiming for a role at FAANG giants like **Google**, **Amazon**, or **Meta**, cutting-edge startups like **Stripe** or **Open AI**, or prestigious firms like **McKinsey**, this repository has you covered. Perfect for aspiring **Data Scientists**, **Machine Learning Engineers**, and **LLM Engineers**! 💻 4 | 5 | This collection spans **coding**, **theory**, **algorithms**, **system design**, and **production**, with questions sourced from real interviews. Each category links to a dedicated markdown file, and we'll add detailed answers soon to supercharge your prep! 📚✨ 6 | 7 | --- 8 | 9 | ## 📖 Table of Contents 10 | 11 | - [🌟 Introduction](#-introduction) 12 | - [🔍 ML Interview Areas](#-ml-interview-areas) 13 | - [💻 ML Coding](#ml-coding) 14 | - [📚 ML Theory (Breadth)](#ml-theory-breadth) 15 | - [🧠 ML Algorithms (Depth)](#ml-algorithms-depth) 16 | - [🏢 Applied ML Cases](#applied-ml-cases) 17 | - [⚙️ ML System Design](#ml-system-design) 18 | - [🛠️ Common Machine Learning Algorithms](#-common-machine-learning-algorithms) 19 | - [🔥 Detailed Question Categories](#-detailed-question-categories) 20 | - [🌱 Easy Machine Learning Questions](#easy-machine-learning-questions) 21 | - [🧹 Feature Engineering & Data Preprocessing](#feature-engineering--data-preprocessing) 22 | - [🌳 Tree-Based Model Questions](#tree-based-model-questions) 23 | - [🧬 Deep Learning Questions](#deep-learning-questions) 24 | - [📈 Regression Questions](#regression-questions) 25 | - [🚀 Advanced Machine Learning Questions](#advanced-machine-learning-questions) 26 | - [🏭 Production & MLOps Questions](#production--mlops-questions) 27 | - [⏳ Time Series & Clustering Questions](#time-series--clustering-questions) 28 | - [📊 Statistics & Probability Questions](#statistics--probability-questions) 29 | - [📝 Natural Language Processing Questions](#natural-language-processing-questions) 30 | - [📸 Computer Vision Questions](#computer-vision-questions) 31 | - [⚡ Optimization Techniques Questions](#optimization-techniques-questions) 32 | - [🚨 Anomaly Detection Questions](#anomaly-detection-questions) 33 | - [🤖 Reinforcement Learning Questions](#reinforcement-learning-questions) 34 | - [🎨 Generative Models Questions](#generative-models-questions) 35 | - [🌌 Generative AI Questions](#generative-ai-questions) 36 | - [💡 Preparation Tips](#-preparation-tips) 37 | - [🤝 Contributing](#-contributing) 38 | - [📜 License](#-license) 39 | 40 | --- 41 | 42 | ## 🌟 Introduction 43 | 44 | This repository is your **one-stop shop** for mastering machine learning interviews! 🎯 Designed to help you tackle roles like **Data Scientist**, **ML Engineer**, or **Product Analyst**, it compiles **real-world questions** from top-tier companies. From foundational concepts to production-ready systems, we've organized everything to make your prep **structured**, **fun**, and **effective**. 😄 45 | 46 | Questions are grouped into **core interview areas** and **specialized categories**, with links to detailed question lists. Answers will be added in separate files to keep things clean and focused. Let’s dive in and ace that interview! 🚀 47 | 48 | --- 49 | 50 | ## 🔍 ML Interview Areas 51 | 52 | Here are the **five key pillars** of machine learning interviews, each packed with questions to test your skills: 53 | 54 | ### 💻 ML Coding 55 | 56 | Get ready to code ML algorithms **from scratch**! 🚀 No Scikit-Learn allowed here—just pure Python, NumPy, or SciPy to showcase your coding chops and conceptual mastery. 57 | 58 | - 📄 [ML Coding Questions](questions/ml-coding.md) 59 | 60 | ### 📚 ML Theory (Breadth) 61 | 62 | Show off your **big-picture understanding** of ML! 🧠 From bias-variance trade-offs to cross-validation, these questions test your grasp of core concepts. 63 | 64 | - 📄 [ML Theory Questions](questions/ml-theory.md) 65 | 66 | ### 🧠 ML Algorithms (Depth) 67 | 68 | Dive **deep** into specific algorithms! 🔬 Whether it’s Random Forest or Gradient Boosting, you’ll need to know their mechanics, trade-offs, and quirks inside out. 69 | 70 | - 📄 [ML Algorithms Questions](questions/ml-algorithms.md) 71 | 72 | ### 🏢 Applied ML Cases 73 | 74 | Solve **real-world business problems** with ML! 💼 Think fraud detection or propensity modeling, often coded in Jupyter or Colab to prove your practical skills. 75 | 76 | - 📄 [Applied ML Cases Questions](questions/applied-ml-cases.md) 77 | 78 | ### ⚙️ ML System Design 79 | 80 | Design **scalable, production-ready ML systems**! 🏭 From data pipelines to model deployment, these questions test your ability to build robust architectures. 81 | 82 | - 📄 [ML System Design Questions](questions/ml-system-design.md) 83 | 84 | --- 85 | 86 | ## 🛠️ Common Machine Learning Algorithms 87 | 88 | Master these **seven powerhouse algorithms** to shine in interviews! 🏆 Know their assumptions, applications, trade-offs, and how to tune them: 89 | 90 | - **Linear Regression** 📈 91 | - **Logistic Regression** ✅ 92 | - **Decision Tree** 🌳 93 | - **Random Forest** 🌲 94 | - **Gradient Boosted Trees** 🚀 95 | - **K-Means** 🗳️ 96 | - **Dense Neural Networks** 🧠 97 | 98 | 💡 **Pro Tip**: Practice writing **pseudocode** for these algorithms to nail ML coding and depth rounds! ✍️ 99 | 100 | --- 101 | 102 | ## 🔥 Detailed Question Categories 103 | 104 | Ready to go deeper? These categories break down questions into **specialized topics** for laser-focused prep! 🎯 105 | 106 | ### 🌱 Easy Machine Learning Questions 107 | 108 | Start with the **basics**! These foundational questions cover supervised vs. unsupervised learning, overfitting, and model validation. 109 | 110 | - 📄 [Easy ML Questions](questions/easy-ml.md) 111 | 112 | ### 🧹 Feature Engineering & Data Preprocessing 113 | 114 | Clean and transform data like a pro! 🧼 Learn to handle missing values, encode categorical variables, and reduce dimensionality. 115 | 116 | - 📄 [Feature Engineering Questions](questions/feature-engineering.md) 117 | 118 | ### 🌳 Tree-Based Model Questions 119 | 120 | Get cozy with **decision trees**, **random forests**, and boosting methods like **XGBoost** and **LightGBM**. 🌲 121 | 122 | - 📄 [Tree-Based Model Questions](questions/tree-based-models.md) 123 | 124 | ### 🧬 Deep Learning Questions 125 | 126 | Explore the world of **neural networks**! 🧠 From CNNs to transformers, these questions dive into attention mechanisms and more. 127 | 128 | - 📄 [Deep Learning Questions](questions/deep-learning.md) 129 | 130 | ### 📈 Regression Questions 131 | 132 | Master **linear**, **logistic**, **ridge**, and **lasso regression**, plus how to handle multicollinearity and outliers. 133 | 134 | - 📄 [Regression Questions](questions/regression.md) 135 | 136 | ### 🚀 Advanced Machine Learning Questions 137 | 138 | Tackle **complex challenges** like ensemble methods, model distillation, and handling massive datasets. 🧑‍🚀 139 | 140 | - 📄 [Advanced ML Questions](questions/advanced-ml.md) 141 | 142 | ### 🏭 Production & MLOps Questions 143 | 144 | Deploy and maintain models like a boss! 🛠️ Learn about model monitoring, A/B testing, and feature stores. 145 | 146 | - 📄 [Production & MLOps Questions](questions/production-mlops.md) 147 | 148 | ### ⏳ Time Series & Clustering Questions 149 | 150 | Forecast the future and group data! 📅 Dive into **ARIMA**, **SARIMA**, **K-Means**, and **DBSCAN**. 151 | 152 | - 📄 [Time Series & Clustering Questions](questions/time-series-clustering.md) 153 | 154 | ### 📊 Statistics & Probability Questions 155 | 156 | Strengthen your **mathematical foundations**! 📐 Master probability distributions, hypothesis testing, and Bayesian methods. 157 | 158 | - 📄 [Statistics & Probability Questions](questions/statistics-probability.md) 159 | 160 | ### 📝 Natural Language Processing Questions 161 | 162 | Unlock the power of **text analysis**! 📚 From tokenization to transformers, tackle embeddings, sentiment analysis, and LLMs. 163 | 164 | - 📄 [Natural Language Processing Questions](questions/natural-language-processing.md) 165 | 166 | ### 📸 Computer Vision Questions 167 | 168 | See the world through **pixels**! 🖼️ Explore CNNs, object detection, and segmentation for image-based tasks. 169 | 170 | - 📄 [Computer Vision Questions](questions/computer-vision.md) 171 | 172 | ### ⚡ Optimization Techniques Questions 173 | 174 | Optimize like a pro! 🔧 Dive into gradient descent, second-order methods, and adaptive optimizers like Adam. 175 | 176 | - 📄 [Optimization Techniques Questions](questions/optimization-techniques.md) 177 | 178 | ### 🚨 Anomaly Detection Questions 179 | 180 | Spot the outliers! 🔍 Learn statistical and ML methods like Isolation Forest and autoencoders to detect anomalies. 181 | 182 | - 📄 [Anomaly Detection Questions](questions/anomaly-detection.md) 183 | 184 | ### 🤖 Reinforcement Learning Questions 185 | 186 | Train agents to make decisions! 🎮 Master MDPs, Q-learning, and modern algorithms like PPO for dynamic environments. 187 | 188 | - 📄 [Reinforcement Learning Questions](questions/reinforcement-learning.md) 189 | 190 | ### 🎨 Generative Models Questions 191 | 192 | Create new data from scratch! ✨ Understand VAEs, GANs, and diffusion models for generating images, text, and more. 193 | 194 | - 📄 [Generative Models Questions](questions/generative-models.md) 195 | 196 | ### 🌌 Generative AI Questions 197 | 198 | Ride the Gen AI wave! 🌟 Explore LLMs, multimodal models, prompt engineering, and ethical considerations for cutting-edge AI. 199 | 200 | - 📄 [Generative AI Questions](questions/generative-ai.md) 201 | 202 | --- 203 | 204 | ## 💡 Preparation Tips 205 | 206 | Here’s how to **crush your ML interview** with confidence! 😎 207 | 208 | - **🖥️ Code Like a Pro**: Write ML algorithms from scratch using Python, NumPy, or SciPy. 209 | - **🧠 Master Theory**: Nail concepts like bias-variance trade-off and cross-validation. 210 | - **🔍 Study Algorithms**: Understand the ins and outs of key algorithms (e.g., Random Forest vs. Gradient Boosting). 211 | - **🏢 Solve Business Problems**: Practice ML in real-world contexts to show business impact. 212 | - **⚙️ Learn System Design**: Get comfortable with scalable ML pipelines and deployment. 213 | - **🎭 Mock Interviews**: Simulate interviews to polish your answers and build confidence. 214 | - **🌐 Stay Current**: Keep up with ML trends, especially in LLMs and MLOps. 215 | 216 | --- 217 | 218 | ## 🤝 Contributing 219 | 220 | We’d love your help to make this repo even better! 🌟 To contribute: 221 | 222 | 1. **Fork** the repository. 223 | 2. Create a new branch (`git checkout -b feature/add-questions`). 224 | 3. Add your changes (new questions, answers, or resources). 225 | 4. Submit a **pull request** with a clear description. 226 | 227 | Have a cool question or answer? Share it with the community! 😄 228 | 229 | --- 230 | 231 | **Ready to ace your ML interview?** 🚀 Start exploring the question categories, practice daily, and watch your confidence soar! Let’s make that dream job yours! 💼🎉 -------------------------------------------------------------------------------- /questions/ml-algorithms.md: -------------------------------------------------------------------------------- 1 | # ML Algorithms (Depth) Questions 2 | 3 | This file contains machine learning algorithm questions commonly asked in interviews at companies like **Amazon**. These questions assess your **in-depth understanding** of specific algorithms, focusing on their mechanics, assumptions, trade-offs, and differences. They test your ability to articulate detailed knowledge beyond general ML concepts. 4 | 5 | Below are the questions with comprehensive answers, including explanations, mathematical intuition where relevant, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is the pseudocode of the Random Forest model?](#1-what-is-the-pseudocode-of-the-random-forest-model) 12 | 2. [What is the variance and bias of the Random Forest model?](#2-what-is-the-variance-and-bias-of-the-random-forest-model) 13 | 3. [How is the Random Forest different from Gradient Boosted Trees?](#3-how-is-the-random-forest-different-from-gradient-boosted-trees) 14 | 15 | --- 16 | 17 | ## 1. What is the pseudocode of the Random Forest model? 18 | 19 | **Question**: [Amazon] Provide the pseudocode for the Random Forest algorithm. 20 | 21 | **Answer**: 22 | 23 | **Random Forest** is an ensemble learning method that combines multiple decision trees to improve predictive accuracy and reduce overfitting. It uses **bagging** (Bootstrap Aggregating) and **feature randomness** to create diverse trees, then aggregates their predictions (majority vote for classification, average for regression). 24 | 25 | **Pseudocode**: 26 | 27 | Algorithm: RandomForest(X, y, n_trees, max_depth, min_samples_split, max_features) 28 | Input: 29 | X: feature matrix (n_samples, n_features) 30 | y: target vector (n_samples) 31 | n_trees: number of trees 32 | max_depth: maximum depth of each tree (optional) 33 | min_samples_split: minimum samples to split a node 34 | max_features: number of features to consider at each split 35 | Output: 36 | forest: list of trained decision trees 37 | 38 | Initialize: 39 | forest = [] 40 | 41 | For i = 1 to n_trees: 42 | # Bootstrap sampling: sample n_samples with replacement 43 | indices = RandomSampleWithReplacement(n_samples) 44 | X_boot = X[indices] 45 | y_boot = y[indices] 46 | 47 | # Build a decision tree with feature randomness 48 | tree = DecisionTree(max_depth, min_samples_split) 49 | For each node in tree: 50 | If node satisfies stopping criteria (e.g., max_depth, min_samples_split): 51 | Make node a leaf, assign majority class (classification) or mean (regression) 52 | Else: 53 | # Randomly select max_features features 54 | candidate_features = RandomSubset(features, max_features) 55 | # Find best split based on criterion (e.g., Gini, entropy, MSE) 56 | best_feature, best_threshold = FindBestSplit(X_boot, y_boot, candidate_features) 57 | Split node into left and right children 58 | Recurse on children 59 | Add tree to forest 60 | 61 | Return forest 62 | 63 | Prediction: 64 | For a new sample x: 65 | predictions = [] 66 | For each tree in forest: 67 | prediction = Predict(tree, x) 68 | Append prediction to predictions 69 | If classification: 70 | Return majority vote of predictions 71 | If regression: 72 | Return mean of predictions 73 | 74 | **Explanation**: 75 | - **Bootstrap Sampling**: Each tree is trained on a random subset of the data (with replacement), ensuring diversity. 76 | - **Feature Randomness**: At each split, only a subset of features (`max_features`) is considered, reducing correlation between trees. 77 | - **Decision Tree**: Uses standard splitting criteria (e.g., Gini impurity for classification, MSE for regression). 78 | - **Aggregation**: Combines predictions to stabilize results and improve accuracy. 79 | 80 | **Interview Tips**: 81 | - Clarify parameters like `max_features` (often `sqrt(n_features)` for classification, `n_features/3` for regression). 82 | - Mention stopping criteria (e.g., `max_depth`, `min_samples_split`) to prevent overfitting. 83 | - Explain how bagging reduces variance compared to a single tree. 84 | - Be ready to sketch a tree split or discuss Gini/entropy if asked for details. 85 | 86 | --- 87 | 88 | ## 2. What is the variance and bias of the Random Forest model? 89 | 90 | **Question**: [Amazon] Discuss the bias and variance characteristics of the Random Forest model. 91 | 92 | **Answer**: 93 | 94 | **Bias** and **variance** describe a model’s error components, and Random Forest’s ensemble nature gives it distinct characteristics: 95 | 96 | - **Bias**: 97 | - **Definition**: Bias measures how much a model’s predictions deviate from true values due to oversimplification. 98 | - **Random Forest Bias**: Random Forest has **similar bias** to a single decision tree because each tree is grown deep (low bias) to capture complex patterns. However, since trees are unpruned or lightly constrained (e.g., limited `max_depth`), individual trees have low bias, and averaging doesn’t significantly increase it. 99 | - **Example**: For a non-linear dataset, Random Forest fits complex boundaries (low bias) unless restricted by parameters like shallow depth. 100 | 101 | - **Variance**: 102 | - **Definition**: Variance measures how sensitive a model’s predictions are to changes in the training data. 103 | - **Random Forest Variance**: Random Forest **reduces variance** compared to a single decision tree. A single tree overfits to training data noise (high variance), but Random Forest’s bagging and feature randomness create diverse trees. Averaging (regression) or voting (classification) smooths out individual tree errors, lowering overall variance. 104 | - **Example**: If one tree overpredicts due to noise, others balance it out, stabilizing predictions. 105 | 106 | - **Bias-Variance Trade-off**: 107 | - Random Forest achieves **low bias** (like deep trees) and **low variance** (via ensemble averaging), making it robust for many tasks. 108 | - Total error = (Bias)² + Variance + Irreducible Error. Random Forest minimizes variance while maintaining low bias, reducing total error compared to a single tree. 109 | 110 | - **Parameter Impact**: 111 | - **More trees (`n_estimators`)**: Further reduces variance without affecting bias much. 112 | - **Smaller `max_features`**: Increases diversity, reducing variance but potentially increasing bias if too small. 113 | - **Shallow trees (`max_depth`)**: Increases bias, reduces variance slightly. 114 | 115 | **Practical Example**: 116 | - On a noisy classification dataset: 117 | - A single decision tree might have low bias (fits training data well) but high variance (sensitive to noise). 118 | - A Random Forest with 100 trees has similar bias (still captures patterns) but lower variance (stable predictions across datasets). 119 | 120 | **Interview Tips**: 121 | - Contrast with a single tree: “A single tree has low bias but high variance; Random Forest keeps low bias and reduces variance through bagging.” 122 | - Mention that bias depends on tree depth and data complexity. 123 | - Be ready to discuss how hyperparameters tune the trade-off (e.g., `n_estimators`, `max_features`). 124 | - Sketch the bias-variance curve if asked, showing Random Forest’s advantage. 125 | 126 | --- 127 | 128 | ## 3. How is the Random Forest different from Gradient Boosted Trees? 129 | 130 | **Question**: [Amazon] Explain the key differences between Random Forest and Gradient Boosted Trees. 131 | 132 | **Answer**: 133 | 134 | **Random Forest** and **Gradient Boosted Trees** are both ensemble methods that combine decision trees, but they differ in how trees are built and combined, leading to distinct strengths and weaknesses. 135 | 136 | 1. **Methodology**: 137 | - **Random Forest**: 138 | - Uses **bagging** (Bootstrap Aggregating). 139 | - Trains multiple trees **independently** on random subsets of data (with replacement) and features. 140 | - Combines predictions via **majority vote** (classification) or **averaging** (regression). 141 | - Goal: Reduce variance by averaging uncorrelated trees. 142 | - **Gradient Boosted Trees**: 143 | - Uses **boosting**. 144 | - Trains trees **sequentially**, where each tree corrects errors of previous ones by fitting to the residual errors (gradients of the loss function). 145 | - Combines predictions via **weighted sum** of trees. 146 | - Goal: Reduce bias and variance by iteratively improving the model. 147 | 148 | 2. **Tree Construction**: 149 | - **Random Forest**: 150 | - Trees are typically **deep** (low bias) to capture complex patterns. 151 | - Uses **feature randomness** (`max_features`) to ensure diversity. 152 | - Each tree is trained on a bootstrap sample, reducing correlation. 153 | - **Gradient Boosted Trees**: 154 | - Trees are usually **shallow** (weak learners) to avoid overfitting. 155 | - All features are considered at each split (unless specified), focusing on error correction. 156 | - Uses the full dataset (or subsamples) with weights adjusted per iteration. 157 | 158 | 3. **Bias and Variance**: 159 | - **Random Forest**: 160 | - **Low bias**: Deep trees fit data well. 161 | - **Low variance**: Bagging averages out errors from uncorrelated trees. 162 | - Better at handling noisy data due to independence. 163 | - **Gradient Boosted Trees**: 164 | - **Lower bias**: Sequential correction refines predictions, capturing complex patterns. 165 | - **Higher variance**: Sequential dependence makes it sensitive to noise unless regularized. 166 | - More prone to overfitting without tuning. 167 | 168 | 4. **Training Speed**: 169 | - **Random Forest**: 170 | - **Faster**: Trees are trained in parallel, and computation scales well. 171 | - Less sensitive to hyperparameters, easier to tune. 172 | - **Gradient Boosted Trees**: 173 | - **Slower**: Sequential training means each tree waits for the previous one. 174 | - Requires careful tuning (e.g., learning rate, tree depth). 175 | 176 | 5. **Performance**: 177 | - **Random Forest**: 178 | - Excels in **general-purpose tasks** with noisy or high-dimensional data. 179 | - Robust out-of-the-box, less tuning needed. 180 | - **Gradient Boosted Trees**: 181 | - Often **outperforms** Random Forest on structured/tabular data with careful tuning. 182 | - Preferred in competitions (e.g., XGBoost, LightGBM) for maximizing predictive accuracy. 183 | 184 | 6. **Hyperparameters**: 185 | - **Random Forest**: 186 | - Key parameters: `n_estimators`, `max_features`, `max_depth`. 187 | - Less sensitive to overfitting. 188 | - **Gradient Boosted Trees**: 189 | - Key parameters: `n_estimators`, `learning_rate`, `max_depth`, `subsample`. 190 | - Requires balancing `learning_rate` and `n_estimators` to avoid overfitting. 191 | 192 | **Practical Example**: 193 | - **Fraud Detection**: 194 | - **Random Forest**: Good for noisy, high-dimensional data; quick to train and robust. 195 | - **Gradient Boosted Trees**: Better if tuned to focus on rare fraud cases, but needs regularization to avoid overfitting. 196 | - **Choice**: Use Random Forest for quick prototyping; use Gradient Boosting (e.g., XGBoost) for maximizing accuracy with tuning. 197 | 198 | **Interview Tips**: 199 | - Emphasize **bagging vs. boosting** as the core difference. 200 | - Compare strengths: “Random Forest is robust and fast; Gradient Boosting is powerful but needs tuning.” 201 | - Mention popular implementations (e.g., Random Forest in Scikit-Learn, XGBoost/LightGBM for boosting). 202 | - Be ready to discuss when to use each (e.g., Random Forest for noisy data, Gradient Boosting for structured data). 203 | 204 | --- 205 | 206 | ## Notes 207 | 208 | - **Depth**: Answers dive into algorithmic details, suitable for “depth” rounds where interviewers expect thorough knowledge. 209 | - **Clarity**: Explanations are structured to articulate complex ideas simply, ideal for verbalizing in interviews. 210 | - **Practicality**: Includes examples and trade-offs to show real-world application (e.g., Random Forest vs. Gradient Boosting use cases). 211 | - **Consistency**: Matches the style of `ml-coding.md` and `ml-theory.md` for a cohesive repository. 212 | 213 | For deeper practice, implement Random Forest or Gradient Boosting manually (see [ML Coding](ml-coding.md)) or explore related topics in [Applied ML Cases](applied-ml-cases.md). 🚀 214 | 215 | --- 216 | 217 | **Next Steps**: Strengthen your prep with [ML Theory](ml-theory.md) for broader concepts or dive into [Applied ML Cases](applied-ml-cases.md) for business-focused problems! 🌟 -------------------------------------------------------------------------------- /questions/regression.md: -------------------------------------------------------------------------------- 1 | # Regression Questions 2 | 3 | This file contains regression-related questions commonly asked in interviews at companies like **Google**, **Amazon**, and others. These questions assess your **understanding** of regression techniques, their assumptions, evaluation metrics, and practical applications. They test your ability to articulate regression concepts and apply them to real-world problems. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition where relevant, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is regression analysis?](#1-what-is-regression-analysis) 12 | 2. [What is the difference between linear regression and logistic regression?](#2-what-is-the-difference-between-linear-regression-and-logistic-regression) 13 | 3. [What are the assumptions of linear regression?](#3-what-are-the-assumptions-of-linear-regression) 14 | 4. [How do you evaluate the performance of a regression model?](#4-how-do-you-evaluate-the-performance-of-a-regression-model) 15 | 5. [What is multicollinearity and how do you detect it?](#5-what-is-multicollinearity-and-how-do-you-detect-it) 16 | 6. [What is the difference between Ridge and Lasso regression?](#6-what-is-the-difference-between-ridge-and-lasso-regression) 17 | 7. [What is polynomial regression and when would you use it?](#7-what-is-polynomial-regression-and-when-would-you-use-it) 18 | 8. [How does quantile regression differ from ordinary least squares regression?](#8-how-does-quantile-regression-differ-from-ordinary-least-squares-regression) 19 | 20 | --- 21 | 22 | ## 1. What is regression analysis? 23 | 24 | **Answer**: 25 | 26 | **Regression analysis** is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (features) to predict continuous outcomes. 27 | 28 | - **Key Points**: 29 | - **Goal**: Estimate how features influence the target (e.g., predict house prices from size, location). 30 | - **Types**: 31 | - **Linear**: Assumes linear relationship (e.g., `y = w0 + w1*x1`). 32 | - **Non-Linear**: Captures complex patterns (e.g., polynomial regression). 33 | - **Logistic**: For binary outcomes (not strictly regression, but related). 34 | - **Components**: 35 | - Model (e.g., linear equation). 36 | - Loss function (e.g., Mean Squared Error). 37 | - Optimization (e.g., least squares). 38 | 39 | - **Use Cases**: 40 | - Predict sales, stock prices, temperature. 41 | - Understand feature impact (e.g., “How does age affect income?”). 42 | 43 | **Example**: 44 | - Task: Predict house price from square footage. 45 | - Model: `price = 50,000 + 200 * sqft`. 46 | - Result: Predicts price for new houses. 47 | 48 | **Interview Tips**: 49 | - Clarify scope: “Focuses on continuous outputs, unlike classification.” 50 | - Mention flexibility: “Can be linear or non-linear.” 51 | - Be ready to sketch: “Show y vs. x with a fitted line.” 52 | 53 | --- 54 | 55 | ## 2. What is the difference between linear regression and logistic regression? 56 | 57 | **Answer**: 58 | 59 | - **Linear Regression**: 60 | - **Purpose**: Predicts a continuous outcome. 61 | - **Model**: `y = w0 + w1*x1 + ... + wn*xn`. 62 | - **Output**: Unbounded real numbers (e.g., price, temperature). 63 | - **Loss**: Mean Squared Error (MSE): `1/n * Σ(y_pred - y_true)²`. 64 | - **Use Case**: Predict house prices, sales. 65 | - **Assumption**: Linear relationship, Gaussian errors. 66 | 67 | - **Logistic Regression**: 68 | - **Purpose**: Predicts probability of a binary outcome (extendable to multiclass). 69 | - **Model**: `P(y=1) = 1/(1 + e^-(w0 + w1*x1 + ...))` (sigmoid function). 70 | - **Output**: Probability [0, 1], thresholded for class (e.g., 0.5). 71 | - **Loss**: Log loss (cross-entropy): `-1/n * Σ[y_true * log(y_pred) + (1-y_true) * log(1-y_pred)]`. 72 | - **Use Case**: Predict churn (yes/no), spam detection. 73 | - **Assumption**: Log-odds are linear. 74 | 75 | - **Key Differences**: 76 | - **Output**: Linear → continuous; Logistic → probability. 77 | - **Task**: Linear for regression; Logistic for classification. 78 | - **Loss**: Linear uses MSE; Logistic uses log loss. 79 | - **Non-Linearity**: Logistic applies sigmoid to linear combination. 80 | 81 | **Example**: 82 | - Linear: Predict weight from height (e.g., 70 kg). 83 | - Logistic: Predict if someone exercises (yes/no) based on age. 84 | 85 | **Interview Tips**: 86 | - Emphasize output: “Logistic outputs probabilities, not raw values.” 87 | - Clarify naming: “Logistic is classification despite ‘regression’ name.” 88 | - Be ready to derive: “Show sigmoid transforming linear model.” 89 | 90 | --- 91 | 92 | ## 3. What are the assumptions of linear regression? 93 | 94 | **Answer**: 95 | 96 | Linear regression relies on several assumptions for valid results: 97 | 98 | 1. **Linearity**: 99 | - The relationship between features and target is linear (`y = w0 + w1*x1 + ...`). 100 | - Check: Scatter plots, residual plots (no patterns). 101 | 2. **Independence**: 102 | - Observations are independent of each other. 103 | - Check: Study design (e.g., no repeated measures). 104 | 3. **Homoscedasticity**: 105 | - Constant variance of residuals across feature values. 106 | - Check: Residual vs. fitted plot (even spread). 107 | 4. **Normality**: 108 | - Residuals are normally distributed (for inference, not prediction). 109 | - Check: Q-Q plot, Shapiro-Wilk test. 110 | 5. **No Multicollinearity**: 111 | - Features are not highly correlated with each other. 112 | - Check: Variance Inflation Factor (VIF), correlation matrix. 113 | 6. **No Extreme Outliers**: 114 | - Outliers can skew coefficients. 115 | - Check: Boxplots, leverage statistics. 116 | 117 | **Example**: 118 | - Task: Predict sales from ad spend. 119 | - Violation: Non-linear pattern in residuals → use polynomial regression. 120 | 121 | **Interview Tips**: 122 | - Prioritize key assumptions: “Linearity and homoscedasticity are critical.” 123 | - Mention checks: “I’d plot residuals to verify.” 124 | - Discuss fixes: “Transform features for non-linearity.” 125 | 126 | --- 127 | 128 | ## 4. How do you evaluate the performance of a regression model? 129 | 130 | **Answer**: 131 | 132 | Evaluating a regression model involves metrics that measure prediction error and fit quality: 133 | 134 | - **Mean Squared Error (MSE)**: 135 | - `1/n * Σ(y_pred - y_true)²`. 136 | - Pros: Penalizes large errors, differentiable. 137 | - Cons: Sensitive to outliers, scale-dependent. 138 | - **Root Mean Squared Error (RMSE)**: 139 | - `√MSE`. 140 | - Pros: Interpretable in target units (e.g., dollars). 141 | - Cons: Still outlier-sensitive. 142 | - **Mean Absolute Error (MAE)**: 143 | - `1/n * Σ|y_pred - y_true|`. 144 | - Pros: Robust to outliers, intuitive. 145 | - Cons: Less sensitive to large errors. 146 | - **R-Squared (R²)**: 147 | - `1 - (Σ(y_pred - y_true)² / Σ(y_true - mean(y_true))²)`. 148 | - Pros: Measures variance explained (0 to 1). 149 | - Cons: Can mislead with non-linear models. 150 | - **Adjusted R-Squared**: 151 | - Adjusts R² for number of features. 152 | - Pros: Penalizes overfitting. 153 | - **Residual Analysis**: 154 | - Plot residuals vs. predicted or features. 155 | - Pros: Diagnoses assumption violations (e.g., non-linearity). 156 | 157 | **Example**: 158 | - Model: Predict house prices. 159 | - Metrics: RMSE = $10,000, R² = 0.85. 160 | - Residuals: Random scatter → good fit. 161 | 162 | **Interview Tips**: 163 | - Tailor metrics: “RMSE for interpretability, MAE for robustness.” 164 | - Discuss context: “Business prefers dollar errors (RMSE).” 165 | - Be ready to plot: “Show residuals to check fit.” 166 | 167 | --- 168 | 169 | ## 5. What is multicollinearity and how do you detect it? 170 | 171 | **Answer**: 172 | 173 | **Multicollinearity** occurs when independent variables in a regression model are highly correlated, leading to unstable or misleading coefficients. 174 | 175 | - **Impact**: 176 | - Inflates standard errors, making coefficients insignificant. 177 | - Hard to interpret feature importance (e.g., which variable drives `y`?). 178 | - Does not affect predictions, only inference. 179 | 180 | - **Detection**: 181 | - **Correlation Matrix**: 182 | - Compute Pearson correlation between features. 183 | - Threshold: `|r| > 0.8` suggests issue. 184 | - **Variance Inflation Factor (VIF)**: 185 | - `VIF_i = 1/(1 - R²_i)`, where `R²_i` is from regressing feature `i` on others. 186 | - Threshold: VIF > 5 or 10 indicates multicollinearity. 187 | - **Condition Number**: 188 | - Ratio of largest to smallest eigenvalue of feature matrix. 189 | - High value (>30) suggests instability. 190 | 191 | - **Solutions**: 192 | - Remove one correlated feature. 193 | - Combine features (e.g., PCA, average). 194 | - Use regularized models (Ridge, Lasso). 195 | 196 | **Example**: 197 | - Features: `height_cm`, `height_m` (r = 1.0). 198 | - VIF: >1000. 199 | - Action: Drop `height_m`. 200 | 201 | **Interview Tips**: 202 | - Clarify impact: “Affects inference, not predictions.” 203 | - Suggest VIF: “Most reliable for detection.” 204 | - Be ready to compute: “Show VIF formula.” 205 | 206 | --- 207 | 208 | ## 6. What is the difference between Ridge and Lasso regression? 209 | 210 | **Answer**: 211 | 212 | - **Ridge Regression**: 213 | - **How**: Adds L2 penalty to linear regression loss: `MSE + λ * Σw_i²`. 214 | - **Effect**: Shrinks coefficients toward zero, but rarely to zero. 215 | - **Use Case**: Handles multicollinearity, stabilizes coefficients. 216 | - **Pros**: Robust to correlated features. 217 | - **Cons**: Keeps all features (less interpretable). 218 | 219 | - **Lasso Regression**: 220 | - **How**: Adds L1 penalty: `MSE + λ * Σ|w_i|`. 221 | - **Effect**: Shrinks coefficients, sets some to exactly zero (feature selection). 222 | - **Use Case**: Sparse models, high-dimensional data. 223 | - **Pros**: Selects features automatically. 224 | - **Cons**: Unstable with highly correlated features. 225 | 226 | - **Key Differences**: 227 | - **Penalty**: Ridge uses L2 (squared); Lasso uses L1 (absolute). 228 | - **Feature Selection**: Lasso eliminates features; Ridge shrinks them. 229 | - **Stability**: Ridge better for multicollinearity; Lasso may pick one of correlated pair. 230 | - **Geometry**: Ridge constrains weights to a sphere; Lasso to a diamond. 231 | 232 | **Example**: 233 | - Dataset: 100 features, some correlated. 234 | - Ridge: Keeps all features, small weights. 235 | - Lasso: Selects 20 features, others zero. 236 | 237 | **Interview Tips**: 238 | - Explain penalty: “L1 promotes sparsity, L2 smoothness.” 239 | - Mention Elastic Net: “Combines both for balance.” 240 | - Be ready to sketch: “Show L1 vs. L2 constraint shapes.” 241 | 242 | --- 243 | 244 | ## 7. What is polynomial regression and when would you use it? 245 | 246 | **Answer**: 247 | 248 | **Polynomial regression** extends linear regression by modeling non-linear relationships using polynomial terms of features. 249 | 250 | - **How It Works**: 251 | - Instead of `y = w0 + w1*x`, use `y = w0 + w1*x + w2*x² + ... + wn*x^n`. 252 | - Fit using least squares, treating `x²`, `x³` as new features. 253 | - Can include interactions (e.g., `x1 * x2`). 254 | 255 | - **When to Use**: 256 | - **Non-Linear Data**: When scatter plots show curvature (e.g., sales vs. time). 257 | - **Simple Non-Linearity**: Polynomial terms suffice (vs. complex models like neural nets). 258 | - **Interpretability**: Need to explain relationship (vs. black-box models). 259 | - **Small Datasets**: Avoids overfitting of deep models. 260 | 261 | - **Limitations**: 262 | - **Overfitting**: High-degree polynomials fit noise (e.g., degree > 5). 263 | - **Extrapolation**: Poor outside training range. 264 | - **Computation**: High-degree terms increase complexity. 265 | 266 | **Example**: 267 | - Task: Predict car speed vs. time (curved). 268 | - Model: `speed = w0 + w1*time + w2*time²`. 269 | - Result: Captures acceleration curve. 270 | 271 | **Interview Tips**: 272 | - Clarify fit: “It’s still linear in weights, just non-linear in features.” 273 | - Discuss degree: “Choose via cross-validation to avoid overfitting.” 274 | - Be ready to plot: “Show linear vs. polynomial fit.” 275 | 276 | --- 277 | 278 | ## 8. How does quantile regression differ from ordinary least squares regression? 279 | 280 | **Answer**: 281 | 282 | - **Ordinary Least Squares (OLS) Regression**: 283 | - **Goal**: Minimize mean squared error to predict the mean of the target. 284 | - **Loss**: `1/n * Σ(y_pred - y_true)²`. 285 | - **Output**: Conditional mean (`E[y|x]`). 286 | - **Assumption**: Homoscedastic errors, focuses on central tendency. 287 | - **Use Case**: Predict average house price. 288 | 289 | - **Quantile Regression**: 290 | - **Goal**: Predict specific quantiles of the target (e.g., median, 90th percentile). 291 | - **Loss**: Weighted absolute error: `Σρ_τ(y_pred - y_true)`, where `ρ_τ(u) = u * (τ - I(u<0))`, `τ` is quantile (e.g., 0.5 for median). 292 | - **Output**: Conditional quantile (e.g., `Q_τ(y|x)`). 293 | - **Assumption**: No assumption on error distribution, robust to heteroscedasticity. 294 | - **Use Case**: Predict price ranges, model tails (e.g., high-risk cases). 295 | 296 | - **Key Differences**: 297 | - **Focus**: OLS predicts mean; quantile predicts quantiles. 298 | - **Loss**: OLS uses squared error; quantile uses asymmetric absolute error. 299 | - **Robustness**: Quantile is robust to outliers, heteroscedasticity. 300 | - **Interpretation**: Quantile models entire distribution (e.g., median vs. 95th percentile). 301 | 302 | **Example**: 303 | - Task: Predict income. 304 | - OLS: Average income = $50K. 305 | - Quantile (τ=0.9): 90th percentile = $100K. 306 | 307 | **Interview Tips**: 308 | - Explain quantiles: “Captures distribution, not just mean.” 309 | - Highlight robustness: “Great for skewed or outlier-heavy data.” 310 | - Be ready to derive: “Show quantile loss function.” 311 | 312 | --- 313 | 314 | ## Notes 315 | 316 | - **Focus**: Answers emphasize regression-specific concepts, ideal for ML interviews. 317 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 318 | - **Depth**: Includes mathematical intuition (e.g., loss functions) and practical tips (e.g., VIF for multicollinearity). 319 | - **Consistency**: Matches the style of previous files for a cohesive repository. 320 | 321 | For deeper practice, try implementing regression models (see [ML Coding](ml-coding.md)) or explore [ML System Design](ml-system-design.md) for scaling regression solutions. 🚀 322 | 323 | --- 324 | 325 | **Next Steps**: Build on these skills with [Statistics & Probability](statistics-probability.md) for foundational math or revisit [Deep Learning](deep-learning.md) for advanced regression models! 🌟 -------------------------------------------------------------------------------- /questions/applied-ml-cases.md: -------------------------------------------------------------------------------- 1 | # Applied ML Cases Questions 2 | 3 | This file contains applied machine learning case questions commonly asked in interviews at companies like **Google**, **PayPal**, and **Apple**. These questions assess your ability to **solve real-world business problems** using machine learning, focusing on translating business needs into ML solutions. They test your practical skills, often requiring verbal explanations or hands-on coding in environments like Jupyter or Colab. 4 | 5 | Below are the questions with detailed answers, including step-by-step approaches, key considerations, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [Given a dataset that contains purchase history on PlayStore, how would you build a propensity score model?](#1-given-a-dataset-that-contains-purchase-history-on-playstore-how-would-you-build-a-propensity-score-model) 12 | 2. [How would you build a fraud model without labels?](#2-how-would-you-build-a-fraud-model-without-labels) 13 | 3. [How would you identify meaningful segmentation?](#3-how-would-you-identify-meaningful-segmentation) 14 | 15 | --- 16 | 17 | ## 1. Given a dataset that contains purchase history on PlayStore, how would you build a propensity score model? 18 | 19 | **Question**: [Google] You have a dataset with purchase history on the PlayStore. Describe how you would build a propensity score model to predict the likelihood of a user making a purchase. 20 | 21 | **Answer**: 22 | 23 | A **propensity score model** predicts the probability that a user will perform a specific action (here, making a purchase). For the PlayStore purchase history dataset, the goal is to estimate the likelihood of a user buying an app or in-app item. Here’s a step-by-step approach: 24 | 25 | 1. **Problem Definition**: 26 | - **Objective**: Binary classification to predict `P(purchase = 1 | user features)`. 27 | - **Output**: Probability score (0 to 1) for each user. 28 | - **Metric**: AUC-ROC (for ranking users), precision/recall (for business impact). 29 | 30 | 2. **Data Exploration**: 31 | - **Dataset**: Assume columns like `user_id`, `app_id`, `timestamp`, `price`, `category`, `past_purchases`, `session_duration`, `device_type`. 32 | - **Target**: Create a binary label (`1` for purchased, `0` for no purchase within a time window). 33 | - **EDA**: Check for class imbalance (purchases are rare), missing values, and correlations (e.g., session time vs. purchase). 34 | 35 | 3. **Feature Engineering**: 36 | - **User Features**: 37 | - Frequency of app visits, time since last purchase. 38 | - Total spend, average purchase value. 39 | - Session metrics (e.g., avg. session duration, clicks per session). 40 | - **App Features**: 41 | - App category (e.g., games, productivity), price, ratings. 42 | - **Behavioral Features**: 43 | - Recency, frequency, monetary (RFM) scores. 44 | - Interaction patterns (e.g., viewed but didn’t buy). 45 | - **Temporal Features**: 46 | - Day of week, time of day for purchases. 47 | - Handle categorical variables (e.g., one-hot encode `category`, `device_type`). 48 | 49 | 4. **Data Preprocessing**: 50 | - **Imbalanced Data**: Use oversampling (SMOTE), undersampling, or class weights due to rare purchases. 51 | - **Missing Values**: Impute numerical features (e.g., median for session time) and categorical features (e.g., mode for device). 52 | - **Scaling**: Standardize numerical features (e.g., spend, session duration) for models like logistic regression. 53 | 54 | 5. **Model Selection**: 55 | - **Algorithm**: Start with **logistic regression** (interpretable, outputs probabilities) or **Random Forest/XGBoost** (handles non-linearity, interactions). 56 | - **Why Logistic Regression**: Easy to interpret coefficients (e.g., “+10 min session time increases purchase odds by X%”). 57 | - **Why XGBoost**: Captures complex patterns, often performs well for imbalanced data. 58 | 59 | 6. **Training and Evaluation**: 60 | - **Split**: Use train-validation-test split (e.g., 70-15-15) or time-based split (train on older data, test on recent). 61 | - **Cross-Validation**: 5-fold CV to tune hyperparameters (e.g., regularization for logistic regression, `max_depth` for XGBoost). 62 | - **Metrics**: 63 | - **Primary**: AUC-ROC to evaluate ranking ability. 64 | - **Secondary**: Precision/recall/F1 for minority class (purchases). 65 | - **Business**: Expected revenue from targeting top N% of users. 66 | 67 | 7. **Model Interpretation**: 68 | - Use feature importance (e.g., XGBoost’s gain) or SHAP values to identify key drivers (e.g., high spend, frequent sessions). 69 | - Explain to stakeholders: “Users with longer sessions are 2x more likely to buy.” 70 | 71 | 8. **Deployment**: 72 | - Output propensity scores daily for users. 73 | - Integrate into PlayStore for personalized recommendations (e.g., target high-propensity users with promotions). 74 | - Monitor model drift (e.g., changes in purchase behavior). 75 | 76 | **Example**: 77 | - Dataset: 1M users, 5% purchased. 78 | - Features: `past_spend`, `session_time`, `app_category`. 79 | - Model: XGBoost with class weights, AUC-ROC = 0.85. 80 | - Action: Target top 10% of users, increasing conversions by 20%. 81 | 82 | **Interview Tips**: 83 | - Emphasize **business impact**: “The model helps prioritize marketing spend.” 84 | - Discuss **imbalanced data**: “Purchases are rare, so I’d use class weights or SMOTE.” 85 | - Be ready to sketch feature engineering or explain model choice trade-offs. 86 | - Mention scalability: “For millions of users, I’d optimize feature computation.” 87 | 88 | --- 89 | 90 | ## 2. How would you build a fraud model without labels? 91 | 92 | **Question**: [PayPal] Describe how you would build a fraud detection model when no labeled data (fraud vs. non-fraud) is available. 93 | 94 | **Answer**: 95 | 96 | Building a fraud detection model without labels requires **unsupervised** or **semi-supervised** learning, as we can’t rely on explicit fraud/non-fraud labels. Fraud is typically rare and anomalous, so we’ll treat it as an outlier detection problem. Here’s the approach: 97 | 98 | 1. **Problem Definition**: 99 | - **Objective**: Identify transactions that deviate from normal behavior, likely indicating fraud. 100 | - **Output**: Anomaly scores or binary flags (anomalous vs. normal). 101 | - **Metric**: Since no labels, use proxy metrics like anomaly rate or manual review feedback. 102 | 103 | 2. **Data Exploration**: 104 | - **Dataset**: Assume columns like `user_id`, `transaction_amount`, `timestamp`, `merchant`, `location`, `device_id`. 105 | - **EDA**: Analyze distributions (e.g., amount, frequency), detect patterns (e.g., typical user behavior), and check for missing values. 106 | 107 | 3. **Feature Engineering**: 108 | - **Transaction Features**: 109 | - Amount, time since last transaction, transaction frequency. 110 | - Merchant category, cross-border flag. 111 | - **User Behavior**: 112 | - Average spend, usual transaction locations. 113 | - Device changes, login frequency. 114 | - **Temporal Features**: 115 | - Hour of day, day of week (fraud may peak at odd hours). 116 | - **Aggregates**: 117 | - Rolling averages (e.g., spend over last 7 days). 118 | - Deviations (e.g., current amount vs. user’s average). 119 | 120 | 4. **Data Preprocessing**: 121 | - **Scaling**: Standardize numerical features (e.g., amount, frequency) for distance-based algorithms. 122 | - **Encoding**: One-hot encode categorical features (e.g., merchant type). 123 | - **Missing Values**: Impute or flag missing data (e.g., missing location as a feature). 124 | 125 | 5. **Model Selection**: 126 | - **Unsupervised**: 127 | - **Isolation Forest**: Efficient for high-dimensional data, assumes anomalies are “isolated.” 128 | - **DBSCAN**: Clusters normal transactions, flags outliers as noise. 129 | - **Autoencoders**: Learn to reconstruct normal transactions; high reconstruction error indicates anomalies. 130 | - **Semi-Supervised** (if partial labels emerge): 131 | - Train on “normal” data (assume most transactions are non-fraudulent). 132 | - Flag deviations as potential fraud. 133 | - **Why Isolation Forest**: Fast, scalable, and effective for rare anomalies like fraud. 134 | 135 | 6. **Training and Evaluation**: 136 | - **Training**: Fit the model on the full dataset (unsupervised) or assumed normal data (semi-supervised). 137 | - **Hyperparameters**: Tune `contamination` (expected anomaly rate, e.g., 0.01%) for Isolation Forest. 138 | - **Evaluation**: 139 | - No labels, so inspect top anomalies manually or with domain experts. 140 | - Proxy metrics: Compare anomaly patterns (e.g., high amounts, unusual locations) to known fraud heuristics. 141 | - If labels become available (e.g., manual reviews), compute precision/recall. 142 | 143 | 7. **Model Interpretation**: 144 | - Highlight features driving anomalies (e.g., “Transaction of $5000 vs. user’s avg. $50”). 145 | - Provide explainable outputs for fraud investigators (e.g., anomaly score + key features). 146 | 147 | 8. **Deployment**: 148 | - Score transactions in real-time for flagging. 149 | - Implement a feedback loop: Manual reviews refine the model (e.g., confirmed fraud becomes labels). 150 | - Monitor for drift (e.g., new fraud patterns). 151 | 152 | **Example**: 153 | - Dataset: 10M transactions, no labels. 154 | - Model: Isolation Forest flags 0.1% as anomalies. 155 | - Findings: Anomalies include large cross-border transactions at 3 AM. 156 | - Action: Flag for review, reducing fraud risk. 157 | 158 | **Interview Tips**: 159 | - Emphasize **unsupervised learning**: “Without labels, I’d use anomaly detection.” 160 | - Discuss **feedback loops**: “Manual reviews can create labels for future supervised models.” 161 | - Mention **scalability**: “Isolation Forest handles millions of transactions efficiently.” 162 | - Be ready to pivot: “If partial labels emerge, I’d switch to semi-supervised.” 163 | 164 | --- 165 | 166 | ## 3. How would you identify meaningful segmentation? 167 | 168 | **Question**: [Apple] Explain how you would identify meaningful customer segments in a dataset. 169 | 170 | **Answer**: 171 | 172 | **Customer segmentation** divides users into groups with similar characteristics for targeted strategies (e.g., marketing, product design). “Meaningful” segments are actionable, distinct, and aligned with business goals. Here’s a step-by-step approach: 173 | 174 | 1. **Problem Definition**: 175 | - **Objective**: Group customers into clusters based on behavior, demographics, or preferences. 176 | - **Output**: Cluster labels for each customer, with interpretable segment profiles. 177 | - **Metric**: Silhouette score (cluster cohesion), business KPIs (e.g., revenue per segment). 178 | 179 | 2. **Data Exploration**: 180 | - **Dataset**: Assume columns like `user_id`, `age`, `location`, `purchase_history`, `app_usage`, `preferences`. 181 | - **EDA**: Check distributions (e.g., age, spend), correlations, and missing values. Identify key variables for segmentation. 182 | 183 | 3. **Feature Engineering**: 184 | - **Demographics**: Age, income, location (urban/rural). 185 | - **Behavioral**: 186 | - Frequency/recency of purchases, avg. spend. 187 | - App engagement (e.g., sessions per week, features used). 188 | - **Preferences**: Favorite categories, survey responses. 189 | - **Aggregates**: RFM scores (recency, frequency, monetary). 190 | - Standardize features to ensure equal weighting. 191 | 192 | 4. **Data Preprocessing**: 193 | - **Scaling**: Normalize numerical features (e.g., spend, sessions) for distance-based clustering. 194 | - **Encoding**: One-hot encode categorical features (e.g., location). 195 | - **Dimensionality Reduction**: Use PCA if high-dimensional to reduce noise while preserving variance. 196 | 197 | 5. **Model Selection**: 198 | - **Algorithms**: 199 | - **K-Means**: Simple, effective for spherical clusters. 200 | - **Hierarchical Clustering**: Captures nested structures, interpretable dendrograms. 201 | - **Gaussian Mixture Models (GMM)**: Flexible for non-spherical clusters. 202 | - **DBSCAN**: Identifies outliers, but may not suit all segmentation tasks. 203 | - **Why K-Means**: Fast, scalable, and works well with clear business features (e.g., spend, engagement). 204 | - **Number of Clusters**: Use elbow method (within-cluster variance) or silhouette score to choose `k`. 205 | 206 | 6. **Training and Evaluation**: 207 | - **Training**: Run clustering (e.g., K-Means with `k=4`) on preprocessed data. 208 | - **Evaluation**: 209 | - **Quantitative**: Silhouette score (higher = better separation), within-cluster variance. 210 | - **Qualitative**: Inspect segment profiles (e.g., “high-spend, young users”). 211 | - **Business**: Validate with KPIs (e.g., segment A drives 50% revenue). 212 | - Iterate on `k` or features if segments aren’t actionable. 213 | 214 | 7. **Interpretation**: 215 | - Profile segments: “Segment 1: Young, frequent buyers; Segment 2: Older, low engagement.” 216 | - Visualize: Scatter plots (PCA components) or bar charts (avg. spend per segment). 217 | - Map to business actions: “Target Segment 1 with premium offers.” 218 | 219 | 8. **Deployment**: 220 | - Assign new customers to segments using trained model. 221 | - Integrate into CRM for personalized campaigns. 222 | - Monitor segment stability over time (e.g., re-cluster quarterly). 223 | 224 | **Example**: 225 | - Dataset: 100K Apple users. 226 | - Features: `age`, `spend`, `app_usage`. 227 | - Model: K-Means, `k=3`. 228 | - Segments: “Power users” (high spend, frequent), “Casual users” (moderate), “Inactive” (low). 229 | - Action: Upsell to power users, re-engage inactives. 230 | 231 | **Interview Tips**: 232 | - Emphasize **business alignment**: “Segments must drive actionable strategies.” 233 | - Discuss **interpretability**: “I’d ensure clusters are explainable to stakeholders.” 234 | - Mention **validation**: “Silhouette score ensures clusters are distinct.” 235 | - Be ready to pivot: “If segments aren’t meaningful, I’d adjust features or k.” 236 | 237 | --- 238 | 239 | ## Notes 240 | 241 | - **Practicality**: Answers outline actionable steps, bridging ML and business needs. 242 | - **Clarity**: Explanations are structured for verbal delivery, with clear methodology and trade-offs. 243 | - **Depth**: Includes technical details (e.g., SMOTE, Isolation Forest) and business context (e.g., revenue impact). 244 | - **Consistency**: Matches the style of `ml-coding.md`, `ml-theory.md`, and `ml-algorithms.md` for a cohesive repository. 245 | 246 | For deeper practice, try coding a propensity model (see [ML Coding](ml-coding.md)) or explore [ML System Design](ml-system-design.md) for scaling these solutions. 🚀 247 | 248 | --- 249 | 250 | **Next Steps**: Build on these skills with [ML System Design](ml-system-design.md) for production-ready solutions or revisit [ML Theory](ml-theory.md) for foundational concepts! 🌟 -------------------------------------------------------------------------------- /questions/ml-system-design.md: -------------------------------------------------------------------------------- 1 | # ML System Design Questions 2 | 3 | This file contains machine learning system design questions commonly asked in interviews at companies like **Google**, **Uber**, and **Amazon**. These questions assess your ability to **design scalable, production-ready ML systems**, covering functional and non-functional requirements, architecture, data pipelines, model training, evaluation, and deployment. They test your end-to-end understanding of building robust ML solutions. 4 | 5 | Below are the questions with detailed answers, including step-by-step designs, key considerations, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [Build a real-time translation system](#1-build-a-real-time-translation-system) 12 | 2. [Build a real-time ETA model](#2-build-a-real-time-eta-model) 13 | 3. [Build a recommender system for product search](#3-build-a-recommender-system-for-product-search) 14 | 15 | --- 16 | 17 | ## 1. Build a real-time translation system 18 | 19 | **Question**: [Google] Design a real-time translation system to translate text or speech across languages with low latency. 20 | 21 | **Answer**: 22 | 23 | A **real-time translation system** converts text or speech from one language to another instantly, requiring low latency, high accuracy, and scalability. Here’s a detailed design: 24 | 25 | 1. **Requirements**: 26 | - **Functional**: 27 | - Input: Text or speech in source language. 28 | - Output: Translated text or speech in target language. 29 | - Supported languages: e.g., 50+ languages. 30 | - **Non-Functional**: 31 | - Latency: <200ms for text, <1s for speech-to-speech. 32 | - Scalability: Handle millions of requests per second. 33 | - Accuracy: Comparable to state-of-the-art models (e.g., BLEU score >30). 34 | - Availability: 99.9% uptime. 35 | 36 | 2. **Architecture Overview**: 37 | - **Components**: 38 | - **Speech-to-Text (STT)**: Converts speech input to text (if speech-based). 39 | - **Text Translation Model**: Translates source text to target text. 40 | - **Text-to-Speech (TTS)**: Converts translated text to speech (if needed). 41 | - **Orchestrator**: Manages pipeline (STT → Translation → TTS). 42 | - **Caching Layer**: Stores frequent translations for speed. 43 | - **Load Balancer**: Distributes requests across servers. 44 | - **Monitoring**: Tracks latency, errors, and model drift. 45 | - **Flow**: 46 | - User sends input (text/speech) via API. 47 | - Load balancer routes to orchestrator. 48 | - Pipeline processes: STT (if speech), translate, TTS (if speech output). 49 | - Response returned to user. 50 | 51 | 3. **Data Preparation**: 52 | - **Dataset**: Use parallel corpora (e.g., OPUS, CommonCrawl) with millions of sentence pairs per language. 53 | - **Preprocessing**: 54 | - Tokenize text, normalize (e.g., lowercase, remove special characters). 55 | - Handle rare languages with transfer learning from high-resource languages. 56 | - **Features**: Token embeddings, language IDs, context (if conversational). 57 | 58 | 4. **Model Training**: 59 | - **Algorithm**: Transformer-based model (e.g., mBART, T5) for sequence-to-sequence translation. 60 | - **Why Transformer**: State-of-the-art for multilingual translation, handles long dependencies. 61 | - **Training**: 62 | - Pretrain on large corpora (e.g., Wikipedia dumps). 63 | - Fine-tune on domain-specific data (e.g., user queries). 64 | - Use mixed-precision training for efficiency. 65 | - **Optimization**: Quantization (e.g., INT8) and pruning to reduce latency. 66 | - **Hardware**: Train on GPUs/TPUs, deploy on CPUs/GPUs for inference. 67 | 68 | 5. **Model Evaluation**: 69 | - **Metrics**: 70 | - **BLEU**: Measures translation quality against reference translations. 71 | - **Latency**: End-to-end response time. 72 | - **User Satisfaction**: A/B test with human feedback. 73 | - **Offline**: Evaluate on held-out test set (e.g., WMT benchmarks). 74 | - **Online**: Monitor real-time metrics (e.g., error rate, latency). 75 | 76 | 6. **Model Productionization**: 77 | - **Serving**: 78 | - Use ONNX or TensorRT for optimized inference. 79 | - Deploy on Kubernetes with auto-scaling for traffic spikes. 80 | - **Caching**: Store frequent translations (e.g., “Hello” → “Hola”) in Redis. 81 | - **STT/TTS**: Integrate third-party APIs (e.g., Google STT, Amazon Polly) or custom models. 82 | - **API**: REST/gRPC endpoint (e.g., `POST /translate {text, source_lang, target_lang}`). 83 | - **Latency Optimization**: 84 | - Batch small requests during inference. 85 | - Use edge servers for low-latency delivery. 86 | - **Fault Tolerance**: Replicate services across regions, retry failed requests. 87 | 88 | 7. **Monitoring and Maintenance**: 89 | - **Metrics**: Latency, throughput, error rate, BLEU score drift. 90 | - **Drift**: Retrain if new slang or phrases emerge. 91 | - **Logging**: Store anonymized requests for analysis. 92 | - **A/B Testing**: Compare new models against baseline. 93 | 94 | **Example**: 95 | - Input: “Hello, how are you?” (English speech). 96 | - Flow: STT → Text (“Hello, how are you?”) → Translation → Text (“Hola, ¿cómo estás?”) → TTS → Spanish speech. 97 | - Latency: 500ms end-to-end, scalable to 1M users. 98 | 99 | **Interview Tips**: 100 | - Start with **requirements**: Clarify text vs. speech, latency goals. 101 | - Sketch **architecture**: Draw pipeline (STT → Translation → TTS). 102 | - Discuss **trade-offs**: Custom STT vs. third-party, model size vs. latency. 103 | - Highlight **scalability**: Mention caching, auto-scaling, edge servers. 104 | 105 | --- 106 | 107 | ## 2. Build a real-time ETA model 108 | 109 | **Question**: [Uber] Design a real-time ETA (Estimated Time of Arrival) model for a ride-sharing platform. 110 | 111 | **Answer**: 112 | 113 | A **real-time ETA model** predicts the time it takes for a driver to reach a passenger or complete a trip, requiring low latency, high accuracy, and robustness to dynamic conditions. Here’s the design: 114 | 115 | 1. **Requirements**: 116 | - **Functional**: 117 | - Input: Pickup/drop-off coordinates, trip details. 118 | - Output: ETA in minutes (e.g., 12.5 min). 119 | - **Non-Functional**: 120 | - Latency: <100ms per request. 121 | - Scalability: Handle millions of requests per minute. 122 | - Accuracy: Within ±2 minutes for 90% of predictions. 123 | - Availability: 99.99% uptime. 124 | 125 | 2. **Architecture Overview**: 126 | - **Components**: 127 | - **Data Ingestion**: Real-time feeds (GPS, traffic, weather). 128 | - **Feature Store**: Precomputed features (e.g., historical ETAs). 129 | - **ETA Model**: Predicts travel time. 130 | - **Routing Engine**: Computes optimal paths. 131 | - **Orchestrator**: Combines features, model, and routing. 132 | - **Caching Layer**: Stores recent ETAs for similar routes. 133 | - **Monitoring**: Tracks accuracy, latency, drift. 134 | - **Flow**: 135 | - User requests ETA via app. 136 | - Orchestrator pulls features, queries model and routing engine. 137 | - Response (ETA) returned to user. 138 | 139 | 3. **Data Preparation**: 140 | - **Dataset**: 141 | - Historical trips: `start/end coordinates`, `distance`, `duration`, `timestamp`. 142 | - Real-time: GPS traces, traffic (e.g., Google Maps API), weather. 143 | - Contextual: Road type, time of day, events (e.g., concerts). 144 | - **Preprocessing**: 145 | - Clean GPS noise (e.g., Kalman filter). 146 | - Aggregate traffic (e.g., avg. speed per road segment). 147 | - **Features**: 148 | - **Trip**: Distance, number of turns, road types. 149 | - **Dynamic**: Traffic speed, weather conditions. 150 | - **Temporal**: Hour, day, rush hour flags. 151 | - **Historical**: Avg. ETA for similar routes. 152 | 153 | 4. **Model Training**: 154 | - **Algorithm**: Gradient Boosted Trees (e.g., XGBoost) or Neural Networks. 155 | - **Why XGBoost**: Handles non-linear interactions, fast inference, robust to missing data. 156 | - **Training**: 157 | - Target: Actual trip duration (minutes). 158 | - Train on recent data (e.g., last 3 months) to capture trends. 159 | - Use weighted loss for outliers (e.g., heavy traffic). 160 | - **Optimization**: Feature selection to reduce inference time. 161 | 162 | 5. **Model Evaluation**: 163 | - **Metrics**: 164 | - **MAE**: Mean absolute error (target <2 min). 165 | - **Percentile Accuracy**: 90th percentile error. 166 | - **Latency**: Inference time. 167 | - **Offline**: Test on held-out trips. 168 | - **Online**: A/B test with drivers (e.g., ETA vs. actual). 169 | 170 | 6. **Model Productionization**: 171 | - **Serving**: 172 | - Deploy model on Kubernetes with gRPC for low-latency inference. 173 | - Use ONNX for cross-platform compatibility. 174 | - **Feature Store**: Store precomputed features (e.g., historical ETAs) in Redis. 175 | - **Routing Engine**: Integrate with OSRM or Google Maps for path planning. 176 | - **Caching**: Cache ETAs for frequent routes (e.g., airport to downtown). 177 | - **Latency Optimization**: 178 | - Precompute static features (e.g., road distances). 179 | - Parallelize model and routing queries. 180 | - **Fault Tolerance**: Fallback to historical averages if model fails. 181 | 182 | 7. **Monitoring and Maintenance**: 183 | - **Metrics**: MAE, latency, cache hit rate, request volume. 184 | - **Drift**: Retrain weekly for new traffic patterns. 185 | - **Alerts**: Trigger if MAE exceeds threshold. 186 | - **Logging**: Store predictions for analysis. 187 | 188 | **Example**: 189 | - Input: Trip from downtown to airport, 5 PM, rainy. 190 | - Features: Distance (10 km), traffic (slow), historical ETA (15 min). 191 | - Output: ETA = 18 min, MAE = 1.5 min. 192 | - Scalability: Handles 10M daily requests. 193 | 194 | **Interview Tips**: 195 | - Clarify **scope**: “Is this driver-to-passenger or full trip ETA?” 196 | - Draw **pipeline**: Ingestion → Features → Model → Routing. 197 | - Discuss **real-time challenges**: “Traffic changes fast, so I’d use a feature store.” 198 | - Highlight **accuracy vs. latency**: “Caching reduces latency but risks stale data.” 199 | 200 | --- 201 | 202 | ## 3. Build a recommender system for product search 203 | 204 | **Question**: [Amazon] Design a recommender system to enhance product search results on an e-commerce platform. 205 | 206 | **Answer**: 207 | 208 | A **recommender system for product search** suggests relevant products based on user queries, improving search quality and conversion rates. It combines search relevance with personalization. Here’s the design: 209 | 210 | 1. **Requirements**: 211 | - **Functional**: 212 | - Input: User query (e.g., “wireless headphones”), user profile. 213 | - Output: Ranked list of products. 214 | - **Non-Functional**: 215 | - Latency: <200ms per query. 216 | - Scalability: Handle billions of searches daily. 217 | - Relevance: High click-through rate (CTR), conversion rate. 218 | - Availability: 99.9% uptime. 219 | 220 | 2. **Architecture Overview**: 221 | - **Components**: 222 | - **Query Processor**: Parses and normalizes queries. 223 | - **Search Index**: Retrieves candidate products. 224 | - **Recommender Model**: Reranks candidates with personalization. 225 | - **Feature Store**: Stores user/product features. 226 | - **Caching Layer**: Caches frequent query results. 227 | - **Orchestrator**: Combines search and recommendation. 228 | - **Monitoring**: Tracks CTR, latency, relevance. 229 | - **Flow**: 230 | - User enters query. 231 | - Query processor fetches candidates from search index. 232 | - Recommender reranks based on user profile. 233 | - Top products returned. 234 | 235 | 3. **Data Preparation**: 236 | - **Dataset**: 237 | - User data: `user_id`, `search_history`, `purchases`, `clicks`. 238 | - Product data: `product_id`, `title`, `category`, `price`, `ratings`. 239 | - Interaction data: Query-product pairs, clicks, purchases. 240 | - **Preprocessing**: 241 | - Tokenize queries, remove stop words. 242 | - Normalize product titles (e.g., stemming). 243 | - **Features**: 244 | - **Query**: Keywords, embeddings (e.g., BERT). 245 | - **User**: Purchase history, preferred categories, price range. 246 | - **Product**: Category, price, popularity, ratings. 247 | - **Context**: Time, device type. 248 | 249 | 4. **Model Training**: 250 | - **Algorithm**: 251 | - **Collaborative Filtering**: Matrix factorization (e.g., ALS) for user-item interactions. 252 | - **Content-Based**: Embeddings (e.g., product titles via BERT). 253 | - **Two-Tower Model**: Neural network with user and item towers for ranking. 254 | - **Why Two-Tower**: Balances personalization and relevance, scalable for inference. 255 | - **Training**: 256 | - Target: Click or purchase (binary classification/ranking). 257 | - Loss: Pairwise ranking (e.g., BPR) or pointwise (e.g., cross-entropy). 258 | - Train on recent interactions (e.g., last 6 months). 259 | 260 | 5. **Model Evaluation**: 261 | - **Metrics**: 262 | - **NDCG**: Measures ranking quality. 263 | - **CTR**: Clicks per impression. 264 | - **Conversion Rate**: Purchases per recommendation. 265 | - **Offline**: Evaluate on held-out interactions. 266 | - **Online**: A/B test with live users. 267 | 268 | 6. **Model Productionization**: 269 | - **Serving**: 270 | - Deploy on TensorFlow Serving or TorchServe. 271 | - Use Elasticsearch for search index, Redis for feature store. 272 | - **Caching**: Cache top queries (e.g., “iPhone”) and their results. 273 | - **Latency Optimization**: 274 | - Precompute embeddings for products. 275 | - Limit candidates (e.g., top 100 from search index). 276 | - **Scalability**: 277 | - Shard search index by category. 278 | - Auto-scale inference servers for peak traffic. 279 | - **Fault Tolerance**: Fallback to non-personalized search if model fails. 280 | 281 | 7. **Monitoring and Maintenance**: 282 | - **Metrics**: NDCG, CTR, latency, cache hit rate. 283 | - **Drift**: Retrain weekly for new products/queries. 284 | - **Cold Start**: Use content-based features for new users/products. 285 | - **Logging**: Store queries and clicks for analysis. 286 | 287 | **Example**: 288 | - Query: “wireless headphones.” 289 | - Features: User prefers electronics, $50-$100 range. 290 | - Output: Top 5 headphones, ranked by relevance + user history. 291 | - Result: CTR improves by 15% vs. baseline search. 292 | 293 | **Interview Tips**: 294 | - Clarify **personalization**: “Should it prioritize user history or query match?” 295 | - Draw **pipeline**: Query → Search → Rerank → Response. 296 | - Discuss **cold start**: “For new users, I’d use content-based filtering.” 297 | - Highlight **scalability**: “Sharding and caching handle billions of queries.” 298 | 299 | --- 300 | 301 | ## Notes 302 | 303 | - **Comprehensiveness**: Answers cover end-to-end design, from requirements to monitoring. 304 | - **Practicality**: Designs balance accuracy, latency, and scalability for real-world systems. 305 | - **Clarity**: Explanations are structured for verbal delivery, with clear components and trade-offs. 306 | - **Consistency**: Matches the style of `ml-coding.md`, `ml-theory.md`, `ml-algorithms.md`, and `applied-ml-cases.md`. 307 | 308 | For deeper practice, explore [ML Coding](ml-coding.md) for implementation details or [Applied ML Cases](applied-ml-cases.md) for business-focused problems. 🚀 309 | 310 | --- 311 | 312 | **Next Steps**: Solidify your prep with [Easy ML Questions](easy-ml.md) for fundamentals or dive into [Feature Engineering](feature-engineering.md) for data prep techniques! 🌟 -------------------------------------------------------------------------------- /questions/optimization-techniques.md: -------------------------------------------------------------------------------- 1 | # Optimization Techniques Questions 2 | 3 | This file contains optimization techniques questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of optimization algorithms, their mechanics, and their application in training ML models, covering topics like gradient descent, second-order methods, and regularization. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is gradient descent, and how does it work?](#1-what-is-gradient-descent-and-how-does-it-work) 12 | 2. [What is the difference between batch, mini-batch, and stochastic gradient descent?](#2-what-is-the-difference-between-batch-mini-batch-and-stochastic-gradient-descent) 13 | 3. [Why might gradient descent fail to converge, and how can you address it?](#3-why-might-gradient-descent-fail-to-converge-and-how-can-you-address-it) 14 | 4. [What is the role of learning rate in optimization?](#4-what-is-the-role-of-learning-rate-in-optimization) 15 | 5. [Explain momentum in the context of gradient descent](#5-explain-momentum-in-the-context-of-gradient-descent) 16 | 6. [What is the Adam optimizer, and why is it widely used?](#6-what-is-the-adam-optimizer-and-why-is-it-widely-used) 17 | 7. [What are second-order optimization methods, and how do they differ from first-order methods?](#7-what-are-second-order-optimization-methods-and-how-do-they-differ-from-first-order-methods) 18 | 8. [What is the role of regularization in optimization?](#8-what-is-the-role-of-regularization-in-optimization) 19 | 20 | --- 21 | 22 | ## 1. What is gradient descent, and how does it work? 23 | 24 | **Answer**: 25 | 26 | **Gradient descent** is an iterative optimization algorithm used to minimize a loss function by updating model parameters in the direction of the negative gradient. 27 | 28 | - **How It Works**: 29 | - **Objective**: Minimize loss `J(θ)`, where `θ` is parameters. 30 | - **Gradient**: Compute `∇J(θ)` (partial derivatives w.r.t. `θ`). 31 | - **Update Rule**: `θ ← θ - η * ∇J(θ)`, where `η` is learning rate. 32 | - **Iteration**: Repeat until convergence (e.g., small gradient or max steps). 33 | - **Intuition**: Move downhill on loss surface toward minimum. 34 | 35 | - **Key Aspects**: 36 | - **Loss Surface**: Convex (one minimum) or non-convex (multiple minima). 37 | - **Convergence**: Guaranteed for convex functions with proper `η`. 38 | - **ML Context**: Train models (e.g., linear regression, neural nets). 39 | 40 | **Example**: 41 | - Task: Fit linear regression `y = θx`. 42 | - Loss: `J(θ) = Σ(y_i - θx_i)²`. 43 | - Gradient: `∇J(θ) = -Σx_i(y_i - θx_i)`. 44 | - Update: Adjust `θ` to reduce error. 45 | 46 | **Interview Tips**: 47 | - Explain intuition: “Follow steepest descent to minimize loss.” 48 | - Mention variants: “Batch, mini-batch, stochastic.” 49 | - Be ready to derive: “Show update for linear regression.” 50 | 51 | --- 52 | 53 | ## 2. What is the difference between batch, mini-batch, and stochastic gradient descent? 54 | 55 | **Answer**: 56 | 57 | - **Batch Gradient Descent**: 58 | - **How**: Compute gradient over entire dataset. 59 | - **Update**: `θ ← θ - η * (1/n) Σ_1^n ∇J(θ, x_i, y_i)`. 60 | - **Pros**: 61 | - Stable updates, accurate gradients. 62 | - Converges to global minimum (convex loss). 63 | - **Cons**: 64 | - Slow for large datasets (full pass per update). 65 | - High memory usage. 66 | - **Use Case**: Small datasets, convex problems. 67 | 68 | - **Mini-Batch Gradient Descent**: 69 | - **How**: Compute gradient over small subsets (batches). 70 | - **Update**: `θ ← θ - η * (1/b) Σ_i∈batch ∇J(θ, x_i, y_i)`, `b` is batch size. 71 | - **Pros**: 72 | - Balances speed and stability. 73 | - Leverages GPU parallelism. 74 | - **Cons**: 75 | - Batch size tuning needed. 76 | - Some noise in gradients. 77 | - **Use Case**: Standard for deep learning (e.g., batch size 32). 78 | 79 | - **Stochastic Gradient Descent (SGD)**: 80 | - **How**: Compute gradient for one sample at a time. 81 | - **Update**: `θ ← θ - η * ∇J(θ, x_i, y_i)`, random `i`. 82 | - **Pros**: 83 | - Fast updates, low memory. 84 | - Escapes local minima (non-convex). 85 | - **Cons**: 86 | - Noisy gradients, unstable convergence. 87 | - Needs careful learning rate. 88 | - **Use Case**: Online learning, large datasets. 89 | 90 | - **Key Differences**: 91 | - **Data Used**: Batch (all), mini-batch (subset), SGD (one). 92 | - **Stability**: Batch is smoothest, SGD is noisiest. 93 | - **Speed**: SGD fastest per update, batch slowest. 94 | - **ML Context**: Mini-batch for most neural nets, SGD for streaming. 95 | 96 | **Example**: 97 | - Dataset: 1M images. 98 | - Batch: Too slow (1M gradients/update). 99 | - Mini-batch: Train with 128 images/update. 100 | - SGD: Update per image, fast but noisy. 101 | 102 | **Interview Tips**: 103 | - Highlight trade-offs: “Mini-batch balances speed and accuracy.” 104 | - Discuss ML: “Mini-batch suits GPUs.” 105 | - Be ready to compare: “Show noise vs. convergence.” 106 | 107 | --- 108 | 109 | ## 3. Why might gradient descent fail to converge, and how can you address it? 110 | 111 | **Answer**: 112 | 113 | Gradient descent may fail to converge due to several issues: 114 | 115 | - **Issues**: 116 | - **Learning Rate Too High**: 117 | - Overshoots minimum, causes divergence. 118 | - **Fix**: Reduce `η` or use adaptive rates (e.g., Adam). 119 | - **Learning Rate Too Low**: 120 | - Slow progress, stuck in plateaus. 121 | - **Fix**: Increase `η` or use learning rate schedules. 122 | - **Non-Convex Loss**: 123 | - Local minima, saddle points trap updates. 124 | - **Fix**: Use momentum, SGD noise, or initialization tricks. 125 | - **Vanishing/Exploding Gradients**: 126 | - Deep nets: Gradients too small/large. 127 | - **Fix**: Gradient clipping, normalization (e.g., BatchNorm). 128 | - **Numerical Instability**: 129 | - Floating-point errors in large models. 130 | - **Fix**: Use stable loss (e.g., log-sum-exp), mixed precision. 131 | - **Bad Initialization**: 132 | - Poor starting `θ` leads to slow or divergent paths. 133 | - **Fix**: Xavier/He initialization. 134 | 135 | - **General Fixes**: 136 | - **Learning Rate Schedules**: Decay `η` (e.g., exponential, cosine annealing). 137 | - **Adaptive Optimizers**: Adam, RMSprop adjust `η` per parameter. 138 | - **Early Stopping**: Halt if loss plateaus. 139 | - **Monitoring**: Track loss/gradients to debug. 140 | 141 | **Example**: 142 | - Problem: Neural net loss oscillates. 143 | - Cause: `η = 0.1` too high. 144 | - Fix: Set `η = 0.001`, use Adam. 145 | 146 | **Interview Tips**: 147 | - Prioritize learning rate: “Most common convergence issue.” 148 | - Suggest diagnostics: “Plot loss to spot problems.” 149 | - Be ready to sketch: “Show overshooting vs. stuck.” 150 | 151 | --- 152 | 153 | ## 4. What is the role of learning rate in optimization? 154 | 155 | **Answer**: 156 | 157 | The **learning rate** (`η`) controls the step size of parameter updates in gradient descent, balancing speed and stability. 158 | 159 | - **Role**: 160 | - **Update Size**: Scales gradient: `θ ← θ - η * ∇J(θ)`. 161 | - **Convergence**: 162 | - High `η`: Fast but risks overshooting/divergence. 163 | - Low `η`: Stable but slow, may stall. 164 | - **Trade-Off**: Optimal `η` minimizes iterations to minimum. 165 | 166 | - **In Practice**: 167 | - **Tuning**: 168 | - Start with defaults (e.g., 0.001 for Adam). 169 | - Grid search or learning rate finder (e.g., cyclical rates). 170 | - **Schedules**: 171 | - Decay: Reduce `η` over time (e.g., `η_t = η_0 / (1 + kt)`). 172 | - Cyclical: Vary `η` between bounds. 173 | - **Adaptive Methods**: Adam, Adagrad adjust `η` per parameter. 174 | - **ML Context**: Critical for neural nets (e.g., CNNs, transformers). 175 | 176 | **Example**: 177 | - Task: Train ResNet. 178 | - `η = 0.1`: Diverges (loss explodes). 179 | - `η = 0.001`: Converges in 10 epochs. 180 | 181 | **Interview Tips**: 182 | - Explain balance: “Too high diverges, too low crawls.” 183 | - Mention schedules: “Decay helps late-stage convergence.” 184 | - Be ready to tune: “Describe grid search for `η`.” 185 | 186 | --- 187 | 188 | ## 5. Explain momentum in the context of gradient descent 189 | 190 | **Answer**: 191 | 192 | **Momentum** accelerates gradient descent by incorporating past gradients, smoothing updates and speeding convergence. 193 | 194 | - **How It Works**: 195 | - **Standard GD**: `θ ← θ - η * ∇J(θ)`. 196 | - **Momentum**: 197 | - Track velocity: `v_t = γ * v_{t-1} + η * ∇J(θ)`. 198 | - Update: `θ ← θ - v_t`. 199 | - `γ` (e.g., 0.9) is momentum term. 200 | - **Intuition**: Like a ball rolling downhill, builds speed in consistent directions. 201 | 202 | - **Benefits**: 203 | - **Faster Convergence**: Accelerates in flat regions. 204 | - **Smoother Path**: Reduces oscillations in steep valleys. 205 | - **Escape Local Minima**: Momentum carries past small bumps. 206 | - **ML Context**: Key for deep nets with noisy gradients. 207 | 208 | - **Variants**: 209 | - **Nesterov Momentum**: 210 | - Look-ahead gradient: `v_t = γ * v_{t-1} + η * ∇J(θ - γ * v_{t-1})`. 211 | - More accurate updates. 212 | 213 | **Example**: 214 | - Loss: Narrow valley. 215 | - GD: Zigzags slowly. 216 | - Momentum: Smooths path, converges faster. 217 | 218 | **Interview Tips**: 219 | - Use analogy: “Momentum is like inertia.” 220 | - Highlight benefits: “Speeds up and stabilizes.” 221 | - Be ready to derive: “Show velocity update.” 222 | 223 | --- 224 | 225 | ## 6. What is the Adam optimizer, and why is it widely used? 226 | 227 | **Answer**: 228 | 229 | **Adam** (Adaptive Moment Estimation) is an optimization algorithm combining momentum and adaptive learning rates, widely used for its efficiency and robustness. 230 | 231 | - **How It Works**: 232 | - **Moments**: 233 | - First moment (mean): `m_t = β_1 * m_{t-1} + (1 - β_1) * ∇J(θ)` (momentum). 234 | - Second moment (variance): `v_t = β_2 * v_{t-1} + (1 - β_2) * (∇J(θ))²` (RMSprop). 235 | - **Bias Correction**: 236 | - Adjust for initialization: `m_t' = m_t / (1 - β_1^t)`, `v_t' = v_t / (1 - β_2^t)`. 237 | - **Update**: 238 | - `θ ← θ - η * m_t' / (√v_t' + ε)`, `ε` prevents division by zero. 239 | - **Hyperparameters**: 240 | - `β_1 = 0.9`, `β_2 = 0.999`, `η = 0.001`, `ε = 10^-8`. 241 | 242 | - **Why Widely Used**: 243 | - **Adaptive Rates**: Per-parameter learning rates via `v_t`. 244 | - **Fast Convergence**: Combines momentum’s speed with RMSprop’s stability. 245 | - **Robustness**: Works well across tasks (e.g., CNNs, transformers). 246 | - **Low Tuning**: Default parameters often suffice. 247 | - **ML Context**: Standard for deep learning (e.g., PyTorch, TensorFlow). 248 | 249 | **Example**: 250 | - Task: Train BERT. 251 | - Adam: Converges in 3 epochs vs. SGD’s 10. 252 | 253 | **Interview Tips**: 254 | - Break down steps: “Momentum plus adaptive scaling.” 255 | - Highlight defaults: “0.001 works for most cases.” 256 | - Be ready to compare: “Vs. SGD: faster, less tuning.” 257 | 258 | --- 259 | 260 | ## 7. What are second-order optimization methods, and how do they differ from first-order methods? 261 | 262 | **Answer**: 263 | 264 | - **Second-Order Methods**: 265 | - **How**: Use second derivatives (Hessian) to capture curvature of loss surface. 266 | - **Examples**: 267 | - **Newton’s Method**: 268 | - Update: `θ ← θ - H^-1 * ∇J(θ)`, `H` is Hessian. 269 | - Exploits curvature for precise steps. 270 | - **Quasi-Newton** (e.g., BFGS): 271 | - Approximate Hessian to reduce computation. 272 | - **Pros**: 273 | - Faster convergence (fewer iterations). 274 | - Handles ill-conditioned surfaces (e.g., narrow valleys). 275 | - **Cons**: 276 | - Computationally expensive (`O(n²)` for Hessian). 277 | - Memory-intensive for large models. 278 | - **Use Case**: Small-scale problems, logistic regression. 279 | 280 | - **First-Order Methods**: 281 | - **How**: Use only first derivatives (gradient). 282 | - **Examples**: Gradient descent, Adam, SGD. 283 | - **Pros**: 284 | - Scalable to large models (e.g., deep nets). 285 | - Low memory (`O(n)` for gradients). 286 | - **Cons**: 287 | - Slower convergence (many iterations). 288 | - Sensitive to learning rate, curvature. 289 | - **Use Case**: Deep learning, large datasets. 290 | 291 | - **Key Differences**: 292 | - **Information**: Second-order uses curvature; first-order uses slope. 293 | - **Complexity**: Second-order is `O(n²)`; first-order is `O(n)`. 294 | - **Scalability**: First-order for big models; second-order for small. 295 | - **ML Context**: First-order (Adam) dominates due to scale; second-order for niche cases. 296 | 297 | **Example**: 298 | - Task: Optimize small neural net. 299 | - Newton: Converges in 5 steps. 300 | - SGD: Needs 100 steps. 301 | 302 | **Interview Tips**: 303 | - Explain Hessian: “Curvature guides better steps.” 304 | - Discuss limits: “Second-order too slow for deep learning.” 305 | - Be ready to derive: “Show Newton’s update.” 306 | 307 | --- 308 | 309 | ## 8. What is the role of regularization in optimization? 310 | 311 | **Answer**: 312 | 313 | **Regularization** adds constraints to optimization to prevent overfitting, improve generalization, and stabilize training. 314 | 315 | - **Role**: 316 | - **Penalize Complexity**: Add term to loss: `J_total = J(θ) + λR(θ)`. 317 | - `J(θ)`: Original loss (e.g., MSE). 318 | - `R(θ)`: Regularizer (e.g., L2 norm). 319 | - `λ`: Controls regularization strength. 320 | - **Reduce Overfitting**: Discourage large weights, favor simpler models. 321 | - **Stabilize Optimization**: Smooth loss surface, avoid numerical issues. 322 | 323 | - **Common Types**: 324 | - **L2 Regularization** (Weight Decay): 325 | - `R(θ) = ||θ||₂²`. 326 | - Shrinks weights, prevents dominance by few features. 327 | - **L1 Regularization**: 328 | - `R(θ) = ||θ||₁`. 329 | - Promotes sparsity (e.g., feature selection). 330 | - **Dropout**: 331 | - Randomly drop neurons during training. 332 | - Implicitly regularizes neural nets. 333 | - **Early Stopping**: 334 | - Halt training when validation loss plateaus. 335 | - Prevents overfitting without modifying loss. 336 | 337 | - **In Optimization**: 338 | - **Gradient Update**: `∇J_total = ∇J(θ) + λ∇R(θ)`. 339 | - Example: L2 adds `2λθ` to gradient. 340 | - **Effect**: Balances fit vs. simplicity, guides to robust minima. 341 | - **ML Context**: Essential for deep nets, high-dimensional data. 342 | 343 | **Example**: 344 | - Task: Train CNN. 345 | - Without L2: Overfits (train acc=0.99, test=0.7). 346 | - With L2 (`λ=0.01`): Generalizes (test acc=0.85). 347 | 348 | **Interview Tips**: 349 | - Link to overfitting: “Regularization simplifies models.” 350 | - Compare L1/L2: “L1 for sparsity, L2 for smoothness.” 351 | - Be ready to derive: “Show L2 gradient term.” 352 | 353 | --- 354 | 355 | ## Notes 356 | 357 | - **Focus**: Answers cover optimization techniques critical for ML interviews. 358 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 359 | - **Depth**: Includes mathematical rigor (e.g., Adam updates, Hessian) and ML applications (e.g., deep learning). 360 | - **Consistency**: Matches the style of previous files for a cohesive repository. 361 | 362 | For deeper practice, apply these to neural nets (see [Deep Learning](deep-learning.md)) or explore [Production MLOps](production-mlops.md) for scaling optimization. 🚀 363 | 364 | --- 365 | 366 | **Next Steps**: Build on these skills with [Computer Vision](computer-vision.md) for CNN optimization or revisit [Statistics & Probability](statistics-probability.md) for loss function math! 🌟 -------------------------------------------------------------------------------- /questions/statistics-probability.md: -------------------------------------------------------------------------------- 1 | # Statistics and Probability Questions 2 | 3 | This file contains statistics and probability questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **foundational understanding** of statistical concepts, probability theory, and their applications in ML, such as model evaluation, uncertainty quantification, and data analysis. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is the difference between population and sample?](#1-what-is-the-difference-between-population-and-sample) 12 | 2. [Explain the Central Limit Theorem and its significance in machine learning](#2-explain-the-central-limit-theorem-and-its-significance-in-machine-learning) 13 | 3. [What is a p-value, and how is it used in hypothesis testing?](#3-what-is-a-p-value-and-how-is-it-used-in-hypothesis-testing) 14 | 4. [What is the difference between Type I and Type II errors?](#4-what-is-the-difference-between-type-i-and-type-ii-errors) 15 | 5. [What are bias and variance in the context of machine learning?](#5-what-are-bias-and-variance-in-the-context-of-machine-learning) 16 | 6. [What is the difference between a probability density function and a cumulative distribution function?](#6-what-is-the-difference-between-a-probability-density-function-and-a-cumulative-distribution-function) 17 | 7. [Explain Bayes’ Theorem and its applications in machine learning](#7-explain-bayes-theorem-and-its-applications-in-machine-learning) 18 | 8. [What is the difference between correlation and causation?](#8-what-is-the-difference-between-correlation-and-causation) 19 | 20 | --- 21 | 22 | ## 1. What is the difference between population and sample? 23 | 24 | **Answer**: 25 | 26 | - **Population**: 27 | - **Definition**: The entire set of individuals or observations that the study aims to describe (e.g., all users of an app). 28 | - **Parameters**: Described by true values (e.g., population mean `μ`, variance `σ²`). 29 | - **Use**: Ideal for analysis but often impractical due to size or cost. 30 | - **Example**: All global smartphone users. 31 | 32 | - **Sample**: 33 | - **Definition**: A subset of the population selected for analysis (e.g., 1000 surveyed users). 34 | - **Statistics**: Estimates population parameters (e.g., sample mean `x̄`, variance `s²`). 35 | - **Use**: Practical for inference, assuming proper sampling (e.g., random). 36 | - **Example**: 1000 randomly selected smartphone users. 37 | 38 | - **Key Differences**: 39 | - **Scope**: Population is complete; sample is partial. 40 | - **Values**: Population has true parameters; sample has estimates. 41 | - **Purpose**: Population defines truth; sample approximates it. 42 | - **ML Context**: Models train on samples (e.g., datasets) to generalize to populations. 43 | 44 | **Example**: 45 | - Population: All customer transactions. 46 | - Sample: 10,000 transactions for churn prediction model. 47 | 48 | **Interview Tips**: 49 | - Clarify inference: “Samples estimate population traits.” 50 | - Mention sampling: “Random sampling reduces bias.” 51 | - Be ready to discuss: “Impact of bad sampling on ML.” 52 | 53 | --- 54 | 55 | ## 2. Explain the Central Limit Theorem and its significance in machine learning 56 | 57 | **Answer**: 58 | 59 | The **Central Limit Theorem (CLT)** states that the distribution of the sample mean (or sum) of a sufficiently large number of independent, identically distributed (i.i.d.) random variables approaches a normal distribution, regardless of the underlying distribution, provided the variance is finite. 60 | 61 | - **Key Points**: 62 | - **Conditions**: 63 | - Sample size `n` is large (typically `n ≥ 30`). 64 | - Variables are i.i.d. 65 | - Finite mean `μ` and variance `σ²`. 66 | - **Result**: Sample mean `x̄ ~ N(μ, σ²/n)` (standard error = `σ/√n`). 67 | - **Intuition**: Averages smooth out quirks of the original distribution. 68 | 69 | - **Significance in ML**: 70 | - **Confidence Intervals**: Estimate model metrics (e.g., accuracy) with normal-based intervals. 71 | - **Hypothesis Testing**: Use normal assumptions for p-values in A/B tests. 72 | - **Data Analysis**: Justifies normality of aggregated metrics (e.g., average user time). 73 | - **Feature Engineering**: Supports scaling/transformations assuming normality. 74 | - **Gradient Descent**: Errors in large batches approximate normality, aiding optimization. 75 | 76 | **Example**: 77 | - Data: User session times (skewed). 78 | - CLT: Mean of 100 sessions ~ normal, enabling t-tests for significance. 79 | 80 | **Interview Tips**: 81 | - Emphasize normality: “CLT makes means normal for large `n`.” 82 | - Link to ML: “Critical for testing model performance.” 83 | - Be ready to sketch: “Show skewed data → normal means.” 84 | 85 | --- 86 | 87 | ## 3. What is a p-value, and how is it used in hypothesis testing? 88 | 89 | **Answer**: 90 | 91 | A **p-value** is the probability of observing data (or more extreme) under the null hypothesis, used to assess evidence against it. 92 | 93 | - **How It Works**: 94 | - **Null Hypothesis (H₀)**: Assumes no effect (e.g., “new model = old model”). 95 | - **Alternative Hypothesis (H₁)**: Claims an effect (e.g., “new model > old”). 96 | - **Test Statistic**: Compute metric (e.g., t-statistic) from data. 97 | - **P-value**: `P(data | H₀)`—small p-value suggests H₀ is unlikely. 98 | - **Threshold (α)**: Reject H₀ if p < α (e.g., 0.05). 99 | 100 | - **In Hypothesis Testing**: 101 | - **Steps**: 102 | 1. Define H₀, H₁. 103 | 2. Choose test (e.g., t-test, chi-squared). 104 | 3. Compute p-value. 105 | 4. Decide: Reject H₀ if p < α, else fail to reject. 106 | - **Interpretation**: Small p-value indicates strong evidence against H₀. 107 | 108 | - **ML Use Cases**: 109 | - A/B testing: Compare model performance (e.g., click rates). 110 | - Feature selection: Test feature significance. 111 | - Model evaluation: Check if improvements are significant. 112 | 113 | **Example**: 114 | - Test: Does new model increase accuracy? 115 | - H₀: No difference. 116 | - P-value = 0.03 < 0.05 → Reject H₀, new model is better. 117 | 118 | **Interview Tips**: 119 | - Clarify meaning: “P-value measures evidence, not truth.” 120 | - Avoid pitfalls: “Small p doesn’t prove H₁.” 121 | - Be ready to compute: “Explain t-test p-value.” 122 | 123 | --- 124 | 125 | ## 4. What is the difference between Type I and Type II errors? 126 | 127 | **Answer**: 128 | 129 | - **Type I Error (False Positive)**: 130 | - **Definition**: Reject the null hypothesis (H₀) when it is true. 131 | - **Probability**: Denoted `α` (significance level, e.g., 0.05). 132 | - **Example**: Conclude a model improves accuracy (H₀: no improvement) when it doesn’t. 133 | - **Impact**: Overconfidence in results, deploying ineffective models. 134 | - **ML Context**: False alarm in fraud detection (flag innocent transaction). 135 | 136 | - **Type II Error (False Negative)**: 137 | - **Definition**: Fail to reject H₀ when it is false. 138 | - **Probability**: Denoted `β` (1 - power). 139 | - **Example**: Miss that a model improves accuracy (H₀ false) when it does. 140 | - **Impact**: Missed opportunities, underutilizing better models. 141 | - **ML Context**: Miss fraud in detection (let guilty transaction pass). 142 | 143 | - **Key Differences**: 144 | - **Error Type**: Type I = wrong rejection; Type II = wrong acceptance. 145 | - **Probability**: Type I controlled by `α`; Type II by `β`. 146 | - **Trade-Off**: Reducing Type I (lower `α`) increases Type II, and vice versa. 147 | - **Priority**: Depends on context (e.g., medical tests prioritize low Type II). 148 | 149 | **Example**: 150 | - Fraud Model: 151 | - Type I: Flag innocent user (H₀: not fraud, rejected). 152 | - Type II: Miss fraudster (H₀: not fraud, accepted). 153 | 154 | **Interview Tips**: 155 | - Use terms: “Type I is false positive, Type II is false negative.” 156 | - Discuss trade-offs: “Lower α reduces Type I but risks Type II.” 157 | - Be ready to contextualize: “Fraud needs low Type II.” 158 | 159 | --- 160 | 161 | ## 5. What are bias and variance in the context of machine learning? 162 | 163 | **Answer**: 164 | 165 | - **Bias**: 166 | - **Definition**: Error due to overly simplistic models (underfitting). 167 | - **Cause**: Model assumes too much (e.g., linear model for non-linear data). 168 | - **Effect**: High training error, poor fit to data. 169 | - **Example**: Linear regression on quadratic data misses curves. 170 | - **Math**: `Bias = E[f̂(x)] - f(x)`, where `f̂` is model, `f` is truth. 171 | 172 | - **Variance**: 173 | - **Definition**: Error due to sensitivity to training data (overfitting). 174 | - **Cause**: Model too complex (e.g., deep tree fits noise). 175 | - **Effect**: Low training error, high test error. 176 | - **Example**: Decision tree memorizes training points, fails on new data. 177 | - **Math**: `Variance = E[(f̂(x) - E[f̂(x)])²]`. 178 | 179 | - **Bias-Variance Trade-Off**: 180 | - **Goal**: Minimize total error = `Bias² + Variance + Irreducible Error`. 181 | - **High Bias**: Simple models (e.g., linear regression). 182 | - **High Variance**: Complex models (e.g., deep nets). 183 | - **Balance**: Use regularization, ensembles, or model selection. 184 | 185 | **Example**: 186 | - Task: Predict house prices. 187 | - High Bias: Linear model (misses patterns, error=10K). 188 | - High Variance: 100-layer net (fits noise, test error=15K). 189 | - Balanced: Ridge regression (error=5K). 190 | 191 | **Interview Tips**: 192 | - Explain intuition: “Bias underfits, variance overfits.” 193 | - Mention trade-off: “Tune complexity for balance.” 194 | - Be ready to sketch: “Show error vs. complexity curve.” 195 | 196 | --- 197 | 198 | ## 6. What is the difference between a probability density function and a cumulative distribution function? 199 | 200 | **Answer**: 201 | 202 | - **Probability Density Function (PDF)**: 203 | - **Definition**: For continuous random variables, describes the likelihood of a variable taking a specific value. 204 | - **Properties**: 205 | - `f(x) ≥ 0` (non-negative). 206 | - Integral over all `x`: `∫f(x)dx = 1`. 207 | - Probability over interval: `P(a ≤ X ≤ b) = ∫_a^b f(x)dx`. 208 | - **Use**: Model distributions (e.g., Gaussian for errors). 209 | - **Example**: Normal PDF `f(x) = (1/√(2πσ²))e^(-(x-μ)²/(2σ²))`. 210 | 211 | - **Cumulative Distribution Function (CDF)**: 212 | - **Definition**: Gives the probability that a random variable is less than or equal to a value: `F(x) = P(X ≤ x)`. 213 | - **Properties**: 214 | - `0 ≤ F(x) ≤ 1`. 215 | - Monotonically increasing. 216 | - `F(x) = ∫_{-∞}^x f(t)dt` (for continuous variables). 217 | - **Use**: Compute probabilities, quantiles (e.g., 95th percentile). 218 | - **Example**: Normal CDF gives `P(X ≤ 1)`. 219 | 220 | - **Key Differences**: 221 | - **Role**: PDF gives density at a point; CDF gives cumulative probability. 222 | - **Output**: PDF can exceed 1 (density); CDF is [0,1]. 223 | - **Computation**: PDF is derivative of CDF; CDF is integral of PDF. 224 | - **ML Context**: PDF for likelihoods (e.g., GMM); CDF for thresholds (e.g., anomaly detection). 225 | 226 | **Example**: 227 | - Normal Distribution: 228 | - PDF: Peak at mean, describes spread. 229 | - CDF: Sigmoid-like, gives `P(X < 0)`. 230 | 231 | **Interview Tips**: 232 | - Clarify continuous: “PDF for continuous, PMF for discrete.” 233 | - Link to ML: “PDF in loss, CDF in thresholds.” 234 | - Be ready to plot: “Show PDF vs. CDF curves.” 235 | 236 | --- 237 | 238 | ## 7. Explain Bayes’ Theorem and its applications in machine learning 239 | 240 | **Answer**: 241 | 242 | **Bayes’ Theorem** describes how to update probabilities based on new evidence, fundamental to probabilistic reasoning. 243 | 244 | - **Formula**: 245 | - `P(A|B) = [P(B|A) * P(A)] / P(B)`. 246 | - **Terms**: 247 | - `P(A|B)`: Posterior (probability of A given B). 248 | - `P(B|A)`: Likelihood (probability of B given A). 249 | - `P(A)`: Prior (initial belief about A). 250 | - `P(B)`: Evidence (normalizing constant). 251 | 252 | - **Intuition**: 253 | - Combines prior knowledge with observed data to refine beliefs. 254 | - Example: Update disease probability given test result. 255 | 256 | - **Applications in ML**: 257 | - **Naive Bayes Classifier**: 258 | - Assumes feature independence. 259 | - Uses Bayes to compute `P(class|features)`. 260 | - Example: Spam detection. 261 | - **Bayesian Inference**: 262 | - Update model parameters (e.g., Gaussian Process priors). 263 | - Example: Hyperparameter tuning. 264 | - **Probabilistic Models**: 265 | - VAEs, Bayesian neural nets model uncertainty. 266 | - Example: Predict with confidence intervals. 267 | - **Anomaly Detection**: 268 | - Compute `P(data|normal)` vs. `P(data|anomaly)`. 269 | - **Recommendation Systems**: 270 | - Update user preferences based on interactions. 271 | 272 | **Example**: 273 | - Task: Diagnose disease (1% prevalence, test 95% accurate). 274 | - Bayes: `P(disease|positive) = [P(positive|disease) * P(disease)] / P(positive)`. 275 | 276 | **Interview Tips**: 277 | - Break down formula: “Prior, likelihood, posterior.” 278 | - Highlight ML: “Naive Bayes is a direct application.” 279 | - Be ready to derive: “Compute spam example.” 280 | 281 | --- 282 | 283 | ## 8. What is the difference between correlation and causation? 284 | 285 | **Answer**: 286 | 287 | - **Correlation**: 288 | - **Definition**: Measures the strength and direction of a linear relationship between two variables. 289 | - **Metric**: Pearson correlation coefficient `r` (-1 to 1). 290 | - `r = cov(X,Y) / (σ_X * σ_Y)`. 291 | - **Properties**: 292 | - No implication of cause. 293 | - Can be spurious (e.g., ice cream sales vs. drownings). 294 | - **ML Use**: Feature selection, exploratory analysis. 295 | - **Example**: Height and weight (r = 0.7). 296 | 297 | - **Causation**: 298 | - **Definition**: One variable directly influences another (cause → effect). 299 | - **Establishing**: 300 | - Randomized experiments (e.g., RCTs). 301 | - Causal inference (e.g., propensity scoring, DAGs). 302 | - **Properties**: 303 | - Requires control for confounders. 304 | - Harder to prove than correlation. 305 | - **ML Use**: Policy decisions, treatment effects. 306 | - **Example**: Medicine improves health (proven via trial). 307 | 308 | - **Key Differences**: 309 | - **Implication**: Correlation shows association; causation shows effect. 310 | - **Proof**: Correlation is statistical; causation needs experiments or causal models. 311 | - **Risk**: Assuming correlation = causation leads to errors. 312 | - **ML Context**: Correlation for patterns, causation for actions. 313 | 314 | **Example**: 315 | - Correlation: More ads, higher sales (r = 0.8). 316 | - Causation: Prove ads drive sales via A/B test. 317 | 318 | **Interview Tips**: 319 | - Stress caution: “Correlation doesn’t imply causation.” 320 | - Mention confounders: “Hidden variables cause spurious links.” 321 | - Be ready to clarify: “Use RCTs for causation.” 322 | 323 | --- 324 | 325 | ## Notes 326 | 327 | - **Focus**: Answers cover core stats and probability, critical for ML interviews. 328 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 329 | - **Depth**: Includes mathematical rigor (e.g., Bayes, CLT) and ML applications (e.g., hypothesis testing in A/B tests). 330 | - **Consistency**: Matches the style of previous files for a cohesive repository. 331 | 332 | For deeper practice, apply these concepts in model evaluation (see [ML Algorithms](ml-algorithms.md)) or explore [Advanced ML](advanced-ml.md) for Bayesian methods. 🚀 333 | 334 | --- 335 | 336 | **Next Steps**: Build on these skills with [ML System Design](ml-system-design.md) for scaling statistical models or revisit [Time Series & Clustering](time-series-clustering.md) for time-based stats! 🌟 -------------------------------------------------------------------------------- /questions/anomaly-detection.md: -------------------------------------------------------------------------------- 1 | # Anomaly Detection Questions 2 | 3 | This file contains anomaly detection questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of statistical and ML methods for identifying outliers or unusual patterns in data, covering techniques, metrics, and applications. 4 | 5 | Below are the questions with detailed answers, including explanations, technical details, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is anomaly detection, and what are its common applications?](#1-what-is-anomaly-detection-and-what-are-its-common-applications) 12 | 2. [What is the difference between supervised and unsupervised anomaly detection?](#2-what-is-the-difference-between-supervised-and-unsupervised-anomaly-detection) 13 | 3. [How would you use statistical methods for anomaly detection?](#3-how-would-you-use-statistical-methods-for-anomaly-detection) 14 | 4. [What is the Isolation Forest algorithm, and how does it work?](#4-what-is-the-isolation-forest-algorithm-and-how-does-it-work) 15 | 5. [Explain the concept of autoencoders for anomaly detection](#5-explain-the-concept-of-autoencoders-for-anomaly-detection) 16 | 6. [What is the role of distance-based methods in anomaly detection?](#6-what-is-the-role-of-distance-based-methods-in-anomaly-detection) 17 | 7. [How do you evaluate the performance of an anomaly detection model?](#7-how-do-you-evaluate-the-performance-of-an-anomaly-detection-model) 18 | 8. [What are the challenges of anomaly detection in high-dimensional data?](#8-what-are-the-challenges-of-anomaly-detection-in-high-dimensional-data) 19 | 20 | --- 21 | 22 | ## 1. What is anomaly detection, and what are its common applications? 23 | 24 | **Answer**: 25 | 26 | **Anomaly detection** identifies data points, events, or patterns that deviate significantly from expected behavior, also known as outliers or novelties. 27 | 28 | - **Types**: 29 | - **Point Anomalies**: Single data point deviates (e.g., unusual transaction). 30 | - **Contextual Anomalies**: Deviates in context (e.g., high temp in winter). 31 | - **Collective Anomalies**: Group deviates (e.g., network attack pattern). 32 | 33 | - **Common Applications**: 34 | - **Fraud Detection**: Spot unusual credit card transactions. 35 | - **Cybersecurity**: Detect intrusions or malware. 36 | - **Healthcare**: Identify abnormal patient vitals (e.g., ECG spikes). 37 | - **Industrial Monitoring**: Flag machine failures (e.g., sensor outliers). 38 | - **Finance**: Monitor stock market irregularities. 39 | - **ML Context**: Clean datasets, improve model robustness. 40 | 41 | - **Approaches**: 42 | - Statistical (e.g., z-score). 43 | - ML (e.g., Isolation Forest, autoencoders). 44 | - Rule-based (e.g., thresholds). 45 | 46 | **Example**: 47 | - Task: Detect fraud. 48 | - Anomaly: $10,000 transaction vs. usual $100. 49 | 50 | **Interview Tips**: 51 | - Clarify types: “Point, contextual, collective matter.” 52 | - Link to domains: “Fraud and security are big.” 53 | - Be ready to list: “Name 3-5 use cases.” 54 | 55 | --- 56 | 57 | ## 2. What is the difference between supervised and unsupervised anomaly detection? 58 | 59 | **Answer**: 60 | 61 | - **Supervised Anomaly Detection**: 62 | - **How**: Train model on labeled data (normal vs. anomalous). 63 | - **Process**: 64 | - Dataset: Examples of both classes (e.g., 99% normal, 1% fraud). 65 | - Model: Classifier (e.g., SVM, neural net). 66 | - Predict: Classify new points as normal/anomalous. 67 | - **Pros**: 68 | - High accuracy with good labels. 69 | - Leverages known patterns. 70 | - **Cons**: 71 | - Needs labeled data (rare for anomalies). 72 | - Imbalanced classes (few anomalies). 73 | - **Use Case**: Fraud detection with historical labels. 74 | 75 | - **Unsupervised Anomaly Detection**: 76 | - **How**: Identify anomalies without labels, assuming most data is normal. 77 | - **Process**: 78 | - Model: Learn normal patterns (e.g., clustering, autoencoders). 79 | - Flag: Points deviating from normal (e.g., high reconstruction error). 80 | - **Pros**: 81 | - No labels needed, widely applicable. 82 | - Handles unseen anomalies. 83 | - **Cons**: 84 | - Harder to tune thresholds. 85 | - May miss subtle anomalies. 86 | - **Use Case**: Industrial monitoring with no prior failures. 87 | 88 | - **Key Differences**: 89 | - **Labels**: Supervised uses them; unsupervised doesn’t. 90 | - **Data**: Supervised needs balanced labels; unsupervised assumes normal majority. 91 | - **Flexibility**: Unsupervised for new anomalies; supervised for known ones. 92 | - **ML Context**: Unsupervised is common due to label scarcity. 93 | 94 | **Example**: 95 | - Supervised: Train SVM on labeled fraud data. 96 | - Unsupervised: Use Isolation Forest on unlabeled sensor data. 97 | 98 | **Interview Tips**: 99 | - Highlight labels: “Supervised needs costly annotations.” 100 | - Discuss trade-offs: “Unsupervised is flexible but noisier.” 101 | - Be ready to suggest: “Unsupervised for new systems.” 102 | 103 | --- 104 | 105 | ## 3. How would you use statistical methods for anomaly detection? 106 | 107 | **Answer**: 108 | 109 | Statistical methods identify anomalies by modeling data distributions and flagging points with low probability. 110 | 111 | - **Methods**: 112 | - **Z-Score**: 113 | - **How**: Compute `z = (x - μ) / σ`, where `μ` is mean, `σ` is std dev. 114 | - **Threshold**: Flag if `|z| > 3` (e.g., >3 std devs). 115 | - **Use**: Univariate, normal-like data. 116 | - **Percentiles**: 117 | - **How**: Flag points outside extreme percentiles (e.g., <1% or >99%). 118 | - **Use**: Non-parametric, skewed data. 119 | - **Gaussian Mixture Models (GMM)**: 120 | - **How**: Fit multiple Gaussians, compute likelihood. 121 | - **Threshold**: Low `P(x)` = anomaly. 122 | - **Use**: Multivariate, complex distributions. 123 | - **Grubbs’ Test**: 124 | - **How**: Test for single outlier in normal data. 125 | - **Use**: Small datasets. 126 | - **Mahalanobis Distance**: 127 | - **How**: Measure distance accounting for covariance: `D = √((x-μ)^T Σ^-1 (x-μ))`. 128 | - **Threshold**: High `D` = anomaly. 129 | - **Use**: Multivariate normal data. 130 | 131 | - **Steps**: 132 | 1. Model data (e.g., fit Gaussian). 133 | 2. Compute anomaly score (e.g., z-score, likelihood). 134 | 3. Set threshold (e.g., 99th percentile). 135 | 4. Flag outliers. 136 | 137 | - **Pros**: 138 | - Interpretable, mathematically grounded. 139 | - Works with small data. 140 | - **Cons**: 141 | - Assumes distribution (e.g., normality). 142 | - Struggles with high dimensions. 143 | 144 | **Example**: 145 | - Data: Server response times. 146 | - Z-Score: Flag time with `z > 3` (e.g., 5s vs. mean 1s). 147 | 148 | **Interview Tips**: 149 | - Start simple: “Z-score is intuitive baseline.” 150 | - Mention limits: “Normality assumption often fails.” 151 | - Be ready to compute: “Show z-score formula.” 152 | 153 | --- 154 | 155 | ## 4. What is the Isolation Forest algorithm, and how does it work? 156 | 157 | **Answer**: 158 | 159 | **Isolation Forest** is an unsupervised anomaly detection algorithm that isolates anomalies by randomly partitioning data, leveraging that anomalies are easier to isolate. 160 | 161 | - **How It Works**: 162 | - **Concept**: Anomalies have fewer neighbors, require fewer splits to isolate. 163 | - **Process**: 164 | 1. **Build Trees**: 165 | - Create multiple random trees. 166 | - For each tree: 167 | - Randomly select feature. 168 | - Randomly split between min/max values. 169 | - Repeat until points are isolated or max depth. 170 | 2. **Path Length**: 171 | - Measure depth to isolate each point. 172 | - Anomalies have shorter paths (fewer splits). 173 | 3. **Score**: 174 | - Average path length across trees. 175 | - Normalize: `score ≈ 1` (anomaly), `≈ 0` (normal). 176 | - **Math**: Score = `2^(-E(h(x))/c(n))`, where `E(h(x))` is avg path length, `c(n)` is avg path for `n` points. 177 | 178 | - **Pros**: 179 | - Fast, scales to large datasets (`O(n log n)`). 180 | - Handles high dimensions. 181 | - No distribution assumptions. 182 | - **Cons**: 183 | - Random splits may miss complex anomalies. 184 | - Less interpretable than statistical methods. 185 | 186 | **Example**: 187 | - Data: Network traffic. 188 | - Isolation Forest: Flags packet with short path (e.g., 2 splits vs. 10). 189 | 190 | **Interview Tips**: 191 | - Explain intuition: “Anomalies are quick to isolate.” 192 | - Compare: “Vs. DBSCAN: faster, less parameter tuning.” 193 | - Be ready to sketch: “Show tree splitting point.” 194 | 195 | --- 196 | 197 | ## 5. Explain the concept of autoencoders for anomaly detection 198 | 199 | **Answer**: 200 | 201 | **Autoencoders** are neural networks trained to reconstruct input data, used for anomaly detection by flagging points with high reconstruction error. 202 | 203 | - **How They Work**: 204 | - **Architecture**: 205 | - Encoder: Compress input `x` to latent space `z` (e.g., `z = f(x)`). 206 | - Decoder: Reconstruct `x'` from `z` (e.g., `x' = g(z)`). 207 | - Bottleneck: `z` has lower dimension, forces learning. 208 | - **Training**: 209 | - Minimize loss: `L = ||x - x'||²` (MSE). 210 | - Train on normal data to learn typical patterns. 211 | - **Detection**: 212 | - Compute reconstruction error: `error = ||x - x'||²`. 213 | - Threshold: High error = anomaly (poorly reconstructed). 214 | - **Variants**: 215 | - Variational Autoencoder (VAE): Probabilistic latent space. 216 | - Denoising Autoencoder: Reconstruct from noisy input. 217 | 218 | - **Pros**: 219 | - Captures complex, non-linear patterns. 220 | - Scales to high-dimensional data (e.g., images). 221 | - Flexible with deep architectures. 222 | - **Cons**: 223 | - Needs normal-heavy data for training. 224 | - Compute-intensive, hard to tune. 225 | 226 | **Example**: 227 | - Task: Detect faulty machine parts. 228 | - Autoencoder: Trained on normal images, flags high-error defects. 229 | 230 | **Interview Tips**: 231 | - Highlight error: “Anomalies don’t reconstruct well.” 232 | - Link to ML: “Like unsupervised feature learning.” 233 | - Be ready to sketch: “Show encoder → bottleneck → decoder.” 234 | 235 | --- 236 | 237 | ## 6. What is the role of distance-based methods in anomaly detection? 238 | 239 | **Answer**: 240 | 241 | **Distance-based methods** identify anomalies as points far from others in feature space, assuming anomalies are isolated. 242 | 243 | - **How They Work**: 244 | - **Distance Metric**: Compute distance (e.g., Euclidean, Manhattan). 245 | - **Approaches**: 246 | - **K-Nearest Neighbors (k-NN)**: 247 | - Compute distance to `k`-th nearest neighbor. 248 | - Threshold: High distance = anomaly. 249 | - **Local Outlier Factor (LOF)**: 250 | - Compare local density to neighbors’ density. 251 | - Score: High LOF = anomaly (lower density). 252 | - **Mahalanobis Distance**: 253 | - Use covariance: `D = √((x-μ)^T Σ^-1 (x-μ))`. 254 | - Flag high `D` as outliers. 255 | - **Thresholding**: Set cutoff (e.g., top 1% distances). 256 | 257 | - **Role**: 258 | - **Identify Outliers**: Detect points in sparse regions. 259 | - **Handle Multivariate**: Account for feature correlations (e.g., Mahalanobis). 260 | - **Simple Intuition**: Anomalies are “far” from normal clusters. 261 | 262 | - **Pros**: 263 | - Intuitive, no distribution assumptions. 264 | - Effective for low-to-moderate dimensions. 265 | - **Cons**: 266 | - Scales poorly (`O(n²)` for k-NN). 267 | - Fails in high dimensions (curse of dimensionality). 268 | 269 | **Example**: 270 | - Data: User logins. 271 | - k-NN: Flag login with high distance to 5th neighbor. 272 | 273 | **Interview Tips**: 274 | - Explain density: “Anomalies live in empty spaces.” 275 | - Mention limits: “High dimensions need preprocessing.” 276 | - Be ready to compute: “Show Euclidean distance.” 277 | 278 | --- 279 | 280 | ## 7. How do you evaluate the performance of an anomaly detection model? 281 | 282 | **Answer**: 283 | 284 | Evaluating anomaly detection models is tricky due to imbalance (few anomalies) and lack of labels in unsupervised cases. 285 | 286 | - **Metrics**: 287 | - **Supervised (Labeled Data)**: 288 | - **Precision/Recall/F1**: 289 | - Precision: `TP / (TP + FP)` (correct anomalies). 290 | - Recall: `TP / (TP + FN)` (found anomalies). 291 | - F1: `2 * (P * R) / (P + R)`. 292 | - **ROC-AUC**: Area under ROC curve (TPR vs. FPR). 293 | - **Use**: F1 for imbalance, AUC for threshold trade-offs. 294 | - **Unsupervised (No Labels)**: 295 | - **Reconstruction Error**: High error = anomaly (e.g., autoencoders). 296 | - **Distance/Score**: Rank points, evaluate top `k` (e.g., Isolation Forest). 297 | - **Use**: Compare to known anomalies if available. 298 | - **Precision@K**: Fraction of top `K` predictions that are anomalies. 299 | - **Use**: When only some anomalies are verified. 300 | 301 | - **Techniques**: 302 | - **Confusion Matrix**: Analyze TP, FP, TN, FN (supervised). 303 | - **Threshold Tuning**: Adjust cutoff to balance precision/recall. 304 | - **Visualization**: Plot scores, inspect outliers (e.g., t-SNE). 305 | - **Domain Feedback**: Validate with experts (e.g., fraud team). 306 | 307 | - **Challenges**: 308 | - Imbalance: Few anomalies skew accuracy. 309 | - No Labels: Hard to quantify unsupervised performance. 310 | - Context: Anomalies vary by domain (e.g., fraud vs. sensor). 311 | 312 | **Example**: 313 | - Task: Detect network intrusions. 314 | - Metrics: F1 = 0.7 (supervised), Precision@100 = 0.8 (unsupervised). 315 | 316 | **Interview Tips**: 317 | - Stress imbalance: “F1 over accuracy for rare anomalies.” 318 | - Discuss unsupervised: “Use domain knowledge to validate.” 319 | - Be ready to plot: “Show ROC curve.” 320 | 321 | --- 322 | 323 | ## 8. What are the challenges of anomaly detection in high-dimensional data? 324 | 325 | **Answer**: 326 | 327 | High-dimensional data poses unique challenges for anomaly detection: 328 | 329 | - **Curse of Dimensionality**: 330 | - **Issue**: Distances become uniform, making outliers hard to spot. 331 | - **Fix**: Dimensionality reduction (e.g., PCA, t-SNE). 332 | - **Sparsity**: 333 | - **Issue**: Data spreads thinly, reducing density differences. 334 | - **Fix**: Feature selection (e.g., remove low-variance features). 335 | - **Computational Cost**: 336 | - **Issue**: Algorithms like k-NN scale poorly (`O(n²d)`). 337 | - **Fix**: Use approximate methods (e.g., random projections). 338 | - **Irrelevant Features**: 339 | - **Issue**: Noise dilutes anomaly signals. 340 | - **Fix**: Feature engineering, robust stats (e.g., Mahalanobis). 341 | - **Complex Patterns**: 342 | - **Issue**: Anomalies may only appear in subspaces. 343 | - **Fix**: Subspace clustering, deep methods (e.g., autoencoders). 344 | - **Label Scarcity**: 345 | - **Issue**: High-dimensional data hard to label, limits supervised methods. 346 | - **Fix**: Unsupervised (e.g., Isolation Forest), semi-supervised. 347 | 348 | **Example**: 349 | - Data: 1000D image features. 350 | - Challenge: k-NN slow, distances meaningless. 351 | - Fix: PCA to 50D, run Isolation Forest. 352 | 353 | **Interview Tips**: 354 | - Highlight dimensionality: “Uniform distances kill detection.” 355 | - Suggest fixes: “PCA or autoencoders help.” 356 | - Be ready to explain: “Why high-D breaks k-NN.” 357 | 358 | --- 359 | 360 | ## Notes 361 | 362 | - **Focus**: Answers cover anomaly detection techniques, ideal for ML interviews. 363 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 364 | - **Depth**: Includes statistical (e.g., z-score), ML (e.g., autoencoders), and practical tips (e.g., evaluation). 365 | - **Consistency**: Matches the style of previous files for a cohesive repository. 366 | 367 | For deeper practice, apply these to fraud detection (see [Applied ML Cases](applied-ml-cases.md)) or explore [Statistics & Probability](statistics-probability.md) for statistical foundations. 🚀 368 | 369 | --- 370 | 371 | **Next Steps**: Build on these skills with [Optimization Techniques](optimization-techniques.md) for model training or revisit [Time Series & Clustering](time-series-clustering.md) for time-based anomalies! 🌟 -------------------------------------------------------------------------------- /questions/reinforcement-learning.md: -------------------------------------------------------------------------------- 1 | # Reinforcement Learning Questions 2 | 3 | This file contains reinforcement learning (RL) questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of RL concepts, algorithms, and applications, covering topics like Markov Decision Processes (MDPs), Q-learning, and modern methods like PPO. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is reinforcement learning, and how does it differ from supervised learning?](#1-what-is-reinforcement-learning-and-how-does-it-differ-from-supervised-learning) 12 | 2. [What is a Markov Decision Process (MDP)?](#2-what-is-a-markov-decision-process-mdp) 13 | 3. [Explain the difference between value-based and policy-based RL methods](#3-explain-the-difference-between-value-based-and-policy-based-rl-methods) 14 | 4. [What is Q-learning, and how does it work?](#4-what-is-q-learning-and-how-does-it-work) 15 | 5. [What is the exploration-exploitation trade-off in RL?](#5-what-is-the-exploration-exploitation-trade-off-in-rl) 16 | 6. [What is the Proximal Policy Optimization (PPO) algorithm?](#6-what-is-the-proximal-policy-optimization-ppo-algorithm) 17 | 7. [What are the advantages and disadvantages of model-based vs. model-free RL?](#7-what-are-the-advantages-and-disadvantages-of-model-based-vs-model-free-rl) 18 | 8. [How do you evaluate the performance of a reinforcement learning agent?](#8-how-do-you-evaluate-the-performance-of-a-reinforcement-learning-agent) 19 | 20 | --- 21 | 22 | ## 1. What is reinforcement learning, and how does it differ from supervised learning? 23 | 24 | **Answer**: 25 | 26 | **Reinforcement Learning (RL)** is a paradigm where an agent learns to make decisions by interacting with an environment, maximizing cumulative rewards through trial and error. 27 | 28 | - **How RL Works**: 29 | - **Components**: 30 | - Agent: Decision-maker. 31 | - Environment: System agent interacts with. 32 | - State (s): Current situation. 33 | - Action (a): Choice made by agent. 34 | - Reward (r): Feedback from environment. 35 | - **Goal**: Learn policy `π(a|s)` to maximize expected reward `E[Σ γ^t r_t]`, where `γ` is discount factor. 36 | - **Process**: Agent explores, observes rewards, updates policy. 37 | 38 | - **Vs. Supervised Learning**: 39 | - **Data**: 40 | - RL: No labeled dataset; learns from rewards. 41 | - Supervised: Labeled input-output pairs. 42 | - **Feedback**: 43 | - RL: Delayed, sparse rewards. 44 | - Supervised: Immediate error (e.g., loss). 45 | - **Goal**: 46 | - RL: Optimize long-term reward. 47 | - Supervised: Minimize prediction error. 48 | - **Exploration**: 49 | - RL: Balances exploration/exploitation. 50 | - Supervised: Uses fixed dataset. 51 | - **Use Case**: 52 | - RL: Robotics, games. 53 | - Supervised: Classification, regression. 54 | 55 | **Example**: 56 | - RL: Train robot to walk (reward for steps). 57 | - Supervised: Predict house prices (labeled data). 58 | 59 | **Interview Tips**: 60 | - Clarify feedback: “RL learns from rewards, not labels.” 61 | - Highlight exploration: “Unique to RL.” 62 | - Be ready to compare: “Show RL vs. supervised task.” 63 | 64 | --- 65 | 66 | ## 2. What is a Markov Decision Process (MDP)? 67 | 68 | **Answer**: 69 | 70 | A **Markov Decision Process (MDP)** is a mathematical framework for modeling sequential decision-making under uncertainty, foundational to RL. 71 | 72 | - **Components**: 73 | - **States (S)**: Set of possible situations (e.g., game board). 74 | - **Actions (A)**: Choices available (e.g., move left). 75 | - **Transition Probability**: `P(s'|s,a)` (probability of next state). 76 | - **Reward Function**: `R(s,a,s')` (reward for action). 77 | - **Discount Factor (γ)**: `0 ≤ γ ≤ 1`, balances immediate vs. future rewards. 78 | - **Policy (π)**: `π(a|s)`, strategy mapping states to actions. 79 | 80 | - **Markov Property**: 81 | - Future depends only on current state: `P(s_{t+1}|s_t,a_t) = P(s_{t+1}|s_1,...,s_t,a_1,...,a_t)`. 82 | 83 | - **Goal**: 84 | - Find optimal policy `π*` to maximize expected return: `E[Σ γ^t r_t]`. 85 | 86 | - **RL Context**: 87 | - MDPs formalize environments (e.g., robot navigation). 88 | - Algorithms (e.g., Q-learning) solve MDPs. 89 | 90 | **Example**: 91 | - MDP: Chess game. 92 | - States: Board positions. 93 | - Actions: Legal moves. 94 | - Rewards: +1 for win, 0 else. 95 | - Policy: Choose best move. 96 | 97 | **Interview Tips**: 98 | - List components: “States, actions, rewards, transitions.” 99 | - Stress Markov: “Only current state matters.” 100 | - Be ready to model: “Define MDP for a simple game.” 101 | 102 | --- 103 | 104 | ## 3. Explain the difference between value-based and policy-based RL methods 105 | 106 | **Answer**: 107 | 108 | - **Value-Based Methods**: 109 | - **How**: Learn value function to estimate expected rewards. 110 | - State value: `V(s) = E[Σ γ^t r_t | s]`. 111 | - Action value: `Q(s,a) = E[Σ γ^t r_t | s,a]`. 112 | - **Policy**: Derived implicitly (e.g., `π(s) = argmax_a Q(s,a)`). 113 | - **Examples**: 114 | - Q-learning: Update `Q` table. 115 | - Deep Q-Networks (DQN): Approximate `Q` with neural net. 116 | - **Pros**: 117 | - Stable for discrete actions. 118 | - Converges well with enough data. 119 | - **Cons**: 120 | - Struggles with continuous actions. 121 | - Overestimates values (e.g., max bias). 122 | - **Use Case**: Games (e.g., Atari). 123 | 124 | - **Policy-Based Methods**: 125 | - **How**: Directly learn policy `π(a|s;θ)` parameterized by `θ`. 126 | - Optimize: Maximize `J(θ) = E[Σ γ^t r_t]` via gradient ascent. 127 | - **Examples**: 128 | - REINFORCE: Policy gradient. 129 | - PPO: Constrained policy updates. 130 | - **Pros**: 131 | - Handles continuous actions. 132 | - Better for stochastic policies. 133 | - **Cons**: 134 | - High variance in gradients. 135 | - Slower convergence. 136 | - **Use Case**: Robotics (e.g., arm control). 137 | 138 | - **Key Differences**: 139 | - **Learning**: 140 | - Value: Estimate `V` or `Q`, derive `π`. 141 | - Policy: Optimize `π` directly. 142 | - **Output**: 143 | - Value: Action scores. 144 | - Policy: Action probabilities. 145 | - **Action Space**: 146 | - Value: Discrete-friendly. 147 | - Policy: Continuous-friendly. 148 | - **ML Context**: Value for games, policy for control. 149 | 150 | **Example**: 151 | - Value: DQN picks best chess move. 152 | - Policy: PPO learns robot walking gait. 153 | 154 | **Interview Tips**: 155 | - Clarify approach: “Value indirect, policy direct.” 156 | - Discuss trade-offs: “Value for discrete, policy for continuous.” 157 | - Be ready to sketch: “Show Q-table vs. policy net.” 158 | 159 | --- 160 | 161 | ## 4. What is Q-learning, and how does it work? 162 | 163 | **Answer**: 164 | 165 | **Q-learning** is a model-free, value-based RL algorithm that learns an action-value function `Q(s,a)` to find the optimal policy. 166 | 167 | - **How It Works**: 168 | - **Goal**: Estimate `Q(s,a) = E[Σ γ^t r_t | s,a]`. 169 | - **Update Rule** (Bellman equation): 170 | - `Q(s,a) ← Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)]`. 171 | - `α`: Learning rate. 172 | - `r`: Reward. 173 | - `γ`: Discount factor. 174 | - `s'`: Next state. 175 | - **Process**: 176 | 1. Initialize `Q` (e.g., zeros). 177 | 2. Choose action (e.g., ε-greedy). 178 | 3. Observe `r`, `s'`. 179 | 4. Update `Q`. 180 | 5. Repeat until convergence. 181 | - **Policy**: `π(s) = argmax_a Q(s,a)`. 182 | 183 | - **Key Features**: 184 | - **Off-Policy**: Learns optimal `Q` regardless of exploration. 185 | - **Tabular**: Stores `Q` for discrete states/actions. 186 | - **Convergence**: Guaranteed for finite MDPs with proper `α`, `ε`. 187 | 188 | - **Pros**: 189 | - Simple, effective for small problems. 190 | - Model-free, no environment knowledge needed. 191 | - **Cons**: 192 | - Scales poorly (large state/action spaces). 193 | - Slow for complex tasks. 194 | 195 | **Example**: 196 | - Task: Navigate gridworld. 197 | - Q-learning: Updates `Q` for each move, learns path to goal. 198 | 199 | **Interview Tips**: 200 | - Explain update: “Q moves toward reward + future value.” 201 | - Highlight off-policy: “Learns even with random actions.” 202 | - Be ready to code: “Show Q-table update.” 203 | 204 | --- 205 | 206 | ## 5. What is the exploration-exploitation trade-off in RL? 207 | 208 | **Answer**: 209 | 210 | The **exploration-exploitation trade-off** is the balance between trying new actions (exploration) to learn better policies and choosing known high-reward actions (exploitation). 211 | 212 | - **Exploration**: 213 | - **Why**: Discover unknown states/rewards. 214 | - **Risk**: Waste time on poor actions. 215 | - **Methods**: 216 | - ε-greedy: Random action with probability `ε`. 217 | - Softmax: Sample based on `Q` scores. 218 | - UCB: Prioritize uncertainty. 219 | - **Exploitation**: 220 | - **Why**: Maximize immediate reward. 221 | - **Risk**: Miss better long-term policies. 222 | - **Method**: Choose `argmax_a Q(s,a)` or highest `π(a|s)`. 223 | 224 | - **Balancing**: 225 | - **ε-Greedy**: Start with high `ε` (e.g., 0.1), decay over time. 226 | - **Annealing**: Reduce exploration as learning stabilizes. 227 | - **Intrinsic Rewards**: Add curiosity bonuses. 228 | - **ML Context**: Critical for RL convergence (e.g., DQN, PPO). 229 | 230 | **Example**: 231 | - Game: Slot machines. 232 | - Exploit: Pull known best arm. 233 | - Explore: Try new arms to find better. 234 | 235 | **Interview Tips**: 236 | - Use analogy: “Like choosing a new restaurant vs. favorite.” 237 | - Discuss decay: “Exploration fades as Q improves.” 238 | - Be ready to suggest: “ε-greedy is simple baseline.” 239 | 240 | --- 241 | 242 | ## 6. What is the Proximal Policy Optimization (PPO) algorithm? 243 | 244 | **Answer**: 245 | 246 | **Proximal Policy Optimization (PPO)** is a policy-based RL algorithm that balances sample efficiency, stability, and performance, widely used in complex tasks. 247 | 248 | - **How It Works**: 249 | - **Policy Gradient**: 250 | - Optimize `π(a|s;θ)` to maximize `J(θ) = E[Σ γ^t r_t]`. 251 | - Gradient: `∇J ≈ Σ ∇_θ log π(a|s;θ) * A`, where `A` is advantage. 252 | - **Clipped Objective**: 253 | - Limit policy updates to avoid large changes. 254 | - Objective: `L = E[min(r_t * A, clip(r_t, 1-ε, 1+ε) * A)]`. 255 | - `r_t = π_θ(a|s) / π_old(a|s)`, `ε` (e.g., 0.2) bounds ratio. 256 | - **Advantage**: 257 | - Estimate `A = Q(s,a) - V(s)` using value function `V`. 258 | - **Process**: 259 | 1. Collect trajectories with current `π`. 260 | 2. Compute advantages. 261 | 3. Optimize clipped objective (multiple epochs). 262 | 4. Update `π`, repeat. 263 | 264 | - **Why Popular**: 265 | - **Stability**: Clipping prevents destructive updates. 266 | - **Efficiency**: Reuses samples, good for on-policy. 267 | - **Versatility**: Works for discrete/continuous actions. 268 | - **ML Context**: Standard for robotics, games (e.g., OpenAI). 269 | 270 | **Example**: 271 | - Task: Train robot arm. 272 | - PPO: Learns smooth motions, avoids wild swings. 273 | 274 | **Interview Tips**: 275 | - Highlight clipping: “Keeps updates safe.” 276 | - Compare: “Vs. TRPO: simpler, still robust.” 277 | - Be ready to sketch: “Show clipped vs. unclipped loss.” 278 | 279 | --- 280 | 281 | ## 7. What are the advantages and disadvantages of model-based vs. model-free RL? 282 | 283 | **Answer**: 284 | 285 | - **Model-Based RL**: 286 | - **How**: Learn model of environment (`P(s'|s,a)`, `R(s,a)`), plan actions. 287 | - **Examples**: AlphaZero, MuZero. 288 | - **Advantages**: 289 | - **Sample Efficiency**: Simulate transitions, need fewer real samples. 290 | - **Planning**: Optimize over predicted futures. 291 | - **Transferable**: Model reusable across tasks. 292 | - **Disadvantages**: 293 | - **Model Errors**: Inaccurate models hurt performance. 294 | - **Complexity**: Hard to learn dynamics for complex environments. 295 | - **Compute**: Planning can be slow (e.g., tree search). 296 | - **Use Case**: Games with clear rules (e.g., chess). 297 | 298 | - **Model-Free RL**: 299 | - **How**: Learn policy/value directly from experience, no environment model. 300 | - **Examples**: DQN, PPO. 301 | - **Advantages**: 302 | - **Simplicity**: No need to model complex dynamics. 303 | - **Robustness**: Works with real-world noise. 304 | - **Scalable**: Easier for high-dimensional tasks. 305 | - **Disadvantages**: 306 | - **Sample Inefficiency**: Needs many interactions. 307 | - **Overfitting**: Policy may not generalize. 308 | - **Instability**: Sensitive to hyperparameters. 309 | - **Use Case**: Robotics, real-time control. 310 | 311 | - **Key Differences**: 312 | - **Model**: Model-based learns dynamics; model-free doesn’t. 313 | - **Efficiency**: Model-based uses fewer samples; model-free needs more. 314 | - **Complexity**: Model-based harder to implement; model-free simpler. 315 | - **ML Context**: Model-free dominates (e.g., PPO); model-based for structured tasks. 316 | 317 | **Example**: 318 | - Model-Based: Simulate chess moves. 319 | - Model-Free: Learn moves via trial and error. 320 | 321 | **Interview Tips**: 322 | - Stress efficiency: “Model-based saves samples.” 323 | - Discuss limits: “Model errors kill planning.” 324 | - Be ready to compare: “Show sample needs.” 325 | 326 | --- 327 | 328 | ## 8. How do you evaluate the performance of a reinforcement learning agent? 329 | 330 | **Answer**: 331 | 332 | Evaluating an RL agent measures its ability to maximize rewards and generalize across environments. 333 | 334 | - **Metrics**: 335 | - **Cumulative Reward**: 336 | - Sum of rewards: `Σ r_t` or discounted `Σ γ^t r_t`. 337 | - **Use**: Primary metric, shows policy quality. 338 | - **Average Reward**: 339 | - Mean reward per episode or step. 340 | - **Use**: Stabilizes noisy rewards. 341 | - **Success Rate**: 342 | - Fraction of episodes achieving goal (e.g., win game). 343 | - **Use**: Task-specific (e.g., robot reaches target). 344 | - **Convergence**: 345 | - Track value/policy stability (e.g., `Q` changes). 346 | - **Use**: Assess learning progress. 347 | - **Exploration Metrics**: 348 | - Measure exploration (e.g., entropy of `π`). 349 | - **Use**: Ensure balance with exploitation. 350 | 351 | - **Techniques**: 352 | - **Test Policy**: 353 | - Run `π` greedily (no exploration) on test environment. 354 | - Average rewards over episodes (e.g., 100 runs). 355 | - **Benchmarking**: 356 | - Compare to baselines (e.g., random, human, prior algorithms). 357 | - **Robustness**: 358 | - Test on varied environments (e.g., different dynamics). 359 | - **Visualization**: 360 | - Plot reward curves, state visits, or actions. 361 | - **Statistical Analysis**: 362 | - Confidence intervals for rewards (e.g., mean ± std dev). 363 | 364 | - **Challenges**: 365 | - **Noise**: Rewards vary (e.g., random environments). 366 | - **Partial Observability**: Hard to assess in POMDPs. 367 | - **Long Horizons**: Delayed rewards skew metrics. 368 | 369 | **Example**: 370 | - Task: Train game agent. 371 | - Metrics: Avg reward = 500, success rate = 80%. 372 | - Analysis: Plot reward curve to confirm convergence. 373 | 374 | **Interview Tips**: 375 | - Prioritize rewards: “Cumulative reward is king.” 376 | - Discuss robustness: “Test on new environments.” 377 | - Be ready to plot: “Show reward vs. episode.” 378 | 379 | --- 380 | 381 | ## Notes 382 | 383 | - **Focus**: Answers cover RL fundamentals and advanced methods, ideal for ML interviews. 384 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 385 | - **Depth**: Includes mathematical rigor (e.g., Q-learning, PPO) and practical tips (e.g., evaluation). 386 | - **Consistency**: Matches the style of previous files for a cohesive repository. 387 | 388 | For deeper practice, apply RL to robotics (see [Applied ML Cases](applied-ml-cases.md)) or explore [Optimization Techniques](optimization-techniques.md) for policy gradients. 🚀 389 | 390 | --- 391 | 392 | **Next Steps**: Build on these skills with [Computer Vision](computer-vision.md) for RL in visual tasks or revisit [Statistics & Probability](statistics-probability.md) for expected rewards! 🌟 -------------------------------------------------------------------------------- /questions/time-series-clustering.md: -------------------------------------------------------------------------------- 1 | # Time Series and Clustering Questions 2 | 3 | This file contains questions about time series analysis and clustering, commonly asked in interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of modeling temporal data and grouping similar data points, covering techniques, metrics, and applications. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition where relevant, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is time series analysis, and what are its key components?](#1-what-is-time-series-analysis-and-what-are-its-key-components) 12 | 2. [What is the difference between AR, MA, and ARIMA models?](#2-what-is-the-difference-between-ar-ma-and-arima-models) 13 | 3. [What is stationarity, and why is it important in time series analysis?](#3-what-is-stationarity-and-why-is-it-important-in-time-series-analysis) 14 | 4. [What are some common methods for handling missing values in time series data?](#4-what-are-some-common-methods-for-handling-missing-values-in-time-series-data) 15 | 5. [What is the difference between clustering and classification?](#5-what-is-the-difference-between-clustering-and-classification) 16 | 6. [What are the different types of clustering algorithms?](#6-what-are-the-different-types-of-clustering-algorithms) 17 | 7. [How do you choose the optimal number of clusters in k-means clustering?](#7-how-do-you-choose-the-optimal-number-of-clusters-in-k-means-clustering) 18 | 8. [What are the advantages and disadvantages of hierarchical clustering?](#8-what-are-the-advantages-and-disadvantages-of-hierarchical-clustering) 19 | 20 | --- 21 | 22 | ## 1. What is time series analysis, and what are its key components? 23 | 24 | **Answer**: 25 | 26 | **Time series analysis** involves studying data points collected over time to identify patterns, forecast future values, or understand underlying structures. 27 | 28 | - **Key Components**: 29 | - **Trend**: Long-term increase or decrease (e.g., rising sales over years). 30 | - **Seasonality**: Repeating patterns at fixed intervals (e.g., holiday sales spikes). 31 | - **Cyclical Patterns**: Longer, non-fixed fluctuations (e.g., economic cycles). 32 | - **Noise**: Random variations (e.g., unpredictable daily fluctuations). 33 | - **Level**: Baseline value without trend/seasonality. 34 | 35 | - **Goals**: 36 | - Forecasting (e.g., predict stock prices). 37 | - Anomaly detection (e.g., detect server outages). 38 | - Decomposition (e.g., separate trend from seasonality). 39 | 40 | - **Techniques**: 41 | - Statistical: ARIMA, Exponential Smoothing. 42 | - ML: LSTM, Prophet. 43 | - Decomposition: STL (Seasonal-Trend Decomposition). 44 | 45 | **Example**: 46 | - Data: Monthly sales. 47 | - Analysis: Identify upward trend, December spikes (seasonality), random noise. 48 | 49 | **Interview Tips**: 50 | - Clarify components: “Trend, seasonality, noise are core.” 51 | - Mention use cases: “Forecasting and anomaly detection.” 52 | - Be ready to sketch: “Show trend + seasonal curve.” 53 | 54 | --- 55 | 56 | ## 2. What is the difference between AR, MA, and ARIMA models? 57 | 58 | **Answer**: 59 | 60 | - **AR (AutoRegressive)**: 61 | - **How**: Models value at time `t` as a linear combination of past values. 62 | - **Formula**: `y_t = c + φ_1*y_{t-1} + ... + φ_p*y_{t-p} + ε_t`. 63 | - **Order**: `p` (lags). 64 | - **Use Case**: Predict stock prices with momentum. 65 | - **Intuition**: “Past values predict future.” 66 | 67 | - **MA (Moving Average)**: 68 | - **How**: Models value as a linear combination of past errors. 69 | - **Formula**: `y_t = c + ε_t + θ_1*ε_{t-1} + ... + θ_q*ε_{t-q}`. 70 | - **Order**: `q` (error lags). 71 | - **Use Case**: Smooth noisy series (e.g., temperature). 72 | - **Intuition**: “Corrects based on past surprises.” 73 | 74 | - **ARIMA (AutoRegressive Integrated Moving Average)**: 75 | - **How**: Combines AR and MA, with differencing to handle non-stationarity. 76 | - **Formula**: ARIMA(p,d,q) where `d` is differencing order. 77 | - `p`: AR terms. 78 | - `d`: Differencing to make stationary. 79 | - `q`: MA terms. 80 | - **Use Case**: Forecast sales with trend and noise. 81 | - **Intuition**: “AR + MA, adjusted for trends.” 82 | 83 | - **Key Differences**: 84 | - **Scope**: AR uses past values, MA uses past errors, ARIMA combines both + differencing. 85 | - **Stationarity**: AR/MA assume stationarity; ARIMA handles non-stationary via `d`. 86 | - **Complexity**: ARIMA is more flexible but harder to tune. 87 | 88 | **Example**: 89 | - AR(1): `y_t = 0.5*y_{t-1} + ε_t` (simple trend). 90 | - MA(1): `y_t = ε_t + 0.3*ε_{t-1}` (noise smoothing). 91 | - ARIMA(1,1,1): Differences data, combines AR and MA. 92 | 93 | **Interview Tips**: 94 | - Explain terms: “AR for values, MA for errors.” 95 | - Mention stationarity: “ARIMA’s `d` fixes trends.” 96 | - Be ready to derive: “Show ARIMA(p,d,q) equation.” 97 | 98 | --- 99 | 100 | ## 3. What is stationarity, and why is it important in time series analysis? 101 | 102 | **Answer**: 103 | 104 | **Stationarity** means a time series’ statistical properties (mean, variance, autocorrelation) are constant over time. 105 | 106 | - **Types**: 107 | - **Strict Stationarity**: All moments are time-invariant (rarely used). 108 | - **Weak Stationarity**: Constant mean, variance, and autocovariance. 109 | 110 | - **Why Important**: 111 | - **Model Assumptions**: Many models (e.g., AR, MA) assume stationarity for valid predictions. 112 | - **Predictability**: Stationary series have stable patterns, easier to forecast. 113 | - **Simplifies Analysis**: Removes trends/seasonality for modeling. 114 | 115 | - **Testing**: 116 | - **Visual**: Plot series, check for trends/seasonality. 117 | - **Statistical Tests**: 118 | - Augmented Dickey-Fuller (ADF): Null hypothesis is non-stationary. 119 | - KPSS: Null is stationary. 120 | - **Autocorrelation**: Check ACF plot for decay. 121 | 122 | - **Achieving Stationarity**: 123 | - **Differencing**: Subtract `y_{t-1}` from `y_t` (e.g., ARIMA’s `d`). 124 | - **Transformations**: Log, square root to stabilize variance. 125 | - **Detrending**: Remove linear/polynomial trend. 126 | - **Deseasonalizing**: Subtract seasonal component. 127 | 128 | **Example**: 129 | - Series: Stock prices (trending, non-stationary). 130 | - Fix: Difference once, ADF p-value < 0.05 → stationary. 131 | 132 | **Interview Tips**: 133 | - Clarify definition: “Constant mean and variance.” 134 | - Emphasize models: “ARIMA needs it for AR/MA parts.” 135 | - Be ready to test: “Explain ADF test intuition.” 136 | 137 | --- 138 | 139 | ## 4. What are some common methods for handling missing values in time series data? 140 | 141 | **Answer**: 142 | 143 | Handling missing values in time series maintains temporal structure and avoids bias. Common methods: 144 | 145 | - **Forward Fill**: 146 | - Use last observed value: `y_t = y_{t-1}`. 147 | - Pros: Simple, preserves trends. 148 | - Cons: Assumes stability, poor for long gaps. 149 | - **Backward Fill**: 150 | - Use next observed value: `y_t = y_{t+1}`. 151 | - Pros: Works if future data is available. 152 | - Cons: Not real-time, same limits as forward. 153 | - **Linear Interpolation**: 154 | - Estimate between points: `y_t = y_{t-1} + (y_{t+1} - y_{t-1}) * (t - t-1)/(t+1 - t-1)`. 155 | - Pros: Smooth, captures trends. 156 | - Cons: Assumes linearity, fails for non-linear patterns. 157 | - **Spline/Polynomial Interpolation**: 158 | - Fit smooth curve to gaps. 159 | - Pros: Handles non-linear patterns. 160 | - Cons: Risk of overfitting, complex. 161 | - **Model-Based Imputation**: 162 | - Use time series model (e.g., ARIMA, Kalman filter) to predict missing values. 163 | - Pros: Respects temporal structure. 164 | - Cons: Computationally intensive, model-dependent. 165 | - **Mean/Median Imputation**: 166 | - Use series mean or median. 167 | - Pros: Simple, stable. 168 | - Cons: Ignores temporal patterns, adds bias. 169 | - **Domain-Specific**: 170 | - Use external data (e.g., weather for temperature gaps). 171 | - Pros: Accurate if correlated. 172 | - Cons: Needs extra data. 173 | 174 | **Example**: 175 | - Data: Hourly sensor readings, 5% missing. 176 | - Method: Linear interpolation for short gaps, ARIMA for longer gaps. 177 | 178 | **Interview Tips**: 179 | - Prioritize temporal: “Time series needs context, not just means.” 180 | - Suggest trade-offs: “Interpolation is fast, models are accurate.” 181 | - Be ready to code: “Describe interpolation logic.” 182 | 183 | --- 184 | 185 | ## 5. What is the difference between clustering and classification? 186 | 187 | **Answer**: 188 | 189 | - **Clustering**: 190 | - **Type**: Unsupervised learning. 191 | - **Goal**: Group similar data points into clusters without labels. 192 | - **How**: Optimize similarity (e.g., minimize distance within clusters). 193 | - **Output**: Cluster assignments (e.g., group 1, group 2). 194 | - **Example**: Segment customers by behavior. 195 | - **Algorithms**: K-means, DBSCAN, hierarchical. 196 | 197 | - **Classification**: 198 | - **Type**: Supervised learning. 199 | - **Goal**: Predict predefined class labels for data points. 200 | - **How**: Train on labeled data to minimize prediction error. 201 | - **Output**: Class labels (e.g., spam/not spam). 202 | - **Example**: Classify emails as spam. 203 | - **Algorithms**: Logistic regression, SVM, neural networks. 204 | 205 | - **Key Differences**: 206 | - **Labels**: Clustering has none; classification uses labeled data. 207 | - **Objective**: Clustering finds structure; classification predicts labels. 208 | - **Evaluation**: Clustering uses internal metrics (e.g., silhouette); classification uses accuracy, F1. 209 | - **Use Case**: Clustering for exploration; classification for prediction. 210 | 211 | **Example**: 212 | - Clustering: Group users by purchase patterns (no labels). 213 | - Classification: Predict if user will buy (yes/no, labeled). 214 | 215 | **Interview Tips**: 216 | - Clarify supervision: “Clustering is unsupervised, classification supervised.” 217 | - Mention metrics: “Silhouette for clustering, F1 for classification.” 218 | - Be ready to compare: “Clustering to explore, classification to decide.” 219 | 220 | --- 221 | 222 | ## 6. What are the different types of clustering algorithms? 223 | 224 | **Answer**: 225 | 226 | Clustering algorithms group data based on similarity, differing in approach and assumptions: 227 | 228 | - **Centroid-Based**: 229 | - **How**: Assign points to clusters based on distance to centroids. 230 | - **Example**: K-means. 231 | - Minimize within-cluster variance. 232 | - Iteratively update centroids. 233 | - **Pros**: Fast, scalable. 234 | - **Cons**: Assumes spherical clusters, needs `k`. 235 | - **Hierarchical**: 236 | - **How**: Build tree of clusters (dendrogram) via merging (agglomerative) or splitting (divisive). 237 | - **Example**: Agglomerative clustering. 238 | - Use linkage (e.g., single, complete). 239 | - **Pros**: No need for `k`, captures hierarchy. 240 | - **Cons**: Slow for large data, sensitive to noise. 241 | - **Density-Based**: 242 | - **How**: Group points in dense regions, ignore sparse areas. 243 | - **Example**: DBSCAN. 244 | - Core points, border points, noise. 245 | - **Pros**: Finds arbitrary shapes, handles outliers. 246 | - **Cons**: Struggles with varying densities, parameter-sensitive. 247 | - **Distribution-Based**: 248 | - **How**: Assume data follows a distribution (e.g., Gaussian). 249 | - **Example**: Gaussian Mixture Models (GMM). 250 | - Fit mixture of Gaussians using EM. 251 | - **Pros**: Probabilistic, flexible shapes. 252 | - **Cons**: Computationally heavy, assumes distribution. 253 | - **Graph-Based**: 254 | - **How**: Treat data as graph, cluster via connectivity. 255 | - **Example**: Spectral clustering. 256 | - Use eigenvalues of similarity matrix. 257 | - **Pros**: Captures complex structures. 258 | - **Cons**: Scales poorly, needs similarity metric. 259 | 260 | **Example**: 261 | - Data: Customer purchases. 262 | - K-means: 3 spherical groups. 263 | - DBSCAN: Irregular groups, outliers ignored. 264 | 265 | **Interview Tips**: 266 | - Categorize clearly: “Centroid, hierarchical, density, etc.” 267 | - Discuss trade-offs: “K-means is fast, DBSCAN finds shapes.” 268 | - Be ready to suggest: “DBSCAN for noisy data, GMM for soft clusters.” 269 | 270 | --- 271 | 272 | ## 7. How do you choose the optimal number of clusters in k-means clustering? 273 | 274 | **Answer**: 275 | 276 | Choosing the optimal number of clusters (`k`) in k-means balances fit and complexity. Methods: 277 | 278 | - **Elbow Method**: 279 | - **How**: Plot within-cluster sum of squares (WSS) vs. `k`. 280 | - **Logic**: WSS decreases as `k` increases; look for “elbow” where adding clusters yields diminishing returns. 281 | - **Pros**: Simple, visual. 282 | - **Cons**: Subjective, no clear elbow sometimes. 283 | - **Silhouette Score**: 284 | - **How**: Measure cohesion (distance to own cluster) vs. separation (distance to others). 285 | - **Formula**: `s(i) = (b(i) - a(i)) / max(a(i), b(i))`, where `a(i)` is intra-cluster distance, `b(i)` is nearest other cluster. 286 | - **Logic**: Maximize average silhouette score (range [-1, 1]). 287 | - **Pros**: Quantifies quality, less subjective. 288 | - **Cons**: Computationally intensive for large data. 289 | - **Gap Statistic**: 290 | - **How**: Compare WSS to expected WSS under null distribution (random data). 291 | - **Logic**: Choose `k` where gap is largest (real clusters vs. noise). 292 | - **Pros**: Statistically grounded. 293 | - **Cons**: Complex, requires null sampling. 294 | - **Domain Knowledge**: 295 | - **How**: Use business context (e.g., 3 customer segments expected). 296 | - **Pros**: Practical, interpretable. 297 | - **Cons**: Not data-driven. 298 | - **Cross-Validation**: 299 | - **How**: Split data, evaluate clustering stability for different `k`. 300 | - **Pros**: Robust. 301 | - **Cons**: Less common, high compute. 302 | 303 | **Example**: 304 | - Data: User activity logs. 305 | - Elbow: Suggests `k=3` (sharp bend). 306 | - Silhouette: Confirms `k=3` (score = 0.6 vs. 0.4 for `k=4`). 307 | 308 | **Interview Tips**: 309 | - Prioritize elbow/silhouette: “Most common in practice.” 310 | - Explain intuition: “Balance fit and simplicity.” 311 | - Be ready to plot: “Show WSS curve with elbow.” 312 | 313 | --- 314 | 315 | ## 8. What are the advantages and disadvantages of hierarchical clustering? 316 | 317 | **Answer**: 318 | 319 | - **Advantages**: 320 | - **No Need for `k`**: Number of clusters chosen post-hoc via dendrogram cut. 321 | - **Hierarchical Structure**: Captures nested relationships (e.g., sub-groups within clusters). 322 | - **Flexible Linkage**: Options like single, complete, average linkage suit different data. 323 | - **Visualizable**: Dendrogram shows merging process, aids interpretation. 324 | - **Deterministic**: No randomness (unlike k-means initialization). 325 | 326 | - **Disadvantages**: 327 | - **Scalability**: `O(n²)` or `O(n³)` complexity, slow for large datasets. 328 | - **Sensitive to Noise**: Outliers can distort merges (e.g., single linkage chains). 329 | - **Irreversible**: Merging decisions can’t be undone, leading to suboptimal clusters. 330 | - **Memory Intensive**: Stores distance matrix, impractical for big data. 331 | - **Linkage Choice**: Results vary with linkage (e.g., single vs. complete), needs tuning. 332 | 333 | **Example**: 334 | - Data: Gene expression (500 samples). 335 | - Advantage: Dendrogram shows biological hierarchies. 336 | - Disadvantage: Takes 10 mins vs. k-means’ 1 min. 337 | 338 | **Interview Tips**: 339 | - Highlight dendrogram: “Great for visualizing structure.” 340 | - Discuss limits: “Not for big data due to speed.” 341 | - Be ready to compare: “Vs. k-means: slower but no `k` needed.” 342 | 343 | --- 344 | 345 | ## Notes 346 | 347 | - **Focus**: Answers cover time series and clustering, ideal for specialized ML interviews. 348 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 349 | - **Depth**: Includes technical details (e.g., ARIMA math, silhouette score) and practical tips (e.g., missing value imputation). 350 | - **Consistency**: Matches the style of previous files for a cohesive repository. 351 | 352 | For deeper practice, implement time series models or clustering algorithms (see [ML Coding](ml-coding.md)) or explore [Production MLOps](production-mlops.md) for deploying such solutions. 🚀 353 | 354 | --- 355 | 356 | **Next Steps**: Build on these skills with [Statistics & Probability](statistics-probability.md) for foundational math or revisit [Deep Learning](deep-learning.md) for neural network-based time series models! 🌟 -------------------------------------------------------------------------------- /questions/natural-language-processing.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing Questions 2 | 3 | This file contains natural language processing (NLP) questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of NLP techniques, models, and applications, covering topics like text preprocessing, embeddings, transformers, and evaluation. 4 | 5 | Below are the questions with detailed answers, including explanations, technical details, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What are the key steps in preprocessing text data for NLP tasks?](#1-what-are-the-key-steps-in-preprocessing-text-data-for-nlp-tasks) 12 | 2. [What is tokenization, and why is it important in NLP?](#2-what-is-tokenization-and-why-is-it-important-in-nlp) 13 | 3. [Explain the difference between bag-of-words and TF-IDF representations](#3-explain-the-difference-between-bag-of-words-and-tf-idf-representations) 14 | 4. [What are word embeddings, and how do they improve NLP models?](#4-what-are-word-embeddings-and-how-do-they-improve-nlp-models) 15 | 5. [What is the difference between RNNs and transformers for NLP tasks?](#5-what-is-the-difference-between-rnns-and-transformers-for-nlp-tasks) 16 | 6. [Explain the attention mechanism in the context of NLP](#6-explain-the-attention-mechanism-in-the-context-of-nlp) 17 | 7. [What is BERT, and how does it differ from traditional word embeddings?](#7-what-is-bert-and-how-does-it-differ-from-traditional-word-embeddings) 18 | 8. [How do you evaluate the performance of an NLP model?](#8-how-do-you-evaluate-the-performance-of-an-nlp-model) 19 | 20 | --- 21 | 22 | ## 1. What are the key steps in preprocessing text data for NLP tasks? 23 | 24 | **Answer**: 25 | 26 | Preprocessing text data transforms raw text into a format suitable for NLP models. Key steps: 27 | 28 | - **Lowercasing**: 29 | - Convert text to lowercase (e.g., “Hello” → “hello”). 30 | - **Why**: Reduces vocabulary size, ensures consistency. 31 | - **Tokenization**: 32 | - Split text into tokens (words, subwords, or characters). 33 | - **Why**: Enables numerical representation. 34 | - **Tool**: NLTK, spaCy, or model-specific (e.g., BERT tokenizer). 35 | - **Removing Noise**: 36 | - Strip punctuation, special characters, or URLs. 37 | - **Why**: Reduces irrelevant features. 38 | - **Stop Word Removal**: 39 | - Remove common words (e.g., “the,” “is”). 40 | - **Why**: Focuses on meaningful terms, but context-dependent (e.g., keep for transformers). 41 | - **Stemming/Lemmatization**: 42 | - Reduce words to root form (e.g., “running” → “run”). 43 | - **Why**: Normalizes variations, but lemmatization is context-aware. 44 | - **Tool**: Porter Stemmer, WordNet lemmatizer. 45 | - **Handling Numbers**: 46 | - Normalize or remove numbers (e.g., “2023” → “”). 47 | - **Why**: Generalizes numerical data. 48 | - **N-grams** (optional): 49 | - Extract multi-word sequences (e.g., “machine learning”). 50 | - **Why**: Captures phrases for some tasks. 51 | - **Encoding**: 52 | - Convert tokens to IDs for model input (e.g., vocabulary indices). 53 | - **Why**: Models require numerical input. 54 | 55 | **Example**: 56 | - Raw: “I’m running to the Store in 2023!” 57 | - Processed: Tokens = [“run”, “store”] (after lowercasing, removing noise, lemmatizing). 58 | 59 | **Interview Tips**: 60 | - Tailor steps: “Depends on task—transformers need less cleanup.” 61 | - Mention tools: “spaCy for pipelines, BERT for tokenization.” 62 | - Be ready to code: “Show tokenization in Python.” 63 | 64 | --- 65 | 66 | ## 2. What is tokenization, and why is it important in NLP? 67 | 68 | **Answer**: 69 | 70 | **Tokenization** is the process of splitting text into smaller units (tokens), such as words, subwords, or characters, to enable numerical processing by NLP models. 71 | 72 | - **Types**: 73 | - **Word**: Split on spaces/punctuation (e.g., “I love NLP” → [“I”, “love”, “NLP”]). 74 | - **Subword**: Break words into pieces (e.g., “playing” → [“play”, “##ing”]). 75 | - Used in BERT, WordPiece, BPE. 76 | - **Character**: Split into individual characters (e.g., “NLP” → [“N”, “L”, “P”]). 77 | - **Why Important**: 78 | - **Numerical Input**: Models require tokens to map to IDs (e.g., vocabulary). 79 | - **Granularity**: Captures meaning at right level (e.g., subwords handle rare words). 80 | - **Context Preservation**: Maintains structure for tasks like translation. 81 | - **Vocabulary Size**: Balances size vs. coverage (subwords reduce OOV). 82 | - **Challenges**: 83 | - Ambiguity (e.g., “U.S.” vs. “us”). 84 | - Language-specific rules (e.g., Chinese segmentation). 85 | - Reversible tokenization for generation. 86 | 87 | **Example**: 88 | - Text: “unhappiness”. 89 | - Subword: [“un”, “##happiness”] (BERT-style). 90 | - Benefit: Handles “happy” and “unhappy” consistently. 91 | 92 | **Interview Tips**: 93 | - Explain types: “Subword is key for modern NLP.” 94 | - Link to models: “BERT uses WordPiece for flexibility.” 95 | - Be ready to sketch: “Show text → tokens → IDs.” 96 | 97 | --- 98 | 99 | ## 3. Explain the difference between bag-of-words and TF-IDF representations 100 | 101 | **Answer**: 102 | 103 | - **Bag-of-Words (BoW)**: 104 | - **How**: Represent text as a vector of word counts or presence (binary). 105 | - **Process**: 106 | - Build vocabulary (e.g., all unique words). 107 | - Vectorize: Count occurrences per document (e.g., [2, 0, 1] for words). 108 | - **Pros**: Simple, captures word frequency. 109 | - **Cons**: Ignores word order, no semantic meaning, high-dimensional. 110 | - **Use Case**: Basic text classification (e.g., spam detection). 111 | - **Example**: “cat dog” → [1, 1, 0] (vocab: cat, dog, bird). 112 | 113 | - **TF-IDF (Term Frequency-Inverse Document Frequency)**: 114 | - **How**: Weight words by frequency in document (TF) and rarity across corpus (IDF). 115 | - **Formula**: 116 | - TF = `count(word, doc) / len(doc)`. 117 | - IDF = `log(N / n_t)`, where `N` is docs, `n_t` is docs with word. 118 | - TF-IDF = `TF * IDF`. 119 | - **Pros**: Downweights common words (e.g., “the”), highlights distinctive terms. 120 | - **Cons**: Still ignores order, sparse vectors. 121 | - **Use Case**: Document retrieval, topic modeling. 122 | - **Example**: “cat” in one doc, rare in corpus → high TF-IDF. 123 | 124 | - **Key Differences**: 125 | - **Weighting**: BoW uses raw counts; TF-IDF adjusts for rarity. 126 | - **Meaning**: BoW treats all words equally; TF-IDF emphasizes uniqueness. 127 | - **Sparsity**: Both sparse, but TF-IDF reduces noise from frequent words. 128 | - **ML Context**: BoW for simple models; TF-IDF for search/relevance. 129 | 130 | **Example**: 131 | - Docs: “cat dog”, “dog bird”. 132 | - BoW: [1, 1, 0], [0, 1, 1]. 133 | - TF-IDF: “cat” gets higher weight (rare), “dog” lower (common). 134 | 135 | **Interview Tips**: 136 | - Clarify limits: “Both lose context, unlike embeddings.” 137 | - Explain IDF: “Rarity boosts important words.” 138 | - Be ready to compute: “Show TF-IDF for a small corpus.” 139 | 140 | --- 141 | 142 | ## 4. What are word embeddings, and how do they improve NLP models? 143 | 144 | **Answer**: 145 | 146 | **Word embeddings** are dense, low-dimensional vector representations of words that capture semantic meaning, learned from text data. 147 | 148 | - **How They Work**: 149 | - Map words to vectors (e.g., 300-dim) in a continuous space. 150 | - Similar words (e.g., “king”, “queen”) are close in vector space. 151 | - Learned via models like: 152 | - **Word2Vec**: Predict word given context (CBOW) or vice versa (Skip-gram). 153 | - **GloVe**: Factorize co-occurrence matrix. 154 | - **FastText**: Include subword info for rare words. 155 | 156 | - **Why They Improve NLP**: 157 | - **Semantic Similarity**: Capture relationships (e.g., “king - man + woman ≈ queen”). 158 | - **Dimensionality Reduction**: Dense (vs. sparse BoW), reduces parameters. 159 | - **Generalization**: Handle unseen words via similarity (e.g., FastText subwords). 160 | - **Transfer Learning**: Pretrained embeddings (e.g., GloVe) boost small datasets. 161 | - **Contextual Models**: Lead to advanced embeddings (e.g., BERT, contextual). 162 | 163 | - **Limitations**: 164 | - Static embeddings lack context (e.g., “bank” as river vs. finance). 165 | - Compute-intensive to train from scratch. 166 | 167 | **Example**: 168 | - Task: Sentiment analysis. 169 | - GloVe: “happy” = [0.1, 0.5, ...], “joy” nearby → model learns positive sentiment. 170 | 171 | **Interview Tips**: 172 | - Highlight semantics: “Embeddings encode meaning, not just counts.” 173 | - Compare: “BoW is sparse, embeddings are dense.” 174 | - Be ready to sketch: “Show ‘king’, ‘queen’ in 2D space.” 175 | 176 | --- 177 | 178 | ## 5. What is the difference between RNNs and transformers for NLP tasks? 179 | 180 | **Answer**: 181 | 182 | - **RNNs (Recurrent Neural Networks)**: 183 | - **How**: Process sequences sequentially, maintaining a hidden state. 184 | - Update: `h_t = f(W_h * h_{t-1} + W_x * x_t + b)`. 185 | - **Variants**: LSTM, GRU mitigate vanishing gradients. 186 | - **Pros**: 187 | - Good for short sequences. 188 | - Memory-efficient for small models. 189 | - **Cons**: 190 | - Sequential processing → slow training/inference. 191 | - Struggles with long-range dependencies (e.g., 100+ tokens). 192 | - **Use Case**: Early NLP (e.g., sentiment with short texts). 193 | - **Example**: Predict next word in “I love…” using LSTM. 194 | 195 | - **Transformers**: 196 | - **How**: Process sequences in parallel using self-attention. 197 | - Attention: `Attention(Q,K,V) = softmax(QK^T/√d_k)V`. 198 | - Stack encoder/decoder layers for tasks. 199 | - **Pros**: 200 | - Captures long-range dependencies (e.g., sentence-wide context). 201 | - Parallelizable → faster training. 202 | - Scales to large models (e.g., BERT, GPT). 203 | - **Cons**: 204 | - Memory-intensive (quadratic w.r.t. sequence length). 205 | - Requires large data to train. 206 | - **Use Case**: Modern NLP (e.g., translation, QA). 207 | - **Example**: BERT understands “bank” context in sentence. 208 | 209 | - **Key Differences**: 210 | - **Processing**: RNNs are sequential; transformers are parallel. 211 | - **Dependencies**: RNNs struggle with long-range; transformers excel. 212 | - **Scalability**: Transformers dominate large-scale NLP; RNNs for niche cases. 213 | - **ML Context**: Transformers replaced RNNs for most tasks (e.g., BERT vs. LSTM). 214 | 215 | **Example**: 216 | - RNN: Fails to link “he” to “John” in long text. 217 | - Transformer: Captures link via attention, better accuracy. 218 | 219 | **Interview Tips**: 220 | - Emphasize attention: “Transformers focus on relevant tokens.” 221 | - Discuss speed: “Parallelism makes transformers faster.” 222 | - Be ready to sketch: “Show RNN loop vs. transformer attention.” 223 | 224 | --- 225 | 226 | ## 6. Explain the attention mechanism in the context of NLP 227 | 228 | **Answer**: 229 | 230 | The **attention mechanism** allows NLP models to focus on relevant parts of a sequence when processing or generating text, improving context understanding. 231 | 232 | - **How It Works**: 233 | - **Scaled Dot-Product Attention**: 234 | - Input: Query (`Q`), Key (`K`), Value (`V`) vectors for each token. 235 | - Compute: `Attention = softmax(QK^T/√d_k)V`. 236 | - Output: Weighted sum of values, emphasizing important tokens. 237 | - **Multi-Head Attention**: 238 | - Run multiple attention layers in parallel, concatenate outputs. 239 | - Captures different relationships (e.g., syntax, semantics). 240 | - **Self-Attention**: 241 | - Tokens attend to each other in same sequence (e.g., sentence). 242 | - **Math**: 243 | - Similarity: `QK^T` measures token relevance. 244 | - Scaling: `√d_k` stabilizes gradients. 245 | 246 | - **In NLP**: 247 | - **Transformers**: Core of BERT, GPT, enabling context-aware embeddings. 248 | - **Tasks**: 249 | - Translation: Focus on source words for target. 250 | - QA: Attend to relevant sentence parts. 251 | - Summarization: Highlight key phrases. 252 | - **Benefits**: 253 | - Long-range dependencies (unlike RNNs). 254 | - Interpretable weights (e.g., see what “it” refers to). 255 | 256 | **Example**: 257 | - Sentence: “The cat, which is black, sleeps.” 258 | - Attention: “Sleeps” attends to “cat,” not “black,” for meaning. 259 | 260 | **Interview Tips**: 261 | - Simplify intuition: “Attention weighs important words.” 262 | - Link to transformers: “Powers modern NLP models.” 263 | - Be ready to derive: “Show attention matrix calculation.” 264 | 265 | --- 266 | 267 | ## 7. What is BERT, and how does it differ from traditional word embeddings? 268 | 269 | **Answer**: 270 | 271 | **BERT** (Bidirectional Encoder Representations from Transformers) is a pretrained transformer model that generates contextual word embeddings for NLP tasks. 272 | 273 | - **How BERT Works**: 274 | - **Architecture**: Stack of transformer encoders (e.g., 12 layers in BERT-base). 275 | - **Pretraining**: 276 | - **Masked Language Model (MLM)**: Predict masked words (15% of tokens). 277 | - **Next Sentence Prediction (NSP)**: Predict if sentences follow each other. 278 | - **Fine-Tuning**: Adapt to tasks (e.g., classification, QA) with task-specific data. 279 | - **Output**: Contextual embeddings for each token, varying by sentence. 280 | 281 | - **Traditional Word Embeddings**: 282 | - **Examples**: Word2Vec, GloVe, FastText. 283 | - **How**: Static vectors per word, trained on co-occurrence or prediction. 284 | - **Properties**: Same vector for “bank” in all contexts. 285 | 286 | - **Key Differences**: 287 | - **Contextuality**: 288 | - BERT: Dynamic embeddings (e.g., “bank” differs in “river bank” vs. “bank account”). 289 | - Traditional: Fixed embeddings, no context. 290 | - **Directionality**: 291 | - BERT: Bidirectional, considers full sentence. 292 | - Traditional: Unidirectional or context-free. 293 | - **Training**: 294 | - BERT: Pretrained on large corpus, fine-tuned. 295 | - Traditional: Trained once, used as-is or lightly adapted. 296 | - **Performance**: 297 | - BERT: Superior for tasks like QA, sentiment (context matters). 298 | - Traditional: Simpler, faster for basic tasks. 299 | - **ML Context**: BERT powers modern NLP; traditional embeddings for lightweight models. 300 | 301 | **Example**: 302 | - Sentence: “I bank money.” 303 | - BERT: “bank” embedding reflects finance. 304 | - GloVe: Same “bank” vector for all uses. 305 | 306 | **Interview Tips**: 307 | - Highlight context: “BERT adapts to sentence meaning.” 308 | - Compare size: “BERT is heavy, GloVe is light.” 309 | - Be ready to sketch: “Show BERT’s transformer layers.” 310 | 311 | --- 312 | 313 | ## 8. How do you evaluate the performance of an NLP model? 314 | 315 | **Answer**: 316 | 317 | Evaluating NLP models depends on the task, using metrics to measure accuracy, robustness, and generalization: 318 | 319 | - **Classification Tasks** (e.g., sentiment analysis): 320 | - **Accuracy**: Fraction of correct predictions. 321 | - **Precision/Recall/F1**: 322 | - Precision: `TP / (TP + FP)` (correct positives). 323 | - Recall: `TP / (TP + FN)` (captured positives). 324 | - F1: `2 * (P * R) / (P + R)` (harmonic mean). 325 | - **Use**: F1 for imbalanced data (e.g., spam detection). 326 | - **Sequence Labeling** (e.g., NER): 327 | - **Token-Level F1**: Evaluate per token (e.g., “B-PER”, “I-PER”). 328 | - **Entity-Level F1**: Match full entities (e.g., “John Doe”). 329 | - **Use**: Entity F1 for strict matching. 330 | - **Generation Tasks** (e.g., translation, summarization): 331 | - **BLEU**: Measures n-gram overlap with reference. 332 | - **ROUGE**: Recall-oriented for summaries (e.g., ROUGE-L for longest sequence). 333 | - **METEOR**: Considers synonyms, stemming. 334 | - **Use**: BLEU for translation, ROUGE for summarization. 335 | - **Question Answering**: 336 | - **Exact Match (EM)**: Full answer match. 337 | - **F1 Score**: Overlap of answer tokens. 338 | - **Use**: EM for strict evaluation, F1 for partial credit. 339 | - **Embedding-Based** (e.g., similarity): 340 | - **Cosine Similarity**: Measure vector alignment. 341 | - **Correlation**: Compare to human judgments. 342 | - **Use**: Evaluate semantic search. 343 | - **Human Evaluation**: 344 | - Subjective scoring for fluency, coherence (e.g., chatbot responses). 345 | - **Use**: When metrics fail (e.g., creative tasks). 346 | - **General Practices**: 347 | - **Cross-Validation**: Ensure robustness. 348 | - **Error Analysis**: Inspect mispredictions (e.g., confusion matrix). 349 | - **Domain-Specific**: Align metrics with goals (e.g., latency for real-time). 350 | 351 | **Example**: 352 | - Task: Sentiment classification. 353 | - Metrics: F1 = 0.85 (handles imbalance), accuracy = 0.88. 354 | - Analysis: Check false positives for improvement. 355 | 356 | **Interview Tips**: 357 | - Match metric to task: “F1 for classification, BLEU for translation.” 358 | - Discuss limits: “BLEU misses fluency, needs human eval.” 359 | - Be ready to compute: “Show F1 formula.” 360 | 361 | --- 362 | 363 | ## Notes 364 | 365 | - **Focus**: Answers cover NLP fundamentals and advanced models, ideal for ML interviews. 366 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 367 | - **Depth**: Includes technical details (e.g., attention math, BERT pretraining) and practical tips (e.g., preprocessing pipelines). 368 | - **Consistency**: Matches the style of previous files for a cohesive repository. 369 | 370 | For deeper practice, implement NLP models (see [ML Coding](ml-coding.md)) or explore [Deep Learning](deep-learning.md) for transformer foundations. 🚀 371 | 372 | --- 373 | 374 | **Next Steps**: Build on these skills with [Statistics & Probability](statistics-probability.md) for NLP evaluation metrics or revisit [Production MLOps](production-mlops.md) for deploying NLP models! 🌟 -------------------------------------------------------------------------------- /questions/computer-vision.md: -------------------------------------------------------------------------------- 1 | # Computer Vision Questions 2 | 3 | This file contains computer vision questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of image processing, deep learning models, and vision-specific techniques, covering topics like convolutional neural networks (CNNs), object detection, and evaluation metrics. 4 | 5 | Below are the questions with detailed answers, including explanations, technical details, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What is the role of convolutional neural networks (CNNs) in computer vision?](#1-what-is-the-role-of-convolutional-neural-networks-cnns-in-computer-vision) 12 | 2. [Explain the difference between convolution and pooling layers](#2-explain-the-difference-between-convolution-and-pooling-layers) 13 | 3. [What is transfer learning, and how is it used in computer vision?](#3-what-is-transfer-learning-and-how-is-it-used-in-computer-vision) 14 | 4. [What is the difference between object detection and image classification?](#4-what-is-the-difference-between-object-detection-and-image-classification) 15 | 5. [Explain the YOLO algorithm for object detection](#5-explain-the-yolo-algorithm-for-object-detection) 16 | 6. [What is semantic segmentation, and how does it differ from instance segmentation?](#6-what-is-semantic-segmentation-and-how-does-it-differ-from-instance-segmentation) 17 | 7. [What are some common data augmentation techniques for computer vision?](#7-what-are-some-common-data-augmentation-techniques-for-computer-vision) 18 | 8. [How do you evaluate the performance of a computer vision model?](#8-how-do-you-evaluate-the-performance-of-a-computer-vision-model) 19 | 20 | --- 21 | 22 | ## 1. What is the role of convolutional neural networks (CNNs) in computer vision? 23 | 24 | **Answer**: 25 | 26 | **Convolutional Neural Networks (CNNs)** are deep learning models designed to process structured grid-like data, such as images, and are foundational to computer vision tasks. 27 | 28 | - **Role**: 29 | - **Feature Extraction**: Learn hierarchical features (e.g., edges, textures, objects) via convolutional layers. 30 | - **Spatial Invariance**: Capture patterns regardless of position (e.g., a cat anywhere in the image). 31 | - **Task Versatility**: Used for classification, detection, segmentation, and more. 32 | - **End-to-End Learning**: Map raw pixels to outputs without manual feature engineering. 33 | 34 | - **Components**: 35 | - **Convolutional Layers**: Apply filters to extract features (e.g., 3x3 kernels). 36 | - **Pooling Layers**: Downsample to reduce dimensionality, enhance robustness. 37 | - **Fully Connected Layers**: Aggregate features for final predictions. 38 | - **Activation Functions**: Introduce non-linearity (e.g., ReLU). 39 | 40 | - **Why Effective**: 41 | - Exploit local correlations (pixels nearby are related). 42 | - Reduce parameters via weight sharing (unlike dense layers). 43 | - Scale to large images with deep architectures. 44 | 45 | **Example**: 46 | - Task: Classify cats vs. dogs. 47 | - CNN: Learns edges (layer 1), shapes (layer 2), cat faces (layer 5). 48 | 49 | **Interview Tips**: 50 | - Emphasize features: “CNNs learn patterns automatically.” 51 | - Highlight invariance: “Handles translations well.” 52 | - Be ready to sketch: “Show conv → pool → dense layers.” 53 | 54 | --- 55 | 56 | ## 2. Explain the difference between convolution and pooling layers 57 | 58 | **Answer**: 59 | 60 | - **Convolutional Layer**: 61 | - **Purpose**: Extract features by applying learnable filters to input. 62 | - **How**: 63 | - Filter (e.g., 3x3) slides over image, computing dot products. 64 | - Output: Feature map highlighting patterns (e.g., edges). 65 | - **Math**: `output[i,j] = Σ_m Σ_n input[i+m,j+n] * filter[m,n] + bias`. 66 | - **Parameters**: Filter weights, biases (learned via backprop). 67 | - **Pros**: Captures local patterns (e.g., corners, textures). 68 | - **Cons**: Computationally intensive, many parameters for deep layers. 69 | - **Example**: Detect vertical edges in an image. 70 | 71 | - **Pooling Layer**: 72 | - **Purpose**: Downsample feature maps to reduce size, improve robustness. 73 | - **How**: 74 | - Apply operation (e.g., max, average) over a region (e.g., 2x2). 75 | - Output: Smaller map (e.g., 2x2 max-pooling halves dimensions). 76 | - **Types**: 77 | - Max Pooling: Take maximum value. 78 | - Average Pooling: Compute mean. 79 | - **Parameters**: None (fixed operation). 80 | - **Pros**: Reduces compute, prevents overfitting, adds invariance. 81 | - **Cons**: Loses some spatial info. 82 | - **Example**: Max-pool a 4x4 feature map to 2x2, keeping strongest signals. 83 | 84 | - **Key Differences**: 85 | - **Function**: Convolution extracts features; pooling downsamples. 86 | - **Learnable**: Convolution has weights; pooling is fixed. 87 | - **Output**: Convolution preserves depth; pooling reduces dimensions. 88 | - **Use**: Convolution for pattern detection; pooling for efficiency. 89 | 90 | **Example**: 91 | - Input: 28x28 image. 92 | - Conv: 3x3 filter → 26x26 feature map (32 filters). 93 | - Pool: 2x2 max-pool → 13x13 map. 94 | 95 | **Interview Tips**: 96 | - Clarify roles: “Conv learns, pooling simplifies.” 97 | - Mention trade-offs: “Pooling loses detail but speeds up.” 98 | - Be ready to compute: “Show 3x3 conv on 5x5 input.” 99 | 100 | --- 101 | 102 | ## 3. What is transfer learning, and how is it used in computer vision? 103 | 104 | **Answer**: 105 | 106 | **Transfer learning** involves using a pretrained model (trained on a large dataset) as a starting point for a new task, adapting it with minimal training. 107 | 108 | - **How It Works**: 109 | - **Pretrained Model**: Trained on large dataset (e.g., ImageNet with 1M images). 110 | - Example: ResNet, VGG, EfficientNet. 111 | - **Feature Extraction**: 112 | - Use pretrained layers to extract general features (e.g., edges, shapes). 113 | - Freeze early layers, keeping weights fixed. 114 | - **Fine-Tuning** (optional): 115 | - Train later layers or entire model on new data. 116 | - Adjust with small learning rate to avoid overfitting. 117 | - **Output Layer**: Replace final layer (e.g., 1000-class softmax → 2-class). 118 | 119 | - **Why Used in Vision**: 120 | - **Data Scarcity**: Small datasets (e.g., 1000 medical images) benefit from pretrained features. 121 | - **Speed**: Reduces training time (hours vs. days). 122 | - **Performance**: Leverages learned patterns (e.g., textures), boosts accuracy. 123 | - **Accessibility**: Pretrained models are widely available (e.g., PyTorch Hub). 124 | 125 | - **Process**: 126 | 1. Load pretrained model (e.g., ResNet50). 127 | 2. Freeze convolutional base. 128 | 3. Replace head (e.g., for 10 classes). 129 | 4. Train on new data, optionally unfreeze layers. 130 | 131 | **Example**: 132 | - Task: Classify X-ray images (normal vs. pneumonia). 133 | - Transfer: Use ImageNet-pretrained ResNet, fine-tune on 5000 X-rays. 134 | 135 | **Interview Tips**: 136 | - Highlight efficiency: “Saves time and data.” 137 | - Discuss layers: “Early layers are generic, later task-specific.” 138 | - Be ready to code: “Show PyTorch transfer learning.” 139 | 140 | --- 141 | 142 | ## 4. What is the difference between object detection and image classification? 143 | 144 | **Answer**: 145 | 146 | - **Image Classification**: 147 | - **Goal**: Assign a single label to an entire image. 148 | - **Input**: Image (e.g., 224x224 pixels). 149 | - **Output**: Class probability (e.g., “dog” with 0.9). 150 | - **How**: CNN predicts one label via softmax. 151 | - **Use Case**: Identify scene type (e.g., beach vs. forest). 152 | - **Example**: “This is a cat image.” 153 | - **Models**: ResNet, VGG, EfficientNet. 154 | 155 | - **Object Detection**: 156 | - **Goal**: Identify and localize multiple objects in an image. 157 | - **Input**: Image. 158 | - **Output**: Bounding boxes + class labels + confidence scores (e.g., “cat at [x,y,w,h], 0.95”). 159 | - **How**: Models predict boxes and classes (e.g., via anchors, grids). 160 | - **Use Case**: Autonomous driving (detect cars, pedestrians). 161 | - **Example**: “Cat at top-left, dog at bottom-right.” 162 | - **Models**: YOLO, Faster R-CNN, SSD. 163 | 164 | - **Key Differences**: 165 | - **Scope**: Classification labels whole image; detection localizes objects. 166 | - **Output**: Classification gives one class; detection gives boxes + classes. 167 | - **Complexity**: Detection is harder (spatial + classification). 168 | - **ML Context**: Classification for simple tasks; detection for localization. 169 | 170 | **Example**: 171 | - Classification: Image → “positive for tumor.” 172 | - Detection: Image → “tumor at [100,150,50,50].” 173 | 174 | **Interview Tips**: 175 | - Clarify output: “Detection adds spatial info.” 176 | - Mention models: “YOLO for detection, ResNet for classification.” 177 | - Be ready to sketch: “Show image with box vs. single label.” 178 | 179 | --- 180 | 181 | ## 5. Explain the YOLO algorithm for object detection 182 | 183 | **Answer**: 184 | 185 | **YOLO (You Only Look Once)** is a real-time object detection algorithm that predicts bounding boxes and class probabilities in a single pass, optimized for speed and accuracy. 186 | 187 | - **How It Works**: 188 | - **Input**: Image (e.g., 416x416). 189 | - **Grid Division**: Split image into `S x S` grid (e.g., 13x13). 190 | - **Predictions per Cell**: 191 | - `B` bounding boxes with coordinates `[x, y, w, h]`. 192 | - Confidence score: `P(object) * IoU` (Intersection over Union). 193 | - `C` class probabilities (e.g., dog, cat). 194 | - **Output**: Tensor of predictions (e.g., `S x S x (B * 5 + C)`). 195 | - **Post-Processing**: 196 | - Non-Max Suppression (NMS): Remove overlapping boxes. 197 | - Threshold confidence to filter weak detections. 198 | - **Architecture**: 199 | - CNN backbone (e.g., DarkNet, EfficientNet). 200 | - Multi-scale predictions (e.g., YOLOv3 detects at different resolutions). 201 | 202 | - **Key Features**: 203 | - **Single Pass**: Unlike two-stage models (e.g., Faster R-CNN), YOLO is fast. 204 | - **Global Context**: Considers entire image, reducing background errors. 205 | - **Versions**: YOLOv1 (2015) to YOLOv8 (2023), improving accuracy/speed. 206 | 207 | - **Pros**: 208 | - Fast (e.g., 60 FPS for YOLOv5). 209 | - Unified model, easy to train. 210 | - Good for real-time (e.g., video). 211 | - **Cons**: 212 | - Struggles with small objects or dense scenes. 213 | - Lower precision than two-stage models. 214 | 215 | **Example**: 216 | - Task: Detect cars in traffic cam. 217 | - YOLO: Outputs boxes for each car with class “car” and confidence. 218 | 219 | **Interview Tips**: 220 | - Emphasize speed: “YOLO’s single pass is key.” 221 | - Compare: “Vs. Faster R-CNN: faster but less precise.” 222 | - Be ready to sketch: “Show grid and boxes.” 223 | 224 | --- 225 | 226 | ## 6. What is semantic segmentation, and how does it differ from instance segmentation? 227 | 228 | **Answer**: 229 | 230 | - **Semantic Segmentation**: 231 | - **Goal**: Assign a class label to every pixel in an image. 232 | - **Input**: Image. 233 | - **Output**: Pixel-wise class map (e.g., “sky,” “tree,” “road”). 234 | - **How**: CNNs with upsampling (e.g., encoder-decoder like U-Net). 235 | - Predict class per pixel via softmax. 236 | - **Use Case**: Autonomous driving (label road vs. sidewalk). 237 | - **Example**: Color all cars red, all roads gray. 238 | - **Models**: U-Net, DeepLab, FCN. 239 | 240 | - **Instance Segmentation**: 241 | - **Goal**: Assign a class label and unique ID to each object instance per pixel. 242 | - **Input**: Image. 243 | - **Output**: Pixel-wise map with instance IDs (e.g., “car1,” “car2”). 244 | - **How**: Combines detection and segmentation (e.g., predict boxes + masks). 245 | - **Use Case**: Robotics (distinguish individual objects). 246 | - **Example**: Separate two cars with different colors. 247 | - **Models**: Mask R-CNN, YOLACT. 248 | 249 | - **Key Differences**: 250 | - **Granularity**: Semantic labels classes; instance labels objects. 251 | - **Output**: Semantic has one class per pixel; instance has class + ID. 252 | - **Task**: Semantic is simpler (no instance separation); instance is detection + segmentation. 253 | - **ML Context**: Semantic for scene understanding; instance for object interaction. 254 | 255 | **Example**: 256 | - Image: Two cats. 257 | - Semantic: All cat pixels = “cat.” 258 | - Instance: Cat1 pixels = “cat1,” Cat2 = “cat2.” 259 | 260 | **Interview Tips**: 261 | - Clarify instances: “Instance segmentation tracks individuals.” 262 | - Mention models: “U-Net for semantic, Mask R-CNN for instance.” 263 | - Be ready to sketch: “Show pixel labels vs. instance masks.” 264 | 265 | --- 266 | 267 | ## 7. What are some common data augmentation techniques for computer vision? 268 | 269 | **Answer**: 270 | 271 | **Data augmentation** artificially increases dataset size by applying transformations to images, improving model robustness and generalization. 272 | 273 | - **Geometric Transformations**: 274 | - **Rotation**: Rotate image (e.g., ±30°). 275 | - **Why**: Handles tilted objects. 276 | - **Translation**: Shift image (e.g., ±10 pixels). 277 | - **Why**: Objects not centered. 278 | - **Scaling/Zoom**: Resize (e.g., 0.8x to 1.2x). 279 | - **Why**: Varying object sizes. 280 | - **Flipping**: Horizontal/vertical flip. 281 | - **Why**: Symmetry (e.g., left/right faces). 282 | - **Shearing**: Skew image. 283 | - **Why**: Perspective changes. 284 | - **Color Transformations**: 285 | - **Brightness**: Adjust intensity (e.g., ±20%). 286 | - **Why**: Lighting variations. 287 | - **Contrast**: Stretch intensity range. 288 | - **Why**: Different exposures. 289 | - **Hue/Saturation**: Alter colors. 290 | - **Why**: Color shifts (e.g., sunlight). 291 | - **Noise and Blur**: 292 | - **Gaussian Noise**: Add random noise. 293 | - **Why**: Sensor imperfections. 294 | - **Blur**: Apply Gaussian blur. 295 | - **Why**: Out-of-focus images. 296 | - **Cutout/Dropout**: 297 | - Randomly mask patches. 298 | - **Why**: Forces focus on other regions. 299 | - **Mixup/Cutmix**: 300 | - Blend images/labels (Mixup) or paste patches (Cutmix). 301 | - **Why**: Improves generalization. 302 | - **Task-Specific**: 303 | - Crop for detection (focus on objects). 304 | - Elastic distortions for OCR (mimic handwriting). 305 | 306 | **Example**: 307 | - Task: Classify dogs. 308 | - Augmentation: Rotate, flip, adjust brightness → model handles varied poses/lighting. 309 | 310 | **Interview Tips**: 311 | - Link to robustness: “Augmentation mimics real-world noise.” 312 | - Balance: “Too much augmentation distorts data.” 313 | - Be ready to code: “Show PIL/PyTorch augmentation.” 314 | 315 | --- 316 | 317 | ## 8. How do you evaluate the performance of a computer vision model? 318 | 319 | **Answer**: 320 | 321 | Evaluating computer vision models depends on the task, using metrics to measure accuracy, robustness, and generalization: 322 | 323 | - **Image Classification**: 324 | - **Accuracy**: Fraction of correct predictions. 325 | - **Precision/Recall/F1**: 326 | - Precision: `TP / (TP + FP)` (correct positives). 327 | - Recall: `TP / (TP + FN)` (captured positives). 328 | - F1: `2 * (P * R) / (P + R)`. 329 | - **Top-k Accuracy**: Correct class in top k predictions. 330 | - **Use**: F1 for imbalanced data (e.g., rare diseases). 331 | - **Object Detection**: 332 | - **mAP (Mean Average Precision)**: 333 | - Compute AP per class at IoU threshold (e.g., 0.5). 334 | - Average across classes. 335 | - **IoU**: `Intersection / Union` of predicted vs. true boxes. 336 | - **Use**: mAP@0.5 for standard detection (e.g., COCO). 337 | - **Semantic Segmentation**: 338 | - **Pixel Accuracy**: Fraction of correct pixels. 339 | - **mIoU (Mean Intersection over Union)**: 340 | - Average IoU per class: `IoU = TP / (TP + FP + FN)`. 341 | - **Use**: mIoU for class imbalance (e.g., Cityscapes). 342 | - **Instance Segmentation**: 343 | - **mAP with Masks**: Like detection, but for mask IoU. 344 | - **Use**: COCO-style mAP for Mask R-CNN. 345 | - **General Metrics**: 346 | - **Confusion Matrix**: Analyze errors (e.g., false positives). 347 | - **ROC-AUC**: For binary tasks, measure threshold trade-offs. 348 | - **Non-Metric Evaluation**: 349 | - **Visualization**: Inspect predictions (e.g., bounding boxes, masks). 350 | - **Error Analysis**: Identify failure modes (e.g., small objects missed). 351 | - **Practical Considerations**: 352 | - **Latency**: Ensure real-time performance (e.g., <100ms). 353 | - **Robustness**: Test on augmented/noisy data. 354 | - **Domain-Specific**: Align metrics with goals (e.g., safety for autonomous driving). 355 | 356 | **Example**: 357 | - Task: Detect pedestrians. 358 | - Metrics: mAP@0.5 = 0.85, inference time = 50ms. 359 | - Analysis: Check missed small pedestrians. 360 | 361 | **Interview Tips**: 362 | - Match metric to task: “mAP for detection, mIoU for segmentation.” 363 | - Discuss trade-offs: “Accuracy vs. latency matters.” 364 | - Be ready to compute: “Show IoU formula.” 365 | 366 | --- 367 | 368 | ## Notes 369 | 370 | - **Focus**: Answers cover core computer vision concepts, ideal for ML interviews. 371 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 372 | - **Depth**: Includes technical details (e.g., YOLO grid, CNN math) and practical tips (e.g., augmentation pipelines). 373 | - **Consistency**: Matches the style of previous files for a cohesive repository. 374 | 375 | For deeper practice, implement vision models (see [ML Coding](ml-coding.md)) or explore [Deep Learning](deep-learning.md) for CNN foundations. 🚀 376 | 377 | --- 378 | 379 | **Next Steps**: Build on these skills with [Natural Language Processing](natural-language-processing.md) for multimodal tasks or revisit [Production MLOps](production-mlops.md) for deploying vision models! 🌟 -------------------------------------------------------------------------------- /questions/generative-models.md: -------------------------------------------------------------------------------- 1 | # Generative Models Questions 2 | 3 | This file contains generative models questions commonly asked in machine learning interviews at companies like **Google**, **Amazon**, **Meta**, and others. These questions assess your **understanding** of models that generate new data, such as images, text, or audio, covering techniques like GANs, VAEs, and diffusion models, their mechanics, and applications. 4 | 5 | Below are the questions with detailed answers, including explanations, mathematical intuition, and practical insights for interviews. 6 | 7 | --- 8 | 9 | ## Table of Contents 10 | 11 | 1. [What are generative models, and how do they differ from discriminative models?](#1-what-are-generative-models-and-how-do-they-differ-from-discriminative-models) 12 | 2. [What is a Variational Autoencoder (VAE), and how does it work?](#2-what-is-a-variational-autoencoder-vae-and-how-does-it-work) 13 | 3. [Explain the concept of Generative Adversarial Networks (GANs)](#3-explain-the-concept-of-generative-adversarial-networks-gans) 14 | 4. [What are the challenges in training GANs, and how can they be addressed?](#4-what-are-the-challenges-in-training-gans-and-how-can-they-be-addressed) 15 | 5. [What is a diffusion model, and how does it generate data?](#5-what-is-a-diffusion-model-and-how-does-it-generate-data) 16 | 6. [What is the difference between VAE and GAN in terms of output quality and training?](#6-what-is-the-difference-between-vae-and-gan-in-terms-of-output-quality-and-training) 17 | 7. [How do you evaluate the performance of generative models?](#7-how-do-you-evaluate-the-performance-of-generative-models) 18 | 8. [What are some common applications of generative models in industry?](#8-what-are-some-common-applications-of-generative-models-in-industry) 19 | 20 | --- 21 | 22 | ## 1. What are generative models, and how do they differ from discriminative models? 23 | 24 | **Answer**: 25 | 26 | - **Generative Models**: 27 | - **Definition**: Learn the joint probability distribution `P(X,Y)` or data distribution `P(X)` to generate new samples resembling the training data. 28 | - **Goal**: Model how data is created (e.g., generate images, text). 29 | - **How**: Capture underlying patterns to sample new instances. 30 | - **Examples**: GANs, VAEs, diffusion models, autoregressive models. 31 | - **Output**: New data (e.g., fake face image). 32 | - **Use Case**: Image synthesis, data augmentation. 33 | 34 | - **Discriminative Models**: 35 | - **Definition**: Learn the conditional probability `P(Y|X)` to predict labels given input. 36 | - **Goal**: Classify or regress (e.g., label image as cat/dog). 37 | - **How**: Focus on decision boundaries between classes. 38 | - **Examples**: Logistic regression, SVM, CNNs for classification. 39 | - **Output**: Label or score (e.g., “dog” with 0.9). 40 | - **Use Case**: Classification, detection. 41 | 42 | - **Key Differences**: 43 | - **Distribution**: 44 | - Generative: Models `P(X)` or `P(X,Y)` (full data). 45 | - Discriminative: Models `P(Y|X)` (labels given data). 46 | - **Task**: 47 | - Generative: Create new data. 48 | - Discriminative: Predict labels. 49 | - **Complexity**: 50 | - Generative: Harder, needs data structure. 51 | - Discriminative: Simpler, focuses on boundaries. 52 | - **ML Context**: Generative for creation (e.g., art), discriminative for decisions (e.g., spam detection). 53 | 54 | **Example**: 55 | - Generative: VAE generates handwritten digits. 56 | - Discriminative: CNN classifies digits as 0-9. 57 | 58 | **Interview Tips**: 59 | - Clarify distributions: “Generative models P(X), discriminative P(Y|X).” 60 | - Highlight tasks: “Generate vs. classify.” 61 | - Be ready to compare: “Show cat image: generate vs. label.” 62 | 63 | --- 64 | 65 | ## 2. What is a Variational Autoencoder (VAE), and how does it work? 66 | 67 | **Answer**: 68 | 69 | A **Variational Autoencoder (VAE)** is a generative model that combines neural networks with Bayesian inference to learn a latent representation of data and generate new samples. 70 | 71 | - **How It Works**: 72 | - **Architecture**: 73 | - **Encoder**: Maps input `x` to latent distribution `q(z|x)` (mean `μ`, variance `σ²`). 74 | - **Latent Space**: Sample `z ~ N(μ, σ²)` using reparameterization (e.g., `z = μ + σ * ε`, `ε ~ N(0,1)`). 75 | - **Decoder**: Maps `z` to reconstructed `p(x|z)`. 76 | - **Objective**: 77 | - Minimize loss: `L = Reconstruction Loss + KL Divergence`. 78 | - **Reconstruction**: `E[log p(x|z)]` (e.g., MSE for images). 79 | - **KL Divergence**: `D_KL(q(z|x) || p(z))`, regularizes `q(z|x)` to `p(z) ~ N(0,1)`. 80 | - **Training**: 81 | - Optimize via gradient descent. 82 | - Sample `z` to generate new data. 83 | - **Generation**: 84 | - Sample `z ~ N(0,1)`, pass through decoder. 85 | 86 | - **Key Features**: 87 | - **Probabilistic**: Latent `z` follows distribution, enables sampling. 88 | - **Regularized**: KL term ensures smooth latent space. 89 | - **Continuous**: Interpolates between data points. 90 | 91 | - **Pros**: 92 | - Stable training (vs. GANs). 93 | - Interpretable latent space. 94 | - **Cons**: 95 | - Blurry outputs (due to MSE loss). 96 | - Limited quality vs. GANs. 97 | 98 | **Example**: 99 | - Task: Generate faces. 100 | - VAE: Encodes face to `z`, decodes to similar face. 101 | 102 | **Interview Tips**: 103 | - Explain components: “Encoder, latent, decoder.” 104 | - Highlight KL: “Regularizes for smooth sampling.” 105 | - Be ready to derive: “Show VAE loss function.” 106 | 107 | --- 108 | 109 | ## 3. Explain the concept of Generative Adversarial Networks (GANs) 110 | 111 | **Answer**: 112 | 113 | **Generative Adversarial Networks (GANs)** are generative models where two neural networks—a generator and a discriminator—compete in a game to produce realistic data. 114 | 115 | - **How They Work**: 116 | - **Generator (G)**: 117 | - Input: Random noise `z ~ p(z)` (e.g., `N(0,1)`). 118 | - Output: Fake data `G(z)` (e.g., image). 119 | - Goal: Fool discriminator. 120 | - **Discriminator (D)**: 121 | - Input: Real data `x` or fake `G(z)`. 122 | - Output: Probability `D(x)` (real) or `D(G(z))` (fake). 123 | - Goal: Distinguish real vs. fake. 124 | - **Training**: 125 | - **Loss** (min-max game): 126 | - `min_G max_D E[log D(x)] + E[log (1 - D(G(z)))]`. 127 | - D maximizes correct classification. 128 | - G minimizes D’s ability to spot fakes. 129 | - Alternate updates: 130 | - Train D to improve detection. 131 | - Train G to reduce `1 - D(G(z))`. 132 | - **Equilibrium**: G produces data indistinguishable from real (`D(x) ≈ D(G(z)) ≈ 0.5`). 133 | 134 | - **Key Features**: 135 | - **Adversarial**: Competition drives quality. 136 | - **Flexible**: No explicit distribution modeling. 137 | - **High-Quality**: Sharp, realistic outputs. 138 | 139 | **Example**: 140 | - Task: Generate art. 141 | - GAN: G creates paintings, D critiques vs. real art. 142 | 143 | **Interview Tips**: 144 | - Use game analogy: “Generator fakes, discriminator judges.” 145 | - Highlight loss: “Min-max balances both.” 146 | - Be ready to sketch: “Show G → D pipeline.” 147 | 148 | --- 149 | 150 | ## 4. What are the challenges in training GANs, and how can they be addressed? 151 | 152 | **Answer**: 153 | 154 | Training GANs is notoriously difficult due to their adversarial nature. Common challenges and fixes: 155 | 156 | - **Challenges**: 157 | - **Mode Collapse**: 158 | - **Issue**: Generator produces limited variety (e.g., same face). 159 | - **Fix**: 160 | - Use Wasserstein GAN (WGAN): Replace loss with Earth Mover’s distance. 161 | - Mini-batch discrimination: Encourage diversity. 162 | - **Non-Convergence**: 163 | - **Issue**: G and D oscillate, no equilibrium. 164 | - **Fix**: 165 | - Gradient penalty (WGAN-GP): Stabilize training. 166 | - Label smoothing: Soften D’s targets (e.g., 0.9 vs. 1). 167 | - **Vanishing Gradients**: 168 | - **Issue**: D too strong, G gets no signal. 169 | - **Fix**: 170 | - Use leaky ReLU in D. 171 | - Alternate training steps (e.g., train D once, G twice). 172 | - **Training Imbalance**: 173 | - **Issue**: D or G dominates, halting progress. 174 | - **Fix**: 175 | - Balance learning rates (e.g., lower for D). 176 | - Monitor losses to adjust steps. 177 | - **High Compute**: 178 | - **Issue**: Deep GANs need GPUs, long training. 179 | - **Fix**: Use progressive growing (e.g., start small, scale up). 180 | 181 | - **General Tips**: 182 | - Normalize inputs (e.g., [-1,1] for images). 183 | - Use stable architectures (e.g., DCGAN). 184 | - Monitor generated samples visually. 185 | 186 | **Example**: 187 | - Problem: GAN generates same dog breed. 188 | - Fix: Add WGAN loss, diversity improves. 189 | 190 | **Interview Tips**: 191 | - Prioritize mode collapse: “Biggest GAN headache.” 192 | - Suggest fixes: “WGAN-GP is go-to stabilizer.” 193 | - Be ready to debug: “Describe training loss curves.” 194 | 195 | --- 196 | 197 | ## 5. What is a diffusion model, and how does it generate data? 198 | 199 | **Answer**: 200 | 201 | **Diffusion models** are generative models that learn to reverse a noise-adding process to generate data from random noise, producing high-quality outputs. 202 | 203 | - **How They Work**: 204 | - **Forward Process** (Noise Addition): 205 | - Start with data `x_0` (e.g., image). 206 | - Add Gaussian noise over `T` steps: `x_t = √(1-β_t) * x_{t-1} + √β_t * ε`. 207 | - `β_t`: Noise schedule (increases with `t`). 208 | - End: `x_T ~ N(0,1)` (pure noise). 209 | - **Reverse Process** (Denoising): 210 | - Learn to reverse: `p(x_{t-1}|x_t) = N(μ_θ(x_t, t), Σ_θ(x_t, t))`. 211 | - Train model (e.g., U-Net) to predict noise `ε_θ(x_t, t)`. 212 | - Objective: Minimize `E[||ε - ε_θ(x_t, t)||²]`. 213 | - **Generation**: 214 | - Start with noise `x_T ~ N(0,1)`. 215 | - Iteratively denoise `x_{t-1} ← model(x_t, t)` for `T` steps. 216 | - Output: `x_0` (generated data). 217 | 218 | - **Key Features**: 219 | - **Iterative**: Multiple denoising steps (vs. GAN’s single pass). 220 | - **High-Quality**: Matches or beats GANs. 221 | - **Stable**: No adversarial training. 222 | 223 | - **Pros**: 224 | - Robust training, no mode collapse. 225 | - Excellent for images, audio. 226 | - **Cons**: 227 | - Slow generation (100s of steps). 228 | - High compute for training. 229 | 230 | **Example**: 231 | - Task: Generate faces. 232 | - Diffusion: Denoises random noise to sharp face over 1000 steps. 233 | 234 | **Interview Tips**: 235 | - Explain process: “Noise to data via learned reversal.” 236 | - Highlight quality: “Rivals GANs, more stable.” 237 | - Be ready to sketch: “Show forward/reverse steps.” 238 | 239 | --- 240 | 241 | ## 6. What is the difference between VAE and GAN in terms of output quality and training? 242 | 243 | **Answer**: 244 | 245 | - **Output Quality**: 246 | - **VAE**: 247 | - **Characteristics**: Often blurry, less detailed. 248 | - **Reason**: MSE reconstruction loss smooths outputs, KL regularization limits expressiveness. 249 | - **Example**: Generated faces look soft, lack fine textures. 250 | - **Use Case**: When interpretability > sharpness (e.g., latent interpolation). 251 | - **GAN**: 252 | - **Characteristics**: Sharp, realistic, high-fidelity. 253 | - **Reason**: Adversarial loss optimizes for perceptual similarity, not pixel-wise error. 254 | - **Example**: Generated faces have crisp details (e.g., eyes, hair). 255 | - **Use Case**: Photorealistic synthesis (e.g., art, deepfakes). 256 | 257 | - **Training**: 258 | - **VAE**: 259 | - **Process**: Optimize single loss (`Reconstruction + KL`) with gradient descent. 260 | - **Stability**: Converges reliably, no competing objectives. 261 | - **Speed**: Faster, simpler optimization. 262 | - **Challenges**: Balancing reconstruction vs. KL, tuning latent size. 263 | - **GAN**: 264 | - **Process**: Min-max game between G and D, alternating updates. 265 | - **Stability**: Prone to mode collapse, non-convergence. 266 | - **Speed**: Slower, needs careful tuning (e.g., learning rates). 267 | - **Challenges**: Balancing G/D, avoiding vanishing gradients. 268 | 269 | - **Key Differences**: 270 | - **Quality**: GANs > VAEs (sharper but riskier). 271 | - **Training**: VAEs easier, GANs trickier. 272 | - **Latent Space**: VAEs structured (Gaussian); GANs unstructured (noise). 273 | - **ML Context**: VAEs for stable tasks, GANs for high-quality visuals. 274 | 275 | **Example**: 276 | - VAE: Generates blurry digits, trains in 1 hour. 277 | - GAN: Generates sharp digits, trains in 3 hours with tuning. 278 | 279 | **Interview Tips**: 280 | - Contrast outputs: “VAE blurry, GAN crisp.” 281 | - Discuss stability: “VAE trains smoothly, GAN fights.” 282 | - Be ready to sketch: “Show VAE vs. GAN loss.” 283 | 284 | --- 285 | 286 | ## 7. How do you evaluate the performance of generative models? 287 | 288 | **Answer**: 289 | 290 | Evaluating generative models is challenging due to subjective quality and lack of direct metrics. Common approaches: 291 | 292 | - **Quantitative Metrics**: 293 | - **Inception Score (IS)**: 294 | - **How**: Use pretrained classifier (e.g., InceptionV3) to score diversity and clarity. 295 | - **Pros**: Correlates with human judgment. 296 | - **Cons**: Biased to classifier’s domain. 297 | - **Fréchet Inception Distance (FID)**: 298 | - **How**: Compare feature distributions (real vs. fake) via InceptionV3. 299 | - **Formula**: `FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_r Σ_g)^0.5)`. 300 | - **Pros**: Sensitive to quality, diversity. 301 | - **Cons**: Needs large samples, domain-specific. 302 | - **Precision/Recall**: 303 | - **How**: Measure coverage (recall) and fidelity (precision) of generated data. 304 | - **Pros**: Balances mode collapse vs. quality. 305 | - **Cons**: Complex to compute. 306 | - **Log-Likelihood** (VAEs, diffusion): 307 | - **How**: Estimate `log p(x)` for generated data. 308 | - **Cons**: Doesn’t reflect perceptual quality. 309 | 310 | - **Qualitative Evaluation**: 311 | - **Human Judgment**: 312 | - Score samples for realism, diversity (e.g., 1-5 scale). 313 | - **Use**: Gold standard for creative tasks. 314 | - **Visual Inspection**: 315 | - Check samples for artifacts, variety. 316 | - **Use**: Debug mode collapse, blurriness. 317 | - **Interpolation**: 318 | - Test latent space (e.g., smooth transitions in VAEs). 319 | - **Use**: Assess structure, generalization. 320 | 321 | - **Task-Specific**: 322 | - **Downstream Tasks**: Use generated data for classification, measure accuracy. 323 | - **Domain Metrics**: PSNR/SSIM for images, BLEU for text. 324 | 325 | - **Challenges**: 326 | - **Subjectivity**: Metrics vs. human perception misalign. 327 | - **Diversity vs. Quality**: Hard to balance (e.g., FID misses modes). 328 | - **Compute**: FID needs many samples. 329 | 330 | **Example**: 331 | - Task: Generate faces. 332 | - Metrics: FID = 10 (good), IS = 3.5 (decent). 333 | - Visual: Check for diverse expressions. 334 | 335 | **Interview Tips**: 336 | - Prioritize FID: “Best for image quality.” 337 | - Mention limits: “Metrics don’t capture everything.” 338 | - Be ready to compute: “Explain FID formula.” 339 | 340 | --- 341 | 342 | ## 8. What are some common applications of generative models in industry? 343 | 344 | **Answer**: 345 | 346 | Generative models power creative and practical applications across industries: 347 | 348 | - **Image Synthesis**: 349 | - **Use**: Create realistic images (e.g., faces, landscapes). 350 | - **Models**: GANs (StyleGAN), diffusion (DALL·E 2). 351 | - **Example**: Generate product mockups for e-commerce. 352 | - **Data Augmentation**: 353 | - **Use**: Generate synthetic data to boost ML training. 354 | - **Models**: VAEs, GANs. 355 | - **Example**: Augment medical images for rare diseases. 356 | - **Text-to-Image Generation**: 357 | - **Use**: Convert prompts to visuals (e.g., “cat in space”). 358 | - **Models**: Diffusion (Stable Diffusion), GANs. 359 | - **Example**: Design art for games via prompts. 360 | - **Super-Resolution**: 361 | - **Use**: Upscale low-res images. 362 | - **Models**: GANs (SRGAN), diffusion. 363 | - **Example**: Enhance satellite imagery. 364 | - **Video Generation**: 365 | - **Use**: Create or edit videos (e.g., deepfakes, animations). 366 | - **Models**: GANs, diffusion. 367 | - **Example**: Auto-generate marketing videos. 368 | - **Music and Audio**: 369 | - **Use**: Compose music, synthesize voices. 370 | - **Models**: VAEs, autoregressive (WaveNet). 371 | - **Example**: AI music for streaming platforms. 372 | - **Text Generation**: 373 | - **Use**: Generate stories, code, or dialogues. 374 | - **Models**: Transformers (GPT), VAEs. 375 | - **Example**: Chatbots, automated content. 376 | - **Drug Discovery**: 377 | - **Use**: Generate molecular structures. 378 | - **Models**: VAEs, GANs. 379 | - **Example**: Design new compounds for pharma. 380 | - **Anomaly Detection**: 381 | - **Use**: Model normal data, flag outliers. 382 | - **Models**: VAEs, GANs. 383 | - **Example**: Detect defects in manufacturing. 384 | 385 | **Example**: 386 | - Industry: Gaming. 387 | - Application: Use StyleGAN to create unique NPC faces. 388 | 389 | **Interview Tips**: 390 | - List variety: “Images, text, audio, molecules.” 391 | - Link to impact: “Augmentation saves data costs.” 392 | - Be ready to brainstorm: “Suggest a new use case.” 393 | 394 | --- 395 | 396 | ## Notes 397 | 398 | - **Focus**: Answers cover generative models critical for ML interviews. 399 | - **Clarity**: Explanations are structured for verbal delivery, with examples and trade-offs. 400 | - **Depth**: Includes mathematical rigor (e.g., VAE loss, GAN game) and practical tips (e.g., FID evaluation). 401 | - **Consistency**: Matches the style of previous files for a cohesive repository. 402 | 403 | For deeper practice, apply these to image tasks (see [Computer Vision](computer-vision.md)) or explore [Anomaly Detection](anomaly-detection.md) for generative-based outliers. 🚀 404 | 405 | --- 406 | 407 | **Next Steps**: Build on these skills with [Natural Language Processing](natural-language-processing.md) for text generation or revisit [Optimization Techniques](optimization-techniques.md) for training generative models! 🌟 --------------------------------------------------------------------------------