├── Chips_Machine_Learning_Interviews_Book ├── chapter5.md ├── chapter6.md ├── chapter7.md ├── chapter8.md └── ml-interviews-book.md ├── MLE_Interview_QnA ├── AgenticAIQnA.md ├── DataScienceQnA.md ├── PythonQnA.md └── case_study_company_A.md └── README.md /Chips_Machine_Learning_Interviews_Book/chapter5.md: -------------------------------------------------------------------------------- 1 | ## 5 Math 2 | 3 | ### 5.1.1 Vectors 4 | 5 | 1. Dot product 6 | 1. [E] What’s the geometric interpretation of the dot product of two vectors? 7 | ```text 8 | The dot product of two vectors can be interpreted geometrically as the product of the magnitudes of the two vectors and the cosine of the angle between them. This can be written as: dot product = magnitude of vector 1 * magnitude of vector 2 * cos(angle between vectors). Geometrically, the dot product of two vectors tells us about the orientation of the vectors relative to each other. If the dot product is positive, it means the vectors are oriented in roughly the same direction. If the dot product is negative, it means the vectors are oriented in opposite directions. If the dot product is zero, it means the vectors are orthogonal (perpendicular) to each other. 9 | 10 | The dot product has many useful properties and applications. For example, it can be used to find the projection of one vector onto another, to test whether two vectors are orthogonal, and to compute the angle between two vectors. It is also closely related to the cross product, which is a vector quantity that describes the direction and magnitude of the rotation required to move one vector to the other. 11 | ``` 12 | 2. [E] Given a vector , find vector of unit length such that the dot product of and is maximum. 13 | ```text 14 | To find the vector of unit length that maximizes the dot product with a given vector, you can simply normalize the given vector. Normalizing a vector means dividing it by its magnitude, so that the resulting vector has a magnitude of 1. This is done by dividing each component of the vector by the magnitude of the vector. If the given vector is X, the normalized vector would be: normalized vector = (1/magnitude of X) * X. 15 | 16 | For example, if X is a two-dimensional vector with components (x1, x2), the normalized vector of X would be: normalized vector of X = (1/sqrt(x1^2 + x2^2)) * (x1, x2) 17 | 18 | The dot product of the normalized vector with the original vector will be maximum, because the dot product is proportional to the magnitudes of the two vectors, and the normalized vector has a magnitude of 1. Alternatively, you can also find the unit vector that maximizes the dot product by projecting the given vector onto the unit vector and then normalizing the result. This will give you the same result as normalizing the given vector directly. 19 | ``` 20 | 21 | 2. Outer product 22 | 1. [E] Given two vectors and . Calculate the outer product ? 23 | ```text 24 | NOT ANSWERED 25 | ``` 26 | 2. [M] Give an example of how the outer product can be useful in ML. 27 | ```text 28 | NOT ANSWERED 29 | ``` 30 | 31 | 3. [E] What does it mean for two vectors to be linearly independent? 32 | ```text 33 | NOT ANSWERED 34 | ``` 35 | 4. [M] Given two sets of vectors and . How do you check that they share the same basis? 36 | ```text 37 | NOT ANSWERED 38 | ``` 39 | 5. [M] Given vectors, each of dimensions. What is the dimension of their span? 40 | ```text 41 | NOT ANSWERED 42 | ``` 43 | 6. Norms and metrics 44 | 1. [E] What's a norm? What is ? 45 | ```text 46 | NOT ANSWERED 47 | ``` 48 | 2. [M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm? 49 | ```text 50 | NOT ANSWERED 51 | ``` 52 | 53 | 54 | ### 5.2.1 Probability -------------------------------------------------------------------------------- /Chips_Machine_Learning_Interviews_Book/chapter6.md: -------------------------------------------------------------------------------- 1 | ## Chapter 6. Computer Science 2 | 3 | ### 4 | 5 | -------------------------------------------------------------------------------- /Chips_Machine_Learning_Interviews_Book/chapter7.md: -------------------------------------------------------------------------------- 1 | ## Chapter 7. Computer Science 2 | 3 | ### 7.1 Basics 4 | 1. [E] Explain supervised, unsupervised, weakly supervised, semi-supervised, and active learning. 5 | ```text 6 | - Supervised learning is a type of machine learning where the model is trained on labeled data, meaning that the data used to train the model is already tagged with correct output labels. The model makes predictions based on this labeled data and the goal is for the model to make predictions on new, unseen data that is drawn from the same distribution as the training data. Examples of supervised learning include classification tasks, such as identifying spam emails, and regression tasks, such as predicting the price of a house given its characteristics. 7 | 8 | - Unsupervised learning is a type of machine learning where the model is not given any labeled training data and must discover the underlying structure of the data through techniques such as clustering. The goal of unsupervised learning is to find patterns or relationships in the data, rather than to make predictions on new, unseen data. Examples of unsupervised learning include dimensionality reduction and density estimation. 9 | 10 | - Weakly supervised learning is a type of machine learning that is intermediate between supervised and unsupervised learning. In weakly supervised learning, the model is given some labeled data, but the labels are not necessarily complete or accurate. This can be the case when it is expensive or time-consuming to label the data, or when the data is noisy or incomplete. Weakly supervised learning algorithms try to make the most of the available labels to learn about the underlying structure of the data. 11 | 12 | - Semi-supervised learning is a type of machine learning that is intermediate between supervised and unsupervised learning. In semi-supervised learning, the model is given a small amount of labeled data and a large amount of unlabeled data. The goal is to use the labeled data to learn about the structure of the data, and then use this learned structure to label the unlabeled data. Semi-supervised learning algorithms try to make the most of the available labeled data and use it to infer the structure of the unlabeled data. 13 | 14 | - Active learning is a type of machine learning where the model is able to interact with its environment and request labels for specific data points that it is unsure about. The goal of active learning is to improve the efficiency of the model by only labeling the data that is most valuable for improving the model's performance. This is especially useful when labeling data is expensive or time-consuming, as it allows the model to focus on the most important data points. 15 | ``` 16 | 17 | 2. Empirical risk minimization. 18 | 1. [E] What’s the risk in empirical risk minimization? 19 | ```text 20 | Empirical risk minimization is a method for minimizing the expected loss of a machine learning model on a given dataset. The goal is to find the model parameters that minimize the average loss over the training data. One risk of empirical risk minimization is overfitting, which occurs when the model is too complex and fits the training data too closely, leading to poor generalization to new, unseen data. Overfitting can occur when the model has too many parameters relative to the size of the training data, or when the training data is noisy or contains outliers. To mitigate the risk of overfitting, it is important to use appropriate model complexity and regularization techniques, such as using a simpler model or adding a regularization term to the loss function. It is also important to evaluate the model's performance on a held-out validation dataset to ensure that it generalizes well to new data. 21 | Another risk of empirical risk minimization is the assumption that the training data is representative of the underlying distribution of the data. If the training data is not representative, the model's performance on new, unseen data may be poor. To mitigate this risk, it is important to carefully choose the training data and ensure that it is representative of the distribution of data that the model will be applied to. 22 | ``` 23 | 2. [E] Why is it empirical? 24 | ```text 25 | Empirical risk minimization is called empirical because it is based on the empirical distribution of the data, which is the distribution of the data that is actually observed in the training set. 26 | ``` 27 | 3. [E] How do we minimize that risk? 28 | ```text 29 | Empirical risk minimization involves finding the model parameters that minimize the average loss over the training data. This can be done using optimization algorithms such as gradient descent, which involves iteratively updating the model parameters in the direction that reduces the loss. 30 | 31 | To minimize the empirical risk, the loss function is defined based on the task at hand and the desired properties of the model. For example, in a classification task, the loss function could be the cross-entropy loss, which measures the distance between the predicted class probabilities and the true class labels. In a regression task, the loss function could be the mean squared error, which measures the difference between the predicted and true values. 32 | ``` 33 | 34 | 3. [E] Occam's razor states that when the simple explanation and complex explanation both work equally well, the simple explanation is usually correct. How do we apply this principle in ML? 35 | ```text 36 | In machine learning, Occam's Razor is often applied in the context of model selection, where it suggests that, all else being equal, a simpler model with fewer parameters is generally preferred over a more complex model with more parameters. This is because a simpler model is less likely to overfit the data, and is therefore more likely to generalize well to new, unseen data. 37 | 38 | For example, suppose you are trying to build a machine learning model to predict the price of a house based on a set of features such as the size of the house, the number of bedrooms, and the location. If you have a choice between two models, one with a single linear regression model and one with a more complex model such as a neural network, Occam's razor suggests that you should generally prefer the simpler linear regression model, unless the additional complexity of the neural network is justified by the data. 39 | 40 | In summary, Occam's razor is a principle that suggests that the simplest explanation for a phenomenon is generally the most likely to be correct. In machine learning, it is often applied in the context of model selection, where it suggests that a simpler model with fewer parameters is generally preferred over a more complex model with more parameters. 41 | ``` 42 | 4. [E] What are the conditions that allowed deep learning to gain popularity in the last decade? 43 | ```text 44 | There are several factors that have contributed to the popularity of deep learning in recent years: 45 | 46 | - Increased computational power: Deep learning algorithms require a lot of computational power to train. In the last decade, there has been a significant increase in the availability of powerful hardware, such as graphics processing units (GPUs), which are well-suited for deep learning tasks. This has made it possible to train deep learning models on large datasets. 47 | 48 | - Availability of large datasets: In order to train effective deep learning models, large amounts of labeled data are needed. In recent years, there has been an explosion of data available on the internet, as well as an increase in the number of companies and organizations collecting and sharing data. This has made it possible to train deep learning models on a wide range of tasks. 49 | 50 | - Improvements in algorithms: There have been significant advances in the development of deep learning algorithms in recent years. Researchers have been able to improve the performance of deep learning models by developing new architectures and training techniques. 51 | 52 | - Broader adoption of machine learning: Deep learning is a subfield of machine learning, which has also gained popularity in recent years. The success of deep learning has contributed to the broader adoption of machine learning in a variety of industries and applications. 53 | 54 | - Real-world successes: The success of deep learning in a variety of real-world tasks, such as image and speech recognition, has contributed to its popularity. Deep learning models have achieved state-of-the-art performance on many tasks, and this has led to a growing interest in using deep learning for a wide range of applications. 55 | ``` 56 | 5. [M] If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive and why? 57 | ```text 58 | A wide neural network refers to a neural network with a large number of units or neurons in each layer, while a deep neural network refers to a neural network with a large number of layers. In general, a wide neural network is more expressive than a deep neural network with the same number of parameters, because it has more units in each layer and is therefore able to model more complex functions. 59 | If you have a wide neural network and a deep neural network with the same number of parameters, the wide neural network is generally more expressive because it has a higher model capacity. This is because the wide neural network has more units in each layer and is therefore able to model more complex functions. However, it is important to note that this is not always the case, and the relative expressiveness of a wide versus deep neural network will depend on the specific architecture and the data being modeled. 60 | ``` 61 | 6. [H] The Universal Approximation Theorem states that a neural network with 1 hidden layer can approximate any continuous function for inputs within a specific range. Then why can’t a simple neural network reach an arbitrarily small positive error? 62 | ```text 63 | The Universal Approximation Theorem does not mean that a neural network with a single hidden layer can reach an arbitrarily small positive error on all tasks. There are several reasons why this may not be possible: 64 | 65 | 1. Limited capacity: A neural network with a single hidden layer has limited capacity, and may not be able to capture the complexity of the target function if it is very complex. As a result, it may not be able to reach an arbitrarily small positive error. 66 | 67 | 2. Overfitting: If a neural network is trained on a limited amount of data, it may overfit to the training data and perform poorly on unseen data. This can lead to a higher error on the test set. 68 | 69 | 3. Local optima: During training, the neural network may get stuck in a local minimum of the loss function, which can result in a suboptimal solution. This can lead to a higher error on the test set. 70 | 71 | Lack of sufficient data: In some cases, there may not be enough labeled data available to train a neural network to an arbitrarily small positive error. 72 | ``` 73 | 7. [E] What are saddle points and local minima? Which are thought to cause more problems for training large NNs? 74 | ```text 75 | 76 | ``` 77 | 8. Hyperparameters. 78 | 1. [E] What are the differences between parameters and hyperparameters? 79 | ```text 80 | Parameters are the internal variables of a model that are learned during training. These include the weights and biases of a neural network, which are adjusted based on the training data to optimize the model's performance. Parameters are typically adjusted by the optimization algorithm during training, and they are specific to a particular task or dataset. 81 | 82 | Hyperparameters, on the other hand, are external variables that are set before training begins. They control the overall behavior of the model and are not learned during training. Examples of hyperparameters include the learning rate, the batch size, and the number of hidden units in a neural network. Hyperparameters are typically set by the practitioner, and they can significantly influence the model's performance. 83 | ``` 84 | 2. [E] Why is hyperparameter tuning important? 85 | ```text 86 | Hyperparameter tuning is an important step in the process of training a deep learning model because it can significantly influence the model's performance. Hyperparameters control the overall behavior of a model, and choosing the right values for them can make a big difference in the model's ability to fit the data and generalize to new situations. 87 | 88 | For example, in a neural network, the learning rate is a hyperparameter that controls the step size that the optimization algorithm takes when adjusting the weights and biases. If the learning rate is too high, the optimization algorithm may overshoot the minimum and oscillate, leading to slow or unstable convergence. If the learning rate is too low, the optimization algorithm may take too long to converge to a good solution. Choosing the right learning rate is therefore critical for successful training. 89 | 90 | Hyperparameter tuning can be time-consuming and require a lot of trial and error, but it is generally worth the effort because it can significantly improve the performance of a deep learning model. There are several methods for tuning hyperparameters, including manual search, grid search, and random search. It is also possible to use more advanced methods such as Bayesian optimization or evolutionary algorithms. 91 | 92 | Overall, hyperparameter tuning is important in deep learning because it allows practitioners to fine-tune the behavior of a model and improve its performance on a particular task or dataset. 93 | ``` 94 | 3. [M] Explain algorithm for tuning hyperparameters. 95 | ```text 96 | There are several algorithms that can be used for tuning hyperparameters in deep learning, including manual search, grid search, random search, and more advanced methods such as Bayesian optimization and evolutionary algorithms. Here is a brief overview of each algorithm: 97 | 98 | Manual search: This is the most basic method for tuning hyperparameters. It involves manually adjusting the values of the hyperparameters and evaluating the model's performance on a validation set. This process can be repeated until satisfactory performance is achieved. 99 | 100 | Grid search: This is a systematic method for exploring a range of hyperparameter values. The practitioner specifies a set of values for each hyperparameter, and the algorithm trains a model for each combination of hyperparameter values. The performance of each model is evaluated using a validation set, and the combination of hyperparameters that yields the best performance is selected. 101 | 102 | Random search: This method involves sampling random combinations of hyperparameter values and training a model for each combination. The performance of each model is evaluated using a validation set, and the combination of hyperparameters that yields the best performance is selected. This method can be more efficient than grid search because it does not require training a model for every combination of hyperparameter values. 103 | 104 | Bayesian optimization: This is a more advanced method that uses Bayesian statistics to model the underlying function that relates the hyperparameters to the model's performance. The algorithm uses this model to iteratively select the most promising combinations of hyperparameter values to try next, based on the previous results. 105 | 106 | Evolutionary algorithms: These are optimization algorithms that mimic the process of natural evolution to find the best combination of hyperparameter values. They involve generating a population of random hyperparameter values, evaluating the performance of each individual, and using evolutionary operators such as crossover and mutation to generate new individuals for the next generation. The process is repeated until satisfactory performance is achieved. 107 | 108 | Overall, the choice of algorithm for tuning hyperparameters will depend on the specific requirements of the task and the available resources. Some algorithms may be more suitable for certain types of tasks or hyperparameter ranges, and some may be more computationally efficient than others. It is often a good idea to try a few different algorithms to see which one works best for a particular task. 109 | ``` 110 | 111 | 9. Classification vs. regression. 112 | 1. [E] What makes a classification problem different from a regression problem? 113 | ```text 114 | In a classification problem, the output variable is a categorical variable, which means that it can take on a finite number of values or categories. For example, a model might be trained to classify email messages as spam or not spam, or to classify images of animals as cats, dogs, or birds. In a classification problem, the goal is to predict the class or category that an input belongs to. 115 | 116 | In a regression problem, the output variable is a continuous variable, which means that it can take on any value within a range. For example, a model might be trained to predict the price of a house given its features (e.g., size, location, number of bedrooms), or to predict the stock price of a company given its financial data. In a regression problem, the goal is to predict a continuous value. 117 | 118 | In summary, the main difference between classification and regression is the type of output variable that is being predicted. Classification involves predicting a categorical variable, while regression involves predicting a continuous variable. 119 | ``` 120 | 2. [E] Can a classification problem be turned into a regression problem and vice versa? 121 | ```text 122 | Yes, it is often possible to convert a classification problem into a regression problem and vice versa. Here are some examples of how this can be done: 123 | 124 | Classification to regression: One way to convert a classification problem into a regression problem is to use a method called "ordinal encoding." This involves assigning a numerical value to each class, such that the difference between the values reflects the relative order or importance of the classes. For example, if a model is trained to classify email messages as spam or not spam, the classes could be encoded as 0 (not spam) and 1 (spam). The model could then be trained to predict the encoded values as a continuous output. 125 | 126 | Regression to classification: One way to convert a regression problem into a classification problem is to define a set of threshold values that divide the output range into discrete classes. For example, if a model is trained to predict the price of a house, the output range could be divided into classes such as "cheap," "moderate," and "expensive." The model could then be trained to predict the class that a given input belongs to. 127 | 128 | It is important to note that these conversion methods are not always appropriate and may not produce good results in all cases. The suitability of a particular conversion method will depend on the specifics of the task and the characteristics of the data. In general, it is usually best to use the appropriate type of problem for the task at hand, rather than trying to convert it to a different type. 129 | ``` 130 | 10. Parametric vs. non-parametric methods. 131 | 1. [E] What’s the difference between parametric methods and non-parametric methods? Give an example of each method. 132 | ```text 133 | In machine learning, parametric methods and non-parametric methods are two types of techniques that can be used to model and make predictions from data. 134 | 135 | Parametric methods are based on the assumption that the data is generated from a specific type of probability distribution with a fixed set of parameters. These methods involve estimating the parameters of the distribution from the data and using them to make predictions. Examples of parametric methods include linear regression, logistic regression, and linear discriminant analysis. 136 | 137 | Non-parametric methods, on the other hand, do not make any assumptions about the underlying distribution of the data. These methods can be more flexible than parametric methods because they do not rely on a fixed set of parameters. Examples of non-parametric methods include decision trees, k-nearest neighbors, and support vector machines. 138 | 139 | In general, parametric methods are typically faster and more computationally efficient than non-parametric methods, but they may be less flexible and may not perform as well on complex or irregular data. Non-parametric methods can be more flexible, but they may require more data and computational resources to train. 140 | 141 | It is important to note that the choice between parametric and non-parametric methods will depend on the specific requirements of the task and the characteristics of the data. In some cases, a parametric method may be the best choice, while in other cases a non-parametric method may be more appropriate. 142 | ``` 143 | 2. [H] When should we use one and when should we use the other? 144 | 11. [M] Why does ensembling independently trained models generally improve performance? 145 | ```text 146 | Ensembling is a machine learning technique that involves combining the predictions of multiple independently trained models to make a final prediction. Ensembling is often used to improve the performance of a model because it can reduce the variance and bias of the final prediction. 147 | 148 | One reason why ensembling can improve performance is that it can reduce variance. When multiple models are trained independently, they are likely to make different errors due to randomness in the training data. By combining the predictions of these models, the overall error is likely to be reduced because the errors will tend to cancel out. This can lead to more stable and consistent predictions. 149 | 150 | Another reason why ensembling can improve performance is that it can reduce bias. Individual models may have a bias towards certain patterns or features in the data, and this can limit their ability to generalize to new data. By combining the predictions of multiple models, the overall bias is likely to be reduced because the models are likely to have different biases. This can lead to better generalization and improved performance on new data. 151 | 152 | Overall, ensembling is a powerful technique that can improve the performance of a model by reducing variance and bias. It is often used in conjunction with other techniques such as model selection and hyperparameter optimization to further improve the performance of a model. 153 | ``` 154 | 12. [M] Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0? 155 | 13. [E] Why does an ML model’s performance degrade in production? 156 | 14. [M] What problems might we run into when deploying large machine learning models? 157 | 15. Your model performs really well on the test set but poorly in production. 158 | 1. [M] What are your hypotheses about the causes? 159 | 2. [H] How do you validate whether your hypotheses are correct? 160 | 3. [M] Imagine your hypotheses about the causes are correct. What would you do to address them? 161 | 162 | ### 7.2 Sampling and creating training data 163 | 164 | 1. [E] If you have 6 shirts and 4 pairs of pants, how many ways are there to choose 2 shirts and 1 pair of pants? 165 | 2. [M] What is the difference between sampling with vs. without replacement? Name an example of when you would use one rather than the other? 166 | 3. [M] Explain Markov chain Monte Carlo sampling. 167 | 4. [M] If you need to sample from high-dimensional data, which sampling method would you choose? 168 | 5. [H] Suppose we have a classification task with many classes. An example is when you have to predict the next word in a sentence -- the next word can be one of many, many possible words. If we have to calculate the probabilities for all classes, it’ll be prohibitively expensive. Instead, we can calculate the probabilities for a small set of candidate classes. This method is called candidate sampling. Name and explain some of the candidate sampling algorithms. 169 | 6. Suppose you want to build a model to classify whether a Reddit comment violates the website’s rule. You have 10 million unlabeled comments from 10K users over the last 24 months and you want to label 100K of them. 170 | 1. [M] How would you sample 100K comments to label? 171 | 2. [M] Suppose you get back 100K labeled comments from 20 annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them? 172 | 7. [M] Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument? 173 | 8. [M] How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution? 174 | 175 | 9. [H] How do you know you’ve collected enough samples to train your ML model? 176 | 10. [M] How to determine outliers in your data samples? What to do with them? 177 | 11. Sample duplication 178 | 1. [M] When should you remove duplicate training samples? When shouldn’t you? 179 | 2. [M] What happens if we accidentally duplicate every data point in your train set or in your test set? 180 | 12. Missing data 181 | 1. [H] In your dataset, two out of 20 variables have more than 30% missing values. What would you do? 182 | 2. [M] How might techniques that handle missing data make selection bias worse? How do you handle this bias? 183 | 13. [M] Why is randomization important when designing experiments (experimental design)? 184 | 185 | 14. Class imbalance. 186 | 1. [E] How would class imbalance affect your model? 187 | 2. [E] Why is it hard for ML models to perform well on data with class imbalance? 188 | 3. [M] Imagine you want to build a model to detect skin legions from images. In your training dataset, only 1% of your images shows signs of legions. After training, your model seems to make a lot more false negatives than false positives. What are some of the techniques you'd use to improve your model? 189 | 15. Training data leakage. 190 | 1. [M] Imagine you're working with a binary task where the positive class accounts for only 1% of your data. You decide to oversample the rare class then split your data into train and test splits. Your model performs well on the test split but poorly in production. What might have happened? 191 | 2. [M] You want to build a model to classify whether a comment is spam or not spam. You have a dataset of a million comments over the period of 7 days. You decide to randomly split all your data into the train and test splits. Your co-worker points out that this can lead to data leakage. How? 192 | 193 | 16. [M] How does data sparsity affect your models? 194 | 17. Feature leakage 195 | 1. [E] What are some causes of feature leakage? 196 | 2. [E] Why does normalization help prevent feature leakage? 197 | 3. [M] How do you detect feature leakage? 198 | 199 | 18. [M] Suppose you want to build a model to classify whether a tweet spreads misinformation. You have 100K labeled tweets over the last 24 months. You decide to randomly shuffle on your data and pick 80% to be the train split, 10% to be the valid split, and 10% to be the test split. What might be the problem with this way of partitioning? 200 | 19. [M] You’re building a neural network and you want to use both numerical and textual features. How would you process those different features? 201 | 20. [H] Your model has been performing fairly well using just a subset of features available in your data. Your boss decided that you should use all the features available instead. What might happen to the training error? What might happen to the test error? 202 | 203 | 204 | ### 7.3 Objective functions, metrics, and evaluation 205 | 206 | 1. Convergence. 207 | 1. [E] When we say an algorithm converges, what does convergence mean? 208 | 2. [E] How do we know when a model has converged? 209 | 2. [E] Draw the loss curves for overfitting and underfitting. 210 | 3. Bias-variance trade-off 211 | 1. [E] What’s the bias-variance trade-off? 212 | 2. [M] How’s this tradeoff related to overfitting and underfitting? 213 | 3. [M] How do you know that your model is high variance, low bias? What would you do in this case? 214 | 4. [M] How do you know that your model is low variance, high bias? What would you do in this case? 215 | 4. Cross-validation. 216 | 1. [E] Explain different methods for cross-validation. 217 | 2. [M] Why don’t we see more cross-validation in deep learning? 218 | 5. Train, valid, test splits. 219 | 1. [E] What’s wrong with training and testing a model on the same data? 220 | 2. [E] Why do we need a validation set on top of a train set and a test set? 221 | 3. [M] Your model’s loss curves on the train, valid, and test sets look like this. What might have been the cause of this? What would you do? 222 | 6. [E] Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their X-ray scan. Your colleague announces that the problem is solved now that they’ve built a system that can predict with 99.99% accuracy. How would you respond to that claim? 223 | 7. F1 score. 224 | 1. [E] What’s the benefit of F1 over the accuracy? 225 | 2. [M] Can we still use F1 for a problem with more than two classes. How? 226 | 8. Given a binary classifier that outputs the following confusion matrix. 227 | 1. [E] Calculate the model’s precision, recall, and F1. 228 | 2. [M] What can we do to improve the model’s performance? 229 | 9. Consider a classification where 99% of data belongs to class A and 1% of data belongs to class B. 230 | 1. [M] If your model predicts A 100% of the time, what would the F1 score be? Hint: The F1 score when A is mapped to 0 and B to 1 is different from the F1 score when A is mapped to 1 and B to 0. 231 | 2. [M] If we have a model that predicts A and B at a random (uniformly), what would the expected F1 be? 232 | 233 | 10. [M] For logistic regression, why is log loss recommended over MSE (mean squared error)? 234 | 11. [M] When should we use RMSE (Root Mean Squared Error) over MAE (Mean Absolute Error) and vice versa? 235 | 12. [M] Show that the negative log-likelihood and cross-entropy are the same for binary classification tasks. 236 | 13. [M] For classification tasks with more than two labels (e.g. MNIST with 10 labels), why is cross-entropy a better loss function than MSE? 237 | 14. [E] Consider a language with an alphabet of 27 characters. What would be the maximal entropy of this language? 238 | 15. [E] A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distribution of the data and Q is the distribution learned by our model. How do measure how close Q is to P? 239 | 16. MPE (Most Probable Explanation) vs. MAP (Maximum A Posteriori) 240 | 1. [E] How do MPE and MAP differ? 241 | 2. [H] Give an example of when they would produce different results. 242 | 17. [E] Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the predicted price should never be off more than 10% from the actual price. Which metric would you use? -------------------------------------------------------------------------------- /Chips_Machine_Learning_Interviews_Book/chapter8.md: -------------------------------------------------------------------------------- 1 | ## Chapter 8. Machine learning algorithms -------------------------------------------------------------------------------- /Chips_Machine_Learning_Interviews_Book/ml-interviews-book.md: -------------------------------------------------------------------------------- 1 | # Answering ml-interview-book questions 2 | Most of these questions will be answered be ChatGPT and must not been considered 100% correct, so please review the answers and if you disagree please add an issue or contact the author. 3 | 4 | ### Book Chapters 5 | 6 | * [Chapter 5](chapter5.md) 7 | * [Chapter 6](chapter6.md) 8 | * [Chapter 7](chapter7.md) 9 | * [Chapter 8](chapter8.md) -------------------------------------------------------------------------------- /MLE_Interview_QnA/AgenticAIQnA.md: -------------------------------------------------------------------------------- 1 | # Agentic AI QnA 2 | 3 | ### What is an agent in LangChain? 4 | - An agent is a framework in LangChain that decides **what actions to take** based on user input. 5 | - It can use **tools** like search or calculators. 6 | - Agents rely on an LLM to choose the next tool or action based on intermediate results. 7 | 8 | --- 9 | ### What tools did your AI support agent use? 10 | - A **retriever tool** (using FAISS vector store) to pull answers from product docs. 11 | - A fallback tool to **escalate** queries to humans (mocked as a log function). 12 | - Optionally, a **search tool** to simulate looking up live data. 13 | --- 14 | 15 | ### What is Retrieval-Augmented Generation (RAG)? 16 | - RAG combines **retrieved documents** with an **LLM response**. 17 | - Helps ground LLM answers in trusted content (like internal FAQs). 18 | - Reduces hallucination risk and improves accuracy in niche domains. 19 | 20 | --- 21 | ### How did you store and use memory in the agent? 22 | - I used **ConversationBufferMemory** to retain the chat history. 23 | - This helped the bot maintain context across multiple user turns. 24 | - LangChain allows easily injecting memory into agents or chains. -------------------------------------------------------------------------------- /MLE_Interview_QnA/DataScienceQnA.md: -------------------------------------------------------------------------------- 1 | # DS Interview QnA 2 | 3 | ## What is Logistic Regression? 4 | - A **linear model** used for **binary (or multiclass) classification**. 5 | - Outputs probabilities using the **sigmoid function**. 6 | - Assumes a **linear relationship** between input features and the log-odds of the outcome. 7 | - Fast, interpretable, and works well with linearly separable data. 8 | 9 | ## What is Random Forest? 10 | - An **ensemble of decision trees** using bootstrap aggregation (**bagging**). 11 | - Reduces variance and avoids overfitting of single decision trees. 12 | - Handles non-linear relationships, missing data, and feature importance estimation. 13 | - Slower and less interpretable than linear models. 14 | 15 | ## What is XGBoost? 16 | - A **gradient boosting** framework that builds trees sequentially to correct previous errors. 17 | - Highly accurate and robust to overfitting with proper regularization. 18 | - Requires careful tuning (e.g., learning rate, depth, early stopping). 19 | - Often the top choice in ML competitions (e.g., Kaggle). 20 | 21 | ## Why might you choose logistic regression over random forest or XGBoost? 22 | - When you need: 23 | - If the data is **linearly separable** 24 | - **Model interpretability** (e.g., feature coefficients) 25 | - **Speed** in training/prediction 26 | - **Simplicity** and fewer hyperparameters 27 | - **Baseline performance** for comparison 28 | 29 | 30 | 31 | ## What is the difference between supervised and unsupervised learning? 32 | - **Supervised learning** uses labeled data to train models that map inputs to known outputs (e.g., classification, regression). 33 | - **Unsupervised learning** uses unlabeled data to discover hidden patterns or groupings (e.g., clustering, dimensionality reduction). 34 | 35 | #### Key Differences: 36 | - Supervised learning requires **ground truth labels**; unsupervised does not. 37 | - Evaluation in supervised learning uses accuracy, precision, recall, etc.; unsupervised uses metrics like silhouette score or domain-based validation. 38 | 39 | --- 40 | ## What is overfitting in machine learning, and how can you prevent it? 41 | - **Overfitting** happens when a model learns the training data too well, including noise, and performs poorly on unseen data. 42 | - It leads to low training error but high test error — the model fails to generalize. 43 | 44 | #### Prevention techniques: 45 | - **Cross-validation** 46 | - **Regularization** (L1, L2) 47 | - **Early stopping** 48 | - **Simpler models** or **fewer features** 49 | - **Pruning** (for trees), **dropout** (for neural nets) 50 | - **Increasing training data** 51 | 52 | --- 53 | ## What is the difference between precision and recall? 54 | - **Precision** measures the proportion of true positives among all predicted positives. 55 | `Precision = TP / (TP + FP)` 56 | - **Recall** measures the proportion of true positives among all actual positives. 57 | `Recall = TP / (TP + FN)` 58 | 59 | #### Key Differences: 60 | - Precision focuses on **how accurate** the positive predictions are. 61 | - Recall focuses on **how complete** the positive predictions are. 62 | - High precision = fewer false positives; high recall = fewer false negatives. 63 | 64 | --- 65 | ## What is the bias-variance tradeoff in machine learning? 66 | - The **bias-variance tradeoff** describes the balance between two sources of error that affect model performance on unseen data: 67 | - **Bias**: Error from erroneous assumptions in the model (underfitting). 68 | - **Variance**: Error from excessive sensitivity to training data (overfitting). 69 | 70 | #### Tradeoff: 71 | - **High bias, low variance**: Simple model, poor on both training and test sets. 72 | - **Low bias, high variance**: Complex model, great on training but poor on test data. 73 | - Goal: Find the sweet spot with **low total error** by balancing bias and variance. 74 | 75 | --- 76 | ## What is regularization in machine learning, and why is it used? 77 | - **Regularization** is a technique used to prevent overfitting by adding a penalty term to the loss function that discourages complex models. 78 | 79 | #### Types: 80 | - **L1 Regularization (Lasso)**: Adds the absolute value of coefficients to the loss function. Encourages sparsity (some weights become zero). 81 | - **L2 Regularization (Ridge)**: Adds the squared value of coefficients to the loss function. Penalizes large weights but keeps all features. 82 | 83 | #### Purpose: 84 | - Controls model complexity. 85 | - Helps improve generalization to unseen data. 86 | 87 | --- 88 | 89 | ## What is hyperparameter tuning, and what are some common methods to perform it? 90 | 91 | - **Hyperparameter tuning** is the process of **finding the optimal values** for model hyperparameters — settings not learned during training. 92 | 93 | #### Examples of hyperparameters: 94 | - Learning rate (in NNs or gradient boosting) 95 | - Number of layers or neurons (in neural nets) 96 | - Max depth or number of trees (in decision trees, random forests) 97 | - Regularization strength (L1/L2 penalties) 98 | - Batch size, dropout rate, etc. 99 | 100 | #### Common tuning methods: 101 | - **Grid Search**: tries all combinations from a predefined grid. 102 | - **Random Search**: randomly samples combinations — often faster than grid search. 103 | - **Bayesian Optimization**: uses past evaluation results to choose the next promising set. 104 | - **Automated tools**: e.g. `Optuna`, `Ray Tune`, `scikit-learn`’s `GridSearchCV`. 105 | 106 | ✅ Optimizers like Adam are used **within** models during training — not for tuning hyperparameters externally. 107 | 108 | --- 109 | ## What is the difference between bagging and boosting? 110 | - Both are **ensemble methods** often applied to decision trees (like Random Forests and XGBoost) that combine multiple models to improve performance, but they do so differently. 111 | 112 | #### Bagging (Bootstrap Aggregating): 113 | - Trains multiple models **independently** on random **subsets** of the data (with replacement). 114 | - Combines predictions via **averaging** (regression) or **majority vote** (classification). 115 | - Goal: **Reduce variance** (e.g., Random Forest). 116 | 117 | #### Boosting: 118 | - Trains models **sequentially**, where each model tries to **correct the errors** of the previous one. 119 | - Uses weighted data or errors to focus learning. 120 | - Goal: **Reduce bias and variance** (e.g., AdaBoost, XGBoost). 121 | 122 | #### Key Difference: 123 | - Bagging: parallel, reduces variance. 124 | - Boosting: sequential, reduces bias. 125 | 126 | ## What is the difference between Random Forest and XGBoost? 127 | 128 | #### Random Forest: 129 | - Based on **bagging** (parallel ensemble of decision trees). 130 | - Uses **random subsets** of data and features. 131 | - Averages predictions (regression) or takes majority vote (classification). 132 | - **Reduces variance**, robust to overfitting, minimal tuning. 133 | - Fast to train and **easy to parallelize**. 134 | 135 | ✅ Use it for strong baselines or when interpretability is less critical. 136 | 137 | --- 138 | 139 | #### XGBoost: 140 | - Based on **boosting** (sequential ensemble of trees). 141 | - Each tree tries to **fix errors** made by the previous tree. 142 | - Uses **gradient descent** on a loss function. 143 | - Supports **regularization**, early stopping, and custom objectives. 144 | - **Often more accurate**, but requires **careful tuning**. 145 | 146 | ✅ Use it when you need **maximum performance**, especially on structured/tabular data. 147 | 148 | --- 149 | ## What is cross-validation, and why is it used? 150 | - **Cross-validation** is a technique to assess how well a model generalizes by splitting the training data into multiple **folds**. 151 | 152 | #### How it works: 153 | - Common method: **k-fold cross-validation** — splits data into *k* subsets, trains on *k-1*, and validates on the remaining one, repeating *k* times. 154 | - Final score is the **average performance** across all folds. 155 | 156 | #### Purpose: 157 | - Provides a **more reliable estimate** of model performance. 158 | - Helps **detect overfitting or underfitting**. 159 | - Useful for **model selection and hyperparameter tuning**. 160 | 161 | --- 162 | ## What is the role of an activation function in a neural network? 163 | - An **activation function** introduces **non-linearity** into a neural network, enabling it to learn complex patterns and functions. 164 | - Applied to the output of each neuron before passing it to the next layer. 165 | 166 | #### Common activation functions: 167 | - **ReLU** (Rectified Linear Unit): `f(x) = max(0, x)` 168 | - **Sigmoid**: Maps values to (0, 1), useful for binary classification. 169 | - **Tanh**: Maps values to (–1, 1), centered around 0. 170 | 171 | #### Why it matters: 172 | - Without activation functions, a neural network behaves like a **linear model**, no matter how many layers it has. 173 | 174 | --- 175 | ## What is the vanishing gradient problem in deep learning? 176 | - The **vanishing gradient problem** occurs when gradients become **very small** during backpropagation, especially in deep networks. 177 | - This causes **early layers** to learn extremely slowly or not at all, because their weights are barely updated. 178 | 179 | #### Common causes: 180 | - Using activation functions like **sigmoid** or **tanh**, which squash values into small ranges and produce tiny gradients. 181 | 182 | #### Solutions: 183 | - Use **ReLU** or variants (Leaky ReLU, ELU). 184 | - **Batch normalization** to stabilize layer inputs. 185 | - **Residual connections** (e.g., in ResNets) to improve gradient flow. 186 | 187 | --- 188 | 189 | ## What is a learning curve in machine learning, and what can it tell you about your model? 190 | 191 | - A **learning curve** plots **model performance** (e.g., accuracy or loss) versus **training set size** or **training epochs**. 192 | 193 | #### Common types: 194 | - **Training curve**: performance on the training data 195 | - **Validation curve**: performance on unseen/validation data 196 | 197 | #### What it tells you: 198 | - If both training and validation errors are high → **underfitting** 199 | - If training error is low but validation error is high → **overfitting** 200 | - If both errors converge and are low → good **generalization** 201 | 202 | #### Helps diagnose model capacity and whether adding more data might help. 203 | 204 | --- 205 | ## What is the difference between ReLU and Sigmoid activation functions? 206 | - **ReLU (Rectified Linear Unit)**: `f(x) = max(0, x)` 207 | - Output range: [0, ∞) 208 | - Computationally efficient and helps avoid vanishing gradients. 209 | - Sparse activation (many neurons output 0), improving efficiency. 210 | 211 | - **Sigmoid**: `f(x) = 1 / (1 + e^(-x))` 212 | - Output range: (0, 1) 213 | - Useful for binary classification. 214 | - Prone to **vanishing gradients** and **slow convergence** in deep networks. 215 | 216 | #### Key Difference: 217 | - ReLU is better for deep networks due to faster convergence and stronger gradient flow. 218 | - Sigmoid squashes values, which can lead to **gradient saturation** and slow learning. 219 | 220 | --- 221 | ## What is the purpose of backpropagation in neural networks? 222 | - **Backpropagation** is the algorithm used to **train neural networks** by adjusting weights to minimize the loss function. 223 | 224 | #### How it works: 225 | - Calculates the **error** at the output layer. 226 | - Uses the **chain rule** to propagate that error backward through the network. 227 | - Updates weights using the **gradient of the loss** with respect to each weight. 228 | 229 | #### Purpose: 230 | - Enables the model to learn by minimizing error over time. 231 | - Combined with **gradient descent** or similar optimizers to update weights effectively. 232 | 233 | --- 234 | ## What are the main components of a neural network's training loop? 235 | 1. **Forward Pass**: 236 | - Input data passes through the network to compute predictions. 237 | 238 | 2. **Loss Computation**: 239 | - A **loss function** (e.g., MSE, cross-entropy) measures the difference between predicted and actual values. 240 | 241 | 3. **Backward Pass (Backpropagation)**: 242 | - Computes **gradients** of the loss with respect to each weight using the **chain rule**. 243 | 244 | 4. **Weight Update**: 245 | - **Optimizer** (e.g., SGD, Adam) updates the weights using the gradients and the **learning rate**. 246 | 247 | 5. **Repeat**: 248 | - Loop through multiple **epochs** (passes through the full dataset) until convergence or early stopping. 249 | 250 | --- 251 | ## What is an epoch, a batch, and an iteration in neural network training? 252 | 253 | - **Epoch**: One full pass through the entire training dataset. 254 | - **Batch**: A subset of the training data used to update the model once (used due to memory limits or efficiency). 255 | - **Iteration**: One update of the model weights — corresponds to one batch pass. 256 | 257 | #### Relationship: 258 | - If dataset has 1,000 samples and batch size is 100: 259 | - 1 epoch = 10 iterations. 260 | 261 | --- 262 | ## What is the purpose of using dropout in neural networks? 263 | - **Dropout** is a regularization technique used to **reduce overfitting** by randomly "dropping out" (i.e., setting to zero) a fraction of neurons during training. 264 | 265 | #### How it works: 266 | - At each training step, neurons are randomly disabled with a certain **probability (dropout rate)**. 267 | - This prevents the network from becoming too reliant on any one neuron and encourages redundancy. 268 | 269 | #### Effects: 270 | - Improves generalization. 271 | - Acts like training many smaller networks and averaging them at test time. 272 | 273 | --- 274 | ## What is the role of the loss function in training neural networks? 275 | - The **loss function** measures how well the neural network’s predictions match the true target values. 276 | 277 | #### Purpose: 278 | - Guides training by quantifying the **prediction error**. 279 | - During **backpropagation**, the gradient of the loss function is computed with respect to model weights. 280 | - These gradients are then used by the optimizer to **update weights** and minimize the loss over time. 281 | 282 | #### Common examples: 283 | - **Mean Squared Error (MSE)** for regression. 284 | - **Cross-Entropy Loss** for classification. 285 | --- 286 | ## What is the difference between a convolutional layer and a fully connected (dense) layer? 287 | 288 | - A **convolutional layer** applies filters (kernels) that **slide over input data**, capturing spatial or local patterns (e.g., edges, textures). 289 | - A **fully connected (dense) layer** connects **every input neuron to every output neuron**, capturing global relationships. 290 | 291 | #### Convolutional Layer: 292 | - Maintains **spatial structure** (height, width). 293 | - Fewer parameters due to **weight sharing**. 294 | - Common in image and signal processing. 295 | 296 | #### Fully Connected Layer: 297 | - **Flattens input** and processes it as a 1D vector. 298 | - Large number of parameters. 299 | - Typically used in **final classification layers** of a neural network. 300 | 301 | #### Key Difference: 302 | - Convolutional: local feature detection. 303 | - Dense: global pattern integration. 304 | 305 | --- 306 | ## What is transfer learning, and why is it useful in deep learning? 307 | 308 | - **Transfer learning** is a technique where a model trained on one task is **reused or fine-tuned** on a **related but different task**. 309 | 310 | #### How it works: 311 | - A model (e.g., trained on ImageNet) learns general features (like edges, textures). 312 | - These learned weights are reused for a new task (e.g., medical imaging) by: 313 | - **Freezing** earlier layers. 314 | - **Fine-tuning** later layers on the new dataset. 315 | 316 | #### Benefits: 317 | - **Reduces training time**. 318 | - **Requires less labeled data**. 319 | - Often achieves **better performance** on small datasets. 320 | 321 | --- 322 | ## What is early stopping, and how does it help during training? 323 | 324 | - **Early stopping** is a regularization technique that stops training when the model’s **performance on a validation set stops improving**, preventing overfitting. 325 | 326 | #### How it works: 327 | - Monitor validation loss (or another metric). 328 | - If it **doesn’t improve after a set number of epochs** (called "patience"), training is stopped early. 329 | 330 | #### Benefits: 331 | - **Prevents overfitting** by not over-training. 332 | - **Saves computation time** by avoiding unnecessary epochs. 333 | 334 | --- 335 | ## What are precision, recall, and F1-score, and how are they related? 336 | 337 | - **Precision**: Measures how many of the predicted positives are actually correct. 338 | `Precision = TP / (TP + FP)` 339 | 340 | - **Recall**: Measures how many actual positives were correctly identified. 341 | `Recall = TP / (TP + FN)` 342 | 343 | - **F1-score**: Harmonic mean of precision and recall — balances both metrics. 344 | `F1 = 2 * (Precision * Recall) / (Precision + Recall)` 345 | 346 | #### Relationship: 347 | - Precision favors **fewer false positives**. 348 | - Recall favors **fewer false negatives**. 349 | - F1-score is useful when you want a **balanced** measure and classes are **imbalanced**. 350 | 351 | --- 352 | ## What is gradient descent, and how is it used in training machine learning models? 353 | 354 | - **Gradient descent** is an optimization algorithm used to **minimize the loss function** by iteratively updating model parameters in the direction of the **negative gradient**. 355 | 356 | #### How it works: 357 | - Compute the **gradient** of the loss with respect to model weights. 358 | - Update weights: 359 | `w := w - learning_rate * gradient` 360 | 361 | #### Role of Learning Rate: 362 | - The **learning rate (LR)** determines the **step size** at each update. 363 | - Too high → may overshoot the minimum. 364 | - Too low → slow convergence. 365 | 366 | #### Variants: 367 | - **Batch Gradient Descent**: uses the full dataset. 368 | - **Stochastic Gradient Descent (SGD)**: uses one sample at a time. 369 | - **Mini-batch Gradient Descent**: uses small batches (most common). 370 | 371 | --- 372 | ## What is the purpose of the softmax function in neural networks? 373 | 374 | - **Softmax** transforms a vector of raw outputs (logits) into a **probability distribution** over multiple classes. 375 | 376 | #### How it works: 377 | - Each output value is exponentiated and divided by the sum of all exponentiated outputs: 378 | `softmax(z_i) = exp(z_i) / Σ exp(z_j)` 379 | - Ensures all output values are in the range **(0, 1)** and sum to **1**. 380 | 381 | #### Use case: 382 | - Commonly used in the **output layer** of multi-class classification models. 383 | - Allows for **probabilistic interpretation** of model predictions. 384 | 385 | --- 386 | ## What is weight initialization, and why is it important in neural networks? 387 | 388 | - **Weight initialization** is the process of assigning **initial values** to the model’s weights before training begins. 389 | 390 | #### Why it matters: 391 | - Poor initialization can lead to **vanishing or exploding activations/gradients**. 392 | - Good initialization helps with **faster convergence** and more **stable training**. 393 | 394 | #### Common strategies: 395 | - **Random Initialization**: Common, but must be scaled properly. 396 | - **Xavier/Glorot Initialization**: For tanh activation, keeps variance stable across layers. 397 | - **He Initialization**: Designed for ReLU, avoids vanishing gradients by scaling variance based on number of inputs. 398 | 399 | #### Goal: 400 | - Ensure that signals **flow well** through the network during the initial stages of training. 401 | 402 | --- 403 | ## What are common feature selection techniques in machine learning? 404 | 405 | - **Filter methods**: Use statistical tests to rank features by relevance. 406 | - Examples: Chi-squared test, ANOVA, correlation coefficient. 407 | 408 | - **Wrapper methods**: Use model performance to evaluate subsets of features. 409 | - Examples: Recursive Feature Elimination (RFE), forward/backward selection. 410 | 411 | - **Embedded methods**: Feature selection happens during model training. 412 | - Examples: Lasso (L1), tree-based models (feature importances). 413 | 414 | --- 415 | 416 | ## How do you handle class imbalance in classification problems? 417 | 418 | - **Resampling techniques**: 419 | - **Oversampling** (e.g., SMOTE) 420 | - **Undersampling** (e.g., RandomUnderSampler) 421 | 422 | - **Class weighting**: Assign higher weight to minority class in loss function. 423 | 424 | - **Use proper metrics**: 425 | - Don’t rely on accuracy. 426 | - Prefer precision, recall, F1-score, ROC-AUC. 427 | 428 | - **Algorithm choice**: Tree-based models often handle imbalance better. 429 | 430 | --- 431 | 432 | ## What metrics can you use beyond accuracy to evaluate classifiers? 433 | 434 | - **Precision, Recall, F1-score**: For imbalanced classes. 435 | - **ROC-AUC**: Measures model's ability to distinguish classes across thresholds. 436 | - **Log Loss**: Penalizes confident wrong predictions. 437 | - **Confusion Matrix**: Visualizes TP, FP, TN, FN. 438 | 439 | ✅ Use multiple metrics to understand performance holistically. 440 | 441 | ## What are common model interpretability tools? 442 | 443 | - **Feature importance**: Available in tree models (e.g., `feature_importances_` in scikit-learn). 444 | - **Permutation importance**: Measures drop in performance when a feature is shuffled. 445 | - **SHAP (SHapley Additive exPlanations)**: 446 | - Model-agnostic 447 | - Explains individual predictions and global feature impact. 448 | - **LIME**: Local surrogate models explain single predictions. 449 | 450 | ✅ Useful for trust, debugging, and communicating results. 451 | 452 | ## What are best practices for deploying ML models? 453 | 454 | - **Preprocessing pipeline**: 455 | - Use `sklearn.pipeline.Pipeline` to combine scaling, feature selection, modeling. 456 | 457 | - **Model versioning**: 458 | - Track model changes (e.g., with MLflow, DVC). 459 | 460 | - **Validation before deployment**: 461 | - Cross-validation, holdout test sets, A/B testing in production. 462 | 463 | - **Monitoring after deployment**: 464 | - Track model drift, input distributions, and prediction quality. 465 | 466 | - **Automation**: 467 | - Use tools like Airflow or MLflow for reproducibility and retraining. 468 | 469 | # Dara preprocessing 470 | 471 | ## What is data preprocessing in machine learning? 472 | 473 | - Data preprocessing is the step of **cleaning, transforming, and preparing raw data** before feeding it to a model. 474 | 475 | #### Common steps: 476 | - **Missing value handling**: fill (impute), drop, or flag 477 | - **Feature scaling**: normalization, standardization 478 | - **Encoding categorical variables**: one-hot, label encoding 479 | - **Outlier detection/removal** 480 | - **Text/image preprocessing**: tokenization, resizing, etc. 481 | 482 | ✅ The goal is to make data **consistent, clean, and suitable** for modeling. 483 | 484 | 485 | ## What is a pipeline in scikit-learn? 486 | 487 | - A **Pipeline** is a way to chain preprocessing and modeling steps together so they run **sequentially and consistently**. 488 | 489 | #### Benefits: 490 | - Ensures **reproducibility** 491 | - Prevents **data leakage** 492 | - Simplifies **cross-validation and grid search** 493 | 494 | #### Example: 495 | ```python 496 | from sklearn.pipeline import Pipeline 497 | from sklearn.preprocessing import StandardScaler 498 | from sklearn.linear_model import LogisticRegression 499 | 500 | pipeline = Pipeline([ 501 | ('scaler', StandardScaler()), 502 | ('model', LogisticRegression()) 503 | ]) 504 | 505 | pipeline.fit(X_train, y_train) 506 | ``` 507 | 508 | ## Why is it important to include preprocessing steps inside a pipeline? 509 | 510 | - Including preprocessing in a pipeline ensures: 511 | - ✅ **No data leakage** — preprocessing is only fit on training data during cross-validation or grid search. 512 | - ✅ **Reproducibility** — the same transformations are applied consistently to train and test data. 513 | - ✅ **Clean code** — simplifies experimentation and reduces bugs. 514 | 515 | #### Example: 516 | ```python 517 | from sklearn.pipeline import Pipeline 518 | from sklearn.preprocessing import StandardScaler, MinMaxScaler 519 | from sklearn.model_selection import GridSearchCV 520 | from sklearn.linear_model import LogisticRegression 521 | 522 | pipe = Pipeline([ 523 | ('scaler', StandardScaler()), 524 | ('model', LogisticRegression()) 525 | ]) 526 | 527 | param_grid = { 528 | 'scaler': [StandardScaler(), MinMaxScaler()], 529 | 'model__C': [0.1, 1, 10] 530 | } 531 | 532 | grid = GridSearchCV(pipe, param_grid, cv=5) 533 | grid.fit(X_train, y_train) 534 | ``` 535 | 536 | ## What are two common methods for encoding categorical features, and when would you use each? 537 | 538 | 1. **One-Hot Encoding**: 539 | - Creates a new binary column for each category. 540 | - Use when: 541 | - The number of unique categories is **small** 542 | - The model **doesn’t assume ordinal relationships** 543 | - Works well with **tree-based models and linear models**. 544 | 545 | 2. **Ordinal/Label Encoding**: 546 | - Converts categories to **integer values** (e.g., red → 0, green → 1). 547 | - Use only when: 548 | - There is a **meaningful order** to the categories (e.g., low < medium < high) 549 | - Or with **tree-based models** (they’re robust to the encoded values) 550 | 551 | ✅ Other advanced methods (less common in simple pipelines): 552 | - **Target encoding / frequency encoding** 553 | - **Hashing encoder** 554 | - **Embeddings** (for deep learning) 555 | 556 | #### Example: 557 | ```python 558 | from sklearn.preprocessing import OneHotEncoder 559 | 560 | encoder = OneHotEncoder(sparse=False) 561 | X_encoded = encoder.fit_transform(X[['color']]) 562 | ``` 563 | 564 | ## How to encode ordinal features? 565 | 566 | - Use **Ordinal Encoding** if the feature has a **natural order** (e.g., Low < Medium < High). 567 | - Keeps order information while representing categories numerically. 568 | 569 | ## What are some common strategies for handling missing values in a dataset? 570 | 571 | #### 1. **Deletion**: 572 | - Drop rows (`df.dropna()`) or columns (`df.dropna(axis=1)`) with missing values. 573 | - Use when: 574 | - Missingness is minimal or random. 575 | - Dropping won't significantly harm model performance. 576 | 577 | #### 2. **Imputation**: 578 | - **Mean/Median/Mode Imputation**: 579 | - For numerical data → use mean or median. 580 | - For categorical data → use mode (most frequent). 581 | - **Forward/Backward Fill**: 582 | - Fill with previous (`ffill`) or next (`bfill`) values — useful in time series. 583 | - **Rolling/Window Average**: 584 | - Fill with average of a moving window. 585 | - **Custom constant or zero**: 586 | - Risky unless zero is a meaningful value. 587 | 588 | #### 3. **Model-Based Imputation**: 589 | - Predict missing values using another model (e.g., k-NN, regression, or IterativeImputer in sklearn). 590 | 591 | #### 4. **Indicator Columns**: 592 | - Add a binary column indicating whether a value was missing. 593 | 594 | #### Example (sklearn pipeline-friendly): 595 | ```python 596 | from sklearn.impute import SimpleImputer 597 | 598 | imputer = SimpleImputer(strategy='mean') 599 | X_filled = imputer.fit_transform(X) 600 | ``` 601 | 602 | # Feature Engineering 603 | 604 | ### What is feature engineering, and what are some examples? 605 | 606 | - **Feature engineering** is the process of **transforming raw data into features** that better represent the underlying patterns to improve model performance. 607 | 608 | #### Examples: 609 | - **From text**: tokenize, clean, embed (e.g., TF-IDF, word embeddings) 610 | - **From dates**: extract day, month, hour, weekday, time of day 611 | - **From numeric data**: log transforms, binning, interaction terms 612 | - **From categories**: encode frequency, combine rare levels, group hierarchies 613 | - **Domain-specific**: ratios (e.g., price per square foot), flags (e.g., is_weekend) 614 | 615 | ✅ Good feature engineering can significantly boost model accuracy — often more than changing the algorithm! 616 | 617 | ## Why is feature scaling important in machine learning, and when should you apply it? 618 | 619 | - **Feature scaling** transforms numeric features to a common scale **without distorting their relative differences**. 620 | - This is important because many ML algorithms assume features are on similar scales. 621 | 622 | #### When to scale: 623 | ✅ Required for: 624 | - Distance-based models (e.g., KNN, SVM, KMeans) 625 | - Gradient descent–based models (e.g., logistic regression, neural nets) 626 | 627 | 🚫 Not needed for: 628 | - Tree-based models (e.g., Random Forest, XGBoost) — they’re scale-invariant 629 | 630 | #### Common methods: 631 | - **StandardScaler**: zero mean, unit variance (default for many) 632 | - **MinMaxScaler**: scales to [0, 1] 633 | - **RobustScaler**: uses median and IQR, useful for outliers 634 | 635 | #### Example: 636 | ```python 637 | from sklearn.preprocessing import StandardScaler 638 | 639 | scaler = StandardScaler() 640 | X_scaled = scaler.fit_transform(X) 641 | ``` 642 | ## What is dimensionality reduction, and why is it useful? 643 | 644 | - **Dimensionality reduction** is the process of reducing the number of input features (dimensions) while **preserving as much information as possible**. 645 | 646 | #### Why it’s useful: 647 | - Reduces **model complexity** 648 | - Helps combat the **curse of dimensionality** 649 | - Speeds up training 650 | - Improves visualization 651 | - Can reduce overfitting 652 | 653 | #### Common methods: 654 | - **PCA (Principal Component Analysis)**: linear technique that finds directions (components) of maximum variance 655 | - **t-SNE / UMAP**: non-linear techniques for visualization 656 | 657 | ✅ Especially useful for high-dimensional data like text vectors, images, or gene expression datasets. 658 | 659 | ## What are some common preprocessing steps when working with raw text data? 660 | 661 | 1. **Lowercasing** – makes text case-insensitive. 662 | 2. **Removing punctuation and stopwords** – cleans irrelevant or common filler words. 663 | 3. **Tokenization** – splits text into words or subwords. 664 | 4. **Stemming/Lemmatization** – reduces words to their base/root form. 665 | 5. **Vectorization** – converts text into numeric format: 666 | - **Bag of Words (CountVectorizer)** 667 | - **TF-IDF** 668 | - **Embeddings** (Word2Vec, GloVe, BERT) 669 | 670 | ✅ These steps prepare text for modeling by normalizing, reducing noise, and making it numerically representable. 671 | 672 | # Neural Networks (NN) 673 | 674 | ## What are the differences between CNNs and RNNs? 675 | 676 | - **CNNs (Convolutional Neural Networks)** are designed to process **spatial data**, such as images. 677 | - **RNNs (Recurrent Neural Networks)** are designed to process **sequential data**, such as time series or text. 678 | 679 | #### CNNs: 680 | - Use **filters (kernels)** to detect local patterns (edges, textures). 681 | - Good for handling **fixed-size inputs** with spatial hierarchy. 682 | - Layers include **convolution**, **pooling**, and **fully connected** layers. 683 | 684 | #### RNNs: 685 | - Use **recurrent connections** to maintain information across time steps. 686 | - Ideal for **sequences** where context matters (e.g., text, audio). 687 | - Variants include **LSTM** and **GRU** to address vanishing gradients. 688 | 689 | #### Key Difference: 690 | - CNNs focus on **spatial relationships**. 691 | - RNNs focus on **temporal or sequential relationships**. 692 | 693 | --- 694 | ## What are LSTM networks, and how do they improve over standard RNNs? 695 | 696 | - **LSTM (Long Short-Term Memory)** networks are a type of RNN designed to **capture long-term dependencies** in sequential data. 697 | 698 | #### Key Components: 699 | - **Cell state**: Acts as memory that runs through the sequence. 700 | - **Gates**: Control the flow of information: 701 | - **Forget gate**: Decides what to discard. 702 | - **Input gate**: Decides what new information to store. 703 | - **Output gate**: Controls what gets output. 704 | 705 | #### Improvements over RNNs: 706 | - **Mitigate vanishing gradient** problem. 707 | - **Maintain context** over longer sequences. 708 | - More effective for tasks like language modeling, time series forecasting, and speech recognition. 709 | 710 | --- 711 | 712 | ### What is a CNN (Convolutional Neural Network)? 713 | 714 | A **CNN** is a type of deep neural network designed to work with **grid-like data**, such as images. 715 | 716 | It automatically learns **spatial features** (like edges, textures, shapes) by using special layers instead of fully connected ones. 717 | 718 | --- 719 | 720 | ### 🔧 Main Components of a CNN: 721 | 722 | #### 1. 🧱 Convolutional Layer 723 | - Applies **filters (kernels)** that slide over the image. 724 | - Each filter detects a specific **pattern** (e.g., edge, corner). 725 | - Output is a **feature map**. 726 | 727 | #### 2. 🧼 Activation Function (usually ReLU) 728 | - Adds **non-linearity** to the output of convolutions. 729 | 730 | #### 3. 🧽 Pooling Layer (e.g., Max Pooling) 731 | - **Downsamples** the feature maps to reduce size and computation. 732 | - Helps the network focus on **important features**. 733 | 734 | #### 4. 🧠 Fully Connected Layer (Dense) 735 | - Flattens the feature maps and passes them to one or more dense layers for final **classification or regression**. 736 | 737 | #### 5. 📊 Output Layer 738 | - Produces the final prediction (e.g., class probabilities using softmax). 739 | 740 | --- 741 | 742 | ### 🖼️ Summary: 743 | 744 | CNNs work like a visual processing pipeline: 745 | ``Input Image → Convolutions → Activations → Pooling → Dense Layers → Output`` 746 | 747 | ✅ Great for: image classification, object detection, medical imaging, etc. 748 | 749 | 750 | # Transformers 751 | 752 | ### What problem were Transformers originally designed to solve, and what made them different from RNNs? 753 | - Transformers were introduced in the paper **“Attention Is All You Need” (Vaswani et al., 2017)** for **machine translation**. 754 | - Unlike RNNs, which process input **sequentially**, Transformers process **entire sequences in parallel** using **self-attention**. 755 | 756 | #### Key differences from RNNs: 757 | - **No recurrence**: uses **self-attention** to model relationships between tokens, regardless of position. 758 | - **Faster training**: enables parallelization across time steps. 759 | - **Better long-range dependency modeling**: avoids vanishing gradients and sequential bottlenecks. 760 | 761 | ✅ The Transformer architecture forms the basis for BERT, GPT, T5, and more. 762 | 763 | --- 764 | 765 | ### What is self-attention in Transformers, and why is it important? 766 | 767 | - **Self-attention** allows the model to assign **importance scores (weights)** to different tokens in a sequence **relative to each other**. 768 | - This lets the model capture **contextual relationships**, regardless of distance in the input. 769 | 770 | - Self-attention computes: 771 | - How **important** every other word is to it. 772 | - This is done by calculating **attention scores** between all token pairs. 773 | 774 | #### How it works: 775 | - For each token, the model computes: 776 | - A **query**, **key**, and **value** vector. 777 | - The attention weight between tokens is computed as: 778 | `Attention(Q, K, V) = softmax(QKᵀ / √d) * V` 779 | 780 | - This mechanism enables: 781 | - **Parallelism** (unlike RNNs) 782 | - Modeling of **long-range dependencies** 783 | - Dynamic focus on **relevant context** (e.g., in translation: "bank" in “river bank” vs “credit bank”) 784 | 785 | ✅ Self-attention is the heart of the Transformer architecture. 786 | 787 | ### What is positional encoding in Transformers, and why is it needed? 788 | 789 | - **Positional encoding** provides information about the **order of tokens** in a sequence. 790 | - Transformers have **no recurrence or convolution**, so they need **explicit position signals** to understand word order. 791 | 792 | #### Why it matters: 793 | - In language, word order affects meaning (e.g., “dog bites man” vs “man bites dog”). 794 | 795 | #### How it's implemented: 796 | - Uses fixed or learned vectors added to the token embeddings. 797 | - Original Transformer used **sinusoidal encodings**: 798 | - For each position \( pos \) and dimension \( i \): 799 | - \( PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d}) \) 800 | - \( PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d}) \) 801 | 802 | ✅ Without positional encoding, the model would treat all tokens as unordered. 803 | 804 | ### What do the encoder and decoder do in a Transformer? 805 | 806 | Think of it like **a translator**: 807 | 808 | #### 🧠 Encoder = Understands the input 809 | - Reads the input sentence (e.g., in English). 810 | - Figures out what the sentence **means**. 811 | - It builds a **contextual summary** of the input. 812 | 813 | #### 🗣️ Decoder = Writes the output 814 | - Uses what the encoder understood. 815 | - Starts generating the translated sentence (e.g., in French), **word by word**. 816 | - It uses: 817 | - What it has **already generated** 818 | - What the encoder said the input **meant** 819 | 820 | ✅ Together: 821 | - The encoder **reads and understands**. 822 | - The decoder **writes a response** using that understanding. -------------------------------------------------------------------------------- /MLE_Interview_QnA/PythonQnA.md: -------------------------------------------------------------------------------- 1 | # Python for DS/MLE Interview QnA 2 | I'm preparing for Data Science & MLE interviews. I want to use this chat to go through Q&A flashcards only (no deep implementation). Focus on interview-style questions and concise, accurate answers. 3 | 4 | 5 | ## OOP concepts 6 | 7 | ### What is encapsulation in OOP? 8 | - Encapsulation is the bundling of data (attributes) and methods that operate on that data into a single unit (a class). 9 | - It also refers to **restricting direct access** to some of an object’s components, usually by naming conventions. 10 | 11 | #### In Python: 12 | - Prefixing with a single underscore `_var` is a **convention** indicating internal use. 13 | - Prefixing with double underscores `__var` triggers **name mangling**, making access more difficult from outside the class. 14 | 15 | #### Example: 16 | ```python 17 | class MyClass: 18 | def __init__(self): 19 | self._internal = "protected" 20 | self.__private = "private" 21 | 22 | obj = MyClass() 23 | print(obj._internal) # accessible, but intended as protected 24 | print(obj._MyClass__private) # name-mangled, not easily accessed 25 | ``` 26 | 27 | 28 | --- 29 | ### What is inheritance in OOP? 30 | - Inheritance allows a class (child) to **inherit attributes and methods** from another class (parent). 31 | - It promotes code reuse and supports hierarchical relationships. 32 | 33 | #### Types: 34 | - **Single Inheritance**: One child, one parent 35 | - **Multiple Inheritance**: One child, multiple parents 36 | - **Multilevel Inheritance**: Inheritance chain across multiple classes 37 | 38 | #### Example: 39 | ```python 40 | class Animal: 41 | def speak(self): 42 | return "Some sound" 43 | 44 | class Dog(Animal): 45 | def speak(self): 46 | return "Bark" 47 | 48 | d = Dog() 49 | print(d.speak()) # Output: Bark 50 | ``` 51 | 52 | 53 | --- 54 | ### What is polymorphism in OOP? 55 | - Polymorphism allows different classes to implement the same method interface in different ways. 56 | - It enables writing code that works on objects of different types as long as they implement the expected behavior. 57 | 58 | #### Example: 59 | ```python 60 | class Dog: 61 | def speak(self): 62 | return "Bark" 63 | 64 | class Cat: 65 | def speak(self): 66 | return "Meow" 67 | 68 | def animal_sound(animal): 69 | print(animal.speak()) 70 | 71 | animal_sound(Dog()) # Bark 72 | animal_sound(Cat()) # Meow 73 | ``` 74 | 75 | 76 | --- 77 | ### What is abstraction in OOP? 78 | - Abstraction lets you define a blueprint (interface) for a group of related classes while hiding the implementation details. 79 | - It helps enforce a contract: every subclass **must implement** certain methods. 80 | 81 | #### Real-world Example: 82 | Imagine building a payment system that can support different payment methods (e.g., PayPal, Credit Card), but the caller shouldn't care how each works internally. 83 | 84 | ```python 85 | from abc import ABC, abstractmethod 86 | 87 | class PaymentMethod(ABC): 88 | @abstractmethod 89 | def pay(self, amount): 90 | pass 91 | 92 | class CreditCard(PaymentMethod): 93 | def pay(self, amount): 94 | print(f"Paying ${amount} using Credit Card.") 95 | 96 | class PayPal(PaymentMethod): 97 | def pay(self, amount): 98 | print(f"Paying ${amount} using PayPal.") 99 | 100 | def checkout(payment_method: PaymentMethod, amount: float): 101 | payment_method.pay(amount) 102 | 103 | checkout(CreditCard(), 100) # Paying $100 using Credit Card. 104 | checkout(PayPal(), 50) # Paying $50 using PayPal. 105 | ``` 106 | - The PaymentMethod is abstract: it defines a method pay() but does not implement it. 107 | - Subclasses (CreditCard, PayPal) provide their own implementations. 108 | - The caller (checkout) doesn’t care which payment method is used — that’s abstraction. 109 | 110 | --- 111 | ## Python basics 112 | 113 | ### What is the difference between a function and a method in Python? 114 | - A **function** is an independent block of code defined using `def` and not tied to any object. 115 | - A **method** is a function that is associated with an object (usually defined within a class) and takes `self` or `cls` as the first parameter. 116 | 117 | --- 118 | ### What are the `self` and `cls` parameters in Python? 119 | - `self` refers to the instance of the class and is used in **instance methods** to access or modify object attributes. 120 | - `cls` refers to the class itself and is used in **class methods** to access or modify class-level data. 121 | 122 | --- 123 | ### What’s the difference between `@classmethod` and `@staticmethod`? 124 | `@classmethod` receives the class (`cls`) as the first argument and can modify class state. 125 | `@staticmethod` receives no implicit first argument and behaves like a regular function inside the class. 126 | 127 | #### Example: 128 | ```python 129 | class Counter: 130 | count = 0 # class-level attribute 131 | 132 | def __init__(self): 133 | # instance method using `self` 134 | self.id = Counter.count 135 | Counter.increment() 136 | 137 | @classmethod 138 | def increment(cls): 139 | # class method using `cls` 140 | cls.count += 1 141 | ``` 142 | 143 | --- 144 | ### What is the difference between `is` and `==` in Python? 145 | - `==` checks for **value equality** — whether two objects have the same contents. 146 | - `is` checks for **object identity** — whether two references point to the **same object in memory**. 147 | 148 | #### Example: 149 | ```python 150 | a = [1, 2, 3] 151 | b = [1, 2, 3] 152 | 153 | a == b # True: same values 154 | a is b # False: different objects 155 | 156 | c = a 157 | a is c # True: same object 158 | ``` 159 | 160 | --- 161 | ### What are decorators in Python? 162 | - Decorators are functions that modify the behavior of other functions or methods without changing their code. 163 | - They are often used for **logging**, **access control**, **timing**, and **caching**. 164 | 165 | ```python 166 | def my_decorator(func): 167 | def wrapper(): 168 | print("Before function call") 169 | func() 170 | print("After function call") 171 | return wrapper 172 | 173 | @my_decorator 174 | def say_hello(): 175 | print("Hello!") 176 | 177 | say_hello() 178 | ``` 179 | ```text 180 | Before function call 181 | Hello! 182 | After function call 183 | ``` 184 | 185 | --- 186 | ### What is a thread in Python? 187 | - A thread is the smallest unit of execution within a process. 188 | - Multiple threads in a process share the same memory and resources. 189 | - Threads are useful for performing tasks concurrently, especially **I/O-bound operations**. 190 | - In Python (CPython), due to the Global Interpreter Lock (GIL), threads cannot achieve true parallelism for CPU-bound tasks. 191 | 192 | #### Common use cases: 193 | - Handling multiple client connections in a server 194 | - Performing background tasks (e.g., logging, downloading) 195 | - Running I/O operations without blocking the main program 196 | 197 | --- 198 | ### What is the difference between multithreading and multiprocessing in Python? 199 | 200 | - **Multithreading** uses multiple threads within a single process. 201 | - Threads share the same memory space. 202 | - Limited by the Global Interpreter Lock (GIL) in CPython, so it's not ideal for CPU-bound tasks. 203 | - Best suited for I/O-bound tasks (e.g., file or network operations). 204 | 205 | - **Multiprocessing** uses separate processes. 206 | - Each process has its own Python interpreter and memory space. 207 | - Avoids the GIL, so it's better for CPU-bound tasks (e.g., data processing, model training). 208 | - More memory-intensive than threads. 209 | 210 | 211 | #### Summary: 212 | | Feature | Multithreading | Multiprocessing | 213 | |------------|---------------------------|-----------------------| 214 | | Memory | Shared | Separate | 215 | | GIL Impact | Yes (limited concurrency) | No (true parallelism) | 216 | | Use Case | I/O-bound tasks | CPU-bound tasks | 217 | | Overhead | Lower | Higher | 218 | 219 | ```python 220 | import threading 221 | 222 | def task(): 223 | print("Thread running") 224 | 225 | t = threading.Thread(target=task) 226 | t.start() 227 | t.join() 228 | ``` 229 | 230 | --- 231 | ### What are `async` and `await` in Python? 232 | - `async` and `await` are used to write asynchronous, non-blocking code using coroutines. 233 | - Useful for **I/O-bound** tasks like network calls, file operations, or APIs where waiting would otherwise block the program. 234 | 235 | #### Key Concepts: 236 | - `async def` defines a **coroutine function**. 237 | - `await` pauses the coroutine until the awaited task completes. 238 | 239 | #### Example: 240 | ```python 241 | import asyncio 242 | 243 | async def say_hello(): 244 | print("Hello") 245 | await asyncio.sleep(1) 246 | print("World") 247 | 248 | asyncio.run(say_hello()) 249 | ``` 250 | 251 | --- 252 | ### What is a simple example of a generator function in Python? 253 | - A generator function uses the `yield` keyword to return values one at a time. 254 | - It produces values lazily and maintains state between calls. 255 | 256 | #### Example: 257 | ```python 258 | def count_up_to(n): 259 | i = 1 260 | while i <= n: 261 | yield i 262 | i += 1 263 | 264 | gen = count_up_to(3) 265 | print(next(gen)) # 1 266 | print(next(gen)) # 2 267 | print(next(gen)) # 3 268 | ``` 269 | 270 | --- 271 | ### What is the difference between list comprehensions and generator expressions in Python? 272 | - **List comprehensions** return a full list in memory. 273 | - **Generator expressions** return an iterator that yields items lazily (one at a time). 274 | 275 | #### Key Differences: 276 | | Feature | List Comprehension | Generator Expression | 277 | |--------------|----------------------------|-----------------------------------| 278 | | Syntax | `[x for x in iterable]` | `(x for x in iterable)` | 279 | | Memory usage | Loads all items at once | Generates items on demand | 280 | | Performance | Faster for small data sets | Better for large or infinite data | 281 | | Return type | `list` | `generator` | 282 | 283 | #### Example: 284 | ```python 285 | # List comprehension 286 | squares = [x**2 for x in range(5)] # [0, 1, 4, 9, 16] 287 | 288 | # Generator expression 289 | squares_gen = (x**2 for x in range(5)) # use next() or a loop to consume 290 | ``` 291 | 292 | --- 293 | ### What is Pydantic and why is it used? 294 | - Pydantic is a Python library for **data validation and settings management** using Python type hints. 295 | - It ensures that input data matches specified types and structures, raising clear validation errors when it doesn't. 296 | - Commonly used with FastAPI and other data pipelines to validate and parse structured data. 297 | 298 | #### Example: 299 | ```python 300 | from pydantic import BaseModel 301 | 302 | class User(BaseModel): 303 | name: str 304 | age: int 305 | 306 | user = User(name="Alice", age=30) # valid 307 | user = User(name="Bob", age="30") # also valid, auto-converts string to int 308 | 309 | ``` 310 | 311 | --- 312 | ### How does Pydantic handle nested models? 313 | - Pydantic supports nested models by allowing one `BaseModel` to be used as a field in another. 314 | - It automatically validates and parses the nested structure. 315 | 316 | #### Example: 317 | ```python 318 | from pydantic import BaseModel 319 | 320 | class Address(BaseModel): 321 | city: str 322 | zip_code: str 323 | 324 | class User(BaseModel): 325 | name: str 326 | address: Address 327 | 328 | user = User(name="Alice", address={"city": "New York", "zip_code": "10001"}) 329 | ``` 330 | 331 | --- 332 | ### What is the difference between a NumPy array and a pandas DataFrame? 333 | - A **NumPy array** is a multidimensional, fixed-type array for numerical computation. 334 | - A **1D NumPy array** is conceptually similar to a **mathematical vector** 335 | - A NumPy array can also be **2D (matrices)** or **n-dimensional (tensors)**. 336 | - A **pandas DataFrame** is a 2D labeled data structure with columns that can have different data types. 337 | 338 | #### Summary: 339 | | Feature | NumPy array | pandas DataFrame | 340 | |--------------|---------------------------|---------------------------| 341 | | Data type | Homogeneous | Heterogeneous | 342 | | Labels | None (index-based) | Row and column labels | 343 | | Use case | Numerical computation | Tabular data manipulation | 344 | | Dependencies | Core scientific computing | Built on top of NumPy | 345 | 346 | ### What are vectorized operations in NumPy? 347 | - Vectorized operations apply functions to entire arrays without using explicit loops. 348 | - They are faster and more efficient because they use optimized C code under the hood. 349 | - This is a key advantage of NumPy over native Python lists. 350 | 351 | #### Example: 352 | ```python 353 | import numpy as np 354 | 355 | a = np.array([1, 2, 3]) 356 | b = np.array([4, 5, 6]) 357 | 358 | c = a + b # vectorized addition: [5, 7, 9] 359 | ``` 360 | --- 361 | 362 | ## Pandas 363 | 364 | ### What’s the difference between `.loc[]` and `.iloc[]` in pandas? 365 | - `.loc[]` is **label-based** indexing: access rows/columns by names or boolean masks. 366 | - `.iloc[]` is **position-based** indexing: access rows/columns by integer positions. 367 | 368 | #### Example: 369 | ```python 370 | import pandas as pd 371 | 372 | df = pd.DataFrame({ 373 | "name": ["Alice", "Bob"], 374 | "age": [25, 30] 375 | }, index=["a", "b"]) 376 | 377 | df.loc["a"] # Access row with label "a" 378 | df.iloc[0] # Access first row by position 379 | ``` 380 | 381 | --- 382 | ### What is the difference between `df.apply()` and `df.map()` in pandas? 383 | - `map()` is used **only on Series** (usually one column) to apply a function **element-wise**. 384 | - `apply()` works on **both Series and DataFrames** and can apply a function **row-wise or column-wise**. 385 | 386 | #### Example of `map()`: 387 | ```python 388 | df["col"].map(lambda x: x * 2) 389 | df["col"].apply(lambda x: x * 2) 390 | df.apply(sum, axis=0) # sum of each column 391 | df.apply(sum, axis=1) # sum of each row 392 | ``` 393 | 394 | ### What are window functions in pandas and how do you use them? 395 | - **Window functions** perform operations over a sliding window of rows, commonly used for **rolling statistics** or **rankings**. 396 | 397 | #### Types of window functions: 398 | - `rolling()` – fixed-size moving window 399 | - `expanding()` – growing window 400 | - `ewm()` – exponentially weighted window 401 | - `rank()`, `cumsum()`, `shift()` – cumulative/transform window functions 402 | 403 | --- 404 | ### What are window functions in pandas and how do you use them? 405 | - **Window functions** perform operations over a sliding window of rows, commonly used for **rolling statistics** or **rankings**. 406 | 407 | #### Types of window functions: 408 | - `rolling()` – fixed-size moving window 409 | - `expanding()` – growing window 410 | - `ewm()` – exponentially weighted window 411 | - `rank()`, `cumsum()`, `shift()` – cumulative/transform window functions 412 | 413 | 414 | #### Example (Rolling Mean): 415 | ```python 416 | df["rolling_avg"] = df["sales"].rolling(window=3).mean() 417 | ``` 418 | 419 | ```python 420 | import pandas as pd 421 | import numpy as np 422 | 423 | # Sample data 424 | data = { 425 | "date": pd.date_range(start="2024-01-01", periods=10, freq='D'), 426 | "sales": [100, 110, 90, 120, 130, 125, 140, 135, 150, 160] 427 | } 428 | df = pd.DataFrame(data) 429 | 430 | # Rolling window: moving average over 3 days 431 | df["rolling_mean"] = df["sales"].rolling(window=3).mean() 432 | 433 | # Expanding window: cumulative mean from the start 434 | df["expanding_mean"] = df["sales"].expanding().mean() 435 | 436 | # Exponentially weighted mean with span=3 437 | df["ewm_mean"] = df["sales"].ewm(span=3, adjust=False).mean() 438 | 439 | # Rank of sales 440 | df["rank"] = df["sales"].rank() 441 | 442 | # Cumulative sum 443 | df["cumsum"] = df["sales"].cumsum() 444 | 445 | # Shifted sales (previous day's sales) 446 | df["prev_day_sales"] = df["sales"].shift(1) 447 | ``` 448 | 449 | --- 450 | ### What does the `pivot()` function do in pandas, and how is it different from `melt()`? 451 | 452 | - `pivot()` **reshapes data from long to wide format**, turning unique values from one column into new column headers. 453 | - `melt()` **reshapes data from wide to long format**, unpivoting column headers into a single column. 454 | 455 | #### Example: `pivot()` 456 | ```python 457 | df.pivot(index="date", columns="product", values="sales") 458 | df.melt(id_vars="date", value_vars=["A", "B"]) 459 | ``` 460 | Turns columns A and B into row entries under a new column variable, with their values in another column. 461 | In short: 462 | - pivot → wide format (more columns) 463 | - melt → long format (more rows) 464 | 465 | --- 466 | ### What is the difference between shallow copy and deep copy in Python? 467 | - A **shallow copy** creates a new object but **copies references** to the original objects inside it. 468 | - A **deep copy** creates a new object and **recursively copies all nested objects**, so they are fully independent. 469 | 470 | #### Example: 471 | ```python 472 | import copy 473 | 474 | original = [[1, 2], [3, 4]] 475 | shallow = copy.copy(original) 476 | deep = copy.deepcopy(original) 477 | 478 | original[0][0] = 99 479 | 480 | print(shallow[0][0]) # 99 (affected) 481 | print(deep[0][0]) # 1 (unchanged) 482 | ``` 483 | 484 | --- 485 | ### What are Python context managers used for, and how do you create a custom one? 486 | 487 | - Context managers handle **setup and cleanup** actions automatically, often used with the `with` statement. 488 | - Common use cases: managing **files**, **database connections**, **locks**, or **temporary resources**. 489 | 490 | #### Example (built-in): 491 | ```python 492 | with open("file.txt", "r") as f: 493 | data = f.read() 494 | ``` -------------------------------------------------------------------------------- /MLE_Interview_QnA/case_study_company_A.md: -------------------------------------------------------------------------------- 1 | # Company A Questions and Answers (beta) 2 | 3 | ### Machine Learning 4 | 1. Describe Supervised, unsupervised, semi supervised learning with examples. 5 | 2. How would you use a model trained with only a few examples that has not optimal scores in order to train an unsupervised model for a similar task? 6 | 7 | One approach to using a model trained on a small dataset to train an unsupervised model for a similar task would be to use the model as a feature extractor. A feature extractor is a model that is trained to extract useful features from input data, which can then be used as input to an unsupervised model. Here's how you might use a small, poorly performing supervised model as a feature extractor: 8 | 9 | * Train the small, supervised model on the available labeled data. 10 | * Extract the hidden layer activations of the model when it is presented with new, unlabeled data. These activations can be thought of as a representation of the input data in the feature space of the model. 11 | * Use the extracted features as input to an unsupervised model, such as a clustering algorithm or an autoencoder. 12 | * Train the unsupervised model on the extracted features. 13 | * Using the small, supervised model as a feature extractor can allow you to leverage the knowledge learned by the model, even if it is not performing well on the original task. The unsupervised model can then use the extracted features to discover patterns or relationships in the data that might not be apparent from the raw input data. 14 | 15 | It is important to keep in mind that this approach will only work if the supervised model is learning useful features that are relevant to the task at hand. If the model is not learning useful features, the extracted features may not be helpful for training the unsupervised model. 16 | 17 | 3. Is accuracy a good measure to evaluate a binary classification model? 18 | 19 | Accuracy is a commonly used metric to evaluate the performance of a binary classification model, but it is not always the best metric to use. Accuracy is defined as the number of correct predictions made by the model divided by the total number of predictions made. It is a simple and straightforward metric to compute, but it can be misleading if the classes in the dataset are imbalanced. For example, if the dataset contains 99% negative examples and 1% positive examples, a model that always predicts negative would have an accuracy of 99%, even though it is not making any useful predictions. In cases where the classes are imbalanced, it is often more informative to use metrics that take into account the class distribution, such as precision, recall, and the F1 score. These metrics give a more detailed view of the model's performance, and can be more useful for evaluating the model's effectiveness. It is important to consider the context in which the model will be used when choosing an evaluation metric. For example, if the model is being used to predict whether a patient has a certain disease, false negatives (predictions of "no disease" when the patient actually has the disease) may be more concerning than false positives, in which case, recall might be a more important metric to focus on. 20 | 21 | 1. Let a dataset with thousands of features. How do you decide what feature to keep that are more useful when training a model? 22 | 2. Explain how SVM works. 23 | 3. Explain L1 and L2 regularization. 24 | 4. How can you avoid overfitting? 25 | 5. What is the purpose of the activation function? 26 | 6. Given 2 sets of inputs x and y calculate the output of relu and sigmoid activation function 27 | 28 | 29 | The rectified linear unit (ReLU) activation function is defined as f(x) = max(0, x), so for input x=0.5, the output of the ReLU function would be 0.5. 30 | The sigmoid activation function is defined as f(x) = 1 / (1 + e^(-x)), so for input y=2, the output of the sigmoid function would be approximately 0.8807970779778823. 31 | 32 | 10. Where should we use the sigmoid and where the relu activation functions 33 | 34 | The ReLU activation function is typically used in the hidden layers of a neural network, as it helps to introduce non-linearity and alleviate the vanishing gradient problem. It is also computationally efficient, as it only requires a simple threshold operation. The ReLU is commonly used in image and speech recognition, and natural language processing tasks. 35 | 36 | On the other hand, the sigmoid activation function is mostly used in the output layer of a binary classification neural network. Sigmoid function output a probability value between 0 and 1. It is used to predict the probability of an instance belonging to a particular class. Because of this, it's commonly used in logistic regression, where we want to predict the probability of success. 37 | 38 | It's important to note that the ReLU activation may not be the best choice when the input data is negative, as it will output zero, so in this cases Leaky ReLU, ELU are some of the alternatives. 39 | 40 | 11. Explain as detailed as possible the pipeline for training a text classification model (text pre-process, feature extraction , model selection , training process) 41 | 12. Let's say you need to create recommendation system for online ads for a sites users. The training dataset consists of user preference and labels data. Explain what type of model will you use and how will you train it? 42 | 43 | There are several types of models that could be used to create a recommendation system for online ads, and the choice of model will depend on the specific requirements of the system and the characteristics of the training dataset. Here are some potential approaches: 44 | 45 | * Collaborative filtering: Collaborative filtering is a method of making recommendations based on the preferences of similar users. This approach would involve training a model that takes as input a user's past ad clicks and outputs a list of recommended ads. The model could be trained using a matrix factorization algorithm, such as singular value decomposition (SVD), or a neural network. 46 | 47 | * Content-based filtering: Content-based filtering is a method of making recommendations based on the characteristics of the items being recommended. This approach would involve training a model that takes as input the characteristics of an ad (e.g., its category, title, and description) and outputs a score indicating how relevant the ad is to a given user. The model could be trained using a supervised learning algorithm, such as a support vector machine (SVM) or a decision tree. 48 | 49 | * Hybrid approach: It is also possible to combine the above approaches by using a hybrid model that combines collaborative filtering and content-based filtering. This could involve training separate models for collaborative filtering and content-based filtering and combining their outputs to generate recommendations. 50 | 51 | To train the model, you would need to split the dataset into a training set and a validation set. The model would then be trained on the training set and evaluated on the validation set to determine its performance. The model's hyperparameters (e.g., the learning rate, the size of the hidden layers) could be tuned using techniques such as grid search or random search to find the best combination of hyperparameters for the task at hand. Once the model has been trained and validated, it can be deployed in the recommendation system. 52 | 53 | 13. Let's say that you want to train a Naive Bayes model as a baseline for the online ads recommendation system. How would you train a naive Bayes model if your data consist of millions of classes? 54 | 55 | Training a Naive Bayes model with millions of classes can be challenging due to the high dimensional nature of the data and the computational complexity of the model. Here are some potential approaches to training a Naive Bayes model with a large number of classes: 56 | 57 | * Use a variant of Naive Bayes that is more suited to high-dimensional data: There are several variants of Naive Bayes that are better suited to handling high-dimensional data than the standard Naive Bayes model. These include the Complement Naive Bayes (CNB) and the Multinomial Naive Bayes (MNB) models. CNB is particularly effective at handling imbalanced datasets, while MNB is better suited to sparse data. 58 | 59 | * Use feature selection to reduce the dimensionality of the data: One way to reduce the dimensionality of the data is to use feature selection techniques to identify the most relevant features and discard the rest. This can help to reduce the complexity of the model and make it more tractable to train. 60 | 61 | * Use a batch training approach: Instead of training the model on the entire dataset at once, you can split the dataset into smaller batches and train the model on each batch separately. This can help to reduce the memory and computational requirements of the model, making it more feasible to train on a large dataset. 62 | 63 | * Use distributed training: If the data is too large to fit on a single machine, you can use distributed training to train the model across multiple machines. This can help to speed up the training process and make it more scalable. 64 | 65 | It is important to keep in mind that the performance of a Naive Bayes model may not be competitive with more advanced models on a large, high-dimensional dataset. If the goal is to achieve the highest possible performance, it may be necessary to consider using a different type of model. 66 | 67 | 14. Why we call Naive Bayes "Naive"? 68 | 69 | The Naive Bayes model is called "naive" because it makes a strong assumption about the independence of the features in the data. Specifically, the model assumes that all of the features are independent of each other given the class label. This assumption is often unrealistic in practice, as features can be correlated with each other and with the class label. 70 | 71 | For example, consider a dataset of emails that are classified as either spam or not spam. The model might assume that the presence of certain words (e.g., "Viagra") is independent of the presence of other words (e.g., "free"), given the class label. However, in reality, the presence of certain words is often correlated with the presence of other words. 72 | 73 | Despite the unrealistic assumption of feature independence, the Naive Bayes model can still be very effective in practice, particularly for classification tasks with a small number of features. This is because the model is simple and easy to implement, and it can perform well with relatively little data. 74 | 75 | 15. What if you wanted to group two site users together based on their preferences and how will this help build a recommendation model? 76 | 77 | We could use a clustering algorithm to group similar users together. Clustering is a technique used to group similar data points together. There are several different clustering algorithms that could be used for this task, such as k-means, hierarchical clustering, or density-based clustering. These algorithms work by analyzing the attributes of the users, such as: 78 | - browsing history 79 | - purchase history 80 | - demographic information 81 | - other similar attributes 82 | 83 | Once users are grouped together based on their preferences, this information can be used to build a recommendation model. Additionally, clustering can be used to identify patterns and trends in user preferences, which can be used to develop targeted marketing strategies or to improve the overall user experience on the website. 84 | 85 | ### Deep Learning 86 | 16. Why do we need multiple hidden layers? 87 | 88 | In deep learning, multiple hidden layers are used to allow the model to learn more complex patterns in the data. A single hidden layer is sufficient to represent any function that can be represented using a combination of simple functions, such as linear and nonlinear transformations. However, as the complexity of the function increases, the number of hidden units required to represent it also increases. 89 | 90 | Using multiple hidden layers allows the model to learn more complex patterns by composing simple patterns learned by the lower layers. For example, a model with three hidden layers could learn to recognize edges in the first hidden layer, shapes in the second hidden layer, and objects in the third hidden layer. Each layer builds on the representation learned by the previous layer, allowing the model to learn more abstract and complex patterns. 91 | 92 | There are trade-offs to using multiple hidden layers, however. Adding more layers can increase the model's capacity and improve its ability to learn complex patterns, but it can also make the model more prone to overfitting, particularly if the model is not regularized appropriately. It is important to strike a balance between the model's capacity and its generalization ability. 93 | 94 | 17. What to expect to happen when we add a very deep network of hundreds of layers in a neural network? 95 | 96 | Adding a very deep network with hundreds of layers to a neural network can lead to a number of different outcomes, depending on the characteristics of the data and the model architecture. Here are a few possible scenarios: 97 | 98 | * Improved performance: In some cases, adding more layers to a neural network can improve its performance on the task at hand. This is because a deeper network can learn more complex patterns in the data, allowing it to make more accurate predictions. However, this improvement in performance is not always guaranteed, and the benefits of adding more layers can diminish as the network becomes deeper. 99 | 100 | * Overfitting: Adding more layers to a neural network can also increase the risk of overfitting, particularly if the model is not regularized appropriately. Overfitting occurs when the model performs well on the training data but poorly on unseen data, and it can be caused by the model having too much capacity to fit the noise in the training data. 101 | 102 | * Increased training time: Training a very deep network can be computationally intensive, and it may take significantly longer to train compared to a shallower network. This can be a particular issue if the model is being trained on a large dataset or on a resource-constrained machine. 103 | 104 | * Decreased training stability: Very deep networks can also be more sensitive to the initialization of the weights and the choice of optimization algorithm. This can make it more difficult to train the model and can lead to unstable training dynamics, such as oscillations or divergence. 105 | 106 | Overall, adding a very deep network with hundreds of layers to a neural network can lead to improved performance in some cases, but it is important to carefully consider the trade-offs and ensure that the model is properly regularized and optimized. 107 | 108 | 18. What to expect to happen when we add a very large input layer in a neural network? 109 | 110 | Adding a very large input layer to a neural network can lead to a number of different outcomes, depending on the characteristics of the data and the model architecture. Here are a few possible scenarios: 111 | 112 | * Improved performance: In some cases, a larger input layer can improve the performance of the model on the task at hand. This is because a larger input layer can allow the model to consider more features of the input data, which may contain important information that can be used to make more accurate predictions. However, this improvement in performance is not always guaranteed, and the benefits of adding more input units can diminish as the input layer becomes larger. 113 | 114 | * Increased training time: Training a model with a very large input layer can be computationally intensive, and it may take significantly longer to train compared to a model with a smaller input layer. This can be a particular issue if the model is being trained on a large dataset or on a resource-constrained machine. 115 | 116 | * Overfitting: A larger input layer can also increase the risk of overfitting, particularly if the model is not regularized appropriately. Overfitting occurs when the model performs well on the training data but poorly on unseen data, and it can be caused by the model having too much capacity to fit the noise in the training data. 117 | 118 | * Decreased training stability: A model with a very large input layer can also be more sensitive to the initialization of the weights and the choice of optimization algorithm. This can make it more difficult to train the model and can lead to unstable training dynamics, such as oscillations or divergence. 119 | 120 | Overall, adding a very large input layer to a neural network can lead to improved performance in some cases, but it is important to carefully consider the trade-offs and ensure that the model is properly regularized and optimized. 121 | 122 | 19. What if we don't use any activation function in a deep neural network? 123 | 124 | If no activation function is used in a deep neural network, the network will simply perform a linear transformation on the input data. Without an activation function, the network will not be able to introduce non-linearity and represent complex relationships between the input and output data. The network will not be able to learn and make predictions on non-linearly separable data, such as images, speech, and natural language. 125 | Additionally, without the activation function the backpropagation algorithm would not be able to compute the gradients and update the weights, causing the network to fail to learn. 126 | 127 | 20. What are the pros and cons of using BiLSTM or RNN layers instead of deep MLP? 128 | 21. Explain what an Auto Encoder is 129 | 130 | ### Data Structures 131 | 21. What type of DS does python use for dictionaries? 132 | 133 | Hash maps 134 | 135 | 22. What are the advantages of using a hash map? 136 | 23. What is the time complexity for searching for a value in array? 137 | 138 | ### Algorithms 139 | 24. Code a “find 2 values that sum up to a target in an array” program. 140 | 141 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MLE Interview Questions 2 | *Interview Questions for Machine Learning Engineer role. Note that this is a work in progress.* 3 | 4 | ## [MLE Interview QnA](MLE_Interview_QnA) 5 | 6 | *My collection of Machine Learning Engineer interview questions and relative answers. No question shall remain unanswered.* 7 | 1. [Python QnA](MLE_Interview_QnA/PythonQnA.md) 8 | 2. [Data Science QnA](MLE_Interview_QnA/DataScienceQnA.md) 9 | 3. [MLOps QnA]() 10 | 4. [Agentic AI QnA]() 11 | 5. [Case Study - Company A QnA](MLE_Interview_QnA/case_study_company_A.md) 12 | 6. [Chip Huyen's book QnA](Chips_Machine_Learning_Interviews_Book/ml-interviews-book.md) --------------------------------------------------------------------------------