├── Deep Learning Questions & Answers for Data Scientists.md ├── Figures ├── 1657144303401.jpg ├── Batch normalization.jpg ├── Betweem&IN.png ├── CNN_architecture.png ├── Derivation-of-the-Overall-Silhouette-Coefficient-OverallSil.png ├── Elbow diagram.png ├── Hard Vs soft voting.png ├── Joins in SQL.png ├── KPI.png ├── Kerenl trick.png ├── Linear regression assumptions.jpg ├── Monte Carlo Methods.png ├── Oversampling.png ├── Python questions 8.png ├── Roc_curve.svg.png ├── Sampling bias.png ├── Screenshot 2022-07-31 201033.png ├── Tanh-and-its-gradient-plot.png ├── Therotical Explanation Q 7 Probability.jfif ├── cross validation.png ├── ezgif.com-gif-maker.jpg ├── gradient descent vs batch gradient descent.png ├── grid_search_cross_validation.png ├── long-tailed distribution.jpg ├── precision Vs recall Vs F Score.png └── readme.md ├── Machine Learning Interview Questions & Answers for Data Scientists.md ├── Probability Interview Questions & Answers for Data Scientists.md ├── Python Interview Questions & Answers for Data Scientists.md ├── README.md ├── Resume Based Questions.md ├── SQL & DB Interview Questions & Answers for Data Scientists.md └── Statistics Interview Questions & Answers for Data Scientists.md /Deep Learning Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # Deep Learning Interview Questions for Data Scientists # 2 | 3 | ## Questions ## 4 | 5 | ## Deep Neural Networks ## 6 | * [Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-what-are-autoencoders-explain-the-different-layers-of-autoencoders-and-mention-three-practical-usages-of-them) 7 | * [Q2: What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-what-is-an-activation-function-and-discuss-the-use-of-an-activation-function-explain-three-different-types-of-activation-functions) 8 | * [Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-you-are-using-a-deep-neural-network-for-a-prediction-task-after-training-your-model-you-notice-that-it-is-strongly-overfitting-the-training-set-and-that-the-performance-on-the-test-isnt-good-what-can-you-do-to-reduce-overfitting) 9 | * [Q4: Why should we use Batch Normalization?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-why-should-we-use-batch-normalization) 10 | * [Q5: How to know whether your model is suffering from the problem of Exploding Gradients?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-how-to-know-whether-your-model-is-suffering-from-the-problem-of-exploding-gradients) 11 | * [Q6: Can you name and explain a few hyperparameters used for training a neural network?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-can-you-name-and-explain-a-few-hyperparameters-used-for-training-a-neural-network) 12 | * [Q7: Can you explain the parameter sharing concept in deep learning?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-can-you-explain-the-parameter-sharing-concept-in-deep-learning) 13 | * [Q8: Describe the architecture of a typical Convolutional Neural Network (CNN)?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-describe-the-architecture-of-a-typical-convolutional-neural-network-cnn) 14 | * [Q9: What is the Vanishing Gradient Problem in Artificial Neural Networks and How to fix it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-what-is-the-vanishing-gradient-problem-in-artificial-neural-networks-and-how-to-fix-it) 15 | * [Q10: When it comes to training an artificial neural network, what could be the reason why the loss doesn't decrease in a few epochs?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q10--when-it-comes-to-training-an-artificial-neural-network-what-could-be-the-reason-why-the-loss-doesnt-decrease-in-a-few-epochs) 16 | * [Q11: Why Sigmoid or Tanh is not preferred to be used as the activation function in the hidden layer of the neural network?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q11-why-sigmoid-or-tanh-is-not-preferred-to-be-used-as-the-activation-function-in-the-hidden-layer-of-the-neural-network) 17 | * [Q12: Discuss in what context it is recommended to use transfer learning and when it is not.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q12-discuss-in-what-context-it-is-recommended-to-use-transfer-learning-and-when-it-is-not) 18 | * [Q13: Discuss the vanishing gradient in RNN and How they can be solved.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q13-discuss-the-vanishing-gradient-in-rnn-and-how-they-can-be-solved) 19 | * [Q14: What are the main gates in LSTM and what are their tasks?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q14-what-are-the-main-gates-in-lstm-and-what-are-their-tasks) 20 | * [Q15: Is it a good idea to use CNN to classify 1D signals? ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q15-is-it-a-good-idea-to-use-cnn-to-classify1dsignal) 21 | * [Q16: How does L1/L2 regularization affect a neural network?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q16-how-does-l1l2-regularization-affect-a-neural-network) 22 | * [Q17: 𝐇𝐨𝐰 𝐰𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐜𝐡𝐚𝐧𝐠𝐞 𝐚 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐟𝐫𝐨𝐦 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐭𝐨 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q17-%F0%9D%90%87%F0%9D%90%A8%F0%9D%90%B0-%F0%9D%90%B0%F0%9D%90%A8%F0%9D%90%AE%F0%9D%90%A5%F0%9D%90%9D-%F0%9D%90%B2%F0%9D%90%A8%F0%9D%90%AE-%F0%9D%90%9C%F0%9D%90%A1%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%A0%F0%9D%90%9E-%F0%9D%90%9A-%F0%9D%90%A9%F0%9D%90%AB%F0%9D%90%9E-%F0%9D%90%AD%F0%9D%90%AB%F0%9D%90%9A%F0%9D%90%A2%F0%9D%90%A7%F0%9D%90%9E%F0%9D%90%9D-%F0%9D%90%A7%F0%9D%90%9E%F0%9D%90%AE%F0%9D%90%AB%F0%9D%90%9A%F0%9D%90%A5-%F0%9D%90%A7%F0%9D%90%9E%F0%9D%90%AD%F0%9D%90%B0%F0%9D%90%A8%F0%9D%90%AB%F0%9D%90%A4-%F0%9D%90%9F%F0%9D%90%AB%F0%9D%90%A8%F0%9D%90%A6-%F0%9D%90%9C%F0%9D%90%A5%F0%9D%90%9A%F0%9D%90%AC%F0%9D%90%AC%F0%9D%90%A2%F0%9D%90%9F%F0%9D%90%A2%F0%9D%90%9C%F0%9D%90%9A%F0%9D%90%AD%F0%9D%90%A2%F0%9D%90%A8%F0%9D%90%A7-%F0%9D%90%AD%F0%9D%90%A8-%F0%9D%90%AB%F0%9D%90%9E%F0%9D%90%A0%F0%9D%90%AB%F0%9D%90%9E%F0%9D%90%AC%F0%9D%90%AC%F0%9D%90%A2%F0%9D%90%A8%F0%9D%90%A7) 23 | * [Q18: What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an SGD optimizer?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q18-what-might-happen-if-you-set-the-momentum-hyperparameter-too-close-to-1-eg-09999-when-using-an-sgd-optimizer) 24 | * [Q19: What are the hyperparameters that can be optimized for the batch normalization layer?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q19-what-are-the-hyperparameters-that-can-be-optimized-for-the-batch-normalization-layer) 25 | * [Q20: What is the effect of dropout on the training and prediction speed of your deep learning model?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q20-what-is-the-effect-of-dropout-on-the-training-and-prediction-speed-of-your-deep-learning-model) 26 | * [Q21: What is the advantage of deep learning over traditional machine learning?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q21-what-is-the-advantage-of-deep-learning-over-traditional-machine-learning) 27 | * [Q22: What is a depthwise Separable layer and what are its advantages?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q22-what-is-a-depthwise-separable-layer-and-what-are-its-advantages) 28 | 29 | ## Natural Language Processing ## 30 | * [Q23: What is a transformer architecture, and why is it widely used in natural language processing tasks?]() 31 | * [Q24: Explain the key components of a transformer model.]() 32 | * [Q25: What is self-attention, and how does it work in transformers?]() 33 | * [Q26: What are the advantages of transformers over traditional sequence-to-sequence models?]() 34 | * [Q27: How does the attention mechanism help transformers capture long-range dependencies in sequences?]() 35 | * [Q28: What are the limitations of transformers, and what are some potential solutions?]() 36 | * [Q29: How are transformers trained, and what is the role of pre-training and fine-tuning?]() 37 | * [Q30: What is BERT (Bidirectional Encoder Representations from Transformers), and how does it improve language understanding tasks?]() 38 | * [Q31: Describe the process of generating text using a transformer-based language model.]() 39 | * [Q32: What are some challenges or ethical considerations associated with large language models?]() 40 | * [Q33: Explain the concept of transfer learning and how it can be applied to transformers.]() 41 | * [Q34: How can transformers be used for tasks other than natural language processing, such as computer vision?]() 42 | 43 | ## Computer Vision ## 44 | 45 | * [Q35: What is computer vision, and why is it important?]() 46 | * [Q36: Explain the concept of image segmentation and its applications.]() 47 | * [Q37: What is object detection, and how does it differ from image classification?]() 48 | * [Q38: Describe the steps involved in building an image recognition system.]() 49 | * [Q39: What are the challenges in implementing real-time object tracking?]() 50 | * [Q40: Can you explain the concept of feature extraction in computer vision?]() 51 | * [Q41: What is optical character recognition (OCR), and what are its main applications?]() 52 | * [Q42: How does a convolutional neural network (CNN) differ from a traditional neural network in the context of computer vision?]() 53 | * [Q43: What is the purpose of data augmentation in computer vision, and what techniques can be used?]() 54 | * [Q44: Discuss some popular deep learning frameworks or libraries used for computer vision tasks.]() 55 | 56 | 57 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 58 | 59 | ## Questions & Answers ## 60 | 61 | ### Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them? ### 62 | 63 | Answer: 64 | 65 | Autoencoders are one of the deep learning types used for unsupervised learning. There are key layers of autoencoders, which are the input layer, encoder, bottleneck hidden layer, decoder, and output. 66 | 67 | The three layers of the autoencoder are:- 68 | 1) Encoder - Compresses the input data to an encoded representation which is typically much smaller than the input data. 69 | 2) Latent Space Representation/ Bottleneck/ Code - Compact summary of the input containing the most important features 70 | 3) Decoder - Decompresses the knowledge representation and reconstructs the data back from its encoded form. 71 | Then a loss function is used at the top to compare the input and output images. 72 | NOTE- It's a requirement that the dimensionality of the input and output be the same. Everything in the middle can be played with. 73 | 74 | Autoencoders have a wide variety of usage in the real world. The following are some of the popular ones: 75 | 76 | 1. Transformers and Big Bird (Autoencoders is one of these components in both algorithms): Text Summarizer, Text Generator 77 | 2. Image compression 78 | 3. Nonlinear version of PCA 79 | 80 | 81 | ### Q2: What is an activation function and discuss the use of an activation function? Explain three different types of activation functions? ### 82 | 83 | Answer: 84 | 85 | In mathematical terms, the activation function serves as a gate between the current neuron input and its output, going to the next level. Basically, it decides whether neurons should be activated or not. 86 | It is used to introduce non-linearity into a model. 87 | 88 | Activation functions are added to introduce non-linearity to the network, it doesn't matter how many layers or how many neurons your net has, the output will be linear combinations of the input in the absence of activation functions. In other words, activation functions are what make a linear regression model different from a neural network. We need non-linearity, to capture more complex features and model more complex variations that simple linear models can not capture. 89 | 90 | There are a lot of activation functions: 91 | 92 | * Sigmoid function: f(x) = 1/(1+exp(-x)) 93 | 94 | The output value of it is between 0 and 1, we can use it for classification. It has some problems like the gradient vanishing on the extremes, also it is computationally expensive since it uses exp. 95 | 96 | * Relu: f(x) = max(0,x) 97 | 98 | it returns 0 if the input is negative and the value of the input if the input is positive. It solves the problem of vanishing gradient for the positive side, however, the problem is still on the negative side. It is fast because we use a linear function in it. 99 | 100 | * Leaky ReLU: 101 | 102 | F(x)= ax, x<0 103 | F(x)= x, x>=0 104 | 105 | It solves the problem of vanishing gradient on both sides by returning a value “a” on the negative side and it does the same thing as ReLU for the positive side. 106 | 107 | * Softmax: it is usually used at the last layer for a classification problem because it returns a set of probabilities, where the sum of them is 1. Moreover, it is compatible with cross-entropy loss, which is usually the loss function for classification problems. 108 | 109 | 110 | ### Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting? ### 111 | 112 | To reduce overfitting in a deep neural network changes can be made in three places/stages: The input data to the network, the network architecture, and the training process: 113 | 114 | 1. The input data to the network: 115 | 116 | * Check if all the features are available and reliable 117 | * Check if the training sample distribution is the same as the validation and test set distribution. Because if there is a difference in validation set distribution then it is hard for the model to predict as these complex patterns are unknown to the model. 118 | * Check for train / valid data contamination (or leakage) 119 | * The dataset size is enough, if not try data augmentation to increase the data size 120 | * The dataset is balanced 121 | 122 | 2. Network architecture: 123 | * Overfitting could be due to model complexity. Question each component: 124 | * can fully connect layers be replaced with convolutional + pooling layers? 125 | * what is the justification for the number of layers and number of neurons chosen? Given how hard it is to tune these, can a pre-trained model be used? 126 | * Add regularization - lasso (l1), ridge (l2), elastic net (both) 127 | * Add dropouts 128 | * Add batch normalization 129 | 130 | 3. The training process: 131 | * Improvements in validation losses should decide when to stop training. Use callbacks for early stopping when there are no significant changes in the validation loss and restore_best_weights. 132 | 133 | ### Q4: Why should we use Batch Normalization? ### 134 | 135 | Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. 136 | 137 | Usually, a dataset is fed into the network in the form of batches where the distribution of the data differs for every batch size. By doing this, there might be chances of vanishing gradient or exploding gradient when it tries to backpropagate. In order to combat these issues, we can use BN (with irreducible error) layer mostly on the inputs to the layer before the activation function in the previous layer and after fully connected layers. 138 | 139 | 140 | Batch Normalisation has the following effects on the Neural Network: 141 | 142 | 1. Robust Training of the deeper layers of the network. 143 | 2. Better covariate-shift proof NN Architecture. 144 | 3. Has a slight regularisation effect. 145 | 4. Centred and Controlled values of Activation. 146 | 5. Tries to Prevent exploding/vanishing gradient. 147 | 6. Faster Training/Convergence to the minimum loss function 148 | 149 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Batch%20normalization.jpg) 150 | 151 | ### Q5: How to know whether your model is suffering from the problem of Exploding Gradients? ### 152 | 153 | By taking incremental steps towards the minimal value, the gradient descent algorithm aims to minimize the error. The weights and biases in a neural network are updated using these processes. However, at times, the steps grow excessively large, resulting in increased updates to weights and bias terms to the point where the weights overflow (or become NaN, that is, Not a Number). An exploding gradient is the result of this, and it is an unstable method. 154 | 155 | There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as: 156 | 157 | 1. The model is unable to get traction on your training data (e g. poor loss). 158 | 2. The model is unstable, resulting in large changes in loss from update to update. 159 | 3. The model loss goes to NaN during training. 160 | 161 | If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients. There are some less subtle signs that you can use to confirm that you have exploding gradients: 162 | 163 | 1. The model weights quickly become very large during training. 164 | 2. The model weights go to NaN values during training. 165 | 3. The error gradient values are consistently above 1.0 for each node and layer during training. 166 | 167 | ### Q6: Can you name and explain a few hyperparameters used for training a neural network? ### 168 | 169 | Answer: 170 | 171 | Hyperparameters are any parameter in the model that affects the performance but is not learned from the data unlike parameters ( weights and biases), the only way to change it is manually by the user. 172 | 173 | 174 | 1. Number of nodes: number of inputs in each layer. 175 | 176 | 2. Batch normalization: normalization/standardization of inputs in a layer. 177 | 178 | 3. Learning rate: the rate at which weights are updated. 179 | 180 | 4. Dropout rate: percent of nodes to drop temporarily during the forward pass. 181 | 182 | 5. Kernel: matrix to perform dot product of image array with 183 | 184 | 6. Activation function: defines how the weighted sum of inputs is transformed into outputs (e.g. tanh, sigmoid, softmax, Relu, etc) 185 | 186 | 7. Number of epochs: number of passes an algorithm has to perform for training 187 | 188 | 8. Batch size: number of samples to pass through the algorithm individually. E.g. if the dataset has 1000 records and we set a batch size of 100 then the dataset will be divided into 10 batches which will be propagated to the algorithm one after another. 189 | 190 | 9. Momentum: Momentum can be seen as a learning rate adaptation technique that adds a fraction of the past update vector to the current update vector. This helps damps oscillations and speed up progress towards the minimum. 191 | 192 | 10. Optimizers: They focus on getting the learning rate right. 193 | 194 | * Adagrad optimizer: Adagrad uses a large learning rate for infrequent features and a smaller learning rate for frequent features. 195 | 196 | * Other optimizers, like Adadelta, RMSProp, and Adam, make further improvements to fine-tuning the learning rate and momentum to get to the optimal weights and bias. Thus getting the learning rate right is key to well-trained models. 197 | 198 | 11. Learning Rate: Controls how much to update weights & bias (w+b) terms after training on each batch. Several helpers are used to getting the learning rate right. 199 | 200 | ### Q7: Can you explain the parameter sharing concept in deep learning? ### 201 | Answer: 202 | Parameter sharing is the method of sharing weights by all neurons in a particular feature map. Therefore helps to reduce the number of parameters in the whole system, making it computationally cheap. It basically means that the same parameters will be used to represent different transformations in the system. This basically means the same matrix elements may be updated multiple times during backpropagation from varied gradients. The same set of elements will facilitate transformations at more than one layer instead of those from a single layer as conventional. This is usually done in architectures like Siamese that tend to have parallel trunks trained simultaneously. In that case, using shared weights in a few layers( usually the bottom layers) helps the model converge better. This behavior, as observed, can be attributed to more diverse feature representations learned by the system. Since neurons corresponding to the same features are triggered in varied scenarios. Helps to model to generalize better. 203 | 204 | 205 | 206 | Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. 207 | 208 | One practical example is when the input is faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case, it is common to relax the parameter sharing scheme, and instead, simply call the layer a Locally-Connected Layer. 209 | 210 | ### Q8: Describe the architecture of a typical Convolutional Neural Network (CNN)? ### 211 | 212 | Answer: 213 | 214 | In a typical CNN architecture, a few convolutional layers are connected in a cascade style. Each convolutional layer is followed by a Rectified Linear Unit (ReLU) layer or other activation function, then a pooling layer*, then one or more convolutional layers (+ReLU), then another pooling layer. 215 | 216 | The output from each convolution layer is a set of objects called feature maps, generated by a single kernel filter. The feature maps are used to define a new input to the next layer. A common trend is to keep on increasing the number of filters as the size of the image keeps dropping as it passes through the Convolutional and Pooling layers. The size of each kernel filter is usually 3×3 kernel because it can extract the same features which extract from large kernels and faster than them. 217 | 218 | After that, the final small image with a large number of filters(which is a 3D output from the above layers) is flattened and passed through fully connected layers. At last, we use a softmax layer with the required number of nodes for classification or use the output of the fully connected layers for some other purpose depending on the task. 219 | 220 | The number of these layers can increase depending on the complexity of the data and when they increase you need more data. Stride, Padding, Filter size, Type of Pooling, etc all are Hyperparameters and need to be chosen (maybe based on some previously built successful models) 221 | 222 | *Pooling: it is a way to reduce the number of features by choosing a number to represent its neighbor. And it has many types max-pooling, average pooling, and global average. 223 | 224 | * Max pooling: it takes the max number of window 2×2 as an example and represents this window by using the max number in it then slides on the image to make the same operation. 225 | * Average pooling: it is the same as max-pooling but takes the average of the window. 226 | 227 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/CNN_architecture.png) 228 | 229 | ### Q9: What is the Vanishing Gradient Problem in Artificial Neural Networks and How to fix it? ### 230 | 231 | Answer: 232 | 233 | The vanishing gradient problem is encountered in artificial neural networks with gradient-based learning methods and backpropagation. In these learning methods, each of the weights of the neural network receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. Sometimes when gradients become vanishingly small, this prevents the weight to change value. 234 | 235 | When the neural network has many hidden layers, the gradients in the earlier layers will become very low as we multiply the derivatives of each layer. As a result, learning in the earlier layers becomes very slow. 𝐓𝐡𝐢𝐬 𝐜𝐚𝐧 𝐜𝐚𝐮𝐬𝐞 𝐭𝐡𝐞 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐭𝐨 𝐬𝐭𝐨𝐩 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠. This problem of vanishing gradient descent happens when training neural networks with many layers because the gradient diminishes dramatically as it propagates backward through the network. 236 | 237 | Some ways to fix it are: 238 | 1. Use skip/residual connections. 239 | 2. Using ReLU or Leaky ReLU over sigmoid and tanh activation functions. 240 | 3. Use models that help propagate gradients to earlier time steps like in GRUs and LSTMs. 241 | 242 | ### Q10: When it comes to training an artificial neural network, what could be the reason why the loss doesn't decrease in a few epochs? ### 243 | 244 | Answer: 245 | 246 | Some of the reasons why the loss doesn't decrease after a few Epochs are: 247 | 248 | a) The model is under-fitting the training data. 249 | 250 | b) The learning rate of the model is large. 251 | 252 | c) The initialization is not proper (like all the weights initialized with 0 doesn't make the network learn any function) 253 | 254 | d) The Regularisation hyper-parameter is quite large. 255 | 256 | e). The classic case of vanishing gradients 257 | 258 | ### Q11: Why Sigmoid or Tanh is not preferred to be used as the activation function in the hidden layer of the neural network? ### 259 | 260 | Answer: 261 | 262 | A common problem with Tanh or Sigmoid functions is that they saturate. Once saturated, the learning algorithms cannot adapt to the weights and enhance the performance of the model. 263 | Thus, Sigmoid or Tanh activation functions prevent the neural network from learning effectively leading to a vanishing gradient problem. The vanishing gradient problem can be addressed with the use of Rectified Linear Activation Function (ReLu) instead of sigmoid and Tanh. 264 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Tanh-and-its-gradient-plot.png) 265 | 266 | ### Q12: Discuss in what context it is recommended to use transfer learning and when it is not. ### 267 | 268 | Answer: 269 | 270 | Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It is a popular approach in deep learning where pre-trained models are used as the starting point for computer vision and natural language processing tasks given the vast computing and time resources required to develop neural network models on these problems and from the huge jumps in a skill that they provide on related problems. 271 | 272 | Transfer learning is used for tasks where the data is too little to train a full-scale model from the beginning. In transfer learning, well-trained, well-constructed networks are used which have learned over large sets and can be used to boost the performance of a dataset. 273 | 274 | 𝐓𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐜𝐚𝐧 𝐛𝐞 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐟𝐨𝐥𝐥𝐨𝐰𝐢𝐧𝐠 𝐜𝐚𝐬𝐞𝐬: 275 | 1. The downstream task has a very small amount of data available, then we can try using pre-trained model weights by switching the last layer with new layers which we will train. 276 | 277 | 2. In some cases, like in vision-related tasks, the initial layers have a common behavior of detecting edges, then a little more complex but still abstract features and so on which is common in all vision tasks, and hence a pre-trained model's initial layers can be used directly. The same thing holds for Language Models too, for example, a model trained in a large Hindi corpus can be transferred and used for other Indo-Aryan Languages with low resources available. 278 | 279 | 280 | 𝐂𝐚𝐬𝐞𝐬 𝐰𝐡𝐞𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐛𝐞 𝐮𝐬𝐞𝐝: 281 | 282 | 1. The first and most important is the "COST". So is it cost-effective or we can have a similar performance without using it. 283 | 284 | 2. The pre-trained model has no relation to the downstream task. 285 | 286 | 3. If the latency is a big constraint (Mostly in NLP ) then transfer learning is not the best option. However Now with the TensorFlow lite kind of platform and Model Distillation, Latency is not a problem anymore. 287 | 288 | ### Q13: Discuss the vanishing gradient in RNN and How they can be solved. ### 289 | 290 | Answer: 291 | 292 | In Sequence to Sequence models such as RNNs, the input sentences might have long-term dependencies for example we might say "The boy who was wearing a red t-shirt, blue jeans, black shoes, and a white cap and who lives at ... and is 10 years old ...... etc, is genius" here the verb (is) in the sentence depends on the (boy) i.e if we say (The boys, ......, are genius". When training an RNN we do backward propagation both through layers and backward through time. Without focusing too much on mathematics, during backward propagation we tend to multiply gradients that are either > 1 or < 1, if the gradients are < 1 and we have about 100 steps backward in time then multiplying 100 numbers that are < 1 will result in a very very tiny gradient causing no change in the weights as we go backward in time (0.1 * 0.1 * 0.1 * .... a 100 times = 10^(-100)) such that in our previous example the word "is" doesn't affect its main dependency the word "boy" during learning the meanings of the word due to the long description in between. 293 | 294 | Models like the Gated Recurrent Units (GRUs) and the Long short-term memory (LSTMs) were proposed, the main idea of these models is to use gates to help the network determine which information to keep and which information to discard during learning. Then Transformers were proposed depending on the self-attention mechanism to catch the dependencies between words in the sequence. 295 | 296 | ### Q14: What are the main gates in LSTM and what are their tasks? ### 297 | 298 | Answer: 299 | There are 3 main types of gates in a LSTM Model, as follows: 300 | - Forget Gate 301 | - Input/Update Gate 302 | - Output Gate 303 | 1) Forget Gate:- It helps in deciding which data to keep or thrown out 304 | 2) Input Gate:- it helps in determining whether new data should be added in long term memory cell given by previous hidden state and new input data 305 | 3) Output Gate:- this gate gives out the new hidden state 306 | 307 | Common things for all these gates are they all take take inputs as the current temporal state/input/word/observation and the previous hidden state output and sigmoid activation is mostly used in all of these. 308 | 309 | 310 | ![The-LSTM-unit-contain-a-forget-gate-output-gate-and-input-gate-The-yellow-circle_W640](https://user-images.githubusercontent.com/72076328/191361816-70d9469e-ce48-44df-8992-6a48bff736ab.jpg) 311 | 312 | ### Q15: Is it a good idea to use CNN to classify 1D signal? ### 313 | 314 | Answer: 315 | For time-series data, where we assume temporal dependence between the values, then convolutional neural networks (CNN) are one of the possible approaches. However the most popular approach to such data is to use recurrent neural networks (RNN), but you can alternatively use CNNs, or a hybrid approach (quasi-recurrent neural networks, QRNN). 316 | 317 | With **CNN**, you would use sliding windows of some width, that would look at certain (learned) patterns in the data, and stack such windows on top of each other, so that higher-level windows would look for patterns within the lower-level patterns. Using such sliding windows may be helpful for finding things such as repeating patterns within the data. One drawback is that it doesn't take into account the temporal or sequential aspect of the 1D signals, which can be very important for prediction. 318 | 319 | With **RNN**, you would use a cell that takes as input the previous hidden state and current input value, to return output and another hidden form, so the information flows via the hidden states and takes into account the temporal dependencies. 320 | 321 | **QRNN** layers mix both approaches. 322 | 323 | ### Q16: How does L1/L2 regularization affect a neural network? ### 324 | 325 | Answer: 326 | 327 | Overfitting occurs in more complex neural network models (many layers, many neurons) and the complexity of the neural network can be reduced by using L1 and L2 regularization as well as dropout , Data augmenration and Dropaout. 328 | L1 regularization forces the weight parameters to become zero. L2 regularization forces the weight parameters towards zero (but never exactly zero|| weight deccay ) 329 | 330 | Smaller weight parameters make some neurons neglectable therfore neural network becomes less complex and less overfitting. 331 | 332 | Regularisation has the following benefits: 333 | - Reducing the variance of the model over unseen data. 334 | - Makes it feasible to fit much more complicated models without overfitting. 335 | - Reduces the magnitude of weights and biases. 336 | - L1 learns sparse models that is many weights turn out to be 0. 337 | - https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/tree/main 338 | 339 | ### Q17: 𝐇𝐨𝐰 𝐰𝐨𝐮𝐥𝐝 𝐲𝐨𝐮 𝐜𝐡𝐚𝐧𝐠𝐞 𝐚 𝐩𝐫𝐞-𝐭𝐫𝐚𝐢𝐧𝐞𝐝 𝐧𝐞𝐮𝐫𝐚𝐥 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐟𝐫𝐨𝐦 𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐭𝐨 𝐫𝐞𝐠𝐫𝐞𝐬𝐬𝐢𝐨𝐧? ### 340 | 341 | Answer: 342 | Using transfer learning where we can use our knowledge about one task to do another. First set of layers of a neural network are usually feature extraction layers and will be useful for all tasks with the same input distribution. So, we should replace the last fully connected layer and Softmax responsible for classification with one neuron for regression-or fully connected-layer for correction then one neuron for regression. 343 | 344 | We can optionally freeze the first set of layers if we have few data or to converge fast. Then we can train the network with the data we have and using the suitable loss for the regression problem, making use of the robust feature extraction -first set of layers- of a pre-trained model on huge data. 345 | 346 | ### Q18: What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an SGD optimizer? ### 347 | 348 | Answer: 349 | 350 | If the momentum hyperparameter is set too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. 351 | 352 | Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value. 353 | 354 | Also since the momentum is used to update the weights based on an "exponential moving average" of all the previous gradients instead of the current gradient only, this in some sense, combats the instability of the gradients that comes with stochastic gradient descent, the higher the momentum term, the stronger the influence of previous gradients to the current optimization step (with the more recent gradients having even stronger influence), setting a momentum term close to 1, will result in a gradient that is almost a sum of all the previous gradients basically, which might result in an exploding gradient scenario. 355 | ![1667318817187](https://user-images.githubusercontent.com/72076328/199281009-a2051070-bf73-4478-9616-c5688d7eceb0.jpg) 356 | 357 | ### Q19: What are the hyperparameters that can be optimized for the batch normalization layer? ### 358 | 359 | Answer: The $\gamma$ and $\beta$ hyperparameters for the batch normalization layer are learned end to end by the network. 360 | In batch-normalization, the outputs of the intermediate layers are normalized to have a mean of 0 and standard deviation of 1. Rescaling by $\gamma$ and shifting by $\beta$ helps us change the mean and standard deviation to other values. 361 | 362 | ### Q20: What is the effect of dropout on the training and prediction speed of your deep learning model? ### 363 | 364 | Answer: Dropout is a regularization technique, which zeroes down some weights and scales up the rest of the weights by a factor of 1/(1-p). Let's say if Dropout layer is initialized with p=0.5, that means half of the weights will zeroed down, and rest will be scaled by a factor of 2. This layer is only enabled during training and is disabled during validation and testing. Hence validation and testing is faster. The reason why it works only during training is, we want to reduce the complexity of the model so that model doesn't overfit. Once the model is trained, it doesn't make sense to keep that layer enabled. 365 | 366 | 367 | ### Q21: What is the advantage of deep learning over traditional machine learning? ### 368 | 369 | Answer: 370 | 371 | Deep learning offers several advantages over traditional machine learning approaches, including: 372 | 373 | 1. Ability to process large amounts of data: Deep learning models can analyze and process massive amounts of data quickly and accurately, making it ideal for tasks such as image recognition or natural language processing. 374 | 375 | 2. Automated feature extraction: In traditional machine learning, feature engineering is a crucial step in the model building process. Deep learning models, on the other hand, can automatically learn and extract features from the raw data, reducing the need for human intervention. 376 | 377 | 3. Better accuracy: Deep learning models have shown to achieve higher accuracy levels in complex tasks such as speech recognition and image classification when compared to traditional machine learning models. 378 | 379 | 4. Adaptability to new data: Deep learning models can adapt and learn from new data, making them suitable for use in dynamic and ever-changing environments. 380 | 381 | While deep learning does have its advantages, it also has some limitations, such as requiring large amounts of data and computational resources, making it unsuitable for some applications. 382 | 383 | ### Q22: What is a depthwise Separable layer and what are its advantages? ### 384 | 385 | Answer: 386 | 387 | Standard neural network Convolution layers involve a lot of multiplications that make them unsuitable for deployment. 388 | 389 | ![image](https://user-images.githubusercontent.com/16001446/214198300-f9b1edcf-7b5f-4fd0-a574-edd1b6feefc2.png) 390 | 391 | In this above scenario, we have an input image of 12x12x3 pixels and we apply a 5x5 convolution(no padding, stride = 1). We stack 256 such kernels 392 | so that we get an output of dimensions 8x8x256. 393 | 394 | Here, there are 256 5x5x3 kernels that move 8x8 times which leads to 256x3x5x5x8x8 = 1,28,800 multiplications. 395 | 396 | Depthwise separable convolution separates this process into two parts: a depthwise convolution and a pointwise convolution. 397 | 398 | In depthwise convolution, we apply a kernel parallelly to each channel of the image. 399 | 400 | ![image](https://user-images.githubusercontent.com/16001446/214199199-63fc7784-b9de-4ca8-ade7-1d4b522ee3b2.png) 401 | 402 | We end up getting 3 different outputs (representing 3 channels of the image) to get an 8x8x1 image. These are stacked together to form a 8x8x3 image. 403 | 404 | Pointwise Convolution now converts this 8x8x3 image input from the depthwise convolution back to an 8x8x1 output. 405 | 406 | ![image](https://user-images.githubusercontent.com/16001446/214199695-afdf7c3c-0c9d-4ed5-916f-b7afb5eada8f.png) 407 | 408 | Stacking 256 1x1x3 kernels give us the final output as the standard convolution. 409 | 410 | ![image](https://user-images.githubusercontent.com/16001446/214199795-bacbdb57-59bd-4b36-ba71-7d65fffcfa8b.png) 411 | 412 | Total Number of multiplications: 413 | 414 | For Depthwise convolution, we have 3 5x5x1 kernels moving 8x8 times, totalling 3x5x5x8x8=4800 multiplications. 415 | 416 | In Pointwise convolution, we have 256 1x1x3 kernels moving 8x8 times, which is a total of 256x1x1x3x8x8=49152 multiplications. 417 | 418 | Total number of multiplications = 4800 + 49152 = 53952 multiplications which is way lower than the standard convolution case. 419 | 420 | Reference: https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728 421 | 422 | 423 | # Natural Language Processing # 424 | 425 | ## Q23: What is transformer architecture, and why is it widely used in natural language processing tasks? ## 426 | Answer: 427 | The key components of a transformer architecture are as follows: 428 | 429 | 1. Encoder: The encoder processes the input sequence, such as a sentence or a document, and transforms it into a set of representations that capture the contextual information of each input element. The encoder consists of multiple identical layers, each containing a self-attention mechanism and position-wise feed-forward neural networks. The self-attention mechanism allows the model to attend to different parts of the input sequence while encoding it. 430 | 431 | 2. Decoder: The decoder takes the encoded representations generated by the encoder and generates an output sequence. It also consists of multiple identical layers, each containing a self-attention mechanism and additional cross-attention mechanisms. The cross-attention mechanisms enable the decoder to attend to relevant parts of the encoded input sequence when generating the output. 432 | 433 | 3. Self-Attention: Self-attention is a mechanism that allows the transformer to weigh the importance of different elements in the input sequence when generating representations. It computes attention scores between each element and every other element in the sequence, resulting in a weighted sum of the values. This process allows the model to capture dependencies and relationships between different elements in the sequence. 434 | 435 | 4. Positional Encoding: Transformers incorporate positional encoding to provide information about the order or position of elements in the input sequence. This encoding is added to the input embeddings and allows the model to understand the sequential nature of the data. 436 | 437 | 5. Feed-Forward Networks: Transformers utilize feed-forward neural networks to process the representations generated by the attention mechanisms. These networks consist of multiple layers of fully connected neural networks with activation functions, enabling non-linear transformations of the input representations. 438 | 439 | ![Screenshot-from-2019-06-17-19-53-10](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/assets/72076328/44c96916-f6cc-4b69-b163-ef94f027eb2e) 440 | 441 | The transformer architecture is widely used in NLP tasks due to several reasons: 442 | 443 | Self-Attention Mechanism: Transformers leverage a self-attention mechanism that allows the model to focus on different parts of the input sequence during processing. This mechanism enables the model to capture long-range dependencies and contextual information efficiently, making it particularly effective for tasks that involve understanding and generating natural language. 444 | 445 | Parallelization: Transformers can process the elements of a sequence in parallel, as opposed to recurrent neural networks (RNNs) that require sequential processing. This parallelization greatly accelerates training and inference, making transformers more computationally efficient. 446 | 447 | Scalability: Transformers scale well with the length of input sequences, thanks to the self-attention mechanism. Unlike RNNs, transformers do not suffer from the vanishing or exploding gradient problem, which can hinder the modeling of long sequences. This scalability makes transformers suitable for tasks that involve long texts or documents. 448 | 449 | Transfer Learning: Transformers have shown great success in pre-training and transfer learning. Models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) are pre-trained on massive amounts of text data, enabling them to learn rich representations of language. These pre-trained models can then be fine-tuned on specific downstream tasks with comparatively smaller datasets, leading to better generalization and improved performance. 450 | 451 | Contextual Understanding: Transformers excel in capturing the contextual meaning of words and sentences. By considering the entire input sequence simultaneously, transformers can generate more accurate representations that incorporate global context, allowing for better language understanding and generation. 452 | 453 | 454 | ## Q24: Explain the key components of a transformer model. ## 455 | Answer: 456 | 457 | A transformer model consists of several key components that work together to process and generate representations for input sequences. The main components of a transformer model are as follows: 458 | 459 | * Encoder: The encoder is responsible for processing the input sequence and generating representations that capture the contextual information of each element. It consists of multiple identical layers, typically stacked on top of each other. Each layer contains two sub-layers: a self-attention mechanism and a position-wise feed-forward neural network. 460 | 461 | * Self-Attention Mechanism: This mechanism allows the model to attend to different parts of the input sequence while encoding it. It computes attention scores between each element and every other element in the sequence, resulting in a weighted sum of values. This process allows the model to capture dependencies and relationships between different elements. 462 | 463 | * Position-wise Feed-Forward Neural Network: After the self-attention mechanism, a feed-forward neural network is applied to each position separately. It consists of fully connected layers with activation functions, enabling non-linear transformations of the input representations. 464 | 465 | * Decoder: The decoder takes the encoded representations generated by the encoder and generates an output sequence. It also consists of multiple identical layers, each containing sub-layers such as self-attention, cross-attention, and position-wise feed-forward networks. 466 | 467 | * Self-Attention Mechanism: Similar to the encoder, the decoder uses self-attention to attend to different parts of the decoded sequence while generating the output. It allows the decoder to consider the previously generated elements in the output sequence when generating the next element. 468 | 469 | * Cross-Attention Mechanism: In addition to self-attention, the decoder employs cross-attention to attend to relevant parts of the encoded input sequence. It allows the decoder to align and extract information from the encoded sequence when generating the output. 470 | 471 | * Self-Attention and Cross-Attention: These attention mechanisms are fundamental components of the transformer architecture. They enable the model to weigh the importance of different elements in the input and output sequences when generating representations. Attention scores are computed by measuring the compatibility between elements, and the weighted sum of values is used to capture contextual dependencies. 472 | 473 | * Positional Encoding: Transformers incorporate positional encoding to provide information about the order or position of elements in the input sequence. It is added to the input embeddings and allows the model to understand the sequential nature of the data. 474 | 475 | * Residual Connections and Layer Normalization: Transformers employ residual connections and layer normalization to facilitate the flow of information and improve gradient propagation. Residual connections enable the model to capture both high-level and low-level features, while layer normalization normalizes the inputs to each layer, improving the stability and performance of the model. 476 | 477 | These components collectively enable the transformer model to process and generate representations for input sequences in an efficient and effective manner. The self-attention mechanisms, along with the feed-forward networks and positional encoding, allow the model to capture long-range dependencies, handle the parallel processing, and generate high-quality representations, making transformers highly successful in natural language processing tasks. 478 | 479 | 480 | ## Q25: What is self-attention, and how does it work in transformers? ## 481 | Answer: 482 | 483 | ## Q26: What are the advantages of transformers over traditional sequence-to-sequence models? ## 484 | Answer: 485 | Transformers have several advantages over traditional sequence-to-sequence models, such as recurrent neural networks (RNNs), when it comes to natural language processing tasks. Here are some key advantages: 486 | 487 | * Long-range dependencies: Transformers are capable of capturing long-range dependencies in sequences more effectively compared to RNNs. This is because RNNs suffer from vanishing or exploding gradient problems when processing long sequences, which limits their ability to capture long-term dependencies. Transformers address this issue by using self-attention mechanisms that allow for capturing relationships between any two positions in a sequence, regardless of their distance. 488 | 489 | * Parallelization: Transformers can process inputs in parallel, making them more efficient in terms of computational time compared to RNNs. In RNNs, the sequential nature of computation limits parallelization since each step depends on the previous step's output. Transformers, on the other hand, process all positions in a sequence simultaneously, enabling efficient parallelization across different positions. 490 | 491 | * Scalability: Transformers are highly scalable and can handle larger input sequences without significantly increasing computational requirements. In RNNs, the computational complexity grows linearly with the length of the input sequence, making it challenging to process long sequences efficiently. Transformers, with their parallel processing and self-attention mechanisms, maintain a constant computational complexity, making them suitable for longer sequences. 492 | 493 | * Global context understanding: Transformers capture global context information effectively due to their attention mechanisms. Each position in the sequence attends to all other positions, allowing for a comprehensive understanding of the entire sequence during the encoding and decoding process. This global context understanding aids in various NLP tasks, such as machine translation, where the translation of a word can depend on the entire source sentence. 494 | 495 | * Transfer learning and fine-tuning: Transformers facilitate transfer learning and fine-tuning, which is the ability to pre-train models on large-scale datasets and then adapt them to specific downstream tasks with smaller datasets. Pretraining transformers on massive amounts of data, such as in models like BERT or GPT, helps capture rich language representations that can be fine-tuned for a wide range of NLP tasks, providing significant performance gains. 496 | 497 | ## Q27: How does the attention mechanism help transformers capture long-range dependencies in sequences? ## 498 | Answer: 499 | The attention mechanism in transformers plays a crucial role in capturing long-range dependencies in sequences. It allows each position in a sequence to attend to other positions, enabling the model to focus on relevant parts of the input during both the encoding and decoding stages. Here's how the attention mechanism works in transformers: 500 | 501 | * Self-Attention: Self-attention, also known as intra-attention, is the key component of the attention mechanism in transformers. It computes the importance, or attention weight, that each position in the sequence should assign to other positions. This attention weight determines how much information a position should gather from other positions. 502 | 503 | * Query, Key, and Value: To compute self-attention, each position in the sequence is associated with three learned vectors: query, key, and value. These vectors are derived from the input embeddings and transformed through linear transformations. The query vector is used to search for relevant information, the key vector represents the positions to which the query attends, and the value vector holds the information content of each position. 504 | 505 | * Attention Scores: The attention mechanism calculates attention scores between the query vector of a position and the key vectors of all other positions in the sequence. The attention scores quantify the relevance or similarity between positions. They are obtained by taking the dot product between the query and key vectors and scaling it by a factor of the square root of the dimensionality of the key vectors. 506 | 507 | * Attention Weights: The attention scores are then normalized using the softmax function to obtain attention weights. These weights determine the contribution of each position to the final representation of the current position. Positions with higher attention weights have a stronger influence on the current position's representation. 508 | 509 | * Weighted Sum: Finally, the attention weights are used to compute a weighted sum of the value vectors. This aggregation of values gives the current position a comprehensive representation that incorporates information from all relevant positions, capturing the long-range dependencies effectively. 510 | 511 | By allowing each position to attend to other positions, the attention mechanism provides a mechanism for information to flow across the entire sequence. This enables transformers to capture dependencies between distant positions, even in long sequences, without suffering from the limitations of vanishing or exploding gradients that affect traditional recurrent neural networks. Consequently, transformers excel in modeling complex relationships and dependencies in sequences, making them powerful tools for various tasks, including natural language processing and computer vision. 512 | 513 | ## Q28: What are the limitations of transformers, and what are some potential solutions? ## 514 | Answer: 515 | While transformers have revolutionized many natural language processing tasks, they do have certain limitations. Here are some notable limitations of transformers and potential solutions: 516 | 517 | * Sequential Computation: Transformers process the entire sequence in parallel, which limits their ability to model sequential information explicitly. This can be a disadvantage when tasks require strong sequential reasoning. Potential solutions include incorporating recurrent connections into transformers or using hybrid models that combine the strengths of transformers and recurrent neural networks. 518 | 519 | * Memory and Computational Requirements: Transformers consume more memory and computational resources compared to traditional sequence models, especially for large-scale models and long sequences. This limits their scalability and deployment on resource-constrained devices. Solutions involve developing more efficient architectures, such as sparse attention mechanisms or approximations, to reduce memory and computational requirements without sacrificing performance significantly. 520 | 521 | * Lack of Interpretability: Transformers are often considered as black-box models, making it challenging to interpret the reasoning behind their predictions. Understanding the decision-making process of transformers is an ongoing research area. Techniques such as attention visualization, layer-wise relevance propagation, and saliency maps can provide insights into the model's attention and contribution to predictions, enhancing interpretability. 522 | 523 | * Handling Out-of-Distribution Data: Transformers can struggle with data that significantly deviates from the distribution seen during training. They may make overconfident predictions or produce incorrect outputs when faced with out-of-distribution samples. Solutions include exploring uncertainty estimation techniques, robust training approaches, or incorporating external knowledge sources to improve generalization and handle out-of-distribution scenarios. 524 | 525 | * Limited Contextual Understanding: Transformers rely heavily on context information to make predictions. However, they can still struggle with understanding the broader context, especially in scenarios with complex background knowledge or multi-modal data. Incorporating external knowledge bases, leveraging graph neural networks, or combining transformers with other modalities like images or graphs can help improve contextual understanding and capture richer representations. 526 | 527 | * Training Data Requirements: Transformers typically require large amounts of labeled data for effective training due to their high capacity. Acquiring labeled data can be expensive and time-consuming, limiting their applicability to domains with limited labeled datasets. Solutions include exploring semi-supervised learning, active learning, or transfer learning techniques to mitigate the data requirements and leverage pretraining on large-scale datasets. 528 | 529 | Researchers and practitioners are actively working on addressing these limitations to further enhance the capabilities and applicability of transformers in various domains. As the field progresses, we can expect continued advancements and novel solutions to overcome these challenges. 530 | 531 | 532 | ## Q29: How are transformers trained, and what is the role of pre-training and fine-tuning? ## 533 | Answer: 534 | 535 | ## Q30: What is BERT (Bidirectional Encoder Representations from Transformers), and how does it improve language understanding tasks? ## 536 | Answer: 537 | BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based neural network model introduced by Google in 2018. It is designed to improve the understanding of natural language in various language processing tasks, such as question answering, sentiment analysis, named entity recognition, and more. 538 | 539 | BERT differs from previous language models in its ability to capture the context of a word by considering both the left and right context in a sentence. Traditional language models, like the ones based on recurrent neural networks, process text in a sequential manner, making it difficult to capture the full context. 540 | 541 | BERT, on the other hand, is a "pre-trained" model that is trained on a large corpus of unlabeled text data. During pre-training, BERT learns to predict missing words in sentences by considering the surrounding words on both sides. This bidirectional training allows BERT to capture contextual information effectively. 542 | 543 | Once pre-training is complete, BERT is fine-tuned on specific downstream tasks. This fine-tuning involves training the model on labeled data from a particular task, such as sentiment analysis or named entity recognition. During fine-tuning, BERT adapts its pre-trained knowledge to the specific task, further improving its understanding and performance. 544 | 545 | The key advantages of BERT include: 546 | 547 | 1. Contextual understanding: BERT can capture the contextual meaning of words by considering both the preceding and following words in a sentence, leading to better language understanding. 548 | 549 | 2. Transfer learning: BERT is pre-trained on a large corpus of unlabeled data, enabling it to learn general language representations. These pre-trained representations can then be fine-tuned for specific tasks, even with limited labeled data. 550 | 551 | 3. Versatility: BERT can be applied to a wide range of natural language processing tasks. By fine-tuning the model on specific tasks, it can achieve state-of-the-art performance in tasks such as question answering, text classification, and more. 552 | 553 | 4. Handling ambiguity: BERT's bidirectional nature helps it handle ambiguous language constructs more effectively. It can make more informed predictions by considering the context from both directions. 554 | ![0_ViwaI3Vvbnd-CJSQ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/assets/72076328/74d6f513-8626-478c-aabb-a91ee1f70bbe) 555 | 556 | ## Q31: Describe the process of generating text using a transformer-based language model. ## 557 | Answer: 558 | 559 | ## Q32: What are some challenges or ethical considerations associated with large language models? ## 560 | Answer: 561 | 562 | ## Q33: Explain the concept of transfer learning and how it can be applied to transformers. ## 563 | Answer: 564 | 565 | Transfer learning is a machine learning technique where knowledge gained from training on one task is leveraged to improve performance on another related task. Instead of training a model from scratch on a specific task, transfer learning enables the use of pre-trained models as a starting point for new tasks. 566 | 567 | In the context of transformers, transfer learning has been highly successful, particularly with models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). 568 | 569 | Here's how transfer learning is applied to transformers: 570 | 571 | 1. Pre-training: In the pre-training phase, a transformer model is trained on a large corpus of unlabeled text data. The model is trained to predict missing words in a sentence (masked language modeling) or to predict the next word in a sequence (causal language modeling). This process enables the model to learn general language patterns, syntactic structures, and semantic relationships. 572 | 573 | 2. Fine-tuning: Once the transformer model is pre-trained, it can be fine-tuned on specific downstream tasks with smaller labeled datasets. Fine-tuning involves retraining the pre-trained model on task-specific labeled data. The model's parameters are adjusted to optimize performance on the specific task, while the pre-trained knowledge acts as a strong initialization for the fine-tuning process. 574 | 575 | a. Task-specific architecture: During fine-tuning, the architecture of the pre-trained transformer model is often modified or extended to accommodate the specific requirements of the downstream task. For example, in sentiment analysis, an additional classification layer may be added on top of the pre-trained model to classify text sentiment. 576 | 577 | b. Few-shot or zero-shot learning: Transfer learning with transformers allows for few-shot or even zero-shot learning scenarios. Few-shot learning refers to training a model on a small amount of labeled data, which is beneficial when data availability is limited. Zero-shot learning refers to using the pre-trained model directly on a task for which it hasn't been explicitly trained, but the model can still generate meaningful predictions based on its understanding of language. 578 | 579 | Transfer learning with transformers offers several advantages: 580 | 581 | 1. Reduced data requirements: Pre-training on large unlabeled datasets allows the model to capture general language understanding, reducing the need for massive amounts of labeled task-specific data. 582 | 583 | 2. Improved generalization: The pre-trained model has learned rich representations of language from extensive pre-training, enabling it to generalize well to new tasks and domains. 584 | 585 | 3. Efficient training: Fine-tuning a pre-trained model requires less computational resources and training time compared to training from scratch. 586 | 587 | 4. State-of-the-art performance: Transfer learning with transformers has achieved state-of-the-art performance on a wide range of NLP tasks, including text classification, named entity recognition, question answering, machine translation, and more. 588 | 589 | By leveraging the knowledge encoded in pre-trained transformers, transfer learning enables faster and more effective development of models for specific NLP tasks, even with limited labeled data. 590 | 591 | 592 | ## Q34: How can transformers be used for tasks other than natural language processing, such as computer vision? ## 593 | Answer: 594 | 595 | 596 | 597 | # Computer Vision # 598 | 599 | ## Q35: What is computer vision, and why is it important? 600 | Answer: 601 | 602 | ## Q36: Explain the concept of image segmentation and its applications. 603 | Answer: 604 | 605 | ## Q37: What is object detection, and how does it differ from image classification? 606 | Answer: 607 | 608 | ## Q38: Describe the steps involved in building an image recognition system. 609 | Answer: 610 | 611 | ## Q39: What are the challenges in implementing real-time object tracking? 612 | Answer: 613 | 614 | ## Q40: Can you explain the concept of feature extraction in computer vision? 615 | Answer: 616 | 617 | ## Q41: What is optical character recognition (OCR), and what are its main applications? 618 | Answer: 619 | 620 | ## Q42: How does a convolutional neural network (CNN) differ from a traditional neural network in the context of computer vision? 621 | Answer: 622 | 623 | ## Q43: What is the purpose of data augmentation in computer vision, and what techniques can be used? 624 | Answer: 625 | 626 | The purpose of data augmentation in computer vision is to artificially increase the size and diversity of a training dataset by applying various transformations to the original images. Data augmentation helps prevent overfitting and improves the generalization ability of deep learning models by exposing them to a broader range of variations and patterns present in the data. It also reduces the risk of the model memorizing specific examples in the training data. 627 | 628 | By applying different augmentation techniques, the model becomes more robust and capable of handling variations in the real-world test data that may not be present in the original training set. Common data augmentation techniques include: 629 | 630 | 1. Horizontal Flipping: Flipping images horizontally, i.e., left to right, or vice versa. This is particularly useful for tasks where the orientation of objects doesn't affect their interpretation, such as object detection or image classification. 631 | 632 | 2. Vertical Flipping: Similar to horizontal flipping but flipping images from top to bottom. 633 | 634 | 3. Random Rotation: Rotating images by a random angle. This can be helpful to simulate objects at different angles and orientations. 635 | 636 | 4. Random Crop: Taking random crops from the input images. This forces the model to focus on different parts of the image and helps in handling varying object scales. 637 | 638 | 5. Scaling and Resizing: Rescaling images to different sizes or resizing them while maintaining the aspect ratio. This augmentation helps the model handle objects of varying sizes. 639 | 640 | 6. Color Jittering: Changing the brightness, contrast, saturation, and hue of the images randomly. This augmentation can help the model become more robust to changes in lighting conditions. 641 | 642 | 7. Gaussian Noise: Adding random Gaussian noise to the images, which simulates noisy environments and enhances the model's noise tolerance. 643 | 644 | 8. Elastic Transformations: Applying local deformations to the image, simulating distortions that might occur due to variations in the imaging process. 645 | 646 | 9. Cutout: Randomly masking out portions of the image with black pixels. This helps the model learn to focus on other informative parts of the image. 647 | 648 | 10. Mixup: Combining two or more images and their corresponding labels in a weighted manner to create new training examples. This encourages the model to learn from the combined patterns of multiple images. 649 | 650 | It's important to note that the choice of data augmentation techniques depends on the specific computer vision task and the characteristics of the dataset. Additionally, augmentation should be applied only during the training phase and not during testing or evaluation to ensure that the model generalizes well to unseen data. 651 | 652 | 653 | 654 | ## Q44: Discuss some popular deep learning frameworks or libraries used for computer vision tasks. 655 | Answer: 656 | -------------------------------------------------------------------------------- /Figures/1657144303401.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/1657144303401.jpg -------------------------------------------------------------------------------- /Figures/Batch normalization.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Batch normalization.jpg -------------------------------------------------------------------------------- /Figures/Betweem&IN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Betweem&IN.png -------------------------------------------------------------------------------- /Figures/CNN_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/CNN_architecture.png -------------------------------------------------------------------------------- /Figures/Derivation-of-the-Overall-Silhouette-Coefficient-OverallSil.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Derivation-of-the-Overall-Silhouette-Coefficient-OverallSil.png -------------------------------------------------------------------------------- /Figures/Elbow diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Elbow diagram.png -------------------------------------------------------------------------------- /Figures/Hard Vs soft voting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Hard Vs soft voting.png -------------------------------------------------------------------------------- /Figures/Joins in SQL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Joins in SQL.png -------------------------------------------------------------------------------- /Figures/KPI.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/KPI.png -------------------------------------------------------------------------------- /Figures/Kerenl trick.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Kerenl trick.png -------------------------------------------------------------------------------- /Figures/Linear regression assumptions.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Linear regression assumptions.jpg -------------------------------------------------------------------------------- /Figures/Monte Carlo Methods.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Monte Carlo Methods.png -------------------------------------------------------------------------------- /Figures/Oversampling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Oversampling.png -------------------------------------------------------------------------------- /Figures/Python questions 8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Python questions 8.png -------------------------------------------------------------------------------- /Figures/Roc_curve.svg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Roc_curve.svg.png -------------------------------------------------------------------------------- /Figures/Sampling bias.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Sampling bias.png -------------------------------------------------------------------------------- /Figures/Screenshot 2022-07-31 201033.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Screenshot 2022-07-31 201033.png -------------------------------------------------------------------------------- /Figures/Tanh-and-its-gradient-plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Tanh-and-its-gradient-plot.png -------------------------------------------------------------------------------- /Figures/Therotical Explanation Q 7 Probability.jfif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/Therotical Explanation Q 7 Probability.jfif -------------------------------------------------------------------------------- /Figures/cross validation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/cross validation.png -------------------------------------------------------------------------------- /Figures/ezgif.com-gif-maker.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/ezgif.com-gif-maker.jpg -------------------------------------------------------------------------------- /Figures/gradient descent vs batch gradient descent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/gradient descent vs batch gradient descent.png -------------------------------------------------------------------------------- /Figures/grid_search_cross_validation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/grid_search_cross_validation.png -------------------------------------------------------------------------------- /Figures/long-tailed distribution.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/long-tailed distribution.jpg -------------------------------------------------------------------------------- /Figures/precision Vs recall Vs F Score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/Data-Science-Interview-Questions-Answers/20ebf4e6fb2deac82163e0b96a3420a29bf4b85f/Figures/precision Vs recall Vs F Score.png -------------------------------------------------------------------------------- /Figures/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning Interview Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Interview Questions & Answers for Data Scientists # 2 | 3 | ## Questions ## 4 | * [Q1: Mention three ways to make your model robust to outliers?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-mention-three-ways-to-make-your-model-robust-to-outliers) 5 | * [Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-describe-the-motivation-behind-random-forests-and-mention-two-reasons-why-they-are-better-than-individual-decision-trees) 6 | * [Q3: What are the differences and similarities between gradient boosting and random forest? and what are the advantages and disadvantages of each when compared to each other?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-what-are-the-differences-and-similarities-between-gradient-boosting-and-random-forest-and-what-are-the-advantage-and-disadvantages-of-each-when-compared-to-each-other) 7 | * [Q4: What are L1 and L2 regularization? What are the differences between the two?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-what-are-l1-and-l2-regularization-what-are-the-differences-between-the-two) 8 | * [Q5: What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-what-are-the-bias-and-variance-in-a-machine-learning-model-and-explain-the-bias-variance-trade-off) 9 | * [Q6: Mention three ways to handle missing or corrupted data in a dataset?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-mention-three-ways-to-handle-missing-or-corrupted-data-in-a-dataset) 10 | * [Q7: Explain briefly the logistic regression model and state an example of when you have used it recently?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-explain-briefly-the-logistic-regression-model-and-state-an-example-of-when-you-have-used-it-recently) 11 | * [Q8: Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. and what are the pros and cons for each of them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-explain-briefly-batch-gradient-descent-stochastic-gradient-descent-and-mini-batch-gradient-descent-and-what-are-the-pros-and-cons-for-each-of-them) 12 | * [Q9: Explain what is information gain and entropy in the context of decision trees?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-explain-what-is-information-gain-and-entropy-in-the-context-of-decision-trees) 13 | * [Q10: Explain the linear regression model and discuss its assumption?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q10-explain-the-linear-regression-model-and-discuss-its-assumption) 14 | * [Q11: Explain briefly the K-Means clustering and how can we find the best value of K?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q11-explain-briefly-the-k-means-clustering-and-how-can-we-find-the-best-value-of-k) 15 | * [Q12: Define Precision, recall, and F1 and discuss the trade-off between them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q12-define-precision-recall-and-f1-and-discuss-the-trade-off-between-them) 16 | * [Q13: What are the differences between a model that minimizes squared error and the one that minimizes the absolute error? and in which cases each error metric would be more appropriate?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q13-what-are-the-differences-between-a-model-that-minimizes-squared-error-and-the-one-that-minimizes-the-absolute-error-and-in-which-cases-each-error-metric-would-be-more-appropriate) 17 | * [Q14: Define and compare parametric and non-parametric models and give two examples for each of them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q14-define-and-compare-parametric-and-non-parametric-models-and-give-two-examples-for-each-of-them) 18 | * [Q15: Explain the kernel trick in SVM and why we use it and how to choose what kernel to use?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q15-explain-the-kernel-trick-in-svm-and-why-we-use-it-and-how-to-choose-what-kernel-to-use) 19 | * [Q16: Define the cross-validation process and the motivation behind using it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q16-define-the-cross-validation-process-and-the-motivation-behind-using-it) 20 | * [Q17: You are building a binary classifier and you found that the data is imbalanced, what should you do to handle this situation?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q17-you-are-building-a-binary-classifier-and-you-found-that-the-data-is-imbalanced-what-should-you-do-to-handle-this-situation) 21 | * [Q18: You are working on a clustering problem, what are different evaluation metrics that can be used, and how to choose between them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q18-you-are-working-on-a-clustering-problem-what-are-different-evaluation-metrics-that-can-be-used-and-how-to-choose-between-them) 22 | * [Q19: What is the ROC curve and when should you use it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q19-what-is-the-roc-curve-and-when-should-you-use-it) 23 | * [Q20: What is the difference between hard and soft voting classifiers in the context of ensemble learners?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q20-what-is-the-difference-between-hard-and-soft-voting-classifiers-in-the-context-of-ensemble-learners) 24 | * [Q21: What is boosting in the context of ensemble learners discuss two famous boosting methods](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q21-what-is-boosting-in-the-context-of-ensemble-learners-discuss-two-famous-boosting-methods) 25 | * [Q22: How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q22-how-can-you-evaluate-the-performance-of-a-dimensionality-reduction-algorithm-on-your-dataset) 26 | * [Q23: Define the curse of dimensionality and how to solve it. ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q23-define-the-curse-of-dimensionality-and-how-to-solve-it) 27 | * [Q24: In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q24-in-what-cases-would-you-use-vanilla-pca-incremental-pca-randomized-pca-or-kernel-pca) 28 | * [Q25: Discuss two clustering algorithms that can scale to large datasets](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q25-discuss-two-clustering-algorithms-that-can-scale-to-large-datasets) 29 | * [Q26: Do you need to scale your data if you will be using the SVM classifier and discuss your answer](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q26-do-you-need-to-scale-your-data-if-you-will-be-using-the-svm-classifier-and-discus-your-answer) 30 | * [Q27: What are Loss Functions and Cost Functions? Explain the key Difference Between them.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q27-what-are-loss-functions-and-cost-functions-explain-the-key-difference-between-them) 31 | * [Q28: What is the importance of batch in machine learning and explain some batch depend on gradient descent algorithm?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q28-what-is-the-importance-of-batch-in-machine-learning-and-explain-some-batch-depend-gradient-descent-algorithm) 32 | * [Q29: What are the different methods to split a tree in a decision tree algorithm?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q29-what-are-the-different-methods-to-split-a-tree-in-a-decision-tree-algorithm) 33 | * [Q30: Why boosting is a more stable algorithm as compared to other ensemble algorithms? ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q30-why-boosting-is-a-more-stable-algorithm-as-compared-to-other-ensemble-algorithms) 34 | * [Q31: What is active learning and discuss one strategy of it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q31-what-is-active-learning-and-discuss-one-strategy-of-it) 35 | * [Q32: What are the different approaches to implementing recommendation systems?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=the%20best%20instances.-,Q32%3A%20What%20are%20the%20different%20approaches%20to%20implementing%20recommendation%20systems%3F,-Answer%3A) 36 | * [Q33: What are the evaluation metrics that can be used for multi-label classification?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=Q33%3A%20What%20are%20the%20evaluation%20metrics%20that%20can%20be%20used%20for%20multi%2Dlabel%20classification%3F) 37 | * [Q34: What is the difference between concept and data drift and how to overcome each of them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=at%20k%20(MAP%40k)-,Q34%3A%20What%20is%20the%20difference%20between%20concept%20and%20data%20drift%20and%20how%20to%20overcome%20each%20of%20them%3F,-Answer%3A) 38 | * [Q35: Can you explain the ARIMA model and its components?]() 39 | * [Q36: What are the assumptions made by the ARIMA model?]() 40 | 41 | ------------------------------------------------------------------------------------------------------------------------------------------------------------- 42 | 43 | ## Questions & Answers ## 44 | 45 | ### Q1: Mention three ways to make your model robust to outliers. ### 46 | 47 | Investigating the outliers is always the first step in understanding how to treat them. After you understand the nature of why the outliers occurred you can apply one of the several methods mentioned [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#11:~:text=for%20large%20datasets.-,Bonus%20Question%3A%20Discuss%20how%20to%20make%20your%20model%20robust%20to%20outliers.,-There%20are%20several). 48 | 49 | ### Q2: Describe the motivation behind random forests and mention two reasons why they are better than individual decision trees. ### 50 | 51 | The motivation behind random forest or ensemble models in general in layman's terms, Let's say we have a question/problem to solve we bring 100 people and ask each of them the question/problem and record their solution. The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#2:~:text=2.%20Describe%20the%20motivation%20behind%20random%20forests%20and%20mention%20two%20reasons%20why%20they%20are%20better%20than%20individual%20decision%C2%A0trees.) 52 | 53 | ### Q3: What are the differences and similarities between gradient boosting and random forest? and what are the advantages and disadvantages of each when compared to each other? ### 54 | 55 | Similarities: 56 | 1. Both these algorithms are decision-tree-based algorithms 57 | 2. Both these algorithms are ensemble algorithms 58 | 3. Both are flexible models and do not need much data preprocessing. 59 | 60 | The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#2:~:text=3.%20What%20are%20the%20differences%20and%20similarities%20between%20gradient%20boosting%20and%20random%20forest%3F%20And%20what%20are%20the%20advantages%20and%20disadvantages%20of%20each%20when%20compared%20to%20each%C2%A0other%3F) 61 | 62 | ### Q4: What are L1 and L2 regularization? What are the differences between the two? ### 63 | 64 | Answer: 65 | 66 | Regularization is a technique used to avoid overfitting by trying to make the model more simple.The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#2:~:text=6.%20What%20are%20L1%20and%20L2%20regularizations%3F%20What%20are%20the%20differences%20between%20the%C2%A0two%3F) 67 | ### Q5: What are the Bias and Variance in a Machine Learning Model and explain the bias-variance trade-off? ### 68 | 69 | Answer: 70 | 71 | The goal of any supervised machine learning model is to estimate the mapping function (f) that predicts the target variable (y) given input (x). The prediction error can be broken down into three parts: 72 | The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#2:~:text=8.%20What%20are%20the%20bias%20and%20variance%20in%20a%20machine%20learning%20model%20and%20explain%20the%20bias%2Dvariance%20trade%2Doff%3F) 73 | 74 | ### Q6: Mention three ways to handle missing or corrupted data in a dataset. ### 75 | 76 | Answer: 77 | 78 | In general, real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#9:~:text=10.-,Mention%20three%20ways%20to%20handle%20missing%20or%20corrupted%20data%20in%20a%C2%A0dataset.,-In%20general%2C%20real) 79 | 80 | ### Q7: Explain briefly the logistic regression model and state an example of when you have used it recently. ### 81 | 82 | Answer: 83 | 84 | Logistic regression is used to calculate the probability of occurrence of an event in the form of a dependent output variable based on independent input variables. Logistic regression is commonly used to estimate the probability that an instance belongs to a particular class. If the probability is bigger than 0.5 then it will belong to that class (positive) and if it is below 0.5 it will belong to the other class. This will make it a binary classifier. 85 | 86 | It is important to remember that the Logistic regression isn't a classification model, it's an ordinary type of regression algorithm, and it was developed and used before machine learning, but it can be used in classification when we put a threshold to determine specific categories" 87 | 88 | There is a lot of classification applications to it: 89 | 90 | Classify email as spam or not, To identify whether the patient is healthy or not, and so on. 91 | 92 | ### Q8: Explain briefly batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. and what are the pros and cons for each of them? ### 93 | 94 | Gradient descent is a generic optimization algorithm cable for finding optimal solutions to a wide range of problems. The general idea of gradient descent is to tweak parameters iteratively in order to minimize a cost function. 95 | 96 | Batch Gradient Descent: 97 | In Batch Gradient descent the whole training data is used to minimize the loss function by taking a step toward the nearest minimum by calculating the gradient (the direction of descent) 98 | 99 | Pros: 100 | Since the whole data set is used to calculate the gradient it will be stable and reach the minimum of the cost function without bouncing (if the learning rate is chosen cooreclty) 101 | 102 | Cons: 103 | 104 | Since batch gradient descent uses all the training set to compute the gradient at every step, it will be very slow especially if the size of the training data is large. 105 | 106 | 107 | Stochastic Gradient Descent: 108 | 109 | Stochastic Gradient Descent picks up a random instance in the training data set at every step and computes the gradient based only on that single instance. 110 | 111 | Pros: 112 | 1. It makes the training much faster as it only works on one instance at a time. 113 | 2. It become easier to train large datasets 114 | 115 | Cons: 116 | 117 | Due to the stochastic (random) nature of this algorithm, this algorithm is much less regular than the batch gradient descent. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, not settling down there. So once the algorithm stops the final parameters are good but not optimal. For this reason, it is important to use a training schedule to overcome this randomness. 118 | 119 | Mini-batch Gradient: 120 | 121 | At each step instead of computing the gradients on the whole data set as in the Batch Gradient Descent or using one random instance as in the Stochastic Gradient Descent, this algorithm computes the gradients on small random sets of instances called mini-batches. 122 | 123 | Pros: 124 | 1. The algorithm's progress space is less erratic than with Stochastic Gradient Descent, especially with large mini-batches. 125 | 2. You can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. 126 | 127 | Cons: 128 | 1. It might be difficult to escape from local minima. 129 | 130 | ![alt text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/gradient%20descent%20vs%20batch%20gradient%20descent.png) 131 | 132 | ### Q9: Explain what is information gain and entropy in the context of decision trees. ### 133 | Entropy and Information Gain are two key metrics used in determining the relevance of decision-making when constructing a decision tree model and determining the nodes and the best way to split. 134 | 135 | The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until we reach a small enough set that contains data points that fall under one label. 136 | 137 | Entropy is the measure of impurity, disorder, or uncertainty in a bunch of examples. Entropy controls how a Decision Tree decides to split the data. 138 | Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way. It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification. 139 | 140 | ### Q10: Explain the linear regression model and discuss its assumption. ### 141 | Linear regression is a supervised statistical model to predict dependent variable quantity based on independent variables. 142 | Linear regression is a parametric model and the objective of linear regression is that it has to learn coefficients using the training data and predict the target value given only independent values. 143 | 144 | Some of the linear regression assumptions and how to validate them: 145 | 146 | 1. Linear relationship between independent and dependent variables 147 | 2. Independent residuals and the constant residuals at every x 148 | We can check for 1 and 2 by plotting the residuals(error terms) against the fitted values (upper left graph). Generally, we should look for a lack of patterns and a consistent variance across the horizontal line. 149 | 3. Normally distributed residuals 150 | We can check for this using a couple of methods: 151 | * Q-Q-plot(upper right graph): If data is normally distributed, points should roughly align with the 45-degree line. 152 | * Boxplot: it also helps visualize outliers 153 | * Shapiro–Wilk test: If the p-value is lower than the chosen threshold, then the null hypothesis (Data is normally distributed) is rejected. 154 | 4. Low multicollinearity 155 | * you can calculate the VIF (Variable Inflation Factors) using your favorite statistical tool. If the value for each covariate is lower than 10 (some say 5), you're good to go. 156 | 157 | The figure below summarizes these assumptions. 158 | ![alt text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Linear%20regression%20assumptions.jpg) 159 | 160 | ### Q11: Explain briefly the K-Means clustering and how can we find the best value of K? ### 161 | K-Means is a well-known clustering algorithm. K-means clustering is often used because it is easy to interpret and implement. The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#2:~:text=4.%20Briefly%20explain%20the%20K%2DMeans%20clustering%20and%20how%20can%20we%20find%20the%20best%20value%20of%C2%A0K.) 162 | 163 | ### Q12: Define Precision, recall, and F1 and discuss the trade-off between them? ### 164 | 165 | Precision and recall are two classification evaluation metrics that are used beyond accuracy. The rest of the answer is [here](https://365datascience.com/career-advice/job-interview-tips/machine-learning-interview-questions-and-answers/#9:~:text=9.-,Define%20precision%2C%20recall%2C%20and%20F1%20and%20discuss%20the%20trade%2Doff%20between%C2%A0them.,-Precision%20and%20recall) 166 | 167 | ### Q13: What are the differences between a model that minimizes squared error and the one that minimizes the absolute error? and in which cases each error metric would be more appropriate? ### 168 | 169 | Both mean square error (MSE) and mean absolute error (MAE) measures the distances between vectors and express average model prediction in units of the target variable. Both can range from 0 to infinity, the lower they are the better the model. 170 | 171 | The main difference between them is that in MSE the errors are squared before being averaged while in MAE they are not. This means that a large weight will be given to large errors. MSE is useful when large errors in the model are trying to be avoided. This means that outliers affect MSE more than MAE, that is why MAE is more robust to outliers. 172 | Computation-wise MSE is easier to use as the gradient calculation will be more straightforward than MAE, which requires linear programming to calculate it. 173 | 174 | ### Q14: Define and compare parametric and non-parametric models and give two examples for each of them? ### 175 | 176 | Answer: 177 | 178 | **Parametric models** assume that the dataset comes from a certain function with some set of parameters that should be tuned to reach the optimal performance. For such models, the number of parameters is determined prior to training, thus the degree of freedom is limited and reduces the chances of overfitting. 179 | 180 | Ex. Linear Regression, Logistic Regression, LDA 181 | 182 | **Nonparametric models** don't assume anything about the function from which the dataset was sampled. For these models, the number of parameters is not determined prior to training, thus they are free to generalize the model based on the data. Sometimes these models overfit themselves while generalizing. To generalize they need more data in comparison with Parametric Models. They are relatively more difficult to interpret compared to Parametric Models. 183 | 184 | Ex. Decision Tree, Random Forest. 185 | 186 | ### Q15: Explain the kernel trick in SVM. Why do we use it and how to choose what kernel to use? ### 187 | Answer: 188 | Kernels are used in SVM to map the original input data into a particular higher dimensional space where it will be easier to find patterns in the data and train the model with better performance. 189 | 190 | For eg.: If we have binary class data which form a ring-like pattern (inner and outer rings representing two different class instances) when plotted in 2D space, a linear SVM kernel will not be able to differentiate the two classes well when compared to an RBF (radial basis function) kernel, mapping the data into a particular higher dimensional space where the two classes are clearly separable. 191 | 192 | Typically without the kernel trick, in order to calculate support vectors and support vector classifiers, we need first to transform data points one by one to the higher dimensional space, do the calculations based on SVM equations in the higher dimensional space, and then return the results. The ‘trick’ in the kernel trick is that we design the kernels based on some conditions as mathematical functions that are equivalent to a dot product in the higher dimensional space without even having to transform data points to the higher dimensional space. i.e. we can calculate support vectors and support vector classifiers in the same space where the data is provided which saves a lot of time and calculations. 193 | 194 | Having domain knowledge can be very helpful in choosing the optimal kernel for your problem, however, in the absence of such knowledge following this default rule can be helpful: 195 | For linear problems, we can try linear or logistic kernels, and for nonlinear problems, we can use RBF or Gaussian kernels. 196 | 197 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Kerenl%20trick.png) 198 | 199 | ### Q16: Define the cross-validation process and the motivation behind using it. ### 200 | Cross-validation is a technique used to assess the performance of a learning model in several subsamples of training data. In general, we split the data into train and test sets where we use the training data to train our model and the test data to evaluate the performance of the model on unseen data and validation set for choosing the best hyperparameters. Now, a random split in most cases(for large datasets) is fine. However, for smaller datasets, it is susceptible to loss of important information present in the data in which it was not trained. Hence, cross-validation though computationally expensive combats this issue. 201 | 202 | The process of cross-validation is as follows: 203 | 204 | 1. Define k or the number of folds 205 | 2. Randomly shuffle the data into K equally-sized blocks (folds) 206 | 3. For each i in fold 1 to k train the data using all the folds except for fold i and test on the fold i. 207 | 3. Average the K validation/test error from the previous step to get an estimate of the error. 208 | 209 | This process aims to accomplish the following: 210 | 1- Prevent overfitting during training by avoiding training and testing on the same subset of the data points 211 | 212 | 2- Avoid information loss by using a certain subset of the data for validation only. This is important for small datasets. 213 | 214 | Cross-validation is always good to be used for small datasets, and if used for large datasets the computational complexity will increase depending on the number of folds. 215 | 216 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/cross%20validation.png) 217 | 218 | ### Q17: You are building a binary classifier and you found that the data is imbalanced, what should you do to handle this situation? ### 219 | Answer: 220 | If there is a data imbalance there are several measures we can take to train a fairer binary classifier: 221 | 222 | **1. Pre-Processing:** 223 | 224 | * Check whether you can get more data or not. 225 | 226 | * Use sampling techniques (Sample minority class, Downsample majority class, can take the hybrid approach as well). We can also use data augmentation to add more data points for the minority class but with little deviations/changes leading to new data points that are similar to the ones they are derived from. The most common/popular technique is SMOTE (Synthetic Minority Oversampling technique) 227 | 228 | * Suppression: Though not recommended, we can drop off some features directly responsible for the imbalance. 229 | 230 | * Learning Fair Representation: Projecting the training examples to a subspace or plane minimizes the data imbalance. 231 | 232 | * Re-Weighting: We can assign some weights to each training example to reduce the imbalance in the data. 233 | 234 | **2. In-Processing:** 235 | 236 | * Regularisation: We can add score terms that measure the data imbalance in the loss function and therefore minimizing the loss function will also minimize the degree of imbalance concerning the score chosen which also indirectly minimizes other metrics that measure the degree of data imbalance. 237 | 238 | * Adversarial Debiasing: Here we use the adversarial notion to train the model where the discriminator tries to detect if there are signs of data imbalance in the predicted data by the generator and hence the generator learns to generate data that is less prone to imbalance. 239 | 240 | **3. Post-Processing:** 241 | * Odds-Equalization: Here we try to equalize the odds for the classes with respect to the data is imbalanced for correct imbalance in the trained model. Usually, the F1 score is a good choice, if both precision and recall scores are important 242 | 243 | * Choose appropriate performance metrics. For example, accuracy is not a correct metric to use when classes are imbalanced. Instead, use precision, recall, F1 score, and ROC curve. 244 | 245 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Oversampling.png) 246 | 247 | ### Q18: You are working on a clustering problem, what are different evaluation metrics that can be used, and how to choose between them? ### 248 | 249 | Answer: 250 | 251 | Clusters are evaluated based on some similarity or dissimilarity measure such as the distance between cluster points. If the clustering algorithm separates dissimilar observations and similar observations together, then it has performed well. The two most popular metrics evaluation metrics for clustering algorithms are the 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 and 𝐃𝐮𝐧𝐧’𝐬 𝐈𝐧𝐝𝐞𝐱. 252 | 253 | 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 254 | The Silhouette Coefficient is defined for each sample and is composed of two scores: 255 | a: The mean distance between a sample and all other points in the same cluster. 256 | b: The mean distance between a sample and all other points in the next nearest cluster. 257 | 258 | S = (b-a) / max(a,b) 259 | 260 | The 𝐒𝐢𝐥𝐡𝐨𝐮𝐞𝐭𝐭𝐞 𝐜𝐨𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 for a set of samples is given as the mean of the Silhouette Coefficient for each sample. The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. 261 | 262 | Dunn’s Index 263 | 264 | Dunn’s Index (DI) is another metric for evaluating a clustering algorithm. Dunn’s Index is equal to the minimum inter-cluster distance divided by the maximum cluster size. Note that large inter-cluster distances (better separation) and smaller cluster sizes (more compact clusters) lead to a higher DI value. A higher DI implies better clustering. It assumes that better clustering means that clusters are compact and well-separated from other clusters. 265 | 266 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Derivation-of-the-Overall-Silhouette-Coefficient-OverallSil.png) 267 | 268 | ### Q19: What is the ROC curve and when should you use it? ### 269 | 270 | Answer: 271 | 272 | ROC curve, Receiver Operating Characteristic curve, is a graphical representation of the model's performance where we plot the True Positive Rate (TPR) against the False Positive Rate (FPR) for different threshold values, for hard classification, between 0 to 1 based on model output. 273 | 274 | This ROC curve is mainly used to compare two or more models as shown in the figure below. Now, it is easy to see that a reasonable model will always give FPR less (since it's an error) than TPR so, the curve hugs the upper left corner of the square box 0 to 1 on the TPR axis and 0 to 1 on the FPR axis. 275 | 276 | The more the AUC(area under the curve) for a model's ROC curve, the better the model in terms of prediction accuracy in terms of TPR and FPR. 277 | 278 | Here are some benefits of using the ROC Curve : 279 | 280 | * Can help prioritize either true positives or true negatives depending on your case study (Helps you visually choose the best hyperparameters for your case) 281 | 282 | * Can be very insightful when we have unbalanced datasets 283 | 284 | * Can be used to compare different ML models by calculating the area under the ROC curve (AUC) 285 | 286 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Roc_curve.svg.png) 287 | 288 | ### Q20: What is the difference between hard and soft voting classifiers in the context of ensemble learners? ### 289 | 290 | Answer: 291 | 292 | * Hard Voting: We take into account the class predictions for each classifier and then classify an input based on the maximum votes to a particular class. 293 | 294 | * Soft Voting: We take into account the probability predictions for each class by each classifier and then classify an input to the class with maximum probability based on the average probability (averaged over the classifier's probabilities) for that class. 295 | 296 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Hard%20Vs%20soft%20voting.png) 297 | 298 | ### Q21: What is boosting in the context of ensemble learners discuss two famous boosting methods ### 299 | 300 | Answer: 301 | 302 | Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. 303 | 304 | There are many boosting methods available, but by far the most popular are: 305 | 306 | * Adaptive Boosting: One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor under-fitted. This results in new predictors focusing more and more on the hard cases. 307 | * Gradient Boosting: Another very popular Boosting algorithm is Gradient Boosting. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration as AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor. 308 | 309 | ![1661788022018](https://user-images.githubusercontent.com/72076328/187241588-6cc3166f-a3e0-46b9-a0ce-e3d9ef9f0228.jpg) 310 | 311 | ### Q22: How can you evaluate the performance of a dimensionality reduction algorithm on your dataset? ### 312 | 313 | Answer: 314 | 315 | Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. 316 | 317 | Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset. 318 | 319 | ### Q23: Define the curse of dimensionality and how to solve it. ### 320 | 321 | Answer: 322 | Curse of dimensionality represents the situation when the amount of data is too few to be represented in a high-dimensional space, as it will be highly scattered in that high-dimensional space and it becomes more probable that we overfit this data. If we increase the number of features, we are implicitly increasing model complexity and if we increase model complexity we need more data. 323 | 324 | Possible solutions are: 325 | Remove irrelevant features not discriminating classes correlated or features not resulting in much improvement, we can use: 326 | 327 | * Feature selection(select the most important ones). 328 | * Feature extraction(transform current feature dimensionality into a lower dimension preserving the most possible amount of information like PCA ). 329 | 330 | ![Curse of dim'](https://user-images.githubusercontent.com/72076328/188653089-8999ea59-9511-4d52-baff-15a652e117a9.png) 331 | 332 | ### Q24: In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA? ### 333 | 334 | 335 | Answer: 336 | 337 | Regular PCA is the default, but it works only if the dataset fits in memory. Incremental PCA is useful for large datasets that don't fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks when you need to apply PCA on the fly, every time a new instance arrives. Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA. Finally, Kernel PCA is useful for nonlinear datasets. 338 | 339 | 340 | ### Q25: Discuss two clustering algorithms that can scale to large datasets ### 341 | 342 | Answer: 343 | 344 | **Minibatch Kmeans:** Instead of using the full dataset at each iteration, the algorithm 345 | is capable of using mini-batches, moving the centroids just slightly at each iteration. 346 | This speeds up the algorithm typically by a factor of 3 or 4 and makes it 347 | possible to cluster huge datasets that do not fit in memory. Scikit-Learn implements 348 | this algorithm in the MiniBatchKMeans class. 349 | 350 | **Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)**  351 | is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset. 352 | 353 | ### Q26: Do you need to scale your data if you will be using the SVM classifier and discus your answer ### 354 | Answer: 355 | Yes, feature scaling is required for SVM and all margin-based classifiers since the optimal hyperplane (the decision boundary) is dependent on the scale of the input features. In other words, the distance between two observations will differ for scaled and non-scaled cases, leading to different models being generated. 356 | 357 | This can be seen in the figure below, when the features have different scales, we can see that the decision boundary and the support vectors are only classifying the X1 features without taking into consideration the X0 feature, however after scaling the data to the same scale the decision boundaries and support vectors are looking much better and the model is taking into account both features. 358 | 359 | To scale the data, normalization, and standardization are the most popular approaches. 360 | ![SVM scaled Vs non scaled](https://user-images.githubusercontent.com/72076328/192571498-4a939472-7bb1-4bf2-963f-a6e6394802ba.png) 361 | 362 | ### Q27: What are Loss Functions and Cost Functions? Explain the key Difference Between them. ### 363 | 364 | Answer: 365 | The loss function is the measure of the performance of the model on a single training example, whereas the cost function is the average loss function over all training examples or across the batch in the case of mini-batch gradient descent. 366 | 367 | Some examples of loss functions are Mean Squared Error, Binary Cross Entropy, etc. 368 | 369 | Whereas, the cost function is the average of the above loss functions over training examples. 370 | 371 | ### Q28: What is the importance of batch in machine learning and explain some batch-dependent gradient descent algorithms? ### 372 | 373 | Answer: 374 | In the memory, the dataset can load either completely at once or in the form of a set. If we have a huge size of the dataset, then loading the whole data into memory will reduce the training speed, hence batch term is introduced. 375 | 376 | Example: image data contains 1,00,000 images, we can load this into 3125 batches where 1 batch = 32 images. So instead of loading the whole 1,00,000 images in memory, we can load 32 images 3125 times which requires less memory. 377 | 378 | In summary, a batch is important in two ways: (1) Efficient memory consumption. (2) Improve training speed. 379 | 380 | There are 3 types of gradient descent algorithms based on batch size: (1) Stochastic gradient descent (2) Batch gradient descent (3) Mini Batch gradient descent 381 | 382 | If the whole data is in a single batch, it is called batch gradient descent. If the single data points are equal to one batch i.e. number of batches = number of data instances, it is called stochastic gradient descent. If the number of batches is less than the number of data points or greater than 1, it is known as mini-batch gradient descent. 383 | 384 | ### Q29: What are the different methods to split a tree in a decision tree algorithm? ### 385 | 386 | Answer: 387 | 388 | Decision trees can be of two types regression and classification. 389 | For classification, classification accuracy created a lot of instability. So the following loss functions are used: 390 | - Gini's Index 391 | Gini impurity is used to predict the likelihood of a randomly chosen example being incorrectly classified by a particular node. It’s referred to as an “impurity” measure because it demonstrates how the model departs from a simple division. 392 | 393 | - Cross-entropy or Information Gain 394 | Information gain refers to the process of identifying the most important features/attributes that convey the most information about a class. The entropy principle is followed with the goal of reducing entropy from the root node to the leaf nodes. Information gain is the difference in entropy before and after splitting, which describes the impurity of in-class items. 395 | 396 | 397 | For regression, the good old mean squared error serves as a good loss function which is minimized by splits of the input features and predicting the mean value of the target feature on the subspaces resulting from the split. But finding the split that results in the minimum possible residual sum of squares is computationally infeasible, so a greedy top-down approach is taken i.e. the splits are made at a level from top to down which results in maximum reduction of RSS. We continue this until some maximum depth or number of leaves is attained. 398 | 399 | ### Q30: Why boosting is a more stable algorithm as compared to other ensemble algorithms? ### 400 | 401 | Answer: 402 | 403 | Boosting algorithms focus on errors found in previous iterations until they become obsolete. Whereas in bagging there is no corrective loop. That’s why boosting is a more stable algorithm compared to other ensemble algorithms. 404 | 405 | ### Q31: What is active learning and discuss one strategy of it? ### 406 | 407 | Answer: 408 | Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. In statistics literature, it is sometimes referred to as optimal experimental design. 409 | 410 | 1. Stream-based sampling 411 | In stream-based selective sampling, unlabelled data is continuously fed to an active learning system, where the learner decides whether to send the same to a human oracle or not based on a predefined learning strategy. This method is apt in scenarios where the model is in production and the data sources/distributions vary over time. 412 | 413 | 2. Pool-based sampling 414 | In this case, the data samples are chosen from a pool of unlabelled data based on the informative value scores and sent for manual labeling. Unlike stream-based sampling, oftentimes, the entire unlabelled dataset is scrutinized for the selection of the best instances. 415 | 416 | ![1669836673164](https://user-images.githubusercontent.com/72076328/204896144-43b2181a-d9ce-471b-95d0-44c0f7bb3025.jpg) 417 | 418 | ### Q32: What are the different approaches to implementing recommendation systems? ### 419 | Answer: 420 | 1. 𝐂𝐨𝐧𝐭𝐞𝐧𝐭-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: Content-Based Filtering depends on similarities of items and users' past activities on the website to recommend any product or service. 421 | 422 | This filter helps in avoiding a cold start for any new products as it doesn't rely on other users' feedback, it can recommend products based on similarity factors. However, content-based filtering needs a lot of domain knowledge so that the recommendations made are 100 percent accurate. 423 | 424 | 2. 𝐂𝐨𝐥𝐥𝐚𝐛𝐨𝐫𝐚𝐭𝐢𝐯𝐞-𝐁𝐚𝐬𝐞𝐝 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: The primary job of a collaborative filtering system is to overcome the shortcomings of content-based filtering. 425 | 426 | So, instead of focusing on just one user, the collaborative filtering system focuses on all the users and clusters them according to their interests. 427 | 428 | Basically, it recommends a product 'x' to user 'a' based on the interest of user 'b'; users 'a' and 'b' must have had similar interests in the past, which is why they are clustered together. 429 | 430 | The domain knowledge that is required for collaborative filtering is less, recommendations made are more accurate and it can adapt to the changing tastes of users over time. However, collaborative filtering faces the problem of a cold start as it heavily relies on feedback or activity from other users. 431 | 432 | 3. 𝐇𝐲𝐛𝐫𝐢𝐝 𝐟𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠: A mixture of content and collaborative methods. Uses descriptors and interactions. 433 | 434 | More modern approaches typically fall into the hybrid filtering category and tend to work in two stages: 435 | 436 | 1). A candidate generation phase where we coarsely generate candidates from a corpus of hundreds of thousands, millions, or billions of items down to a few hundred or thousand 437 | 438 | 2) A ranking phase where we re-rank the candidates into a final top-n set to be shown to the user. Some systems employ multiple candidate generation methods and rankers. 439 | 440 | ### Q33: What are the evaluation metrics that can be used for multi-label classification? ### 441 | 442 | Answer: 443 | 444 | Multi-label classification is a type of classification problem where each instance can be assigned to multiple classes or labels simultaneously. 445 | 446 | The evaluation metrics for multi-label classification are designed to measure the performance of a multi-label classifier in predicting the correct set of labels for each instance. 447 | Some commonly used evaluation metrics for multi-label classification are: 448 | 449 | 1. Hamming Loss: Hamming Loss is the fraction of labels that are incorrectly predicted. It is defined as the average number of labels that are predicted incorrectly per instance. 450 | 451 | 2. Accuracy: Accuracy is the fraction of instances that are correctly predicted. In multi-label classification, accuracy is calculated as the percentage of instances for which all labels are predicted correctly. 452 | 453 | 3. Precision, Recall, F1-Score: These metrics can be applied to each label separately, treating the classification of each label as a separate binary classification problem. Precision measures the proportion of predicted positive labels that are correct, recall measures the proportion of actual positive labels that are correctly predicted, and F1-score is the harmonic mean of precision and recall. 454 | 455 | 4. Macro-F1, Micro-F1: Macro-F1 and Micro-F1 are two types of F1-score metrics that take into account the label imbalance in the dataset. Macro-F1 calculates the F1-score for each label and then averages them, while Micro-F1 calculates the overall F1-score by aggregating the true positive, false positive, and false negative counts across all labels. 456 | 457 | There are other metrics that can be used such as: 458 | * Precision at k (P@k) 459 | * Average precision at k (AP@k) 460 | * Mean average precision at k (MAP@k) 461 | 462 | 463 | ### Q34: What is the difference between concept and data drift and how to overcome each of them? ### 464 | 465 | Answer: 466 | 467 | Concept drift and data drift are two different types of problems that can occur in machine learning systems. 468 | 469 | Concept drift refers to changes in the underlying relationships between the input data and the target variable over time. This means that the distribution of the data that the model was trained on no longer matches the distribution of the data it is being tested on. For example, a spam filter model that was trained on emails from several years ago may not be as effective at identifying spam emails from today because the language and tactics used in spam emails may have changed. 470 | 471 | Data drift, on the other hand, refers to changes in the input data itself over time. This means that the values of the input feature that the model was trained on no longer match the values of the input features in the data it is being tested on. For example, a model that was trained on data from a particular geographical region may not be as effective at predicting outcomes for data from a different region. 472 | 473 | To overcome concept drift, one approach is to use online learning methods that allow the model to adapt to new data as it arrives. This involves continually training the model on the most recent data while using historical data to maintain context. Another approach is to periodically retrain the model using a representative sample of the most recent data. 474 | 475 | To overcome data drift, one approach is to monitor the input data for changes and retrain the model when significant changes are detected. This may involve setting up a monitoring system that alerts the user when the data distribution changes beyond a certain threshold. 476 | 477 | Another approach is to preprocess the input data to remove or mitigate the effects of the features changing over time so that the model can continue learning from the remaining features. 478 | ![ezgif com-webp-to-jpg (7)](https://user-images.githubusercontent.com/72076328/221916192-7a9fcf21-8e5f-4ddc-bd90-ef1bdabf1d3f.jpg) 479 | 480 | ### Q35: Can you explain the ARIMA model and its components? ### 481 | Answer: 482 | The ARIMA model, which stands for Autoregressive Integrated Moving Average, is a widely used time series forecasting model. It combines three key components: Autoregression (AR), Differencing (I), and Moving Average (MA). 483 | 484 | * Autoregression (AR): 485 | The autoregressive component captures the relationship between an observation in a time series and a certain number of lagged observations. It assumes that the value at a given time depends linearly on its own previous values. The "p" parameter in ARIMA(p, d, q) represents the order of autoregressive terms. For example, ARIMA(1, 0, 0) refers to a model with one autoregressive term. 486 | 487 | * Differencing (I): 488 | Differencing is used to make a time series stationary by removing trends or seasonality. It calculates the difference between consecutive observations to eliminate any non-stationary behavior. The "d" parameter in ARIMA(p, d, q) represents the order of differencing. For instance, ARIMA(0, 1, 0) indicates that differencing is applied once. 489 | 490 | * Moving Average (MA): 491 | The moving average component takes into account the dependency between an observation and a residual error from a moving average model applied to lagged observations. It assumes that the value at a given time depends linearly on the error terms from previous time steps. The "q" parameter in ARIMA(p, d, q) represents the order of the moving average terms. For example, ARIMA(0, 0, 1) signifies a model with one moving average term. 492 | 493 | By combining these three components, the ARIMA model can capture both autoregressive patterns, temporal dependencies, and stationary behavior in a time series. The parameters p, d, and q are typically determined through techniques like the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). 494 | 495 | It's worth noting that there are variations of the ARIMA model, such as SARIMA (Seasonal ARIMA), which incorporates additional seasonal components for modeling seasonal patterns in the data. 496 | 497 | ARIMA models are widely used in forecasting applications, but they do make certain assumptions about the underlying data, such as linearity and stationarity. It's important to validate these assumptions and adjust the model accordingly if they are not met. 498 | ![1-1](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/assets/72076328/12707951-bdf5-4cd1-9efd-c60c465007a3) 499 | 500 | ### Q36: What are the assumptions made by the ARIMA model? ### 501 | Answer: 502 | 503 | The ARIMA model makes several assumptions about the underlying time series data. These assumptions are important to ensure the validity and accuracy of the model's results. Here are the key assumptions: 504 | 505 | Stationarity: The ARIMA model assumes that the time series is stationary. Stationarity means that the statistical properties of the data, such as the mean and variance, remain constant over time. This assumption is crucial for the autoregressive and moving average components to hold. If the time series is non-stationary, differencing (the "I" component) is applied to transform it into a stationary series. 506 | 507 | Linearity: The ARIMA model assumes that the relationship between the observations and the lagged values is linear. It assumes that the future values of the time series can be modeled as a linear combination of past values and error terms. 508 | 509 | No Autocorrelation in Residuals: The ARIMA model assumes that the residuals (the differences between the predicted values and the actual values) do not exhibit any autocorrelation. In other words, the errors are not correlated with each other. 510 | 511 | Normally Distributed Residuals: The ARIMA model assumes that the residuals follow a normal distribution with a mean of zero. This assumption is necessary for statistical inference, parameter estimation, and hypothesis testing. 512 | 513 | It's important to note that while these assumptions are commonly made in ARIMA modeling, they may not always hold in real-world scenarios. It's essential to assess the data and, if needed, apply transformations or consider alternative models that relax some of these assumptions. Additionally, diagnostics tools, such as residual analysis and statistical tests, can help evaluate the adequacy of the assumptions and the model's fit to the data. 514 | -------------------------------------------------------------------------------- /Probability Interview Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # Probability Interview Questions & Answers for Data Scientists # 2 | 3 | ## Questions ## 4 | * [Q1: You and your friend are playing a game with a fair coin. The two of you will continue to toss the coin until the sequence HH or TH shows up. If HH shows up first, you win, and if TH shows up first your friend win. What is the probability of you winning the game?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-you-and-your-friend-are-playing-a-game-with-a-fair-coin-the-two-of-you-will-continue-to-toss-the-coin-until-the-sequence-hh-or-th-shows-up-if-hh-shows-up-first-you-win-and-if-th-shows-up-first-your-friend-win-what-is-the-probability-of-you-winning-the-game) 5 | * [Q2: If you roll a dice three times, what is the probability to get two consecutive threes?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-if-you-roll-a-dice-three-times-what-is-the-probability-to-get-two-consecutive-threes) 6 | * [Q3: Suppose you have ten fair dice. If you randomly throw them simultaneously, what is the probability that the sum of all of the top faces is divisible by six?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-suppose-you-have-ten-fair-dice-if-you-randomly-throw-them-simultaneously-what-is-the-probability-that-the-sum-of-all-of-the-top-faces-is-divisible-by-six) 7 | * [Q4: If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability that the median of three numbers is greater than 1.5?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-if-you-have-three-draws-from-a-uniformly-distributed-random-variable-between-0-and-2-what-is-the-probability-that-the-median-of-three-numbers-is-greater-than-15) 8 | * [Q5: Assume you have a deck of 100 cards with values ranging from 1 to 100 and you draw two cards randomly without replacement, what is the probability that the number of one of them is double the other?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-assume-you-have-a-deck-of-100-cards-with-values-ranging-from-1-to-100-and-you-draw-two-cards-randomly-without-replacement-what-is-the-probability-that-the-number-of-one-of-them-is-double-the-other) 9 | * [Q6: What is the difference between the Bernoulli and Binomial distribution?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-what-is-the-difference-between-the-bernoulli-and-binomial-distribution) 10 | * [Q7: If there are 30 people in a room, what is the probability that everyone has different birthdays?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-if-there-are-30-people-in-a-room-what-is-the-probability-that-everyone-has-different-birthdays) 11 | * [Q8: Assume two coins, one fair and the other is unfair. You pick one at random, flip it five times, and observe that it comes up as tails all five times. What is the probability that you are fliping the unfair coin? ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-assume-two-coins-one-fair-and-the-other-is-unfair-you-pick-one-at-random-flip-it-five-times-and-observe-that-it-comes-up-as-tails-all-five-times-what-is-the-probability-that-you-are-fliping-the-unfair-coin) 12 | * [Q9: Assume you take a stick of length 1 and you break it uniformly at random into three parts. What is the probability that the three pieces can be used to form a triangle?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-assume-you-take-a-stick-of-length-1-and-you-break-it-uniformly-at-random-into-three-parts-what-is-the-probability-that-the-three-pieces-can-be-used-to-form-a-triangle) 13 | * [Q10: Say you draw a circle and choose two chords at random. What is the probability that those chords will intersect?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q10-say-you-draw-a-circle-and-choose-two-chords-at-random-what-is-the-probability-that-those-chords-will-intersect) 14 | * [Q11: If there’s a 15% probability that you might see at least one airplane in a five-minute interval, what is the probability that you might see at least one airplane in a period of half an hour?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q11-if-theres-a-15-probability-that-you-might-see-at-least-one-airplane-in-a-five-minute-interval-what-is-the-probability-that-you-might-see-at-least-one-airplane-in-a-period-of-half-an-hour) 15 | * [Q12: Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds using this coin? ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q12-say-you-are-given-an-unfair-coin-with-an-unknown-bias-towards-heads-or-tails-how-can-you-generate-fair-odds-using-this-coin) 16 | * [Q13: According to hospital records, 75% of patients suffering from a disease die from that disease. Find out the probability that 4 out of the 6 randomly selected patients survive.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q13-according-to-hospital-records-75-of-patients-suffering-from-a-disease-die-from-that-disease-find-out-the-probability-that-4-out-of-the-6-randomly-selected-patients-survive) 17 | * [Q14: Discuss some methods you will use to estimate the Parameters of a Probability Distribution](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q14-discuss-some-methods-you-will-use-to-estimate-the-parameters-of-a-probability-distribution) 18 | * [Q15: You have 40 cards in four colors, 10 reds, 10 greens, 10 blues, and ten yellows. Each color has a number from 1 to 10. When you pick two cards without replacement, what is the probability that the two cards are not in the same color and not in the same number?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q15-you-have-40-cards-in-four-colors-10-reds-10-greens-10-blues-and-ten-yellows-each-color-has-a-number-from-1-to-10-when-you-pick-two-cards-without-replacement-what-is-the-probability-that-the-two-cards-are-not-in-the-same-color-and-not-in-the-same-number) 19 | * [Q16: Can you explain the difference between frequentist and Bayesian probability approaches?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=39%20%3D%209/13-,Q16%3A%20Can%20you%20explain%20the%20difference%20between%20frequentist%20and%20Bayesian%20probability%20approaches%3F,-Answer%3A) 20 | * [Q17: Explain the Difference Between Probability and Likelihood](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=with%20new%20data.-,Q17%3A%20Explain%20the%20Difference%20Between%20Probability%20and%20Likelihood,Give%20feedback,-Footer) 21 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ 22 | ## Questions & Answers ## 23 | 24 | 25 | ### Q1: You and your friend are playing a game with a fair coin. The two of you will continue to toss the coin until the sequence HH or TH shows up. If HH shows up first, you win, and if TH shows up first your friend win. What is the probability of you winning the game? ### 26 | 27 | Answer: 28 | 29 | First flip is either heads or tails. If the second flip is heads we have a winner no matter what. Hence we have a 1/2 chance of game ending on the second flip. 30 | If first flip is H, and the second flip is H, then player 1 wins. If first flip is H, and second flip is T, the game goes on. Generalizing, if the last flip was T, then HH will never occur and player 1 has no chance of wining. Either the game goes on OR player 2 wins OR the game goes on AND player 2 wins. 31 | Player 1 can only win if the first flip is H and the second flip is H. Consider the following four scenarios 32 | - HH : Player 1 wins 33 | - HT ... ? : HH will never occur before TH. Player 2 wins. 34 | - TT ... ? : HH will never occur before TH. Player 2 wins. 35 | - TH : Player 2 wins. 36 | Hence probability of player 1 wining is 1/4 and probability of player 2 wining is 3/4. 37 | 38 | 39 | ### Q2: If you roll a dice three times, what is the probability to get two consecutive threes? ### 40 | 41 | The right answer is 11/216 42 | 43 | There are different ways to answer this question: 44 | 45 | 1. If we roll a dice three times we can get two consecutive 3’s in three ways: 46 | 1. The first two rolls are 3s and the third is any other number with a probability of 1/6 * 1/6 * 5/6. 47 | 48 | 2. The first one is not three while the other two rolls are 3s with a probability of 5/6 * 1/6 * 1/6 49 | 50 | 3. The last one is that the three rolls are 3s with probability 1/6 ^ 3 51 | 52 | So the final result is 2 * (5/6 * (1/6)^2) + (1/6)*3 = 11/216 53 | 54 | By Inclusion-Exclusion Principle: 55 | 56 | Probability of at least two consecutive threes 57 | = Probability of two consecutive threes in first two rolls + Probability of two consecutive threes in last two rolls - Probability of three consecutive threes 58 | 59 | = 2 * Probability of two consecutive threes in first two rolls - Probability of three consecutive threes 60 | = 2 * (1/6) * (1/6) - (1/6) * (1/6) * (1/6) = 11/216 61 | 62 | It can be seen also like this: 63 | 64 | The sample space is made of (x, y, z) tuples where each letter can take a value from 1 to 6, therefore the sample space has 6x6x6=216 values, and the number of outcomes that are considered two consecutive threes is (3,3, X) or (X, 3, 3), the number of possible outcomes is therefore 6 for the first scenario (3,3,1) till 65 | (3,3,6) and 6 for the other scenario (1,3,3) till (6,3,3) and subtract the duplicate (3,3,3) which appears in both, and this leaves us with a probability of 11/216. 66 | 67 | ### Q3: Suppose you have ten fair dice. If you randomly throw them simultaneously, what is the probability that the sum of all of the top faces is divisible by six? ### 68 | Answer: 69 | 1/6 70 | 71 | Explanation: 72 | With 10 dices, the possible sums divisible by 6 are 12, 18, 24, 30, 36, 42, 48, 54, and 60. You don't actually need to calculate the probability of getting each of these numbers as the final sums from 10 dices because no matter what the sum of the first 9 numbers is, you can still choose a number between 1 to 6 on the last die and add to that previous sum to make the final sum divisible by 6. Therefore, we only care about the last die. And the probability to get that number on the last die is 1/6. So the answer is 1/6 73 | 74 | 75 | ### Q4: If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability that the median of three numbers is greater than 1.5? ### 76 | The right answer is 5/32 or 0.156. There are different methods to solve it: 77 | 78 | * **Method 1:** 79 | 80 | To get a median greater than 1.5 at least two of the three numbers must be greater than 1.5. The probability of one number being greater than 1.5 in this distribution is 0.25. Then, using the binomial distribution with three trials and a success probability of 0.25 we compute the probability of 2 or more successes to get the probability of the median is more than 1.5, which would be about 15.6%. 81 | 82 | * **Method2 :** 83 | 84 | A median greater than 1.5 will occur when o all three uniformly distributed random numbers are greater than 1.5 or 1 uniform distributed random number between 0 and 1.5 and the other two are greater than 1.5. 85 | 86 | So, the probability of the above event is 87 | = {(2 - 1.5) / 2}^3 + (3 choose 1)(1.5/2)(0.5/2)^2 88 | = 10/64 = 5/32 89 | 90 | * **Method3:** 91 | 92 | Using the Monte Carlo method as shown in the figure below: 93 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Monte%20Carlo%20Methods.png) 94 | 95 | ### Q5: Assume you have a deck of 100 cards with values ranging from 1 to 100 and you draw two cards randomly without replacement, what is the probability that the number of one of them is double the other? ### 96 | 97 | There are a total of (100 C 2) = 4950 ways to choose two cards at random from the 100 cards and there are only 50 pairs of these 4950 ways that you will get one number and it's double. Therefore the probability that the number of one of them is double the other is 50/4950. 98 | 99 | ### Q6: What is the difference between the Bernoulli and Binomial distribution? ### 100 | Answer: 101 | 102 | Bernoulli and Binomial are both types of probability distributions. 103 | 104 | The function of Bernoulli is given by 105 | 106 | p(x) =p^x * q^(1-x) , x=[0,1] 107 | 108 | Mean is p 109 | 110 | Variance p*(1-p) 111 | 112 | The function Binomial is given by: 113 | 114 | p(x) = nCx p^x q^(n-x) x=[0,1,2...n] 115 | 116 | Mean : np 117 | 118 | Variance :npq 119 | 120 | Where p and q are the probability of success and probability of failure respectively, n is the number of independent trials and x is the number of successes. 121 | 122 | As we can see sample space( x ) for Bernoulli distribution is Binary (2 outcomes), and just a single trial. 123 | 124 | Eg: A loan sanction for a person can be either a success or a failure, with no other possibility. (Hence single trial). 125 | 126 | Whereas for Binomial the sample space(x) ranges from 0 -n. 127 | 128 | Eg. Tossing a coin 6 times, what is the probability of getting 2 or a few heads? 129 | 130 | Here sample space is x=[0,1,2] and more than 1 trial and n=6(finite) 131 | 132 | In short, Bernoulli Distribution is a single trial version of Binomial Distribution. 133 | 134 | ### Q7: If there are 30 people in a room, what is the probability that everyone has different birthdays? ### 135 | 136 | The sample space is 365^30 and the number of events is 365p30 because we need to choose persons without replacement to get everyone to have a unique birthday therefore the Prob = 356p30 / 365^30 = 0.2936 137 | 138 | A theoretical explanation is provided in the figure below thanks to Fazil Mohammed. 139 | 140 | > Note: Why do we use permutations and not combinations here?

141 | When calculating 365C30, you are saying: “Out of 365 days, I'm choosing 30 distinct days, but I don't care in what order they are assigned to people.” This treats the selection of birthdays as unordered, which isn't the case in the birthday problem, because who gets which birthday is important.
142 | For example, if you selected 30 distinct birthdays (as in a combination), this would only tell you which 30 birthdays are used, but it wouldn't account for the fact that different people being assigned different birthdays creates different outcomes. In contrast, with permutations, we are considering the specific assignment of each person to a particular birthday, where the order matters because we care about which person gets which birthday. i.e. if Person A is born on 01/01 and person B is born on 02/02, its different than if Person A is born on 02/02 and person B is born on 01/01. 143 | 144 | 145 | Interesting facts provided by Rishi Dey Chowdhury: 146 | 1. With just 23 people there is over 50% chance of a birthday match and with 57 people the match probability exceeds 99%. One intuition to think of why with such a low number of people the probability of a match is so high. It's because for a match we require a pair of people and 23 choose 2 is 23*11 = 253 which is a relatively big number and ya 50% sounds like a decent probability of a match for this case. 147 | 148 | 2. Another interesting fact is if the assumption of equal probability of birthday of a person on any day out of 365 is violated and there is a non-equal probability of birthday of a person among days of the year then, it is even more likely to have a birthday match. 149 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Therotical%20Explanation%20Q%207%20Probability.jfif) 150 | 151 | ### Q8: Assume two coins, one fair and the other is unfair. You pick one at random, flip it five times, and observe that it comes up as tails all five times. What is the probability that you are fliping the unfair coin? Assume that the unfair coin always results in tails. ### 152 | 153 | Answer: 154 | 155 | Let's use Baye’s theorem let U denote the case where you are flipping the unfair coin and F denote the case where you are flipping the fair coin. Since the coin is chosen randomly, we know that P(U)=P(F)=0.5. Let 5T denote the event of flipping 5 tails in a row. 156 | 157 | Then, we are interested in solving for P(U|5T) (the probability that you are flipping the unfair coin given that you obtained 5 tails). Since the unfair coin always results in tails, therefore P(5T|U) = 1 and also P(5T|F) =1/2⁵ = 1/32 by the definition of a fair coin. 158 | 159 | Lets apply Bayes theorem where P(U|5T) = P(5T|U) * P(U) / P(5T|U)* P(U) + P(5T|F)* P(F) = 0.5 / 0.5 +0.5* 1/32 = 0.97 160 | 161 | Therefore the probability that you picked the unfair coin is 97% 162 | 163 | ### Q9: Assume you take a stick of length 1 and you break it uniformly at random into three parts. What is the probability that the three pieces can be used to form a triangle? ### 164 | 165 | Answer: 166 | The right answer is 0.25 167 | 168 | Let's say, x and y are the lengths of the two parts, so the length of the third part will be 1-x-y 169 | 170 | As per the triangle inequality theorem, the sum of two sides should always be greater than the third side. Therefore, no two lengths can be more than 1/2. 171 | x<1/2 172 | y<1/2 173 | 174 | To achieve this the first breaking point (X) should before the 0.5 mark on the stick and the second breaking point (Y) should be after the 0.5 mark on the stick. 175 | 176 | P(X < 0.5) = (0.5-0) / (1-0) = 0.5 177 | 178 | P(Y > 0.5) = (1 - 0.5) / (1-0) = 0.5 179 | 180 | Hence, overal probability = P(X < 0.5) * P(Y > 0.5) = 1/5 = 0.25 181 | 182 | ### Q10: Say you draw a circle and choose two chords at random. What is the probability that those chords will intersect? ### 183 | 184 | Answer: 185 | For making 2 chords, 4 points are necessary and from 4 points there are 3 different combinations of pairs of chords can be made. From the 3 combinations, there is only one combination in which the two chords intersect hence answer is 1/3. 186 | Let's assume that P1, P2, P3, and P4 are four points then 3 different combinations are possible for pairs of chords: (P1 P2) (P3 P4) or (P1 P3) (P4 P2) or (P1 P4) (P2 P3) there the 3rd one will only intersect. 187 | 188 | ![Probability question 70](https://user-images.githubusercontent.com/72076328/189387820-1a4fb356-d8a9-4054-9475-09e3ad5bc872.png) 189 | 190 | 191 | ### Q11: If there’s a 15% probability that you might see at least one airplane in a five-minute interval, what is the probability that you might see at least one airplane in a period of half an hour? ### 192 | 193 | Answer: 194 | 195 | Probability of at least one plane in 5 mins interval=0.15 196 | Probability of no plane in 5 mins interval=0.85 197 | Probability of seeing at least one plane in 30 mins=1 - Probability of not seeing any plane in 30 minutes 198 | =1-(0.85)^6 = 0.6228 199 | 200 | This problem can also be solved using Poisson distribution. Refer this [blog post](https://towardsdatascience.com/shooting-star-problem-simple-solution-and-poisson-process-demonstration-739e94184edf). 201 | 202 | ### Q12: Say you are given an unfair coin, with an unknown bias towards heads or tails. How can you generate fair odds using this coin? ### 203 | 204 | Answer: 205 | 206 | ![propability_83](https://user-images.githubusercontent.com/72076328/194715951-800d2018-28dc-482c-adf8-34a93ce9b8a1.jpg) 207 | 208 | ### Q13: According to hospital records, 75% of patients suffering from a disease die from that disease. Find out the probability that 4 out of the 6 randomly selected patients survive. ### 209 | 210 | Answer: 211 | This has to be a binomial since there are only 2 outcomes – death or life. 212 | 213 | Here n =6, and x=4. 214 | 215 | p=0.25 (probability if life) q = 0.75(probability of death) 216 | 217 | Using probability mass function equation: 218 | 219 | P(X) = nCx * p^x * q^(n-x) 220 | 221 | 222 | Then: 223 | 224 | P(4) = 6C4 * (0.25)^4 * (0.75)^2 = 0.032 225 | 226 | ### Q14: Discuss some methods you will use to estimate the Parameters of a Probability Distribution ### 227 | 228 | Answer: 229 | 230 | There are different ways you can go about this. Following are some methods, one may choose only one of these or a combination depending on the observed data. 231 | - Method of moments 232 | - Maximum Likelihood Estimatation 233 | - Bayesian Estimation 234 | - Least Squares Estimation 235 | - Method of Least Absolute Deviation 236 | - Chi-squared Test 237 | 238 | ### Q15: You have 40 cards in four colors, 10 reds, 10 greens, 10 blues, and ten yellows. Each color has a number from 1 to 10. When you pick two cards without replacement, what is the probability that the two cards are not in the same color and not in the same number? ### 239 | 240 | Answer: 241 | 242 | Since it doesn't matter how you choose the first card, so, choose one card at random. 243 | Now, all we have to care about is the restriction on the second card. It can't be the same number (i.e. 3 cards from the other colors can't be chosen in favorable cases) and also can't be the same color (i.e. 9 cards from the same color can't be chosen keep in mind we have already picked one). 244 | 245 | So, the number of favorable choices for the 2nd card is (39-12)/39 = 27/39 = 9/13 246 | 247 | ![1668961881451](https://user-images.githubusercontent.com/72076328/202913961-f94f17b1-dc41-45b2-ba51-389583431d7b.jpg) 248 | 249 | 250 | ### Q16: Can you explain the difference between frequentist and Bayesian probability approaches? ### 251 | 252 | Answer: 253 | 254 | The frequentist approach to probability defines probability as the long-run relative frequency of an event in an infinite number of trials. It views probabilities as fixed and objective, determined by the data at hand. In this approach, the parameters of a model are treated as fixed and unknown and estimated using methods like maximum likelihood estimation. 255 | 256 | 257 | 258 | On the other hand, Bayesian probability defines probability as a degree of belief, or the degree of confidence, in an event. It views probabilities as subjective and personal, representing an individual's beliefs. In this approach, the parameters of a model are treated as random variables with prior beliefs, which are updated as new data becomes available to form a posterior belief. 259 | 260 | 261 | 262 | In summary, the frequentist approach deals with fixed and objective probabilities and uses methods like estimation, while the Bayesian approach deals with subjective and personal probabilities and uses methods like updating prior beliefs with new data. 263 | 264 | ### Q17: Explain the Difference Between Probability and Likelihood ### 265 | Probability and likelihood are two concepts that are often used in statistics and data analysis, but they have different meanings and uses. 266 | 267 | Probability is the measure of the likelihood of an event occurring. It is a number between 0 and 1, with 0 indicating an impossible event and 1 indicating a certain event. For example, the probability of flipping a coin and getting heads is 0.5. 268 | 269 | The likelihood, on the other hand, is the measure of how well a statistical model or hypothesis fits a set of observed data. It is not a probability, but rather a measure of how plausible the data is given the model or hypothesis. For example, if we have a hypothesis that the average height of people in a certain population is 6 feet, the likelihood of observing a random sample of people with an average height of 5 feet would be low. 270 | -------------------------------------------------------------------------------- /Python Interview Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # Python Questions # 2 | ## Questions: ## 3 | * [Q1: Given two arrays, write a python function to return the intersection of the two? For example, X = [1,5,9,0] and Y = [3,0,2,9] it should return [9,0]](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-given-two-arrays-write-a-python-function-to-return-the-intersection-of-the-two-for-example-x--1590-and-y--3029-it-should-return-90) 4 | * [Q2: Given an array, find all the duplicates in this array? For example: input: [1,2,3,1,3,6,5] output: [1,3]](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-given-an-array-find-all-the-duplicates-in-this-array-for-example-input-1231365-output-13) 5 | * [Q3: Given an integer array, return the maximum product of any three numbers in the array?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-given-an-integer-array-return-the-maximum-product-of-any-three-numbers-in-the-array) 6 | * [Q4: Given an integer array, find the sum of the largest contiguous subarray within the array. For example, given the array A = [0,-1,-5,-2,3,14] it should return 17 because of [3,14]. Note that if all the elements are negative it should return zero.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-given-an-integer-array-find-the-sum-of-the-largest-contiguous-subarray-within-the-array-for-example-given-the-array-a--0-1-5-2314-it-should-return-17-because-of-314-note-that-if-all-the-elements-are-negative-it-should-return-zero) 7 | * [Q5: Define tuples and lists in Python What are the major differences between them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-define-tuples-and-lists-in-python-what-are-the-major-differences-between-them) 8 | * [Q6: Compute the Euclidean Distance Between Two Series?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-compute-the-euclidean-distance-between-two-series) 9 | * [Q7: Given an integer n and an integer K, output a list of all of the combination of k numbers chosen from 1 to n. For example, if n=3 and k=2, return [1,2][1,3],[2,3]](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-given-an-integer-n-and-an-integer-k-output-a-list-of-all-of-the-combination-of-k-numbers-chosen-from-1-to-n-for-example-if-n3-and-k2-return-121323) 10 | * [Q8: Write a function to generate N samples from a normal distribution and plot them on the histogram](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-write-a-function-to-generate-n-samples-from-a-normal-distribution-and-plot-them-on-the-histogram) 11 | * [Q9: What is the difference between apply and applymap function in pandas? ](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-what-is-the-difference-between-apply-and-applymap-function-in-pandas) 12 | * [Q10: Given a string, return the first recurring character in it, or “None” if there is no recurring character. Example: input = "pythoninterviewquestion" , output = "n"](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q10-given-a-string-return-the-first-recurring-character-in-it-or-none-if-there-is-no-recurring-character-example-input--pythoninterviewquestion--output--n) 13 | * [Q11: Given a positive integer X return an integer that is a factorial of X. If a negative integer is provided, return -1. Implement the solution by using a recursive function.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q11-given-a-positive-integerx-return-an-integer-that-is-a-factorial-ofx-if-a-negative-integer-is-provided-return--1-implement-the-solution-by-using-a-recursive-function) 14 | * [Q12: Given an m-by-n matrix with positive integers, determine the length of the longest path of increasing within the matrix. For example, consider the input matrix:](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q12-given-an-m-by-n-matrix-with-positive-integers-determine-the-length-of-the-longest-path-of-increasing-within-the-matrix-for-example-consider-the-input-matrix) 15 | 16 | [ 1 2 3 ] 17 | 18 | [ 4 5 6 ] 19 | 20 | [ 7 8 9 ] 21 | 22 | The answer should be 5 since the longest path would be 1-2-5-6-9]() 23 | 24 | * [Q13: 𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐰𝐡𝐚𝐭 𝐅𝐥𝐚𝐬𝐤 𝐢𝐬 𝐚𝐧𝐝 𝐢𝐭𝐬 𝐛𝐞𝐧𝐞𝐟𝐢𝐭𝐬](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q13-%F0%9D%90%84%F0%9D%90%B1%F0%9D%90%A9%F0%9D%90%A5%F0%9D%90%9A%F0%9D%90%A2%F0%9D%90%A7-%F0%9D%90%B0%F0%9D%90%A1%F0%9D%90%9A%F0%9D%90%AD-%F0%9D%90%85%F0%9D%90%A5%F0%9D%90%9A%F0%9D%90%AC%F0%9D%90%A4-%F0%9D%90%A2%F0%9D%90%AC-%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%9D-%F0%9D%90%A2%F0%9D%90%AD%F0%9D%90%AC-%F0%9D%90%9B%F0%9D%90%9E%F0%9D%90%A7%F0%9D%90%9E%F0%9D%90%9F%F0%9D%90%A2%F0%9D%90%AD%F0%9D%90%AC) 25 | * [Q14: 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐥𝐢𝐬𝐭𝐬, 𝐚𝐫𝐫𝐚𝐲𝐬, 𝐚𝐧𝐝 𝐬𝐞𝐭𝐬 𝐢𝐧 𝐏𝐲𝐭𝐡𝐨𝐧, 𝐚𝐧𝐝 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐬𝐡𝐨𝐮𝐥𝐝 𝐮𝐬𝐞 𝐞𝐚𝐜𝐡 𝐨𝐟 𝐭𝐡𝐞𝐦?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q14-%F0%9D%90%96%F0%9D%90%A1%F0%9D%90%9A%F0%9D%90%AD-%F0%9D%90%A2%F0%9D%90%AC-%F0%9D%90%AD%F0%9D%90%A1%F0%9D%90%9E-%F0%9D%90%9D%F0%9D%90%A2%F0%9D%90%9F%F0%9D%90%9F%F0%9D%90%9E%F0%9D%90%AB%F0%9D%90%9E%F0%9D%90%A7%F0%9D%90%9C%F0%9D%90%9E-%F0%9D%90%9B%F0%9D%90%9E%F0%9D%90%AD%F0%9D%90%B0%F0%9D%90%9E%F0%9D%90%9E%F0%9D%90%A7-%F0%9D%90%A5%F0%9D%90%A2%F0%9D%90%AC%F0%9D%90%AD%F0%9D%90%AC-%F0%9D%90%9A%F0%9D%90%AB%F0%9D%90%AB%F0%9D%90%9A%F0%9D%90%B2%F0%9D%90%AC-%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%9D-%F0%9D%90%AC%F0%9D%90%9E%F0%9D%90%AD%F0%9D%90%AC-%F0%9D%90%A2%F0%9D%90%A7-%F0%9D%90%8F%F0%9D%90%B2%F0%9D%90%AD%F0%9D%90%A1%F0%9D%90%A8%F0%9D%90%A7-%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%9D-%F0%9D%90%B0%F0%9D%90%A1%F0%9D%90%9E%F0%9D%90%A7-%F0%9D%90%B2%F0%9D%90%A8%F0%9D%90%AE-%F0%9D%90%AC%F0%9D%90%A1%F0%9D%90%A8%F0%9D%90%AE%F0%9D%90%A5%F0%9D%90%9D-%F0%9D%90%AE%F0%9D%90%AC%F0%9D%90%9E-%F0%9D%90%9E%F0%9D%90%9A%F0%9D%90%9C%F0%9D%90%A1-%F0%9D%90%A8%F0%9D%90%9F-%F0%9D%90%AD%F0%9D%90%A1%F0%9D%90%9E%F0%9D%90%A6) 26 | * [Q15: What are some common ways to handle missing data in Python, and which method do you prefer and why?]() 27 | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 28 | ## Questions & Answers: ## 29 | 30 | ### Q1: Given two arrays, write a python function to return the intersection of the two? For example, X = [1,5,9,0] and Y = [3,0,2,9] it should return [9,0] ### 31 | 32 | Answer: 33 | ``` 34 | set(X).intersect (set(Y)) 35 | ``` 36 | 37 | ### Q2: Given an array, find all the duplicates in this array? For example: input: [1,2,3,1,3,6,5] output: [1,3] ### 38 | 39 | Answer: 40 | ``` 41 | set1=set() 42 | res=set() 43 | for i in list: 44 | if i in set1: 45 | res.add(i) 46 | else: 47 | set1.add(i) 48 | print(res) 49 | ``` 50 | 51 | 52 | ### Q3: Given an integer array, return the maximum product of any three numbers in the array? ### 53 | 54 | Answer: 55 | 56 | ``` 57 | import heapq 58 | 59 | def max_three(arr): 60 | a = heapq.nlargest(3, arr) # largerst 3 numbers for postive case 61 | b = heapq.nsmallest(2, arr) # for negative case 62 | return max(a[2]*a[1]*a[0], b[1]*b[0]*a[0]) 63 | ``` 64 | 65 | ### Q4: Given an integer array, find the sum of the largest contiguous subarray within the array. For example, given the array A = [0,-1,-5,-2,3,14] it should return 17 because of [3,14]. Note that if all the elements are negative it should return zero. 66 | 67 | ``` 68 | def max_subarray(arr): 69 | n = len(arr) 70 | max_sum = arr[0] #max 71 | curr_sum = 0 72 | for i in range(n): 73 | curr_sum += arr[i] 74 | max_sum = max(max_sum, curr_sum) 75 | if curr_sum <0: 76 | curr_sum = 0 77 | return max_sum 78 | 79 | ``` 80 | 81 | ### Q5: Define tuples and lists in Python What are the major differences between them? ### 82 | 83 | Answer: 84 | 85 | Lists: 86 | In Python, a list is created by placing elements inside square brackets [], separated by commas. A list can have any number of items and they may be of different types (integer, float, string, etc.). A list can also have another list as an item. This is called a nested list. 87 | 88 | 1. Lists are mutable 89 | 2. Lists are better for performing operations, such as insertion and deletion. 90 | 3. Lists consume more memory 91 | 4. Lists have several built-in methods 92 | 93 | 94 | Tuples: 95 | A tuple is a collection of objects which ordered and immutable. Tuples are sequences, just like lists. The differences between tuples and lists are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets. 96 | 97 | 1. Tuples are immutable 98 | 2. Tuple data type is appropriate for accessing the elements 99 | 3. Tuples consume less memory as compared to the list 100 | 4. Tuple does not have many built-in methods. 101 | 102 | 103 | * Mutable = we can change, add, delete and modify stuff 104 | * Immutable = we cannot change, add, delete and modify stuff 105 | 106 | ### Q6: Compute the Euclidean Distance Between Two Series? ### 107 | ``` 108 | ``` 109 | 110 | ### Q7: Given an integer n and an integer K, output a list of all of the combination of k numbers chosen from 1 to n. For example, if n=3 and k=2, return [1,2],[1,3],[2,3] ### 111 | 112 | Answer 113 | ``` 114 | from itertools import combinations 115 | def find_combintaion(k,n): 116 | list_num = [] 117 | comb = combinations([x for x in range(1, n+1)],k) 118 | for i in comb: 119 | list_num.append(i) 120 | print("(K:{},n:{}):".format(k,n)) 121 | print(list_num,"\n") 122 | ``` 123 | 124 | ### Q8: Write a function to generate N samples from a normal distribution and plot them on the histogram ### 125 | 126 | Answer: 127 | Using bultin Libraries: 128 | ``` 129 | import numpy as np 130 | import matplotlib.pyplot as plt 131 | 132 | x = np.random.randn((N,)) 133 | plt.hist(x) 134 | ``` 135 | From scratch: 136 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Python%20questions%208.png) 137 | 138 | 139 | ### Q9: What is the difference between apply and applymap function in pandas? ### 140 | 141 | Answer: 142 | 143 | Both the methods only accept callables as arguments but what sets them apart is that applymap is defined on dataframes and works element-wise. While apply can be defined on data frames as well as series and can work row/column-wise as well as element-wise. In terms of use case, applymap is used for transformations while apply is used for more complex operations and aggregations. Applymap only returns a dataframe while apply can return a scalar value, series, or dataframe. 144 | 145 | 146 | ### Q10: Given a string, return the first recurring character in it, or “None” if there is no recurring character. Example: input = "pythoninterviewquestion" , output = "n" ### 147 | 148 | Answer: 149 | ``` 150 | input_string = "pythoninterviewquestion" 151 | 152 | def first_recurring(input_str): 153 | 154 | a_str = "" 155 | for letter in input_str: 156 | a_str = a_str + letter 157 | if a_str.count(letter) > 1: 158 | return letter 159 | return None 160 | 161 | first_recurring(input_string) 162 | 163 | ``` 164 | 165 | ### Q11: Given a positive integer X return an integer that is a factorial of X. If a negative integer is provided, return -1. Implement the solution by using a recursive function. ### 166 | 167 | Answer: 168 | ``` 169 | def factorial(x): 170 | # Edge cases 171 | if x < 0: return -1 172 | if x == 0: return 1 173 | 174 | # Exit condition - x = 1 175 | if x == 1: 176 | return x 177 | else: 178 | # Recursive part 179 | return x * factorial(x - 1) 180 | ``` 181 | 182 | ### Q12: Given an m-by-n matrix with positive integers, determine the length of the longest path of increasing within the matrix. For example, consider the input matrix: 183 | 184 | [ 1 2 3 ] 185 | 186 | [ 4 5 6 ] 187 | 188 | [ 7 8 9 ] 189 | 190 | The answer should be 5 since the longest path would be 1-2-5-6-9 191 | 192 | 193 | Answer: 194 | 195 | ``` 196 | MAX = 10 197 | 198 | def Longest_Increasing_Path(dp, mat, n, m, x, y): 199 | 200 | # If value not calculated yet. 201 | if (dp[x][y] < 0): 202 | result = 0 203 | 204 | # // If reach bottom right cell, return 1 205 | if (x == n - 1 and y == m - 1): 206 | dp[x][y] = 1 207 | return dp[x][y] 208 | 209 | # If reach the corner 210 | # of the matrix. 211 | if (x == n - 1 or y == m - 1): 212 | result = 1 213 | 214 | # If value greater than below cell. 215 | if (x + 1 < n and mat[x][y] < mat[x + 1][y]): 216 | result = 1 + LIP(dp, mat, n, 217 | m, x + 1, y) 218 | 219 | # If value greater than left cell. 220 | if (y + 1 < m and mat[x][y] < mat[x][y + 1]): 221 | result = max(result, 1 + LIP(dp, mat, n, 222 | m, x, y + 1)) 223 | dp[x][y] = result 224 | return dp[x][y] 225 | 226 | # Wrapper function 227 | def wrapper(mat, n, m): 228 | dp = [[-1 for i in range(MAX)] 229 | for i in range(MAX)] 230 | return Longest_Increasing_Path(dp, mat, n, m, 0, 0) 231 | ``` 232 | 233 | ### Q13: 𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐰𝐡𝐚𝐭 𝐅𝐥𝐚𝐬𝐤 𝐢𝐬 𝐚𝐧𝐝 𝐢𝐭𝐬 𝐛𝐞𝐧𝐞𝐟𝐢𝐭𝐬 ### 234 | 235 | Answer: 236 | 237 | Flask is a web framework. This means flask provides you with tools, libraries, and technologies that allow you to build a web application. This web application can be some web pages, a blog, a wiki, or go as big as a web-based calendar application or a commercial website. 238 | 239 | Benefits of Flask: 240 | 1. Scalable 241 | Flask’s status as a microframework means that it can be used to grow a tech project such as a web app very quickly. 242 | 243 | 2. Flexible 244 | It allows the project to be rearranged and moved around. Also makes sure that the project structure does not collapse when a part is altered. 245 | 246 | 3. Easy to negotiate 247 | At its core, the microframework is easy to understand for web developers also giving them more control over their code and what is possible. 248 | 249 | 4. Lightweight 250 | Certain parts of a design of a tool/framework might need assembling and reassembling and do not rely on a large number of extensions to function which gives web developers a certain level of control. Further, Flask also supports modular programming, which is where its functionality can be split into several interchangeable modules and each module acts as an independent entity and executes a part of the functionality. 251 | 252 | ### Q14: 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐭𝐡𝐞 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐥𝐢𝐬𝐭𝐬, 𝐚𝐫𝐫𝐚𝐲𝐬, 𝐚𝐧𝐝 𝐬𝐞𝐭𝐬 𝐢𝐧 𝐏𝐲𝐭𝐡𝐨𝐧, 𝐚𝐧𝐝 𝐰𝐡𝐞𝐧 𝐲𝐨𝐮 𝐬𝐡𝐨𝐮𝐥𝐝 𝐮𝐬𝐞 𝐞𝐚𝐜𝐡 𝐨𝐟 𝐭𝐡𝐞𝐦? #### 253 | 254 | Answer: 255 | 256 | All three are data structures that can store sequences of data. but with some differences. 257 | 258 | List denoted by [ ], set by { } , and array/tuple by ( ) 259 | 260 | 𝐋𝐢𝐬𝐭: built-in data type in Python that helps store data in sequence with a very rich API that allows insertion removal retrieval and expansion. one of its benefits is that it allows the use of many data types in the same lists as it can store string, integers, floats of any other derived objects. one of its cons that are very slow if it will be used in numerical computation. 261 | 262 | 𝐀𝐫𝐫𝐚𝐲: on the other hand array can only store a single data type like integers only, float only, or any derived object only. but unlike lists, it's very efficient in terms of speed and memory usage (NumPy is one of the best libraries that implements array operations as it's a very rich library that solves many problems in numerical computation like vectorization, broadcasting, ...etc). 263 | 264 | 𝐒𝐞𝐭: it's also a built-in data type in Python and can store more that data types. but it does not allow for the existence of duplicates and if there are duplicates it only uses one of them. provide a lot of methods like unions, diffs, and intersections. 265 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Science-Interview-Questions-Answers 2 | A Curated list of data science interview questions and answers 3 | 4 | [![GitHub license](https://img.shields.io/github/license/youssefHosni/Data-Science-Interview-Questions-Answers.svg)](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/master/LICENSE) 5 | [![GitHub contributors](https://img.shields.io/github/contributors/youssefHosni/Data-Science-Interview-Questions-Answers.svg)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/graphs/contributors/) 6 | [![GitHub issues](https://img.shields.io/github/issues/youssefHosni/Data-Science-Interview-Questions-Answers.svg)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/issues/) 7 | [![GitHub pull-requests](https://img.shields.io/github/issues-pr/youssefHosni/Data-Science-Interview-Questions-Answers.svg)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/pulls/) 8 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com) 9 | 10 | [![GitHub watchers](https://img.shields.io/github/watchers/youssefHosni/Data-Science-Interview-Questions-Answers.svg?style=social&label=Watch)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/watchers/) 11 | [![GitHub forks](https://img.shields.io/github/forks/youssefHosni/Data-Science-Interview-Questions-Answers.svg?style=social&label=Fork)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/network/) 12 | [![GitHub stars](https://img.shields.io/github/stars/youssefHosni/Data-Science-Interview-Questions-Answers.svg?style=social&label=Star)](https://GitHub.com/youssefHosni/Data-Science-Interview-Questions-Answers/stargazers/) 13 | 14 | 15 | I started an initiative on LinkedIn in which I post daily data science interview questions. For better access, the questions and answers will be updated in this repo. 16 | The questions can be divided into six categories: machine learning questions, deep learning questions, statistics questions, probability questions, python questions, and resume-based questions. If you would like to participate in this questions and answers follow me on [LinkedIn](https://www.linkedin.com/in/youssef-hosni-b2960b135/) 17 | 18 | 19 | I started an intuitive on LinkedIn in May 2022 in which I post daily data science interview questions and the answers I got on each post I summarize them and post on the next day in another post. The questions are prepared by me, in addition to others that I received from my connections on LinkedIn. You are more than welcome to send me questions on my [LinkedIn](https://www.linkedin.com/in/youssef-hosni-b2960b135/) profile or email me & at Youssef.Hosni95@outlook.com. 20 | 21 | The main goal of this intuitive is to revise the basics of data science, be prepared for your next interview, and know the expected questions in a data science interview. For better access, the questions and answers are gathered here in this GitHub repository and in these [medium articles](https://youssefraafat57.medium.com/list/data-science-interview-questions-6789a80bdb14). They will be updated with new questions daily. 22 | 23 | The questions are divided into seven categories: 24 | 25 | * [Machine Learning Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Machine%20Learning%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md) 26 | * [Deep Learning Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Deep%20Learning%20Questions%20&%20Answers%20for%20Data%20Scientists.md) 27 | * [Top LLM Interview Questions & Answers for Data Scientists](https://levelup.gitconnected.com/top-large-language-models-llms-interview-questions-answers-d7b83f94c4e?sk=ba9875db71eb42aa0c5fa717f2dd7bd0) 28 | * [Top Computer Vision Interview Questions & Answers for Data Scientists Part 1](https://levelup.gitconnected.com/top-computer-vision-interview-questions-answers-part-1-7eddf45cfdf7?sk=f0b106cf3aab70fa27f07c61d5bc3965) 29 | * [Top Computer Vision Interview Questions & Answers for Data Scientists Part 2](https://levelup.gitconnected.com/top-computer-vision-interview-questions-answers-part-2-107244fc4289?sk=661863bf1a32af631451c9b43bce8868) 30 | * [Top Computer Vision Interview Questions & Answers for Data Scientists Part 3](https://levelup.gitconnected.com/top-computer-vision-interview-questions-answers-part-3-1e43909131b2?sk=9a10e41649c4c6a2088903e4d2db2a72) 31 | * [Statistics Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md) 32 | * [Probability Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Probability%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md) 33 | * [Python Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Python%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md) 34 | * [SQL & DB Interview Questions & Answers for Data Scientists](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md) 35 | * [Resume Based Questions](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Resume%20Based%20Questions.md) 36 | -------------------------------------------------------------------------------- /Resume Based Questions.md: -------------------------------------------------------------------------------- 1 | # Resume Based Questions # 2 | 3 | ## Questions ## 4 | 5 | * Q1: Discuss a challenging problem you faced while working on a data science project and how did you solve it? 6 | * Q2: Question: Explain a data science project that you are most proud to work on? 7 | 8 | -------------------------------------------------------------------------------- /SQL & DB Interview Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # SQL & DB Interview Questions & Answers for Data Scientists # 2 | 3 | ## Questions ## 4 | * [Q1: What are joins in SQL and discuss its types?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-what-are-joins-in-sql-and-discuss-its-types) 5 | * [Q2: Define the primary, foreign, and unique keys and the differences between them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-define-the-primary-foreign-and-unique-keys-and-the-differences-between-them) 6 | * [Q3: What is the difference between BETWEEN and IN operators in SQL?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-what-is-the-difference-between-between-and-in-operators-in-sql) 7 | * [Q4: Assume you have the given table below which contains information on user logins. Write a query to obtain the number of reactivated users (Users who did not log in the previous month and then logged in the current month)](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-assume-you-have-the-given-table-below-which-contains-information-on-user-logins-write-a-query-to-obtain-the-number-of-reactivated-users-users-who-did-not-log-in-the-previous-month-and-then-logged-in-the-current-month) 8 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Screenshot%202022-07-31%20201033.png) 9 | * [Q5: Describe the advantages and disadvantages of relational database vs NoSQL databases](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-describe-the-advantages-and-disadvantages-of-relational-database-vs-nosql-databases) 10 | * [Q6: Assume you are given the table below on user transactions. Write a query to obtain the third transaction of every user](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-assume-you-are-given-the-table-below-on-user-transactions-write-a-query-to-obtain-the-third-transaction-of-every-user) 11 | ![1661352126442](https://user-images.githubusercontent.com/72076328/186479577-da475779-b4de-45ef-b1ec-79ca5df0dad5.png) 12 | * [Q7: What do you understand by Self Join? Explain using an example](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-what-do-you-understand-by-self-join-explain-using-an-example) 13 | * [Q8: Write an SQL query to join 3 tables](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-write-an-sql-query-to-join-3-tables) 14 | * [Q9: Write a SQL query to get the third-highest salary of an employee from employee_table and arrange them in descending order.](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/SQL%20%26%20DB%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-write-a-sql-query-to-get-the-third-highest-salary-of-an-employee-from-employee_table-and-arrange-them-in-descending-order) 15 | * [Q10: What is the difference between temporary tables and common table expressions?](https://github.com/soopertramp/Data-Science-Interview-Questions-Answers/blob/main/SQL%20&%20DB%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#q10-what-is-the-difference-between-temporary-tables-and-common-table-expressions) 16 | * [Q11: Why use Right Join When Left Join can suffice the requirement?](https://github.com/soopertramp/Data-Science-Interview-Questions-Answers/blob/main/SQL%20&%20DB%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#q11-why-use-right-join-when-left-join-can-suffice-the-requirement) 17 | * [Q12: Why Rank skips sequence?](https://github.com/soopertramp/Data-Science-Interview-Questions-Answers/blob/main/SQL%20&%20DB%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#q12-why-rank-skips-sequence) 18 | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 19 | 20 | ## Questions & Answers ## 21 | 22 | ### Q1: What are joins in SQL and discuss its types? ### 23 | A JOIN clause is used to combine rows from two or more tables, based on a related column between them. It is used to merge two tables or retrieve data from there. There are 4 types of joins: inner join left join, right join, and full join. 24 | 25 | * Inner join: Inner Join in SQL is the most common type of join. It is used to return all the rows from multiple tables where the join condition is satisfied. 26 | * Left Join: Left Join in SQL is used to return all the rows from the left table but only the matching rows from the right table where the join condition is fulfilled. 27 | * Right Join: Right Join in SQL is used to return all the rows from the right table but only the matching rows from the left table where the join condition is fulfilled. 28 | * Full Join: Full join returns all the records when there is a match in any of the tables. Therefore, it returns all the rows from the left-hand side table and all the rows from the right-hand side table. 29 | ![alt text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Joins%20in%20SQL.png) 30 | 31 | ### Q2: Define the primary, foreign, and unique keys and the differences between them? ### 32 | 33 | **Primary key:** Is a key that is used to uniquely identify each row or record in the table, it can be a single column or composite pk that contains more than one column 34 | 35 | * The primary key doesn't accept null or repeated values 36 | * The purpose of the primary key is to keep the Entity's integrity 37 | * There is only one PK in each table 38 | * Every row must have a unique primary key 39 | 40 | **Foreign key:** Is a key that is used to identify, show or describe the relationship between tuples of two tables. It acts as a cross-reference between tables because it references the primary key of another table, thereby establishing a link between them. 41 | 42 | * The purpose of the foreign key is to keep data integrity 43 | * It can contain null values or primary key values 44 | 45 | **Unique key:** It's a key that can identify each row in the table as the primary key but it can contain one null value 46 | 47 | * Every table can have more than one Unique key 48 | 49 | ### Q3: What is the difference between BETWEEN and IN operators in SQL? ### 50 | Answer: 51 | 52 | The SQL **BETWEEN** operator selects values within a given range. It is inclusive of both the ranges, begin and end values are included. The values can be text, date, numbers, or other 53 | 54 | For example, select * from tablename where price BETWEEN 10 and 100; 55 | 56 | The **IN** operator is used to select rows in which a certain value exists in a given field. It is used with the WHERE clause to match values in a list. 57 | 58 | For example, select COLUMN from tablename where 'USA' in (country); 59 | 60 | IN is mainly best for categorical variables(it can be used with Numerical as well) whereas Between is for Numerical Variables 61 | ![alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Betweem%26IN.png) 62 | 63 | ### Q4: Assume you have the given table below which contains information on user logins. Write a query to obtain the number of reactivated users (Users who did not log in the previous month and then logged in the current month) ### 64 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Screenshot%202022-07-31%20201033.png) 65 | 66 | Answer: 67 | First, we look at all the users who did not log in during the previous month. To obtain the last month's data, we subtract an 𝐈𝐍𝐓𝐄𝐑𝐕𝐀𝐋 of 1 month from the current month's login date. Then, we use 𝐖𝐇𝐄𝐑𝐄 𝐄𝐗𝐈𝐒𝐓𝐒 against the previous month's interval to check whether there was login in the previous month. Finally, we 𝗖𝗢𝗨𝗡𝗧 the number of users satisfying this condition. 68 | 69 | ``` 70 | SELECT 71 | DATE_TRUNC('month', current_month.login_date) AS current_month, 72 | COUNT(*) AS num_reactivated_users 73 | FROM 74 | user_logins current_month 75 | WHERE 76 | NOT EXISTS ( 77 | SELECT 78 | * 79 | FROM 80 | user_logins last_month 81 | WHERE 82 | DATE_TRUNC('month', last_month.login_date) BETWEEN DATE_TRUNC('month', current_month.login_date) AND DATE_TRUNC('month', current_month.login_date) - INTERVAL '1 month' 83 | ) 84 | ``` 85 | ### Q5: Describe the advantages and disadvantages of relational database vs NoSQL databases ### 86 | 87 | Answer: 88 | 89 | **Advantages of Relational Databases:** Ensure data integrity through a defined schema and ACID properties. Easy to get started with and use for small-scale applications. Lends itself well to vertical scaling. Uses an almost standard query language, making learning or switching between types of relational databases easy. 90 | 91 | **Advantages of NoSQL Databases:** Offers more flexibility in data format and representations, which makes working with Unstructured or semistructured data easier. Hence, useful when still the data schema or adding new features/functionality rapidly like in a startup environment to scale with horizontal scaling. Lends itself better to applications that need to be highly available. 92 | 93 | **Disadvantages of Relational Databases:** Data schema needs to be known in advance. Ale schemas is possible, but frequent changes to the schema for large tables can cause performance issues. Horizontal scaling is relatively difficult, leading to eventual performance bottlenecks 94 | 95 | **Disadvantages of NoSQL Databases:** As outlined by the BASE framework, weaker guarantees of data correctness are made due to the soft-state and eventual consistency property. Managing consistency can also be difficult due to the lack of a predefined schema that's strictly adhered to. Depending on the type of NoSQL database, it can be challenging for the database to handle its types of complex queries or access patterns. 96 | 97 | 98 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/ezgif.com-gif-maker.jpg) 99 | 100 | 101 | ### Q6: Assume you are given the table below on user transactions. Write a query to obtain the third transaction of every user ### 102 | 103 | ![1661352126442](https://user-images.githubusercontent.com/72076328/186479648-5dcbf0a3-46e3-46b4-99ab-94feeace3dca.png) 104 | 105 | Answer: 106 | First, we obtain the transaction numbers for each user. We can do this by using the ROW_NUMBER window function, where we PARTITION by the user_id and ORDER by the transaction_date fields, calling the resulting field a transaction number. From there, we can simply take all transactions having a transaction number equal to 3. 107 | ![1661352088335](https://user-images.githubusercontent.com/72076328/186479695-5d2b7f36-5703-489d-87e3-6bad6ee1a9b7.jpg) 108 | 109 | ### Q7: What do you understand by Self Join? Explain using an example ### 110 | 111 | Answer: 112 | 113 | Self-join is as its name implies, joining a table to itself on a database, this process may come in handy in a number of cases, such as: 114 | 115 | 1- comparing the table's rows to themselves: 116 | 117 | It's like we have two copies of the same table and join them together on a given condition to reach the required output query. 118 | 119 | Ex. If we have a store database with a client's data table holding a bunch of demographics, we could self-join the client's table to get clients who are located in the same city/made a purchase on the same day/etc. 120 | 121 | 2- querying a table that has hierarchical data: 122 | 123 | Meaning, the table has a primary key that has a one-to-many relationship with another foreign key inside the same table, in other words, the table has data that refers to the same table. We could use self-join in order to have a clear look at the data by matching its keys. 124 | 125 | Ex. The organizational structure of a company may contain an employee table that has an employee id and his manager id (who is also an employee, hence has an employee id too) in the same table. Using self-join on this table would allow us to reference every employee directly to his manager. 126 | 127 | P.S. we would need to take care of duplicates that may occur and consider them in the conditions. 128 | 129 | ### Q8: Write an SQL query to join 3 tables ### 130 | ![1668274347333](https://user-images.githubusercontent.com/72076328/201538710-264494b8-62e7-4e36-8487-be449c1b441a.jpg) 131 | 132 | 133 | ### Q9: Write a SQL query to get the third-highest salary of an employee from employee_table and arrange them in descending order. ### 134 | 135 | Answer: 136 | 137 | ### Q10: What is the difference between temporary tables and common table expressions? ### 138 | 139 | Answer: 140 | 141 | 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗿𝘆 𝘁𝗮𝗯𝗹𝗲𝘀 and 𝗖𝗧𝗘s are both used to store intermediate results in MySQL, but there are some key differences between the two: 142 | 143 | 𝗗𝗲𝗳𝗶𝗻𝗶𝘁𝗶𝗼𝗻: A temporary table is a physical table that is created in the database and persists until it is explicitly dropped or the session ends. A CTE is a virtual table that is defined only within the scope of a single SQL statement. 144 | 145 | 𝗦𝘁𝗼𝗿𝗮𝗴𝗲: Temporary tables are stored in the database and occupy physical disk space. CTEs are not stored on disk and exist only in memory for the duration of the query. 146 | 147 | 𝗔𝗰𝗰𝗲𝘀𝘀: Temporary tables can be accessed from any session that has the appropriate privileges. CTEs are only accessible within the scope of the query in which they are defined. 148 | 149 | 𝗟𝗶𝗳𝗲𝘀𝗽𝗮𝗻: Temporary tables persist until they are explicitly dropped or the session ends. CTEs are only available for the duration of the query in which they are defined and are then discarded. 150 | 151 | 𝗦𝘆𝗻𝘁𝗮𝘅: Temporary tables are created using the CREATE TEMPORARY TABLE statement, while CTEs are defined using the WITH clause. 152 | 153 | 𝗣𝘂𝗿𝗽𝗼𝘀𝗲: Temporary tables are typically used to store intermediate results that will be used in multiple queries, while CTEs are used to simplify complex queries by breaking them down into smaller, more manageable parts. 154 | 155 | In summary, temporary tables are physical tables that persist in the database and can be accessed from any session, while CTEs are virtual tables that exist only within the scope of a single query and are discarded once the query is complete. Both temporary tables and CTEs can be useful tools for simplifying complex queries and storing intermediate results. 156 | 157 | ### Q11: Why use Right Join When Left Join can suffice the requirement? ### 158 | 159 | Answer: 160 | In MySQL, the 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 𝗮𝗻𝗱 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 are used to retrieve data from multiple tables by joining them based on a specified condition. 161 | 162 | Generally, the 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is used more frequently than the 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 because it returns all the rows from the left table and matching rows from the right table, or NULL values if there is no match. 163 | 164 | In most cases, a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is sufficient to meet the requirement of retrieving all the data from the left table and matching data from the right table. 165 | 166 | However, there may be situations where using a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 is more appropriate. 167 | 168 | Here are a few examples: 169 | 170 | 𝟭. 𝗪𝗵𝗲𝗻 𝘁𝗵𝗲 𝗽𝗿𝗶𝗺𝗮𝗿𝘆 𝘁𝗮𝗯𝗹𝗲 𝗶𝘀 𝘁𝗵𝗲 𝗿𝗶𝗴𝗵𝘁 𝘁𝗮𝗯𝗹𝗲: If the right table contains the primary data that needs to be retrieved, and the left table contains supplementary data, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 can be used to retrieve all the data from the right table and matching data from the left table. 171 | 172 | 𝟮. 𝗪𝗵𝗲𝗻 𝘁𝗵𝗲 𝗾𝘂𝗲𝗿𝘆 𝗻𝗲𝗲𝗱𝘀 𝘁𝗼 𝗯𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱: In some cases, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 may be more efficient than a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 because the database optimizer can choose the most efficient join order based on the query structure and the available indexes. 173 | 174 | 𝟯. 𝗪𝗵𝗲𝗻 𝘂𝘀𝗶𝗻𝗴 𝗼𝘂𝘁𝗲𝗿 𝗷𝗼𝗶𝗻𝘀: If the query requires an outer join, a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 may be used to return all the rows from the right table, including those with no matching rows in the left table. 175 | It's important to note that while a 𝗥𝗜𝗚𝗛𝗧 𝗝𝗢𝗜𝗡 can provide additional functionality in certain cases, it may also make the query more complex and difficult to read. In most cases, a 𝗟𝗘𝗙𝗧 𝗝𝗢𝗜𝗡 is the preferred method for joining tables in MySQL. 176 | 177 | ### Q12: Why Rank skips sequence? ### 178 | 179 | Answers: 180 | In MySQL, the rank function may skip a sequence of numbers when using the `DENSE_RANK()` function or the `RANK()` function, depending on the data and the query. The `DENSE_RANK()` function assigns a unique rank to each distinct value in a result set, whereas the `RANK()` function assigns the same rank to the duplicate values. 181 | 182 | Here are some of the reasons why the rank function may skip a sequence in MySQL: 183 | 184 | 1. 𝗧𝗵𝗲 `𝗗𝗘𝗡𝗦𝗘_𝗥𝗔𝗡𝗞()` function skips ranks when there are ties. For example, if there are two rows with the same values in the ranking column, both will be assigned the same rank, and the next rank will be incremented by 1. 185 | 186 | 2. 𝗧𝗵𝗲 `𝗥𝗔𝗡𝗞()` function skips ranks when there are gaps between the duplicate values. For example, if there are three rows with the same values in the ranking column, and then the next row has a higher value, the `RANK()` function will skip over the fourth rank. 187 | 188 | 3. The query may have filtering or grouping clauses that affect the ranking. For example, if a query filters out some rows or groups them by a different column, the ranking may not be sequential. 189 | 190 | It's important to note that the ranking function in MySQL behaves differently from the ranking function in other databases, so the same query may produce different results in different database systems. 191 | -------------------------------------------------------------------------------- /Statistics Interview Questions & Answers for Data Scientists.md: -------------------------------------------------------------------------------- 1 | # Statistics Interview Questions & Answers for Data Scientists # 2 | 3 | ## Questions ## 4 | * [Q1: Explain the central limit theorem and give examples of when you can use it in a real-world problem?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q1-explain-the-central-limit-theorem-and-give-examples-of-when-you-can-use-it-in-a-real-world-problem) 5 | * [Q2: Briefly explain the A/B testing and its application? What are some common pitfalls encountered in A/B testing?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q2-briefly-explain-the-ab-testing-and-its-application-what-are-some-common-pitfalls-encountered-in-ab-testing) 6 | * [Q3: Describe briefly the hypothesis testing and p-value in layman’s term? And give a practical application for them ?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q3-describe-briefly-the-hypothesis-testing-and-p-value-in-laymans-term-and-give-a-practical-application-for-them-) 7 | * [Q4: Given a left-skewed distribution that has a median of 60, what conclusions can we draw about the mean and the mode of the data?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q4-given-a-left-skewed-distribution-that-has-a-median-of-60-what-conclusions-can-we-draw-about-the-mean-and-the-mode-of-the-data) 8 | * [Q5: What is the meaning of selection bias and how to avoid it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q5-what-is-the-meaning-of-selection-bias-and-how-to-avoid-it) 9 | * [Q6: Explain the long-tailed distribution and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q6-explain-the-long-tailed-distribution-and-provide-three-examples-of-relevant-phenomena-that-have-long-tails-why-are-they-important-in-classification-and-regression-problems) 10 | * [Q7: What is the meaning of KPI in statistics](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q7-what-is-the-meaning-of-kpi-in-statistics) 11 | * [Q8: Say you flip a coin 10 times and observe only one head. What would be the null hypothesis and p-value for testing whether the coin is fair or not?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q8-say-you-flip-a-coin-10-times-and-observe-only-one-head-what-would-be-the-null-hypothesis-and-p-value-for-testing-whether-the-coin-is-fair-or-not) 12 | * [Q9: You are testing hundreds of hypotheses, each with a t-test. What considerations would you take into account when doing this?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q9-you-are-testing-hundreds-of-hypotheses-each-with-a-t-test-what-considerations-would-you-take-into-account-when-doing-this) 13 | * [Q10: What general conditions must be satisfied for the central limit theorem to hold?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q10-what-general-conditions-must-be-satisfied-for-the-central-limit-theorem-to-hold) 14 | * [Q11: What is skewness discuss two methods to measure it?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q11-what-is-skewness-discuss-two-methods-to-measure-it) 15 | * [Q12: You sample from a uniform distribution [0, d] n times. What is your best estimate of d?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q12-you-sample-from-a-uniform-distribution-0-d-n-times-what-is-your-best-estimate-of-d) 16 | * [Q13: Discuss the Chi-square, ANOVA, and t-test](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q13-discuss-the-chi-square-anova-and-t-test) 17 | * [Q14: Say you have two subsets of a dataset for which you know their means and standard deviations. How do you calculate the blended mean and standard deviation of the total dataset? Can you extend it to K subsets?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q14-say-you-have-two-subsets-of-a-dataset-for-which-you-know-their-means-and-standard-deviations-how-do-you-calculate-the-blended-mean-and-standard-deviation-of-the-total-dataset-can-you-extend-it-to-k-subsets) 18 | * [Q15: What is the relationship between the significance level and the confidence level in Statistics?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#q15-what-is-the-relationship-between-the-significance-level-and-the-confidence-level-in-statistics) 19 | * [Q16: What is the Law of Large Numbers in statistics and how it can be used in data science ?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20%26%20Answers%20for%20Data%20Scientists.md#:~:text=written%20in%20percentages.-,Q16%3A%20What%20is%20the%20Law%20of%20Large%20Numbers%20in%20statistics%20and%20how%20it%20can%20be%20used%20in%20data%20science%20%3F,-Answer%3A%20The%20law) 20 | * [Q17: What is the difference between a confidence interval and a prediction interval, and how do you calculate them?](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers/blob/main/Statistics%20Interview%20Questions%20&%20Answers%20for%20Data%20Scientists.md#:~:text=and%20hypothesis%20testing.-,Q17%3A%20What%20is%20the%20difference%20between%20a%20confidence%20interval%20and%20a%20prediction%20interval%2C%20and%20how%20do%20you%20calculate%20them%3F,-Answer%3A) 21 | * [Q18: What are the differences between the z-test and t-test?]() 22 | * [Q19: When to use a z-test Vs a t-test?]() 23 | * [Q20: Given a specific dataset, how do you calculate t-statistic or z-statistics?]() 24 | 25 | ## Questions & Answers ## 26 | 27 | ### Q1: Explain the central limit theorem and give examples of when you can use it in a real-world problem. ### 28 | 29 | Answers: 30 | 31 | The center limit theorem states that if any random variable, regardless of the distribution, is sampled a large enough time, the sample mean will be approximately normally distributed. This allows for studying the properties of any statistical distribution as long as there is a large enough sample size. 32 | 33 | Important remark from Adrian Olszewski: 34 | ⚠️ we can rely on the CLT with means (because it applies to any unbiased statistic) only if expressing data in this way makes sense. And it makes sense *ONLY* in the case of unimodal and symmetric data, coming from additive processes. So forget skewed, multi-modal data with mixtures of distributions, coming from multiplicative processes, and non-trivial mean-variance relationships. That are the places where arithmetic means is meaningless. Thus, using the CLT of e.g. bootstrap will give some valid answers to an invalid question. 35 | 36 | ⚠️ the distribution of means isn't enough. Every single kind of inference requires the entire test statistic to follow a certain distribution. And the test statistic consists also of the estimate of variance. Never assume the same sample size sufficient for means will suffice for the entire test statistic. See an excerpt from Rand Wilcox attached. Especially do never believe in magic numbers like N=30. 37 | 38 | ⚠️ think first about how to sensible describe your data, state the hypothesis of interest and then apply a valid method. 39 | 40 | 41 | Examples of real-world usage of CLT: 42 | 43 | 1. The CLT can be used at any company with a large amount of data. Consider companies like Uber/Lyft wants to test whether adding a new feature will increase the booked rides or not using hypothesis testing. So if we have a large number of individual ride X, which in this case is a Bernoulli random variable (since the rider will book a ride or not), we can estimate the statistical properties of the total number of bookings. Understanding and estimating these statistical properties play a significant role in applying hypothesis testing to your data and knowing whether adding a new feature will increase the number of booked riders or not. 44 | 45 | 2. Manufacturing plants often use the central limit theorem to estimate how many products produced by the plant are defective. 46 | 47 | ### Q2: Briefly explain the A/B testing and its application? What are some common pitfalls encountered in A/B testing? ### 48 | A/B testing helps us to determine whether a change in something will cause a change in performance significantly or not. So in other words you aim to statistically estimate the impact of a given change within your digital product (for example). You measure success and counter metrics on at least 1 treatment vs 1 control group (there can be more than 1 XP group for multivariate tests). 49 | 50 | Applications: 51 | 1. Consider the example of a general store that sells bread packets but not butter, for a year. If we want to check whether its sale depends on the butter or not, then suppose the store also sells butter and sales for next year are observed. Now we can determine whether selling butter can significantly increase/decrease or doesn't affect the sale of bread. 52 | 53 | 2. While developing the landing page of a website you create 2 different versions of the page. You define a criteria for success eg. conversion rate. Then define your hypothesis 54 | Null hypothesis(H): No difference between the performance of the 2 versions. Alternative hypothesis(H'): version A will perform better than B. 55 | 56 | NOTE: You will have to split your traffic randomly(to avoid sample bias) into 2 versions. The split doesn't have to be symmetric, you just need to set the minimum sample size for each version to avoid undersample bias. 57 | 58 | Now if version A gives better results than version B, we will still have to statistically prove that results derived from our sample represent the entire population. Now one of the very common tests used to do so is 2 sample t-test where we use values of significance level (alpha) and p-value to see which hypothesis is right. If p-value 60 91 | 92 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/1657144303401.jpg) 93 | 94 | ### Q5: What is the meaning of selection bias and how to avoid it? ### 95 | Answer: 96 | 97 | Sampling bias is the phenomenon that occurs when a research study design fails to collect a representative sample of a target population. This typically occurs because the selection criteria for respondents failed to capture a wide enough sampling frame to represent all viewpoints. 98 | 99 | The cause of sampling bias almost always owes to one of two conditions. 100 | 1. Poor methodology: In most cases, non-representative samples pop up when researchers set improper parameters for survey research. The most accurate and repeatable sampling method is simple random sampling where a large number of respondents are chosen at random. When researchers stray from random sampling (also called probability sampling), they risk injecting their own selection bias into recruiting respondents. 101 | 102 | 2. Poor execution: Sometimes data researchers craft scientifically sound sampling methods, but their work is undermined when field workers cut corners. By reverting to convenience sampling (where the only people studied are those who are easy to reach) or giving up on reaching non-responders, a field worker can jeopardize the careful methodology set up by data scientists. 103 | 104 | The best way to avoid sampling bias is to stick to probability-based sampling methods. These include simple random sampling, systematic sampling, cluster sampling, and stratified sampling. In these methodologies, respondents are only chosen through processes of random selection—even if they are sometimes sorted into demographic groups along the way. 105 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/Sampling%20bias.png) 106 | 107 | ### Q6: Explain the long-tailed distribution and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems? ### 108 | 109 | Answer: 110 | A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically. 111 | 112 | Three examples of relevant phenomena that have long tails: 113 | 114 | 1. Frequencies of languages spoken 115 | 2. Population of cities 116 | 3. Pageviews of articles 117 | 118 | All of these follow something close to 80-20 rule: 80% of outcomes (or outputs) result from 20% of all causes (or inputs) for any given event. This 20% forms the long tail in the distribution. 119 | 120 | It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed. 121 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/long-tailed%20distribution.jpg) 122 | 123 | ### Q7: What is the meaning of KPI in statistics ### 124 | Answer: 125 | 126 | **KPI** stands for key performance indicator, a quantifiable measure of performance over time for a specific objective. KPIs provide targets for teams to shoot for, milestones to gauge progress, and insights that help people across the organization make better decisions. From finance and HR to marketing and sales, key performance indicators help every area of the business move forward at the strategic level. 127 | 128 | KPIs are an important way to ensure your teams are supporting the overall goals of the organization. Here are some of the biggest reasons why you need key performance indicators. 129 | 130 | * Keep your teams aligned: Whether measuring project success or employee performance, KPIs keep teams moving in the same direction. 131 | * Provide a health check: Key performance indicators give you a realistic look at the health of your organization, from risk factors to financial indicators. 132 | * Make adjustments: KPIs help you clearly see your successes and failures so you can do more of what’s working, and less of what’s not. 133 | * Hold your teams accountable: Make sure everyone provides value with key performance indicators that help employees track their progress and help managers move things along. 134 | 135 | Types of KPIs 136 | Key performance indicators come in many flavors. While some are used to measure monthly progress against a goal, others have a longer-term focus. The one thing all KPIs have in common is that they’re tied to strategic goals. Here’s an overview of some of the most common types of KPIs. 137 | 138 | * **Strategic**: These big-picture key performance indicators monitor organizational goals. Executives typically look to one or two strategic KPIs to find out how the organization is doing at any given time. Examples include return on investment, revenue and market share. 139 | * **Operational:** These KPIs typically measure performance in a shorter time frame, and are focused on organizational processes and efficiencies. Some examples include sales by region, average monthly transportation costs and cost per acquisition (CPA). 140 | * **Functional Unit:** Many key performance indicators are tied to specific functions, such finance or IT. While IT might track time to resolution or average uptime, finance KPIs track gross profit margin or return on assets. These functional KPIs can also be classified as strategic or operational. 141 | * **Leading vs Lagging:** Regardless of the type of key performance indicator you define, you should know the difference between leading indicators and lagging indicators. While leading KPIs can help predict outcomes, lagging KPIs track what has already happened. Organizations use a mix of both to ensure they’re tracking what’s most important. 142 | 143 | 144 | ![Alt_text](https://github.com/youssefHosni/Data-Science-Interview-Questions/blob/main/Figures/KPI.png) 145 | 146 | ### Q8: Say you flip a coin 10 times and observe only one head. What would be the null hypothesis and p-value for testing whether the coin is fair or not? ### 147 | 148 | Answer: 149 | 150 | The null hypothesis is that the coin is fair, and the alternative hypothesis is that the coin is biased. The p-value is the probability of observing the results obtained given that the null hypothesis is true, in this case, the coin is fair. 151 | 152 | In total for 10 flips of a coin, there are 2^10 = 1024 possible outcomes and in only 10 of them are there 9 tails and one head. 153 | 154 | Hence, the exact probability of the given result is the p-value, which is 10/1024 = 0.0098. Therefore, with a significance level set, for example, at 0.05, we can reject the null hypothesis. 155 | 156 | ### Q9: You are testing hundreds of hypotheses, each with a t-test. What considerations would you take into account when doing this? ### 157 | 158 | Answer: 159 | The main consideration when we have a large number of tests is that probability of getting a significant test due to chance alone increases. This will increase the type 1 error (rejecting the null hypothesis when it's actually true). 160 | 161 | Therefore we need to consider the Bonferroni Effect which happens when we make many tests. Ex. If our significance level is 0.05 but we made a 100 test it means that the probability of getting a value inside the rejection rejoin is 0.0005, not 0.05 so here we need to use another significance level which's called alpha star = significance level /K Where K is the number of the tests. 162 | 163 | 164 | ### Q10: What general conditions must be satisfied for the central limit theorem to hold? ### 165 | 166 | Answer: 167 | 168 | In order to apply the central limit theorem, there are four conditions that must be met: 169 | 170 | 1.** Randomization:** The data must be sampled randomly such that every member in a population has an equal probability of being selected to be in the sample. 171 | 172 | 2. **Independence:** The sample values must be independent of each other. 173 | 174 | 3. **The 10% Condition:** When the sample is drawn without replacement, the sample size should be no larger than 10% of the population. 175 | 176 | 4. **Large Sample Condition:** The sample size needs to be sufficiently large. 177 | 178 | ### Q11: What is skewness discuss two methods to measure it? ### 179 | 180 | Answer: 181 | 182 | Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. 183 | There are two main types of skewness negative skew which refers to a longer or fatter tail on the left side of the distribution, while positive skew refers to a longer or fatter tail on the right. These two skews refer to the direction or weight of the distribution. 184 | 185 | The mean of positively skewed data will be greater than the median. In a negatively skewed distribution, the exact opposite is the case: the mean of negatively skewed data will be less than the median. If the data graphs symmetrically, the distribution has zero skewness, regardless of how long or fat the tails are. 186 | 187 | There are several ways to measure skewness. Pearson’s first and second coefficients of skewness are two common methods. Pearson’s first coefficient of skewness, or Pearson mode skewness, subtracts the mode from the mean and divides the difference by the standard deviation. Pearson’s second coefficient of skewness, or Pearson median skewness, subtracts the median from the mean, multiplies the difference by three, and divides the product by the standard deviation. 188 | 189 | ![1663943424873](https://user-images.githubusercontent.com/72076328/191984720-5b267ab0-9ed3-4315-8443-62d662822796.jpg) 190 | 191 | ### Q12: You sample from a uniform distribution [0, d] n times. What is your best estimate of d? ### 192 | 193 | Answer: 194 | 195 | Intuitively it is the maximum of the sample points. Here's the mathematical proof is in the figure below: 196 | 197 | ![1665416540418](https://user-images.githubusercontent.com/72076328/194904988-e35754a7-6331-4989-9523-e122edb2a065.jpg) 198 | 199 | ### Q13: Discuss the Chi-square, ANOVA, and t-test ### 200 | 201 | Answer: 202 | 203 | Chi-square test A statistical method is used to find the difference or correlation between the observed and expected categorical variables in the dataset. 204 | 205 | Example: A food delivery company wants to find the relationship between gender, location, and food choices of people. 206 | 207 | It is used to determine whether the difference between 2 categorical variables is: 208 | 209 | * Due to chance or 210 | 211 | * Due to relationship 212 | 213 | Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the means (or average) of different groups. A range of scenarios uses it to determine if there is any difference between the means of different groups. 214 | 215 | t_test is a statistical method for the comparison of the mean of the two groups of the normally distributed sample(s). 216 | 217 | It comes in various types such as: 218 | 219 | 1. One sample t-test: 220 | 221 | Used to compare the mean of a sample and the population. 222 | 223 | 2. Two sample t-tests: 224 | 225 | Used to compare the mean of two independent samples and whether their population is statistically different. 226 | 227 | 3. Paired t-test: 228 | 229 | Used to compare means of different samples from the same group. 230 | 231 | ### Q14: Say you have two subsets of a dataset for which you know their means and standard deviations. How do you calculate the blended mean and standard deviation of the total dataset? Can you extend it to K subsets? ### 232 | 233 | Answer: 234 | 235 | 236 | ### Q15: What is the relationship between the significance level and the confidence level in Statistics?### 237 | 238 | Answer: 239 | Confidence level = 1 - significance level. 240 | 241 | It's closely related to hypothesis testing and confidence intervals. 242 | 243 | ⏺ Significance Level according to the hypothesis testing literature means the probability of Type-I error one is willing to tolerate. 244 | 245 | ⏺ Confidence Level according to the confidence interval literature means the probability in terms of the true parameter value lying inside the confidence interval. They are usually written in percentages. 246 | 247 | ### Q16: What is the Law of Large Numbers in statistics and how it can be used in data science ? ### 248 | 249 | Answer: 250 | The law of large numbers states that as the number of trials in a random experiment increases, the average of the results obtained from the experiment approaches the expected value. In statistics, it's used to describe the relationship between sample size and the accuracy of statistical estimates. 251 | 252 | In data science, the law of large numbers is used to understand the behavior of random variables over many trials. It's often applied in areas such as predictive modeling, risk assessment, and quality control to ensure that data-driven decisions are based on a robust and accurate representation of the underlying patterns in the data. 253 | 254 | The law of large numbers helps to guarantee that the average of the results from a large number of independent and identically distributed trials will converge to the expected value, providing a foundation for statistical inference and hypothesis testing. 255 | 256 | ### Q17: What is the difference between a confidence interval and a prediction interval, and how do you calculate them? ### 257 | 258 | Answer: 259 | 260 | A confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence. It is used to estimate the precision or accuracy of a sample statistic, such as a mean or a proportion, based on a sample from a larger population. 261 | 262 | For example, if we want to estimate the average height of all adults in a certain region, we can take a random sample of individuals from that region and calculate the sample mean height. Then we can construct a confidence interval for the true population mean height, based on the sample mean and the sample size, with a certain level of confidence, such as 95%. This means that if we repeat the sampling process many times, 95% of the resulting intervals will contain the true population mean height. 263 | 264 | The formula for a confidence interval is: 265 | confidence interval = sample statistic +/- margin of error 266 | 267 | The margin of error depends on the sample size, the standard deviation of the population (or the sample, if the population standard deviation is unknown), and the desired level of confidence. For example, if the sample size is larger or the standard deviation is smaller, the margin of error will be smaller, resulting in a narrower confidence interval. 268 | 269 | A prediction interval is a range of values that is likely to contain a future observation or outcome with a certain level of confidence. It is used to estimate the uncertainty or variability of a future value based on a statistical model and the observed data. 270 | 271 | For example, if we have a regression model that predicts the sales of a product based on its price and advertising budget, we can use a prediction interval to estimate the range of possible sales for a new product with a certain price and advertising budget, with a certain level of confidence, such as 95%. This means that if we repeat the prediction process many times, 95% of the resulting intervals will contain the true sales value. 272 | 273 | The formula for a prediction interval is: 274 | prediction interval = point estimate +/- margin of error 275 | 276 | The point estimate is the predicted value of the outcome variable based on the model and the input variables. The margin of error depends on the residual standard deviation of the model, which measures the variability of the observed data around the predicted values, and the desired level of confidence. For example, if the residual standard deviation is larger or the level of confidence is higher, the margin of error will be larger, resulting in a wider prediction interval. 277 | ![4](https://user-images.githubusercontent.com/72076328/227254955-b57bd42a-b51b-4b4a-abab-1adb059eca98.png) 278 | 279 | 280 | --------------------------------------------------------------------------------