├── .gitignore ├── Coding.ipynb ├── DataScience_Interview_Questions.pdf ├── README.md ├── communication.md ├── data-analysis.md ├── predictive-modeling.md ├── probability.md ├── product-metrics.md ├── programming.md └── statistical-inference.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # zip files 10 | *.tar.gz 11 | images/* 12 | 13 | # Trivial files 14 | *.DS_Store 15 | 16 | # Distribution / packaging 17 | .Python 18 | build/ 19 | develop-eggs/ 20 | dist/ 21 | downloads/ 22 | eggs/ 23 | .eggs/ 24 | lib/ 25 | lib64/ 26 | parts/ 27 | sdist/ 28 | var/ 29 | wheels/ 30 | *.egg-info/ 31 | .installed.cfg 32 | *.egg 33 | MANIFEST 34 | 35 | # PyInstaller 36 | # Usually these files are written by a python script from a template 37 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 38 | *.manifest 39 | *.spec 40 | 41 | # Installer logs 42 | pip-log.txt 43 | pip-delete-this-directory.txt 44 | 45 | # Unit test / coverage reports 46 | htmlcov/ 47 | .tox/ 48 | .coverage 49 | .coverage.* 50 | .cache 51 | nosetests.xml 52 | coverage.xml 53 | *.cover 54 | .hypothesis/ 55 | .pytest_cache/ 56 | 57 | # Translations 58 | *.mo 59 | *.pot 60 | 61 | # Django stuff: 62 | *.log 63 | local_settings.py 64 | db.sqlite3 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | target/ 78 | 79 | # Jupyter Notebook 80 | .ipynb_checkpoints 81 | 82 | # pyenv 83 | .python-version 84 | 85 | # celery beat schedule file 86 | celerybeat-schedule 87 | 88 | # SageMath parsed files 89 | *.sage.py 90 | 91 | # Environments 92 | .env 93 | .venv 94 | env/ 95 | venv/ 96 | ENV/ 97 | env.bak/ 98 | venv.bak/ 99 | 100 | # Spyder project settings 101 | .spyderproject 102 | .spyproject 103 | 104 | # Rope project settings 105 | .ropeproject 106 | 107 | # mkdocs documentation 108 | /site 109 | 110 | # mypy 111 | .mypy_cache/ 112 | -------------------------------------------------------------------------------- /Coding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "1. Write a function to calculate all possible assignment vectors of 2n users, where n users are assigned to group 0 (control), and n users are assigned to group 1 (treatment)." 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stdout", 17 | "output_type": "stream", 18 | "text": [ 19 | "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1], [0, 1, 1, 0], [0, 1, 0, 1], [0, 0, 1, 1]]\n" 20 | ] 21 | } 22 | ], 23 | "source": [ 24 | "def n_choose_k(n, k):\n", 25 | " \"\"\" function to choose k from n \"\"\"\n", 26 | " if k == 1:\n", 27 | " ans = []\n", 28 | " for i in range(n):\n", 29 | " tmp = [0] * n\n", 30 | " tmp[i] = 1\n", 31 | " ans.append(tmp)\n", 32 | " return ans\n", 33 | " \n", 34 | " if k == n:\n", 35 | " return [[1] * n]\n", 36 | " \n", 37 | " ans = []\n", 38 | " space = n - k + 1\n", 39 | " for i in range(space):\n", 40 | " assignment = [0] * (i + 1)\n", 41 | " assignment[i] = 1\n", 42 | " for c in n_choose_k(n - i - 1, k - 1):\n", 43 | " ans.append(assignment + c)\n", 44 | " return ans\n", 45 | "\n", 46 | "# test: choose 2 from 4\n", 47 | "print(n_choose_k(4, 2))" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [] 56 | } 57 | ], 58 | "metadata": { 59 | "kernelspec": { 60 | "display_name": "Python 3", 61 | "language": "python", 62 | "name": "python3" 63 | }, 64 | "language_info": { 65 | "codemirror_mode": { 66 | "name": "ipython", 67 | "version": 3 68 | }, 69 | "file_extension": ".py", 70 | "mimetype": "text/x-python", 71 | "name": "python", 72 | "nbconvert_exporter": "python", 73 | "pygments_lexer": "ipython3", 74 | "version": "3.6.5" 75 | } 76 | }, 77 | "nbformat": 4, 78 | "nbformat_minor": 2 79 | } 80 | -------------------------------------------------------------------------------- /DataScience_Interview_Questions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JifuZhao/120-DS-Interview-Questions/e829fbb44598b51904faa3f62e1c240321c2b824/DataScience_Interview_Questions.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 120 Data Science Interview Questions 2 | 3 | Here are the answers to [120 Data Science Interview Questions](http://www.datasciencequestions.com/). 4 | 5 | - [Predictive Modeling](predictive-modeling.md) 6 | - [Programming](programming.md) 7 | - [Probability](probability.md) 8 | - [Statistical Inference](statistical-inference.md) 9 | - [Data Analysis](data-analysis.md) 10 | - [Product Metrics](product-metrics.md) 11 | - [Communication](communication.md) 12 | 13 | The above answer some is modified based on Kojin's original collection [kojino/120-Data-Science-Interview-Questions](https://github.com/kojino/120-Data-Science-Interview-Questions). 14 | 15 | Another solution is from [Nitish-McQueen](https://github.com/Nitish-McQueen): [Data Science Interview Questions](./DataScience_Interview_Questions.pdf) 16 | 17 | There are a lot of different materials useful for Data Science Interview, feel free to check my adapted collection: [JifuZhao/FreeML](https://github.com/JifuZhao/FreeML). 18 | 19 | Quera has a good list of questions: [https://datascienceinterview.quora.com/Answers-1](https://datascienceinterview.quora.com/Answers-1). 20 | 21 | Feel free to send me a pull request if you find any mistakes or have better answers. 22 | -------------------------------------------------------------------------------- /communication.md: -------------------------------------------------------------------------------- 1 | ## Communication (5 questions) 2 | 3 | #### 1. Explain to me a technical concept related to the role that you’re interviewing for. 4 | - AB test, PCA, data science, machine learning, neural networks 5 | 6 | #### 2. Introduce me to something you’re passionate about. 7 | - Data science 8 | 9 | #### 3. How would you explain an A/B test to an engineer with no statistics background? A linear regression? 10 | - A/B testing, or more broadly, multivariate testing, is the testing of different elements of a user's experience to determine which variation helps the business achieve its goal more effectively (i.e. increasing conversions, etc..)  This can be copy on a web site, button colors, different user interfaces, different email subject lines, calls to action, offers, etc.  11 | 12 | #### 4. How would you explain a confidence interval to an engineer with no statistics background? What does 95% confidence mean? 13 | - [link](https://www.quora.com/What-is-a-confidence-interval-in-laymans-terms) 14 | 15 | #### 5. How would you explain to a group of senior executives why data is important? 16 | - Examples 17 | -------------------------------------------------------------------------------- /data-analysis.md: -------------------------------------------------------------------------------- 1 | ## Data Analysis (27 questions) 2 | 3 | #### 1. (Given a Dataset) Analyze this dataset and tell me what you can learn from it. 4 | - Typical data cleaning and visualization 5 | 6 | #### 2. What is R2? What are some other metrics that could be better than R2 and why? 7 | - goodness of fit measure. variance explained by the regression / total variance 8 | - the more predictors you add, the higher R^2 becomes. 9 | - hence use adjusted R^2 which adjusts for the degrees of freedom  10 | - or train error metrics 11 | 12 | #### 3. What is the curse of dimensionality? 13 | - High dimensionality makes clustering hard, because having lots of dimensions means that everything is "far away" from each other. 14 | - For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases 15 | - All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample. 16 | - The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data.  17 | - We should conduct PCA to reduce dimensionality 18 | 19 | #### 4. Is more data always better? 20 | - Statistically 21 | - It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help. 22 | - It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc. 23 | - Practically 24 | - More data usually benefit the models 25 | - Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data. 26 | 27 | #### 5. What are advantages of plotting your data before performing analysis? 28 | - Data sets have errors. You won't find them all but you might find some. That 212 year old man. That 9 foot tall woman. 29 | - Variables can have skewness, outliers, etc. Then the arithmetic mean might not be useful, which means the standard deviation isn't useful. 30 | - Variables can be multimodal! If a variable is multimodal then anything based on its mean or median is going to be suspect.  31 | 32 | #### 6. How can you make sure that you don’t analyze something that ends up meaningless? 33 | - Proper exploratory data analysis. 34 | - In every data analysis task, there's the exploratory phase where you're just graphing things, testing things on small sets of the data, summarizing simple statistics, and getting rough ideas of what hypotheses you might want to pursue further. 35 | - Then there's the exploratory phase, where you look deeply into a set of hypotheses.  36 | - The exploratory phase will generate lots of possible hypotheses, and the exploratory phase will let you really understand a few of them. Balance the two and you'll prevent yourself from wasting time on many things that end up meaningless, although not all. 37 | 38 | #### 7. What is the role of trial and error in data analysis? What is the the role of making a hypothesis before diving in? 39 | - data analysis is a repetition of setting up a new hypothesis and trying to refute the null hypothesis. 40 | - The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is. 41 | 42 | #### 8. How can you determine which features are the most important in your model? 43 | - Linear regression can use p-value 44 | - run the features though a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles. 45 | - Look at the variables added in forward variable selection  46 | 47 | #### 9. How do you deal with some of your predictors being missing? 48 | - Remove rows with missing values - This works well if 49 | - the values are missing randomly (see [Vinay Prabhu's answer](https://www.quora.com/How-can-I-deal-with-missing-values-in-a-predictive-model/answer/Vinay-Prabhu-7) for more details on this) 50 | - if you don't lose too much of the dataset after doing so. 51 | - Build another predictive model to predict the missing values 52 | - This could be a whole project in itself, so simple techniques are usually used here. 53 | - Use a model that can incorporate missing data  54 | - Like a random forest, or any tree-based method. 55 | 56 | #### 10. You have several variables that are positively correlated with your response, and you think combining all of the variables could give you a good prediction of your response. However, you see that in the multiple linear regression, one of the weights on the predictors is negative. What could be the issue? 57 | - Multicollinearity refers to a situation in which two or more explanatory variables in a [multiple regression](https://en.wikipedia.org/wiki/Multiple_regression "Multiple regression") model are highly linearly related.  58 | - Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based. 59 | - principal component regression 60 | 61 | #### 11. Let’s say you’re given an unfeasible amount of predictors in a predictive modeling task. What are some ways to make the prediction more feasible? 62 | - PCA 63 | 64 | #### 12. Now you have a feasible amount of predictors, but you’re fairly sure that you don’t need all of them. How would you perform feature selection on the dataset? 65 | - ridge / lasso / elastic net regression 66 | - Univariate Feature Selection where a statistical test is applied to each feature individually. You retain only the best features according to the test outcome scores 67 | - Recursive Feature Elimination: 68 | - First, train a model with all the feature and evaluate its performance on held out data. 69 | - Then drop let say the 10% weakest features (e.g. the feature with least absolute coefficients in a linear model) and retrain on the remaining features. 70 | - Iterate until you observe a sharp drop in the predictive accuracy of the model. 71 | 72 | #### 13. Your linear regression didn’t run and communicates that there are an infinite number of best estimates for the regression coefficients. What could be wrong? 73 | - p > n. 74 | - If some of the explanatory variables are perfectly correlated (positively or negatively) then the coefficients would not be unique.  75 | 76 | #### 14. You run your regression on different subsets of your data, and find that in each subset, the beta value for a certain variable varies wildly. What could be the issue here? 77 | - The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely. 78 | 79 | #### 15. What is the main idea behind ensemble learning? If I had many different models that predicted the same response variable, what might I want to do to incorporate all of the models? Would you expect this to perform better than an individual model or worse? 80 | - The assumption is that a group of weak learners can be combined to form a strong learner. 81 | - Hence the combined model is expected to perform better than an individual model. 82 | - Assumptions: 83 | - average out biases 84 | - reduce variance 85 | - Bagging works because some underlying learning algorithms are unstable: slightly different inputs leads to very different outputs. If you can take advantage of this instability by running multiple instances, it can be shown that the reduced instability leads to lower error. If you want to understand why, the original bagging paper( [http://www.springerlink.com/](http://www.springerlink.com/content/l4780124w2874025/)) has a section called "why bagging works" 86 | - Boosting works because of the focus on better defining the "decision edge". By re-weighting examples near the margin (the positive and negative examples) you get a reduced error (see http://citeseerx.ist.psu.edu/vie...) 87 | - Use the outputs of your models as inputs to a meta-model.  88 | 89 | For example, if you're doing binary classification, you can use all the probability outputs of your individual models as inputs to a final logistic regression (or any model, really) that can combine the probability estimates. 90 | 91 | One very important point is to make sure that the output of your models are out-of-sample predictions. This means that the predicted value for any row in your data-frame should NOT depend on the actual value for that row. 92 | 93 | #### 16. Given that you have wifi data in your office, how would you determine which rooms and areas are underutilized and over-utilized? 94 | - If the data is more used in one room, then that one is over utilized! 95 | - Maybe account for the room capacity and normalize the data. 96 | 97 | #### 17. How could you use GPS data from a car to determine the quality of a driver? 98 | - Speed 99 | - Driving paths 100 | 101 | #### 18. Given accelerometer, altitude, and fuel usage data from a car, how would you determine the optimum acceleration pattern to drive over hills? 102 | - Historical data? 103 | 104 | #### 19. Given position data of NBA players in a season’s games, how would you evaluate a basketball player’s defensive ability? 105 | - Evaluate his positions in the court. 106 | 107 | #### 20. How would you quantify the influence of a Twitter user? 108 | - like page rank with each user corresponding to the webpages and linking to the page equivalent to following. 109 | 110 | #### 21. Given location data of golf balls in games, how would construct a model that can advise golfers where to aim? 111 | - winning probability for different positions 112 | 113 | #### 22. You have 100 mathletes and 100 math problems. Each mathlete gets to choose 10 problems to solve. Given data on who got what problem correct, how would you rank the problems in terms of difficulty? 114 | - One way you could do this is by storing a "skill level" for each user and a "difficulty level" for each problem.  We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem.*  Then we maximize the likelihood of the data to find the hidden skill and difficulty levels. 115 | - The Rasch model for dichotomous data takes the form: 116 | {\displaystyle \Pr\\{X_{ni}=1\\}={\frac {\exp({\beta_{n}}-{\delta_{i}})}{1+\exp({\beta_{n}}-{\delta_{i}})}},} 117 | where  is the ability of person  and  is the difficulty of item}. 118 | 119 | #### 23. You have 5000 people that rank 10 sushis in terms of saltiness. How would you aggregate this data to estimate the true saltiness rank in each sushi? 120 | - Some people would take the mean rank of each sushi.  If I wanted something simple, I would use the median, since ranks are (strictly speaking) ordinal and not interval, so adding them is a bit risque (but people do it all the time and you probably won't be far wrong). 121 | 122 | #### 24. Given data on congressional bills and which congressional representatives co-sponsored the bills, how would you determine which other representatives are most similar to yours in voting behavior? How would you evaluate who is the most liberal? Most republican? Most bipartisan? 123 | - collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative 124 | - for liberal and republican parties, find the mean vector and find the representative closest to the center point 125 | 126 | #### 25. How would you come up with an algorithm to detect plagiarism in online content? 127 | - reduce the text to a more compact form (e.g. fingerprinting, bag of words) then compare those with other texts by calculating the similarity 128 | 129 | #### 26. You have data on all purchases of customers at a grocery store. Describe to me how you would program an algorithm that would cluster the customers into groups. How would you determine the appropriate number of clusters to include? 130 | - K-means 131 | - choose a small value of k that still has a low SSE (elbow method) 132 | - [Elbow method](https://bl.ocks.org/rpgove/0060ff3b656618e9136b) 133 | 134 | #### 27. Let’s say you’re building the recommended music engine at Spotify to recommend people music based on past listening history. How would you approach this problem? 135 | - content-based filtering 136 | - collaborative filtering 137 | -------------------------------------------------------------------------------- /predictive-modeling.md: -------------------------------------------------------------------------------- 1 | ## Predictive Modeling (19 questions) 2 | 3 | #### 1. (Given a Dataset) Analyze this dataset and give me a model that can predict this response variable. 4 | - Problem Determination -> Data Cleaning -> Feature Engineering -> Modeling 5 | - Benchmark Models 6 | - Linear Regression (Ridge or Lasso) for regression 7 | - Logistic Regression for Classification 8 | - Advanced Models 9 | - Random Forest, Boosting Trees, and so on 10 | - Scikit-Learn, XGBoost, LightGBM, CatBoost 11 | - Determine if the problem is classification or regression 12 | - Plot and visualize the data. 13 | - Start by fitting a simple model (multivariate regression, logistic regression), do some feature engineering accordingly, and then try some complicated models. Always split the dataset into train, validation, test dataset and use cross validation to check their performance. 14 | - Favor simple models that run quickly and you can easily explain. 15 | - Mention cross validation as a means to evaluate the model. 16 | 17 | #### 2. What could be some issues if the distribution of the test data is significantly different than the distribution of the training data? 18 | - The model that has high training accuracy might have low test accuracy. Without further knowledge, it is hard to know which dataset represents the population data and thus the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs. test dataset (as in cross validation). 19 | - When there is a change in data distribution, this is called the dataset shift. If the train and test data has a different distribution, then the classifier would likely overfit to the train data. 20 | - This issue can be overcome by using a more general learning method. 21 | - This can occur when: 22 | - P(y|x) are the same but P(x) are different. (covariate shift) 23 | - P(y|x) are different. (concept shift) 24 | - The causes can be: 25 | - Training samples are obtained in a biased way. (sample selection bias) 26 | - Train is different from test because of temporal, spatial changes. (non-stationary environments) 27 | - Solution to covariate shift 28 | - importance weighted cv 29 | 30 | #### 3. What are some ways I can make my model more robust to outliers? 31 | - We can have regularization such as L1 or L2 to reduce variance (increase bias). 32 | - Changes to the algorithm: 33 | - Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non parametric tests instead of parametric ones. 34 | - Use robust error metrics such as MAE or Huber Loss instead of MSE. 35 | - Changes to the data: 36 | - Winsorizing the data 37 | - Transforming the data (e.g. log) 38 | - Remove them only if you’re certain they’re anomalies and not worth predicting 39 | 40 | #### 4. What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be appropriate? 41 | - MSE is more strict to having outliers. MAE is more robust in that sense, but is harder to fit the model for because it cannot be numerically optimized. So when there are less variability in the model and the model is computationally easy to fit, we should use MAE, and if that’s not the case, we should use MSE. 42 | - MSE: easier to compute the gradient, MAE: linear programming needed to compute the gradient 43 | - MAE more robust to outliers. If the consequences of large errors are great, use MSE 44 | - MSE corresponds to maximizing likelihood of Gaussian random variables 45 | 46 | #### 5. What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups? 47 | - Accuracy: proportion of instances you predict correctly. 48 | - Pros: intuitive, easy to explain 49 | - Cons: works poorly when the class labels are imbalanced and the signal from the data is weak 50 | - ROC curve and AUC: plot false-positive-rate (fpr) on the x axis and true-positive-rate (tpr) on the y axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who's who. 51 | - Pros: Works well when testing the ability of distinguishing the two classes. 52 | - Cons: can’t interpret predictions as probabilities (because AUC is determined by rankings), so can’t explain the uncertainty of the model, and it doesn't work for multi-class case. 53 | - logloss/deviance/cross entropy: 54 | - Pros: error metric based on probabilities 55 | - Cons: very sensitive to false positives, negatives 56 | - When there are more than 2 groups, we can have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case. 57 | 58 | #### 6. What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What’s the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.) 59 | - Things to look at: N, P, linearly separable, features independent, likely to overfit, speed, performance, memory usage and so on. 60 | - Logistic Regression 61 | - features roughly linear, problem roughly linearly separable 62 | - robust to noise, use l1,l2 regularization for model selection, avoid overfitting 63 | - the output come as probabilities 64 | - efficient and the computation can be distributed 65 | - can be used as a baseline for other algorithms 66 | - (-) can hardly handle categorical features 67 | - SVM 68 | - with a nonlinear kernel, can deal with problems that are not linearly separable 69 | - (-) slow to train, for most industry scale applications, not really efficient 70 | - Naive Bayes 71 | - computationally efficient when P is large by alleviating the curse of dimensionality 72 | - works surprisingly well for some cases even if the condition doesn’t hold 73 | - with word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization 74 | - (-) conditional independence of every other feature should be met 75 | - Tree Ensembles 76 | - good for large N and large P, can deal with categorical features very well 77 | - non parametric, so no need to worry about outliers 78 | - GBT’s work better but the parameters are harder to tune 79 | - RF works out of the box, but usually performs worse than GBT 80 | - Deep Learning 81 | - works well for some classification tasks (e.g. image) 82 | - used to squeeze something out of the problem 83 | 84 | #### 7. What is regularization and where might it be helpful? What is an example of using regularization in a model? 85 | - Regularization is useful for reducing variance in the model, meaning avoiding overfitting. 86 | - For example, we can use L1 regularization in Lasso regression to penalize large coefficients and automatically select features, or we can also use L2 regularization for Ridge regression to penalize the feature coefficients. 87 | 88 | #### 8. Why might it be preferable to include fewer predictors over many? 89 | - When we add irrelevant features, it increases model's tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc. 90 | - curse of dimensionality 91 | - adding random noise makes the model more complicated but useless 92 | - computational cost 93 | - Ask someone for more details. 94 | 95 | #### 9. Given training data on tweets and their retweets, how would you predict the number of retweets of a given tweet after 7 days after only observing 2 days worth of data? 96 | - Build a time series model with the training data with a seven day cycle and then use that for a new data with only 2 days data. 97 | - Ask someone for more details. 98 | - Build a regression function to estimate the number of retweets as a function of time t 99 | - to determine if one regression function can be built, see if there are clusters in terms of the trends in the number of retweets 100 | - if not, we have to add features to the regression function 101 | - features + # of retweets on the first and the second day -> predict the seventh day 102 | - https://en.wikipedia.org/wiki/Dynamic_time_warping 103 | 104 | #### 10. How could you collect and analyze data to use social media to predict the weather? 105 | - We can collect social media data using twitter, Facebook, instagram API’s. 106 | - Then, for example, for twitter, we can construct features from each tweet, e.g. the tweeted date, number of favorites, retweets, and of course, the features created from the tweeted content itself. 107 | - Then use a multivariate time series model to predict the weather. 108 | - Ask someone for more details. 109 | 110 | #### 11. How would you construct a feed to show relevant content for a site that involves user interactions with items? 111 | - We can do so using building a recommendation engine. 112 | - The easiest we can do is to show contents that are popular other users, which is still a valid strategy if for example the contents are news articles. 113 | - To be more accurate, we can build a content based filtering or collaborative filtering. If there’s enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed. If there isn’t, we can recommend similar items based on vectorization of items (content based filtering). 114 | 115 | #### 12. How would you design the people you may know feature on LinkedIn or Facebook? 116 | - Find strong unconnected people in weighted connection graph 117 | - Define similarity as how strong the two people are connected 118 | - Given a certain feature, we can calculate the similarity based on 119 | - friend connections (neighbors) 120 | - Check-in’s people being at the same location all the time. 121 | - same college, workplace 122 | - Have randomly dropped graphs test the performance of the algorithm 123 | - Ref. News Feed Optimization 124 | - Affinity score: how close the content creator and the users are 125 | - Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote 126 | - Time decay: the older the less important 127 | 128 | #### 13. How would you predict who someone may want to send a Snapchat or Gmail to? 129 | - for each user, assign a score of how likely someone would send an email to 130 | - the rest is feature engineering: 131 | - number of past emails, how many responses, the last time they exchanged an email, whether the last email ends with a question mark, features about the other users, etc. 132 | - Ask someone for more details. 133 | - People who someone sent emails the most in the past, conditioning on time decay. 134 | 135 | #### 14. How would you suggest to a franchise where to open a new store? 136 | - build a master dataset with local demographic information available for each location. 137 | - local income levels, proximity to traffic, weather, population density, proximity to other businesses 138 | - a reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.) 139 | - any data on the local franchise owner-operators, to the degree the manager 140 | - identify a set of KPIs acceptable to the management that had requested the analysis concerning the most desirable factors surrounding a franchise 141 | - quarterly operating profit, ROI, EVA, pay-down rate, etc. 142 | - run econometric models to understand the relative significance of each variable 143 | - run machine learning algorithms to predict the performance of each location candidate 144 | 145 | #### 15. In a search engine, given partial data on what the user has typed, how would you predict the user’s eventual search query? 146 | - Based on the past frequencies of words shown up given a sequence of words, we can construct conditional probabilities of the set of next sequences of words that can show up (n-gram). The sequences with highest conditional probabilities can show up as top candidates. 147 | - To further improve this algorithm, 148 | - we can put more weight on past sequences which showed up more recently and near your location to account for trends 149 | - show your recent searches given partial data 150 | - Personalize and localize the search 151 | - Use the user's historical search data 152 | - Use the historical data from the local region 153 | 154 | #### 16. Given a database of all previous alumni donations to your university, how would you predict which recent alumni are most likely to donate? 155 | - Based on frequency and amount of donations, graduation year, major, etc, construct a supervised regression (or binary classification) algorithm. 156 | 157 | #### 17. You’re Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this? 158 | - Based on the past pickup location of passengers around the same time of the day, day of the week (month, year), construct 159 | - Ask someone for more details. 160 | - Based on the number of past pickups 161 | - account for periodicity (seasonal, monthly, weekly, daily, hourly) 162 | - special events (concerts, festivals, etc.) from tweets 163 | 164 | #### 18. How would you build a model to predict a March Madness bracket? 165 | - One vector each for team A and B. Take the difference of the two vectors and use that as an input to predict the probability that team A would win by training the model. Train the models using past tournament data and make a prediction for the new tournament by running the trained model for each round of the tournament 166 | - Some extensions: 167 | - Experiment with different ways of consolidating the 2 team vectors into one (e.g concantenating, averaging, etc) 168 | - Consider using a RNN type model that looks at time series data. 169 | 170 | #### 19. You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this? 171 | - This is equivalent to making the model more robust to outliers. 172 | - See Q3. 173 | -------------------------------------------------------------------------------- /probability.md: -------------------------------------------------------------------------------- 1 | ## Probability (19 questions) 2 | 3 | #### 1. Bobo the amoeba has a 25%, 25%, and 50% chance of producing 0, 1, or 2 o spring, respectively. Each of Bobo’s descendants also have the same probabilities. What is the probability that Bobo’s lineage dies out? 4 | - p=1/4+1/4*p+1/2*p^2 => p=1/2 5 | 6 | #### 2. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour? 7 | - 1-(0.8)^4 = 0.5904 8 | - Or, we can use Poisson processes 9 | 10 | #### 3. How can you generate a random number between 1 - 7 with only a die? 11 | - [Quora Answer](https://www.quora.com/How-can-you-generate-a-random-number-between-1-7-with-only-a-die-1) 12 | 13 | #### 4. How can you get a fair coin toss if someone hands you a coin that is weighted to come up heads more often than tails? 14 | - Flip twice: 15 | - HT --> H 16 | - TH --> T 17 | - If HH or TT, repeat. 18 | 19 | #### 5. You have an 50-50 mixture of two normal distributions with the same standard deviation. How far apart do the means need to be in order for this distribution to be bimodal? 20 | - more than two standard deviations 21 | 22 | #### 6. Given draws from a normal distribution with known parameters, how can you simulate draws from a uniform distribution? 23 | - Plug in the value to the CDF of the same random variable 24 | 25 | #### 7. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls? 26 | - gg, gb, bg --> 1/3 27 | 28 | #### 8. You have a group of couples that decide to have children until they have their first girl, after which they stop having children. What is the expected gender ratio of the children that are born? What is the expected number of children each couple will have? 29 | - Geometric distribution with p = 0.5 30 | - gender ratio is 1:1. Expected number of children is 2. 31 | - let X be the number of children until getting a female (happens with prob 1/2). this follows a geometric distribution with probability 1/2 32 | 33 | #### 9. How many ways can you split 12 people into 3 teams of 4? 34 | - the outcome follows a multinomial distribution with n=12 and k=3. but the classes are indistinguishable 35 | - (12, 8) * (8, 4) * (4, 4) / (3, 3) 36 | - 12! / (4!)^3 / 3! 37 | 38 | #### 10. Your hash function assigns each object to a number between 1:10, each with equal probability. With 10 objects, what is the probability of a hash collision? What is the expected number of hash collisions? What is the expected number of hashes that are unused. 39 | - the probability of a hash collision: 1-(10!/10^10) 40 | - the expected number of hash collisions: 10(1 - (1-1/10)^10) 41 | - [Quora Reference](https://www.quora.com/Your-hash-function-assigns-each-object-to-a-number-between-1-10-each-with-equal-probability-With-10-objects-what-is-the-probability-of-a-hash-collision-What-is-the-expected-number-of-hash-collisions-What-is-the-expected-number-of-hashes-that-are-unused) 42 | - the expected number of hashes that are unused: 10*(9/10)^10 43 | 44 | #### 11. You call 2 UberX’s and 3 Lyfts. If the time that each takes to reach you is IID, what is the probability that all the Lyfts arrive first? What is the probability that all the UberX’s arrive first? 45 | - Lyfts arrive first: 2! * 3! / 5! 46 | - Ubers arrive first: same 47 | 48 | #### 12. I write a program should print out all the numbers from 1 to 300, but prints out Fizz instead if the number is divisible by 3, Buzz instead if the number is divisible by 5, and FizzBuzz if the number is divisible by 3 and 5. What is the total number of numbers that is either Fizzed, Buzzed, or FizzBuzzed? 49 | - 100+60-20=140 50 | 51 | #### 13. On a dating site, users can select 5 out of 24 adjectives to describe themselves. A match is declared between two users if they match on at least 4 adjectives. If Alice and Bob randomly pick adjectives, what is the probability that they form a match? 52 | - 24C5*(1+5(24-5))/24C5*24C5 = 4/1771 53 | 54 | #### 14. A lazy high school senior types up application and envelopes to n different colleges, but puts the applications randomly into the envelopes. What is the expected number of applications that went to the right college? 55 | - 1 56 | 57 | #### 15. Let’s say you have a very tall father. On average, what would you expect the height of his son to be? Taller, equal, or shorter? What if you had a very short father? 58 | - Shorter. Regression to the mean 59 | 60 | #### 16. What’s the expected number of coin flips until you get two heads in a row? What’s the expected number of coin flips until you get two tails in a row? 61 | - x = 0.25 * 2 + 0.25 * (x + 2) + 0.5 * (x + 1) --> x = 6 62 | - [Quora Reference](https://www.quora.com/What-is-the-expected-number-of-coin-flips-until-you-get-two-heads-in-a-row) 63 | 64 | #### 17. Let’s say we play a game where I keep flipping a coin until I get heads. If the first time I get heads is on the nth coin, then I pay you 2n-1 dollars. How much would you pay me to play this game? 65 | - less than $3 66 | - [Quora reference](https://www.quora.com/I-will-flip-a-coin-until-I-get-my-first-heads-I-will-then-pay-you-2-n-1-where-n-is-the-total-number-of-coins-I-flipped-How-much-would-you-pay-me-to-play-this-game-You-can-only-play-once) 67 | 68 | #### 18. You have two coins, one of which is fair and comes up heads with a probability 1/2, and the other which is biased and comes up heads with probability 3/4. You randomly pick coin and flip it twice, and get heads both times. What is the probability that you picked the fair coin? 69 | - 4/13 70 | - Bayesian method 71 | 72 | #### 19. You have a 0.1% chance of picking up a coin with both heads, and a 99.9% chance that you pick up a fair coin. You flip your coin and it comes up heads 10 times. What’s the chance that you picked up the fair coin, given the information that you observed? 73 | - Bayesian method 74 | 75 | #### 20. What is a P-Value ? 76 | - https://en.wikipedia.org/wiki/P-value 77 | -------------------------------------------------------------------------------- /product-metrics.md: -------------------------------------------------------------------------------- 1 | ## Product Metrics (15 questions) 2 | 3 | #### 1. What would be good metrics of success for an advertising-driven consumer product? (Buzzfeed, YouTube, Google Search, etc.) A service-driven consumer product? (Uber, Flickr, Venmo, etc.) 4 | * advertising-driven: Page-views and daily actives, CTR, CPC (cost per click) 5 | * click-ads 6 | * display-ads 7 | * service-driven: number of purchases, conversion rate 8 | 9 | #### 2. What would be good metrics of success for a productivity tool? (Evernote, Asana, Google Docs, etc.) A MOOC? (edX, Coursera, Udacity, etc.) 10 | * Productivity tool: same as premium subscriptions 11 | * MOOC: same as premium subscriptions, completion rate 12 | 13 | #### 3. What would be good metrics of success for an e-commerce product? (Etsy, Groupon, Birchbox, etc.) A subscription product? (Net ix, Birchbox, Hulu, etc.) Premium subscriptions? (OKCupid, LinkedIn, Spotify, etc.)  14 | * e-commerce: number of purchases, conversion rate, Hourly, daily, weekly, monthly, quarterly, and annual sales, Cost of goods sold, Inventory levels, Site traffic, Unique visitors versus returning visitors, Customer service phone call count, Average resolution time 15 | * subscription 16 | * churn, CoCA, ARPU, MRR, LTV 17 | * premium subscriptions:  18 | * subscription rate 19 | 20 | #### 4. What would be good metrics of success for a consumer product that relies heavily on engagement and interaction? (Snapchat, Pinterest, Facebook, etc.) A messaging product? (GroupMe, Hangouts, Snapchat, etc.) 21 | * heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio 22 | * messaging product:  23 | * daily, monthly active users 24 | 25 | #### 5. What would be good metrics of success for a product that offered in-app purchases? (Zynga, Angry Birds, other gaming apps) 26 | * Average Revenue Per Paid User 27 | * Average Revenue Per User 28 | 29 | #### 6. A certain metric is violating your expectations by going down or up more than you expect. How would you try to identify the cause of the change? 30 | * breakdown the KPI’s into what consists them and find where the change is 31 | * then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment 32 | 33 | #### 7. Growth for total number of tweets sent has been slow this month. What data would you look at to determine the cause of the problem? 34 | * Historical data, especially historical data at the same month 35 | * Outer data, such as economic data, political data, data about competitors 36 | 37 | #### 8. You’re a restaurant and are approached by Groupon to run a deal. What data would you ask from them in order to determine whether or not to do the deal? 38 | * for similar restaurants (they should define similarity), average increase in revenue gain per coupon, average increase in customers per coupon 39 | 40 | #### 9. You are tasked with improving the efficiency of a subway system. Where would you start? 41 | * define efficiency 42 | 43 | #### 10. Say you are working on Facebook News Feed. What would be some metrics that you think are important? How would you make the news each person gets more relevant? 44 | * rate for each action, duration users stay, CTR for sponsor feed posts 45 | * ref. News Feed Optimization 46 | * Affinity score: how close the content creator and the users are 47 | * Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote 48 | * Time decay: the older the less important 49 | 50 | #### 11. How would you measure the impact that sponsored stories on Facebook News Feed have on user engagement? How would you determine the optimum balance between sponsored stories and organic content on a user’s News Feed? 51 | * AB test on different balance ratio and see  52 | 53 | #### 12. You are on the data science team at Uber and you are asked to start thinking about surge pricing. What would be the objectives of such a product and how would you start looking into this? 54 | *  there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure.  55 | * I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algorithms for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you've got yourself an incredible, constantly-evolving algorithm. 56 | 57 | #### 13. Say that you are Netflix. How would you determine what original series you should invest in and create? 58 | * Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead. 59 | 60 | #### 14. What kind of services would find churn (metric that tracks how many customers leave the service) helpful? How would you calculate churn? 61 | * subscription based services 62 | 63 | #### 15. Let’s say that you’re are scheduling content for a content provider on television. How would you determine the best times to schedule content? 64 | * Based on similar product and the corresponding broadcast popularity 65 | -------------------------------------------------------------------------------- /programming.md: -------------------------------------------------------------------------------- 1 | ## Programming (14 questions) 2 | 3 | #### 1. Write a function to calculate all possible assignment vectors of 2n users, where n users are assigned to group 0 (control), and n users are assigned to group 1 (treatment). 4 | - Recursive programming (sol in code) 5 | ```python 6 | def n_choose_k(n, k): 7 | """ function to choose k from n """ 8 | if k == 1: 9 | ans = [] 10 | for i in range(n): 11 | tmp = [0] * n 12 | tmp[i] = 1 13 | ans.append(tmp) 14 | return ans 15 | 16 | if k == n: 17 | return [[1] * n] 18 | 19 | ans = [] 20 | space = n - k + 1 21 | for i in range(space): 22 | assignment = [0] * (i + 1) 23 | assignment[i] = 1 24 | for c in n_choose_k(n - i - 1, k - 1): 25 | ans.append(assignment + c) 26 | return ans 27 | 28 | # test: choose 2 from 4 29 | print(n_choose_k(4, 2)) 30 | ``` 31 | 32 | #### 2. Given a list of tweets, determine the top 10 most used hashtags. 33 | - Store all the hashtags in a dictionary and use priority queue to solve the top-k problem 34 | - An extension will be top-k problem using Hadoop/MapReduce 35 | 36 | #### 3. Program an algorithm to find the best approximate solution to the knapsack problem in a given time. 37 | - [https://en.wikipedia.org/wiki/Knapsack_problem](https://en.wikipedia.org/wiki/Knapsack_problem) 38 | - Greedy solution (add the best v/w as much as possible and move on to the next) 39 | - Dynamic programming 40 | 41 | #### 4. Program an algorithm to find the best approximate solution to the traveling salesman problem in a given time. 42 | - [https://en.wikipedia.org/wiki/Travelling_salesman_problem](https://en.wikipedia.org/wiki/Travelling_salesman_problem) 43 | - Greedy 44 | - Dynamic programming 45 | 46 | #### 5. You have a stream of data coming in of size n, but you don’t know what n is ahead of time. Write an algorithm that will take a random sample of k elements. Can you write one that takes O(k) space? 47 | - [Reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) 48 | 49 | #### 6. Write an algorithm that can calculate the square root of a number. 50 | - Binary search or Newton's method 51 | 52 | #### 7. Given a list of numbers, can you return the outliers? 53 | - Sort then select the highest and the lowest 2.5% 54 | - Visualization can helps a lot 55 | 56 | #### 8. When can parallelism make your algorithms run faster? When could it make your algorithms run slower? 57 | - Ask someone for more details. 58 | - compute in parallel when communication cost < computation cost 59 | - ensemble trees 60 | - minibatch 61 | - cross validation 62 | - forward propagation 63 | - minibatch 64 | - not suitable for online learning 65 | 66 | #### 9. What are the different types of joins? What are the differences between them? 67 | - Inner Join, Left Join, Right Join, Outer Join, Self Join 68 | 69 | #### 10. Why might a join on a subquery be slow? How might you speed it up? 70 | - Change the subquery to a join. 71 | - [Stack Overflow Answers](https://stackoverflow.com/questions/31724903/why-might-a-join-on-a-subquery-be-slow-what-could-be-done-to-make-it-faster-s) 72 | 73 | #### 11. Describe the difference between primary keys and foreign keys in a SQL database. 74 | - Primary keys are columns whose value combinations must be unique in a specific table so that each row can be referenced uniquely. 75 | - Foreign keys are columns that references columns (often primary keys) in other tables. 76 | 77 | #### 12. Given a **COURSES** table with columns **course_id** and **course_name**, a **FACULTY** table with columns **faculty_id** and **faculty_name**, and a **COURSE_FACULTY** table with columns **faculty_id** and **course_id**, how would you return a list of faculty who teach a course given the name of a course? 78 | ```SQL 79 | SELECT f.faculty_name 80 | FROM COURSES c 81 | JOIN COURSE_FACULTY cf 82 | ON c.course_id = cf.course_id 83 | JOIN FACULTY 84 | ON f.faculty_id = cf.faculty_id 85 | WHERE c.course_name = xxx; 86 | ``` 87 | 88 | #### 13. Given a **IMPRESSIONS** table with **ad_id**, **click** (an indicator that the ad was clicked), and **date**, write a SQL query that will tell me the click-through-rate of each ad by month. 89 | ```SQL 90 | SELECT ad_id, MONTH(date), AVG(click) 91 | FROM IMPRESSIONS 92 | GROUP BY ad_id, MONTH(date); 93 | ``` 94 | 95 | #### 14. Write a query that returns the name of each department and a count of the number of employees in each: 96 | - **EMPLOYEES** containing: **Emp_ID** (Primary key) and **Emp_Name** 97 | - **EMPLOYEE_DEPT** containing: **Emp_ID** (Foreign key) and **Dept_ID** (Foreign key) 98 | - **DEPTS** containing: **Dept_ID** (Primary key) and **Dept_Name** 99 | ```SQL 100 | SELECT d.Dept_Name, COUNT(*) 101 | FROM DEPTS d 102 | LEFT JOIN EMPLOYEE_DEPT ed 103 | ON d.Dept_ID = ed.Dept_ID 104 | GROUP BY d.Dept_Name; 105 | ``` 106 | -------------------------------------------------------------------------------- /statistical-inference.md: -------------------------------------------------------------------------------- 1 | ## Statistical Inference (15 questions) 2 | 3 | #### 1. In an A/B test, how can you check if assignment to the various buckets was truly random? 4 | - Plot the distributions of multiple features for both A and B and make sure that they have the same shape. More rigorously, we can conduct a permutation test to see if the distributions are the same. 5 | - MANOVA to compare different means 6 | 7 | #### 2. What might be the benefits of running an A/A test, where you have two buckets who are exposed to the exact same product? 8 | - Verify the sampling algorithm is random. 9 | 10 | #### 3. What would be the hazards of letting users sneak a peek at the other bucket in an A/B test? 11 | - The user might not act the same suppose had they not seen the other bucket. You are essentially adding additional variables of whether the user peeked the other bucket, which are not random across groups. 12 | 13 | #### 4. What would be some issues if blogs decide to cover one of your experimental groups? 14 | - Same as the previous question. The above problem can happen in larger scale. 15 | 16 | #### 5. How would you conduct an A/B test on an opt-in feature?  17 | - Ask someone for more details. 18 | 19 | #### 6. How would you run an A/B test for many variants, say 20 or more? 20 | - one control, 20 treatment, if the sample size for each group is big enough. 21 | - Ways to attempt to correct for this include changing your confidence level (e.g. Bonferroni Correction) or doing family-wide tests before you dive in to the individual metrics (e.g. Fisher's Protected LSD). 22 | 23 | #### 7. How would you run an A/B test if the observations are extremely right-skewed? 24 | - lower the variability by modifying the KPI 25 | - cap values 26 | - percentile metrics 27 | - log transform 28 | - 29 | 30 | #### 8. I have two different experiments that both change the sign-up button to my website. I want to test them at the same time. What kinds of things should I keep in mind? 31 | - exclusive -> ok 32 | 33 | #### 9. What is a p-value? What is the difference between type-1 and type-2 error? 34 | - [en.wikipedia.org/wiki/P-value](https://en.wikipedia.org/wiki/P-value) 35 | - type-1 error: rejecting Ho when Ho is true 36 | - type-2 error: not rejecting Ho when Ha is true 37 | 38 | #### 10. You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a buyer selects the listing. How would you test this hypothesis? 39 | - For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups. 40 | - Ask someone for more details. 41 | 42 | #### 11. How would you design an experiment to determine the impact of latency on user engagement? 43 | - The best way I know to quantify the impact of performance is to isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test. 44 | 45 | #### 12. What is maximum likelihood estimation? Could there be any case where it doesn’t exist? 46 | - A method for parameter optimization (fitting a model). We choose parameters so as to maximize the likelihood function (how likely the outcome would happen given the current data and our model). 47 | - maximum likelihood estimation (MLE) is a method of [estimating](https://en.wikipedia.org/wiki/Estimator "Estimator") the [parameters](https://en.wikipedia.org/wiki/Statistical_parameter "Statistical parameter") of a [statistical model](https://en.wikipedia.org/wiki/Statistical_model "Statistical model") given observations, by finding the parameter values that maximize the [likelihood](https://en.wikipedia.org/wiki/Likelihood "Likelihood") of making the observations given the parameters. MLE can be seen as a special case of the [maximum a posteriori estimation](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation "Maximum a posteriori estimation") (MAP) that assumes a [uniform](https://en.wikipedia.org/wiki/Uniform_distribution_\(continuous\) "Uniform distribution \(continuous\)") [prior distribution](https://en.wikipedia.org/wiki/Prior_probability "Prior probability") of the parameters, or as a variant of the MAP that ignores the prior and which therefore is [unregularized](https://en.wikipedia.org/wiki/Regularization_\(mathematics\) "Regularization \(mathematics\)"). 48 | - for gaussian mixtures, non parametric models, it doesn’t exist 49 | 50 | #### 13. What’s the difference between a MAP, MOM, MLE estimator? In which cases would you want to use each? 51 | - MAP estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution. 52 | - MOM sets moment values and solves for the parameters. MOM is not used much anymore because maximum likelihood estimators have higher probability of being close to the quantities to be estimated and are more often unbiased. 53 | 54 | #### 14. What is a confidence interval and how do you interpret it? 55 | - For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time. 56 | - if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level. 57 | 58 | #### 15. What is unbiasedness as a property of an estimator? Is this always a desirable property when performing inference? What about in data analysis or predictive modeling? 59 | - Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias. 60 | --------------------------------------------------------------------------------