"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### 8. I have two different experiments that both change the sign-up button to my website. I want to test them at the same time. What kinds of things should I keep in mind?"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "Solution\n",
154 | "\n",
155 | " - exclusive ➞ ok"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "### 9. What is a p-value? What is the difference between type-1 and type-2 error?"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "Solution\n",
170 | "\n",
171 | " - **[en.wikipedia.org/wiki/P-value](https://en.wikipedia.org/wiki/P-value)**\n",
172 | " - type-1 error: rejecting Ho when Ho is true\n",
173 | " - type-2 error: not rejecting Ho when Ha is true"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "### 10. You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a buyer selects the listing. How would you test this hypothesis?"
181 | ]
182 | },
183 | {
184 | "cell_type": "markdown",
185 | "metadata": {},
186 | "source": [
187 | "Solution\n",
188 | "\n",
189 | " - For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups.\n",
190 | " - Ask someone for more details."
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "### 11. How would you design an experiment to determine the impact of latency on user engagement?"
198 | ]
199 | },
200 | {
201 | "cell_type": "markdown",
202 | "metadata": {},
203 | "source": [
204 | "Solution\n",
205 | "\n",
206 | " - The best way I know to quantify the impact of performance is to isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test."
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "### 12. 12. What is maximum likelihood estimation? Could there be any case where it doesn’t exist?"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Solution\n",
221 | "\n",
222 | " - A method for parameter optimization (fitting a model). We choose parameters so as to maximize the likelihood function (how likely the outcome would happen given the current data and our model).\n",
223 | " - maximum likelihood estimation (MLE) is a method of **[estimating](https://en.wikipedia.org/wiki/Estimator \"Estimator\")** the **[parameters](https://en.wikipedia.org/wiki/Statistical_parameter \"Statistical parameter\")** of a **[statistical model](https://en.wikipedia.org/wiki/Statistical_model \"Statistical model\")** given observations, by finding the parameter values that maximize the **[likelihood](https://en.wikipedia.org/wiki/Likelihood \"Likelihood\")** of making the observations given the parameters. MLE can be seen as a special case of the **[maximum a posteriori estimation](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation \"Maximum a posteriori estimation\")** (MAP) that assumes a **[uniform](https://en.wikipedia.org/wiki/Uniform_distribution_\\(continuous\\) \"Uniform distribution \\(continuous\\)\")** **[prior distribution](https://en.wikipedia.org/wiki/Prior_probability \"Prior probability\")** of the parameters, or as a variant of the MAP that ignores the prior and which therefore is **[unregularized](https://en.wikipedia.org/wiki/Regularization_\\(mathematics\\) \"Regularization \\(mathematics\\)\")**.\n",
224 | " - for gaussian mixtures, non parametric models, it doesn’t exist"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "### 13. What’s the difference between a MAP, MOM, MLE estimator? In which cases would you want to use each?"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "Solution\n",
239 | "\n",
240 | " - MAP estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution.\n",
241 | " - MOM sets moment values and solves for the parameters. MOM is not used much anymore because maximum likelihood estimators have higher probability of being close to the quantities to be estimated and are more often unbiased."
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "metadata": {},
247 | "source": [
248 | "### 14. What is a confidence interval and how do you interpret it?"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "Solution\n",
256 | "\n",
257 | " - For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time.\n",
258 | " - if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level."
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "### 15. What is unbiasedness as a property of an estimator? Is this always a desirable property when performing inference? What about in data analysis or predictive modeling?"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "Solution\n",
273 | "\n",
274 | " - Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": []
283 | }
284 | ],
285 | "metadata": {
286 | "hide_input": false,
287 | "kernelspec": {
288 | "display_name": "Python 3",
289 | "language": "python",
290 | "name": "python3"
291 | },
292 | "language_info": {
293 | "codemirror_mode": {
294 | "name": "ipython",
295 | "version": 3
296 | },
297 | "file_extension": ".py",
298 | "mimetype": "text/x-python",
299 | "name": "python",
300 | "nbconvert_exporter": "python",
301 | "pygments_lexer": "ipython3",
302 | "version": "3.8.8"
303 | },
304 | "toc": {
305 | "base_numbering": 1,
306 | "nav_menu": {},
307 | "number_sections": true,
308 | "sideBar": true,
309 | "skip_h1_title": false,
310 | "title_cell": "Table of Contents",
311 | "title_sidebar": "Contents",
312 | "toc_cell": false,
313 | "toc_position": {},
314 | "toc_section_display": true,
315 | "toc_window_display": false
316 | },
317 | "varInspector": {
318 | "cols": {
319 | "lenName": 16,
320 | "lenType": 16,
321 | "lenVar": 40
322 | },
323 | "kernels_config": {
324 | "python": {
325 | "delete_cmd_postfix": "",
326 | "delete_cmd_prefix": "del ",
327 | "library": "var_list.py",
328 | "varRefreshCmd": "print(var_dic_list())"
329 | },
330 | "r": {
331 | "delete_cmd_postfix": ") ",
332 | "delete_cmd_prefix": "rm(",
333 | "library": "var_list.r",
334 | "varRefreshCmd": "cat(var_dic_list()) "
335 | }
336 | },
337 | "types_to_exclude": [
338 | "module",
339 | "function",
340 | "builtin_function_or_method",
341 | "instance",
342 | "_Feature"
343 | ],
344 | "window_display": false
345 | }
346 | },
347 | "nbformat": 4,
348 | "nbformat_minor": 2
349 | }
350 |
--------------------------------------------------------------------------------
/06_Data_Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "All the IPython Notebooks in **Data Science Interview Questions** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9/DataScience_Interview_Questions)**\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# Data Analysis ➞ 27 Questions"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### 1. (Given a Dataset) Analyze this dataset and tell me what you can learn from it."
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {
29 | "ExecuteTime": {
30 | "end_time": "2021-09-21T13:31:28.708336Z",
31 | "start_time": "2021-09-21T13:31:28.699521Z"
32 | }
33 | },
34 | "source": [
35 | "Solution\n",
36 | "\n",
37 | "- Typical data cleaning and visualization."
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "### 2. What is `R2`? What are some other metrics that could be better than `R2` and why?"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "Solution\n",
52 | "\n",
53 | "- goodness of fit measure. variance explained by the regression / total variance.\n",
54 | " \n",
55 | " - the more predictors you add, the higher $R^2$ becomes.\n",
56 | " - hence use adjusted $R^2$ which adjusts for the degrees of freedom. \n",
57 | " - or train error metrics."
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "### 3. What is the curse of dimensionality?"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "Solution\n",
72 | "\n",
73 | "- High dimensionality makes clustering hard, because having lots of dimensions means that everything is **\"far away\"** from each other.\n",
74 | " - For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases.\n",
75 | " - All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample.\n",
76 | " - The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data. \n",
77 | " - We should conduct PCA to reduce dimensionality."
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "### 4. Is more data always better?"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Solution\n",
92 | "\n",
93 | "- **Statistically**\n",
94 | " - It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help.\n",
95 | " - It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc.\n",
96 | " \n",
97 | " - **Practically**\n",
98 | " - More data usually benefit the models.\n",
99 | " - Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data."
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "### 5. What are advantages of plotting your data before performing analysis?"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Solution\n",
114 | "\n",
115 | "- Data sets have errors. You won't find them all but you might find some. That 212 year old man. That 9 foot tall woman. \n",
116 | " - Variables can have skewness, outliers, etc. Then the arithmetic mean might not be useful, which means the standard deviation isn't useful.\n",
117 | " - Variables can be multimodal! If a variable is multimodal then anything based on its mean or median is going to be suspect."
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "### 6. How can you make sure that you don’t analyze something that ends up meaningless?"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "Solution\n",
132 | "\n",
133 | "- Proper exploratory data analysis. \n",
134 | " - In every data analysis task, there's the exploratory phase where you're just graphing things, testing things on small sets of the data, summarizing simple statistics, and getting rough ideas of what hypotheses you might want to pursue further. \n",
135 | " - Then there's the exploratory phase, where you look deeply into a set of hypotheses. \n",
136 | " - The exploratory phase will generate lots of possible hypotheses, and the exploratory phase will let you really understand a few of them. Balance the two and you'll prevent yourself from wasting time on many things that end up meaningless, although not all."
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "### 7. What is the role of trial and error in data analysis? What is the the role of making a hypothesis before diving in?"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {},
149 | "source": [
150 | "Solution\n",
151 | "\n",
152 | "- data analysis is a repetition of setting up a new hypothesis and trying to refute the null hypothesis.\n",
153 | " - The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is."
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "### 8. How can you determine which features are the most important in your model?"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "Solution\n",
168 | "\n",
169 | "- Linear regression can use p-value\n",
170 | " - run the features though a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles.\n",
171 | " - Look at the variables added in forward variable selection. "
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "### 9. How do you deal with some of your predictors being missing?"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "Solution\n",
186 | "\n",
187 | "- Remove rows with missing values - This works well if\n",
188 | " - the values are missing randomly (see [Vinay Prabhu's answer](https://www.quora.com/How-can-I-deal-with-missing-values-in-a-predictive-model/answer/Vinay-Prabhu-7) for more details on this)\n",
189 | " - if you don't lose too much of the dataset after doing so.\n",
190 | " - Build another predictive model to predict the missing values.\n",
191 | " - This could be a whole project in itself, so simple techniques are usually used here.\n",
192 | " - Use a model that can incorporate missing data. \n",
193 | " - Like a random forest, or any tree-based method."
194 | ]
195 | },
196 | {
197 | "cell_type": "markdown",
198 | "metadata": {},
199 | "source": [
200 | "### 10. You have several variables that are positively correlated with your response, and you think combining all of the variables could give you a good prediction of your response. However, you see that in the multiple linear regression, one of the weights on the predictors is negative. What could be the issue?"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "Solution\n",
208 | "\n",
209 | " - Multicollinearity refers to a situation in which two or more explanatory variables in a [multiple regression](https://en.wikipedia.org/wiki/Multiple_regression \"Multiple regression\") model are highly linearly related. \n",
210 | " - Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.\n",
211 | " - principal component regression"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "### 11. Let’s say you’re given an unfeasible amount of predictors in a predictive modeling task. What are some ways to make the prediction more feasible?"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "Solution\n",
226 | "\n",
227 | " - PCA"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "### 12. Now you have a feasible amount of predictors, but you’re fairly sure that you don’t need all of them. How would you perform feature selection on the dataset?"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "Solution\n",
242 | "\n",
243 | " - ridge / lasso / elastic net regression.\n",
244 | " - Univariate Feature Selection where a statistical test is applied to each feature individually. You retain only the best features according to the test outcome scores.\n",
245 | " - Recursive Feature Elimination: \n",
246 | " - First, train a model with all the feature and evaluate its performance on held out data.\n",
247 | " - Then drop let say the 10% weakest features (e.g. the feature with least absolute coefficients in a linear model) and retrain on the remaining features.\n",
248 | " - Iterate until you observe a sharp drop in the predictive accuracy of the model."
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "### 13. Your linear regression didn’t run and communicates that there are an infinite number of best estimates for the regression coefficients. What could be wrong?"
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "Solution\n",
263 | "\n",
264 | " - p > n.\n",
265 | " - If some of the explanatory variables are perfectly correlated (positively or negatively) then the coefficients would not be unique. "
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "### 14. You run your regression on different subsets of your data, and find that in each subset, the beta value for a certain variable varies wildly. What could be the issue here?"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "Solution\n",
280 | "\n",
281 | " - The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely."
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {},
287 | "source": [
288 | "### 15. What is the main idea behind ensemble learning? If I had many different models that predicted the same response variable, what might I want to do to incorporate all of the models? Would you expect this to perform better than an individual model or worse?"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "Solution\n",
296 | "\n",
297 | " - The assumption is that a group of weak learners can be combined to form a strong learner.\n",
298 | " - Hence the combined model is expected to perform better than an individual model.\n",
299 | " - Assumptions:\n",
300 | " - average out biases\n",
301 | " - reduce variance\n",
302 | " - Bagging works because some underlying learning algorithms are unstable: slightly different inputs leads to very different outputs. If you can take advantage of this instability by running multiple instances, it can be shown that the reduced instability leads to lower error. If you want to understand why, the original bagging paper( [http://www.springerlink.com/](http://www.springerlink.com/content/l4780124w2874025/)) has a section called \"why bagging works\"\n",
303 | " - Boosting works because of the focus on better defining the \"decision edge\". By re-weighting examples near the margin (the positive and negative examples) you get a reduced error (see http://citeseerx.ist.psu.edu/vie...)\n",
304 | " - Use the outputs of your models as inputs to a meta-model. \n",
305 | "\n",
306 | "**For example:** if you're doing binary classification, you can use all the probability outputs of your individual models as inputs to a final logistic regression (or any model, really) that can combine the probability estimates. \n",
307 | "\n",
308 | "One very important point is to make sure that the output of your models are out-of-sample predictions. This means that the predicted value for any row in your data-frame should NOT depend on the actual value for that row."
309 | ]
310 | },
311 | {
312 | "cell_type": "markdown",
313 | "metadata": {},
314 | "source": [
315 | "### 16. Given that you have wi-fi data in your office, how would you determine which rooms and areas are underutilized and over-utilized?"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "metadata": {},
321 | "source": [
322 | "Solution\n",
323 | "\n",
324 | " - If the data is more used in one room, then that one is over utilized!\n",
325 | " - Maybe account for the room capacity and normalize the data."
326 | ]
327 | },
328 | {
329 | "cell_type": "markdown",
330 | "metadata": {},
331 | "source": [
332 | "### 17. How could you use GPS data from a car to determine the quality of a driver?"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "Solution\n",
340 | "\n",
341 | " - Speed\n",
342 | " - Driving paths"
343 | ]
344 | },
345 | {
346 | "cell_type": "markdown",
347 | "metadata": {},
348 | "source": [
349 | "### 18. Given accelerometer, altitude, and fuel usage data from a car, how would you determine the optimum acceleration pattern to drive over hills?"
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "Solution\n",
357 | "\n",
358 | " - Historical data?"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "### 19. Given position data of NBA players in a season’s games, how would you evaluate a basketball player’s defensive ability?"
366 | ]
367 | },
368 | {
369 | "cell_type": "markdown",
370 | "metadata": {},
371 | "source": [
372 | "Solution\n",
373 | "\n",
374 | " - Evaluate his positions in the court."
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "### 20. How would you quantify the influence of a Twitter user?"
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {},
387 | "source": [
388 | "Solution\n",
389 | "\n",
390 | " - like page rank with each user corresponding to the webpages and linking to the page equivalent to following."
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "### 21. Given location data of golf balls in games, how would construct a model that can advise golfers where to aim?"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "Solution\n",
405 | "\n",
406 | " - winning probability for different positions."
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "### 22. You have 100 mathletes and 100 math problems. Each mathlete gets to choose 10 problems to solve. Given data on who got what problem correct, how would you rank the problems in terms of difficulty?"
414 | ]
415 | },
416 | {
417 | "cell_type": "markdown",
418 | "metadata": {},
419 | "source": [
420 | "Solution\n",
421 | "\n",
422 | " - One way you could do this is by storing a \"skill level\" for each user and a \"difficulty level\" for each problem. We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem.* Then we maximize the likelihood of the data to find the hidden skill and difficulty levels.\n",
423 | " - The Rasch model for dichotomous data takes the form: \n",
424 | " \n",
425 | "$ {\\displaystyle \\Pr {X_{ni}=1\\\\} = {\\frac {\\exp({\\beta_{n}}-{\\delta_{i}})}{1+\\exp({\\beta_{n}}-{\\delta_{i}})}},} $\n",
426 | "\n",
427 | "where is the ability of person and is the difficulty of item."
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "### 23. You have 5000 people that rank 10 sushis in terms of saltiness. How would you aggregate this data to estimate the true saltiness rank in each sushi?"
435 | ]
436 | },
437 | {
438 | "cell_type": "markdown",
439 | "metadata": {},
440 | "source": [
441 | "Solution\n",
442 | "\n",
443 | " - Some people would take the mean rank of each sushi. If I wanted something simple, I would use the median, since ranks are (strictly speaking) ordinal and not interval, so adding them is a bit risque (but people do it all the time and you probably won't be far wrong)."
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "### 24. Given data on congressional bills and which congressional representatives co-sponsored the bills, how would you determine which other representatives are most similar to yours in voting behavior? How would you evaluate who is the most liberal? Most republican? Most bipartisan?"
451 | ]
452 | },
453 | {
454 | "cell_type": "markdown",
455 | "metadata": {},
456 | "source": [
457 | "Solution\n",
458 | "\n",
459 | " - collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative.\n",
460 | " - for liberal and republican parties, find the mean vector and find the representative closest to the center point."
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "### 25. How would you come up with an algorithm to detect plagiarism in online content?"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "Solution\n",
475 | "\n",
476 | " - reduce the text to a more compact form (e.g. fingerprinting, bag of words) then compare those with other texts by calculating the similarity."
477 | ]
478 | },
479 | {
480 | "cell_type": "markdown",
481 | "metadata": {},
482 | "source": [
483 | "### 26. You have data on all purchases of customers at a grocery store. Describe to me how you would program an algorithm that would cluster the customers into groups. How would you determine the appropriate number of clusters to include?"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "Solution\n",
491 | "\n",
492 | " - K-means\n",
493 | " - choose a small value of k that still has a low SSE (elbow method)\n",
494 | " - [Elbow method](https://bl.ocks.org/rpgove/0060ff3b656618e9136b)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "### 27. Let’s say you’re building the recommended music engine at Spotify to recommend people music based on past listening history. How would you approach this problem?"
502 | ]
503 | },
504 | {
505 | "cell_type": "markdown",
506 | "metadata": {},
507 | "source": [
508 | "Solution\n",
509 | "\n",
510 | " - content-based filtering\n",
511 | " - collaborative filtering"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": []
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "Solution\n",
524 | "\n"
525 | ]
526 | }
527 | ],
528 | "metadata": {
529 | "hide_input": false,
530 | "kernelspec": {
531 | "display_name": "Python 3",
532 | "language": "python",
533 | "name": "python3"
534 | },
535 | "language_info": {
536 | "codemirror_mode": {
537 | "name": "ipython",
538 | "version": 3
539 | },
540 | "file_extension": ".py",
541 | "mimetype": "text/x-python",
542 | "name": "python",
543 | "nbconvert_exporter": "python",
544 | "pygments_lexer": "ipython3",
545 | "version": "3.8.8"
546 | },
547 | "toc": {
548 | "base_numbering": 1,
549 | "nav_menu": {},
550 | "number_sections": true,
551 | "sideBar": true,
552 | "skip_h1_title": false,
553 | "title_cell": "Table of Contents",
554 | "title_sidebar": "Contents",
555 | "toc_cell": false,
556 | "toc_position": {},
557 | "toc_section_display": true,
558 | "toc_window_display": false
559 | },
560 | "varInspector": {
561 | "cols": {
562 | "lenName": 16,
563 | "lenType": 16,
564 | "lenVar": 40
565 | },
566 | "kernels_config": {
567 | "python": {
568 | "delete_cmd_postfix": "",
569 | "delete_cmd_prefix": "del ",
570 | "library": "var_list.py",
571 | "varRefreshCmd": "print(var_dic_list())"
572 | },
573 | "r": {
574 | "delete_cmd_postfix": ") ",
575 | "delete_cmd_prefix": "rm(",
576 | "library": "var_list.r",
577 | "varRefreshCmd": "cat(var_dic_list()) "
578 | }
579 | },
580 | "types_to_exclude": [
581 | "module",
582 | "function",
583 | "builtin_function_or_method",
584 | "instance",
585 | "_Feature"
586 | ],
587 | "window_display": false
588 | }
589 | },
590 | "nbformat": 4,
591 | "nbformat_minor": 2
592 | }
593 |
--------------------------------------------------------------------------------
/07_Product_Metrics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "All the IPython Notebooks in **Data Science Interview Questions** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9/DataScience_Interview_Questions)**\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# Product Metrics ➞ 15 Questions"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### 1. What would be good metrics of success for an advertising-driven consumer product? (Buzzfeed, YouTube, Google Search, etc.) A service-driven consumer product? (Uber, Flickr, Venmo, etc.)"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {
29 | "ExecuteTime": {
30 | "end_time": "2021-09-21T13:31:28.708336Z",
31 | "start_time": "2021-09-21T13:31:28.699521Z"
32 | }
33 | },
34 | "source": [
35 | "Solution\n",
36 | "\n",
37 | " * advertising-driven: Page-views and daily actives, CTR, CPC (cost per click)\n",
38 | " * click-ads \n",
39 | " * display-ads \n",
40 | " * service-driven: number of purchases, conversion rate"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "### 2. What would be good metrics of success for a productivity tool? (Evernote, Asana, Google Docs, etc.) A MOOC? (edX, Coursera, Udacity, etc.)"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "Solution\n",
55 | "\n",
56 | " * Productivity tool: same as premium subscriptions\n",
57 | " * MOOC: same as premium subscriptions, completion rate"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "### 3. What would be good metrics of success for an e-commerce product? (Etsy, Groupon, Birchbox, etc.) A subscription product? (Net ix, Birchbox, Hulu, etc.) Premium subscriptions? (OKCupid, LinkedIn, Spotify, etc.) "
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "Solution\n",
72 | "\n",
73 | " * e-commerce: number of purchases, conversion rate, Hourly, daily, weekly, monthly, quarterly, and annual sales, Cost of goods sold, Inventory levels, Site traffic, Unique visitors versus returning visitors, Customer service phone call count, Average resolution time\n",
74 | " * subscription\n",
75 | " * churn, CoCA, ARPU, MRR, LTV\n",
76 | " * premium subscriptions: \n",
77 | " * subscription rate"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "### 4. What would be good metrics of success for a consumer product that relies heavily on engagement and interaction? (Snapchat, Pinterest, Facebook, etc.) A messaging product? (GroupMe, Hangouts, Snapchat, etc.)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Solution\n",
92 | "\n",
93 | " * heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio\n",
94 | " * messaging product: \n",
95 | " * daily, monthly active users"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "### 5. What would be good metrics of success for a product that offered in-app purchases? (Zynga, Angry Birds, other gaming apps)"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "Solution\n",
110 | "\n",
111 | " * Average Revenue Per Paid User\n",
112 | " * Average Revenue Per User"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "### 6. A certain metric is violating your expectations by going down or up more than you expect. How would you try to identify the cause of the change?"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "Solution\n",
127 | "\n",
128 | " * breakdown the KPI’s into what consists them and find where the change is\n",
129 | " * then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "### 7. Growth for total number of tweets sent has been slow this month. What data would you look at to determine the cause of the problem?"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "Solution\n",
144 | "\n",
145 | " * Historical data, especially historical data at the same month\n",
146 | " * Outer data, such as economic data, political data, data about competitors"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "### 8. You’re a restaurant and are approached by Groupon to run a deal. What data would you ask from them in order to determine whether or not to do the deal?"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "Solution\n",
161 | "\n",
162 | " * for similar restaurants (they should define similarity), average increase in revenue gain per coupon, average increase in customers per coupon"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "### 9. You are tasked with improving the efficiency of a subway system. Where would you start?"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "Solution\n",
177 | "\n",
178 | " * define efficiency"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "### 10. Say you are working on Facebook News Feed. What would be some metrics that you think are important? How would you make the news each person gets more relevant?"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "Solution\n",
193 | "\n",
194 | " * rate for each action, duration users stay, CTR for sponsor feed posts\n",
195 | " * ref. News Feed Optimization\n",
196 | " * Affinity score: how close the content creator and the users are\n",
197 | " * Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote\n",
198 | " * Time decay: the older the less important"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "### 11. How would you measure the impact that sponsored stories on Facebook News Feed have on user engagement? How would you determine the optimum balance between sponsored stories and organic content on a user’s News Feed?"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "Solution\n",
213 | "\n",
214 | " * AB test on different balance ratio and see "
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "### 12. You are on the data science team at Uber and you are asked to start thinking about surge pricing. What would be the objectives of such a product and how would you start looking into this?"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "Solution\n",
229 | "\n",
230 | " * there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure. \n",
231 | " * I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algorithms for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you've got yourself an incredible, constantly-evolving algorithm. "
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "### 13. Say that you are Netflix. How would you determine what original series you should invest in and create?"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "Solution\n",
246 | "\n",
247 | " * Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead."
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "### 14. What kind of services would find churn (metric that tracks how many customers leave the service) helpful? How would you calculate churn?"
255 | ]
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "metadata": {},
260 | "source": [
261 | "Solution\n",
262 | "\n",
263 | " * subscription based services"
264 | ]
265 | },
266 | {
267 | "cell_type": "markdown",
268 | "metadata": {},
269 | "source": [
270 | "### 15. Let’s say that you’re are scheduling content for a content provider on television. How would you determine the best times to schedule content?"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "Solution\n",
278 | "\n",
279 | " * Based on similar product and the corresponding broadcast popularity"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {},
286 | "outputs": [],
287 | "source": []
288 | }
289 | ],
290 | "metadata": {
291 | "hide_input": false,
292 | "kernelspec": {
293 | "display_name": "Python 3",
294 | "language": "python",
295 | "name": "python3"
296 | },
297 | "language_info": {
298 | "codemirror_mode": {
299 | "name": "ipython",
300 | "version": 3
301 | },
302 | "file_extension": ".py",
303 | "mimetype": "text/x-python",
304 | "name": "python",
305 | "nbconvert_exporter": "python",
306 | "pygments_lexer": "ipython3",
307 | "version": "3.8.8"
308 | },
309 | "toc": {
310 | "base_numbering": 1,
311 | "nav_menu": {},
312 | "number_sections": true,
313 | "sideBar": true,
314 | "skip_h1_title": false,
315 | "title_cell": "Table of Contents",
316 | "title_sidebar": "Contents",
317 | "toc_cell": false,
318 | "toc_position": {},
319 | "toc_section_display": true,
320 | "toc_window_display": false
321 | },
322 | "varInspector": {
323 | "cols": {
324 | "lenName": 16,
325 | "lenType": 16,
326 | "lenVar": 40
327 | },
328 | "kernels_config": {
329 | "python": {
330 | "delete_cmd_postfix": "",
331 | "delete_cmd_prefix": "del ",
332 | "library": "var_list.py",
333 | "varRefreshCmd": "print(var_dic_list())"
334 | },
335 | "r": {
336 | "delete_cmd_postfix": ") ",
337 | "delete_cmd_prefix": "rm(",
338 | "library": "var_list.r",
339 | "varRefreshCmd": "cat(var_dic_list()) "
340 | }
341 | },
342 | "types_to_exclude": [
343 | "module",
344 | "function",
345 | "builtin_function_or_method",
346 | "instance",
347 | "_Feature"
348 | ],
349 | "window_display": false
350 | }
351 | },
352 | "nbformat": 4,
353 | "nbformat_minor": 2
354 | }
355 |
--------------------------------------------------------------------------------
/08_Communication.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "All the IPython Notebooks in **Data Science Interview Questions** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9/DataScience_Interview_Questions)**\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# Communication ➞ 5 Questions"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### 1. Explain to me a technical concept related to the role that you’re interviewing for."
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {
29 | "ExecuteTime": {
30 | "end_time": "2021-09-21T13:31:28.708336Z",
31 | "start_time": "2021-09-21T13:31:28.699521Z"
32 | }
33 | },
34 | "source": [
35 | "Solution\n",
36 | "\n",
37 | "- AB test, PCA, data science, machine learning, neural networks"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "### 2. Introduce me to something you’re passionate about."
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "Solution\n",
52 | "\n",
53 | "- Data science"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "### 3. How would you explain an A/B test to an engineer with no statistics background? A linear regression?"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "Solution\n",
68 | "\n",
69 | "- A/B testing, or more broadly, multivariate testing, is the testing of different elements of a user's experience to determine which variation helps the business achieve its goal more effectively (i.e. increasing conversions, etc..) This can be copy on a web site, button colors, different user interfaces, different email subject lines, calls to action, offers, etc. "
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "### 4. How would you explain a confidence interval to an engineer with no statistics background? What does 95% confidence mean?"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "Solution\n",
84 | "\n",
85 | "- [link](https://www.quora.com/What-is-a-confidence-interval-in-laymans-terms)"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "### 5. How would you explain to a group of senior executives why data is important?"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "Solution\n",
100 | "\n",
101 | "- Examples"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": []
110 | }
111 | ],
112 | "metadata": {
113 | "hide_input": false,
114 | "kernelspec": {
115 | "display_name": "Python 3",
116 | "language": "python",
117 | "name": "python3"
118 | },
119 | "language_info": {
120 | "codemirror_mode": {
121 | "name": "ipython",
122 | "version": 3
123 | },
124 | "file_extension": ".py",
125 | "mimetype": "text/x-python",
126 | "name": "python",
127 | "nbconvert_exporter": "python",
128 | "pygments_lexer": "ipython3",
129 | "version": "3.8.8"
130 | },
131 | "toc": {
132 | "base_numbering": 1,
133 | "nav_menu": {},
134 | "number_sections": true,
135 | "sideBar": true,
136 | "skip_h1_title": false,
137 | "title_cell": "Table of Contents",
138 | "title_sidebar": "Contents",
139 | "toc_cell": false,
140 | "toc_position": {},
141 | "toc_section_display": true,
142 | "toc_window_display": false
143 | },
144 | "varInspector": {
145 | "cols": {
146 | "lenName": 16,
147 | "lenType": 16,
148 | "lenVar": 40
149 | },
150 | "kernels_config": {
151 | "python": {
152 | "delete_cmd_postfix": "",
153 | "delete_cmd_prefix": "del ",
154 | "library": "var_list.py",
155 | "varRefreshCmd": "print(var_dic_list())"
156 | },
157 | "r": {
158 | "delete_cmd_postfix": ") ",
159 | "delete_cmd_prefix": "rm(",
160 | "library": "var_list.r",
161 | "varRefreshCmd": "cat(var_dic_list()) "
162 | }
163 | },
164 | "types_to_exclude": [
165 | "module",
166 | "function",
167 | "builtin_function_or_method",
168 | "instance",
169 | "_Feature"
170 | ],
171 | "window_display": false
172 | }
173 | },
174 | "nbformat": 4,
175 | "nbformat_minor": 2
176 | }
177 |
--------------------------------------------------------------------------------
/09_Coding.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "All the IPython Notebooks in **Data Science Interview Questions** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9/DataScience_Interview_Questions)**\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "### 1. Write a function to calculate all possible assignment vectors of `2n` users, where `n` users are assigned to group 0 (control), and `n` users are assigned to group 1 (treatment).\n",
17 | "\n",
18 | "Solution"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {},
25 | "outputs": [
26 | {
27 | "name": "stdout",
28 | "output_type": "stream",
29 | "text": [
30 | "[[1, 1, 0, 0], [1, 0, 1, 0], [1, 0, 0, 1], [0, 1, 1, 0], [0, 1, 0, 1], [0, 0, 1, 1]]\n"
31 | ]
32 | }
33 | ],
34 | "source": [
35 | "def n_choose_k(n, k):\n",
36 | " \"\"\" function to choose k from n \"\"\"\n",
37 | " if k == 1:\n",
38 | " ans = []\n",
39 | " for i in range(n):\n",
40 | " tmp = [0] * n\n",
41 | " tmp[i] = 1\n",
42 | " ans.append(tmp)\n",
43 | " return ans\n",
44 | " \n",
45 | " if k == n:\n",
46 | " return [[1] * n]\n",
47 | " \n",
48 | " ans = []\n",
49 | " space = n - k + 1\n",
50 | " for i in range(space):\n",
51 | " assignment = [0] * (i + 1)\n",
52 | " assignment[i] = 1\n",
53 | " for c in n_choose_k(n - i - 1, k - 1):\n",
54 | " ans.append(assignment + c)\n",
55 | " return ans\n",
56 | "\n",
57 | "# test: choose 2 from 4\n",
58 | "print(n_choose_k(4, 2))"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": []
67 | }
68 | ],
69 | "metadata": {
70 | "hide_input": false,
71 | "kernelspec": {
72 | "display_name": "Python 3",
73 | "language": "python",
74 | "name": "python3"
75 | },
76 | "language_info": {
77 | "codemirror_mode": {
78 | "name": "ipython",
79 | "version": 3
80 | },
81 | "file_extension": ".py",
82 | "mimetype": "text/x-python",
83 | "name": "python",
84 | "nbconvert_exporter": "python",
85 | "pygments_lexer": "ipython3",
86 | "version": "3.8.8"
87 | },
88 | "toc": {
89 | "base_numbering": 1,
90 | "nav_menu": {},
91 | "number_sections": true,
92 | "sideBar": true,
93 | "skip_h1_title": false,
94 | "title_cell": "Table of Contents",
95 | "title_sidebar": "Contents",
96 | "toc_cell": false,
97 | "toc_position": {},
98 | "toc_section_display": true,
99 | "toc_window_display": false
100 | },
101 | "varInspector": {
102 | "cols": {
103 | "lenName": 16,
104 | "lenType": 16,
105 | "lenVar": 40
106 | },
107 | "kernels_config": {
108 | "python": {
109 | "delete_cmd_postfix": "",
110 | "delete_cmd_prefix": "del ",
111 | "library": "var_list.py",
112 | "varRefreshCmd": "print(var_dic_list())"
113 | },
114 | "r": {
115 | "delete_cmd_postfix": ") ",
116 | "delete_cmd_prefix": "rm(",
117 | "library": "var_list.r",
118 | "varRefreshCmd": "cat(var_dic_list()) "
119 | }
120 | },
121 | "types_to_exclude": [
122 | "module",
123 | "function",
124 | "builtin_function_or_method",
125 | "instance",
126 | "_Feature"
127 | ],
128 | "window_display": false
129 | }
130 | },
131 | "nbformat": 4,
132 | "nbformat_minor": 2
133 | }
134 |
--------------------------------------------------------------------------------
/DataScience_Interview_Questions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/milaan9/DataScience_Interview_Questions/b515c84b6b42243f45b3621ffe552abbe9219bc7/DataScience_Interview_Questions.pdf
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Milaan Parmar
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 | # Data_Science_Interview_Questions
23 |
24 | ## Introduction 👋
25 |
26 | Here are the answers to [120 Data Science Interview Questions](http://www.datasciencequestions.com/)
27 |
28 | The above answer some is modified based on Kojin's original collection: [kojino/120-Data-Science-Interview-Questions](https://github.com/kojino/120-Data-Science-Interview-Questions)
29 |
30 | Another solution is from: [Nitish-McQueen](https://github.com/Nitish-McQueen): [Data Science Interview Questions](./DataScience_Interview_Questions.pdf)
31 |
32 | Quera has a good list of questions: [https://datascienceinterview.quora.com/Answers-1](https://datascienceinterview.quora.com/Answers-1)
33 |
34 | Feel free to send me a pull request if you find any mistakes or have better answers.
35 |
36 | ---
37 |
38 | ## Table of contents 📋
39 |
40 | | **No.** | **Name** |
41 | | ------- | -------- |
42 | | 01 | **[01_120_Python_Basics_Interview_Questions](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/01_120_Python_Basics_Interview_Questions.ipynb)** |
43 | | 02 | **[02_Predictive_Modeling](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/02_Predictive_Modeling.ipynb)** |
44 | | 03 | **[03_Programming](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/03_Programming.ipynb)** |
45 | | 04 | **[04_Probability](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/04_Probability.ipynb)** |
46 | | 05 | **[05_Statistical_Inference](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/05_Statistical_Inference.ipynb)** |
47 | | 06 | **[06_Data_Analysis](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/06_Data_Analysis.ipynb)** |
48 | | 07 | **[07_Product_Metrics](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/07_Product_Metrics.ipynb)** |
49 | | 08 | **[08_Communication](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/08_Communication.ipynb)** |
50 | | 09 | **[09_Coding](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/09_Coding.ipynb)** |
51 | | 10 | **[10_Linkedin_Skill_Assessment_Python](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/10_Linkedin_Skill_Assessment_Python.ipynb)** |
52 | | 11 | **[DataScience_Interview_Questions](https://github.com/milaan9/DataScience_Interview_Questions/blob/main/DataScience_Interview_Questions.pdf)** |
53 |
54 | These are online **read-only** versions. However you can **`Run ▶`** all the codes **online** by clicking here ➞
55 |
56 | ---
57 |
58 | ## Frequently asked questions ❔
59 |
60 | ### How can I thank you for writing and sharing this tutorial? 🌷
61 |
62 | You can
and
Starring and Forking is free for you, but it tells me and other people that it was helpful and you like this tutorial.
63 |
64 | Go [**`here`**](https://github.com/milaan9/DataScience_Interview_Questions) if you aren't here already and click ➞ **`✰ Star`** and **`ⵖ Fork`** button in the top right corner. You will be asked to create a GitHub account if you don't already have one.
65 |
66 | ---
67 |
68 | ### How can I read this tutorial without an Internet connection?
69 |
70 | 1. Go [**`here`**](https://github.com/milaan9/DataScience_Interview_Questions) and click the big green ➞ **`Code`** button in the top right of the page, then click ➞ [**`Download ZIP`**](https://github.com/milaan9/DataScience_Interview_Questions/archive/refs/heads/main.zip).
71 |
72 | 
73 |
74 | 3. Extract the ZIP and open it. Unfortunately I don't have any more specific instructions because how exactly this is done depends on which operating system you run.
75 |
76 | 4. Launch ipython notebook from the folder which contains the notebooks. Open each one of them
77 |
78 | **`Kernel ➞ Restart & Clear Output`**
79 |
80 | This will clear all the outputs and now you can understand each statement and learn interactively.
81 |
82 | If you have git and you know how to use it, you can also clone the repository instead of downloading a zip and extracting it. An advantage with doing it this way is that you don't need to download the whole tutorial again to get the latest version of it, all you need to do is to pull with git and run ipython notebook again.
83 |
84 | ---
85 |
86 | ## Authors ✍️
87 |
88 | I'm Dr. Milaan Parmar and I have written this tutorial. If you think you can add/correct/edit and enhance this tutorial you are most welcome🙏
89 |
90 | See [github's contributors page](https://github.com/milaan9/DataScience_Interview_Questions/graphs/contributors) for details.
91 |
92 | If you have trouble with this tutorial please tell me about it by [Create an issue on GitHub](https://github.com/milaan9/DataScience_Interview_Questions/issues/new). and I'll make this tutorial better. This is probably the best choice if you had trouble following the tutorial, and something in it should be explained better. You will be asked to create a GitHub account if you don't already have one.
93 |
94 | If you like this tutorial, please [give it a ⭐ star](https://github.com/milaan9/DataScience_Interview_Questions).
95 |
96 | ---
97 |
98 | ## Licence 📜
99 |
100 | You may use this tutorial freely at your own risk. See [LICENSE](./LICENSE).
101 |
102 |
--------------------------------------------------------------------------------
/img/dnld_rep.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/milaan9/DataScience_Interview_Questions/b515c84b6b42243f45b3621ffe552abbe9219bc7/img/dnld_rep.png
--------------------------------------------------------------------------------