└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Preparation 2 | 3 | **P.S. Ctrl+F to serach for relevant keywords.** 4 | 5 | ### Preliminaries 6 | 7 | If you are just beginning with ML & Data Science, a good first place to start will be 8 | - [ ] [Andrew Ng Coursera ML course](https://www.coursera.org/learn/machine-learning). Finish at least the first few weeks. 9 | 10 | If you have already done the Andrew Ng course, you might want to brush up on the concepts through these notes. 11 | - [ ] [Notes on Andrew Ng Machine Learning](https://www.holehouse.org/mlclass/) 12 | 13 | If you want to make a list of important interview topics head over to this article. 14 | - [ ] [Machine Learning Cheatsheet](https://medium.com/swlh/cheat-sheets-for-machine-learning-interview-topics-51c2bc2bab4f) 15 | 16 | ### Courses & Resources 17 | - [ ] [Become a Data Scientist in 2020 with these 10 resources](https://towardsdatascience.com/top-10-resources-to-become-a-data-scientist-in-2020-99a315194701) 18 | - [ ] [Applied Data Science with Python | Coursera](https://www.coursera.org/specializations/data-science-python?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-_4L3mvw.I6oY9SNPHAtR2Q&siteID=lVarvwc5BD0-_4L3mvw.I6oY9SNPHAtR2Q&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0) 19 | - [ ] [Minimal Pandas Subset for Data Scientists - Towards Data Science](https://towardsdatascience.com/minimal-pandas-subset-for-data-scientists-6355059629ae) 20 | - [ ] [Python’s One Liner graph creation library with animations Hans Rosling Style](https://towardsdatascience.com/pythons-one-liner-graph-creation-library-with-animations-hans-rosling-style-f2cb50490396) 21 | - [ ] [3 Awesome Visualization Techniques for every dataset](https://towardsdatascience.com/3-awesome-visualization-techniques-for-every-dataset-9737eecacbe8) 22 | - [ ] [Inferential Statistics | Coursera](https://www.coursera.org/learn/inferential-statistics-intro?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-ydEVG6k5kidzLtNqbbVQvQ&siteID=lVarvwc5BD0-ydEVG6k5kidzLtNqbbVQvQ&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0) 23 | - [ ] [Advanced Machine Learning | Coursera](https://www.coursera.org/specializations/aml?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-_1LkRNzPhJ43gzMHQzcbag&siteID=lVarvwc5BD0-_1LkRNzPhJ43gzMHQzcbag&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0) 24 | - [ ] [Deep Learning | Coursera](https://www.coursera.org/specializations/deep-learning?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-m3SBadPJeg1Z1rWVng39OQ&siteID=lVarvwc5BD0-m3SBadPJeg1Z1rWVng39OQ&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0) 25 | - [ ] [Deep Neural Networks with PyTorch | Coursera](https://www.coursera.org/learn/deep-neural-networks-with-pytorch?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-Kb0qPiTtTFPC3kMQZlnqpg&siteID=lVarvwc5BD0-Kb0qPiTtTFPC3kMQZlnqpg&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0) 26 | - [ ] [Machine Learning - complete course notes](http://www.holehouse.org/mlclass/) 27 | - [ ] [Data Science Interview Questions | Data Science Interview Questions and Answers with Tips](https://www.youtube.com/watch?v=7YuTmLvs1Dc) 28 | 29 | ### Data Science Practice Questions 30 | If you are clueless about which topic to start from in data science, but have some basic idea about ML, then simply give these questions a go. If you get a bunch of them wrong, you'll know where to start your preparation :) 31 | 32 | ### SQL 33 | Quickly go through the tutorial pages, you need not cram anything. Soon after, solve all the Hackerrank questions (in sequence, without skipping). Refer back to any of the tutorials or look up the discussion forum when stuck. You will learn more effectively this way and applying the various clauses will boost your recall. 34 | 35 | - [ ] [SQL Tutorial Series](https://www.w3schools.com/sql/default.asp) 36 | - [ ] [Hackerrank SQL Practice Questions](https://www.hackerrank.com/domains/sql) 37 | - [ ] [Interview Questions - SQL Nomenclature, Theory, Databases](https://www.jigsawacademy.com/blogs/business-analytics/sql-joins-interview-questions/) 38 | - [ ] [SQL Joins](https://learnsql.com/blog/sql-join-interview-questions-with-answers/) 39 | - [ ] [Popular Interview Questions solved](https://github.com/Aafreen29/SQL-Interview-Prep-Question/blob/master/queries.sql) 40 | - [ ] [Amazon Data Analyst SQL Interview Questions](https://leetcode.com/discuss/interview-question/606844/amazon-data-analyst-sql-interview-questions) 41 | 42 | ### Probability 43 | - [ ] [Questions on Expectations in Probability (must-do, solutions well explained)](https://www.codechef.com/wiki/tutorial-expectation) 44 | - [ ] [Brainstellar Interview Probability Puzzles (amazing resource for interview prep)](https://brainstellar.com/puzzles/probability/1) 45 | 46 | ### Statistics 47 | 48 |
49 | Why divide by n-1 in sample standard deviation 50 | 51 | - Let f(v) = sum( (x_i-v)^2 )/n. Using f'(v) = 0, minima occurs at v = sum(x_i)/n = sample mean 52 | - Thus, f(sample mean) < f(population mean), as minima occurs at sample mean 53 | - Thus, sample std < population std (when using n in denominator) 54 | - But our goal was to estimate a value close to population std using the data of samples. 55 | - So we bump us sample std a bit by decreasing its denominator to n-1. Thus, bringing sample std closer to population std 56 | 57 | 58 |
59 | 60 |
61 | Generative vs Discriminative models, Prior vs Posterior probability 62 | 63 | - Prior: Pr(x) assumed distirbution for the param to be estimated without accounting for obeserved (sample) data 64 | - Posterior: Pr(x | obsvd data) accounting for the observed data 65 | - Likelihood: Pr(obsvd data | x) 66 | P(x|obsvd data) ---proportional to--- P(obsvd data|x) * P(x) 67 | posterior ---proportional to--- likelihood * prior 68 | 69 | 70 |
71 | 72 | - [ ] [Variance, Standard Deviation, Covariance, Correlation](https://www.mygreatlearning.com/blog/covariance-vs-correlation/) 73 | - [ ] [Probability vs Likelihood](https://www.youtube.com/watch?v=pYxNSUDSFH4) 74 | - [ ] [Maximum Likelihood, clearly explained!!!](https://www.youtube.com/watch?v=XepXtl9YKwc) 75 | - [ ] [Maximum Likelihood For the Normal Distribution, step-by-step!](https://www.youtube.com/watch?v=Dn6b9fCIUpM) 76 | - [ ] [Naive Bayes](https://www.youtube.com/watch?v=O2L2Uv9pdDA) 77 | - [ ] [Why Dividing By N Underestimates the Variance](https://www.youtube.com/watch?v=sHRBg6BhKjI) 78 | - [ ] [The Central Limit Theorem](https://www.youtube.com/watch?v=YAlJCEDH2uY) 79 | - [ ] [Gaussian Naive Bayes](https://www.youtube.com/watch?v=H3EjCKtlVog) 80 | - [ ] [Covariance and Correlation Part 1: Covariance](https://www.youtube.com/watch?v=qtaqvPAeEJY) 81 | - [ ] [Expectation Maximization: how it works](https://www.youtube.com/watch?v=iQoXFmbXRJA) 82 | - [ ] [Bayesian Inference: An Easy Example](https://www.youtube.com/watch?v=I4dkEALQv34) 83 | 84 | ### Linear Algebra 85 | - [ ] [Eigenvectors and eigenvalues | Essence of linear algebra, chapter 14](https://m.youtube.com/watch?feature=youtu.be&v=PFDu9oVAE-g) 86 | 87 | ### Distributions 88 | - [ ] [(1) Exponential and Laplace Distributions](https://www.youtube.com/watch?v=5ptp4naoYEo) 89 | - Gamma 90 | - Exponential 91 | - Students' T 92 | 93 | ### Inferential Statistics 94 | 95 |
96 | Notes on p-values, statistical significance 97 | 98 | - p-values 99 | 100 | - 0 <= p-value <= 1 101 | 102 | - The closer the p-value to 0, the more the confidence that the null hypothesis (that there is no difference between two things) is false. 103 | 104 | - `Threshold for making the decision`: 0.05. This means that if there is no difference between the two things, then and the same experiment is repeated a bunch of times, then only 5% of them would yield a wrong decision. 105 | 106 | - In essence, 5% of the experiments, where the differences come from weird random things, will generate a p-value less that 0.05. 107 | 108 | - Thus, we should obtain large p-values if the two things being compared are identical. 109 | 110 | - Getting a small p-value even when there is no difference is known as a False positive.' 111 | 112 | - If it is extremely important when we say that the two things are different, we use a smaller threshold like 0.1%. 113 | 114 | - A small p-value does not imply that the difference between the two things is large. 115 | 116 | - Error Types 117 | 118 | - `Type-1 error`: Incorrectly reject null (False positive) 119 | 120 | - `Alpha`: Prob(type-1 error) (aka level of significance) 121 | 122 | - `Type-2 error`: Fail to reject when you should have rejected null hypothesis (False negative) 123 | 124 | - `Beta`: Prob(type-2 error) 125 | 126 | - `Power`: Prob(Finding difference between when when it truly exists) = 1 - beta 127 | 128 | - Having power > 80% for a study is good. Calculated before study is conducted based on projections. 129 | 130 | - `P-value`: Prob(obtaining a result as extreme as the current one, assuming null is true) 131 | 132 | - Low p-value -> reject null hypothesis, high p-value -> fail to reject hypothesis 133 | 134 | - If p-value < alpha -> study was statistically significant. Alpha = 0.05 usually 135 | 136 |
137 | 138 |
139 | Maximum Likelihood Notes 140 | 141 | - Goal of maximum likelihood is to find the optimal way to fit a distribution to the data. 142 | - Probability: Pr(x | mu,std): area under a fixed distribution 143 | - Likelihood: Pr(mu,std | x) : y-axis values on curve (distribution function that can be varied) for fixed data point 144 | 145 |
146 | 147 | - [ ] [Null Hypothesis, p-Value, Statistical Significance, Type 1 Error and Type 2 Error](https://www.youtube.com/watch?v=YSwmpAmLV2s) 148 | - [ ] [Hypothesis Testing and The Null Hypothesis](https://www.youtube.com/watch?v=0oc49DyA3hU) 149 | - [ ] [How to calculate p-values](https://www.youtube.com/watch?v=JQc3yx0-Q9E) 150 | - [ ] [P Values, clearly explained](https://www.youtube.com/watch?v=5Z9OIYA8He8) 151 | - [ ] [p-values: What they are and how to interpret them](https://www.youtube.com/watch?v=vemZtEM63GY) 152 | - [ ] [Intro to Hypothesis Testing in Statistics - Hypothesis Testing Statistics Problems & Examples](https://www.youtube.com/watch?v=VK-rnA3-41c) 153 | - [ ] [Idea behind hypothesis testing | Probability and Statistics](https://www.youtube.com/watch?v=dpGmVV0-4jc) 154 | - [ ] [Examples of null and alternative hypotheses | AP Statistics](https://www.youtube.com/watch?v=_3_6wjycJdk) 155 | - [ ] [Confidence Intervals](https://www.youtube.com/watch?v=TqOeMYtOc1w) 156 | - [ ] [P-values and significance tests | AP Statistics](https://www.youtube.com/watch?v=KS6KEWaoOOE) 157 | - [ ] [Feature selection — Correlation and P-value | by Vishal R | Towards Data Science](https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf) 158 | 159 | ### Statistical Tests 160 | 161 |
162 | t-Test 163 | 164 | - compares 2 means. Works well when sample size is small. We esimate popl_std by sample_std. 165 | 166 | - We are less confident that the distribution resembles normal dist. As sample size increases, it approches normal dist (at about n~=30) 167 | 168 | - t-value = signal/noise = (absolute diff bet two means)/(variability of groups) = | x1 - x2 | / sqrt(s1^2/n1 + s2^2/n2) 169 | 170 | - Thus, increasing variance will give you more noise. Increasing #samples will decrease the noise. 171 | 172 | - Degrees of freedom (DOF) = n1 + n2 - 2 173 | 174 | - if t-value > critical value (from table) => reject hypothesis (found a statistically significant diff bet two means) 175 | 176 | - Independent (unpaired) samples means that two separate populations used to take samples. Paired samples means samples taken from the same population, and now we are comparing two means. 177 | 178 | - In a two tailed test, we are not sure which direction the variance will be. Considering alpha=0.05, the 0.05 is split into 0.025 on both of the tails. In the middle is the remaining 0.95. Run a one-tailed test if sure about the directionality. 179 | 180 | - (mu, sigma) are population statistics. (x_bar, s) are sample statistics. 181 | - Calculating t-statistic when comparing sample mean with an already known mean. t-statistic = (x_bar - mu)/ sqrt(s^2/n) 182 | 183 |
184 | 185 |
186 | Z-test 187 | 188 | - Z-test uses a normal distribution 189 | 190 | - (mu, sigma) are population statistics. (x_bar, s) are sample statistics. 191 | 192 | - z-score = (x-mu)/sigma // no. of std dev a particular sample (x) is away from population mean 193 | - z-statistic = (x_bar - mu)/ sqrt(sigma^2/n) // no. of std dev sample mean is away from population mean 194 | - t-statistic = (x_bar - mu)/ sqrt(s^2/n) // when population std dev (sigma) is unavailable we substitute with sample std dev (s) 195 | 196 | - Use z-stat when pop_std (sigma) is known and n>=30. Otherwise use t-stat. 197 | 198 |
199 | 200 |
201 | 202 | Z-test example 203 | 204 | - [Z-score table](http://www.z-table.com/uploads/2/1/7/9/21795380/8573955.png?759) 205 | - Question: Find z-critical score for two tailed test at alpha=0.03 206 | - This means rejection area on each tail = 0.03/2 = 0.015 207 | - So cumulative area till critical point on right = 1-0.015 = 0.985 208 | - Now look for value on vertical axis that corresponds to 0.985 on alpha=0.03 column 209 | - That value = 2.1 (z-critical score) 210 | 211 | 212 |
213 | 214 | 215 | 216 |
217 | Chi-squred test 218 | 219 | - chi^2 = sum( (observed-expected)^2 / (expected) ) 220 | - The larger the chi^2 value, the more likely the variables are related 221 | - Correlation relationship between two attributes, A and B. A has c distinct values and B has r 222 | - Contingency table: c values of A are the columns and r values of B the rows 223 | - (Ai ,Bj): joint event that attribute A takes on value ai and attribute B takes on value bj 224 | - oij= observed frequency, eij= expected frequency 225 | - Test is based on a significance level, with (r -1)x(c-1) degrees of freedom 226 | - Slides link: https://imgur.com/a/U4uJhHc 227 | 228 |
229 | 230 |
231 | Statistical Tests notes 232 | 233 | - ANOVA test: compares >2 means 234 | - Chi-squared test: compares categorical variables 235 | - Shapiro Wilk test: test if a random sample comes from a normal distribution 236 | - Kolmogorov-Smirnov Goodness of Fit test: compares data with a known distribution to check if they have the same distribution 237 | 238 |
239 | 240 | - [ ] [Student's t-test](https://www.youtube.com/watch?v=pTmLQvMM-1M) 241 | - [ ] [Z-Statistics vs. T-Statistics](https://www.youtube.com/watch?v=DEkPZv5ppHI) 242 | - [ ] [Hypothesis Testing Problems Z Test & T Statistics One & Two Tailed Tests 2](https://www.youtube.com/watch?v=zJ8e_wAWUzE) 243 | - [ ] [Contingency table chi-square test | Probability and Statistics](https://www.youtube.com/watch?v=hpWdDmgsIRE) 244 | - [ ] [6 ways to test for a Normal Distribution — which one to use? (Kolmogorov Smirnov test, Shapiro Wilk test)](https://towardsdatascience.com/6-ways-to-test-for-a-normal-distribution-which-one-to-use-9dcf47d8fa93) 245 | 246 | ### Linear Regression & Logistic Regression 247 | - [ ] [Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8) 248 | - [ ] [R-squared or coefficient of determination | Regression | Probability and Statistics](https://www.youtube.com/watch?v=lng4ZgConCM) 249 | - [ ] [Linear Regression vs Logistic Regression | Data Science Training | Edureka](https://www.youtube.com/watch?v=OCwZyYH14uw) 250 | - [ ] [Regression and R-Squared (2.2)](https://www.youtube.com/watch?v=Q-TtIPF0fCU) 251 | - [ ] [Linear Models Pt.1 - Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo) 252 | - [ ] [How To... Perform Simple Linear Regression by Hand](https://www.youtube.com/watch?v=GhrxgbQnEEU) 253 | - [ ] [Missing Data Imputation using Regression | Kaggle](https://www.kaggle.com/shashankasubrahmanya/missing-data-imputation-using-regression) 254 | - [ ] [Covariance and Correlation Part 2: Pearson's Correlation](https://www.youtube.com/watch?v=xZ_z8KWkhXE) 255 | - [ ] [R-squared explained](https://www.youtube.com/watch?v=2AQKmw14mHM) 256 | - [ ] [Why is logistic regression a linear classifier?](https://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier) 257 | 258 | ### Precision, Recall 259 |
260 | Important Formulae 261 | 262 | - `Sensitivity` = True Positive Rate = TP/(TP+FN) = how sensitive is the model, same as recall 263 | - `Specificity` = 1 - False Positive Rate = 1 - FP/(FP+TN) = TN/(FP+TN) 264 | - `'P'recision` = TP/(TP+FP) = TP / 'P'redicted Positive = how less often does the model raise a false alarm 265 | - `'R'ecall` = TP/(TP+FN) = TP / 'R'eal Positive = of all the true cases, how many did we catch 266 | - `F1-score` = 2*Precision*Recall/(Precision + Recall) = geometric mean of precision & recall 267 | 268 |
269 | 270 | - [ ] [ROC and AUC!](https://www.youtube.com/watch?v=4jRBRDbJemM) 271 | - [ ] [How to Use ROC Curves and Precision-Recall Curves for Classification in Python](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/) 272 | - F1 score, specificity, sensitivity 273 | 274 | ### Gradient Descent 275 | - [ ] [Stochastic Gradient Descent](https://www.youtube.com/watch?v=vMh0zPT0tLI) 276 | 277 | ### Decision Trees & Random Forests 278 | 279 |
280 | Information Gain 281 | 282 | - Information gain determines the reduction of the uncertainty after splitting the dataset on a particular feature such that if the value of information gain increases, that feature is most useful for classification. 283 | - IG = entropy before splitting - entropy after spliting 284 | - Entropy = - sum_over_n ( p_i * ln2(p_i) ) 285 |
286 | 287 |
288 | Gini Index 289 | 290 | - Higher the GI, more randomness. An attribute/feature with least gini index is preferred as root node while making a decision tree. 291 | - 0: all elements correctly divided 292 | - 1: all elements randomly distributed across various classes 293 | - 0.5: all elements uniformly distributed into some classes 294 | - GI (P) = 1 - sum_over_n(p_i^2) where 295 | - P=(p1 , p2 ,.......pn ) , and pi is the probability of an object that is being classified to a particular class. 296 | 297 |
298 | 299 | - [ ] [Decision and Classification Trees](https://www.youtube.com/watch?v=_L39rN6gz7Y) 300 | - [ ] [Regression Trees](https://www.youtube.com/watch?v=g9c66TUylZ4) 301 | - [ ] [Gini Index, Infromation Gain](https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees) 302 | - [ ] [Decision Trees, Part 2 - Feature Selection and Missing Data](https://www.youtube.com/watch?v=wpNl-JwwplA) 303 | - [ ] [How to Prune Regression Trees](https://www.youtube.com/watch?v=D0efHEJsfHo) 304 | - [ ] [Random Forests Part 1 - Building, Using and Evaluating](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ) 305 | - [ ] [Python | Decision Tree Regression using sklearn - GeeksforGeeks](https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/?ref=rp) 306 | 307 | ### Loss functions 308 | 309 |
310 | Cross entropy loss 311 | 312 | - Cross entropy loss for class X = -p(X) * log q(X), where p(X) = prob(class X in target), q(X) = prob(class X in prediction) 313 | - E.g. labels: [cat, dog, panda], target: [1,0,0], prediction: [0.9, 0.05, 0.05] 314 | - Total CE loss for multi-class classification is the summation of CE loss of all classes 315 | - Binary CE loss = -p(X) * log q(X) - (1-p(X)) * log (1-q(X)) 316 | - Cross entropy loss works even for target like [0.5, 0.1, 0.4] as we are taking the sums of CE loss of all classes 317 | - In multi-label classification target can be [1, 0, 1] (not one-hot encoded). Given prediction: [0.6, 0.7, 0.4]. Then CE loss is evaluated as 318 | - CE loss A = Binary CE loss with p(X) = 1, q(X) = 0.6 319 | - CE loss B = Binary CE loss with p(X) = 0, q(X) = 0.7 320 | - CE loss B = Binary CE loss with p(X) = 1, q(X) = 0.4 321 | - Total CE loss = CE loss A + CE loss B + CE loss B 322 | 323 |
324 | 325 | 326 | - [ ] [Why do we need Cross Entropy Loss? (Visualized)](https://www.youtube.com/watch?v=gIx974WtVb4) 327 | - [ ] [Cross-entropy loss (Binary, Multi-Class, Multi-Label)](https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451) 328 | - [ ] [Hinge loss for SVM](https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1) 329 | 330 | ### L1, L2 Regression 331 | - [ ] [Ridge vs Lasso Regression, Visualized](https://www.youtube.com/watch?v=Xm2C_gTAl8c) 332 | - [ ] [Regularization Part 1: Ridge (L2) Regression](https://www.youtube.com/watch?v=Q81RR3yKn30) 333 | - [ ] [Regularization Part 2: Lasso (L1) Regression](https://www.youtube.com/watch?v=NGf0voTMlcs) 334 | - [ ] [Regularization Part 3: Elastic Net Regression](https://www.youtube.com/watch?v=1dKRdX9bfIo) 335 | - [ ] [regression - Why is the L2 regularization equivalent to Gaussian prior? - Cross Validated](https://stats.stackexchange.com/questions/163388/why-is-the-l2-regularization-equivalent-to-gaussian-prior) 336 | - [ ] [regression - Why L1 norm for sparse models - Cross Validated](https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models) 337 | 338 | ### PCA, SVM, LDA 339 | 340 |
341 | PCA 342 | 343 | - Create a covariance matrix of the variables. Its eigenval and eigenvec describe the full multi-dimensional dataset. 344 | - Eigenvec describe the direction of spread, Eigenval describe the importance of certain directions in describing the spread. 345 | - In PCA, sequentially determine the axes in which the data varies the most 346 | - All selected axes are eigenvectors of the symmetric covariance matrix, thus they are mutually perpendicular 347 | - Then reframe the data using a subset of the most influential axes, by plotting the projections of original points on these axes. Thus dimensional reduction. 348 | - Singular Value Decomposition is a way to find those vectors 349 | 350 |
351 | 352 |
353 | SVM 354 | 355 | - Margin is the smallest distance between decision boundary and data point. 356 | 357 | - Maximum margin classifiers classify by using a decision boundary placed such that margin is maximized. Thus, they are super sensitive to outliers. 358 | 359 | - Thus, when we allow some misclassifications to accomodate outliers, it is know as a Soft Margin Classifier aka Support Vector Classifier (SVC). 360 | 361 | - Soft margin is determined through cross-validation. Support Vectors are those observations on the edge of Soft Margin. 362 | 363 | - For 3D data, the Support Vector Classifier forms a plane. For 2D it forms a line. 364 | 365 | - Support Vector Machines (SVM) moves the data into a higher dimension (new dimensions added by applying transformation on original dimensions) 366 | 367 | - Then, a support vector classifier is found that separates the higher dimensional data into two groups. 368 | 369 | - SVMs use Kernels that systematically find the SVCs in higher dimensions. 370 | 371 | - Say 2D data transformed to 3D. Then Polynomial Kernels find 3D relationships between each pair of those 3D points. Then use them to find an SVC. 372 | 373 | - Radial Basis Function (RBF) Kernel finds SVC in infinite dimensions. It behavs like a weighted nearest neighbour model (closest observations have the most impact on classification) 374 | 375 | - Kernel functions do not need to transform points to higher dimenstion. They find pair-wise relationship between points as if they were in higher dimensions, known as Kernel Trick 376 | 377 | - Polynomial relationship between two points a & b: (a*b + r)^d, where r & d are co-eff and degree of polynomial respectively found using cross validation 378 | 379 | - RBF relationship between two points a & b: exp(-r (a-b)^2 ), where r determined using cross validation, scales the influence (in the weighted-nearest neighbour model) 380 | 381 | 382 |
383 | 384 | - [ ] [PCA main ideas in only 5 minutes](https://www.youtube.com/watch?v=HMOI_lkzW08) 385 | - [ ] [Visual Explanation of Principal Component Analysis, Covariance, SVD](https://www.youtube.com/watch?v=5HNr_j6LmPc) 386 | - [ ] [Principal Component Analysis (PCA), Step-by-Step](https://www.youtube.com/watch?v=FgakZw6K1QQ) 387 | - [ ] [Support Vector Machines](https://www.youtube.com/watch?v=efR1C6CvhmE) 388 | - [ ] [Linear Discriminant Analysis (LDA) clearly explained.](https://www.youtube.com/watch?v=azXCzI57Yfc) 389 | 390 | ### Boosting 391 | 392 |
393 | Adaboost 394 | 395 | - Combines a lot of "weak learners" to make decisions. 396 | 397 | - Single level decision trees (one root, two leaves), known as stumps. 398 | 399 | - Each stump has a weighted say in voting (as opposed to random forests where each tree has an equal vote). 400 | 401 | - Errors that the first stump makes, influences how the second stump is made. 402 | - Thus, order is important (as opposed to random forests where each tree is made independent of the others, doesnt matter the order in which trees are made) 403 | 404 | - First all samples are given a weight (equal weights initially). 405 | - Then first stump is made based on which feature classifies the best (feature with lowest Gini index chosen). 406 | - Now to decide stump's weight in final classification, we calculate the following. 407 | 408 | - total_error = sum(weights of samples incorrectly classified) 409 | - amount_of_say = 0.5log( (1-total_error)/total_error ) 410 | 411 | - When stump does a good job, amount_of_say is closer to 1. 412 | 413 | - Now modify the weights so that the next stump learns from the mistakes. 414 | - We want to emphasize on correctly classifying the samples that were wronged earlier. 415 | 416 | - new_sample_weight = sample_weight * exp(amount_of_say) => increased sample weight 417 | - new_sample_weight = sample_weight * exp(-amount_of_say) => decreased sample weight 418 | 419 | - Then normalize new_sample_weights. 420 | - Then create a new collection by sampling records, but with a greater probablilty of picking those which were wrongly classified earlier. 421 | - This is where you can use new_sample_weights (normalized). After re-sampling is done, assign equal weights to all samples and repeat for finding second stump. 422 | 423 |
424 | 425 |
426 | 427 | Gradient Boost 428 | 429 | - Starts by making a single leaf instead of a stump. Considering regression, leaf contains average of target variable as initial prediction. 430 | 431 | - Then build a tree (usu with 8 to 32 leaves). All trees are scaled equally (unlike AdaBoost where trees are weighted while prediciton) 432 | 433 | - The successive trees are also based on previous errors like AdaBoost. 434 | 435 | - Using initial prediction, calculate distance from actual target values, call them residuals, and store them. 436 | 437 | - Now use the features to predict the residuals. 438 | - The average of the values that finally end up in the same leaf is used as the predicted regression value for that leaf 439 | - (this is true when the underlying loss function to be minimized is the squared residual fn.) 440 | 441 | - Then 442 | - new_prediction = initial_prediction + learning_rate*result_from_tree1 443 | - new_residual = target_value - new_prediction 444 | 445 | - new_residual will be smaller than old_residual, thus we are taking small steps towards learning to predict target_value accurately 446 | 447 | - Train new tree on the new_residual, add the result_from_tree2*learning_rate to new_prediction to update it. Rinse and repeat. 448 | 449 |
450 | 451 | - [ ] [Gradient Boost, Learning Rate Shrinkage](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/) 452 | - [ ] [Gradient Boost Part 1: Regression Main Ideas](https://www.youtube.com/watch?v=3CC4N4z3GJc) 453 | - [ ] [XGBoost Part 1: Regression](https://www.youtube.com/watch?v=OtD8wVaFm6E) 454 | - [ ] [AdaBoost](https://www.youtube.com/watch?v=LsK-xG1cLYA) 455 | 456 | 457 | ### Quantiles 458 | - [ ] [Quantile-Quantile Plots (QQ plots)](https://www.youtube.com/watch?v=okjYjClSjOg) 459 | - [ ] [Quantiles and Percentiles](https://www.youtube.com/watch?v=IFKQLDmRK0Y) 460 | 461 | ### Clustering 462 | - [ ] [Hierarchical Clustering](https://www.youtube.com/watch?v=7xHsRkOdVwo) 463 | - [ ] [K-means clustering](https://www.youtube.com/watch?v=4b5d3muPQmA) 464 | 465 | ### Neural Networks 466 | 467 |
468 | CNN notes 469 | 470 | - for data with grid like topology (1D audio, 2D image) 471 | - reduces params in NN through 472 | - sparse interactions 473 | - parameter sharing 474 | - CNN creates spatial features. 475 | - Image passed through CNN gives rise to a volume. Section of this volume taken through the depth represents features of the same part of image 476 | - Each feature in the same depth layer is generated by the same filter that convolves the image (same kernel, shared parameters) 477 | - equivariant representation 478 | - f(g(x)) = g(f(x)) 479 | - Types of layers 480 | - Convolution layer - image convolved using kernels. Kernel applied through a sliding window. Depth of kernel = 3 for RGB image, 1 for grey-scale 481 | - Activation Layer - 482 | 483 | 484 | Notes V.2 485 | - Problems with NN and why CNN? 486 | - The amount of weights rapidly becomes unmanageable for large images. For a 224 x 224 pixel image with 3 color channels there are around 150,000 weights that must be trained 487 | - MLP (multi layer perceptrons) react differently to an input (images) and its shifted version — they are not translation invariant 488 | - Spatial information is lost when the image is flattened into an MLP. Nodes that are close together are important because they help to define the features of an image 489 | - CNN’s leverage the fact that nearby pixels are more strongly related than distant ones. Influence of nearby pixels analyzed using filters. 490 | 491 | - Filters 492 | - reduces the number of weights 493 | - when the location of these features changes it does not throw the neural network off 494 | 495 | The convolution layers: Extracts features from the input 496 | The fully connected (dense) layers: Uses data from convolution layer to generate output 497 | 498 | - Why do CNN work efficiently? 499 | - Parameter sharing: a feature detector in the convolutional layer which is useful in one part of the image, might be useful in other ones 500 | - Sparsity of connections: in each layer, each output value depends only on a small number of inputs 501 | 502 | 503 |
504 | 505 | - [ ] [But what is a neural network? | Chapter 1, Deep learning] (https://www.youtube.com/watch?v=aircAruvnKk) 506 | - [ ] [Gradient descent, how neural networks learn | Chapter 2, Deep learning](https://www.youtube.com/watch?v=IHZwWFHWa-w) 507 | - [ ] [What is backpropagation really doing? | Chapter 3, Deep learning](https://www.youtube.com/watch?v=Ilg3gGewQ5U) 508 | - [ ] [Train-test splitting, Stratification](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/) 509 | - [ ] [Regularization, Dropout, Early Stopping](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/) 510 | - [ ] [Convolution Neural Networks - EXPLAINED](https://www.youtube.com/watch?v=m8pOnJxOcqY) 511 | - [ ] [k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/) 512 | - [ ] Exploding and vanishing gradients 513 | - [ ] [Intro to CNN](https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac) 514 | 515 | ### Activation Function 516 | - ReLU vs Leaky ReLU 517 | - Sigmoid activation 518 | - [ ] [Activation Functions in NN (Sigmoid, tanh, ReLU, Leaky ReLU)](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) 519 | - [ ] [Softmax]() 520 | 521 | ### Time-series Analysis 522 | - [ ] [Intro to time-series analysis and forecasting](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775) 523 | 524 | ### Feature Transformation 525 | 526 | - [ ] [correlation - In supervised learning, why is it bad to have correlated features? - Data Science Stack Exchange](https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features) 527 | - [ ] [5.4 Feature Interaction | Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/interaction.html) 528 | - [ ] [Feature Transformation for Machine Learning, a Beginners Guide | by Rebecca Vickery | vickdata | Medium](https://medium.com/vickdata/four-feature-types-and-how-to-transform-them-for-machine-learning-8693e1c24e80) 529 | - [ ] [Feature Transformation. How to handle different feature types… | by Ali Masri | Towards Data Science<](https://towardsdatascience.com/apache-spark-mllib-tutorial-7aba8a1dce6e) 530 | 531 | ### Python Pandas 532 | - [ ] [(2) Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data](https://www.youtube.com/watch?v=txMdrV1Ut64) 533 | - [ ] [(2) Python Pandas Tutorial (Part 2): DataFrame and Series Basics - Selecting Rows and Columns](https://www.youtube.com/watch?v=zmdjNSmRXF4&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=2) 534 | --------------------------------------------------------------------------------