└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Data Science Preparation
  2 | 
  3 | **P.S. Ctrl+F to serach for relevant keywords.**
  4 | 
  5 | ### Preliminaries
  6 | 
  7 | If you are just beginning with ML & Data Science, a good first place to start will be 
  8 |  - [ ] [Andrew Ng Coursera ML course](https://www.coursera.org/learn/machine-learning). Finish at least the first few weeks.
  9 | 
 10 | If you have already done the Andrew Ng course, you might want to brush up on the concepts through these notes.
 11 |  - [ ] [Notes on Andrew Ng Machine Learning](https://www.holehouse.org/mlclass/)
 12 | 
 13 | If you want to make a list of important interview topics head over to this article.
 14 |  - [ ] [Machine Learning Cheatsheet](https://medium.com/swlh/cheat-sheets-for-machine-learning-interview-topics-51c2bc2bab4f)
 15 | 
 16 | ### Courses & Resources
 17 |  - [ ] [Become a Data Scientist in 2020 with these 10 resources](https://towardsdatascience.com/top-10-resources-to-become-a-data-scientist-in-2020-99a315194701)
 18 |  - [ ] [Applied Data Science with Python | Coursera](https://www.coursera.org/specializations/data-science-python?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-_4L3mvw.I6oY9SNPHAtR2Q&siteID=lVarvwc5BD0-_4L3mvw.I6oY9SNPHAtR2Q&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0)
 19 |  - [ ] [Minimal Pandas Subset for Data Scientists - Towards Data Science](https://towardsdatascience.com/minimal-pandas-subset-for-data-scientists-6355059629ae)
 20 |  - [ ] [Python’s One Liner graph creation library with animations Hans Rosling Style](https://towardsdatascience.com/pythons-one-liner-graph-creation-library-with-animations-hans-rosling-style-f2cb50490396)
 21 |  - [ ] [3 Awesome Visualization Techniques for every dataset](https://towardsdatascience.com/3-awesome-visualization-techniques-for-every-dataset-9737eecacbe8)
 22 |  - [ ] [Inferential Statistics | Coursera](https://www.coursera.org/learn/inferential-statistics-intro?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-ydEVG6k5kidzLtNqbbVQvQ&siteID=lVarvwc5BD0-ydEVG6k5kidzLtNqbbVQvQ&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0)
 23 |  - [ ] [Advanced Machine Learning | Coursera](https://www.coursera.org/specializations/aml?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-_1LkRNzPhJ43gzMHQzcbag&siteID=lVarvwc5BD0-_1LkRNzPhJ43gzMHQzcbag&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0)
 24 |  - [ ] [Deep Learning | Coursera](https://www.coursera.org/specializations/deep-learning?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-m3SBadPJeg1Z1rWVng39OQ&siteID=lVarvwc5BD0-m3SBadPJeg1Z1rWVng39OQ&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0)
 25 |  - [ ] [Deep Neural Networks with PyTorch | Coursera](https://www.coursera.org/learn/deep-neural-networks-with-pytorch?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-Kb0qPiTtTFPC3kMQZlnqpg&siteID=lVarvwc5BD0-Kb0qPiTtTFPC3kMQZlnqpg&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0)
 26 |  - [ ] [Machine Learning - complete course notes](http://www.holehouse.org/mlclass/)
 27 |  - [ ] [Data Science Interview Questions | Data Science Interview Questions and Answers with Tips](https://www.youtube.com/watch?v=7YuTmLvs1Dc)
 28 | 
 29 | ### Data Science Practice Questions
 30 | If you are clueless about which topic to start from in data science, but have some basic idea about ML, then simply give these questions a go. If you get a bunch of them wrong, you'll know where to start your preparation :)
 31 | 
 32 | ### SQL
 33 | Quickly go through the tutorial pages, you need not cram anything. Soon after, solve all the Hackerrank questions (in sequence, without skipping). Refer back to any of the tutorials or look up the discussion forum when stuck. You will learn more effectively this way and applying the various clauses will boost your recall.
 34 | 
 35 | - [ ] [SQL Tutorial Series](https://www.w3schools.com/sql/default.asp)
 36 | - [ ] [Hackerrank SQL Practice Questions](https://www.hackerrank.com/domains/sql)
 37 | - [ ] [Interview Questions - SQL Nomenclature, Theory, Databases](https://www.jigsawacademy.com/blogs/business-analytics/sql-joins-interview-questions/)
 38 | - [ ] [SQL Joins](https://learnsql.com/blog/sql-join-interview-questions-with-answers/)
 39 | - [ ] [Popular Interview Questions solved](https://github.com/Aafreen29/SQL-Interview-Prep-Question/blob/master/queries.sql)
 40 | - [ ] [Amazon Data Analyst SQL Interview Questions](https://leetcode.com/discuss/interview-question/606844/amazon-data-analyst-sql-interview-questions)
 41 | 
 42 | ### Probability
 43 |  - [ ] [Questions on Expectations in Probability (must-do, solutions well explained)](https://www.codechef.com/wiki/tutorial-expectation)
 44 |  - [ ] [Brainstellar Interview Probability Puzzles (amazing resource for interview prep)](https://brainstellar.com/puzzles/probability/1)
 45 | 
 46 | ### Statistics
 47 | 
 48 | <details>
 49 |   <summary>Why divide by n-1 in sample standard deviation</summary>
 50 |  
 51 |   - Let f(v) = sum( (x_i-v)^2 )/n. Using f'(v) = 0, minima occurs at v = sum(x_i)/n = sample mean
 52 |   - Thus, f(sample mean) < f(population mean), as minima occurs at sample mean
 53 |   - Thus, sample std < population std (when using n in denominator)
 54 |   - But our goal was to estimate a value close to population std using the data of samples.
 55 |   - So we bump us sample std a bit by decreasing its denominator to n-1. Thus, bringing sample std closer to population std
 56 |                                        
 57 |  
 58 | </details>
 59 | 
 60 | <details>
 61 |   <summary>Generative vs Discriminative models, Prior vs Posterior probability</summary>
 62 |  
 63 |   - Prior:      Pr(x)               assumed distirbution for the param to be estimated without accounting for obeserved (sample) data
 64 |   - Posterior:  Pr(x | obsvd data)  accounting for the observed data
 65 |   - Likelihood: Pr(obsvd data | x)
 66 |  P(x|obsvd data) ---proportional to--- P(obsvd data|x) * P(x)
 67 |  posterior ---proportional to--- likelihood * prior
 68 |                                        
 69 |  
 70 | </details>
 71 | 
 72 |  - [ ] [Variance, Standard Deviation, Covariance, Correlation](https://www.mygreatlearning.com/blog/covariance-vs-correlation/)
 73 |  - [ ] [Probability vs Likelihood](https://www.youtube.com/watch?v=pYxNSUDSFH4)
 74 |  - [ ] [Maximum Likelihood, clearly explained!!!](https://www.youtube.com/watch?v=XepXtl9YKwc)
 75 |  - [ ] [Maximum Likelihood For the Normal Distribution, step-by-step!](https://www.youtube.com/watch?v=Dn6b9fCIUpM)
 76 |  - [ ] [Naive Bayes](https://www.youtube.com/watch?v=O2L2Uv9pdDA)
 77 |  - [ ] [Why Dividing By N Underestimates the Variance](https://www.youtube.com/watch?v=sHRBg6BhKjI)
 78 |  - [ ] [The Central Limit Theorem](https://www.youtube.com/watch?v=YAlJCEDH2uY)
 79 |  - [ ] [Gaussian Naive Bayes](https://www.youtube.com/watch?v=H3EjCKtlVog)
 80 |  - [ ] [Covariance and Correlation Part 1: Covariance](https://www.youtube.com/watch?v=qtaqvPAeEJY)
 81 |  - [ ] [Expectation Maximization: how it works](https://www.youtube.com/watch?v=iQoXFmbXRJA)
 82 |  - [ ] [Bayesian Inference: An Easy Example](https://www.youtube.com/watch?v=I4dkEALQv34)
 83 | 
 84 | ### Linear Algebra
 85 |  - [ ] [Eigenvectors and eigenvalues | Essence of linear algebra, chapter 14](https://m.youtube.com/watch?feature=youtu.be&v=PFDu9oVAE-g)
 86 | 
 87 | ### Distributions
 88 |  - [ ] [(1) Exponential and Laplace Distributions](https://www.youtube.com/watch?v=5ptp4naoYEo)
 89 |  - Gamma
 90 |  - Exponential
 91 |  - Students' T
 92 |  
 93 | ### Inferential Statistics
 94 | 
 95 | <details>
 96 |   <summary>Notes on p-values, statistical significance</summary>
 97 |  
 98 |  - p-values
 99 |  
100 |     - 0 <= p-value <= 1
101 |    
102 |     - The closer the p-value to 0, the more the confidence that the null hypothesis (that there is no difference between two things) is false.
103 |    
104 |     - `Threshold for making the decision`: 0.05. This means that if there is no difference between the two things, then and the same experiment is repeated a bunch of times, then only 5% of them would yield a wrong decision.
105 |    
106 |     - In essence, 5% of the experiments, where the differences come from weird random things, will generate a p-value less that 0.05.
107 |    
108 |     - Thus, we should obtain large p-values if the two things being compared are identical.
109 |    
110 |     - Getting a small p-value even when there is no difference is known as a False positive.'
111 |    
112 |     - If it is extremely important when we say that the two things are different, we use a smaller threshold like 0.1%.
113 |    
114 |     - A small p-value does not imply that the difference between the two things is large.
115 |     
116 |  - Error Types
117 |     
118 |     - `Type-1 error`: Incorrectly reject null (False positive)
119 |    
120 |     - `Alpha`: Prob(type-1 error) (aka level of significance)
121 |    
122 |     - `Type-2 error`: Fail to reject when you should have rejected null hypothesis (False negative)
123 |    
124 |     - `Beta`: Prob(type-2 error)
125 |    
126 |     - `Power`: Prob(Finding difference between when when it truly exists) = 1 - beta
127 |    
128 |     - Having power > 80% for a study is good. Calculated before study is conducted based on projections.
129 |    
130 |     - `P-value`: Prob(obtaining a result as extreme as the current one, assuming null is true)
131 |    
132 |     - Low p-value -> reject null hypothesis, high p-value -> fail to reject hypothesis
133 |    
134 |     - If p-value < alpha -> study was statistically significant. Alpha = 0.05 usually
135 |     
136 | </details>
137 | 
138 |  <details>
139 |   <summary>Maximum Likelihood Notes</summary>
140 |   
141 |    - Goal of maximum likelihood is to find the optimal way to fit a distribution to the data.
142 |    - Probability: Pr(x | mu,std): area under a fixed distribution
143 |    - Likelihood: Pr(mu,std | x) : y-axis values on curve (distribution function that can be varied) for fixed data point
144 |  
145 |  </details>
146 |  
147 |  - [ ] [Null Hypothesis, p-Value, Statistical Significance, Type 1 Error and Type 2 Error](https://www.youtube.com/watch?v=YSwmpAmLV2s)
148 |  - [ ] [Hypothesis Testing and The Null Hypothesis](https://www.youtube.com/watch?v=0oc49DyA3hU)
149 |  - [ ] [How to calculate p-values](https://www.youtube.com/watch?v=JQc3yx0-Q9E)
150 |  - [ ] [P Values, clearly explained](https://www.youtube.com/watch?v=5Z9OIYA8He8)
151 |  - [ ] [p-values: What they are and how to interpret them](https://www.youtube.com/watch?v=vemZtEM63GY)
152 |  - [ ] [Intro to Hypothesis Testing in Statistics - Hypothesis Testing Statistics Problems &amp; Examples](https://www.youtube.com/watch?v=VK-rnA3-41c)
153 |  - [ ] [Idea behind hypothesis testing | Probability and Statistics](https://www.youtube.com/watch?v=dpGmVV0-4jc)
154 |  - [ ] [Examples of null and alternative hypotheses | AP Statistics](https://www.youtube.com/watch?v=_3_6wjycJdk)
155 |  - [ ] [Confidence Intervals](https://www.youtube.com/watch?v=TqOeMYtOc1w)
156 |  - [ ] [P-values and significance tests | AP Statistics](https://www.youtube.com/watch?v=KS6KEWaoOOE)
157 |  - [ ] [Feature selection — Correlation and P-value | by Vishal R | Towards Data Science](https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf)
158 |  
159 | ### Statistical Tests
160 | 
161 |  <details>
162 |   <summary>t-Test</summary>
163 | 
164 |     - compares 2 means. Works well when sample size is small. We esimate popl_std by sample_std.
165 | 
166 |     - We are less confident that the distribution resembles normal dist. As sample size increases, it approches normal dist (at about n~=30)
167 | 
168 |     - t-value = signal/noise = (absolute diff bet two means)/(variability of groups) = | x1 - x2 | / sqrt(s1^2/n1  +  s2^2/n2)
169 | 
170 |     - Thus, increasing variance will give you more noise. Increasing #samples will decrease the noise.
171 | 
172 |     - Degrees of freedom (DOF) = n1 + n2 - 2
173 | 
174 |     - if t-value > critical value (from table) => reject hypothesis (found a statistically significant diff bet two means) 
175 | 
176 |     - Independent (unpaired) samples means that two separate populations used to take samples. Paired samples means samples taken from the same population, and now we are comparing two means.
177 | 
178 |     - In a two tailed test, we are not sure which direction the variance will be. Considering alpha=0.05, the 0.05 is split into 0.025 on both of the tails. In the middle is the remaining 0.95. Run a one-tailed test if sure about the directionality.
179 | 
180 |     - (mu, sigma) are population statistics. (x_bar, s) are sample statistics.
181 |     - Calculating t-statistic when comparing sample mean with an already known mean. t-statistic = (x_bar - mu)/ sqrt(s^2/n)
182 | 
183 | </details>
184 | 
185 | <details>
186 |   <summary>Z-test</summary>
187 |     
188 |     - Z-test uses a normal distribution
189 | 
190 |     - (mu, sigma) are population statistics. (x_bar, s) are sample statistics. 
191 | 
192 |     - z-score = (x-mu)/sigma  // no. of std dev a particular sample (x) is away from population mean
193 |     - z-statistic = (x_bar - mu)/ sqrt(sigma^2/n) // no. of std dev sample mean is away from population mean
194 |     - t-statistic = (x_bar - mu)/ sqrt(s^2/n) // when population std dev (sigma) is unavailable we substitute with sample std dev (s)
195 | 
196 |     - Use z-stat when pop_std (sigma) is known and n>=30. Otherwise use t-stat.
197 |     
198 | </details>
199 |  
200 |  <details>
201 |   
202 |   <summary> Z-test example </summary>
203 |   
204 |    - [Z-score table](http://www.z-table.com/uploads/2/1/7/9/21795380/8573955.png?759)
205 |    - Question: Find z-critical score for two tailed test at alpha=0.03
206 |        - This means rejection area on each tail = 0.03/2 = 0.015
207 |        - So cumulative area till critical point on right = 1-0.015 = 0.985
208 |        - Now look for value on vertical axis that corresponds to 0.985 on alpha=0.03 column
209 |        - That value = 2.1 (z-critical score)
210 |  
211 |  
212 |  </details>
213 |  
214 | 
215 | 
216 | <details>
217 |   <summary>Chi-squred test</summary>
218 | 
219 |     - chi^2 = sum( (observed-expected)^2 / (expected) )
220 |     - The larger the chi^2 value, the more likely the variables are related
221 |     - Correlation relationship between two attributes, A and B. A has c distinct values and B has r
222 |     - Contingency table: c values of A are the columns and r values of B the rows
223 |     - (Ai ,Bj): joint event that attribute A takes on value ai and attribute B takes on value bj
224 |     - oij= observed frequency, eij= expected frequency
225 |     - Test is based on a significance level, with (r -1)x(c-1) degrees of freedom
226 |     - Slides link: https://imgur.com/a/U4uJhHc
227 | 
228 | </details>
229 | 
230 | <details>
231 |   <summary>Statistical Tests notes</summary>
232 | 
233 |    - ANOVA test: compares >2 means
234 |    - Chi-squared test: compares categorical variables
235 |    - Shapiro Wilk test: test if a random sample comes from a normal distribution
236 |    - Kolmogorov-Smirnov Goodness of Fit test: compares data with a known distribution to check if they have the same distribution
237 | 
238 | </details>
239 |  
240 |  - [ ] [Student's t-test](https://www.youtube.com/watch?v=pTmLQvMM-1M)
241 |  - [ ] [Z-Statistics vs. T-Statistics](https://www.youtube.com/watch?v=DEkPZv5ppHI)
242 |  - [ ] [Hypothesis Testing Problems Z Test & T Statistics One & Two Tailed Tests 2](https://www.youtube.com/watch?v=zJ8e_wAWUzE)
243 |  - [ ] [Contingency table chi-square test | Probability and Statistics](https://www.youtube.com/watch?v=hpWdDmgsIRE)
244 |  - [ ] [6 ways to test for a Normal Distribution — which one to use? (Kolmogorov Smirnov test, Shapiro Wilk test)](https://towardsdatascience.com/6-ways-to-test-for-a-normal-distribution-which-one-to-use-9dcf47d8fa93)
245 | 
246 | ### Linear Regression & Logistic Regression
247 |  - [ ] [Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8)
248 |  - [ ] [R-squared or coefficient of determination | Regression | Probability and Statistics](https://www.youtube.com/watch?v=lng4ZgConCM)
249 |  - [ ] [Linear Regression vs Logistic Regression | Data Science Training | Edureka](https://www.youtube.com/watch?v=OCwZyYH14uw)
250 |  - [ ] [Regression and R-Squared (2.2)](https://www.youtube.com/watch?v=Q-TtIPF0fCU)
251 |  - [ ] [Linear Models Pt.1 - Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo)
252 |  - [ ] [How To... Perform Simple Linear Regression by Hand](https://www.youtube.com/watch?v=GhrxgbQnEEU)
253 |  - [ ] [Missing Data Imputation using Regression | Kaggle](https://www.kaggle.com/shashankasubrahmanya/missing-data-imputation-using-regression)
254 |  - [ ] [Covariance and Correlation Part 2: Pearson&#39;s Correlation](https://www.youtube.com/watch?v=xZ_z8KWkhXE)
255 |  - [ ] [R-squared explained](https://www.youtube.com/watch?v=2AQKmw14mHM)
256 |  - [ ] [Why is logistic regression a linear classifier?](https://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier)
257 |  
258 | ### Precision, Recall
259 | <details>
260 |   <summary>Important Formulae</summary>
261 |   
262 |    - `Sensitivity`   = True Positive Rate = TP/(TP+FN)                         = how sensitive is the model, same as recall
263 |    - `Specificity`   = 1 - False Positive Rate = 1 - FP/(FP+TN) = TN/(FP+TN) 
264 |    - `'P'recision`   = TP/(TP+FP) = TP / 'P'redicted Positive                  = how less often does the model raise a false alarm
265 |    - `'R'ecall`      = TP/(TP+FN) = TP / 'R'eal Positive                       = of all the true cases, how many did we catch
266 |    - `F1-score`      = 2*Precision*Recall/(Precision + Recall)                 = geometric mean of precision & recall
267 | 
268 | </details>
269 | 
270 |  - [ ] [ROC and AUC!](https://www.youtube.com/watch?v=4jRBRDbJemM)
271 |  - [ ] [How to Use ROC Curves and Precision-Recall Curves for Classification in Python](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)
272 |  - F1 score, specificity, sensitivity
273 | 
274 | ### Gradient Descent
275 |  - [ ] [Stochastic Gradient Descent](https://www.youtube.com/watch?v=vMh0zPT0tLI)
276 |  
277 | ### Decision Trees & Random Forests
278 |  
279 |  <details>
280 |    <summary>Information Gain</summary>
281 |   
282 |    - Information gain determines the reduction of the uncertainty after splitting the dataset on a particular feature such that if the value of information gain increases, that feature is most useful for classification.
283 |    - IG = entropy before splitting - entropy after spliting
284 |    - Entropy =  - sum_over_n ( p_i * ln2(p_i) )
285 |  </details>
286 |  
287 |  <details>
288 |    <summary>Gini Index</summary>
289 |   
290 |    - Higher the GI, more randomness. An attribute/feature with least gini index is preferred as root node while making a decision tree. 
291 |    - 0: all elements correctly divided
292 |    - 1: all elements randomly distributed across various classes
293 |    - 0.5: all elements uniformly distributed into some classes
294 |    - GI (P) = 1 - sum_over_n(p_i^2) where
295 |    - P=(p1 , p2 ,.......pn ) , and pi is the probability of an object that is being classified to a particular class.
296 |   
297 |  </details>
298 |  
299 |  - [ ] [Decision and Classification Trees](https://www.youtube.com/watch?v=_L39rN6gz7Y)
300 |  - [ ] [Regression Trees](https://www.youtube.com/watch?v=g9c66TUylZ4)
301 |  - [ ] [Gini Index, Infromation Gain](https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees)
302 |  - [ ] [Decision Trees, Part 2 - Feature Selection and Missing Data](https://www.youtube.com/watch?v=wpNl-JwwplA)
303 |  - [ ] [How to Prune Regression Trees](https://www.youtube.com/watch?v=D0efHEJsfHo)
304 |  - [ ] [Random Forests Part 1 - Building, Using and Evaluating](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ)
305 |  - [ ] [Python | Decision Tree Regression using sklearn - GeeksforGeeks](https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/?ref=rp)
306 |  
307 | ### Loss functions
308 |  
309 |  <details>
310 |    <summary>Cross entropy loss</summary>
311 |   
312 |      - Cross entropy loss for class X = -p(X) * log q(X), where p(X) = prob(class X in target), q(X) = prob(class X in prediction)
313 |      - E.g. labels: [cat, dog, panda], target: [1,0,0], prediction: [0.9, 0.05, 0.05]
314 |      - Total CE loss for multi-class classification is the summation of CE loss of all classes
315 |      - Binary CE loss = -p(X) * log q(X) - (1-p(X)) * log (1-q(X))
316 |      - Cross entropy loss works even for target like [0.5, 0.1, 0.4] as we are taking the sums of CE loss of all classes
317 |      - In multi-label classification target can be [1, 0, 1] (not one-hot encoded). Given prediction: [0.6, 0.7, 0.4]. Then CE loss is evaluated as
318 |        - CE loss A = Binary CE loss with p(X) = 1, q(X) = 0.6
319 |        - CE loss B = Binary CE loss with p(X) = 0, q(X) = 0.7
320 |        - CE loss B = Binary CE loss with p(X) = 1, q(X) = 0.4
321 |        - Total CE loss = CE loss A + CE loss B + CE loss B
322 |  
323 |  </details>
324 |  
325 |  
326 |  - [ ] [Why do we need Cross Entropy Loss? (Visualized)](https://www.youtube.com/watch?v=gIx974WtVb4)
327 |  - [ ] [Cross-entropy loss (Binary, Multi-Class, Multi-Label)](https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451)
328 |  - [ ] [Hinge loss for SVM](https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1)
329 | 
330 | ### L1, L2 Regression
331 |  - [ ] [Ridge vs Lasso Regression, Visualized](https://www.youtube.com/watch?v=Xm2C_gTAl8c)
332 |  - [ ] [Regularization Part 1: Ridge (L2) Regression](https://www.youtube.com/watch?v=Q81RR3yKn30)
333 |  - [ ] [Regularization Part 2: Lasso (L1) Regression](https://www.youtube.com/watch?v=NGf0voTMlcs)
334 |  - [ ] [Regularization Part 3: Elastic Net Regression](https://www.youtube.com/watch?v=1dKRdX9bfIo)
335 |  - [ ] [regression - Why is the L2 regularization equivalent to Gaussian prior? - Cross Validated](https://stats.stackexchange.com/questions/163388/why-is-the-l2-regularization-equivalent-to-gaussian-prior)
336 |  - [ ] [regression - Why L1 norm for sparse models - Cross Validated](https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models)
337 |  
338 | ### PCA, SVM, LDA
339 |  
340 |  <details>
341 |   <summary>PCA</summary>
342 |     
343 |     - Create a covariance matrix of the variables. Its eigenval and eigenvec describe the full multi-dimensional dataset.
344 |     - Eigenvec describe the direction of spread, Eigenval describe the importance of certain directions in describing the spread.
345 |     - In PCA, sequentially determine the axes in which the data varies the most
346 |     - All selected axes are eigenvectors of the symmetric covariance matrix, thus they are mutually perpendicular
347 |     - Then reframe the data using a subset of the most influential axes, by plotting the projections of original points on these axes. Thus dimensional reduction.
348 |     - Singular Value Decomposition is a way to find those vectors 
349 |     
350 | </details>
351 |  
352 | <details>
353 |   <summary>SVM</summary>
354 |   
355 |     - Margin is the smallest distance between decision boundary and data point.
356 |  
357 |     - Maximum margin classifiers classify by using a decision boundary placed such that margin is maximized. Thus, they are super sensitive to outliers.
358 |  
359 |     - Thus, when we allow some misclassifications to accomodate outliers, it is know as a Soft Margin Classifier aka Support Vector Classifier (SVC).
360 |  
361 |     - Soft margin is determined through cross-validation. Support Vectors are those observations on the edge of Soft Margin.
362 |  
363 |     - For 3D data, the Support Vector Classifier forms a plane. For 2D it forms a line.
364 |  
365 |     - Support Vector Machines (SVM) moves the data into a higher dimension (new dimensions added by applying transformation on original dimensions)
366 |  
367 |     - Then, a support vector classifier is found that separates the higher dimensional data into two groups.
368 |  
369 |     - SVMs use Kernels that systematically find the SVCs in higher dimensions.
370 |  
371 |     - Say 2D data transformed to 3D. Then Polynomial Kernels find 3D relationships between each pair of those 3D points. Then use them to find an SVC.
372 |  
373 |     - Radial Basis Function (RBF) Kernel finds SVC in infinite dimensions. It behavs like a weighted nearest neighbour model (closest observations have the most impact on classification)
374 |  
375 |     - Kernel functions do not need to transform points to higher dimenstion. They find pair-wise relationship between points as if they were in higher dimensions, known as Kernel Trick
376 |  
377 |     - Polynomial relationship between two points a & b: (a*b + r)^d, where r & d are co-eff and degree of polynomial respectively found using cross validation
378 |  
379 |     - RBF relationship between two points a & b: exp(-r (a-b)^2 ), where r determined using cross validation, scales the influence (in the weighted-nearest neighbour model)
380 |  
381 |     
382 | </details>
383 |  
384 |  - [ ] [PCA main ideas in only 5 minutes](https://www.youtube.com/watch?v=HMOI_lkzW08)
385 |  - [ ] [Visual Explanation of Principal Component Analysis, Covariance, SVD](https://www.youtube.com/watch?v=5HNr_j6LmPc)
386 |  - [ ] [Principal Component Analysis (PCA), Step-by-Step](https://www.youtube.com/watch?v=FgakZw6K1QQ)
387 |  - [ ] [Support Vector Machines](https://www.youtube.com/watch?v=efR1C6CvhmE)
388 |  - [ ] [Linear Discriminant Analysis (LDA) clearly explained.](https://www.youtube.com/watch?v=azXCzI57Yfc)
389 |  
390 | ### Boosting
391 |  
392 |  <details>
393 |   <summary>Adaboost</summary>
394 |     
395 |     - Combines a lot of "weak learners" to make decisions.
396 | 
397 |     - Single level decision trees (one root, two leaves), known as stumps.
398 | 
399 |     - Each stump has a weighted say in voting (as opposed to random forests where each tree has an equal vote).
400 | 
401 |     - Errors that the first stump makes, influences how the second stump is made. 
402 |     - Thus, order is important (as opposed to random forests where each tree is made independent of the others, doesnt matter the order in which trees are made)
403 | 
404 |     - First all samples are given a weight (equal weights initially).
405 |     - Then first stump is made based on which feature classifies the best (feature with lowest Gini index chosen).
406 |     - Now to decide stump's weight in final classification, we calculate the following. 
407 | 
408 |     - total_error = sum(weights of samples incorrectly classified)
409 |     - amount_of_say = 0.5log( (1-total_error)/total_error )
410 | 
411 |     - When stump does a good job, amount_of_say is closer to 1.
412 | 
413 |     - Now modify the weights so that the next stump learns from the mistakes.
414 |     - We want to emphasize on correctly classifying the samples that were wronged earlier.
415 | 
416 |     - new_sample_weight = sample_weight * exp(amount_of_say) => increased sample weight
417 |     - new_sample_weight = sample_weight * exp(-amount_of_say) => decreased sample weight
418 | 
419 |     - Then normalize new_sample_weights.
420 |     - Then create a new collection by sampling records, but with a greater probablilty of picking those which were wrongly classified earlier.
421 |     - This is where you can use new_sample_weights (normalized). After re-sampling is done, assign equal weights to all samples and repeat for finding second stump. 
422 |   
423 | </details>
424 | 
425 | <details>
426 | 
427 |   <summary>Gradient Boost</summary>
428 | 
429 |     - Starts by making a single leaf instead of a stump. Considering regression, leaf contains average of target variable as initial prediction.
430 |         
431 |     - Then build a tree (usu with 8 to 32 leaves). All trees are scaled equally (unlike AdaBoost where trees are weighted while prediciton)
432 | 
433 |     - The successive trees are also based on previous errors like AdaBoost.
434 | 
435 |     - Using initial prediction, calculate distance from actual target values, call them residuals, and store them.
436 | 
437 |     - Now use the features to predict the residuals. 
438 |     - The average of the values that finally end up in the same leaf is used as the predicted regression value for that leaf
439 |     - (this is true when the underlying loss function to be minimized is the squared residual fn.)
440 | 
441 |     - Then 
442 |     - new_prediction = initial_prediction + learning_rate*result_from_tree1
443 |     - new_residual = target_value - new_prediction
444 | 
445 |     - new_residual will be smaller than old_residual, thus we are taking small steps towards learning to predict target_value accurately
446 | 
447 |     - Train new tree on the new_residual, add the result_from_tree2*learning_rate to new_prediction to update it. Rinse and repeat.
448 | 
449 | </details>
450 |    
451 |  - [ ] [Gradient Boost, Learning Rate Shrinkage](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
452 |  - [ ] [Gradient Boost Part 1: Regression Main Ideas](https://www.youtube.com/watch?v=3CC4N4z3GJc)
453 |  - [ ] [XGBoost Part 1: Regression](https://www.youtube.com/watch?v=OtD8wVaFm6E)
454 |  - [ ] [AdaBoost](https://www.youtube.com/watch?v=LsK-xG1cLYA)
455 |  
456 | 
457 | ### Quantiles
458 |  - [ ] [Quantile-Quantile Plots (QQ plots)](https://www.youtube.com/watch?v=okjYjClSjOg)
459 |  - [ ] [Quantiles and Percentiles](https://www.youtube.com/watch?v=IFKQLDmRK0Y)
460 | 
461 | ### Clustering
462 |  - [ ] [Hierarchical Clustering](https://www.youtube.com/watch?v=7xHsRkOdVwo)
463 |  - [ ] [K-means clustering](https://www.youtube.com/watch?v=4b5d3muPQmA)
464 | 
465 | ### Neural Networks
466 |  
467 |  <details>
468 |   <summary>CNN notes</summary>
469 |   
470 |   - for data with grid like topology (1D audio, 2D image)
471 |   - reduces params in NN through
472 |     - sparse interactions
473 |     - parameter sharing 
474 |       - CNN creates spatial features. 
475 |       - Image passed through CNN gives rise to a volume. Section of this volume taken through the depth represents features of the same part of image
476 |       - Each feature in the same depth layer is generated by the same filter that convolves the image (same kernel, shared parameters)
477 |     - equivariant representation
478 |       - f(g(x)) = g(f(x))
479 |   - Types of layers
480 |     - Convolution layer - image convolved using kernels. Kernel applied through a sliding window. Depth of kernel = 3 for RGB image, 1 for grey-scale
481 |     - Activation Layer - 
482 |   
483 |   
484 |   Notes V.2
485 |    - Problems with NN and why CNN?
486 |      - The amount of weights rapidly becomes unmanageable for large images. For a 224 x 224 pixel image with 3 color channels there are around 150,000 weights that must be trained
487 |      - MLP (multi layer perceptrons) react differently to an input (images) and its shifted version — they are not translation invariant
488 |      - Spatial information is lost when the image is flattened into an MLP. Nodes that are close together are important because they help to define the features of an image
489 |      - CNN’s leverage the fact that nearby pixels are more strongly related than distant ones. Influence of nearby pixels analyzed using filters.
490 |   
491 |   - Filters
492 |      - reduces the number of weights
493 |      - when the location of these features changes it does not throw the neural network off
494 |   
495 |     The convolution layers: Extracts features from the input
496 |     The fully connected (dense) layers: Uses data from convolution layer to generate output
497 |     
498 |   - Why do CNN work efficiently?
499 |     - Parameter sharing: a feature detector in the convolutional layer which is useful in one part of the image, might be useful in other ones
500 |     - Sparsity of connections: in each layer, each output value depends only on a small number of inputs
501 |   
502 |   
503 |  </details>
504 |  
505 |  - [ ] [But what is a neural network? | Chapter 1, Deep learning] (https://www.youtube.com/watch?v=aircAruvnKk)
506 |  - [ ] [Gradient descent, how neural networks learn | Chapter 2, Deep learning](https://www.youtube.com/watch?v=IHZwWFHWa-w)
507 |  - [ ] [What is backpropagation really doing? | Chapter 3, Deep learning](https://www.youtube.com/watch?v=Ilg3gGewQ5U) 
508 |  - [ ] [Train-test splitting, Stratification](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/)
509 |  - [ ] [Regularization, Dropout, Early Stopping](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/)
510 |  - [ ] [Convolution Neural Networks - EXPLAINED](https://www.youtube.com/watch?v=m8pOnJxOcqY)
511 |  - [ ] [k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/)
512 |  - [ ] Exploding and vanishing gradients
513 |  - [ ] [Intro to CNN](https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac)
514 |  
515 | ### Activation Function
516 |  - ReLU vs Leaky ReLU
517 |  - Sigmoid activation
518 |  - [ ] [Activation Functions in NN (Sigmoid, tanh, ReLU, Leaky ReLU)](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)
519 |  - [ ] [Softmax]()
520 |  
521 | ### Time-series Analysis
522 |  - [ ] [Intro to time-series analysis and forecasting](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775)
523 |  
524 | ### Feature Transformation
525 | 
526 |  - [ ] [correlation - In supervised learning, why is it bad to have correlated features? - Data Science Stack Exchange](https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features)
527 |  - [ ] [5.4 Feature Interaction | Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/interaction.html)
528 |  - [ ] [Feature Transformation for Machine Learning, a Beginners Guide | by Rebecca Vickery | vickdata | Medium](https://medium.com/vickdata/four-feature-types-and-how-to-transform-them-for-machine-learning-8693e1c24e80)
529 |  - [ ] [Feature Transformation. How to handle different feature types… | by Ali Masri | Towards Data Science<](https://towardsdatascience.com/apache-spark-mllib-tutorial-7aba8a1dce6e)
530 | 
531 | ### Python Pandas
532 |  - [ ] [(2) Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data](https://www.youtube.com/watch?v=txMdrV1Ut64)
533 |  - [ ] [(2) Python Pandas Tutorial (Part 2): DataFrame and Series Basics - Selecting Rows and Columns](https://www.youtube.com/watch?v=zmdjNSmRXF4&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=2)
534 | 


--------------------------------------------------------------------------------