├── README.md └── docs ├── Answers.md ├── Bayesian_statistics.md ├── Model_selection.md ├── MySQL ├── MySQL-Guidelines.md ├── MySQL-Quick-Tutorial.md ├── image.md └── schema.png ├── NLP.md ├── Performance_metrics.md ├── Preprocessing.md ├── Probabilistic_graphical_model.md ├── Supervised_learning.md └── Unsupervised-Learning.md /README.md: -------------------------------------------------------------------------------- 1 | # ML-Notes 2 | 3 | SOURCES: 4 | 5 | * Quora 6 | * Wikipedia 7 | * Cross Validated 8 | * Springboard 9 | 10 | 11 | ## TOPICS 12 | 13 | * [Model Selection](./docs/Model_selection.md) 14 | * [Supervised Learning](./docs/Supervised_learning.md) 15 | * [Performance Metrics](./docs/Performance_metrics.md) 16 | * [Bayesian Statistics](./docs/Bayesian_statistics.md) 17 | * [Probabilistic Graphical Models](./docs/Probabilistic_graphical_model.md) 18 | * [Good Answers from forums](./docs/Answers.md) 19 | * [Data Preprocessing](./docs/Preprocessing.md) 20 | * [Unsupervised Learning](./docs/Unsupervised-Learning.md) 21 | * [NLP](./docs/NLP.md) 22 | 23 | 24 | ## Discriminative and Generative models 25 | 26 | * [Generative vs. discriminative Stackoverflow](https://stats.stackexchange.com/questions/12421/generative-vs-discriminative) 27 | * [Andrew Ng Generative Learning Algorithms](https://www.youtube.com/watch?v=z5UQyCESW64) 28 | * [Generative vs Discriminative Good explanation](https://www.youtube.com/watch?v=OWJ8xVGRyFA) 29 | 30 | -------------------------------------------------------------------------------- /docs/Answers.md: -------------------------------------------------------------------------------- 1 | ## From Cross Validated 2 | 3 | [What is the difference between “likelihood” and “probability”?](https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability). 4 | [Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a feed-forward neural network? I'm interested in automated ways of building neural networks.](http://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw). 5 | [What does the hidden layer in a neural network compute?](http://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute?rq=1). 6 | [What does O(log n) mean exactly?](http://stackoverflow.com/questions/2307283/what-does-olog-n-mean-exactly?noredirect=1&lq=1). 7 | [Bayesian and frequentist reasoning in plain English](http://stats.stackexchange.com/questions/22/bayesian-and-frequentist-reasoning-in-plain-english). 8 | [How to choose the number of hidden layers and nodes in a feedforward neural network?](http://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw). 9 | [Explaining to laypeople why bootstrapping works](http://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works). 10 | [Can someone help to explain the difference between independent and random?](http://stats.stackexchange.com/questions/231425/can-someone-help-to-explain-the-difference-between-independent-and-random?noredirect=1&lq=1). 11 | [When should I use lasso vs ridge?](http://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge). 12 | [Relationship between SVD and PCA. How to use SVD to perform PCA?](http://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca). 13 | [How to reverse PCA and reconstruct original variables from several principal](http://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com?rq=1). 14 | [Bagging, boosting and stacking in machine learning](http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning). 15 | [In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?](http://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va). 16 | [How to interpret a QQ plot](http://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot). 17 | 18 | 19 | 20 | ## Other Sources 21 | [Complexity of Python Operations](https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt). 22 | [Wiki - Complexity of Python Operations](https://wiki.python.org/moin/TimeComplexity). 23 | [Everything about R^2](http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit) . 24 | 25 | 26 | ## Sebastian Raschka 27 | [What-is-the-role-of-the-activation-function-in-a-neural-network](https://www.quora.com/What-is-the-role-of-the-activation-function-in-a-neural-network). 28 | [What's the difference between gradient descent and stochastic gradient descent?](https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent/answer/Sebastian-Raschka-1?srid=9yUC). 29 | [How do I select SVM kernels?](https://www.quora.com/How-do-I-select-SVM-kernels/answer/Sebastian-Raschka-1?srid=9yUC). 30 | [What is the best visual explanation for the back propagation algorithm for neural networks?](https://www.quora.com/What-is-the-best-visual-explanation-for-the-back-propagation-algorithm-for-neural-networks/answer/Sebastian-Raschka-1?srid=9yUC). 31 | [How do I debug an artificial neural network algorithm?](https://www.quora.com/How-do-I-debug-an-artificial-neural-network-algorithm/answer/Sebastian-Raschka-1?srid=9yUC). 32 | 33 | 34 | ## From Quora 35 | [When should we use logistic regression and Neural Network?](https://www.quora.com/When-should-we-use-logistic-regression-and-Neural-Network/answer/Sebastian-Raschka-1?srid=9yUC). 36 | [What are Kernels in Machine Learning and SVM?](https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM) 37 | [Supervised Learning Topic FAQ](https://www.quora.com/topic/Supervised-Learning/faq). 38 | [What are the advantages of logistic regression over decision trees?](https://www.quora.com/What-are-the-advantages-of-logistic-regression-over-decision-trees). 39 | 40 | -------------------------------------------------------------------------------- /docs/Bayesian_statistics.md: -------------------------------------------------------------------------------- 1 | ## Bayes’ Theorem 2 | 3 | [More reading: An Intuitive (and Short) Explanation of Bayes’ Theorem (BetterExplained)](https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/) 4 | 5 | Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge. 6 | 7 | Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test? 8 | 9 | Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu. 10 | 11 | Bayes’ Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That’s something important to consider when you’re faced with machine learning interview questions. 12 | 13 | 14 | 15 | 16 | 17 | ### Maximum likelihood estimation 18 | Maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters 19 | 20 | For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the model. 21 | 22 | X1, X2, X3, . . . Xn have joint density denoted as 23 | 24 | fθ(x1, x2, . . . , xn) = f(x1, x2, . . . , xn|θ) 25 | 26 | Given observed values X1 = x1, X2 = x2, . . . , Xn = xn, the likelihood of θ is the function 27 | 28 | lik(θ) = f(x1, x2, . . . , xn|θ) 29 | 30 | If the distribution is discrete, f will be the frequency distribution function. 31 | In words: lik(θ)=probability of observing the given data as a function of θ 32 | 33 | The maximum likelihood estimate (mle) of θ is that value of θ that maximises lik(θ): it is 34 | the value that makes the observed data the “most probable”. 35 | 36 | Rather than maximising this product which can be quite tedious, we often use the fact 37 | that the logarithm is an increasing function so it will be equivalent to maximise the log 38 | likelihood 39 | 40 | Discrete distribution, finite parameter space[edit] 41 | Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a HEAD p. The goal then becomes to determine p. 42 | 43 | Example: 44 | Suppose the coin is tossed 80 times: i.e., the sample might be something like x1 = H, x2 = T, …, x80 = T, and the count of the number of HEADS "H" is observed. 45 | 46 | The probability of tossing TAILS is 1 − p (so here p is θ above). Suppose the outcome is 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3, one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3. The coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values: 47 | 48 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/36bc1e5127816685c557ccd68d4f4081d0b7f9fa) 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /docs/Model_selection.md: -------------------------------------------------------------------------------- 1 | ## Algorithms for hyperparameter optimization 2 | 3 | 4 | 5 | ### Grid search 6 | 7 | The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set[2] or evaluation on a held-out validation set. 8 | 9 | Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search. 10 | 11 | For example, a typical soft-margin SVM classifier equipped with an RBF kernel has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say 12 | 13 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/4124e15320f26a727f12f02d9bc61edc512878fd) 14 | Grid search then trains an SVM with each pair (C, γ) in the Cartesian product of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure. 15 | 16 | Grid search suffers from the curse of dimensionality, but is often embarrassingly parallel because typically the hyperparameter settings it evaluates are independent of each other 17 | 18 | 19 | ### Random search 20 | 21 | Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high-dimensional spaces than exhaustive search. This is because oftentimes, it turns out some hyperparameters do not significantly affect the loss. Therefore, having randomly dispersed data gives more "textured" data than an exhaustive search over parameters that ultimately do not affect the loss. 22 | 23 | 24 | 25 | ## Curse of dimensionality 26 | 27 | Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn't be too hard to find. You walk along the line and it takes two minutes. 28 | 29 | Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days. 30 | 31 | Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. Ugh. 32 | 33 | The difficulty of searching through the space gets a *lot* harder as you have more dimensions. You might not realize this intuitively when it's just stated in mathematical formulas, since they all have the same "width". That's the curse of dimensionality. It gets to have a name because it is unintuitive, useful, and yet simple. 34 | 35 | 36 | ## Bias Variance Trade-off 37 | 38 | More reading : [Bias-Variance Tradeoff (Wikipedia)](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) 39 | 40 | Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set. 41 | 42 | Variance is error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data. 43 | 44 | The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model. 45 | 46 | ![Graphical illustration of bias and varianceFrom Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe.](http://www.kdnuggets.com/wp-content/uploads/bias-and-variance.jpg) 47 | 48 | 49 | ## Lasso 50 | 51 | lasso (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produce 52 | 53 | It penalizes the absolute size of the regression coefficients 54 | 55 | ![Lasso eqn](https://wikimedia.org/api/rest_v1/media/math/render/svg/2904b78ec712617fdef0bd35e28442b9c1b35b03) 56 | 57 | Here *t* is a prespecified free parameter that determines the amount of regularisation 58 | 59 | ![Lasso vs Ridge](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/L1_and_L2_balls.svg/1600px-L1_and_L2_balls.svg.png) 60 | 61 | 62 | ## Ridge Regression 63 | 64 | Motivation: too many predictors 65 | 66 | It is not unusual to see the number of input variables greatly exceed the number of observations, e.g. micro-array data analysis, environmental pollution studies. 67 | 68 | With many predictors, fitting the full model without penalization will result in large prediction intervals, and LS regression estimator may not uniquely exist. 69 | 70 | 71 | ## Elastic Net 72 | 73 | Elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods 74 | The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on 75 | 76 | ![Lasso Penalty](https://wikimedia.org/api/rest_v1/media/math/render/svg/5a188f4b162086fb06a4485f3336baefc22e18b3) 77 | 78 | Use of this penalty function has several limitations.[1] For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by 79 | 80 | 81 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/48b3ad7bcf1954b906d16dde0c1d3b65ca8d45aa) 82 | -------------------------------------------------------------------------------- /docs/MySQL/MySQL-Guidelines.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | **VARCHAR(30)** 4 | 5 | Characters with an expected max length of 30 6 | 7 | **NOT NULL** 8 | 9 | Must contain a value 10 | 11 | **NULL** 12 | 13 | Doesn't require a value 14 | 15 | **CHAR(2)** 16 | 17 | Contains exactly 2 characters 18 | 19 | **DEFAULT "PA"** 20 | 21 | Receives a default value of PA 22 | 23 | **MEDIUMINT** 24 | 25 | Value no greater then 8,388,608 26 | 27 | **UNSIGNED** 28 | 29 | Can't contain a negative value 30 | 31 | **DATE** 32 | 33 | Stores a date in the format YYYY-MM-DD 34 | 35 | **ENUM('M', 'F')** 36 | 37 | Can contain either a M or F 38 | 39 | **TIMESTAMP** 40 | 41 | Stores date and time in this format YYYY-MM-DD-HH-MM-SS 42 | 43 | **FLOAT** 44 | 45 | A number with decimal spaces, with a value no bigger than 1.1E38 or smaller than -1.1E38 46 | 47 | **INT** 48 | 49 | Contains a number without decimals 50 | 51 | **AUTO_INCREMENT** 52 | 53 | Generates a number automatically that is one greater then the previous row 54 | 55 | **AUTO_INCREMENT** 56 | 57 | Generates a number automatically that is one greater then the previous row 58 | 59 | **AUTO_INCREMENT** : 60 | 61 | Generates a number automatically that is one greater then the previous row 62 | 63 | 64 | **AUTO_INCREMENT** : 65 | 66 | Generates a number automatically that is one greater then the previous row 67 | 68 | **PRIMARY KEY (SLIDE)** 69 | 70 | Unique ID that is assigned to this row of data 71 | 72 | I. Uniquely identifies a row or record 73 | 74 | II. Each Primary Key must be unique to the row 75 | 76 | III. Must be given a value when the row is created and that value cannot be NULL 77 | 78 | IV. The original value cannot be changed It should be short 79 | 80 | V. Itâlics probably best to auto increment the value of the key 81 | 82 | 83 | 84 | ## Atomic Data & Table Templating ## 85 | 86 | As your database increases in size, you are going to want everything to be organized, so that it can perform your queries quickly. If your tables are set up properly, your database will be able to crank through hundreds of thousands of bits of data in seconds. 87 | 88 | **How do you know how to best set up your tables though? Just follow some simple rules:** 89 | 90 | 1. Every table should focus on describing just one thing. Ex. Customer Table would have name, age, location, contact information. It shouldnt contain lists of anything such as interests, job history, past address, products purchased, etc. 91 | After you decide what one thing your table will describe, then decide what things you need to describe that thing. Refer to the customer example given in the last step. 92 | 93 | 2. Write out all the ways to describe the thing and if any of those things requires multiple inputs, pull them out and create a new table for them. For example, a list of past employers. 94 | 95 | 3. Once your table values have been broken down, we refer to these values as being atomic. Be careful not to break them down to a point in which the data is harder to work with. It might make sense to create a different variable for the house number, street name, apartment number, etc.; but by doing so you may make your self more work? That decision is up to you? 96 | 97 | 4. Some additional rules to help you make your data atomic: Dont have multiple columns with the same sort of information. Ex. If you wanted to include a employment history you should create job1, job2, job3 columns. Make a new table with that data instead. 98 | 99 | 5. Dont include multiple values in one cell. Ex. You shouldnt create a cell named jobs and then give it the value: McDonalds, Radio Shack, Walmart. 100 | 101 | 102 | **What does normalized mean?** 103 | Normalized Tables-Normalized just means that the database is organized in a way that is considered standardized by professional SQL programmers. So if someone new needs to work with the tables they'll be able to understand how to easily. 104 | Another benefit to normalizing your tables is that your queries will run much quicker and the chance your database will be corrupted will go down. 105 | 106 | **What are the rules for creating normalized tables:** 107 | 108 | The tables and variables defined in them must be atomic Each row must have a Primary Key defined. Like your social security number identifies you, the Primary Key will identify your row. 109 | 110 | You also want to eliminate using the same values repeatedly in your columns. Ex. You wouldnt want a column named instructors, in which you hand typed in their names each time. You instead, should create an instructor table and link to its key. 111 | 112 | Every variable in a table should directly relate to the primary key. Ex. You should create tables for all of your customers potential states, cities and zip codes, instead of including them in the main customer table. Then you would link them using foreign keys. Note: Many people think this last rule is overkill and can be ignored! 113 | 114 | No two columns should have a relationship in which when one changes another must also change in the same table. This is called a Dependency. Note: This is another rule that is sometimes ignored. 115 | 116 | 117 | ------------ Numeric Types ------------ 118 | 119 | TINYINT: A number with a value no bigger than 127 or smaller than -128 120 | SMALLINT: A number with a value no bigger than 32,768 or smaller than -32,767 121 | MEDIUM INT: A number with a value no bigger than 8,388,608 or smaller than -8,388,608 122 | INT: A number with a value no bigger than 2^31 or smaller than 2^31 1 123 | BIGINT: A number with a value no bigger than 2^63 or smaller than 2^63 1 124 | FLOAT: A number with decimal spaces, with a value no bigger than 1.1E38 or smaller than -1.1E38 125 | DOUBLE: A number with decimal spaces, with a value no bigger than 1.7E308 or smaller than -1.7E308 126 | 127 | ------------ String Types ------------ 128 | 129 | CHAR: A character string with a fixed length 130 | VARCHAR: A character string with a length thats variable 131 | BLOB: Can contain 2^16 bytes of data 132 | ENUM: A character string that has a limited number of total values, which you must define. 133 | SET: A list of legal possible character strings. Unlike ENUM, a SET can contain multiple values in comparison to the one legal value with ENUM. 134 | 135 | ------------ Date & Time Types ------------ 136 | 137 | DATE: A date value with the format of (YYYY-MM-DD) 138 | TIME: A time value with the format of (HH:MM:SS) 139 | DATETIME: A time value with the format of (YYYY-MM-DD HH:MM:SS) 140 | TIMESTAMP: A time value with the format of (YYYYMMDDHHMMSS) 141 | YEAR: A year value with the format of (YYYY) 142 | 143 | 144 | 145 | There are many math functions built into MySQL. Range had to be quoted because it is a reserved word. 146 | 147 | You can find all reserved words here http://dev.mysql.com/doc/mysqld-version-reference/en/mysqld-version-reference-reservedwords-5-5.html 148 | 149 | 44. The Built in Numeric Functions (SLIDE) 150 | 151 | ABS(x) : Absolute Number: Returns the absolute value of the variable x. 152 | 153 | ACOS(x), ASIN(x), ATAN(x), ATAN2(x,y), COS(x), COT(x), SIN(x), TAN(x) :Trigonometric Functions : They are used to relate the angles of a triangle to the lengths of the sides of a triangle. 154 | 155 | AVG(column_name) : Average of Column : Returns the average of all values in a column. SELECT AVG(column_name) FROM table_name; 156 | 157 | CEILING(x) : Returns the smallest number not less than x. 158 | 159 | COUNT(column_name) : Count : Returns the number of non null values in the column. SELECT COUNT(column_name) FROM table_name; 160 | 161 | DEGREES(x) : Returns the value of x, converted from radians to degrees. 162 | 163 | EXP(x) : Returns e^x 164 | 165 | FLOOR(x) : Returns the largest number not grater than x 166 | 167 | LOG(x) : Returns the natural logarithm of x 168 | 169 | LOG10(x) : Returns the logarithm of x to the base 10 170 | 171 | MAX(column_name) : Maximum Value : Returns the maximum value in the column. SELECT MAX(column_name) FROM table_name; 172 | 173 | MIN(column_name) : Minimum : Returns the minimum value in the column. SELECT MIN(column_name) FROM table_name; 174 | 175 | MOD(x, y) : Modulus : Returns the remainder of a division between x and y 176 | 177 | PI() : Returns the value of PI 178 | 179 | POWER(x, y) : Returns x ^ Y 180 | 181 | RADIANS(x) : Returns the value of x, converted from degrees to radians 182 | 183 | RAND() : Random Number : Returns a random number between the values of 0.0 and 1.0 184 | 185 | ROUND(x, d) : Returns the value of x, rounded to d decimal places 186 | 187 | SQRT(x) : Square Root : Returns the square root of x 188 | 189 | STD(column_name) : Standard Deviation : Returns the Standard Deviation of values in the column. SELECT STD(column_name) FROM table_name; 190 | 191 | SUM(column_name) : Summation : Returns the sum of values in the column. SELECT SUM(column_name) FROM table_name; 192 | 193 | TRUNCATE(x) : Returns the value of x, truncated to d decimal places 194 | -------------------------------------------------------------------------------- /docs/MySQL/MySQL-Quick-Tutorial.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | CREDITS: Derek Banas-https://www.youtube.com/watch?v=yPu6qV5byu4 6 | 7 | 8 | ## MySQL Tutorial A-Z 9 | 10 | 11 | GOAL : Create the following tables(students,scores,test,absences,classes), fill them random data and perform different operations. 12 | 13 | DESCRIBE students 14 | 15 | | Field | Type | Null | Key | Default | Extra | 16 | |--------------|:----------------------|:-----|:----|:-----------------|:-----------------------------| 17 | | first_name | varchar(30) | NO | | NULL | | 18 | | last_name | varchar(30) | NO | | NULL | | 19 | | email | varchar(60) | YES | | NULL | | 20 | | street | varchar(50) | NO | | NULL | | 21 | | city | varchar(40) | NO | | NULL | | 22 | | state | char(2) | NO | | PA | | 23 | | zip | mediumint(8) unsigned | NO | | NULL | | 24 | | phone | varchar(20) | NO | | NULL | | 25 | | birth_date | date | NO | | NULL | | 26 | | sex | enum('M','F') | NO | | NULL | | 27 | | date_entered | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP | 28 | | lunch_cost | float | YES | | NULL | | 29 | | student_id | int(10) unsigned | NO | PRI | NULL | auto_increment | 30 | 31 | 32 | DESCRIBE scores; 33 | 34 | | Field | Type | Null | Key | Default | Extra | 35 | |:-----------|:-----------------|:-----|:----|:--------|:------| 36 | | student_id | int(10) unsigned | NO | PRI | NULL | | 37 | | test_id | int(10) unsigned | NO | PRI | NULL | | 38 | | score | int(11) | NO | | NULL | | 39 | 40 | 41 | 42 | DESCRIBE test; 43 | 44 | | Field | Type | Null | Key | Default | Extra | 45 | |:---------|:-----------------|:-----|:----|:--------|:---------------| 46 | | date | date | NO | | NULL | | 47 | | type | enum('T','Q') | NO | | NULL | | 48 | | maxscore | int(11) | NO | | NULL | | 49 | | class_id | int(10) unsigned | NO | | NULL | | 50 | | test_id | int(10) unsigned | NO | PRI | NULL | auto_increment | 51 | 52 | 53 | 54 | DESCRIBE absences; 55 | 56 | | Field | Type | Null | Key | Default | Extra | 57 | |:-----------|:-----------------|:-----|:----|:--------|:------| 58 | | student_id | int(10) unsigned | NO | PRI | NULL | | 59 | | test_taken | char(1) | NO | | F | | 60 | | date | date | NO | PRI | NULL | | 61 | 62 | 63 | 64 | DESCRIBE classes; 65 | 66 | | Field | Type | Null | Key | Default | Extra | 67 | |:---------|:-----------------|:-----|:----|:--------|:---------------| 68 | | name | varchar(30) | NO | | NULL | | 69 | | class_id | int(10) unsigned | NO | PRI | NULL | auto_increment | 70 | 71 | 72 | 73 | 74 | ## START 75 | 76 | 77 | **Logging in to MySQL** 78 | 79 | mysql -u root -p 80 | 81 | **Quit** 82 | 83 | exit 84 | 85 | **Display all databases** 86 | 87 | show databases; 88 | 89 | 90 | **Create a database** 91 | 92 | CREATE DATABASE test2 93 | 94 | **Make test2 the active database** 95 | 96 | USE test2 97 | 98 | **Show the currently selected database** 99 | 100 | SELECT DATABASE() 101 | 102 | **Delete the database** 103 | 104 | DROP DATABASE IF EXISTS test2 105 | 106 | 107 | **Add Table in Database** 108 | ``` 109 | CREATE TABLE student( 110 | first_name VARCHAR(30) NOT NULL, 111 | last_name VARCHAR(30) NOT NULL, 112 | email VARCHAR(60) NULL, 113 | street VARCHAR(50) NOT NULL, 114 | city VARCHAR(40) NOT NULL, 115 | state CHAR(2) NOT NULL DEFAULT "PA", 116 | zip MEDIUMINT UNSIGNED NOT NULL, 117 | phone VARCHAR(20) NOT NULL, 118 | birth_date DATE NOT NULL, 119 | sex ENUM('M', 'F') NOT NULL, 120 | date_entered TIMESTAMP, 121 | lunch_cost FLOAT NULL, 122 | student_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY 123 | ); 124 | ``` 125 | 126 | 127 | **Show the table set up** 128 | 129 | DESCRIBE student 130 | 131 | **Inserting Data into a Table** 132 | 133 | ``` 134 | INSERT INTO student VALUES('Harry', 'Truman', 'htruman@aol.com', 135 | '202 South St', 'Vancouver', 'WA', 98660, '792-223-9810', "1946-1-24", 136 | 'M', NOW(), 3.50, NULL); 137 | 138 | INSERT INTO student VALUES('Shelly', 'Johnson', 'sjohnson@aol.com', 139 | '9 Pond Rd', 'Sparks', 'NV', 89431, '792-223-6734', "1970-12-12", 140 | 'F', NOW(), 3.50, NULL); 141 | 142 | INSERT INTO student VALUES('Bobby', 'Briggs', 'bbriggs@aol.com', 143 | '14 12th St', 'San Diego', 'CA', 92101, '792-223-6178', "1967-5-24", 144 | 'M', NOW(), 3.50, NULL); 145 | 146 | INSERT INTO student VALUES('Donna', 'Hayward', 'dhayward@aol.com', 147 | '120 16th St', 'Davenport', 'IA', 52801, '792-223-2001', "1970-3-24", 148 | 'F', NOW(), 3.50, NULL); 149 | 150 | INSERT INTO student VALUES('Audrey', 'Horne', 'ahorne@aol.com', 151 | '342 19th St', 'Detroit', 'MI', 48222, '792-223-2001', "1965-2-1", 152 | 'F', NOW(), 3.50, NULL); 153 | 154 | INSERT INTO student VALUES('James', 'Hurley', 'jhurley@aol.com', 155 | '2578 Cliff St', 'Queens', 'NY', 11427, '792-223-1890', "1967-1-2", 156 | 'M', NOW(), 3.50, NULL); 157 | 158 | INSERT INTO student VALUES('Lucy', 'Moran', 'lmoran@aol.com', 159 | '178 Dover St', 'Hollywood', 'CA', 90078, '792-223-9678', "1954-11-27", 160 | 'F', NOW(), 3.50, NULL); 161 | 162 | INSERT INTO student VALUES('Tommy', 'Hill', 'thill@aol.com', 163 | '672 High Plains', 'Tucson', 'AZ', 85701, '792-223-1115', "1951-12-21", 164 | 'M', NOW(), 3.50, NULL); 165 | 166 | INSERT INTO student VALUES('Andy', 'Brennan', 'abrennan@aol.com', 167 | '281 4th St', 'Jacksonville', 'NC', 28540, '792-223-8902', "1960-12-27", 168 | 'M', NOW(), 3.50, NULL); 169 | ``` 170 | 171 | **Show all student Data** 172 | 173 | SELECT * from student 174 | 175 | 176 | **Create a Table for classes** 177 | 178 | ``` 179 | CREATE TABLE class( 180 | name VARCHAR(30) NOT NULL, 181 | class_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY); 182 | ``` 183 | 184 | **Show all Tables** 185 | 186 | SHOW tables 187 | 188 | **Insert all possible classes** 189 | 190 | ``` 191 | INSERT INTO class VALUES 192 | ('English', NULL), ('Speech', NULL), ('Literature', NULL), 193 | ('Algebra', NULL), ('Geometry', NULL), ('Trigonometry', NULL), 194 | ('Calculus', NULL), ('Earth Science', NULL), ('Biology', NULL), 195 | ('Chemistry', NULL), ('Physics', NULL), ('History', NULL), 196 | ('Art', NULL), ('Gym', NULL); 197 | ``` 198 | 199 | **Create Tables test,score and absence** 200 | 201 | 202 | ![SCHEMA](schema.png?raw=true) 203 | 204 | 205 | ``` 206 | CREATE TABLE test( 207 | date DATE NOT NULL, 208 | type ENUM('T', 'Q') NOT NULL, 209 | class_id INT UNSIGNED NOT NULL, 210 | test_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY); 211 | ``` 212 | ``` 213 | CREATE TABLE score( 214 | student_id INT UNSIGNED NOT NULL, 215 | event_id INT UNSIGNED NOT NULL, 216 | score INT NOT NULL, 217 | PRIMARY KEY(event_id, student_id)); 218 | ``` 219 | 220 | ``` 221 | CREATE TABLE absence( 222 | student_id INT UNSIGNED NOT NULL, 223 | date DATE NOT NULL, 224 | PRIMARY KEY(student_id, date)); 225 | ``` 226 | 227 | 228 | 229 | We combined the event and student id to make sure we don't have 230 | duplicate scores and it makes it easier to change scores 231 | 232 | Since neither the event or the student ids are unique on their 233 | own we are able to make them unique by combining them. 234 | 235 | 236 | Again we combine 2 items that aren't unique to generate a 237 | unique key. 238 | 239 | **ADD a new column(max score column to test)** 240 | ALTER TABLE test ADD maxscore INT NOT NULL AFTER type; 241 | DESCRIBE test; 242 | 243 | **Change the name of a column(event_id in score to test_id)** 244 | 245 | ALTER TABLE score CHANGE event_id test_id 246 | INT UNSIGNED NOT NULL; 247 | DESCRIBE test; 248 | 249 | 250 | **Insert Tests** 251 | 252 | ``` 253 | INSERT INTO test VALUES 254 | ('2014-8-25', 'Q', 15, 1, NULL), 255 | ('2014-8-27', 'Q', 15, 1, NULL), 256 | ('2014-8-29', 'T', 30, 1, NULL), 257 | ('2014-8-29', 'T', 30, 2, NULL), 258 | ('2014-8-27', 'Q', 15, 4, NULL), 259 | ('2014-8-29', 'T', 30, 4, NULL); 260 | 261 | ``` 262 | 263 | SELECT * FROM test; 264 | 265 | 266 | 267 | 268 | **Enter student scores** 269 | 270 | ``` 271 | INSERT INTO score VALUES 272 | (1, 1, 15), 273 | (1, 2, 14), 274 | (1, 3, 28), 275 | (1, 4, 29), 276 | (1, 5, 15), 277 | (1, 6, 27), 278 | (2, 1, 15), 279 | (2, 2, 14), 280 | (2, 3, 26), 281 | (2, 4, 28), 282 | (2, 5, 14), 283 | (2, 6, 26), 284 | (3, 1, 14), 285 | (3, 2, 14), 286 | (3, 3, 26), 287 | (3, 4, 26), 288 | (3, 5, 13), 289 | (3, 6, 26), 290 | (4, 1, 15), 291 | (4, 2, 14), 292 | (4, 3, 27), 293 | (4, 4, 27), 294 | (4, 5, 15), 295 | (4, 6, 27), 296 | (5, 1, 14), 297 | (5, 2, 13), 298 | (5, 3, 26), 299 | (5, 4, 27), 300 | (5, 5, 13), 301 | (5, 6, 27), 302 | (6, 1, 13), 303 | (6, 2, 13), 304 | # Missed this day (6, 3, 24), 305 | (6, 4, 26), 306 | (6, 5, 13), 307 | (6, 6, 26), 308 | (7, 1, 13), 309 | (7, 2, 13), 310 | (7, 3, 25), 311 | (7, 4, 27), 312 | (7, 5, 13), 313 | # Missed this day (7, 6, 27), 314 | (8, 1, 14), 315 | # Missed this day (8, 2, 13), 316 | (8, 3, 26), 317 | (8, 4, 23), 318 | (8, 5, 12), 319 | (8, 6, 24), 320 | (9, 1, 15), 321 | (9, 2, 13), 322 | (9, 3, 28), 323 | (9, 4, 27), 324 | (9, 5, 14), 325 | (9, 6, 27), 326 | (10, 1, 15), 327 | (10, 2, 13), 328 | (10, 3, 26), 329 | (10, 4, 27), 330 | (10, 5, 12), 331 | (10, 6, 22); 332 | ``` 333 | 334 | **Fill absences Table** 335 | 336 | ``` 337 | INSERT INTO absence VALUES 338 | (6, '2014-08-29'), 339 | (7, '2014-08-29'), 340 | (8, '2014-08-27'); 341 | ``` 342 | 343 | Now we are done filling all the data. 344 | 345 | 346 | 347 | 348 | **Select specific columns from a table** 349 | 350 | ``` 351 | SELECT FIRST_NAME, last_name 352 | FROM student; 353 | ``` 354 | 355 | **Rename Tables** 356 | 357 | ``` 358 | RENAME TABLE 359 | absence to absences, 360 | class to classes, 361 | score to scores, 362 | student to students, 363 | test to tests; 364 | ``` 365 | 366 | 367 | ### USE WHERE- 368 | **Show every student born in the state of Washington** 369 | ``` 370 | SELECT first_name, last_name, state 371 | FROM students 372 | WHERE state="WA"; 373 | ``` 374 | **Show every student born after 1965** 375 | 376 | ``` 377 | SELECT * 378 | FROM students 379 | WHERE YEAR(birth_date) >= 1965; 380 | ``` 381 | a. You can compare values with =, >, <, >=, <=, != 382 | 383 | b. To get the month, day or year of a date use MONTH(), DAY(), or YEAR() 384 | 385 | **Show every student born in February or California** 386 | 387 | ``` 388 | . SELECT * 389 | FROM students 390 | WHERE MONTH(birth_date) = 2 OR state="CA"; 391 | ``` 392 | 393 | a. AND, && : Returns a true value if both conditions are true 394 | 395 | b. OR, || : Returns a true value if either condition is true 396 | 397 | c. NOT, ! : Returns a true value if the operand is false 398 | 399 | **Show every student born in February AND (California or Nevada)** 400 | ``` 401 | SELECT ** 402 | FROM students 403 | WHERE DAY(birth_date) >= 12 && (state="CA" || state="NV"); 404 | ``` 405 | 406 | **Return rows that have a specific(last_name) empty value** 407 | 408 | ``` SELECT * 409 | FROM students 410 | WHERE last_name IS NULL; 411 | ``` 412 | 413 | **Sort results by a specific(last_name) column.** 414 | 415 | ``` 416 | SELECT * 417 | FROM students 418 | ORDER BY last_name; 419 | 420 | ADD ASC or DESC to specify order 421 | ``` 422 | 423 | ## LIMIT 424 | 425 | **Show first 5 results** 426 | 427 | ``` SELECT * 428 | FROM students 429 | LIMIT 5; 430 | ``` 431 | **Show 5-10 results** 432 | 433 | ``` SELECT * 434 | FROM students 435 | LIMIT 5, 10; 436 | ``` 437 | 438 | ## CONCAT 439 | 440 | **Concat first name and last name** 441 | 442 | ``` SELECT CONCAT(first_name, " ", last_name) AS 'Name', 443 | CONCAT(city, ", ", state) AS 'Hometown' 444 | FROM students; 445 | ``` 446 | a. CONCAT is used to combine results 447 | b. AS provides for a way to define the column name 448 | 449 | 450 | **Match any first name that starts with a D, or ends with a n** 451 | 452 | ``` 453 | SELECT last_name, first_name 454 | FROM students 455 | WHERE first_name LIKE 'D%' OR last_name LIKE '%n'; 456 | ``` 457 | 458 | **MATCH _ _ _ Y last names** 459 | 460 | ``` 461 | SELECT last_name, first_name 462 | FROM students 463 | WHERE first_name LIKE '___y'; 464 | ``` 465 | **Show all the categories of a column(state)** 466 | 467 | ``` 468 | SELECT DISTINCT state 469 | FROM students 470 | ORDER BY state; 471 | ``` 472 | 473 | **Show count of all the categories of a column(state)** 474 | ``` 475 | SELECT COUNT(DISTINCT state) 476 | FROM students; 477 | ``` 478 | 479 | **Show count matching a condition** 480 | ``` 481 | SELECT COUNT(*) 482 | FROM students 483 | WHERE sex='M'; 484 | ``` 485 | 486 | **Group results based on a category(sex/birth year)** 487 | ``` 488 | SELECT sex, COUNT(*) 489 | FROM students 490 | GROUP BY sex; 491 | ``` 492 | 493 | ```SELECT MONTH(birth_date) AS 'Month', COUNT(*) 494 | FROM students 495 | GROUP BY Month 496 | ORDER BY Month; 497 | ``` 498 | 499 | ``` SELECT state, COUNT(state) AS 'Amount' 500 | FROM students 501 | GROUP BY state 502 | HAVING Amount > 1; 503 | ``` 504 | a. HAVING allows you to narrow the results after the query is executed 505 | 506 | 507 | **Select based on a condition** 508 | 509 | SELECT student_id, test_id 510 | FROM scores 511 | WHERE student_id = 6; 512 | 513 | **Insert into table** 514 | 515 | INSERT INTO scores VALUES 516 | (6, 3, 24); 517 | 518 | **Delete based on a condition** 519 | 520 | DELETE FROM absences 521 | WHERE student_id = 6; 522 | 523 | **ADD COLUMN** 524 | 525 | ALTER TABLE absences 526 | ADD COLUMN test_taken CHAR(1) NOT NULL DEFAULT 'F' 527 | AFTER student_id; 528 | 529 | 530 | Use ALTER to add a column to a table. You can use AFTER 531 | or BEFORE to define the placement 532 | 533 | **Update a value based on a condition** 534 | 535 | UPDATE scores SET score=25 536 | WHERE student_id=4 AND test_id=3; 537 | 538 | **Use BETWEEN to find matches between a minimum and maximum** 539 | 540 | SELECT first_name, last_name, birth_date 541 | FROM students 542 | WHERE birth_date 543 | BETWEEN '1960-1-1' AND '1970-1-1'; 544 | 545 | **Use IN to narrow results based on a predefined list of options** 546 | 547 | SELECT first_name, last_name 548 | FROM students 549 | WHERE first_name IN ('Bobby', 'Lucy', 'Andy'); 550 | 551 | **JOIN- Get info from multiple sources:** 552 | 553 | 554 | SELECT student_id, date, score, maxscore 555 | FROM tests, scores 556 | WHERE date = '2014-08-25' 557 | AND tests.test_id = scores.test_id; 558 | 559 | 560 | 561 | a. To combine data from multiple tables you can perform a JOIN 562 | by matching up common data like we did here with the test ids 563 | 564 | b. You have to define the 2 tables to join after FROM 565 | 566 | c. You have to define the common data between the tables after WHERE 567 | 568 | SELECT scores.student_id, tests.date, scores.score, tests.maxscore 569 | FROM tests, scores 570 | WHERE date = '2014-08-25' 571 | AND tests.test_id = scores.test_id; 572 | 573 | 574 | **JOIN + GROUPBY** 575 | 576 | SELECT students.student_id, 577 | CONCAT(students.first_name, " ", students.last_name) AS Name, 578 | COUNT(absences.date) AS Absences 579 | FROM students, absences 580 | WHERE students.student_id = absences.student_id 581 | GROUP BY students.student_id; 582 | 583 | SELECT students.student_id, 584 | CONCAT(students.first_name, " ", students.last_name) AS Name, 585 | COUNT(absences.date) AS Absences 586 | FROM students LEFT JOIN absences 587 | ON students.student_id = absences.student_id 588 | GROUP BY students.student_id; 589 | 590 | If we need to include all information from the table listed 591 | first "FROM students", even if it doesn't exist in the table on 592 | the right "LEFT JOIN absences", we can use a LEFT JOIN. 593 | 594 | SELECT students.first_name, 595 | students.last_name, 596 | scores.test_id, 597 | scores.score 598 | FROM students 599 | INNER JOIN scores 600 | ON students.student_id=scores.student_id 601 | WHERE scores.score <= 15 602 | ORDER BY scores.test_id; 603 | 604 | a. An INNER JOIN gets all rows of data from both tables if there 605 | is a match between columns in both tables 606 | 607 | b. Here I'm getting all the data for all quizzes and matching that 608 | data up based on student ids 609 | 610 | -------------------------------------------------------------------------------- /docs/MySQL/image.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /docs/MySQL/schema.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/deepak-k-zefr/ML-Notes/313a9d6b6786e19328d11bffda4032614403e507/docs/MySQL/schema.png -------------------------------------------------------------------------------- /docs/NLP.md: -------------------------------------------------------------------------------- 1 | # NLP 2 | 3 | 4 | 5 | ### tf–idf 6 | tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. 7 | 8 | The tf-idf value increases proportionally to the 9 | * number of times a word appears in the document, 10 | * but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general 11 | 12 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/10109d0e60cc9d50a1ea2f189bac0ac29a030a00) 13 | 14 | #### Term frequency 15 | The weight of a term that occurs in a document is simply proportional to the term frequency 16 | measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 17 | 18 | TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). 19 | 20 | 21 | #### Inverse document frequency 22 | The specificity of a term can be quantified as an inverse function of the number of documents in which the word occurs 23 | 24 | IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 25 | 26 | IDF(t) = log_e(Total number of documents / Number of documents with term t in it). 27 | 28 | #### EXAMPLE 29 | Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. 30 | 31 | 32 | 33 | 34 | ## Naive Bayes spam filtering 35 | 36 | Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members 37 | 38 | After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. 39 | 40 | Bayesian email filters utilize Bayes' theorem. Bayes' theorem is used several times in the context of spam: 41 | 42 | ### Mathematical foundation 43 | 44 | * a first time, to compute the probability that the message is spam, knowing that a given word appears in this message; 45 | * a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them); 46 | * sometimes a third time, to deal with rare words. 47 | 48 | 49 | 50 | Let's suppose the suspected message contains the word "replica". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities. 51 | 52 | The formula used by the software to determine that, is derived from Bayes' theorem 53 | ![Bayes Theorem](https://wikimedia.org/api/rest_v1/media/math/render/svg/dc8c39ec48e65c0ab10dabe343d4da9a9585a77b) 54 | 55 | where: 56 | 57 | Pr(S|W) is the probability that a message is a spam, knowing that the word "replica" is in it; 58 | Pr(S) is the overall probability that any given message is spam; 59 | Pr(W|S) is the probability that the word "replica" appears in spam messages; 60 | Pr(H) is the overall probability that any given message is not spam (is "ham"); 61 | Pr(W|H) is the probability that the word "replica" appears in ham messages. 62 | 63 | Computing the probability that a message containing a given word is spam 64 | 65 | 66 | 67 | 68 | ### Naive Bayes Text Classificaion 69 | 70 | Input: 71 | * a document `d` 72 | * a fixed set of classes `C = {c1, c2,…, cJ}` 73 | * A training set of m hand-labeled documents `(d1,c1),....,(dm,cm)` 74 | 75 | Output: 76 | a predicted class `c ∈ C` 77 | 78 | 79 | Simple (“naïve”) classifica1on method based on 80 | Bayes rule 81 | * Relies on very simple representa1on of document 82 | * Bag of words 83 | 84 | •For a document d and a class c 85 | 86 | `P(d/c)=P(c/d)*P(d)/ P(c)` 87 | 88 | 89 | Naïve Bayes Classifier (I) 90 | 91 | `MAP` is “maximum a posteriori” = most likely class 92 | 93 | 94 | `Cmap= argmax P(c|d) . (c belongs to C)` 95 | `= P(d/c)*P(c) ` 96 | 97 | 98 | https://web.stanford.edu/class/cs124/lec/naivebayes.pdf 99 | 100 | 101 | -------------------------------------------------------------------------------- /docs/Performance_metrics.md: -------------------------------------------------------------------------------- 1 | ## ROC curve 2 | 3 | More reading: [Receiver operating characteristic (Wikipedia)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) 4 | 5 | The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives). 6 | 7 | 8 | 9 | ### AUC 10 | 11 | The AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example, i.e. P(score(x+)>score(x−)) 12 | 13 | 14 | 15 | ### Sensitivity And Specificity 16 | 17 | Sensitivity refers to the test's ability to correctly detect patients who do have the condition.In the example of a medical test used to identify a disease, the sensitivity of the test is the proportion of people who test positive for the disease among those who have the disease. Mathematically, this can be expressed as: 18 | 19 | ![Sensitivity](https://wikimedia.org/api/rest_v1/media/math/render/svg/fbad73213a4578685fefa43ec96ce53533057e11) 20 | 21 | Specificity relates to the test's ability to correctly detect patients without a condition. Consider the example of a medical test for diagnosing a disease. Specificity of a test is the proportion of healthy patients known not to have the disease, who will test negative for it. Mathematically, this can also be written as: 22 | 23 | 24 | ![Specificity](https://wikimedia.org/api/rest_v1/media/math/render/svg/d7856a809dafad4fa9566eef65b37bedeaa53132) 25 | 26 | ![Image of High Sensitivity Low Specificity](https://upload.wikimedia.org/wikipedia/commons/e/e2/HighSensitivity_LowSpecificity_1401x1050.png) 27 | 28 | Lesser False Negatives and more False Positives. Detect more people with the disease. 29 | 30 | 31 | [Worked out example image](https://en.wikipedia.org/wiki/Template:SensSpecPPVNPV) 32 | 33 | 34 | ## Precision and Recall. 35 | 36 | More Reading: [precision and recall-Scikit Learn](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) 37 | 38 | 39 | [Precision and Recall](https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg)! 40 | 41 | Precision (P) is defined as the number of true positives (Tp) over the number of true positives plus the number of false positives (Fp). 42 | 43 | ![Precision =](https://wikimedia.org/api/rest_v1/media/math/render/svg/26106935459abe7c266f7b1ebfa2a824b334c807) 44 | 45 | Recall (R) is defined as the number of true positives (Tp}) over the number of true positives plus the number of false negatives (Fn). 46 | 47 | ![Recall =](https://wikimedia.org/api/rest_v1/media/math/render/svg/4c233366865312bc99c832d1475e152c5074891b) 48 | 49 | These quantities are also related to the (F1) score, which is defined as the harmonic mean of precision and recall. 50 | 51 | 52 | -------------------------------------------------------------------------------- /docs/Preprocessing.md: -------------------------------------------------------------------------------- 1 | ## Multicollinearity 2 | 3 | Multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. 4 | 5 | Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. 6 | 7 | That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. 8 | 9 | ### Consequences of multicollinearity 10 | 11 | In the presence of multicollinearity, the estimate of one variable's impact on the dependent variable Y while controlling for the others tends to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, X_1 holding the other variables constant. If X_1 is highly correlated with another independent variable, X_2, in the given data set, then we have a set of observations for which X_1 and X_2 have a particular linear stochastic relationship. We don't have a set of observations for which all changes in X_1 are independent of changes in X_2, so we have an imprecise estimate of the effect of independent changes in X_1. 12 | 13 | In some sense, the collinear variables contain the same information about the dependent variable. If nominally "different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables are accorded different names and perhaps employ different numeric measurement scales but are highly correlated with each other, then they suffer from redundancy. 14 | 15 | One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanator, a type II error. 16 | 17 | Another issue with multicollinearity is that small changes to the input data can lead to large changes in the model, even resulting in changes of sign of parameter estimates.[2] 18 | 19 | A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population). 20 | 21 | So long as the underlying specification is correct, multicollinearity does not actually bias results; it just produces large standard errors in the related independent variables. More importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. Since multicollinearity causes imprecise estimates of coefficient values, the resulting out-of-sample predictions will also be imprecise. And if the pattern of multicollinearity in the new data differs from that in the data that was fitted, such extrapolation may introduce large errors in the predictions 22 | 23 | 24 | ### Remedies for multicollinearity 25 | 26 | * Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category (e.g., summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect multicollinearity. 27 | 28 | * Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient values will vary, but look at how much they vary. 29 | 30 | * Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the efficacy of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based. 31 | 32 | * Drop one of the variables. An explanatory variable may be dropped to produce a model with significant coefficients. However, you lose information (because you've dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable. 33 | 34 | * Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity. 35 | 36 | * Mean-center the predictor variables. 37 | 38 | * Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.It has also been suggested that using the Shapley value, a game theory tool, the model could account for the effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible combinations of importance. 39 | 40 | * Ridge regression or principal component regression or partial least squares regression can be used. 41 | 42 | * If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag technique can be used, imposing a general structure on the relative values of the coefficients to be estimated. 43 | -------------------------------------------------------------------------------- /docs/Probabilistic_graphical_model.md: -------------------------------------------------------------------------------- 1 | 2 | ## Probabilistic Graphical Models 3 | # Markov Process 4 | is a stochastic process that satisfies the Markov property. A process satisfies the Markov property if one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process's full history, hence independently from such history; i.e., conditional on the present state of the system, its future and past states are independent. 5 | 6 | A state diagram for a simple example is shown in the figure on the right, using a directed graph to picture the state transitions. The states represent whether a hypothetical stock market is exhibiting a bull market, bear market, or stagnant market trend during a given week. According to the figure, a bull week is followed by another bull week 90% of the time, a bear week 7.5% of the time, and a stagnant week the other 2.5% of the time. Labelling the state space {1 = bull, 2 = bear, 3 = stagnant} 7 | 8 | ![](https://upload.wikimedia.org/wikipedia/commons/thumb/9/95/Finance_Markov_chain_example_state_space.svg/800px-Finance_Markov_chain_example_state_space.svg.png) 9 | 10 | the transition matrix for this example is: 11 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/6cea2dc36a546e141ce2d072636dbf8a0005f235) 12 | 13 | 14 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/df26d8a65d9997bd816356f0ebc532c46ea9a46c) 15 | -------------------------------------------------------------------------------- /docs/Supervised_learning.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Linear Regression 4 | 5 | * **Linear relationship (Model is linear in parameters)** 6 | The linearity assumption can best be observed in scatter plots 7 | 8 | * **No or little multicollinearity** 9 | There is no perfect linear relationship between explanatory variables. Multicollinearity occurs when the independent variables are not independent from each other. 10 | * Correlation matrix – when computing the matrix of Pearson's Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1. 11 | * VIF is a metric computed for every X variable that goes into a linear model. If the VIF of a variable is high, it means the information in that variable is already explained by other X variables present in the given model, which means, more redundant is that variable. So, lower the VIF (<2) the better 12 | 13 | * **Normality of residuals** 14 | 15 | * **The mean of residuals is zero** 16 | 17 | * **No auto-correlation** 18 | (Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).) It is observed in time series data. 19 | 20 | * **Homoscedasticity of residuals or equal variance** 21 | 22 | More reading : http://r-statistics.co/Assumptions-of-Linear-Regression.html. 23 | More reading : http://www.statisticssolutions.com/assumptions-of-linear-regression/. 24 | More reading : http://www.nitiphong.com/paper_pdf/OLS_Assumptions.pdf. 25 | 26 | ## Gauss–Markov theorem 27 | 28 | In statistics, the Gauss–Markov theorem, states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists. 29 | 30 | Here "best" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. The errors do not need to be normal, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance). The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. 31 | 32 | 33 | ## Logistic Regression 34 | is a regression model where the dependent variable (DV) is categorical. Example( of binary dependent variable)- where the output can take only two values, "0" and "1", which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick. 35 | 36 | Logistic regression can be binomial, ordinal or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types, "0" and "1" (which may represent, for example, "dead" vs. "alive" or "win" vs. "loss"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C") that are not ordered. Ordinal logistic regression deals with dependent variables that are ordered. 37 | 38 | Logistic regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters! There's no interaction between the parameter weights, nothing like w_1*x_1 * w_2* x_2 or so, which would make our model non-linear! 39 | 40 | 41 | ### Assumptions 42 | * **No outliers.(Use z-scores, histograms, and k-means clustering, to identify and remove outliers** 43 | and analyze residuals to identify outliers in the regression) 44 | * **Independent errors.(Like OLS, error terms are assumed uncorrelated.)** 45 | * **No multicollinearity.(Check zero-order correlation matrix for high values (ie r>0.7)** 46 | 47 | ### Definition of the logistic function 48 | 49 | The logistic function is useful because it can take any real input *t*, whereas the output always takes values between zero and one and hence is interpretable as a probability. 50 | 51 | ![formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/5e648e1dd38ef843d57777cd34c67465bbca694f) 52 | 53 | 54 | The logistic function sigma (t) is defined as follows: 55 | 56 | ![fig](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/640px-Logistic-curve.svg.png) 57 | 58 | ![fig](https://wikimedia.org/api/rest_v1/media/math/render/svg/836d93163447344be4715ec00638c1cd829e376c) 59 | 60 | ![fig](https://wikimedia.org/api/rest_v1/media/math/render/svg/57fa62921bfe1721bca86f8db39f44f4c1094cd5) 61 | 62 | 63 | More Reading :https://onlinecourses.science.psu.edu/stat504/node/164 64 | 65 | Because logistic regression uses MLE rather than OLS, it avoids many 66 | of the typical assumptions tested in statistical analysis. 67 | * Does not assume normality of variables (both DV and IVs). 68 | * Does not assume linearity between DV and IVs. 69 | * Does not assume homoscedasticity. 70 | * Does not assume normal errors. 71 | * MLE allows more flexibility in the data and analysis because it has fewer restrictions. 72 | 73 | 74 | ## Generalized linear model 75 | GLM is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution 76 | 77 | 78 | ## Neural Networks 79 | 80 | There is no assumption on data, errors or targets. In theory a Neural Network can approximate any function and this is done without assumptions, it only depends on data and network configuration. 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | ## Ensemble methods 89 | 90 | The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. 91 | 92 | Two families of ensemble methods are usually distinguished: 93 | 94 | In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. 95 | 96 | Examples: Bagging methods, Forests of randomized trees, ... 97 | 98 | By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. 99 | 100 | ## Random forest 101 | 102 | ### Motivation 103 | 104 | Trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.[3]:587–588 This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model. 105 | 106 | 107 | ### Tree bagging 108 | 109 | The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: 110 | 111 | For b = 1, ..., B: 112 | 113 | 1) Sample, with replacement, B training examples from X, Y; call these Xb, Yb. 114 | 2) Train a decision or regression tree fb on Xb, Yb. 115 | After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': 116 | 117 | ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/b54befce12aefdb29442bfc71cb5ad452364e8d8) 118 | 119 | or by taking the majority vote in the case of decision trees. 120 | 121 | This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. 122 | 123 | The number of samples/trees, B, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees B can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit. 124 | 125 | 126 | ### From bagging to random forests 127 | 128 | The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. 129 | 130 | Typically, for a classification problem with p features, √p (rounded down) features are used in each split. For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default. 131 | 132 | ![](https://databricks.com/wp-content/uploads/2015/01/Ensemble-example.png) 133 | 134 | 135 | 136 | 137 | Should inputs to random forests be normalized? 138 | 139 | Any algorithm based on recursive partitioning, such as decision trees, and regression trees does not require inputs (features) to be normalized, since it is invariant to monotonic transformations of the features (just think about how the splits are done at each node). Since random forests (as well as gbm) are just a collection of trees, there is no need to normalize. 140 | 141 | 142 | ## KNN vs k-means clustering 143 | 144 | More reading : [How is the k-nearest neighbor algorithm different from k-means clustering?](http://stats.stackexchange.com/questions/56500/what-are-the-main-differences-between-k-means-and-k-nearest-neighbours) 145 | 146 | K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. 147 | 148 | The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t — and is thus unsupervised learning. 149 | 150 | ## KNN 151 | 152 | the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression: 153 | 154 | In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 155 | 156 | In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. 157 | 158 | 159 | ### Dimension reduction 160 | For high-dimensional data (e.g., with number of dimensions more than 10) dimension reduction is usually performed prior to applying the k-NN algorithm in order to avoid the effects of the curse of dimensionality. 161 | 162 | The curse of dimensionality in the k-NN context basically means that Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (imagine multiple points lying more or less on a circle with the query point at the center; the distance from the query to all data points in the search space is almost the same). 163 | 164 | Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), or canonical correlation analysis (CCA) techniques as a pre-processing step, followed by clustering by k-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding 165 | 166 | 167 | 168 | ## Generalized linear model 169 | A generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. 170 | 171 | Intuition 172 | Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable has a normal distribution (intuitively, when a response variable can vary essentially indefinitely in either direction with no fixed "zero value", or more generally for any quantity that only varies by a relatively small amount, e.g. human heights). 173 | 174 | However, these assumptions are inappropriate for some types of response variables. For example, in cases where the response variable is expected to be always positive and varying over a wide range, constant input changes lead to geometrically varying, rather than constantly varying, output changes. As an example, a prediction model might predict that 10 degree temperature decrease would lead to 1,000 fewer people visiting the beach is unlikely to generalize well over both small beaches (e.g. those where the expected attendance was 50 at a particular temperature) and large beaches (e.g. those where the expected attendance was 10,000 at a low temperature). The problem with this kind of prediction model would imply a temperature drop of 10 degrees would lead to 1,000 fewer people visiting the beach, a beach whose expected attendance was 50 at a higher temperature would now be predicted to have the impossible attendance value of −950. 175 | 176 | Examples 177 | When the response data, Y, are binary (taking on only values 0 and 1), the distribution function is generally chosen to be the Bernoulli distribution and the interpretation of μi is then the probability, p, of Yi taking on the value one. 178 | 179 | There are several popular link functions for binomial functions. 180 | 181 | Logit link function 182 | The most typical link function is the canonical logit link: 183 | 184 | ![formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/8fafc094e76c824824b0b49467a84884525dad8e) 185 | -------------------------------------------------------------------------------- /docs/Unsupervised-Learning.md: -------------------------------------------------------------------------------- 1 | ## K-means 2 | 3 | More Reading(https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means) 4 | 5 | k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster 6 | 7 | 8 | K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are: 9 | 1. The centroids of the K clusters, which can be used to label new data 10 | 2. Labels for the training data (each data point is assigned to a single cluster) 11 | 12 | Assumptions: 13 | 1. k-means assumes the variance of the distribution of each attribute (variable) is spherical; 14 | 2. All variables have the same variance: 15 | 16 | Assumption Fail :: 17 | 18 | ![fig](https://i.stack.imgur.com/tXGTo.png) 19 | 20 | 21 | 22 | 23 | 3. The prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations; 24 | 4. There are K clusters 25 | 26 | 27 | 28 | ## K-means++ 29 | 30 | is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. 31 | 32 | The k-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly. 33 | 34 | However, the k-means algorithm has at least two major theoretic shortcomings: 35 | 36 | First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.[5] 37 | Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering. 38 | 39 | The k-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution. 40 | 41 | 42 | #### Example of a sub-optimal clustering 43 | 44 | To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in R2 that form an axis-aligned rectangle whose width is greater than its height. 45 | 46 | If k = 2 and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm converges immediately, without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together—a suboptimal clustering because the width of the rectangle is greater than its height. 47 | 48 | Now, consider stretching the rectangle horizontally to an arbitrary width. The standard k-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the k-means objective function. 49 | 50 | #### Improved initialization algorithm 51 | The intuition behind this approach is that spreading out the k initial cluster centers is a good thing: the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center. 52 | 53 | The exact algorithm is as follows: 54 | 55 | 1. Choose one center uniformly at random from among the data points. 56 | 2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. 57 | 3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. 58 | 4. Repeat Steps 2 and 3 until k centers have been chosen. 59 | 5. Now that the initial centers have been chosen, proceed using standard k-means clustering. 60 | 6. This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error. 61 | 62 | The k-means++ algorithm guarantees an approximation ratio O(log k) in expectation (over the randomness of the algorithm), where k is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum 63 | 64 | ### Applications 65 | k-means clustering is rather easy to implement and apply even on large data sets, particularly when using heuristics such as Lloyd's algorithm . It has been successfully used in various topics, including market segmentation, computer vision, astronomy and agriculture. 66 | It often is used as a preprocessing step for other algorithms, for example to find a starting configuration. 67 | 68 | 69 | ### Why is Euclidean distance not a good metric in high dimensions? 70 | The notion of Euclidean distance, which works well in the two-dimensional and three-dimensional worlds studied by Euclid, has some properties in higher dimensions that are contrary to our (maybe just my) geometric intuition which is also an extrapolation from two and three dimensions. 71 | 72 | Consider a 4×44×4 square with vertices at (±2,±2)(±2,±2). Draw four unit-radius circles centered at (±1,±1)(±1,±1). These "fill" the square, with each circle touching the sides of the square at two points, and each circle touching its two neighbors. For example, the circle centered at (1,1)(1,1) touches the sides of the square at (2,1)(2,1) and (1,2)(1,2), and its neighboring circles at (1,0)(1,0) and (0,1)(0,1). Next, draw a small circle centered at the origin that touches all four circles. Since the line segment whose endpoints are the centers of two osculating circles passes through the point of osculation, it is easily verified that the small circle has radius r2=2‾√−1r2=2−1 and that it touches touches the four larger circles at (±r2/2‾√,±r2/2‾√)(±r2/2,±r2/2). Note that the small circle is "completely surrounded" by the four larger circles and thus is also completely inside the square. Note also that the point (r2,0)(r2,0) lies on the small circle. Notice also that from the origin, one cannot "see" the point (2,0,0)(2,0,0) on the edge of the square because the line of sight passes through the point of osculation (1,0,0)(1,0,0) of the two circles centered at (1,1)(1,1) and (1,−1)(1,−1). Ditto for the lines of sight to the other points where the axes pass through the edges of the square. 73 | 74 | Next, consider a 4×4×44×4×4 cube with vertices at (±2,±2,±2)(±2,±2,±2). We fill it with 88 osculating unit-radius spheres centered at (±1,±1,±1)(±1,±1,±1), and then put a smaller osculating sphere centered at the origin. Note that the small sphere has radius r3=3‾√−1<1r3=3−1<1 and the point (r3,0,0)(r3,0,0) lies on the surface of the small sphere. But notice also that in three dimensions, one can "see" the point (2,0,0)(2,0,0) from the origin; there are no bigger bigger spheres blocking the view as happens in two dimensions. These clear lines of sight from the origin to the points where the axes pass through the surface of the cube occur in all larger dimensions as well. 75 | 76 | Generalizing, we can consider a nn-dimensional hypercube of side 44 and fill it with 2n2n osculating unit-radius hyperspheres centered at (±1,±1,…,±1)(±1,±1,…,±1) and then put a "smaller" osculating sphere of radius 77 | rn=n‾√−1(1) 78 | (1)rn=n−1 79 | at the origin. The point (rn,0,0,…,0)(rn,0,0,…,0) lies on this "smaller" sphere. But, notice from (1)(1) that when n=4n=4, rn=1rn=1 and so the "smaller" sphere has unit radius and thus really does not deserve the soubriquet of "smaller" for n≥4n≥4. Indeed, it would be better if we called it the "larger sphere" or just "central sphere". As noted in the last paragraph, there is a clear line of sight from the origin to the points where the axes pass through the surface of the hypercube. Worse yet, when n>9n>9, we have from (1)(1) that rn>2rn>2, and thus the point (rn,0,0,…,0)(rn,0,0,…,0) on the central sphere lies outside the hypercube of side 44 even though it is "completely surrounded" by the unit-radius hyperspheres that "fill" the hypercube (in the sense of packing it). The central sphere "bulges" outside the hypercube in high-dimensional space. I find this very counter-intuitive because my mental translations of the notion of Euclidean distance to higher dimensions, using the geometric intuition that I have developed from the 2-space and 3-space that I am familiar with, do not describe the reality of high-dimensional space. 80 | 81 | what is 'high dimensions'?" is n≥9n≥9. 82 | 83 | 84 | ## GMM 85 | 86 | A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. 87 | 88 | a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. 89 | 90 | [PROS:](http://scikit-learn.org/stable/modules/mixture.html#pros) 91 | [CONS:](http://scikit-learn.org/stable/modules/mixture.html#cons) 92 | 93 | ### Estimation algorithm Expectation-maximization 94 | The main difficulty in learning Gaussian mixture models from unlabeled data is that it is one usually doesn’t know which points came from which latent component (if one has access to this information it gets very easy to fit a separate Gaussian distribution to each set of points). Expectation-maximization is a well-founded statistical algorithm to get around this problem by an iterative process. First one assumes random components (randomly centered on data points, learned from k-means, or even just normally distributed around the origin) and computes for each point a probability of being generated by each component of the model. Then, one tweaks the parameters to maximize the likelihood of the data given those assignments. Repeating this process is guaranteed to always converge to a local optimum. 95 | 96 | ## DBSCAN, (Density-Based Spatial Clustering of Applications with Noise), 97 | captures the insight that clusters are dense groups of points. The idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster. 98 | 99 | It works like this: First we choose two parameters, a positive number epsilon and a natural number minPoints. We then begin by picking an arbitrary point in our dataset. If there are more than minPoints points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a "cluster". We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so. 100 | 101 | Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process. Now, it's entirely possible that a point we pick has fewer than minPoints points in its epsilon ball, and is also not a part of any other cluster. If that is the case, it's considered a "noise point" not belonging to any cluster. 102 | --------------------------------------------------------------------------------