├── 00-Prerequisites ├── ComputerScience.md ├── LinearAlgebra.md ├── README.md ├── Statistics.md ├── TypesOfData.md └── numpyMatrixTutorial.ipynb ├── 01-Regression ├── 00-RegressionModelsComparison.md ├── 01-LinearRegression.md ├── 02-PolynomialRegression.md ├── 03-SupportVectorRegression.md ├── README.md └── img │ └── svr1.png ├── 02-Classification ├── 01-LogisticRegression.md ├── 02-knn.md ├── 03-SupportVectorMachines.md ├── 04-NaiveBayes.md ├── 05-DecisionTree.md ├── 06-RandomForest.md ├── 07-HiddenMarkovModels.md └── README.md ├── 03-Clustering ├── 01-K-meansClustering.md ├── 02-HierarchicalClustering.md ├── 03-GaussianMixtureModels.md ├── README.md └── dendrograms.png ├── 04-AssociationRuleLearning ├── 01-Apriori.md ├── 02-Eclat.md └── README.md ├── 05-ReinforcementLearning └── README.md ├── 06-NaturalLanguageProcessing └── README.md ├── 07-DeepLearning ├── README.md ├── activationImplementation.ipynb └── src │ └── img │ ├── hyperbolictanf.png │ ├── neuralnetworks.png │ ├── neuron.png │ ├── nn.png │ ├── rectifier.png │ ├── sigmoid.png │ └── threshold.png ├── 08-DimensionalityReduction ├── 01-PrincipalComponentAnalysis.md └── README.md ├── 09-RecommendationEngines └── README.md ├── 10-ModelSelectionAndBoosting └── README.md ├── 11-TimeSeries ├── 01-Introduction.md ├── 01-TimeSeriesInR.md ├── README.md └── StateSpaceModels.md ├── 12-ConstraintSatisfactionProblems └── README.md ├── 13-Appendix ├── 01-Programming │ ├── 01-R │ │ ├── 01-DplyrTutorial.md │ │ └── README.md │ ├── 02-Python │ │ ├── 01-Numpy.md │ │ ├── 02-MatPlotLib.md │ │ ├── 03-Pandas.md │ │ ├── 04-SciPy.md │ │ ├── 05-urllib.md │ │ └── README.md │ └── README.md ├── 02-ApplicationAreas │ └── 01-Introduction.md └── README.md ├── LICENSE └── README.md /00-Prerequisites/ComputerScience.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/00-Prerequisites/ComputerScience.md -------------------------------------------------------------------------------- /00-Prerequisites/LinearAlgebra.md: -------------------------------------------------------------------------------- 1 | # Linear Algebra 2 | 3 | ## Linear Equations 4 | 5 | ### Structure 6 | 7 | One of the most basic motivations towards making computers was solving a system linear equations given by 8 | ![lineq](http://mathurl.com/yacqpchy.png) 9 | 10 | where ![x1](http://mathurl.com/2ga69bb.png), ![x2](http://mathurl.com/2cbdldp.png), ... ![xn](http://mathurl.com/3ac5npr.png) are unknowns and ![aij](http://mathurl.com/nb755rs.png) 's and ![bi](http://mathurl.com/y94hxh5j.png) 's denote the constants, either complex or real. 11 | 12 | ### Definitions 13 | A system ![solset](http://mathurl.com/yc9hspko.png) is called a **solution** of the above system of linear equations if ![solset2](http://mathurl.com/yckmsc45.png) satisfies each of the eqations in our system of linear equations simultaneously and when ![b1=](http://mathurl.com/yd44a5r8.png) ![b2=](http://mathurl.com/y7l9ygqn.png) ![b3=](http://mathurl.com/y9ld77ug.png) ![ellipsis](http://mathurl.com/3dfnuhs.png) ![bn=](http://mathurl.com/y7fzg8ng.png) ![zero](http://mathurl.com/2vzzs3z.png), then the system of equations is called to be **homogeneous**. Such system of equations have one certain solution ![x1](http://mathurl.com/2ga69bb.png) = ![zero](http://mathurl.com/2vzzs3z.png), ![x2](http://mathurl.com/2cbdldp.png) = ![zero](http://mathurl.com/2vzzs3z.png), ... ![xn](http://mathurl.com/3ac5npr.png) = ![zero](http://mathurl.com/2vzzs3z.png) called the **trivial solution**. A system of Linear equations is said to be **consistent** if it has at least one solution and **inconsistent** if it has no solutions. 14 | 15 | 16 | 17 | ## Vectors 18 | 19 | A vector is a numeric measurement with directions. In two dimensions, a vector ![v](http://mathurl.com/36zvquj.png) is of the form ![vector1](http://mathurl.com/y93w4hhl.png) and the magnitude of this vector is given by 20 | 21 | ![magV](http://mathurl.com/y8nt6wqb.png) 22 | 23 | ### Vector Multiplications 24 | #### Dot Product 25 | 26 | There are two ways to go about this: 27 | 28 | 1. ![dotProd1](http://mathurl.com/y9qc43b6.png) 29 | 30 | This is the summation of element wise multiplication of the two vectors. The notation ![atransb](http://mathurl.com/y7wgs22g.png) denotes that the vectors are column vectors and the result of the equation above would be a 1x1 vector which is a scalar quantity. 31 | 32 | 2. ![cosDotProd1](http://mathurl.com/ycpoyuxb.png) 33 | 34 | This notation is not very convenient for vector multiplication unless a the angle on the right hand side is known to us. Although, it is a much more common practice to use this equation for finding out the angle between two vectors using 35 | 36 | ![cosDotProd2](http://mathurl.com/yd35x774.png) 37 | 38 | #### Outer Product 39 | The outer product of two vectors results in a matrix and is given by the equation: 40 | 41 | If there are two column vectors `u1` and `v1` that are given by ![u1](http://mathurl.com/ybqz6wvv.png) and ![u1](http://mathurl.com/y7y5yr4e.png) respectively. Then their outer product is written as ![representation](http://mathurl.com/ycex27tr.png) 42 | 43 | which then results to be ![outProductResult](http://mathurl.com/y7kgwn2n.png). 44 | 45 | ## Matrices 46 | 47 | Matrices are two dimensional set of numbers. These are very efficient data types for quick computations. 48 | 49 | ### Matrix Multiplications 50 | 51 | 52 | #### Dot Product 53 | The product of two matrices A and B in given by the formulae 54 | 55 | ![matrixMultiplication](http://mathurl.com/ycnfztus.png) 56 | 57 | 58 | -------------------------------------------------------------------------------- /00-Prerequisites/README.md: -------------------------------------------------------------------------------- 1 | # Prerequisites 2 | 3 | 1. [Linear Algebra](./LinearAlgebra.md) 4 | 2. [Statistics](./Statistics.md) 5 | 3. [Types of Data](./TypesOfData.md) 6 | -------------------------------------------------------------------------------- /00-Prerequisites/Statistics.md: -------------------------------------------------------------------------------- 1 | # Statistics 2 | 3 | ## Basics 4 | 5 | **Dataset**: Any group of values retrieved through a common method/procedure of collection. 6 | 7 | **Weighted Mean**: Weighted mean of a group of numbers with weights given in percentages and scores given in the same range is given by: 8 | 9 | ![](http://mathurl.com/ybxsxl8j.png) 10 | 11 | > Note: Always question about how the weights and categories were collected and why and to what extent it holds importance. 12 | 13 | 14 | **Standard Deviation**: A quantity expressing by how much the members of a group differ from the mean value for the group. In more formal terms, it's the average distance between a datapoint and the mean of the dataset and is given by the equation: 15 | 16 | ![sd](http://mathurl.com/y76cxpqb.png) 17 | 18 | **z-score**: The z-score is the distance in *standard deviation* units for any given observation: 19 | 20 | ![](http://mathurl.com/y966xvq9.png) 21 | 22 | **Empirical Rule or Three-Sigma Rule**: Most of the data points fall within three standard deviations, with 68% would lie within 1 *sd*, 95% within 2 *sd* and 99.7% would fall within 3 *sd*. 23 | 24 | ![empirical rule](http://res.cloudinary.com/natural-log-zero/image/upload/v1522841733/Screen_Shot_2018-04-04_at_11.33.41_PM_nvlpum.png) 25 | 26 | > Note: It only works for symmetrically distributed data 27 | 28 | **Percentile Score**: The percentile score for any value **x** in the dataset is given by: 29 | 30 | ![](http://mathurl.com/yd9sgdeq.png) 31 | 32 | ## Probabilities 33 | 34 | **Event**: An event is a set of probabilities e.g. When rolling a dice, an event A = (4,2,6) represents the event that an even number will appear. 35 | 36 | **Sample Space**: All possible outcomes for a particular random experiment constitute it's sample space. 37 | 38 | **Probability**: The number of favorable outcomes divided by the number of all possible outcomes in a given situation. 39 | 40 | #### Types of Probabilities 41 | 42 | 1. **Classical Probability**: The flipping of a coin is an example of classical probability because it is known that there are two sides to it and the likelihood of any one of them turning up is 50%. **Objective Probabilities** are based on calculations and classical probability is a type of Objective Probability. 43 | 44 | 2. **Empirical Probability**: The probabilities are the ones that are based on previously known data. For instance, *Messi* scoring more than ten goals in this season of FIFA is an example of Empirical probability because it is calculated on the basis of Messi's previous record. This too is an example of **Objective Probability** since it's also based on calculations. 45 | 46 | 3. **Subjective Probability**: These probabilities are not based on mathematical calculations and people use their opinions, their experiences to assert their viewpoints with some amount of relevant data that makes their own point stronger. 47 | 48 | **Addition Rule**: The addition rule in probabilities is a rule that ensures that an event is not counted twice when calculating probabilities (where P(overlap) is the probability of both E1 and E2 occurring: 49 | 50 | ![](http://mathurl.com/y9rte4y3.png) 51 | 52 | **Conditional Probability**: The probability of occurrence of an event given that some other event has already occurred. 53 | 54 | **Independent Events**: If the probability of two events are completely unrelated. If we can prove that the probability of two occurring together is equal to the product of their individual probabilities, then we can say that the two events are independent. 55 | 56 | **Random Variable**: The result of an experiment that has random outcomes is called a random variable. They can be of two types: 57 | 58 | 1. **Discrete RV**: The number of drinks a person will order at **Tank** is an example of discrete random variable because it has to be a whole number. 59 | 2. **Continuous RV**: The waiting time in line before one can order at a **Burger King** is an example of continuous variable because there are no fixed values that could be outcomes. The possibilities are infinite and continuous. 60 | 3. **Binomial RV**: When an event has only two possible outcomes, the result is called a Binomial Random Variable. 61 | 62 | **Probability Density**: The curves that represent the distribution of probabilities are called Probability Density curves. 63 | 64 | ## Sampling 65 | 66 | There are a few conditions that need to be considered before sampling is done from various sources: 67 | 68 | 1. **Size to Cost ratio**: The appropriate size of the sample based upon the cost per data point in the sample 69 | 2. **Inherent Bias**: If any bias was knowingly/unknowingly introduced while creating the sample, it will need to be considered! 70 | 3. **Quality of Sample** 71 | 72 | A **simple random sample** is the gold standard when collecting samples. This means, that any given point during the sample selection process, any individual has the same probability of being chosen as any other individual. 73 | 74 | Some alternative sampling methods are: 75 | 76 | 1. **kth point**: The first data point is selected and then every ***kth*** data point is selected in this method. 77 | 2. **Opportunity Sampling**: The first ***n*** values are selected from the total data. 78 | 3. **Stratified Sampling**: The whole sample is broken out into homogenous groups. Then we select few samples from each strata. 79 | 4. **Cluster Sampling**: The whole sample is collected from heterogenous groups with data points with different characteristics. Then we select few samples from each group. 80 | 81 | ## Confidence Intervals 82 | 83 | As the name suggests, the confidence intervals present a level of confidence for a given interval. 84 | 85 | ## Hypothesis Testing 86 | 87 | The process to be able to test a hypothesis that has been presented. 88 | 89 | 90 | ## Visualization Tips 91 | 92 | 1. **Tables**: Useful for detailed recording and sharing of actual data 93 | 2. **Frequency Table**: Displays the frequency of each observation in the data set 94 | 3. **Dot Plots**: When you want to convey information that is discrete and individual to each observation. 95 | 4. **Histograms**: When you want to convey frequencies of grouped bins, this could be useful. 96 | 5. **Pie Charts**: Relative Frequency distributions are best represented with Pie Charts. They are also useful for representing distributions among qualitative variables where histograms wouldn't be a very good measure. 97 | -------------------------------------------------------------------------------- /00-Prerequisites/TypesOfData.md: -------------------------------------------------------------------------------- 1 | # Types of Data 2 | 3 | ### Cross-Sectional Data 4 | 5 | It consists of variables, that are either quantitative or qualitative for a lot of different observations (usually called 'cases') and are taken from some defined population. All the cases are registered at a single point in time, or a reasonably short period of time. The techniques commonly used for this kind of data are 't-tests' analysis of variance or regression depending on the kind and number of variables that are there in the data. It is noteworthy that **each observation of a given variable or each set of variables are independent of every other observation in the dataset**. The independence of the variables is a critical assumption when modelling cross-sectional data. 6 | 7 | > It is called cross-sectional data because we are measuring a cross-section of a defined population at a particular point in time or a very short period in time. 8 | 9 | ### Time Series Data 10 | 11 | If measurement on variables are taken over or through time. Every variable in a time series dataset is measured at equally spaced time intervals. Usually, the observations are not independent of each other in this case. Time series data can be classified into two types other thank the univariate and multivariate distinction and they are discussed in the [time series chapter](../11-TimeSeries/README.md). 12 | 13 | -------------------------------------------------------------------------------- /00-Prerequisites/numpyMatrixTutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Numpy Matrix Tutorial" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Importing Numpy Package \n", 17 | "import numpy as np" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "Let's start with defining matrix. There are multiple ways of creating a matrix using numpy\n", 25 | "package. We will see how to define a same matrix using multiple methods." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 55, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "name": "stdout", 35 | "output_type": "stream", 36 | "text": [ 37 | "----This is the first way----\n", 38 | "Printing Matrix A\n", 39 | "[[0 1 2]\n", 40 | " [3 4 5]]\n", 41 | "Printing Matrix B\n", 42 | "[[10 11]\n", 43 | " [12 13]\n", 44 | " [14 15]]\n", 45 | "Shape of Matrix A: (2L, 3L)\n", 46 | "Shape of Matrix B: (3L, 2L)\n", 47 | "----This is the second way----\n", 48 | "Printing Matrix A\n", 49 | "[[1 2 3]\n", 50 | " [4 5 6]]\n", 51 | "Printing Matrix B\n", 52 | "[[10 11]\n", 53 | " [12 13]\n", 54 | " [14 15]]\n", 55 | "Shape of Matrix A: (2L, 3L)\n", 56 | "Shape of Matrix B: (3L, 2L)\n", 57 | "----This is the third way----\n", 58 | "Printing Matrix A\n", 59 | "[[1 2 3]\n", 60 | " [4 5 6]]\n", 61 | "Printing Matrix B\n", 62 | "[[10 11]\n", 63 | " [12 13]\n", 64 | " [14 15]]\n", 65 | "Shape of Matrix A: (2L, 3L)\n", 66 | "Shape of Matrix B: (3L, 2L)\n", 67 | "---------------------\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "\"\"\"Defining two matricies of shape (2,3) Two Rows and Three Columns \n", 73 | "\n", 74 | "There are two ways of defining a matrix in python.\n", 75 | "\n", 76 | "This is the more popular way of defining a matrix by creating a 1D array and then \n", 77 | "converting it into a matrix.\n", 78 | "\"\"\"\n", 79 | "\n", 80 | "print \"----This is the first way----\"\n", 81 | "\n", 82 | "matA = np.arange(6).reshape(2,3)\n", 83 | "matB = np.arange(10,16).reshape(3,2)\n", 84 | "\n", 85 | "print \"Printing Matrix A\"\n", 86 | "print matA\n", 87 | "print \"Printing Matrix B\"\n", 88 | "print matB\n", 89 | "\n", 90 | "print \"Shape of Matrix A: \",matA.shape\n", 91 | "print \"Shape of Matrix B: \",matB.shape\n", 92 | "\n", 93 | "\"\"\"\n", 94 | "The other less popular way of defining a matrix is by calling \n", 95 | "builtin np.matrix function.\n", 96 | "\n", 97 | "There are two ways the function can be called.\n", 98 | "\n", 99 | "First way, by passing the string argument where rows are delimited by ;\n", 100 | "\"\"\"\n", 101 | "\n", 102 | "print \"----This is the second way----\"\n", 103 | "\n", 104 | "matA = np.matrix('1 2 3;4 5 6')\n", 105 | "matB = np.matrix('10 11;12 13;14 15')\n", 106 | "\n", 107 | "print \"Printing Matrix A\"\n", 108 | "print matA\n", 109 | "print \"Printing Matrix B\"\n", 110 | "print matB\n", 111 | "\n", 112 | "print \"Shape of Matrix A: \",matA.shape\n", 113 | "print \"Shape of Matrix B: \",matB.shape\n", 114 | "\n", 115 | "\n", 116 | "# Second way, by passing nested list of rows\n", 117 | "print \"----This is the third way----\"\n", 118 | "\n", 119 | "matA = np.matrix([[1,2,3],[4,5,6]])\n", 120 | "matB = np.matrix([[10,11],[12,13],[14,15]])\n", 121 | "\n", 122 | "print \"Printing Matrix A\"\n", 123 | "print matA\n", 124 | "print \"Printing Matrix B\"\n", 125 | "print matB\n", 126 | "\n", 127 | "print \"Shape of Matrix A: \",matA.shape\n", 128 | "print \"Shape of Matrix B: \",matB.shape\n", 129 | "\n", 130 | "print \"---------------------\"" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "After defining the matrix, manytime there are few information about the matrix you need \n", 138 | "to verify. For example:\n", 139 | " 1. Size of the matrix\n", 140 | " 2. Shape of the matrix\n", 141 | " 3. Dimensions of the matrix\n", 142 | " 4. Data type of the matrix formed\n", 143 | " 5. Total Bytes consumed by the matrix" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 97, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "name": "stdout", 153 | "output_type": "stream", 154 | "text": [ 155 | "Matrix A: \n", 156 | "[[1 2 3]\n", 157 | " [4 5 6]]\n", 158 | "\n", 159 | "Size of the Matrix: 6\n", 160 | "\n", 161 | "Shape of the Matrix: (2L, 3L)\n", 162 | "\n", 163 | "Dimensions of the Matrix: 2\n", 164 | "\n", 165 | "Data Type of the Matrix: int32\n", 166 | "\n", 167 | "Total size of the Matrix (In Bytes): 24\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "# Let's start with how the matrix looks like\n", 173 | "\n", 174 | "print \"Matrix A: \"\n", 175 | "print matA\n", 176 | "\n", 177 | "print \"\\nSize of the Matrix: \", matA.size\n", 178 | "\n", 179 | "print \"\\nShape of the Matrix: \",matA.shape\n", 180 | "\n", 181 | "print \"\\nDimensions of the Matrix: \", matA.ndim\n", 182 | "\n", 183 | "print \"\\nData Type of the Matrix: \", matA.dtype\n", 184 | "\n", 185 | "print \"\\nTotal size of the Matrix (In Bytes): \", matA.nbytes" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "As we can create a matrix from a 1D array by reshaping it, we can create 1D array from the matrix." 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 98, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "Matrix A: \n", 205 | "[[1 2 3]\n", 206 | " [4 5 6]]\n", 207 | "\n", 208 | "1D Array from the matrix (By Attribute): \n", 209 | "[1 2 3 4 5 6]\n", 210 | "\n", 211 | "1D Array from the matrix (By Function): \n", 212 | "[1 2 3 4 5 6]\n", 213 | "[[1 2 3 4 5 6]]\n", 214 | "\n", 215 | "Showing the type of flat iterator object: \n", 216 | "\n", 217 | "Printing Elements: \n", 218 | "1\n", 219 | "2\n", 220 | "3\n", 221 | "4\n", 222 | "5\n", 223 | "6\n", 224 | "\n", 225 | "We can also use ravel() function to convert the matrix to a flattened array\n", 226 | "[[1 2 3 4 5 6]]\n" 227 | ] 228 | } 229 | ], 230 | "source": [ 231 | "print \"Matrix A: \"\n", 232 | "print matA\n", 233 | "\n", 234 | "# By Attribute\n", 235 | "print \"\\n1D Array from the matrix (By Attribute): \"\n", 236 | "print matA.A1\n", 237 | "\n", 238 | "#By Function\n", 239 | "print \"\\n1D Array from the matrix (By Function): \"\n", 240 | "print matA.getA1()\n", 241 | "print matA.flatten()\n", 242 | "\n", 243 | "\n", 244 | "# We can also create a flat iterator object from the matrix similar to 1D array.\n", 245 | "\n", 246 | "matAFlat = matA.flat\n", 247 | "print \"\\nShowing the type of flat iterator object: \",type(matAFlat)\n", 248 | "print \"\\nPrinting Elements: \"\n", 249 | "for i in matAFlat:\n", 250 | " print i\n", 251 | "\n", 252 | "print \"\\nWe can also use ravel() function to convert the matrix to a flattened array\"\n", 253 | "print matA.ravel()\n" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "Now let's explore how we can get basic details like:\n", 261 | " 1. Max/Min value of the matrix\n", 262 | " 2. Index of max/min value in the matrix\n", 263 | " 4. Sum of all values of the matrix\n", 264 | " 5. Sum of values across rows/columns" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 132, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "name": "stdout", 274 | "output_type": "stream", 275 | "text": [ 276 | "[[1 2 3]\n", 277 | " [4 5 6]]\n", 278 | "\n", 279 | "--Matrix level operations--\n", 280 | "\n", 281 | "Max Value of the matrix: 6\n", 282 | "\n", 283 | "Min Value of the matrix: 1\n", 284 | "\n", 285 | "Index of max value in the matrix: 5\n", 286 | "\n", 287 | "Index of min value in the matrix: 0\n", 288 | "\n", 289 | "Sum of all elements in the matrix: 21\n", 290 | "\n", 291 | "Standard Deviation of the matrix: 1.70782512766\n", 292 | "\n", 293 | "Mean of the matrix: 3.5\n", 294 | "\n", 295 | "Cumulative Sum along column and row: \n", 296 | "Along Row\n", 297 | "[[1 2 3]\n", 298 | " [5 7 9]]\n", 299 | "Along Column\n", 300 | "[[ 1 3 6]\n", 301 | " [ 4 9 15]]\n", 302 | "\n", 303 | "Cumulative Product along column and row: \n", 304 | "Along Row\n", 305 | "[[ 1 2 3]\n", 306 | " [ 4 10 18]]\n", 307 | "Along Column\n", 308 | "[[ 1 2 6]\n", 309 | " [ 4 20 120]]\n", 310 | "\n", 311 | "--Row/Column/Diagonal level operation--\n", 312 | "\n", 313 | "Max Value of the matrix along the column: [[4 5 6]]\n", 314 | "\n", 315 | "Min Value of the matrix along the row: [[1]\n", 316 | " [4]]\n", 317 | "\n", 318 | "Index of max value in the matrix along the column: [[1 1 1]]\n", 319 | "\n", 320 | "Index of min value in the matrix along the row: [[0]\n", 321 | " [0]]\n", 322 | "\n", 323 | "Print Columns Sum: [[5 7 9]]\n", 324 | "\n", 325 | "Rows Sum: \n", 326 | "[[ 6]\n", 327 | " [15]]\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "print matA\n", 333 | "\n", 334 | "print \"\\n--Matrix level operations--\"\n", 335 | "print \"\\nMax Value of the matrix: \",matA.max()\n", 336 | "print \"\\nMin Value of the matrix: \",matA.min()\n", 337 | "print \"\\nIndex of max value in the matrix: \",matA.argmax()\n", 338 | "print \"\\nIndex of min value in the matrix: \",matA.argmin()\n", 339 | "print \"\\nSum of all elements in the matrix: \",matA.sum()\n", 340 | "print \"\\nStandard Deviation of the matrix: \",matA.std()\n", 341 | "print \"\\nMean of the matrix: \",matA.mean()\n", 342 | "print \"\\nCumulative Sum along column and row: \"\n", 343 | "print \"Along Row\"\n", 344 | "print matA.cumsum(axis = 0)\n", 345 | "print \"Along Column\"\n", 346 | "print matA.cumsum(axis = 1)\n", 347 | "print \"\\nCumulative Product along column and row: \"\n", 348 | "print \"Along Row\"\n", 349 | "print matA.cumprod(axis = 0)\n", 350 | "print \"Along Column\"\n", 351 | "print matA.cumprod(axis = 1)\n", 352 | "\n", 353 | "\n", 354 | "print \"\\n--Row/Column/Diagonal level operation--\"\n", 355 | "print \"\\nMax Value of the matrix along the column: \",matA.max(axis= 0)\n", 356 | "print \"\\nMin Value of the matrix along the row: \",matA.min(axis = 1)\n", 357 | "print \"\\nIndex of max value in the matrix along the column: \",matA.argmax(axis = 0)\n", 358 | "print \"\\nIndex of min value in the matrix along the row: \",matA.argmin(axis = 1)\n", 359 | "\n", 360 | "\n", 361 | "\"\"\"\n", 362 | "Columnwise and rowise sum of elements\n", 363 | "\n", 364 | "Axis = 0 for columns \n", 365 | "Axis = 1 for rows \n", 366 | "\n", 367 | "Note: Opposite of Pandas \n", 368 | "\"\"\"\n", 369 | "print \"\\nPrint Columns Sum: \",matA.sum(axis = 0)\n", 370 | "print \"\\nRows Sum: \"\n", 371 | "print matA.sum(axis = 1)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "In this section, we will explore how we can sort elements in the matrix. We can do this\n", 379 | "at matrix level as well as at rows/columns level" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 142, 385 | "metadata": { 386 | "scrolled": true 387 | }, 388 | "outputs": [ 389 | { 390 | "name": "stdout", 391 | "output_type": "stream", 392 | "text": [ 393 | "--Matrix Level Sorting--\n", 394 | "[[ 2 4 6]\n", 395 | " [12 43 56]]\n", 396 | "\n", 397 | "--Row/Column Level Sorting--\n", 398 | "\n", 399 | "Column Wise Sorting\n", 400 | "[[ 4 2 6]\n", 401 | " [43 12 56]]\n", 402 | "\n", 403 | "Row Wise Sorting\n", 404 | "[[ 2 4 6]\n", 405 | " [12 43 56]]\n" 406 | ] 407 | } 408 | ], 409 | "source": [ 410 | "matC = np.matrix([[4,2,6],[43,12,56]])\n", 411 | "\n", 412 | "print \"--Matrix Level Sorting--\"\n", 413 | "print np.sort(matC)\n", 414 | "\n", 415 | "print \"\\n--Row/Column Level Sorting--\"\n", 416 | "print \"\\nColumn Wise Sorting\"\n", 417 | "print np.sort(matC,axis = 0)\n", 418 | "print \"\\nRow Wise Sorting\"\n", 419 | "print np.sort(matC,axis = 1)\n", 420 | "\n", 421 | "\n", 422 | "\n", 423 | "# We can also get indexes of the sorted matrix both at matrix level as well as at \n", 424 | "# row/column level. We can use np.argsort() for this. \n", 425 | "\n", 426 | "\n", 427 | "# print np.argsort(matC,axis = 1)\n", 428 | "\n", 429 | "\n", 430 | "# We can get index of all non zero values in the matrix using nonzero function.\n", 431 | "\n", 432 | "# print matC.nonzero()" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 70, 438 | "metadata": { 439 | "scrolled": false 440 | }, 441 | "outputs": [ 442 | { 443 | "name": "stdout", 444 | "output_type": "stream", 445 | "text": [ 446 | "Matrix Transpose\n", 447 | "[[1 4]\n", 448 | " [2 5]\n", 449 | " [3 6]]\n", 450 | "-------\n", 451 | "[[1 4]\n", 452 | " [2 5]\n", 453 | " [3 6]]\n", 454 | "-------\n", 455 | "[[1 4]\n", 456 | " [2 5]\n", 457 | " [3 6]]\n", 458 | "-------\n", 459 | "[[1 4]\n", 460 | " [2 5]\n", 461 | " [3 6]]\n", 462 | "\n", 463 | "Conjucate Transpose of a matrix\n", 464 | "\n", 465 | "Complex Matrix\n", 466 | "[[ 3.+2.j 3.+3.j]\n", 467 | " [ 2.+3.j 5.+6.j]]\n", 468 | "\n", 469 | "Conjucate Transpose of the complex matrix\n", 470 | "[[ 3.-2.j 2.-3.j]\n", 471 | " [ 3.-3.j 5.-6.j]]\n", 472 | "----\n", 473 | "[[ 3.-2.j 3.-3.j]\n", 474 | " [ 2.-3.j 5.-6.j]]\n" 475 | ] 476 | } 477 | ], 478 | "source": [ 479 | "# Matrix Transpose \n", 480 | "\n", 481 | "# Three Methods of Transposing a matrix in numpy\n", 482 | "\n", 483 | "print \"Matrix Transpose\"\n", 484 | "\n", 485 | "matATranspose = matA.T\n", 486 | "print matATranspose\n", 487 | "\n", 488 | "print \"-------\"\n", 489 | "\n", 490 | "matATranspose = matA.transpose()\n", 491 | "print matATranspose\n", 492 | "\n", 493 | "print \"-------\"\n", 494 | "\n", 495 | "matATranspose = np.transpose(matA)\n", 496 | "print matATranspose\n", 497 | "\n", 498 | "print \"-------\"\n", 499 | "matATranspose = matA.getT()\n", 500 | "print matATranspose\n", 501 | "\n", 502 | "print \"\"\n", 503 | "print \"Conjucate Transpose of a matrix\"\n", 504 | "\n", 505 | "\"\"\"\n", 506 | "Transpose of a matrix is obtained by rearranging columns into rows, or rows into columns. \n", 507 | "The complex conjugate of a matrix is obtained by replacing each element by its \n", 508 | "complex conjugate (i.e x+iy ⇛ x-iy or vice versa). The conjugate transpose is obtained \n", 509 | "by performing both operations on the matrix.\n", 510 | "\"\"\"\n", 511 | "\n", 512 | "matD = np.matrix([[3+2j,3+3j],[2+3j,5+6j]])\n", 513 | "print \"\\nComplex Matrix\"\n", 514 | "print matD\n", 515 | "print \"\\nConjucate Transpose of the complex matrix\"\n", 516 | "\n", 517 | "# By Attribute\n", 518 | "print matD.H\n", 519 | "print \"----\"\n", 520 | "\n", 521 | "# By calling function, this is complex conjuct element-wise operation\n", 522 | "print matD.conjugate()\n", 523 | "\n", 524 | "# By calling function, this is complex conjugate of all the elements\n", 525 | "print matA.conj()\n" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "Misc. Matrix Operations:\n", 533 | " 1. Cliping\n", 534 | " 2." 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 118, 540 | "metadata": {}, 541 | "outputs": [ 542 | { 543 | "name": "stdout", 544 | "output_type": "stream", 545 | "text": [ 546 | "[[1 2 3]\n", 547 | " [4 5 6]]\n", 548 | "\n", 549 | "Cliping\n", 550 | "[[2 2 3]\n", 551 | " [3 3 3]]\n" 552 | ] 553 | } 554 | ], 555 | "source": [ 556 | "print matA\n", 557 | "\n", 558 | "print \"\\nCliping\"\n", 559 | "\n", 560 | "\"\"\"\n", 561 | "Given an interval, values outside the interval are clipped to the interval edges. \n", 562 | "For example, if an interval of [0, 1] is specified, values smaller than 0 become 0, \n", 563 | "and values larger than 1 become 1.\n", 564 | "\"\"\"\n", 565 | "print matA.clip(2,3)" 566 | ] 567 | }, 568 | { 569 | "cell_type": "markdown", 570 | "metadata": {}, 571 | "source": [ 572 | "Matrix operations: \n", 573 | " 1. Matrix Addition\n", 574 | " 2. Matrix Substraction\n", 575 | " 3. Matrix Multiplication\n", 576 | " 4. Matrix Division\n", 577 | "\n", 578 | "The way I approach this section is, first I will do (matrix,matrix) level operations then will do (matrix,scaler) level operations also known as Broadcasting. " 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 173, 584 | "metadata": {}, 585 | "outputs": [ 586 | { 587 | "name": "stdout", 588 | "output_type": "stream", 589 | "text": [ 590 | "Printing Matrix A and B\n", 591 | "[[1 2 3]\n", 592 | " [4 5 6]]\n", 593 | "-----\n", 594 | "[[10 11]\n", 595 | " [12 13]\n", 596 | " [14 15]]\n", 597 | "\n", 598 | "\n", 599 | "--Matrix Addition--\n", 600 | "\n", 601 | "(Matrix,Matrix) Level Operation\n", 602 | "Approach 1 (Using Unary Operation):\n", 603 | "[[11 14 17]\n", 604 | " [15 18 21]]\n", 605 | "Approach 2 (Using np.add()):\n", 606 | "[[11 14 17]\n", 607 | " [15 18 21]]\n", 608 | "\n", 609 | "(Matrix,Scaler) Level Operation\n", 610 | "Approach 1 (Using Unary Operation):\n", 611 | "[[101 102 103]\n", 612 | " [104 105 106]]\n", 613 | "Approach 2 (Using np.add()):\n", 614 | "[[101 102 103]\n", 615 | " [104 105 106]]\n", 616 | "\n", 617 | "\n", 618 | "--Matrix Subtraction--\n", 619 | "\n", 620 | "(Matrix,Matrix) Level Operation\n", 621 | "Approach 1 (Using Unary Operation):\n", 622 | "[[ -9 -10 -11]\n", 623 | " [ -7 -8 -9]]\n", 624 | "Approach 2 (Using np.subtract()):\n", 625 | "[[ -9 -10 -11]\n", 626 | " [ -7 -8 -9]]\n", 627 | "\n", 628 | "(Matrix,Scaler) Level Operation\n", 629 | "Approach 1 (Using Unary Operation):\n", 630 | "[[-99 -98 -97]\n", 631 | " [-96 -95 -94]]\n", 632 | "Approach 2 (Using np.subtract()):\n", 633 | "[[-99 -98 -97]\n", 634 | " [-96 -95 -94]]\n", 635 | "\n", 636 | "\n", 637 | "--Matrix Multiplication--\n", 638 | "\n", 639 | "(Matrix,Matrix) Level Operation\n", 640 | "Approach 1 (Using Unary Operation):\n", 641 | "[[ 76 82]\n", 642 | " [184 199]]\n", 643 | "Approach 2 (Using np.dot()):\n", 644 | "[[ 76 82]\n", 645 | " [184 199]]\n", 646 | "\n", 647 | "(Matrix,Scaler) Level Operation\n", 648 | "Approach 1 (Using Unary Operation):\n", 649 | "[[100 200 300]\n", 650 | " [400 500 600]]\n", 651 | "Approach 2 (Using np.dot()):\n", 652 | "[[100 200 300]\n", 653 | " [400 500 600]]\n", 654 | "\n", 655 | "\n", 656 | "--Matrix Subtraction--\n", 657 | "\n", 658 | "(Matrix,Matrix) Level Operation\n", 659 | "Approach 1 (Using Unary Operation):\n", 660 | "[[ -9 -10 -11]\n", 661 | " [ -7 -8 -9]]\n", 662 | "Approach 2 (Using np.subtract()):\n", 663 | "[[ -9 -10 -11]\n", 664 | " [ -7 -8 -9]]\n", 665 | "\n", 666 | "(Matrix,Scaler) Level Operation\n", 667 | "Approach 1 (Using Unary Operation):\n", 668 | "[[-99 -98 -97]\n", 669 | " [-96 -95 -94]]\n", 670 | "Approach 2 (Using np.subtract()):\n", 671 | "[[-99 -98 -97]\n", 672 | " [-96 -95 -94]]\n", 673 | "\n", 674 | "\n", 675 | "--Matrix Division--\n", 676 | "\n", 677 | "(Matrix,Matrix) Level Operation\n", 678 | "Approach 1 (Using Unary Operation):\n", 679 | "[[0 0 0]\n", 680 | " [0 0 0]]\n", 681 | "Approach 2 (Using np.divide()):\n", 682 | "[[0 0 0]\n", 683 | " [0 0 0]]\n", 684 | "\n", 685 | "(Matrix,Scaler) Level Operation\n", 686 | "Approach 1 (Using Unary Operation):\n", 687 | "[[0 0 0]\n", 688 | " [0 0 0]]\n", 689 | "Approach 2 (Using np.divide()):\n", 690 | "[[0 0 0]\n", 691 | " [0 0 0]]\n" 692 | ] 693 | } 694 | ], 695 | "source": [ 696 | "scalerValue = 100\n", 697 | "\n", 698 | "print \"Printing Matrix A and B\"\n", 699 | "print matA\n", 700 | "print \"-----\"\n", 701 | "print matB\n", 702 | "\n", 703 | "print \"\\n\\n--Matrix Addition--\"\n", 704 | "\n", 705 | "print \"\\n(Matrix,Matrix) Level Operation\"\n", 706 | "print \"Approach 1 (Using Unary Operation):\"\n", 707 | "assert matA.shape == matB.T.shape\n", 708 | "matC = matA+matB.T\n", 709 | "print matC\n", 710 | "\n", 711 | "print \"Approach 2 (Using np.add()):\"\n", 712 | "matC = np.add(matA,matB.T)\n", 713 | "print matC\n", 714 | "\n", 715 | "\n", 716 | "print \"\\n(Matrix,Scaler) Level Operation\"\n", 717 | "print \"Approach 1 (Using Unary Operation):\"\n", 718 | "matC = matA+scalerValue\n", 719 | "print matC\n", 720 | "\n", 721 | "print \"Approach 2 (Using np.add()):\"\n", 722 | "matC = np.add(matA,scalerValue)\n", 723 | "print matC\n", 724 | "\n", 725 | "\n", 726 | "print \"\\n\\n--Matrix Subtraction--\"\n", 727 | "\n", 728 | "print \"\\n(Matrix,Matrix) Level Operation\"\n", 729 | "print \"Approach 1 (Using Unary Operation):\"\n", 730 | "assert matA.shape == matB.T.shape\n", 731 | "matC = matA-matB.T\n", 732 | "print matC\n", 733 | "\n", 734 | "print \"Approach 2 (Using np.subtract()):\"\n", 735 | "matC = np.subtract(matA,matB.T)\n", 736 | "print matC\n", 737 | "\n", 738 | "\n", 739 | "print \"\\n(Matrix,Scaler) Level Operation\"\n", 740 | "print \"Approach 1 (Using Unary Operation):\"\n", 741 | "matC = matA-scalerValue\n", 742 | "print matC\n", 743 | "\n", 744 | "print \"Approach 2 (Using np.subtract()):\"\n", 745 | "matC = np.subtract(matA,scalerValue)\n", 746 | "print matC\n", 747 | "\n", 748 | "print \"\\n\\n--Matrix Multiplication--\"\n", 749 | "\n", 750 | "print \"\\n(Matrix,Matrix) Level Operation\"\n", 751 | "print \"Approach 1 (Using Unary Operation):\"\n", 752 | "assert matA.shape == matB.T.shape\n", 753 | "matC = matA*matB\n", 754 | "print matC\n", 755 | "\n", 756 | "print \"Approach 2 (Using np.dot()):\"\n", 757 | "matC = np.dot(matA,matB)\n", 758 | "print matC\n", 759 | "\n", 760 | "print \"\\n(Matrix,Scaler) Level Operation\"\n", 761 | "print \"Approach 1 (Using Unary Operation):\"\n", 762 | "matC = matA*scalerValue\n", 763 | "print matC\n", 764 | "\n", 765 | "print \"Approach 2 (Using np.dot()):\"\n", 766 | "matC = np.dot(matA,scalerValue)\n", 767 | "print matC\n", 768 | "\n", 769 | "print \"\\n\\n--Matrix Subtraction--\"\n", 770 | "\n", 771 | "print \"\\n(Matrix,Matrix) Level Operation\"\n", 772 | "print \"Approach 1 (Using Unary Operation):\"\n", 773 | "assert matA.shape == matB.T.shape\n", 774 | "matC = matA-matB.T\n", 775 | "print matC\n", 776 | "\n", 777 | "print \"Approach 2 (Using np.subtract()):\"\n", 778 | "matC = np.subtract(matA,matB.T)\n", 779 | "print matC\n", 780 | "\n", 781 | "\n", 782 | "print \"\\n(Matrix,Scaler) Level Operation\"\n", 783 | "print \"Approach 1 (Using Unary Operation):\"\n", 784 | "matC = matA-scalerValue\n", 785 | "print matC\n", 786 | "\n", 787 | "print \"Approach 2 (Using np.subtract()):\"\n", 788 | "matC = np.subtract(matA,scalerValue)\n", 789 | "print matC\n", 790 | "\n", 791 | "print \"\\n\\n--Matrix Division--\"\n", 792 | "\n", 793 | "print \"\\n(Matrix,Matrix) Level Operation\"\n", 794 | "print \"Approach 1 (Using Unary Operation):\"\n", 795 | "assert matA.shape == matB.T.shape\n", 796 | "matC = matA//matB.T\n", 797 | "print matC\n", 798 | "\n", 799 | "print \"Approach 2 (Using np.divide()):\"\n", 800 | "matC = np.divide(matA,matB.T)\n", 801 | "print matC\n", 802 | "\n", 803 | "print \"\\n(Matrix,Scaler) Level Operation\"\n", 804 | "print \"Approach 1 (Using Unary Operation):\"\n", 805 | "matC = matA//scalerValue\n", 806 | "print matC\n", 807 | "\n", 808 | "print \"Approach 2 (Using np.divide()):\"\n", 809 | "matC = np.divide(matA,scalerValue)\n", 810 | "print matC" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [] 819 | } 820 | ], 821 | "metadata": { 822 | "kernelspec": { 823 | "display_name": "Python 2", 824 | "language": "python", 825 | "name": "python2" 826 | }, 827 | "language_info": { 828 | "codemirror_mode": { 829 | "name": "ipython", 830 | "version": 2 831 | }, 832 | "file_extension": ".py", 833 | "mimetype": "text/x-python", 834 | "name": "python", 835 | "nbconvert_exporter": "python", 836 | "pygments_lexer": "ipython2", 837 | "version": "2.7.13" 838 | } 839 | }, 840 | "nbformat": 4, 841 | "nbformat_minor": 2 842 | } 843 | -------------------------------------------------------------------------------- /01-Regression/00-RegressionModelsComparison.md: -------------------------------------------------------------------------------- 1 | # Performance Evaluation 2 | 3 | ## R-Squared 4 | 5 | There are a few terms that need to be understood before we can calculate the R-squared value for any regression model and are as follows: 6 | 7 | 1. **Sum of squared residuals** ![ssres](http://mathurl.com/ycattv9v.png) : It is exactly what the name suggests. A residual is the difference between actual values and the predicted values. If we add the squares of all residuals, that would give us the value of ![ssres](http://mathurl.com/ycattv9v.png), shown as follows in mathematical terms: 8 | ![](http://mathurl.com/yawhk57s.png) 9 | 10 | 2. **Total sum of squares** ![sstot](http://mathurl.com/y7hyn3w4.png): Instead of calculating the residuals, we calculate the difference between the average **y** values and the actual **y** values present in the dataset. We then add the squares of these values, represented mathematically as : 11 | ![](http://mathurl.com/yazstja5.png) 12 | 13 | R-sqaure is given by the following formula ![rsq](http://mathurl.com/ybf2xwp2.png) 14 | 15 | The idea is to minimize ![ssres](http://mathurl.com/ycattv9v.png) so as to keep the value of ![rsq](http://mathurl.com/3kwwdyh.png) as close as possible to 1. The calculation essentially gives us a numeric value as to how good is our regression line, as compared to the average value. 16 | 17 | ## Adjusted R-Squared 18 | 19 | The value of ![rsq](http://mathurl.com/3kwwdyh.png) is considered to be better as it gets closer to 1, but there's a catch to this statement. The ![rsq](http://mathurl.com/3kwwdyh.png) value can be artifically inflated by simply adding more variables. This is a problem because the complexity of the model would increase due to this and would result in overfitting. The formulae for Adjusted R-squared is mathematically given as: 20 | 21 | ![](http://mathurl.com/yclkhq5z.png) 22 | 23 | where **p** is the number of regressors and **n** is the sample size. Adjusted R-squared has a penalizing factor that reduces it's value when a non-significant variable is added to the model. 24 | 25 | > **p-value** based backward elimination can be useful in removing non-significant variables that we might have added in our model initially. (full process explained in **Linear Regression** section of the book. 26 | 27 | -------------------------------------------------------------------------------- /01-Regression/01-LinearRegression.md: -------------------------------------------------------------------------------- 1 | # Linear Regression 2 | 3 | The idea is to fit a straight line in the n-dimensional space that holds all our observational points. This would constitute forming an equation of the form **y = mx + c**. Because we have multiple variables, we might need to extend this **mx** to be **m1x1**, **m2x2** and so on. This extensions results in the following mathematical representation between the independent and dependent variables: 4 | 5 | ![eq](http://mathurl.com/y8eahwj3.png) 6 | 7 | where 8 | 9 | 1. **y** = dependent variable/outcome 10 | 2. **x1 to xn** are the dependent variables 11 | 3. **b0 to bn** are the coefficients of the linear model 12 | 13 | A linear regression models bases itself on the following assumptions: 14 | 15 | 1. Linearity 16 | 2. Homoscedasticity 17 | 3. Multivariate normality 18 | 4. Independence of errors 19 | 5. Lack of multicollinearity 20 | 21 | If these assumptions do not hold, then linear regression probably isn't the model for your problem. 22 | 23 | 24 | 25 | ## Variable Selection 26 | 27 | The dataset will, more often than not, contain columns that do not have any effect on the dependent variable. This becomes problematic because we don't want to add too much noise to the dataset. The variables that do not effect the dependent variable (the outcome) will usually only decrease the performance of our model, and even if they do not, there's always an additional computing complexity that comes along with them. This could influence the costs of building the model, especially when we want to do it iteratively. 28 | 29 | There are five ways for doing feature selection while building a model: 30 | 31 | 1. **All-in**: If we are aware that all the variables in the dataset are useful and are required for building the model, then we simply use all the variables that are available to us. 32 | 2. **Backward Elimination**: The following steps need to be taken in order to conduct a backward elimination feature selection: 33 | 1. We set a significance level for a feature to stay in the model 34 | 2. Fit the model with all possible predictors 35 | 3. Consider the predictor with the highest p-value. If P > SL go to **Step 4**, otherwise go to **Step 6** 36 | 4. Remove the variable 37 | 5. Fit the model again, and move to **Step 3** 38 | 6. Finish 39 | 3. **Forward Selection**: Although, it may seem like a straightforward procedure, it is quite intricate in practical implementation. The following steps need to be performed in order to make a linear regressor using forward selection: 40 | 1. We set a significance level to enter the model 41 | 2. Fit regression models with each one of those independent variables and then select the one with the lowest p-value 42 | 3. Fit all possible regression models with the one that we selected in previous step and one additional variable. 43 | 4. Consider the model with the lowest p-value. If p < SL go to **Step 3** otherwise go to **Step 5** 44 | 5. Finish and select the second last model 45 | 4. **Bidirectional Elimination**: The algorithm works as follows: 46 | 1. Select a Significance Level to enter and a Significance Level to stay in the model, viz. SLENTER and SLSTAY 47 | 2. Perform the next step of Forward Selection, i.e. all the new variables that are to be added must have p < SLENTER in order to enter the model 48 | 3. Perform all the steps of Backward Elimination, i.e. all the existing variables must have p < SLSTAY in order to stay in the model 49 | 4. No new variables can be added, and no new variables can be removed 50 | 51 | > The details on variable selection by **Score Comparison** is yet to be found. 52 | 53 | The **lower the p-value** is, the **more important** a particular variable is for our model. 54 | 55 | > The term **'Step wise regression'** is often used for 2, 3, and 4 but sometimes, it refers only to 4, depending on context. 56 | 57 | **Dummy variables:** 58 | The variables that are created when categorical variables are encoded, are called dummy variables. 59 | 60 | We usually use one-hot encoding to do this, and it might seem like not including one last categorical dummy variable would cause a positive bias towards the rest of the equation but this is not the case. The coefficient for the last dummy variable is included in the b0 term of the equation. 61 | 62 | *Dummy Variable trap*: One can never have all the dummy variables and b0 in a model at the same time. We need to remove at least one dummy variable for each of the corresponding categorical variables because all of that will be modeled into b0. 63 | 64 | ## Measure of Accuracy 65 | 66 | ### Mean Squared Error 67 | The root mean squared error in a linear regression problem is given by the equation ![mse](http://mathurl.com/y9brzcnn.png) which is the sum of squared differences between the actual value ![actualValue](http://mathurl.com/kt496dt.png) and the predicted value ![yhat](http://mathurl.com/yc3fp4p7.png) for each of the rows in the dataset (index iterated over `i`). 68 | 69 | ## Intuition (Univariate Linear Regression) 70 | 71 | ### Minimizing the error term we have above 72 | We do so by going through the following steps: 73 | 74 | 1. We write the equation ![mse](http://mathurl.com/y9brzcnn.png) again but we replace ![yhat](http://mathurl.com/yc3fp4p7.png) with the equation of the line that we are to predict. Let's say ![predictionLine](http://mathurl.com/y94r3wvh.png) where we don't know the values of ![a](http://mathurl.com/25elof5.png) and ![b](http://mathurl.com/25js5ug.png) yet. 75 | 76 | With this, our updated equation or the error term becomes ![error](http://mathurl.com/yd5kfvsb.png). 77 | 78 | 2. We now need to minimize the error term ![E](http://mathurl.com/y82dzd23.png) with respect to ![a](http://mathurl.com/25elof5.png) and ![b](http://mathurl.com/25js5ug.png) both. For this we use calculus method of partial derivatives. 79 | 3. We calculate partial derivative of ![E](http://mathurl.com/y82dzd23.png) w.r.t. ![a](http://mathurl.com/25elof5.png) and that can be written as: 80 | 81 | ![partialDiffA](http://mathurl.com/yb8wutve.png) ![parDifA](http://mathurl.com/y8sa4nwz.png) `-1` 82 | 83 | Now we calculate partial derivate of the same equation w.r.t. ![b](http://mathurl.com/25js5ug.png) and that can be written and simplified as: 84 | 85 | ![parDifB](http://mathurl.com/yb8v6uar.png) ![parDifB2](http://mathurl.com/y98a7qgf.png) `-2` 86 | 87 | 4. In order to minimize we equate these equations to zero and solve the equations: 88 | 89 | By solving `Eq 1`, we get 90 | 91 | ![eq1zero](http://mathurl.com/y7r55sjx.png) `-3` 92 | 93 | By solving `Eq 2`, we get 94 | 95 | ![eq2zero](http://mathurl.com/yd2uztuy.png) `-4` 96 | 97 | 5. Now we have two equations `3` and `4`. We can use these to solve for ![a](http://mathurl.com/25elof5.png) and ![b](http://mathurl.com/25js5ug.png), upon doing so we get the following values: 98 | 99 | ![valA](http://mathurl.com/y8leyvd3.png) `-5` 100 | 101 | ![valB](http://mathurl.com/ycq57l2z.png) `-6` 102 | 103 | 6. Now that we have these equations, we can divide boh tops and bottoms by N, so that all our summation terms can be turned into means. For instance ![](http://mathurl.com/y8lfwpmw.png). We can divide the equations `5` and `6` with ![n2](http://mathurl.com/ycsnzgo2.png) to get the following results: 104 | 105 | ![ares](http://mathurl.com/yae3fw4d.png), 106 | 107 | ![bres](http://mathurl.com/ybnzy6jd.png) 108 | 109 | 110 | ## Intuition (Multivariate Linear Regression) 111 | 112 | ### Base equation 113 | 114 | A multivariate Linear Regression can be represented as 115 | 116 | ![multilinreg](http://mathurl.com/ybupzufq.png) 117 | 118 | where ![yhat](http://mathurl.com/oz8dctm.png) is the list of predictions, ![wt](http://mathurl.com/yckejlne.png) is the vector of weights for each variable, ![x](http://mathurl.com/4dpgym.png) is the set of parameters 119 | 120 | 121 | ## CODE REFERENCES 122 | 123 | The function used for backward elimination in such models is from the class `statsmodels.formula.api` and is titled OLS (used as sm.OLS). The important thing to remember about this function is the fact that it requires the users to add a column of 1s in the matrix of features (at the beginning of the matrix) explicitly, to be a partner for the **b0** coefficient. 124 | 125 | -------------------------------------------------------------------------------- /01-Regression/02-PolynomialRegression.md: -------------------------------------------------------------------------------- 1 | # Polynomial Linear Regression 2 | 3 | This regression allows us to regress over dependent variable(s) that has a polynomial relationship with the independent variables generally represented as 4 | 5 | ![polyRegEq](http://mathurl.com/yd39otyo.png) 6 | 7 | There are powers to the same variable that now jointly represent the relationship between dependent and independent variables. 8 | 9 | > One question that may be pertinent here is that why is it still called 'Linear' when it clearly displays a Polynomial relationship. The reason for that is, the word Linear corresponds to the relationship of the various coefficients and not the 'x' term itself. 10 | 11 | ## CODE REFERENCES 12 | 13 | ### Python 14 | We would often not bother ourselves with feature scaling because most libraries have in-built support for the same. For instance `python from sklearn.linear_model import LinearRegression` does feature scaling on its own. 15 | 16 | An example of creating a polynomial model in Python is 17 | 18 | 19 | ```python 20 | from sklearn.preprocessing import PolynomialFeatures 21 | poly_reg = PolynomialFeatures(degree = 2) 22 | 23 | # The fit_transform method used here first fits the poly_ref object to X 24 | # and then transforms it to X_poly 25 | X_poly = poly_reg.fit_transform(X) 26 | # This transformation leads to addition of column of 1 for b0 element and 27 | # the the polynomial terms (depending on our choice of 'degree' parameter 28 | # for the original variables of X. 29 | 30 | poly_reg1 = LinearRegression() 31 | poly_reg1.fit(X_poly, y) 32 | ``` 33 | 34 | -------------------------------------------------------------------------------- /01-Regression/03-SupportVectorRegression.md: -------------------------------------------------------------------------------- 1 | # Support Vector Regression 2 | 3 | This method is regression equivalent of classification using [Support Vector Machines](../02-Classification/03-SupportVectorMachines.md). 4 | 5 | ### Basic Principle 6 | 7 | Much like Support Vector Machines in general, the idea of SVR, is to find a plane(linear or otherwise) in a plane that allows us to make accurate predictions for future data. The regression is done in a way that all the currently available datapoints fit in an error width given by ![epsilon_small](http://mathurl.com/ybr3ffkc.png). This allows us to find a plane which fits the data best and then this can be used to make future predictions on more data. 8 | 9 | > Sometimes a soft error margin ![zeta](http://mathurl.com/yblyw4t5.png) may also be used to include additional points that lie outside the error margin ![epsilon_small](http://mathurl.com/ybr3ffkc.png). 10 | 11 | ![svr_general_principle](./img/svr1.png) 12 | 13 | ### Nonlinearity 14 | 15 | #### Nonlinearity by preprocessing 16 | 17 | One way to application of SVR to represent nonlinear distributions is to map the training instances ![xi](http://mathurl.com/2az2c7m.png) using a map ![mapnonlinear](http://mathurl.com/ycgepdup.png) into some feature space ![F](http://mathurl.com/2apnvu5.png). 18 | 19 | > This is obviously a computationaly expensive method and would require an understanding of the data and its relationship with the dependent variable before any actions can be performed since the mapping function needs to be figures out. 20 | 21 | #### Implicit mapping via kernels 22 | 23 | All the usual kernels that can be used with SVM classifiers, can be used in SVR as well. Code Reference [1] provides link to implementation and performance evaluations for various options available to be used as kernels. 24 | 25 | ### Features 26 | 27 | 1. The implicit bias of the RBF based SVR model allows it to deal efficiently with outliers 28 | 2. The penalty parameter makes it the best choice for problems where data is high risk and noisy. 29 | 30 | ## RESEARCH REFERENCES 31 | 32 | 1. **Support vector regression**; [Basak, Debasish, Srimanta Pal, and Dipak Chandra Patranabis](https://www.researchgate.net/profile/Mohamed_Mourad_Lafifi/post/Hi_could_anyone_tell_how_the_Epsilon-SVR_perform_the_regression_in_Support_Vector_Machines_SVM/attachment/59d6467c79197b80779a181a/AS:458289034076160@1486276028968/download/Review+Support+Vector+Regression.pdf); Neural Information Processing-Letters and Reviews 11.10 (2007): 203-224. 33 | 34 | **[SOLVED]** Review of various techniques, future scope in Support Vector Regression. 35 | 36 | 2. **A tutorial on support vector regression**; [Alex J. Smola and Bernhard Scḧolkopf](http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/nu-SVM-SVR.pdf) 37 | 38 | **[SOLVED]** The basic principle of Support Vector Regression. 39 | 40 | 41 | ## CODE REFERENCES 42 | 43 | 1. **Numpy Based SVR**: [This article](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html) provides various tutorials for implementation of SVR using various kernel options. 44 | 45 | > The penalty parameter **C** is used to regulate the problem of overfitting in SVR models. It can be tuned using Grid search or any better method for reducing the possibilitiy of overfitting. 46 | 47 | > **Note** There is no implicit feature scaling in SVR Python, we need to do it explicity using 48 | 49 | -------------------------------------------------------------------------------- /01-Regression/README.md: -------------------------------------------------------------------------------- 1 | # Regression 2 | 3 | ## Introduction 4 | 5 | The idea of looking at a lot of data samples and trying to predict the dependent variable in a continuous numeric domain is called regression in statistical terms. 6 | 7 | ### Assumptions 8 | 9 | In order to perform regression on any dataset, it must satisfy the following assumptions: 10 | 11 | 1. **Normality**: The erros are assumed to be normally distributed 12 | 2. **Independent**: The errors must be independent of each other 13 | 3. **Mean and Variance**: They must have zero mean and constant variance (this property of having a **constant variance** is also called **homoscedasticity** 14 | 15 | These assumptions are usually verified using Q-Q Plots, S-W test etc. 16 | 17 | This chapter offers introduction to various kind of regressions and their use cases. 18 | 19 | 1. [Linear Regression](./01-LinearRegression.md) 20 | 2. [Polynomial Regression](./02-PolynomialRegression.md) 21 | 3. [Support Vector Regression](./03-SupportVectorRegression.md) 22 | 23 | ## ARTICLE REFERENCES 24 | 25 | 1. [Q-Q Plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) 26 | -------------------------------------------------------------------------------- /01-Regression/img/svr1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/01-Regression/img/svr1.png -------------------------------------------------------------------------------- /02-Classification/01-LogisticRegression.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/02-Classification/01-LogisticRegression.md -------------------------------------------------------------------------------- /02-Classification/02-knn.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/02-Classification/02-knn.md -------------------------------------------------------------------------------- /02-Classification/03-SupportVectorMachines.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/02-Classification/03-SupportVectorMachines.md -------------------------------------------------------------------------------- /02-Classification/04-NaiveBayes.md: -------------------------------------------------------------------------------- 1 | # Naive Bayes 2 | 3 | ## Intuition 4 | 5 | The Naive Bayes classifier works on Bayes Theorem. It is a generative machine learning algorithm, that is, unlike Logistic Regression and quite a few others which try to separate the two classes in the n-dimensional space, it rather tries to focus on one of the classes at a time and tries to place the new data point in one of these class's areas to classify it. In simple words, it tries to build a model for each one of the classes and then when a new data point arrives, it checks that which model describes the features of the new data points most aptly. 6 | 7 | ### Bayes Theorem 8 | 9 | The Bayes' Theorem suggests that it is possible to calculate the probability of occurence of some event, based upon the previous knowledge of something else having occurred already. This idea generalizes the concept of **conditional probability**. The following formula allows us to calculate the conditional probability: 10 | 11 | ![bayes](http://mathurl.com/yd5wgr73.png) 12 | 13 | where: 14 | 15 | ![ph](http://mathurl.com/ydeqdel2.png) is the probability of hypthesis H being true 16 | 17 | ![pe](http://mathurl.com/y85hwz3k.png) is the probability of event E regardless of H 18 | 19 | ![peh](http://mathurl.com/ycu2debl.png) is the probability of event E provided that H is true 20 | 21 | ![phe](http://mathurl.com/y6wymudv.png) is the probability of hypothesis H being true provided that event E has occurred. 22 | 23 | ## NB-Classifier Algorithm 24 | 25 | This is supervised machine learning classifier because we'll be classifying things on the basis of previous information provided to us. The algorithm treats each feature in the input data to be independent from each other which could be a setback in some cases, and is the reason why its called **Naive Bayes**. The fundamental assumption, that all variables are independent, can be wrong and even then it's applied and gives decent result. 26 | 27 | ### Structure 28 | 29 | The equation would become something like this for finding the probability that a particular set of features **X** corresponds to the class, **Class 1**: 30 | 31 | ![bayesml](http://mathurl.com/y7c6qgbp.png) 32 | 33 | where the notations used are as follows: 34 | 35 | 1. ![priorprob](http://mathurl.com/y76nhwxz.png) is called **prior probability** 36 | 2. ![marginallikelihood](http://mathurl.com/ycd77fxh.png) is called **marginal likelihood** 37 | 3. ![likelihood](http://mathurl.com/y7l54x5t.png) is called **likelihood** 38 | 4. ![posteriorprob](http://mathurl.com/yaenwjzj.png) is called the **posterior probability** 39 | 40 | and are usually calculated in that order. 41 | 42 | ### Procedure 43 | 44 | 1. The procedure for calculating ![priorprob](http://mathurl.com/y76nhwxz.png) is pretty straight forward. We take the number of occurrences of **Class 1** and divide it by the total number of observations in our dataset. Therefore 45 | 46 | ![priorprobformula](http://mathurl.com/yckqgn5y.png) 47 | 48 | 2. The calculation of ![marginallikelihood](http://mathurl.com/ycd77fxh.png) is dependent on the choice made by the user as well. The following steps need to be performed for calculation of marginal likelihood: 49 | 1. We define an n-dimensional **proximity range** in the feature space with a radius/width/span of **r** 50 | 2. Then we are going to look at all the points that are enclosed in the range 51 | 3. These points will now be deemed similar to the new data point that needs to be classified. 52 | 4. The marginal likelihood ![marginallikelihood](http://mathurl.com/ycd77fxh.png) is therefore the probability of any new random point being added to fall into this range that we just selected given by 53 | 54 | ![marginallikelihoodformula](http://mathurl.com/y8s9mo57.png) 55 | 56 | 3. ![likelihood](http://mathurl.com/y7l54x5t.png) represents the likelihood of an observation that belongs to **Class 1** exhibiting the **Feature Set X**. 57 | 58 | 1. We define an n-dimensional **proximity range** in the feature space with a radius/width/span of **r** 59 | 2. Then we are going to look at all the points that are enclosed in the range 60 | 3. These points will now be deemed similar to the new data point that needs to be classified. 61 | 4. The probability that needs to be calculated is the probability that a randomly selected datapoint will be within the **proximity range**, provided that we know that it already belongs to **Class 1**. From here, this become pretty straight forward to calculate and is given by 62 | 63 | ![likelihoodformula](http://mathurl.com/yd4pdpnw.png) 64 | 65 | ### Types 66 | 67 | 1. **Multinomial NB**: This type of algorithm is usually used for counts like classification problems in NLP 68 | 2. **Gaussian NB**: Details to go here. 69 | 70 | 71 | ### Advantages 72 | 73 | 1. Simple to build and quite intuitive 74 | 2. Easily trained upon small datasets 75 | 3. Works fast once trained and even the training time is not too much 76 | 4. Not very sensitive to irrelevant features 77 | 5. Since it works on probabilities and has one underlying bias, it can often make correct classification in scenarios where SVM or K-NN aren't very useful because they're linear classifiers. 78 | 79 | ### Disadvantages 80 | 81 | 1. Assumption that every feature is independent, which is rarely the case. 82 | 83 | -------------------------------------------------------------------------------- /02-Classification/05-DecisionTree.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/02-Classification/05-DecisionTree.md -------------------------------------------------------------------------------- /02-Classification/06-RandomForest.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/02-Classification/06-RandomForest.md -------------------------------------------------------------------------------- /02-Classification/07-HiddenMarkovModels.md: -------------------------------------------------------------------------------- 1 | 05 -------------------------------------------------------------------------------- /02-Classification/README.md: -------------------------------------------------------------------------------- 1 | # Classification 2 | 3 | 1. [Logistic Regression](./01-LogisticRegression.md) 4 | 2. [K-Nearest Neighbours](./02-knn.md) 5 | 3. [Support Vector Machines](./03-SupportVectorMachines.md) 6 | 4. [Naive Bayes](./04-NaiveBayes.md) 7 | 5. [Decision Tree Classification](./05-DecisionTree.md) 8 | 6. [Random Forest Classification](./06-RandomForest.md) 9 | 10 | ## Types of Classification Algorithms 11 | 12 | 1. **Discriminative Learning Algorithms** 13 | 14 | These algorithms try to learn the probability of an end result **Y** for a given feature set **X**. These algorithms try to determine how **Y** is directly a function of **X**. Mathematically these are shown as 15 | 16 | ![pyx](http://mathurl.com/cqd6fro.png) 17 | 18 | Some of these algorithms try to learn a hypothesis that tries to predict the possible classes, mathematically represented as 19 | 20 | ![binarhypothesis](http://mathurl.com/ya46sgrc.png) 21 | 22 | 2. **Generative Learning Algorithms** 23 | 24 | This type of Algorithms try to learn, the probability of a given set of features **X** for a particular class **Y** (mathematically represented as ![pxy](http://mathurl.com/bwse6yv.png)) and also, the probability of occurrence of this class **Y** (the probability of occurrence of a given class is represented as ![py](http://mathurl.com/byg852g.png) and is called **class prior**. The most popular example of such algorithms is the [Naive Bayes Algorithm](./04-NaiveBayes.md). -------------------------------------------------------------------------------- /03-Clustering/01-K-meansClustering.md: -------------------------------------------------------------------------------- 1 | # K-means Clustering 2 | 3 | This is a classic clustering algorithm that relies on the concept of centroids and their Euclidean distances from the observed data points. The basic concept works on the following set of rules: 4 | 5 | 1. Assign a fixed number of centroids randomly in the parameter space (the number of centroids will define the number of clusters formed at the end of execution of the algorithm). These centroids need not be one of the points in the observation set, and can literally be random coordinates in the multi-dimensional space that we have. 6 | 2. Calculate the closest centroid from each data point in the observation set and assign the data point to that centroid's cluster. 7 | 3. Move the centroid to the 'center-of-mass' of the cluster that it has created with help of our data points from observation set. 8 | 4. Repeat **Step 2** and see if any points have changed their clusters, from the ones they were previously assigned. If the condition holds true then move to **Step 3** otherwise proceed to **Step 5**. 9 | 5. Finish 10 | 11 | Although, the algorithm might appear like its' construction of clusters based on the distances, this assumptions is untrue. The **k-means clustering algorithm works primarily on minimizing the intra-cluster variance** and that is the reason why metric of computation for accuracy of a k-means cluster is WCSS (within-cluster sum of squares). 12 | 13 | #### Objective Function for Soft K-means 14 | 15 | The objective function for K-means clustering is given by: 16 | 17 | ![objectiveFunction](http://mathurl.com/y8t3jlk3.png) 18 | 19 | This equation is the sum of squared distances weighted by the responsibilities. This means that if ![xn](http://mathurl.com/y72f5olt.png) is far away from cluster ![k](http://mathurl.com/2bhf5kb.png), that responsibility should be very low. This is an example of coordinate descent, which means that we are moving in the direction of a smaller *J* with respect to only one variable at a time. As we have already established, although with each iteration, we converge towards a smaller J, there is absolutely no guarantee that it will converge to a global minimum. 20 | 21 | > It is interesting to observe that the k-means clustering algorithm relies on Euclidean distances for formation of clusters and computation of intra-cluster variation. This is an implicit underlying bias of the algorithm and can be exploited for other kinds of correlations between the attributes by transforming them into Euclidean distances. [Click here](https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric/81494#81494) for a more detailed explanation regarding this. 22 | 23 | **Drawbacks**: As it should be already obvious at this point, the selection of random points could cause serious problems because this randomness would let the algorithm to figure out different clusters than the ones that are actually present in the hyper-dimensional space. This sensitivity to initialization can be alleviated to some extent by the use of following techniques: 24 | 25 | 1. Run the algorithm multiple times and choose the centroids that give us the best cost 26 | 2. Soft K-means algorithm: This allows for each point to be assigned to more than one cluster at a time, allowing for a probability distribution over the centroids. 27 | 3. K-means++ algorithm: 28 | 29 | #### Soft K-means algorithm 30 | The Soft K-means algorithm works as follows: 31 | 32 | 1. initialize m1 ... mx as random points 33 | 2. Calculate cluster responsibilities using ![eq1](http://mathurl.com/ycg9zqtp.png) 34 | 3. Calculate new means using ![eq2](http://mathurl.com/ybjvnln7.png) 35 | 4. If converged, goto **Step 5**, else goto **Step 2** 36 | 5. Finish 37 | 38 | 39 | ## RESEARCH ARTICLES 40 | 41 | 1. **K-Means++: The advantages of careful seeding**; [David Arthur, Sergei Vassilvitskii](http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf) 42 | 43 | **[Solved]** The random selection of centroids would often let the k-means algorithm figure out different clusters than the ones that are actually present in the hyper-dimensional space. 44 | 45 | This algorithm selects only the first centroid at random and then picks the remaining centroids by using a probability distribution function over the data points. (**Section 2.2**) 46 | 47 | 2. **A comparative study between fuzzy clustering algorithm and Hard Clustering algorithm**; [Dibya Joti Bora, Dr. Anil Kumar Gupta](https://arxiv.org/ftp/arxiv/papers/1404/1404.6059.pdf) 48 | 49 | **[Solved]** Sensitivity to random start of K-means has been alleviated to some extent using fuzzy clustering. Any point does not fully belong to one cluster and there's a probability of over the asignment of any point to a cluaster. 50 | 51 | 52 | ## CODE REFERENCES 53 | 54 | Read the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) for more 55 | -------------------------------------------------------------------------------- /03-Clustering/02-HierarchicalClustering.md: -------------------------------------------------------------------------------- 1 | # Hierarchical Clustering 2 | 3 | The idea behind this clustering algorithm is similar to K-means clustering but is different in quite a few aspects. It works as follows: 4 | 5 | 1. Assign each data point in the dataset a single cluster of its own 6 | 2. Take 2 closest data points in the dataset and join them to make 1 cluster (resulting in *n-1* clusters) 7 | 3. Take two closest clusters and make them 1 cluster (resulting in n-2 clusters) 8 | 4. Repeat **Step 3** until there is only one cluster 9 | 5. Finish 10 | 11 | One might find the criterion for selection of closest clusters in **Step 3** to be ambiguous as a concept and unclear from the algorithm above. It is because, although it is based on Euclidean distances between clusters, there are multiple options within Euclidean distances when we are talking about more than two points, and we are discussing clusters here. There are four kinds of Euclidean distances that can be used for the selection of closest clusters and they are defined as follows: 12 | 13 | 1. Distance between closest points 14 | 2. Distance between farthest points 15 | 3. Mean of distances between all the points 16 | 4. Distance between centroids of the two clusters 17 | 18 | ### Dendrograms 19 | 20 | ![Dendrograms](dendrograms.png) 21 | 22 | A graph that allows us to remember what we did through the course of the entire algorithm. They typically represent the points and progression of clustering using HC algorithm. The height in a Dendrogram represents the distance between any two given points. In the above figure the points 10 and 2 are the closest to each other and form the first cluster reducing the size from 10 clusters in Step 1 to 9 clusters in Step 2. This keeps on progressing as we go up the dendrogram, and merge clusters on the way 23 | 24 | ### Dissimilarity Threshold 25 | The dissimilarity threshold is the horizontal bar that we set for obtaining clusters from a dendrogram. In the above example we set it to be 0.56 which means that any clusters with distance more than 0.56 should not be merged together to form any more clusters. This leaves us with 5 clusters containing points (3), (7, 6), (1, 4), (9, 5, 8) and (10, 2) respectively. 26 | 27 | Intuitively, the final number of clusters should always be equal to the number of lines passing by the Dissimilarity Threshold of the final dendrogram. 28 | 29 | Selecting the correct number of clusters is a decision that can be easily made on the basis of height of each of the vertical lines. The longest vertical lines usually go on to form the worst clusters. 30 | 31 | > It is important to note that the vertical lines that we consider for selecting the correct number of clusters, are not the original vertical lines but the lines that were formed after we extend all the horizontal lines as well. So the steps would be to extend all horizontal lines infinitely and then look for the longest vertical line possible. This distance would give us the optimal number of clusters. 32 | 33 | 34 | -------------------------------------------------------------------------------- /03-Clustering/03-GaussianMixtureModels.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/03-Clustering/03-GaussianMixtureModels.md -------------------------------------------------------------------------------- /03-Clustering/README.md: -------------------------------------------------------------------------------- 1 | # Clustering Algorithms 2 | 3 | In real world, data does not always come with labels and in such cases, we need to create our own clusters/labels in order to make some groupings in the data. This is called unsupervised learning. The following algorithms are used to do this: 4 | 5 | 1. [K-Means Clustering](./01-K-meansClustering.md) 6 | 2. [Hierarchical Clustering](./02-HierarchicalClustering.md) 7 | 3. [Gaussian Mixture Models](./03-GaussianMixtureModels.md) 8 | 9 | There are a few other terms that are commonly associated with unsupervised learning as application areas. 10 | ### Density Estimation 11 | The process of taking a sample of data and estimating the probability density function of a random variable is called density estimation. Once the distribution is learnt, one can generate samples that look like they are coming from the same distribution. 12 | 13 | ### Latent Variables 14 | Underlying cause in the form of latent or missing variables can be figured out using the unsupervised learning techniques. 15 | 16 | ### Dimensionality Reduction 17 | This topic is covered in depth, in a separate section of the book. Please refer to the index page for more details regarding the same. -------------------------------------------------------------------------------- /03-Clustering/dendrograms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/03-Clustering/dendrograms.png -------------------------------------------------------------------------------- /04-AssociationRuleLearning/01-Apriori.md: -------------------------------------------------------------------------------- 1 | # Apriori 2 | 3 | The Apriori algorithm has three parts titled **support**, **confidence** and **lift**. 4 | 5 | **Support** 6 | 7 | The support for a movie M is defined by the following equation 8 | 9 | ![eq1](http://mathurl.com/y9g59dlu.png) 10 | 11 | **Confidence** 12 | 13 | The confidence is a metric calculated for any given pair of elements. We test rules using this method because it gives us an idea about what fraction of people who have watched M1, have also watched M2. In other words, how frequently do the two movies appear together in users' watchlist. 14 | 15 | ![eq2](http://mathurl.com/ybqbxvm2.png) 16 | 17 | **Lift** 18 | 19 | The lift is the confidence divided by support as shown in the following equation. It is the likelihood of a person liking M2, when they are chosen at random from another sample of the same distribution and have M1 in their watchlist. The higher the lift, the higher is the probability of a person who has watched M1 watching M2. 20 | 21 | ![eq3](http://mathurl.com/yasxs8o5.png) 22 | 23 | The algorithm then proceeds in the following steps: 24 | 25 | 1. Set a minimum support and confidence, mainly due to immense number of combinations 26 | 2. Take all subsets in the transactions having more support than the minimum support 27 | 3. Take all subsets in the transactions having more confidence than the minimum confidence 28 | 4. Sort the rules by decreasing order of lift 29 | 30 | > It must be kept in mind that the values of Support, Life and Confidence may seem mathematical in the equations above, but are experimental in nature. We choose a value for the parameters, run some the algorithm and then change the value of those parameters and run the algorithm again. We base these values on the empirical data, i.e. the set of rules obtained in this example. 31 | 32 | ## CODE REFERENCES 33 | 34 | * It is interesting to note that the Apriori function that is used in Python does not expect a data frame, instead, expects a list of lists that contain various different transactions. This is quite natural as well, because a horizontal or vertical data frame does not make much sense when it is about business transactions or movies watched. The reason for this is, people have often watched different number of movies, read different number of books etcetera. Applying a rectangular structure on such data and then performing any manual operations upon the same would often cause a lot of data to be missing for it to make any sense or structure at all. 35 | 36 | * R is a better language for this algorithm 37 | -------------------------------------------------------------------------------- /04-AssociationRuleLearning/02-Eclat.md: -------------------------------------------------------------------------------- 1 | # Eclat 2 | 3 | In the eclat model, we only have support. When we calculate the support, in an Eclat model, we are consider the prevalence of a set of items and not individual models. This makes sense because in case of Eclat models, since we only have support, the individual items is just the frequency of the items and nothing more than that. 4 | 5 | ![eq1](http://mathurl.com/y9g59dlu.png) 6 | 7 | The algorithm, as one would intuitively assume it to, works as follows: 8 | 9 | 1. Set a minimum support 10 | 2. Select all subsets of transactions having support more than the minimum support 11 | 3. Sort these subsets by decreasing order of support 12 | 13 | ## CODE REFERENCES 14 | 15 | * R is a better language for this algorithm as well 16 | -------------------------------------------------------------------------------- /04-AssociationRuleLearning/README.md: -------------------------------------------------------------------------------- 1 | # Association Rule Learning 2 | 3 | Association Rule Learning is a set of unsupervised machine learning algorithms that usually try to find 'not-so-apparent' rules within various sets of things. 4 | 5 | 1. [Apriori Algorithm](./01-Apriori.md) 6 | 2. [Eclat Algorithm](./02-Eclat.md) -------------------------------------------------------------------------------- /05-ReinforcementLearning/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/05-ReinforcementLearning/README.md -------------------------------------------------------------------------------- /06-NaturalLanguageProcessing/README.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing 2 | 3 | 4 | 5 | ## Models 6 | 7 | ### Bag of Words Model 8 | 9 | #### Preprocessing 10 | 11 | Usually this would involve the following steps: 12 | 13 | 1. Removal of special characters 14 | 2. Change case for all words to lowercase 15 | 3. Do stemming and remove the words that don't matter 16 | 4. Do tokenization by using the following procedure 17 | 18 | #### Procedure 19 | 20 | This is the most basic model for Natural Language Processing. It works as follows: 21 | 22 | 1. Select all distinct words and set them as columns of a matrix 23 | 2. Create as many rows as the number of observations 24 | 3. Each cell should now contain for each observation, the number of times the word appeared in it 25 | 4. Filter out other non-relevant words by setting a minimum frequency that a word must have across all observations in order for it to stay, 26 | 5. Train this matrix for the classes available in a similar fashion to classification. 27 | 28 | > * The matrix created in **Step 3** in the example above is called the **Term Document Matrix** 29 | 30 | > * The commonly used models for classification on NLP based problems are **Naive Bayes**, **Decision Tree Learning** & **Random Forest** classification but others maybe used as well if they fit the data well. 31 | 32 | #### Types of BoW Models 33 | 34 | 1. **Word Frequency Proportions** ![wf](http://mathurl.com/yabrpcku.png): The proportions of each word with respect to all the words in the document. It is usually given by the equation: 35 | 36 | ![wfequation](http://mathurl.com/yat29v9d.png) 37 | 38 | 2. **Raw Word count**: The raw word count for every word is taken as such. 39 | 3. **Binary**: if a word appears then 1, else 0 40 | 4. **TF-IDF**: Term Frequency - Inverse Document Frequency (takes into account the words that appear. 41 | 42 | ## Use Cases 43 | 44 | ### Parts of Speech Tagging 45 | 46 | This is the segment of NLP that deals with tagging various elements of a sentence with the parts of speech group that they belong to. 47 | 48 | ### Name-Entity Recognition 49 | 50 | In the sentence `Albert Einstein is a genius`, the ability of an algorithm to decipher that **Abert Einstein** is a person, is called **NER**, and is another application of **NLP**. The results are provided in form parse tree. 51 | 52 | ### Latent Semantic Analysis 53 | 54 | There are often more advanced problems that researchers of **NLP** face, like **synonymy**(multiple words having the same meaning) and **polysemy**(one word having multiple meanings). A common technique used for this is the creation of **Latent Variables**. 55 | 56 | ### Article Spinning 57 | 58 | This is the art of changing certain aspects of a particular article that it appears to be a different one and avoids plagiarism. This algorithm works on the principal of Bayesian ML. A popular technique that we use for this is called **Trigram Model**. It models each word in the document as 59 | 60 | ![trigram](http://mathurl.com/ydevd32r.png) 61 | 62 | Then, we replace each word in the document with a certain probability, P. If we change literally every word then there will be high chance for the document to make no sense at all. Therefore it's essential that we only change some words and let others be. 63 | 64 | > Both LSA and Trigram Models are unsupervised algorithms(have no class labels) and tend to learn the structure of our data. 65 | 66 | ## Jargon 67 | 68 | ### TF-IDF: Term Frequency - Inverse Document Frequency 69 | 70 | Taking into account the words that appear most frequently in many documents and therefore neglecting words like is, an, the etcetera. 71 | 72 | ### Tokenization 73 | 74 | Split all the different sentences into different words, each word gets a column which would contain the frequency of appearance of that word. This would be a sparse matrix that we would then operate upon it at will. 75 | 76 | ### Stemming & Lemmatization 77 | 78 | The process of collecting only the roots of words so that even if the same words appear in different forms, we always have a steady output and our machine learning models learn to recognize them properly and as the same words. While stemming is the more basic/crude version of the above, Lemmatization is more sophisticated than that. While a **stemmer** might give you the word `theiv` after stemming the word `theives`, the **lemmatizer** will give you `theif` as the answer. 79 | 80 | ### Latent Variables 81 | 82 | The idea to combine words that often appear together using a probabilistic distribution over the terms. After this, these pairs of variables with very correlation will then be used to revamp the data accordingly, and hopefully the dimensions of this new data will be much lesser than the original one. Although, this would almost certainly help solve the Synonymy problem but it's not quite proven whether or not it helps solve the polysemy problem adn I think, ideally, it shouldn't. 83 | 84 | ### Corpus 85 | 86 | A corpus is a collection of texts of the same type. 87 | 88 | ### Sparsity 89 | 90 | The property that a matrix holding a lot of zeroes and very few values, is called sparsity. This is usually not a good thing and can be reduced, either by removing the least frequent words by using the `max_features` parameter of `CountVectorizer` method (in Python) or using **Dimensionality Reduction** techniques. 91 | 92 | ## CODE REFERENCES 93 | 94 | [1] **CountVectorizer**: It can be used to all the text cleaning necessary for the Bag of Words model, instead of having do to it manually. Read [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for more. -------------------------------------------------------------------------------- /07-DeepLearning/README.md: -------------------------------------------------------------------------------- 1 | # Neural Networks 2 | 3 | Neural Networks are the basis of most commonly used, and hyped machine learning algorithms. Deep Learning, NLP and many other branches have stemmed from them and there's a great deal of content to be learnt about them. 4 | 5 | ## Key Terms 6 | 7 | ### Neurons 8 | 9 | It is the basic building block of any neural network. The connection between two neurons is called a Synapse and it is where the signal is being passed. In a real world neuron (as shown in the figure below) the receptors are called **dendrites** and the transmitter is called the **axon**. Usually there is many dendrites and just one axon per neuron which makes sense, because a neuron should be able to receive a lot of information but the output from it should be clear and unambiguous. 10 | 11 | ![neuron](./src/img/neuron.png) 12 | 13 | Every neuron is composed of two parts: 14 | 15 | 1. **Summation Function**: In a neural network, the summation operator sums all of a node's inputs to create a net input. It is mathematically represented as ![summationFunction](http://mathurl.com/ybwg6p3a.png) 16 | 2. **Activation Function**: When a set of inputs is added into one (by the summation function) and the final input that is generated, it is processed using an **activation function** in order to be passed out as output from the neuron. This is the second component of any neuron in the hidden or output layer. The following list contains the names, and brief introductions of the predominant Activation Functions, used in the industry: 17 | 18 | (i) **Threshold Function**: The threshold function is the most basic activation function in practice. It basically says that if the value provided by summation function is less than a given threshold then return **0**, else return **1**. ![thresholdFunction](./src/img/threshold.png) 19 | 20 | (ii) **Sigmoid Function**: This function is a bit more interesting than the one above. It is given by the equation 21 | 22 | ![sigmoidFunction](http://mathurl.com/y9ws95tn.png) 23 | 24 | and can be plotted as 25 | 26 | ![sigmoidFunctionPlot](./src/img/sigmoid.png) 27 | 28 | In the graph above, **x** is the sum of inputs. It is a smooth function and has a gradual progression hence making it useful in predicting probabilities. 29 | 30 | (iii) **Rectifier Function**: Another very popular activation function used for neural networks is the rectifier function. It returns **0** up to a certain threshold and then starts returning the original value obtained from the input function. It is shown in the image below 31 | 32 | ![rectifierFunction](./src/img/rectifier.png) 33 | 34 | (iv) **Hyperbolic Tangent Function**: This function is similar to the Sigmoid function but the major difference is that it goes from **-1** to **1**. It is given by the equation 35 | 36 | ![hyperbolicTangentFunction](http://mathurl.com/ybyg8nsg.png) 37 | 38 | The graph of this function is: 39 | 40 | ![hyperbolicTangentFunctionPlot](./src/img/hyperbolictanf.png) 41 | 42 | > There are many other kinds of activation functions and probably, in due time, we might add a separate section dedicated to cover each one of those in detail and their possible applications, but these four are the most popular ones and necessary to start with. 43 | 44 | ### Layers 45 | 46 | There are three kinds of layers in any neural network: 47 | 48 | 1. **Input Layer**: This is the layer that is responsible for receiving the input observations one at a time. It is imperative (or normalize, subject to use case) that the input values are standardized before they are passed into the neural network. 49 | 2. **Hidden Layer**: This layer is responsible for all the heavylifting and is the matrix that is configured through our various observations. 50 | 3. **Output Layer**: This is the neuron layer that gives whatever output is required, as per our definition. If it is a regression problem then the output will have only one neuron, while in case of a classification problem, we will have as many neurons in the output layer, as are the number of classes. 51 | 52 | You can see the structure of all available types of neural networks, [here](./src/img/neuralnetworks.png). 53 | 54 | Any hidden neuron, in the context of machine learning, takes its input from other neurons which are at the receiving end of data, i.e. sensors. The outputs from these **input layer neurons** serve as the input for the hidden layer neurons. 55 | 56 | ### Synapses 57 | 58 | Weights or Synapses are very important in neural networks as they are the controlling elements of the information that is being passed along. They basically control to what extent does a signal gets passed along? 59 | 60 | ## Working Procedure 61 | 62 | The way that Neural Networks work can be seen in the image below 63 | 64 | ![nn](./src/img/nn.png) 65 | 66 | This allows us to have a detailed and complete understanding of the Neural Network's individual units. 67 | 68 | 1. The yellow circles constitute the input layer, i.e. the layers that will be recieving the observations. 69 | 2. The yellow arrows show the trajectory of every observation in our dataset 70 | 3. **w1, w2 and w3** are the weights that we are trying to find. 71 | 4. 72 | 73 | > The weights are the only things that we intend to find/learn in a neural network 74 | 75 | ## RESEARCH REFERENCES 76 | 77 | 1. **Deep Sparse Rectifier neural networks**; [Xavier Glorot et al.](https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf) 78 | 79 | **[Solved]** The explanation for why the rectifier function is such an important asset for neural networks and often perform better over hyperbolic tangent activation functions.**(Activation Functions.Rectifier Function)** 80 | 81 | 82 | -------------------------------------------------------------------------------- /07-DeepLearning/src/img/hyperbolictanf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/hyperbolictanf.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/neuralnetworks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/neuralnetworks.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/neuron.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/neuron.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/nn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/nn.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/rectifier.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/rectifier.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/sigmoid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/sigmoid.png -------------------------------------------------------------------------------- /07-DeepLearning/src/img/threshold.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/07-DeepLearning/src/img/threshold.png -------------------------------------------------------------------------------- /08-DimensionalityReduction/01-PrincipalComponentAnalysis.md: -------------------------------------------------------------------------------- 1 | # Principal Component Analysis 2 | 3 | Principal Component Analysis, essentially takes as input, the original dimensions, i.e. the m independent variables originally defined and then produces as output p new variables or dimensions. These are new extracted independent variables are the ones that explain the most variance of the dataset regardless of the dependent variables. The fact that we consider PCA as an unsupervised learning model, is the fact that it does not consider the dependent variable in the final model that it builds. 4 | 5 | ### Functioning 6 | 7 | The algorithm works on the basis of vector-matrix multiplication of the form ![pcaeq](http://mathurl.com/yd787etp.png) where ![q](http://mathurl.com/lzhndgq.png) is a matrix and ![x](http://mathurl.com/4dpgym.png) is a vector. 8 | 9 | > **scalar * vector** = another vector, same direction 10 | 11 | > **matrix * vector** = another vector, possibly different direction 12 | 13 | So we could say, that PCA rotates the original input vectors. 14 | 15 | ### Advantages 16 | 17 | There are various benefits to using PCA: 18 | 19 | 1. Decorrelates all the input data. 20 | 2. Data is ordered by information content after the transformation 21 | 3. Dimensionality Reduction (duh!) 22 | 23 | > Removing data does not always mean that the prediction ability will go down. In many cases the data is too noisy and dimensionality reduction proves to be an essential tool to smooth the data and generalize it. 24 | 25 | 26 | 27 | ## RESEARCH REFERENCES 28 | 29 | * [Principal Component Analysis and Factor Analysis](http://www.sciencedirect.com/science/article/pii/0169743987800849) 30 | 31 | **[Explains]** The difference between Principle Component Analysis and Factor Analysis. 32 | -------------------------------------------------------------------------------- /08-DimensionalityReduction/README.md: -------------------------------------------------------------------------------- 1 | # Dimensionality Reduction 2 | 3 | There are two ways to reduce the dimensions of any data set. Either you select some features and discard the others or you create some new features that allow you concisely represent the same amount of information as your existing features. 4 | 5 | |Feature Selection | Feature Extraction| 6 | |:----------------:|:-----------------:| 7 | |Backward Elimination| PCA| 8 | | Forward Selection| LDA | 9 | | Bidirectional Elimination| Kernel PCA | 10 | |Score Comparison|| 11 | 12 | The things that we will cover in this section are as follows: 13 | 14 | 1. [Principal Component Analysis](./01-PrincipalComponentAnalysis.md) 15 | 2. LDA -------------------------------------------------------------------------------- /09-RecommendationEngines/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/09-RecommendationEngines/README.md -------------------------------------------------------------------------------- /10-ModelSelectionAndBoosting/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ashishpatel26/ML-Notes-in-Markdown/0c7b02cec21ba2204295fd3f8dbafe9316610ee9/10-ModelSelectionAndBoosting/README.md -------------------------------------------------------------------------------- /11-TimeSeries/01-Introduction.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | How is time series different than any other data? It is because it contains a variable *Time*. It is an continuous ordered variable though. Therefore, we would need to anchor ourselves at some point, and we could call this anchor 0 (denoting the present), then we can represent the future as T+1, T+2... and the past as T-1, T-2 ... In order to build a model and have accurate predictions, we would usually want to have 50 or more degrees of freedom. 4 | > Note: Degrees of freedom are usually pieces of independent information. Calculating a mean or any other non-parameter and including it in the model would cause us to lose a degree of freedom. 5 | 6 | # Key Terms 7 | 8 | There are a few key terms that are commonly encountered when working with Time Series, these are inclusive of but not exclusively described in the following list: 9 | 10 | **Trend**: A time series is supposed to have an upward or downward trend if the plot shows a constant slope when plotted. It is noteworthy that for a pattern to be a trend, it needs to be consistent through the series. If we start getting upwards and downwards trend in the same plot, then it could be an indication of cyclical behavior. 11 | ![TS with an upward trend](https://www.dtreg.com/uploaded/pageimg/TsTrend_1.jpg) 12 | 13 | **Sesonality**: If a TS has a regular repeating pattern based on the calendar or clock, then we call it to have seasonality. Soemtimes there can be different amount of seasonal variation across years, and in such cases it is ideal to take the **log** of values in order to get a better idea about seasonality. It is noteworthy that Annual Data would rarely have a seasonal component. 14 | ![TS with a seasonal component](http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/_images/image5.png) 15 | 16 | **Cyclical TS**: Cyclical patterns are irregular based on calendar or clock. In these cases there are trends in the Time Series that are of irregular lengths, like business data, there can be revenue gain or loss due to a bad product but how long the upward or downward trend would last, is unknown beforehand. Often, when a plot shows cyclical behavior, it would be a combination of big and small cycles. Also, these are very difficult to model because there is no specific pattern and we do not know what would happen next. 17 | ![TS with cyclical patterns](http://slideplayer.com/8134442/25/images/8/Components+of+Time+Series+Data.jpg) 18 | 19 | **Autocorrelation/Serial Correlation**: The dependence on the past in case of a univariate time series is called autocorrelation. 20 | 21 | # Types of Time Series 22 | 23 | The two major types of time series are as follows: 24 | 25 | > Most of the theory that has been developed was developed for stationary time series. 26 | 27 | 1. **Stationary**: Such time series have constant mean and reasonably constant variance through time. eg. annual rainfall 28 | 2. **Non-Stationary**: If we have a non-stationary time series then there are additional patterns in our series, like Trends, Seasonality or Cycles. These do not necessarily have a constant mean variance through time. 29 | 30 | a. **Univariate Time Series**: We usually use regression technique to model non-stationary time series and we also have to consider the time series aspects (like autocorrelation i.e. dependence on the past and differencing, i.e. the technique used in TS to try and get rid of autocorrelation). **If we have a Time Series with trend, seasonality and non-constant variance, then it is best to log the data which would then allow us to stabilize the variance, hence the name 'Variance Stablizing Transformation' This allows us to fit our data in additive models, making it a very useful technique.** 31 | 32 | b. **Multivariate Time Series**: Many variables measured through time, typically measured on the same time points. In some other cases, they may be measured on different time points but within an year. Such data would need to be analyzed using multiple regression models. We might also want to fit the time variable in the model (after transformations maybe) but this could allow us to determince whether there was any **trend** in the response and capture it. If you do not find any trend in the data, then the time variable can be chucked out but **modelling the data without checking for the time variable would be a horrible idea**. 33 | 34 | > More observations is often a good thing because it allows greater degrees of freedom. (Clarification needed, will read a few paper to better this concept) 35 | 36 | 3. **Panel Data/Logitudinal Data**: Panel data is a combination of cross-sectional data and time series. There can be two kinds of time logging for panel data, **implicit** and **explicit**. In econometrics and finance studies, the time is **implicitly** marked which in areas like medical studies **explicit** marking of time is done on Panel data. The idea is to take a cross-sectional measurements at some point in time and then repeat the measurements again and again to form a time series. 37 | 38 | 4. **Continuous Time Models**: In this kind of data, we have a Point(Time component) process and a Stochastic (Random) process. For instance, the number of people arriving at a particular restaurant is an example of this where the groups of people ae arriving at certain points in time(point process) and the size of the group is random (stochastic process). 39 | 40 | > NOTE: No matter what a time series data is about, on the axis (while modeling) we always use a sequence of numbers starting from 1, up to the total number of observations. This allows us to focus on the sequence of measurements because we already know the frequency the time interval between any two observations. 41 | 42 | # Features of Time Series 43 | 44 | A few key features to remember when modelling time series are as follows: 45 | 46 | 1. Past will affect the future but the future cannot affect the past 47 | 2. There will rarely be any independence between observations, temperature, economics and all other major fields have dependence on the past in case of Time Series data 48 | 3. The pattern is often exploited to predict and build better models 49 | 4. Stationary techniques can be used to model non-stationary time series, and also to remove trends, seasonality, and sometimes even cycles. 50 | 5. We can also try to decompose time series into various components, model each one of them separately and then forecast each of the components and then combine again. 51 | 6. we also use Moving Averages (also known as filtering or smoothing) because it tries to remove the seasonal component of time series. 52 | 7. We would often also need to transform the data before it can be used. The most common time we need to do transformations in Time Series is when we have an **increasing seasonal variation** because that represents a multiplicative model, which can be hard to explain. 53 | 54 | # Autocorrelation 55 | 56 | The general concept of dependence on the past is called autocorrelation in statistics. If we are going to use regression models for mdoeling non-stationary data, we need to ensure that they satisfy the [underlying assumptions necessary](./01-Regression/README.md) for regression models, i.e. normally distributed, zero mean, constant variance in the residual plot. 57 | 58 | ## Variance, Covariance and Correlation 59 | 60 | ### Variance 61 | The population variance denoted by ![population_variance](http://mathurl.com/yahkk9wb.png) is calculated using the sample variance. For any given sample Y, it's variance is calculated as follows: 62 | 63 | ![var_eq1](http://mathurl.com/ychzr2lm.png) 64 | 65 | > Note: Variance is a special case of covariance, it just means how the sample of observations varies with itself. 66 | 67 | ### Covariance 68 | 69 | It tells us how two varaibles vary with each other. Simply put, are the large values of X related to large values in Y or are the large values in X related to small values in Y? It is calculated as follows: 70 | 71 | ![covariance_equation](http://mathurl.com/y9edhbnx.png) 72 | 73 | ### Correlation 74 | 75 | Standardized covariance is called **correlation**, therefore it always lies between -1 and 1. It is represented as 76 | 77 | ![sd_eq1](http://mathurl.com/yatuzo44.png) 78 | 79 | where ![sd_x](http://mathurl.com/y9afaslq.png) represented the standard deviation for x and ![sd_y](http://mathurl.com/y8zvoe5j.png) represents the standard deviation for y. 80 | 81 | 82 | Autocorrelation takes place for univariate time series. 83 | 84 | ### Sample Autovariance 85 | 86 | Covariance is with respect to itself in this case, for the past values that have occurred in time and is represented as follows. 87 | 88 | ![ac_corr](http://mathurl.com/ycrptg55.png) 89 | 90 | ### Sample Autocovariance 91 | 92 | In a univariate time series, the covariance between two observations *k* time periods apart (lag *k*), is given by 93 | 94 | ![ac_cov](http://mathurl.com/yapbrtuj.png) 95 | 96 | ### Sample Autocorrelation 97 | 98 | Standardized autocovariance is equal to sample autocorrelation (*k* time periods apart) which is denoted by ![ac_cor](http://mathurl.com/yawwkcuf.png) and is given as follows: 99 | 100 | ![ac_cor](http://mathurl.com/ya862ny5.png) 101 | 102 | ## Detecting Autocorrelation 103 | 104 | There are two types of autocorrelation that we look for: 105 | 106 | 1. Positive Autocorrelation: Clustering in Residuals 107 | 2. Negative Autocorrelation: Oscillations in Residuals (this is rare) 108 | 109 | The following methods are used for detecting autocorrelation: 110 | 111 | ### Use the tests 112 | 113 | #### Durbin Watson Test (usually used in Econometrics) 114 | 115 | #### Plot of Current vs Lagged residuals 116 | 117 | #### Runs (Geary) Test 118 | 119 | #### Chi-Squared Test of Independence of Residuals 120 | 121 | ### Autocorrelation Function (ACF) 122 | 123 | This function does the tests for all possible lags and plots them at the same time as well. It is very time efficient for this reason. 124 | 125 | 126 | ## Time Series Regression 127 | 128 | There are certain key points, that need to be considered when we do, Time Series Regression. They are as follows: 129 | 130 | 1. Check if time ordered residuals are independent 131 | 2. White Noise: Time Series of independent residuals (i.e. independent and identically distributed (*iid*), normal with 0 mean and constant variance ![](http://mathurl.com/abetvkx.png)) It is given by ![white_noise](http://mathurl.com/y8sbreq5.png) 132 | 3. When doing regression models of nonstationary time series, check for autocorrelation in the residual series, because **a pattern in the residual series is equivalent to pattern in the data not captured by our model**. 133 | 134 | # Seasonality 135 | 136 | ### Smoothing or Filtering 137 | 138 | > Often the **smoothing or filtering** is done **before the data is delivered** and this **can be bad thing** because we don't know how seasonality was removed and also because we can only predict deseasonal values. 139 | 140 | ### STL (Seasonal Trend Lowess Analysis 141 | 142 | We can also do an STL analysis which would give us the seasonality, trend and remainder plots for any time series. 143 | 144 | > NOTE: If you find that the range of the seasonality plot is more or less the same as the range of the remained plot, then there isn't much seasonality. The STL is designed to look for **seasonality so it will always give you a seasonality plot** and it'll be for the person reading to determine whether there is actual seasonlity or not. 145 | 146 | ### Moving Averages 147 | 148 | > Both can be used to remove seasonality 149 | > 150 | 151 | # Forecasting 152 | 153 | 1. Report to te same level of significacne as the original data 154 | 2. Useall dp when doing calculations for the predictions 155 | 3. **Only round at the end** 156 | 4. Use units where available 157 | 158 | ![](http://mathurl.com/yclyh8k8.png) -------------------------------------------------------------------------------- /11-TimeSeries/01-TimeSeriesInR.md: -------------------------------------------------------------------------------- 1 | # Time Series in R 2 | 3 | Time Series in R is handled in the form of **XTS**(extensible time series) data. Almost all data types can be converted into their XTS objects by using the `as.xts()` constructor. The data format the comes out as an XTS object is usually a matrix with the rownames as dates, forming a time series. 4 | 5 | This becomes particularly useful when we are looking for long term patterns and things that reoccur over a certain period of time. 6 | 7 | ### Intuition 8 | 9 | The idea is to make timeseries self-sustained objects that we can think of, and operate upon, as events in time. 10 | 11 | ### Subsetting 12 | 13 | The format for subsetting any XTS object is as follows: 14 | 15 | ```r 16 | xts_object['YYYYMMDDTHHMMSS'] 17 | 18 | xts_object["Thh:mm/Thh:mm"] 19 | ``` 20 | 21 | There are few rules that need to be followed for our code to be compliant with the subsetting standards of **XTS**: 22 | 23 | 1. The components should be from left to right. 24 | 2. The data and time should be separated with with **T** 25 | 3. There may or may not be individual hyphens between years and months etc. 26 | 4. Colons may or may not be used between the values in 'time' section 27 | 5. The ranges are to be represented by `/` and in case the full ranges are not specified, the last part of date/time before the `/` will be ranged over. 28 | 29 | Key points to remember while subsetting XTS objects are: 30 | 31 | * Just like all other matrix objects in R subsetting still works for all integer indexing as well, including the negative indices. 32 | * It also works with logical vectors, i.e. `xts\_object(index(xts_object) > "YYYYMMDD")` words as well. 33 | * When subsetting the **XTS** objects, there is also the option of using the parameter `which.i` within the square brackets that returns the indixes of elements that match the logical query. 34 | * Order is always preserved while subsetting, even when the numeric indices are provided in arbitrary order, which makes sense for time series. 35 | * Selecting even one column of the dataset would return a matrix and not a vector. 36 | * In most cases, subsetting XTS objects **will be faster than base R** 37 | 38 | ```r 39 | # Example subsetting for a date range starting at YYYYMMDD 40 | xts_object["YYYY-MM-DD/"] 41 | ``` 42 | 43 | We often would need to do subsetting over periods of time and for such problems, it would be handy to have the ability to select certain periods of time at the beginnning or end of the series. Luckily enough, we have `first` and `last` functions in the XTS package that allows us to do this with ease. 44 | 45 | ```r 46 | # The query below would extract as many rows necessary for the last one week of observations from a dataset 47 | last(xts_object, "1 week") 48 | ``` 49 | 50 | > Notice the use of text format for retrieving the data relevant to our search from the time series. 51 | 52 | #### Mathematical Operations 53 | 54 | The basic math operators are supposed to work as normal as **XTS** object is an extended matrix object as per se. This allows a lot of flexibility as most functions would be dispatched to default matrix functions or similar ones, whenever possible. 55 | 56 | Although most matrix operations would work as normal, it is important to remember that all matrix operations will be carried out primarily upon the time dimension and by default **ONLY THE INTERSECTION OF TIMES WOULD BE PROVIDED IN THE RESULTS**. 57 | 58 | #### Merging two or more XTS objects 59 | 60 | There are three major types of merge operations that we can leverage when dealing with Time Series using XTS objects: 61 | 62 | 1. **Outer Join**: The default join method for any merge operation on XTS objects is the outer join. This can be called as `merge(x, y)` and would result in the union of two time series with missing values replaced as `NA`'s. 63 | 2. **Inner Join**: A call to `merge(x, y, join = "inner")` would result in an intersection of the two objects `x` and `y` keeping only the times that appear in both objects. 64 | 3. **Right & Left Joins**: We can also choose to keep elements from one of the two time series being joined by setting the `join` parameter to be `"left"` or `"right"` in the `merge` function. 65 | 66 | In addition to merging two time series objects, we can operate with other data types as well. For instance, passing a vector to the merge function as a `y` argument, would append a new column to the XTS object (with all normal recycling rules still applicable). 67 | 68 | ##### NA Handling 69 | 70 | We can use `fill` argument with any of these join functions to fill in any replacement for missing values. The most popular one being, `na.locf` which replaces missing arguments by last observed value in time for the particular attribute. 71 | 72 | There is a dedicated `na.locf` function for handling missing observations where `locf` stands for **last observation carried forward** and allows us to take the values from the previous observation in time series and save it to the next one. 73 | 74 | There are four other functions that are commonly used for handling NA values in time series: 75 | 76 | 1. `na.fill` would simply fill a replacement based on our arguments with a constant 77 | 2. `na.omit` and `na.trim` would skip the observations with NA's in them 78 | 3. `na.approx` does a linear interpolation for the missing values 79 | 80 | ##### Lagging and Leading 81 | 82 | Two common operations in time series manipulations are lagging and leading, in order to find seasonality or stationarity properties of any time series dataset. We can easily calculate these in R using `lag` function. 83 | 84 | 85 | 86 | ### Exporting and Importing 87 | 88 | It's pretty straight forward as **XTS** objects are built over **zoo** objects, we can do things like 89 | 90 | ```r 91 | write.zoo(xts_object, sep = ",", file = "file_name") 92 | 93 | # as.yearmon serves as the formatting function here 94 | read.zoo(file_name, sep = ",", FUN = as.yearmon) 95 | ``` 96 | 97 | ### Key Terms 98 | 99 | * **Seasonality**: The idea that the series exhibits a particular behavior repeatedly in certain intervals is called seasonality. 100 | * **Stationarity**: The idea that the series remains within certain bounds instead of being continuously increasing or decreasing is called stationarity. 101 | 102 | >The `lag` and `diff` functions are to be used for searching seasonality and stationarity relations between elements. 103 | -------------------------------------------------------------------------------- /11-TimeSeries/README.md: -------------------------------------------------------------------------------- 1 | # Time Series 2 | 3 | Any data recorded at fixed interval of time over a certain period of time, is called time series. It is a stochastic process, i.e. the values are dependent on, and change with time. There are special techniques that are used for tackling such models. 4 | 5 | This chapter contains the following topics: 6 | 7 | 1. [Introduction](./01-Introduction.md) 8 | 9 | a. [Key Terms](./01-Introduction.md#key-terms) 10 | b. [Types of Time Series](./01-Introduction.md#types-of-time-series) 11 | c. [Features](./01-Introduction.md#features-of-time-series) 12 | d. [Autocorrelation](./01-Introduction.md#autocorrelation) 13 | 2. [Implementation in R](./01-TimeSeriesInR.md) 14 | 3. Implementation in Python 15 | -------------------------------------------------------------------------------- /11-TimeSeries/StateSpaceModels.md: -------------------------------------------------------------------------------- 1 | # State Space Models 2 | 3 | -------------------------------------------------------------------------------- /12-ConstraintSatisfactionProblems/README.md: -------------------------------------------------------------------------------- 1 | # Constraint Satisfaction Problems 2 | 3 | The problems of this category usually require the states and goal test to conform to a standard, structured and very simple representation. 4 | 5 | ## Definitions 6 | 7 | * A CSP is defined by 8 | * a set of variables ![x1_xn](http://mathurl.com/yabc3fxk.png) 9 | * a set of contraints given by ![c1_cn](http://mathurl.com/ycr6djl5.png). In this setup there can be three categories of constraints: 10 | * asd 11 | * each variable ![xi](http://mathurl.com/lo88zjm.png) has a non empty domain ![di](http://mathurl.com/yd47wdhh.png) of possible values. 12 | * A **state** of the problem is defined by **assignment** of values to some or all of the variables {![](http://mathurl.com/y9bjh8dq.png), ![](http://mathurl.com/y83hv9nc.png), ...}. 13 | * An assignment that does not violate any constraints is called a **consistent** or **legal** assignment. 14 | * A **complete** assignment is the one, in which all variables of a CSP are mentioned. 15 | * A **complete** and **consistent** assignment is called a solution for any given CSP. 16 | 17 | > **Note:** Although the above process is in reference to assignment problems, some CSVs could also require solutions to maximize an objective function. The underlying principles would still remain the same 18 | 19 | ## Structure 20 | 21 | 1. The simplest kind of CSPs are the ones with discrete and finite domains. Map coloring problems and 8-Queens problems are the popular ones of this kind. In case of a problem with domain size **d** and number of variables **n**, the total number of possible and complete assignments is ![](http://mathurl.com/y98mou73.png). 22 | 23 | > This goes on to show that in the worst case scenario, we will not be able to solve finite-domain CSPs in less than exponential time. 24 | 25 | 2. There can be problems where the domain for a variable. 26 | 27 | ## Solving Mechanism 28 | 29 | It is interesting to see that the a CSP can be given an **incremental formulation** as a standard search problem using the following intuitive procedure: 30 | 31 | 1. **Initial State**: the empty assignment {}, in which all variables are unassigned 32 | 2. **Successor Function**: A value can be assigned to any variable provided that it does not conflict with any previously assigned variables 33 | 3. **Goal Test**: The current assignment is complete 34 | 4. **Path Cost**: a constant cost (eg. 1) for every step 35 | 36 | Every solution must be a complete assignment, therefore appears appears at depth **n** if there are **n** variables. Due to this property **depth first search** algorithms are really popular for CSPs. 37 | 38 | It is also noteworthy that the path taken for reaching a particular solution is irrelevant to the final outcome. Therefore another way to tackle this problem could be to make sure that **every state** is a **complete assignment** which might or might not satisfy the constraints and then use local search methods to find solutions. 39 | 40 | 41 | -------------------------------------------------------------------------------- /13-Appendix/01-Programming/01-R/01-DplyrTutorial.md: -------------------------------------------------------------------------------- 1 | # Tutorial (dplyr) 2 | 3 | Dplyr is perhaps the most commonly used data wrangling tool in R. In this section, we will be exploring some of the commonly used functions and writing styles of **dplyr**. 4 | 5 | > The **dplyr** verbs do NOT change the original dataset, they only return modified copies for us to use. 6 | 7 | ## Manipulating Data 8 | 9 | ### Select 10 | 11 | This dplyr verb allows us to select certain columns from the dataset. The syntax is straight forward. 12 | 13 | ```r 14 | select(df, Column1, Column2) 15 | ``` 16 | 17 | There are few things that need to be taken care of during the usage of **select** verb. 18 | 19 | 1. No quotes required for column names 20 | 2. No **$** sign required in front of the column names 21 | 3. One can use numeric order of columns for select, i.e. `select(df, 2:5)` would select the column number **2** to **5** from the dataframe **df** 22 | 4. After selection, we can also select which columns to not choose from the dataset, by using a syntax like `select(df, 2:5, -(3:4))`. This command would choose the column **2** through **5** without the columns **3** through **4**. 23 | 24 | > All verbs are compatible with helper functions described at the end of this documnet. 25 | 26 | ### Mutate 27 | 28 | This dplyr verb allows us to add new variables to the dataset using the existing variables. 29 | 30 | ```r 31 | mutate(df, newCol = Col1 + Col2) 32 | ``` 33 | 34 | In the example above, a new column **newCol** would be created by adding the values already present in **Col1** and **Col2**. 35 | 36 | Again, no quotes or **$** signs are required in order to use mutate. Although, if there are spaces in the column name, such as a column with the name **Col 1** can be written with the help of wrapping the name in **`** instead of quotes. 37 | 38 | ### Filter 39 | 40 | This verb allows us to filter out rows from the dataset based on the conditions that we provide as arguments to the verb. The syntax for filter is as intuitive as the other verbs. 41 | 42 | ```r 43 | filter(df, Condition == 1) 44 | ``` 45 | 46 | where `df` is the tibble, followed by one or more logical tests after the **,**. 47 | 48 | ### Arrange 49 | 50 | This verb allows us to reorder observations in a data frame so that they're easier to observe and visualize. The syntax for arrange is quite simple. 51 | 52 | ```r 53 | arrange(df, ColumnToArrangeDataBy) 54 | ``` 55 | 56 | > Can be chained like most other verbs 57 | 58 | ### Group By 59 | 60 | This verb allows us to group observations in a data frame so that they're easier to observe and visualize. The syntax for arrange is quite simple. 61 | 62 | ```r 63 | group_by(df, ColumnToGroupDataBy) 64 | ``` 65 | 66 | > Can be chained like most other verbs 67 | 68 | ### Summarise 69 | 70 | This verb allows us to summarise the columns. The syntax is really similar to mutate but the primary difference is, that instead of working on each observation individually like mutate does, **summarise** works of a group of observations, or often, the whole column. 71 | 72 | ```r 73 | summarise(df, newSummaryColumn = summaryFunction(oldColumn)) 74 | ``` 75 | 76 | #### Helper Functions 77 | 78 | 1. `starts_with("X")`: every name that starts with **X** 79 | 2. `ends_with("X")`: every name that ends with **X** 80 | 3. `contains("X")`: every name that contains **X** 81 | 4. `matches("X")`: every name that matches **X**, where **X** can be a regular expression 82 | 5. `num_range("x", 1:5)`: the variables named **x01**, **x02**, **x03**, **x04** and **x05** 83 | 6. `one_of(x)`: every name that appears in **x**, where **x** is a character vector. 84 | 85 | ## Joining Data 86 | 87 | When we are using Dplyr to merge data from two tables, we usually use the dplyr functions for it. It allows us to have better control over our joins (than **merge**) and also allows us to interact with back end applications like Spark or MySQL easily. These functions are meant to be frontends for other storages and therefore, are better suited for such tasks. 88 | 89 | ### Keys 90 | 91 | The important to understand that by default joins are made upon keys. The key from the first table will be considered a **primary key** and the keys from the second table is considered to be a **foreign key**. The primary key needs to uniquely identify the values in the first table, while foreign key accepts duplicate values as well. If the primary key however, does not corresponding to unique observations in the first table, the results will become harder to understand. 92 | 93 | ### Left Join 94 | 95 | Joining with the left data frame as the primary data frame is called a left join. The syntax is pretty straight forward in this case, 96 | 97 | ```r 98 | left_join(df1, df2, by = "commonColumnName") 99 | ``` 100 | 101 | In case of left join if a row in the left (primary) data frame does not have a corresponding value in the secondary data frame, then dplyr will supply it with an **NA** but if a row is in the secondary data frame and does not have any corresponding value in the primary data frame, it will be ignored completely. 102 | 103 | -------------------------------------------------------------------------------- /13-Appendix/01-Programming/01-R/README.md: -------------------------------------------------------------------------------- 1 | # R Programming 2 | 3 | 1. [Dplyr Tutorial](./01-DplyrTutorial.md) -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/01-Numpy.md: -------------------------------------------------------------------------------- 1 | # Numpy 2 | 3 | Numpy is a library used specifically for advanced and faster data manipulation in Python. It allows us to effectively manage and manipulate our datasets with minimal programming. In this document, we will have a look at what are the most commonly used features of Numpy and how can we exploit them to optimize our Python programming. 4 | 5 | ## Import Data 6 | 7 | 1. **Read a csv**: A **csv** file containing only numerical data can be imported using `np.loadtext('filepath.csv', delimiter = ",")`. The default delimiter is *whitespace* so explicit definition is required. 8 | 2. **No Headers**: If the first row contains header, the `skiprows = 1` argument must be used to skip the first row. 9 | 3. **Selective column import**: If only a certain number of columns are required to be imported, then we can use `usecols = [0,1,4]` attribute with column indices to import only a few of all columns possible. 10 | 4. **Import columns with different datatypes**: Although not recommended, Numpy has the ability to import dataframe like structures which contain different datatypes in different columns. This can be done using `np.genfromtxt()` function. Refer documentation for more. 11 | 12 | ## Arrays 13 | 14 | ### 2-D Arrays 15 | 16 | We can create 2-D Numpy arrays as `a = np.array([[1,2,3], [4,5,6]])` and this would lead to a 2 dimensional array with 2 rows and 3 columns. 17 | 18 | On this object, the attribute `shape` represents the dimensions. It can be used as `a.shape` to return `(2,3)` meaning **2 rows** and **3 columns** exist in this 2D array. 19 | 20 | #### Iteration 21 | 22 | If iteration is required over every single element in a 2-D Array using a for loop, then the method `np.nditer` must be used in the following syntax: 23 | 24 | ``` 25 | for val in np.nditer(my_np_array): 26 | do_something(val) 27 | ``` 28 | > The iteration in the case above will happen in a row wise fashion. First observation will be iterated over (and all features of this row will be called), and then the second row and so forth. 29 | 30 | 31 | 32 | 33 | 34 | ### Mathematics on Vectors 35 | 36 | #### Random Operations 37 | 38 | 1. In order to **generate a simple random number**, `np.random.rand()` can be used. This would result in a random number between **0** and **1**. 39 | 2. We can **set a seed** using the function `np.random.seed(seedValue)` which would then introduce reproducability between our function calls. 40 | 3. **Random integers** from a select range of integers can be generated using `np .random.randint(start_integer, end_integer)` which in this case would result in random integers between **start_integer** and **end_integer - 1** (because the end of range is not included in Python). 41 | 4. 42 | 43 | #### Dot Products 44 | 45 | The dot products can be defined for two vectors or matrices in the following ways: 46 | 47 | 1. ![dotProd1](http://mathurl.com/y9qc43b6.png) 48 | 49 | This is the summation of element wise multiplication of the two vectors. The notation ![atransb](http://mathurl.com/y7wgs22g.png) denotes that the vectors are column vectors and the result of the equation above would be a 1x1 vector which is a scalar quantity. 50 | 51 | This definition can be emulated in Python (using Numpy) in various ways: 52 | 53 | 1. Without using Numpy functions) 54 | 55 | ```python 56 | # Create the necessary variables 57 | dotProd = 0 58 | a = np.array([1,2,3]) 59 | b = np.array([2,3,4]) 60 | 61 | # Use a for-loop to calculate the dot product 62 | for e,f in zip(a,b): 63 | dotProd += e*f 64 | 65 | 66 | # The value of dot now becomes 20 as one would expect 67 | ``` 68 | 69 | 2. Using `np.sum` function 70 | 71 | ```python 72 | dotProd = np.sum(a*b) 73 | 74 | # The value of dotProd will be the same as the generic 75 | # code we wrote above because the a*b notation creates 76 | # a vector of products of individual elements and then 77 | # we just sum them to emulate the equation above 78 | ``` 79 | 80 | 3. Using the `sum` function over the `np.array` object instances 81 | 82 | ``` 83 | dotProd = (a*b).sum() 84 | 85 | # Notice the use of object's sum function instead of 86 | # using the sum function of the class as in Method 2 87 | ``` 88 | 89 | 4. Using the `np.dot` function 90 | 91 | ``` 92 | dotProd = np.dot(a,b) 93 | ``` 94 | 95 | 5. Using the `dot` function over the `np.array` object instances 96 | 97 | ``` 98 | dotProd = a.dot(b) 99 | 100 | # Notice the use of object's dot function instead of 101 | # using the dot function of the class as in Method 4 102 | ``` 103 | 104 | > `for` loops should be avoided whenever possible. The intrinsic functions of Numpy are magnitudes faster in operation. 105 | 106 | 2. ![cosDotProd1](http://mathurl.com/ycpoyuxb.png) 107 | 108 | This notation is not very convenient for vector multiplication unless a the angle on the right hand side is known to us. Although, it is a much more common practice to use this equation for finding out the angle between two vectors using 109 | 110 | ![cosDotProd2](http://mathurl.com/yd35x774.png) 111 | 112 | Let's use this equation to find the angle between the vectors above step by step: 113 | 114 | 1. Find the magnitudes of the vectors. 115 | 116 | ``` 117 | # We can do this in two ways: 118 | 119 | # Option 1: Without using the built-in Numpy function 120 | # for this task 121 | magA = np.sqrt(a.dot(a)) 122 | 123 | # Option 2: Using the Linear Algebra module of the 124 | # Numpy package to do this task 125 | magA = np.linalg.norm(a) 126 | 127 | # Using the equation in the starting of this chapter 128 | ``` 129 | 130 | 2. Calculate the cosine of the angle between the two vectors. We know from the equation shown above that the angle between the two vectors can easily be calculated if we have the magnitudes of the two vectors and their cross product. 131 | 132 | ``` 133 | costheta = a.dot(b) / ( np.linalg.norm(a) * np.linalg.norm(b) ) 134 | ``` 135 | 136 | 3. Once we have done this, the actual angle can easily be calculated by using the `np.arccos` function of the Numpy Library 137 | 138 | ``` 139 | theta = np.arccos(costheta) 140 | ``` 141 | 142 | The value that we obtain for `theta` from the operation above is in radians. 143 | 144 | #### Outer Products 145 | This function takes in two vectors `a` and `b` and then returns their outer product. 146 | ```python 147 | AOuter = np.outer(a, b) 148 | ``` 149 | 150 | 151 | ## Matrix 152 | 153 | A matrix is an inherent data type in Numpy but it can also be an array of arrays if we don't want it t be a matrix. The official NP documentation discourages the use of matrices and encourages users to use array of arrays notation instead. This requires all arrays to be of the same length however, obviously. it can be defined as `M = np.array([[1, 2], [3, 4]])` which would then define an array of array kind of matrix immediately. 154 | 155 | > To access a particular element, `[i, j]` notation may be used, just like data frames. 156 | 157 | #### Create a matrix 158 | We can create an empty matrix by using the `np.zeros((5, 5))` which would return a `5 by 5` matrix of zeros. Similarly `np.ones` can be used to create a matrix of ones. 159 | 160 | ### Mathematics on Matrices 161 | 162 | #### Multiplication of Matrices 163 | 164 | The definition of matrix is given in the prerequisites section of this book. 165 | 166 | 1. A simple operation like matrix multiplication can be easily done by using the `.dot()` function of Numpy stack. therefore for two matrices `A` and `B`, their multiplicative result would be given by `C = A.dot(B)`. This would result in the matrix multiplication of the two matrices A and B. 167 | 168 | 2. In order to do an element wise multiplication in matrices we can simply say `A*B` and this would result in each element of one matrix to be multiplied by the corresponding element in the other matrix. 169 | 170 | #### Other common mathematical operations on a Matrix 171 | 172 | ```python 173 | A = np.array([[1, 2], [3, 4]]) 174 | Ainv = np.linalg.inv(A) // Gives us the inverse of A 175 | Adet = np.linalg.det(A) // Gives us the determinant of A 176 | Adiag = np.diag(A) // Gives us the diagonal elements of A in a vector 177 | Atrace = np.trace(A) // Gives us the sum of diagonal elements of A 178 | ``` 179 | > If you pass a **2D** Array to `np.diag`, then it returns the diagonal elements, if you a pass a **1D** array however, it returns a 2D Array with all off diagonal elements as `0` and the elements of the array as diagonal elements. 180 | 181 | ### Solving a Linear System 182 | 183 | The problems in a linear system are often of the form ![linearSystem](http://mathurl.com/5uerks4.png). The solution for `x`, is easily given by ![linearSystem2](http://mathurl.com/6yysmlh.png). We are assuming that `A` is a square matrix and is invertible. The system has `D` equations and `D` unknowns to solve for. This can be simply done by using the equation above and the basic Numpy methods we have used thus far: 184 | 185 | ```python 186 | x = np.linalg.inv(A).dot(b) // Method 1 187 | x = np.linalg.solve(a, b) // Method 2 (Recommended) 188 | ``` 189 | 190 | ## Operator overloading in 'np' 191 | 192 | ### Mathematical Operators 193 | 194 | ``` 195 | a = [1, 2, 3] 196 | print(a+a) 197 | ``` 198 | would return `[1,2,3,1,2,3]` but if you perform the operation with numpy as follows: 199 | 200 | ``` 201 | a = np.array([1,2,3]) 202 | print(a+a) 203 | ``` 204 | would return the element wise sum of the array, i.e. `[2,4,6]` 205 | 206 | ### Boolean Operators 207 | 208 | In case of Boolean Operators over Numpy arrays, the preferred method of operation is using the Numpy function, `logical_and()`, `logical_or()` and `logical_not()`. These are Numpy array equivalents of `and`, `or` and `not` found in base Python. 209 | 210 | ## FAQs 211 | 212 | ### 1. What is the difference between a List and an NP Array? 213 | There are several differences between an NP Array and a Python List: 214 | 215 | 1. There is no **append** method on a NP Array while the method works well on Python Lists. 216 | 2. Lists can be added with a **+** operator. 217 | 3. If `L1 = [1, 2]` and `L2 = [3, 4]`, adding two lists would gives us the concatenation of those lists (`L3 = L1 + L2` would give us the value of `L3` as `[1, 2, 3, 4]`) but adding two Numpy Array would give us the element wise sum for the two Arrays. For example for a Numpy array `A = np.array([1, 2])`, doing `A + A` would give us the value of `A` to be `array([2, 4])`. 218 | 4. Numpy lists can be multiplied and added to elements, while the same is not possible with Python Lists. Doing `2 * L1` would repeat all elements in `L1` but doing `2 * A` would multiply each element of the Numpy Array with the constant. 219 | 5. Almost all mathematical operations are applied element-wise when you are working with Numpy arrays but won't do so with Lists. 220 | 6. It's almost always better to use NP arrays for doing mathematical operations and creating mathematical objects. -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/02-MatPlotLib.md: -------------------------------------------------------------------------------- 1 | # MatPlotLib 2 | 3 | This is the visualization library available for plotting graphs in python. It can be imported in any python script as ` import matplotlib.pyplot as plt`. This allows us to use the shorthand notation `plt` rather than having to use the complete name for the plot function. 4 | 5 | ## Plots 6 | 7 | ### Line based plot 8 | 9 | ```python 10 | # For graphs with lines 11 | plt.plot(x, y) 12 | plt.show() 13 | ``` 14 | 15 | ### Scatter Plots 16 | 17 | ```python 18 | # For scatter plots 19 | plt.scatter(x, y) 20 | plt.show() 21 | ``` 22 | 23 | ### Histograms 24 | 25 | ```python 26 | plt.hist(x) 27 | plt.show() 28 | ``` 29 | 30 | ### Box Plots 31 | 32 | In the example mentioned below, we pass the columns to be plotted, into the `column` argument and the column that we want to compare boxplots across, into the `by` argument. 33 | 34 | ```python 35 | df.boxplot(column = 'column_name', by = "continent") 36 | ``` 37 | 38 | The lines that extend from a boxplot are called **whiskers**. They represent the maximums and minimums of our data, excluding the outliers. 39 | 40 | #### Customizations 41 | 42 | 1. **Bins**: The number of default `bins` for a histogram is 10, and it can be altered by passing a different value `plt.hist(x, bins =3)` 43 | 2. **Range**: Setting the minimum and maximum value for a histogram is done by using the `range` argument and passing it a tuple `plt.hist(x, range = (0, 10))` 44 | 3. **Normalization**: The data can be normalized before the histogram is plotted using `normed` argument as `plt.hist(x, normed = True)`. 45 | 4. **CDF**: A Cumulative Distribution Function can be calculated before plotting by using the Boolean argument `cumulative` in addition to `normed` while plotting `plt.hist(x, cumulative = True, normed = True)`. 46 | 47 | ### Plotting a DataFrame 48 | 49 | A pandas DataFrame can be plotted using the `plt.plot(pd_df)` function. This call would plot all the numeric values in the dataframe across the index. 50 | 51 | ### Customizations 52 | 53 | 1. **Labels**:The customizations for a plot `plt.plot(x,y)` for X and Y **labels** can be done with `plt.xlabel('X')` and `plt.ylabel('Y')`. 54 | 2. **Title**: A title can be added with `plt.title('Plot Title')`. 55 | 3. **Altering Axis**: The axis' can be changed by passing an one dimensional array to the `yticks` function as in `plt.yticks([0,2,4,6])`. This would force the Y axis to have these numbers of the intervals. Optionally, a second list of the same length can also be passed to `yticks` for custom labels, while still using the first row as the original axis numbers. It can be done as `plt.yticks([0,2,4,6],["Zero", "Two", "Four", "Six"])`. 56 | 4. **Adding Text**: Text can be added to any point in the plot using `plt.text(x, y, "Text")` syntax. 57 | 5. **Logarithms**: If the values are interfering with each other due to dominant behavior of a particular feature, then we can use `plt.yscale('log')` to neutralize the effect of that feature before showing the plot. 58 | 6. **Colors**: Colors can be added for features while plotting them using `pd_df["column_1"].plot(color = "r")`. 59 | 60 | ## Statistical Significance 61 | 62 | ### Types of Plots 63 | 64 | 1. **Bar Plots**: It's a good idea to use bar plots for discrete data counts 65 | 2. **Histograms**: Histograms are good for frequency analysis on continuous data columns 66 | 3. **Box Plots**: It's a good idea to use box plots to visualize all basic summary statistics for a given column. 67 | 4. **Scatter Plots**: They're used to observe visually, the relationships between two or more numeric columns. 68 | 69 | ## Exporting plots 70 | 71 | We can easily export plots using `plt.savefig('filename.jpg')` before calling `show()` and save the plots. -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/03-Pandas.md: -------------------------------------------------------------------------------- 1 | # Pandas 2 | 3 | The high level data manipulation tool used by data scientists. It can be imported with the syntax `import pandas as pd`. 4 | 5 | ## Creating a Pandas Dataframe 6 | 7 | ### From Dictionaries 8 | 9 | There are multiple ways to create a Pandas dataframe, the commonly used one is to create it from a dictionary by setting a key for every column label and a the value to be a list of observations of that label for each column. Then simply calling `pd.DataFrame(dict)` would create the data frame. 10 | 11 | #### Assigning row labels 12 | 13 | This can be done by using the syntax `df.index = ['label1', 'label2', ... 'labeln']` for **n** observations that exist in the dataframe. 14 | 15 | ### From Lists 16 | 17 | If we are to create a DataFrame from two conforming lists that are defined as follows, we would need to use the `list` and `zip` functions. 18 | 19 | ```python 20 | labels = ['n1', 'n2'] 21 | x = [1, 2, 3] 22 | y = [4, 5, 6] 23 | ``` 24 | 25 | 1. We would first need to create a list of lists like `list_columns = [x, y]` 26 | 2. Then we would need create a list after element-wise zipping of the columns with their labels, as shown in `z = list(zip(labels, list_columns))` 27 | 3. This will now need to be converted into a dictionary which would have column names (`labels`) and columns (`z`) using `data = dict(z)` 28 | 4. After this, we use the method defined above for dictionaries, i.e. `pd.DataFrame(data)` to create a DataFrame. 29 | 30 | #### Broadcasting 31 | 32 | It is the concept of **recycling** from R, that is called **broadcasting** in Python. The idea is, that a particular value can be recycled and used to fill all the other observations, if unsuitable number of details have been provided. 33 | 34 | ```python 35 | x = [1, 2, 3] 36 | y = {'n': x, 'is_int': 'Yes'} 37 | z = pd.DataFrame(y) 38 | ``` 39 | would return a DataFrame with two columns, the column containing `Yes` thrice. 40 | 41 | ## Reading and Importing Data 42 | 43 | ### From CSVs 44 | 45 | It is relatively straightforward to be reading data from CSVs. One can use `pd.read_csv('path_to_csv.csv')` in order to read from a file. 46 | 47 | 1. **No column labels**: If the data does not have column labels, `pd.read_csv('path_to_csv.csv', header = None)` will allow it to read data without it. 48 | 2. **External column names**: External column names can be added to the data frame using the names argument, `pd.read_csv('path_to_csv.csv', header = None, names = list_of_names)` 49 | 3. **Null value declaration**: If our data uses any other convention than `NaN` for declaring null values in it, we can explicitly define it, by setting the `na_values` attritbute to that character `pd.read_csv('path_to_csv.csv', na_values = ['-1'])` 50 | 51 | This can also be done if there are more than one kinds of `NaN` values present in the dataset using a list of values as shown in `pd.read_csv('path.csv', na_values = ['-1', '999'])` or using a dictionary as shown `pd.read_csv('path.csv', na_values = {col1: '-1', col2: '999'})` if there are separate `NaN` characters in separate columns. 52 | 4. **Assigning row labels**: In case the first column of the csv contains row labels for the data, then use `pd.read_csv('path_to_csv.csv', index_col=0)` for using the (row) labels for your dataframe. 53 | 5. **Date values**: If year, month and date are in separate columns and need to be converged into one column, it can be done using `parse_dates` argument of the `read_csv()` function, e.g. `pd.read_csv('path_to_csv.csv', parse_dates = [[1, 2, 3]])` where columns with indices would contain year, month and day data. 54 | 55 | Alternatively, we can parse the date-time values using `parse_dates = True` which would then convert all the dates that are ISO 8601 compatible (yyyy-mm-dd hh:mm:ss), into appropriate date structure. 56 | 57 | 6. **Handling comments**: If the file contains comments within the data, they can be distinguished using the delimiter passed to the `comment` argument as shown in `pd.read_csv('path_to_csv.csv', comment='#')` 58 | 7. **Delimiter**: The delimiter in while reading a csv to a Pandas DataFrame object can be set using `sep` argument 59 | 8. **Skipping rows**: Rows can be skipped while reading a csv file by using `skiprows` argument in combination with `header` argument. 60 | 9. **Skipping footer**: Rows at the end of the file can be skipped using the `skipfooter = n` argument. This would skip the last `n` rows of the file. 61 | > **NOTE**: The `skipfooter` argument doesn't work with the default C Engine so we need to specify the `engine = python` when setting this parameter. 62 | 63 | #### Chunkwise loading 64 | 65 | In case of large datasets, data can be loaded and processed in chunks. It can be done with the help of `for` loop as in `for chunk in pd.read_csv('path_to_csv.csv', chunksize = 1000)`. 66 | 67 | #### Globbing 68 | 69 | The process of looking for file names with specific patterns, and loading them is called **globbing**. 70 | 71 | ```python 72 | import glob 73 | 74 | pattern = '*.csv' 75 | csv_files = glob.glob(pattern) 76 | ``` 77 | 78 | The code above, would return a list of files names, called `csv_files`. Then we can loop over this list to load all data frames. *Concatenation* can be used for merging all the datasets into one single dataset if required. 79 | 80 | ### From URLs 81 | 82 | Importing a csv from a web URL can be done with the **UrlLib** package as follows 83 | 84 | ```python 85 | from urllib.request import urlretrieve 86 | 87 | urlretrieve('http://onlinefilepath', 'local_file_path.csv') 88 | ``` 89 | 90 | And then proceed with `read_csv()` function as usual. 91 | 92 | ### From Excel 93 | 94 | 1. **Reading a file**: A simple read operation over an Excel Spreadsheet can be executed by using `x = pd.ExcelFile('filepath.xlsx')`. 95 | 2. **Listing sheets**: There can be multiple sheets involved in any particular file and they can be listed by using the `sheet_names` attribute as `print(x.sheet_names)` 96 | 3. **Reading a particular sheet**: It is done by passing the sheet name to the `parse()` method as shown in `df_sheet = x.parse('sheet1')` 97 | 4. **Custom Headers**: One can define custom headers while parsing from an excel sheet by using the `names()` argument. 98 | 99 | ### From HDF5 (Hierarchical Data Format version5) 100 | 101 | This is data format commonly used for storing large quantities of numerical data in Python. This is done using the following code segment 102 | 103 | ```python 104 | import h5py 105 | data = h5py.File("path_to_file.hdf5", "r") 106 | ``` 107 | 108 | You can explore the `data` object so obtained by using code similar to that required to explore a dictionary. 109 | 110 | ### From Pickles 111 | 112 | ```python 113 | with open('file_path.pkl', 'rb') as file: 114 | data = pickle.load(file) 115 | ``` 116 | 117 | ### From SQL 118 | 119 | 1. **Creating a Database engine** 120 | 121 | ```python 122 | from sqlalchemy import create_engine 123 | engine = create_engine("path to sql db connector") 124 | ``` 125 | 2. **List Tables**: This can be done using `engine.table_names()` method of **engine** object. 126 | 3. **Connecting to the engine**: This is done using the `connect()` method available with every sqlalchemy `engine` object. 127 | 4. **Querying**: There are two ways to query an SQL database and they are as follows: 128 | 1. The first one takes all of the above methods and works as follows: 129 | 130 | ```python 131 | con = engine.connect() 132 | results = con.execute("SELECT * FROM table_name") 133 | df = pd.DataFrame(results.fetchall()) 134 | df.columns = results.keys() 135 | con.close() 136 | ``` 137 | 138 | This syntax is very similar to the **PHP** syntax for this operation. We create a connection, then execute the query and get returned a results binary object. We then use `fetchall()` method to convert it to a flat structure and store it in a pandas dataframe. Finally, we add the column names to the dataframe that we created and close the connection. 139 | 140 | 2. The second method is much more concise and works just fine. It harnesses the power of Pandas library and works as follows: 141 | 142 | ```python 143 | df = pd.read_sql_query("SELECT * FROM table_name", engine) 144 | ``` 145 | 146 | This single line of code then executes the command and returns the results in form of a DataFrame. 147 | 148 | 5. **Fetch fewer rows**: Sometimes the SQL query that we execute might return humongous results, then we can use the `fetchmany()` function with the `size` argument over the `results` object in order to fetch a certain number of rows instead of all. 149 | 150 | 151 | 152 | ## Exporting Data 153 | 154 | 1. **csv**: The method `to_csv()` for every DataFrame object allows us to export it to any file that we desire. It works as `pd_df.to_csv('filename.csv')`. 155 | 2. **Excel**: The method `to_excel()` for every DataFrame object allows us to export it to an excel spreadsheet file that we desire. It works as `pd_df.to_csv('filename.xlsx')`. 156 | 3. **Numpy array**: Any Pandas DataFrame can be converted into a Numpy array object using `values` attribute of every pandas dataframe object. 157 | 158 | 159 | 160 | ## Selecting and Index Data 161 | 162 | ### Column Selection 163 | 164 | A column in a dataframe `df` may easily be selected using the syntax `df["columnName"]`. 165 | 166 | > **NOTE:** The returned object from the code above is **NOT** a dataframe but an object of the type series. This may lead to unexpected results and therefore this method is not recommended. 167 | 168 | There are two fixes for the problem mentioned above: 169 | 170 | 1. The fix for the problem mentioned above is to use double square brackets like `df[["columnName"]]` for selecting the column. This would return a dataframe instead of a **series** as was the case in the method above. 171 | 2. Use the `values` attribute for any **Series** object in order to retrieve the numerical values involved in form of a **Numpy** array. 172 | 173 | ### Slicing for Row Selection 174 | 175 | It is uncommon to use regular square brackets for row selection but it can be done using `df[1:5]` which would return the rows with indices 1 through 4 (because as always, the last index would not be included). 176 | 177 | If alternative slicing methods are required then it can be achieved as `pd_df[::3,:]` would select every third row of the DataFrame. 178 | 179 | ### Loc and iLoc 180 | 181 | These are the two most commonly used methods (of Pandas Data Frame objects) for selecting or subsetting data. The **loc** technique operates on **labels** and the **iloc** technique relies on **integer positions**. 182 | 183 | 1. **loc**: This method allows us to select certain rows based on labels as follows `df.loc[["row_label1", "row_label2"]]` would select the rows with these two labels. 184 | 185 | One trick for range of slicing is to use `df.loc[["row_label2", "row_label1":-1]]` for reverse slicing. It would select rows from `row_label1` to `row_label2` but in reverse order. 186 | 187 | > **Note:** The use of `[[ ]]` is still necessary for making sure that the returned object is indeed a Pandas DataFrame in order to avoid any inconsistencies. 188 | 189 | > **WARNING**: Unlike conventional slicing (with numbers) slicing with `loc` using `'column_name_1':'column_name_2'` would include `column_name_2` in the resulting object. This is different from the index based slicing as that ignores the last index. 190 | 191 | It can be further extended to include only specific columns using a comma, as in `df.loc[["row_label1", "row_label2"], ["column_label1", "column_label2"]]`. This query would only return the columns with labels **column_label1** and **column_label2**. 192 | 193 | 2. **iloc**: Everything remains the same except that indices are used instead of labels. 194 | 195 | ### Filtering 196 | 197 | 1. **any and all**: `any()` or `all()` methods are helpful in filtering the columns that have certain properties. They're usually used in combination with `isnull()` or `notnull()` methods. 198 | 2. **Drop na**: The `dropna()` method can be used on data frames to filter out rows with any or all na values based on the argument `how='any'` or `how='all'`. 199 | 200 | ### Iterations 201 | 202 | #### Columns 203 | 204 | A basic `for` loop would result in iteration over column names. For instance, 205 | 206 | ``` 207 | for i in pd_df: 208 | print(i) 209 | ``` 210 | 211 | would simply print the columns names that exist in the pandas dataframe `pd_df`. 212 | 213 | #### Rows 214 | 215 | The rows need to be iterated over using the method `iterrows` of the pandas dataframe object that we are trying to access. 216 | 217 | ``` 218 | for lab, row in pd_df.iterrows(): 219 | print(lab) 220 | print(row) 221 | ``` 222 | 223 | would then print, first the label, and then the contents of each row as a **Series** object. 224 | 225 | ## Manipulating Dataframes 226 | 227 | > NOT COMPLETE: would need to visit more websites and research materials to complete manipulatio 228 | 229 | 1. **Adding a new column** 230 | 231 | 1. **Single Value(loc)**: The operator `loc` can be used to add a new column to an existing dataframe. 232 | 233 | ```python 234 | pd_df.loc["new_column"] = 2 235 | ``` 236 | should create a new column names `new_column` in the `pd_df` dataframe with the value `2` on all rows. 237 | 2. **Mutation(apply)**: `pd_df["new_column"] = pd_df["old_column"].apply(len)` would add a new column with the length of values present in currently existing `old_column`. 238 | 239 | 2. **Tidying Data** 240 | 241 | 1. **Melting Dataframes**: If columns contain values instead of variables, then we would need to use the `melt()` function. It can be used as 242 | 243 | ```python 244 | pd.melt( frame = df, 245 | id_vars = 'identifier_column', 246 | value_vars = ['column1', 'column2'], 247 | var_name = 'new_key_column_name', 248 | value_name = 'new_value_column_name') 249 | ``` 250 | 251 | where `id_vars` is the column/set of columns that contain the ID, and `value_vars` are the columns that need to be merged. 252 | 253 | 2. **Pivoting Data**: It's the opposite process of melting data. It can be used to change a column into multiple columns as follows: 254 | 255 | ```python 256 | pd.pivot( frame = df, 257 | index = 'index_column', 258 | columns = 'key_column', 259 | values = 'value_column') 260 | ``` 261 | 262 | This would work just fine if we're dealing with perfect data, i.e. there are no duplicates. If there are duplicates though, then we would need to use the `pivot_table()` method in order to deal with them. It is done with one additional parameter, and with , as shown below 263 | 264 | ```python 265 | df.pivot_table( index = 'index_column', 266 | columns = 'key_column', 267 | values = 'value_column', 268 | aggfunction = np.mean) 269 | ``` 270 | 271 | where we are telling Pandas to mean the duplicate numeric values using the `aggfunction` attribute. 272 | 273 | > `reset_index()` method is used on the data frames that have been pivoted in order to flatten them2 274 | 275 | 3. **Concatenating Data**: A list of dataframes can be passed to the `pd.concat()` function for concatenating data. 276 | 277 | The `axis = 1` argument can be used for **column wise concatenation**. 278 | 279 | > **Note:** We need to reset the index labels by passing the `ignore_index=True` argument to the `pd.concat()` function so that there are no duplicate indices in order to avoid using `reset_index()` method later. 280 | 281 | 3. **Merging Data** 282 | 283 | We can use the Pandas equivalent of **join** to merge two dataframes as follows 284 | 285 | ```python 286 | pd.merge( left = left_df, right = right_df, 287 | on = None, 288 | left_on = 'left_df_column_name', 289 | right_on = 'right_df_column_name') 290 | ``` 291 | 292 | If the column name is same on both left and right dataframes, then only the `on` parameter can be specified in the function above and the other factors will be redundant. 293 | 294 | There are multiple kinds of merges that can happen in Pandas: 295 | 296 | 1. **One to One**: Both the keys take a value only 1 time on both sides 297 | 2. **Many to One/ One to Many**: This merge happens when there are more than one duplicates on either of the tables. In this case, the value from the other key will be duplicated to fill in the missing repition. 298 | 299 | 4. **Data Type Cleaning** 300 | 301 | We can observe the datatypes of various columns by viewing the `dtypes` attribute of the dataframe that we want to check these details for. 302 | 303 | 1. **Converting Data Types**: The data types can be converted using the `astype()` method of any column. 304 | 2. **Convert to categorical column**: We would often want to convert the column type to a categorical variable, we can pass the `'category'` argument to the `astype()` method of any column to convert it into a categorical column. 305 | 3. **Convert to Numeric**: If there is a column that should be of numeric type but is not, because of mistreated data, or erroneous characters in the data, we can use the `pd.to_numeric(df['column_name'])` function and pass it the additional argument `errors = 'coerce'` in order to convert all that erroneous data to `NaN` with ease. 306 | 4. **Drop NA**: If there are really few data points that have missing values in them, we can lose them with the `dropna()` method. 307 | 5. **Recode Missing Values**: We can customize the missing values using the `fillna('missing_value_placeholder')` method of every data frame object and the columns. 308 | 6. **String Cleaning**: The `re` library for regular expressions gives us a neat way to do string manipulations. We can formulate regular expression formalue like `x = re.compile('\d{3}-\d{3}-\d{4}')`. This would create a new regex object called `x` which has a method `match()`. We can pass any string to this `match()` method to match it with our regular expression and it returns a boolean `True` if the string matches. 309 | 7. **Duplicate Data**: There may be mulitple rows where redundant partial or complete row information is stored and these may be sorted out by using the `drop_duplicates()` methods of the data frame object. 310 | 311 | 5. **Vectorized Operations**: Whenever possible, it is recommended to use vectorized computations rather than going for custom solutions. 312 | 1. **Operating on Strings**: There are vectorized methods in the `str` attribute of every dataframe column that contains strings. These functions enable us to do quick vectorized transformations on the df. 313 | 2. **Map Method**: There are often times when the `str` or other vectorized attributes will not be present. In such cases the `map()` method can be used for mapping operation succinctly. 314 | 315 | 6. **Assigning Index**: We can designate a column, or any other Numpy array of the same length to be the index by assigning it to the `df.index` attribute. 316 | 317 | 1. **Index Name**: Index by default, won't have name associated with it, but one can assign a name to the index by assigning it to the attribute `df.index.name`. The similar operation can be carried for assigning an index name to the column names using the `df.columns.name` attribute. 318 | 2. **Using Tuples as Index**: Often we would need to set two or more columns as index (much like composite keys in SQL). This can be done using Tuples. They list of columns that we need to be set as the composite index of a dataframe can be passed to the `set_index(["composite_key_column_1", "composite_key_column_2"])` to achieve this. It is called the **MultiIndex**. 319 | 3. **Sorting Index**: If we are using a **Multiindex** as shown above, we can also use the `sort_index()` method to sort the index and display it in a more organized manner. 320 | 321 | > This allows for **fancy indexing**, i.e. calling `df.loc[(["index_1_low" : "index_1_high"], "index_2"), :]` would select all the columns for rows that belong in the range provided for `index_1` and all sub rows belonging to `index_2`. 322 | 323 | > The `slice()` function must be used for slicing both indexes. 324 | 4. **Stacking and Unstacking Index**: We might want to remove some of the indexes from the multi level indexes to be columns. To do this, we use the method `unstack()` with the `level="index_name_to_remove"`. This will give us a hierarchical data frame and this effect can be reversed using the `stack()` method in the same format. 325 | 5. **Swapping Index Levels**: The index levels can be swapped using the method `swaplevel(0, 1)` on any dataframe. This would essentially exchange the hierarchical arrangement of indices and running `sort_index()` right after it would do the rest. 326 | 327 | 7. **Aggregation/Reduction**: The `groupby()` method is the Python equivalent of R's `aggregate()` method. It allows us to create virtual groups within the dataframe. It is usually chained together with other aggregation functions like `sum()`, `count()`, `unique()` etc. to produce meaningful results. We can use a typical grouping operation as follows: 328 | 329 | ```python 330 | titanic.groupby(['pclass', 'sex'])['survived'].count() 331 | ``` 332 | 333 | There is also the option of finding out multiple aggregation details on the grouped dataframe: 334 | 1. **Multiple Aggregations**: We can use `titanic.groupby('pclass').agg(['mean', 'sum'])` to compute multiple aggregation values at once. 335 | 2. **Custom Aggregations**: We can pass custom functions as arguments to `agg()` method that would take `Series` objects as inputs and produce results from them. When used, they would receive as inputs multiple `Series` objects (one for each group) and would produce grouped results like other functions. 336 | 3. **Differnet Agg on Different Columns**: We can pass a `dictionary` object to `agg()` method, as an argument, which would contain column names as keys and corresponding aggregation functions to apply as values. This allows us to compute different statistics for the same grouping of objects, upon different columns. 337 | 338 | 8. **Transformation**: Transformation functions are used to transform one or more columns after they have been grouped and is usually chained after the `groupby()` method as `transform(transformation_function)`. This transformation method passes the Series to `transform_function()` which could be a user defined function or a builtin one, which then returns a transformed series of a conforming size. 339 | 9. **Grouping and Filtering**: We can use the dictionary object created by `groupby()` method to loop over and therefore filter only the rows of interest. 340 | 10. **Sorting**: We can sort the values in any column by using the `sort_values(ascending = False)` method available for columns of all dataframe objects. 341 | 11. **Matrix Operations**: Direct matrix operations will not work on Dataframes but `pandas.core.frame.DataFrame` object comes with an `as_matrix()` method available to each object for converting it readily into a Numpy 2D Array. This will only work for DataFrames with only numerical values though. There is a crucial difference between **Numpy 2D Arrays** and **Pandas' DataFrames** and that is, `X[0]` for a Numpy 2D Array returns the ***0th row*** while accessing a DataFrame ***X[0]*** would return the ***0th column*** of the DataFrame. 342 | 12. **Mathematical Operations**: There are various mathematical operations available for our use. 343 | 1. **pct_change()**: This method can used to detect percentage change over a particular column or aggregation values. 344 | 2. **add()**: This method can be used to add two Series with corresponding row indices as `a.add(b)`. This would add the series `a` and `b`. However, if there are non matching indices, i.e. an index in `a` does not have any corresponding index in `b`, then this could return an `NaN` value. We can change this by changing the default non existent value by passing the argument `fill_value` into the `add()` method. This method is chainable so more than one Series can be added in a single line. 345 | 346 | ## Exploring Data 347 | 348 | 1. **Dimensions**: The `shape` attribute of any DataFrame can be used to check the dimensions. 349 | 2. **Column Names**: The `columns` attribute of a DataFrame returns the names of all the columns. 350 | 3. **Indices**: The atrributes `columns` and `index` can be used to retrieve the columns' and rows' index of a DataFrame. 351 | 4. **Column Details**: Much like the `str` function in R, `info()` method can be used over any pandas DataFrame object in order to retrieve meaningful insight on columns. It returns the name of the column and the number of Non Null values preent in the data column. 352 | 5. **Statistical Summary**: Statistical summaries for pandas DataFrames can be quickly generated using the `describe()` method on any pandas DataFrame. 353 | 6. **Interquantile range (IQR)**: Quantile ranges are useful when exploring a dataset and it can easily be determined by using `quantile()` method of Pandas dataframes. For instance, `pd_df.quantile([0.25, 0.75])` would return two values for each column in the dataset and half the data for those columns would lie between those two values. 354 | 7. **Range**: The range can be calculated using the `min()` and `max()` methods on any DataFrame. 355 | 8. **Median**: The `median()` method can be used for finding out the median of any given dataset. 356 | 9. **Standard deviation**: The method `std()` can be used for finding out the standard deviation for any given column. 357 | 10. **Unique objects**: Unique categories in any categorical column can be found using the `unique()` method. 358 | 11. **Frequency Count**: The frequency of factors in a column containing factors by using the `value_counts()` method on that column. Optionally, we could specify the `dropna` argument to this method with a Boolean Value specifying whether or not to involve null values. 359 | 12. **Data Type**: We can explore the data type for any column that we want to, by having a look at the values of the attribute `dtypes` for each column in data frame. 360 | 13. **Index of Max**: The `idxmax()` and `idxmin()` methods allow us to find the row or column labels where the maximum or minimum values are located with the help of `axis = 'columns'` for the column labels, and these methods default to `min()` so we won't have to specify anything there. 361 | 14. **Indexes of Non NULL Values**: One can get the indices of non null values by using the `notnull()` method available for Series objects in Pandas. 362 | 363 | ## Time Series with Pandas 364 | 365 | 1. **Slicing on the basis of time**: Interesting selections can be done on date time indices, using selection operators like `pd_df.loc['2017-01-22':'2018-01-02']` to select the values for that 1 month. 366 | 367 | 2. **Reindexing**: Any time series can be reindexed using the index of another time series by using the syntax `time_s2.reindex(time_s1.index)` 368 | 3. **Forward Filling of Null Values**: Null values in a time series can be forward filled using `time_series.reindex(new_index, method = "ffill")` 369 | 4. **Resampling**: There are two kinds of sampling that can be done with time series in Pandas dataframes: 370 | 1. **Downlsampling**: Reduce datetimes to lower frequency, i.e. from daily to weekly etc. 371 | 2. **Upsampling**: Increasing the times to higher frequency, i.e. from daily to hourly etc. 372 | 373 | There are multiple parameters that we can pass to `resample()` method for allowing us to derive quick statistics from a time series. For instance, `pd_ts.resample('D').mean()` would allow us to have daily averages of all numeric columns of the time series dataframe. 374 | 375 | | Code | Meaning | 376 | | ---- | ------- | 377 | | `min`, `T` | Minute | 378 | | `H` | Hour | 379 | | `D` | Day | 380 | | `B` | Business Day | 381 | | `W` | Week | 382 | | `M` | Month | 383 | | `Q` | Quarter | 384 | | `Y` | Year | 385 | 386 | Numeric values can be used as prefixes to the parameters above in order to increase or decrease the sampling size. For instance, `2W` can be used for downsampling for 2 weeks in `pd_ts.resample('2W').mean()` 387 | 388 | > The `ffill()` and `bfill()` methods can be used for filling in, rolling values as in `pd_ts.resample('2H').ffill()`. 389 | 390 | **Interpolation**: It can be carried out as follows: 391 | 392 | ```python 393 | pd_ts.resample('A').first().interpolate('linear') 394 | 395 | ``` 396 | 397 | This would resample and fill in the gaps between any two data points with **NaN** values. Then the `interpolate()` method would interpolate the values. 398 | 399 | 5. **Datetime Methods**: These methods allow us to slice and select specific date time attributes from the data. 400 | 1. **Select hours**: For instance `pd_ts['date'].dt.hour` would extract and return the *hour* where **0** is 12AM and **23** is 12PM. 401 | 2. **Timezone**: We can **define** the timezone for a particular column by using `pd_ts['date'].dt.tz_localize('US/Central')`. On the other hand, easy **conversions** can be made, using the method `tz_convert()` that is specifically used for converting dates and times in one timezone to another. 402 | 3. 403 | 404 | ### Moving Averages 405 | 406 | We can calculate moving averages, that allows us to focus on long term trends instead of being stuck in short term fluctuations. We do so by using the `rolling()` method as shown in `time_series.rolling(window=24).mean()` which would compute **new values for each point** but it will still be hourly data. This would instead set the value of that point as an average of trailing 24 datapoints, hence making it smoother. 407 | 408 | ``` 409 | 410 | """ 411 | for key in keys['order']: 412 | key_weight = keys['weights'][key] # Weight of the key 413 | key_data = keys['data'][key] 414 | if key_data['type'] == "singular": 415 | print("Jacuzzi") 416 | else: 417 | for section in key_data['order']: 418 | section_weight = key_data['weights'][section] 419 | section_data = key_data['data'][section] 420 | 421 | for unit in section_data: 422 | user_df[key+'_'+section+'_'+unit] = user_data_raw[section_data[unit]['user']].notnull().astype(int) 423 | 424 | 425 | print(user_df.columns) 426 | """ 427 | ``` 428 | 429 | 430 | 431 | -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/04-SciPy.md: -------------------------------------------------------------------------------- 1 | # SciPy 2 | 3 | ## Read and Import Data 4 | 5 | ### Matlab Files 6 | 7 | We can read files from matlab using the `scipy.io.loadmat()` function as shown below 8 | 9 | ```python 10 | import scipy.io 11 | 12 | matlab_data = scipy.io.loadmat('filepath.mat') 13 | ``` 14 | 15 | This would create a dictionary object called `matlab_data` that would contain key value pairs for all the different kinds of objects that have been imported from the matlab file. 16 | 17 | > It is interesting to note here that Matlab files are usually permanently stored work environments of Matlab so multiple objects should be expected while importing a single matlab file. -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/05-urllib.md: -------------------------------------------------------------------------------- 1 | # Web Libraries 2 | 3 | ## Importing Data 4 | 5 | ### Reading a CSV File 6 | 7 | Importing a csv from a web URL can be done with the **UrlLib** package as follows 8 | 9 | ```python 10 | from urllib.request import urlretrieve 11 | 12 | urlretrieve('http://onlinefilepath', 'local_file_path.csv') 13 | ``` 14 | 15 | ### Reading a JSON file 16 | 17 | ```python 18 | import json 19 | 20 | with open("file_path.json") as json_file: 21 | json_data = json.load(json_file) 22 | ``` 23 | This would create a dictionary obejct `json_data` allowing us to loop over or manipulate the data retrieved. 24 | 25 | #### Retrieving from APIs 26 | 27 | ```python 28 | import requests 29 | r = requests.get('http://url') 30 | json_data = r.json() 31 | ``` 32 | 33 | ## Scraping the web 34 | 35 | ### Using the 'requests' package 36 | 37 | ```python 38 | import requests 39 | raw = requests.get('http://url') 40 | text = raw.text 41 | ``` 42 | 43 | We can scrap any webpage with one line of code using the requests package and then simply addressing the `text` attribute of any raw result produced by the `requests.get()` function would give us the actual HTML result of the page. 44 | 45 | ### Beautiful Soup 46 | 47 | We will use the `text` object created in the previous code segment and then beautify it with the library to improve navigability. 48 | 49 | ```python 50 | from bs4 import BeautifulSoup 51 | soup = BeautifulSoup(text) 52 | ``` 53 | 54 | Now that we have extracted the html data into a `soup` object, we can use atrributes like `title` and methods like `get_text()`, `find_all()` etc. on it to extract more meaningful information. 55 | 56 | ## Retrieving from APIs 57 | 58 | ```python 59 | import requests 60 | r = requests.get('http://url') 61 | json_data = r.json() 62 | ``` -------------------------------------------------------------------------------- /13-Appendix/01-Programming/02-Python/README.md: -------------------------------------------------------------------------------- 1 | # Python Programming 2 | 3 | ## Libraries 4 | 5 | The following important libraries have been explored in addition to the text below: 6 | 7 | 1. [Numpy](./01-Numpy.md) 8 | 2. [MatPlotLib](./02-MatPlotLib.md) 9 | 3. [Pandas](./03-Pandas.md) 10 | 4. [SciPy](./04-SciPy.md) 11 | 5. [Web Libraries](./05-urllib.md) 12 | 13 | ## Data Types 14 | 15 | ### Tuples 16 | 17 | **Tuples** are immutable ordered objects that are created using **()**. For defining a new tuple, one could simply write `x = (1, 2, 3)`. This would create a new tuple with the values defined. In order to extract these values into separate variables, one could write `a, b, c = x` and then the values would be mapped to the corresponding variables. Individual elements can also be accessed, just like the elements of lists are accessed, i.e. `print(x[2])` would print **3**, as one would expect from a list. 18 | 19 | ### Dictionaries 20 | 21 | #### Validity 22 | 23 | These are data types that store data in form of key-value pairs. The only thing worth noting here is that **keys** need to be immutable variables, i.e. they cannot be lists of other dictionaries because they are mutable by their very definition. 24 | 25 | #### Check for existence 26 | 27 | In order to quickly check whether or not a key exists in a particular dictionary, use the syntax `"key" in dict` which would return `True` if the key is present in the dictionary object `dict`. 28 | 29 | #### Deletion 30 | 31 | The obvious function for most deletions, `del` is used for **deleting** a **key-value pair** in any dictionary with the syntax being `del(dict["key"])`. 32 | 33 | ### Strings 34 | 35 | #### Manipulations in Strings 36 | 37 | 1. **Search for substring**: `contains()` method of the `str` attribute of every string out there, is used for getting back a Boolean result for a string column/atom with the value *True* if the substring exists otherwise *False*. 38 | 39 | 40 | 41 | ## Manipulations 42 | 43 | ### Iteration 44 | 45 | * **Iterable**: Any object that has an associated `iter()` method is called an iterable in Python terminology. 46 | * **Iterator**: Any object that has an associated `next()` method that presents the next value when called. An iterator object can be created using the `iter()` method as follows 47 | 48 | ```python 49 | x = 'Test' 50 | it_obj = iter('Test') 51 | 52 | next(it_obj) 53 | # Prints 'T' 54 | 55 | next(it_obj) 56 | # Prints 'e' 57 | # ... so forth and so on 58 | ``` 59 | 60 | All objects of the iterator can be printed in a single call using **\*** operator as `print(*it_obj)`. This would print `T e s t` and calling `next()` again on this object would throw the end of iteration error. 61 | 62 | #### Lists 63 | 64 | We can iterate over lists using the `enumerate` function like `for i, val in enumerate(my_list)` would then return the index and values in `i` and `val` respectively on every iteration of the loop. We can change the first index of the enumeration by using `start` parameter of `enumerate` function like `for i, val in enumerate(my_list, start = 10)`, which would, in this case, start the indexing at 10 instead of 0. 65 | 66 | #### Dictionaries 67 | 68 | ##### Using for loop 69 | 70 | In case of dictionaries, the method `items()` must be called in order to properly iterate over items. The syntax for this, is as follows: 71 | 72 | ``` 73 | for key, value in my_dict.items(): 74 | do_something(key) 75 | ``` 76 | 77 | ##### Enumeration 78 | 79 | A common lookup for `for` loops is the iteration dictionaries using both keys and values. The simple syntax to do this is `for index, value in enumerate(my_list)` would then product both indices and values for lists. 80 | 81 | ### Zipping 82 | 83 | Zipping objects using the `zip()` function creates a zip object, which can be used as follows: 84 | 85 | ```python 86 | name = ["Apple", "Xiomi", "LG"] 87 | brand_id = [121, 289, 323] 88 | 89 | z = zip(name, brand_id) 90 | ``` 91 | A **zip object** is an iterator of tuples, which can be turned into a list object, using the **list** function. This list object would be a list of tuples containing pairs in the format *(name, brand_id)*. 92 | 93 | Objects of **zip** type can be unzipped into two separate variables using the **zip()** function with a combination of **\*** as shown in `name_new, brand_id_new = zip(*z)`. 94 | 95 | ### Comprehensions 96 | 97 | #### Simple List Comprehensions 98 | 99 | A list comprehension is used to build lists without using for loops and comprises three basic elements, listed below: 100 | 101 | 1. iterable 102 | 2. iterator variable 103 | 3. output expression 104 | 105 | ```python 106 | new_list = [k + 2 for k in old_list] 107 | ``` 108 | The code above would iterate over all elements in `old_list` and add **2** in them to create the new list called `new_list`. 109 | 110 | #### Conditional LCs 111 | 112 | Conditional statements can also be added to LCs as shown in `new_list = [k + 2 for k in old_list if k % 2 == 0]`. This would return values only if the number being operated upon is even. 113 | 114 | > There is one obvious problem in this technique of conditional LC; the size of list returned might not be the same as the size of input list. 115 | 116 | In order to fix this, a conditional LC with two conditions must be used. The only difference is that conditional statement will now be written before the `for` loop as `new_list = [k + 2 if k % 2 == 0 else 0 for k in old_list]`. 117 | 118 | #### Dictionary Comprehensions 119 | 120 | It is pretty much similar to the list comprehensions mentioned above except that curly brackets are used instead of square brackets. 121 | 122 | ### Generator 123 | 124 | A generator object can be created by using parenthesis instead of the usual square brackets as in `(k + 2 for k in old_list)`. It returns an iterable object of the type generator. 125 | 126 | > It works on the concept of **Lazy Evaluations** therefore saving tons of memory required for evaluation until the value is actually required. 127 | > 128 | > **Generator Functions** have been defined in the "Functions" heading 129 | 130 | ## Functions 131 | 132 | New functions can be defined in python on using the following structure: 133 | 134 | ```python 135 | def my_function(parameters): 136 | """ Docstring for the function """ 137 | code_to_execute 138 | return result 139 | 140 | my_function(arguments) 141 | ``` 142 | 143 | Here the name of the function is `my_function` and it takes in the parameters and returns a value. This value can then be stored for later use whenever necessary. 144 | 145 | > The **Docstring** serves as a documentation and should be written enclosed in triple double-quotes helping future interpretation and modifications to our functions. 146 | 147 | ### Functions with variables arguments 148 | 149 | 1. **Variable number of parameters** 150 | 151 | ```python 152 | def my_function(*parameters): 153 | """ Docstring for the function """ 154 | code_to_execute 155 | return result 156 | 157 | my_function(argument1, argument2) 158 | ``` 159 | 160 | The function defined above would take as many inputs as provided, and then convert them into a tuple therefore allowing you to add as many arguments as necessary in one go. 161 | 162 | 2. **Variable type of parameters** 163 | 164 | ```python 165 | def my_function(**parameters): 166 | """ Docstring for the function """ 167 | for k,v in parameters.items(): 168 | do_something(k, v) 169 | return result 170 | 171 | my_function(key1 = value1, key2 = value2) 172 | ``` 173 | 174 | The function above allows us to pass multiple arguments in the form of key-value pairs therefore allowing us to access the values with names that they came with depending on the use case. 175 | 176 | ### Scope 177 | 178 | The scope heirarchy is as follows: 179 | 180 | 1. Local Scope 181 | 2. Enclosing Scope(s) (in order) 182 | 3. Global Scope 183 | 4. Builtin Scope 184 | 185 | 186 | If a variable in the global scope needs to be altered from within a function, it can be done using the keyword `global`. Therefore 187 | 188 | ```python 189 | x = 5 190 | def my_function(parameters): 191 | """ Docstring for the function """ 192 | global x 193 | x = 10 194 | 195 | 196 | my_function(arguments) 197 | ``` 198 | 199 | the function above would alter the global value of x because we have defined that currently we want to fiddle with the `global` value of **x** using the `global` keyword. 200 | 201 | ### Lambda Functions 202 | 203 | Generation of functions when required can be done in quicker way using the keyword `lambda` as `x = lambda x, y: x*y` would create a function `x` that can then be called like a normal function to find products of `x` and `y`. 204 | 205 | ### Generator Functions 206 | 207 | These can be used to create functions that are executed upon runtime but perform rather complicated tasks with ease. 208 | 209 | ```python 210 | def my_function(parameters): 211 | """ A function that would act like a list being generated in realtime """ 212 | for i in parameters: # Assuming parameters is a list 213 | yield(i+1) 214 | 215 | x = my_function([1, 2, 4]) 216 | ``` 217 | 218 | This would result in a dynamic generator object being created and stored in the variable **x**. This **x** can then be looped over as shown in the code below. This allows us to create a virtual dynamic function that is called and executed step by step on every iteration of the `for` loop that is called over `x`. 219 | 220 | ```python 221 | for item in x: 222 | print(x) 223 | ``` 224 | 225 | 226 | ### Unexpected Behaviors 227 | 228 | #### Exception Handling 229 | 230 | We can catch different types of exceptions that occur during runtime for a given function using the code below. 231 | 232 | ```python 233 | def add_two(parameter): 234 | """ Adds 2 to any number provided """ 235 | try: 236 | return parameter + 2 237 | except: 238 | print("parameter must be an integer or float") 239 | ``` 240 | 241 | #### Custom Errors 242 | 243 | We can manually raise different types of error for a given function using the code below. 244 | 245 | ```python 246 | def add_two_to_int(parameter): 247 | """ Adds 2 to any number provided """ 248 | if round(parameter) != parameter: 249 | raise ValueError("argument provided must be a positive integer") 250 | try: 251 | return parameter + 2 252 | except: 253 | print("parameter must be an integer or float") 254 | ``` 255 | 256 | ## Data Generation 257 | 258 | There are multiple ways to generate data in Python and some of them are as follows: 259 | 260 | 1. **Controlled by Number of Points**: `np.linspace(0, 10, 10)` takes in three arguments, the starting point, the ending point, and the number of points in between the two. -------------------------------------------------------------------------------- /13-Appendix/01-Programming/README.md: -------------------------------------------------------------------------------- 1 | # Programming 2 | 3 | 1. R 4 | 1. [Dplyr Tutorial](./01-R/01-DplyrTutorial.md) -------------------------------------------------------------------------------- /13-Appendix/02-ApplicationAreas/01-Introduction.md: -------------------------------------------------------------------------------- 1 | # Applications in Financial Services 2 | 3 | ##### Traditional Sources of data 4 | 5 | Traditionally, the following indicators have been used as data for finance: 6 | 7 | 1. Macro-Economic Indicators: GDP, Turnover etc. 8 | 2. Performance of investments over time: How well the stocks did? 9 | 3. Records of behavior of clients: If they paid on time or not 10 | 11 | ##### New sources of data 12 | 13 | Although data science takes this a step further, because new sources can be now involved in the decision making process: 14 | 15 | 1. Unstructured text: Tweets, natural language records etc. 16 | 2. Sequence Data: How things change over time if sequence 1 is replaced with sequence 2. 17 | 3. Biometric markers: How 18 | 4. Network data: Social relationships and insights into the records 19 | 20 | ## Decisions that need to be made 21 | 22 | 1. Acceptable level of risk 23 | 2. False positives and false negatives 24 | 3. Sources of your data: How reliable are these? 25 | 4. Importance of real-time analysis (because it would impact the range of algorithms that we can use) 26 | 5. Importance of interpretability 27 | 28 | ## Common ML Algorithms used with Financial Data 29 | 30 | 1. KNN 31 | 2. Regression 32 | 3. Decision Trees and RFs 33 | 4. ANNs 34 | 5. Deep Learning 35 | 36 | ## Potential Problems 37 | 38 | 1. **Flash crash**: The process of having a large amount of change in the value of a security in very brief period of time is called a Flash Crash. The most recent one was when the value of Ethereum went from $319 to 10 cents in 2017[1](https://www.cnbc.com/2017/06/22/ethereum-price-crash-10-cents-gdax-exchange-after-multimillion-dollar-trade.html). 39 | 2. **Relevant data missing**: The models are certainly not going to work if they have not been provided with all the information that is required to predict a particular value. 40 | 3. Overfitting 41 | 4. **Black Box Models**: It is very difficult to explain models that could give very impressive results to a non-technical person. 42 | 5. **Unintentionally using restricted information**: Things like race, gender etc. are often included in the application while training it, and are restricted in many countries from being used for any financial purposes because of racist biases that might creep in wuith them into the algorithm 43 | 6. **Too fast for humans**: Often we would find that the algorithms being used in these areas are too fast for humans to monitor. 44 | 7. **Feelings of Helplessness**: If and when things do go wrong, people may feel helpless because of the above reasons. 45 | 46 | ## Key application areas 47 | 48 | 1. **Algorithm Trading**: There are multiple advantages of doing algorithms trading, like computers can look at a lot more data, much faster than humans can. Profitability increases as these two are combined. Commonly used techniques in this are Regression, Moving averages, channel breakouts, arbitrage (if listed in more than one exchange sell and buy at the same time from various exchanges), Neural Networks, Deep Learning. 49 | 2. **Credit Card and Loans**: There is large scale analysis and trend recognition done for this kind of data. The credit spending can now be predicted using very varying amounts of data like 50 | -------------------------------------------------------------------------------- /13-Appendix/README.md: -------------------------------------------------------------------------------- 1 | # Appendix 2 | 3 | 1. [Programming](./01-Programming) 4 | 1. [R](./01-Programming/01-R) 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Purnesh Tripathi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning 2 | My online notebook for Machine Learning notes 3 | 4 | This book contains short notes for the key machine learning algorithms and topics. In time, I will also update it to include references to other topics that I find interesting. 5 | 6 | #### [Prerequisites](./00-Prerequisites/) 7 | 8 | 9 | 1. [Regression](./01-Regression/) 10 | 2. [Classification](./02-Classification/) 11 | 3. [Clustering](./03-Clustering/) 12 | 4. [Association Rule Learning](./04-AssociationRuleLearning/) 13 | 5. [Reinforcement Learning](./05-ReinforcementLearning/) 14 | 6. [Natural Language Processing](./06-NaturalLanguageProcessing/) 15 | 7. [Deep Learning](./07-DeepLearning/) 16 | 8. [Dimensionality Reduction](./08-DimensionalityReduction/) 17 | 9. [Recommendation Engines](./09-RecommendationEngines/) 18 | 10. [Model Selection and Boosting](./10-ModelSelectionAndBoosting/) 19 | 11. [Time Series](./11-TimeSeries/) 20 | 12. [Constraint Satisfaction Problems](./12-ConstraintSatisfactionProblems/) 21 | 13. [Appendix](./13-Appendix/) --------------------------------------------------------------------------------