└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Statistics in ML & DL Cheat Sheet 2 | 3 | ## Table of Contents 4 | 1. [Basics (Statistics From Scratch)](#basics-from-scratch) 5 | 2. [Advanced statistics](#advanced) 6 | 3. [Linear Algebra](#linear-algebra) 7 | 4. [Calculus](#calculus) 8 | 5. [Numerical Optimization](#numerical-optimization) 9 | 6. [Measurement Theory](#measurement-theory) 10 | 11 | --- 12 | 13 | ## Basics (From Scratch) 14 | 15 | ### Descriptive Statistics: 16 | Descriptive statistics provide a summary of the main aspects of the data, providing a snapshot of its main characteristics. 17 | 18 | - **Mean:** The average of all data points. It's the sum of all data points divided by the number of data points. 19 | - **Median:** The middle value in sorted data. If there's an even number of observations, it's the average of the two middle numbers. 20 | - **Mode:** The value that appears most frequently in a data set. A dataset can have one mode, more than one mode, or no mode at all. 21 | - **Variance:** Measures how far each number in a set is from the mean and is calculated by taking the average of the squared differences from the mean. 22 | - **Standard Deviation:** Represents the average distance between each data point and the mean. It's the square root of variance. 23 | - **Skewness:** Measures the asymmetry of the probability distribution of a random variable about its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. 24 | - **Kurtosis:** Describes the tails and sharpness of a distribution. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. 25 | - **Range:** The difference between the highest and lowest values in a dataset. 26 | - **Interquartile Range (IQR):** Represents the middle 50% of the data. It's the difference between the third quartile (Q3) and the first quartile (Q1). 27 | 28 | ### Probability Distributions: 29 | Probability distributions describe how the values of a random variable are distributed. 30 | 31 | - **Normal Distribution:** A symmetrical, bell-shaped curve where most observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. 32 | - **Binomial Distribution:** Represents the number of successes in a fixed number of binary experiments. It describes the outcome of binary scenarios, e.g., toss of a coin, win or loss, etc. 33 | - **Poisson Distribution:** Represents the number of events in fixed intervals of time or space. It's used for counting the number of times an event occurs within a given time or space. 34 | - **Exponential Distribution:** Describes the time between events in a Poisson process. It's often used to model the time elapsed between events. 35 | - **Uniform Distribution:** All outcomes are equally likely, and every variable has the same probability that it will be the outcome. 36 | - **Gamma Distribution:** A two-parameter family of continuous probability distributions. It generalizes the exponential distribution. 37 | - **Beta Distribution:** Models random variables limited to intervals of finite length in the range from 0 to 1. 38 | 39 | ### Hypothesis Testing: 40 | Hypothesis testing is a statistical method used to make decisions or inferences about population parameters based on a sample. 41 | 42 | - **Null Hypothesis (H0):** The initial claim or status quo. It's a statement of no effect or no difference. 43 | - **Alternative Hypothesis (H1 or Ha):** Represents what the researcher aims to prove. It's a statement indicating the presence of an effect or difference. 44 | - **Type I Error (α):** The probability of rejecting the null hypothesis when it's actually true. Commonly referred to as a "false positive." 45 | - **Type II Error (β):** The probability of not rejecting the null hypothesis when the alternative hypothesis is true. Known as a "false negative." 46 | - **Chi-Squared Test:** Used for testing relationships between categorical variables. It compares the observed frequencies to the frequencies we would expect if there were no relationship between the variables. 47 | - **F-Test:** Compares the variances of two populations. It's used to test if two population variances are equal. 48 | - **Central Limit Theorem:** States that the distribution of sample means approaches a normal distribution as the sample size gets larger, regardless of the shape of the population distribution. 49 | 50 | ### Correlation & Sampling: 51 | Correlation measures the strength and direction of a linear relationship between two variables. 52 | 53 | - **Pearson Correlation Coefficient (r):** Measures the linear relationship between two variables. It ranges from -1 to 1, with -1 indicating a perfect negative linear relationship and 1 indicating a perfect positive linear relationship. 54 | - **Spearman Rank Correlation:** A non-parametric test that measures the strength and direction of the relationship between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. 55 | - **Random Sampling:** A subset of individuals chosen from a larger set where each individual has an equal chance of being selected. 56 | - **Stratified Sampling:** Divides the population into subgroups or strata, and random samples are taken from each stratum. 57 | 58 | ### Estimation & Experimental Design: 59 | Estimation involves approximating a population parameter based on a sample from the population. 60 | 61 | - **Point Estimation:** Provides a single value as an estimate of a parameter. 62 | - **Interval Estimation:** Provides a range of values as an estimate of a parameter. This range is called a confidence interval. 63 | - **Factorial Experiments:** Experiments that study the effects of multiple factors simultaneously. 64 | - **Cross-Over Design:** An experimental design in which subjects receive different treatments during the different periods of the experiment. 65 | 66 | --- 67 | 68 | ## Advanced 69 | 70 | ### Inferential & Multivariate Analysis: 71 | Inferential statistics allow you to infer or deduce population parameters from sample statistics. 72 | 73 | - **Z-Score:** Represents how many standard deviations a data point is from the mean. It's used to compare scores from different distributions. 74 | - **T-Test:** A hypothesis test used to compare the means of two groups. It assumes that the populations the samples come from are normally distributed. 75 | - **Cohen's d:** A measure of effect size that indicates the standardized difference between two means. 76 | - **Effect Size:** Represents the magnitude of a relationship between two or more variables. It describes the strength of the relationship between variables. 77 | - **Canonical Correlation:** Measures the association between two sets of variables. It's used to find the relationship between two sets of variables if each set is used to define a new variable. 78 | - **Discriminant Analysis:** A classification technique. It determines which variables discriminate between two or more naturally occurring groups. 79 | 80 | ### Bayesian Statistics & Time Series Analysis: 81 | Bayesian statistics is a statistical method that applies probability to statistical problems, involving epistemological uncertainties. 82 | 83 | - **Bayes' Theorem:** Provides a way to revise existing predictions or theories (update probabilities) given new evidence. It relates current evidence to prior beliefs. 84 | - **Prior Distribution:** Represents what we know about a variable before considering the current observed evidence. 85 | - **Posterior Distribution:** Represents what we know about a variable after considering the current observed evidence. 86 | - **Autoregression (AR):** A time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. 87 | - **Moving Average (MA):** A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. 88 | - **ARIMA (AutoRegressive Integrated Moving Average):** A class of models that combines AR and MA models as well as a differencing pre-processing step of the sequence to make the sequence stationary. 89 | 90 | ### Non-Parametric & Robust Statistics: 91 | Non-parametric statistics don't assume that the data follows a particular distribution. 92 | 93 | - **Mann-Whitney U Test:** A non-parametric test used to compare two sample means that come from the same population and used to test if two population means are equal. 94 | - **Wilcoxon Signed-Rank Test:** A non-parametric test that compares two paired groups. 95 | - **Kruskal-Wallis H Test:** A rank-based non-parametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable. 96 | - **Huber's M-Estimator:** A method used in robust regression that is less sensitive to outliers in data than the least squares method. 97 | 98 | ### Causal Inference & Survival Analysis: 99 | Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect. 100 | 101 | - **Randomized Controlled Trials (RCTs):** An experimental setup where participants are randomly assigned to an experimental group or a control group. 102 | - **Propensity Score Matching:** A statistical matching technique that attempts to estimate the effect of a treatment by accounting for the covariates that predict receiving the treatment. 103 | - **Kaplan-Meier Estimator:** A non-parametric statistic used to estimate the survival function from lifetime data. 104 | - **Cox Proportional-Hazards Model:** A regression model commonly used in medical research for investigating the association between the survival time of patients and one or more predictor variables. 105 | 106 | --- 107 | 108 | ## Linear Algebra 109 | 110 | ### Basics: 111 | Linear algebra is a branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces. 112 | 113 | - **Vector:** A quantity having direction and magnitude. It can be represented as an array of numbers. 114 | - **Matrix:** A rectangular array of numbers, symbols, or expressions, arranged in rows and columns. 115 | - **Transpose:** A matrix derived by interchanging rows and columns. 116 | - **Determinant:** A scalar value derived from a square matrix and tells if the matrix has an inverse. 117 | - **Inverse:** A matrix, when multiplied by the original matrix, results in the identity matrix. 118 | - **Eigenvalues and Eigenvectors:** Scalar and vector values, respectively, that satisfy the equation `A*v = lambda*v` where A is a matrix, v is the eigenvector, and lambda is the eigenvalue. 119 | - **Dot Product:** A scalar product of two vectors, representing the cosine of the angle between them. 120 | - **Cross Product:** A vector product of two vectors, resulting in a vector perpendicular to both. 121 | - **Orthogonal:** Two vectors are orthogonal if their dot product is zero, meaning they are at right angles to each other. 122 | - **Orthonormal:** A set of vectors that are both orthogonal and of unit length. 123 | 124 | ### Matrix Factorizations: 125 | Matrix factorization is the breaking down of one matrix into a product of multiple matrices. 126 | 127 | - **Row Echelon Form:** A form of a matrix where the leading entry of each row is to the right of the leading entry of the previous row. It's used to solve systems of linear equations. 128 | - **Reduced Row Echelon Form:** A matrix in row echelon form with additional properties like leading entries are 1 and the only non-zero entries in their columns. It provides a unique solution to the system of equations. 129 | - **Gaussian Elimination:** A method to solve systems of linear equations by transforming the system to an upper triangular matrix. It's a sequence of operations performed on the associated matrix of coefficients. 130 | - **LU Decomposition:** Decomposing a matrix into a product of a lower triangular matrix (L) and an upper triangular matrix (U). It's used to simplify the solving of systems of linear equations, such as finding the determinants and inverses. 131 | 132 | --- 133 | 134 | ## Calculus 135 | 136 | ### Basics: 137 | Calculus is a branch of mathematics that studies continuous change, via derivatives and integrals. 138 | 139 | - **Limit:** The value a function approaches as the input approaches a certain value. It's foundational to understanding the concepts of derivatives and integrals. 140 | - **Derivative:** Measures how a function changes as its input changes. It represents the slope of the tangent line to the function at any point. 141 | - **Integral:** Represents the area under a curve. It can be thought of as the "opposite" of differentiation. 142 | - **Partial Derivative:** Derivative of a function with respect to one of its variables, treating the other variables as constants. It's used in multivariable calculus. 143 | - **Gradient:** A vector containing all the partial derivatives of a function. It points in the direction of the greatest rate of increase of the function. 144 | - **Chain Rule:** A formula to compute the derivative of a composite function. It's used when differentiating a function that has another function inside it. 145 | - **Product Rule:** A formula to compute the derivative of the product of two functions. 146 | - **Quotient Rule:** A formula to compute the derivative of the quotient of two functions. 147 | 148 | ### Techniques of Integration: 149 | Integration techniques are methods used to find antiderivatives and integrals of functions. 150 | 151 | - **Integration by Parts:** A method based on the product rule for differentiation. It's used to integrate products of functions. 152 | - **Integration by Substitution:** Changing the variable of integration to simplify the integral. It's similar to the chain rule for differentiation. 153 | - **Partial Fraction Decomposition:** Expressing a rational function as a sum of simpler fractions to simplify the integral. It's used for integrating rational functions. 154 | 155 | ### Multivariable Calculus: 156 | Multivariable calculus involves calculus of more than one variable. 157 | 158 | - **Double Integral:** An integral over a region in the plane. It gives the volume under the surface defined by a function of two variables. 159 | - **Triple Integral:** An integral over a region in space. It gives the volume of a region in space defined by a function of three variables. 160 | - **Line Integral:** An integral over a curve in the plane or space. It's used to find the work done by a force field in moving a particle along a curve. 161 | - **Surface Integral:** An integral over a surface in space. It's used to find the flux of a vector field across a surface. 162 | - **Divergence:** A scalar measure of a vector field's source or sink at a given point. It represents the rate at which "density" exits a given region of space. 163 | - **Curl:** A vector measure of a vector field's rotation at a given point. It measures the rotation or circulation of a vector field. 164 | - **Stokes' Theorem:** Relates a surface integral of a vector field to a line integral around the boundary of the surface. It's a fundamental theorem in vector calculus. 165 | - **Green's Theorem:** Relates a line integral around a simple closed curve C to a double integral over the plane region bounded by C. It's a special case of Stokes' theorem. 166 | - **Divergence Theorem:** Relates a triple integral over a volume bounded by a closed surface S to a surface integral over S. It's used to transform volume integrals to surface integrals. 167 | 168 | --- 169 | 170 | ## Numerical Optimization 171 | 172 | ### Basics: 173 | Numerical optimization involves finding the best solution or approximation to a problem using numerical methods. 174 | 175 | - **Gradient Descent:** An iterative optimization algorithm used to find the minimum of a function. It works by taking steps proportional to the negative of the gradient of the function at the current point. 176 | - **Newton's Method:** An iterative method used to find successively better approximations to the roots (or zeros) of a real-valued function. 177 | - **Conjugate Gradient:** An algorithm for the numerical solution of particular systems of linear equations. It's often used when the system has a large number of equations. 178 | - **Lagrange Multipliers:** A method for finding the local maxima and minima of a function subject to equality constraints. 179 | - **Simplex Algorithm:** A popular algorithm for numerical solution of linear programming problems. 180 | 181 | ### Constraints & Regularization: 182 | Constraints restrict the feasible solutions in optimization, while regularization adds penalties to prevent overfitting. 183 | 184 | - **Bound Constraints:** Limit the values of the decision variables to lie in a specified range. 185 | - **Linear Constraints:** Require linear functions of the decision variables to be less than, equal to, or greater than a constant. 186 | - **Nonlinear Constraints:** Involve nonlinear functions of the decision variables. 187 | - **L1 Regularization (Lasso):** Adds a penalty equal to the absolute value of the magnitude of coefficients. It can lead to some coefficients being exactly zero. 188 | - **L2 Regularization (Ridge):** Adds a penalty equal to the square of the magnitude of coefficients. It prevents overfitting by penalizing large coefficients. 189 | 190 | --- 191 | 192 | ## Measurement Theory 193 | 194 | ### Basics: 195 | Measurement theory deals with the assignment of numbers to objects or events according to specific rules. 196 | 197 | - **Nominal Scale:** A scale of measurement where numbers serve only as labels and do not have any quantitative significance. 198 | - **Ordinal Scale:** A scale of measurement that allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted, but differences between data are not meaningful. 199 | - **Interval Scale:** A scale of measurement where the difference between two values is meaningful. It doesn't have a true zero point. 200 | - **Ratio Scale:** A scale of measurement that has a true zero point. It's the most informative scale. 201 | - **Reliability:** The consistency or repeatability of a measure. If the same test is repeated, it should produce the same results. 202 | - **Validity:** The extent to which a test measures what it claims to measure. It ensures that the instrument is accurate. 203 | 204 | ### Scaling & Transformation: 205 | Scaling and transformation involve changing the scale or values of a variable. 206 | 207 | - **Z-Score Scaling:** Transforms data into a standard normal distribution with a mean of 0 and a standard deviation of 1. 208 | - **Min-Max Scaling:** Transforms data to fit within a specified range, usually [0,1]. 209 | - **Log Transformation:** Used to transform skewed data to approximately conform to normality. 210 | - **Box-Cox Transformation:** A family of power transformations that are used to stabilize variance and make the data more normal in distribution. 211 | 212 | --- 213 | 214 | --------------------------------------------------------------------------------