└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Statistics in ML & DL Cheat Sheet
  2 | 
  3 | ## Table of Contents
  4 | 1. [Basics (Statistics From Scratch)](#basics-from-scratch)
  5 | 2. [Advanced statistics](#advanced)
  6 | 3. [Linear Algebra](#linear-algebra)
  7 | 4. [Calculus](#calculus)
  8 | 5. [Numerical Optimization](#numerical-optimization)
  9 | 6. [Measurement Theory](#measurement-theory)
 10 | 
 11 | ---
 12 | 
 13 | ## Basics (From Scratch)
 14 | 
 15 | ### Descriptive Statistics:
 16 | Descriptive statistics provide a summary of the main aspects of the data, providing a snapshot of its main characteristics.
 17 | 
 18 | - **Mean:** The average of all data points. It's the sum of all data points divided by the number of data points.
 19 | - **Median:** The middle value in sorted data. If there's an even number of observations, it's the average of the two middle numbers.
 20 | - **Mode:** The value that appears most frequently in a data set. A dataset can have one mode, more than one mode, or no mode at all.
 21 | - **Variance:** Measures how far each number in a set is from the mean and is calculated by taking the average of the squared differences from the mean.
 22 | - **Standard Deviation:** Represents the average distance between each data point and the mean. It's the square root of variance.
 23 | - **Skewness:** Measures the asymmetry of the probability distribution of a random variable about its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values.
 24 | - **Kurtosis:** Describes the tails and sharpness of a distribution. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations.
 25 | - **Range:** The difference between the highest and lowest values in a dataset.
 26 | - **Interquartile Range (IQR):** Represents the middle 50% of the data. It's the difference between the third quartile (Q3) and the first quartile (Q1).
 27 | 
 28 | ### Probability Distributions:
 29 | Probability distributions describe how the values of a random variable are distributed.
 30 | 
 31 | - **Normal Distribution:** A symmetrical, bell-shaped curve where most observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.
 32 | - **Binomial Distribution:** Represents the number of successes in a fixed number of binary experiments. It describes the outcome of binary scenarios, e.g., toss of a coin, win or loss, etc.
 33 | - **Poisson Distribution:** Represents the number of events in fixed intervals of time or space. It's used for counting the number of times an event occurs within a given time or space.
 34 | - **Exponential Distribution:** Describes the time between events in a Poisson process. It's often used to model the time elapsed between events.
 35 | - **Uniform Distribution:** All outcomes are equally likely, and every variable has the same probability that it will be the outcome.
 36 | - **Gamma Distribution:** A two-parameter family of continuous probability distributions. It generalizes the exponential distribution.
 37 | - **Beta Distribution:** Models random variables limited to intervals of finite length in the range from 0 to 1.
 38 | 
 39 | ### Hypothesis Testing:
 40 | Hypothesis testing is a statistical method used to make decisions or inferences about population parameters based on a sample.
 41 | 
 42 | - **Null Hypothesis (H0):** The initial claim or status quo. It's a statement of no effect or no difference.
 43 | - **Alternative Hypothesis (H1 or Ha):** Represents what the researcher aims to prove. It's a statement indicating the presence of an effect or difference.
 44 | - **Type I Error (α):** The probability of rejecting the null hypothesis when it's actually true. Commonly referred to as a "false positive."
 45 | - **Type II Error (β):** The probability of not rejecting the null hypothesis when the alternative hypothesis is true. Known as a "false negative."
 46 | - **Chi-Squared Test:** Used for testing relationships between categorical variables. It compares the observed frequencies to the frequencies we would expect if there were no relationship between the variables.
 47 | - **F-Test:** Compares the variances of two populations. It's used to test if two population variances are equal.
 48 | - **Central Limit Theorem:** States that the distribution of sample means approaches a normal distribution as the sample size gets larger, regardless of the shape of the population distribution.
 49 | 
 50 | ### Correlation & Sampling:
 51 | Correlation measures the strength and direction of a linear relationship between two variables.
 52 | 
 53 | - **Pearson Correlation Coefficient (r):** Measures the linear relationship between two variables. It ranges from -1 to 1, with -1 indicating a perfect negative linear relationship and 1 indicating a perfect positive linear relationship.
 54 | - **Spearman Rank Correlation:** A non-parametric test that measures the strength and direction of the relationship between two variables. It assesses how well the relationship between two variables can be described using a monotonic function.
 55 | - **Random Sampling:** A subset of individuals chosen from a larger set where each individual has an equal chance of being selected.
 56 | - **Stratified Sampling:** Divides the population into subgroups or strata, and random samples are taken from each stratum.
 57 | 
 58 | ### Estimation & Experimental Design:
 59 | Estimation involves approximating a population parameter based on a sample from the population.
 60 | 
 61 | - **Point Estimation:** Provides a single value as an estimate of a parameter.
 62 | - **Interval Estimation:** Provides a range of values as an estimate of a parameter. This range is called a confidence interval.
 63 | - **Factorial Experiments:** Experiments that study the effects of multiple factors simultaneously.
 64 | - **Cross-Over Design:** An experimental design in which subjects receive different treatments during the different periods of the experiment.
 65 | 
 66 | ---
 67 | 
 68 | ## Advanced
 69 | 
 70 | ### Inferential & Multivariate Analysis:
 71 | Inferential statistics allow you to infer or deduce population parameters from sample statistics.
 72 | 
 73 | - **Z-Score:** Represents how many standard deviations a data point is from the mean. It's used to compare scores from different distributions.
 74 | - **T-Test:** A hypothesis test used to compare the means of two groups. It assumes that the populations the samples come from are normally distributed.
 75 | - **Cohen's d:** A measure of effect size that indicates the standardized difference between two means.
 76 | - **Effect Size:** Represents the magnitude of a relationship between two or more variables. It describes the strength of the relationship between variables.
 77 | - **Canonical Correlation:** Measures the association between two sets of variables. It's used to find the relationship between two sets of variables if each set is used to define a new variable.
 78 | - **Discriminant Analysis:** A classification technique. It determines which variables discriminate between two or more naturally occurring groups.
 79 | 
 80 | ### Bayesian Statistics & Time Series Analysis:
 81 | Bayesian statistics is a statistical method that applies probability to statistical problems, involving epistemological uncertainties.
 82 | 
 83 | - **Bayes' Theorem:** Provides a way to revise existing predictions or theories (update probabilities) given new evidence. It relates current evidence to prior beliefs.
 84 | - **Prior Distribution:** Represents what we know about a variable before considering the current observed evidence.
 85 | - **Posterior Distribution:** Represents what we know about a variable after considering the current observed evidence.
 86 | - **Autoregression (AR):** A time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step.
 87 | - **Moving Average (MA):** A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
 88 | - **ARIMA (AutoRegressive Integrated Moving Average):** A class of models that combines AR and MA models as well as a differencing pre-processing step of the sequence to make the sequence stationary.
 89 | 
 90 | ### Non-Parametric & Robust Statistics:
 91 | Non-parametric statistics don't assume that the data follows a particular distribution.
 92 | 
 93 | - **Mann-Whitney U Test:** A non-parametric test used to compare two sample means that come from the same population and used to test if two population means are equal.
 94 | - **Wilcoxon Signed-Rank Test:** A non-parametric test that compares two paired groups.
 95 | - **Kruskal-Wallis H Test:** A rank-based non-parametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable.
 96 | - **Huber's M-Estimator:** A method used in robust regression that is less sensitive to outliers in data than the least squares method.
 97 | 
 98 | ### Causal Inference & Survival Analysis:
 99 | Causal inference is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect.
100 | 
101 | - **Randomized Controlled Trials (RCTs):** An experimental setup where participants are randomly assigned to an experimental group or a control group.
102 | - **Propensity Score Matching:** A statistical matching technique that attempts to estimate the effect of a treatment by accounting for the covariates that predict receiving the treatment.
103 | - **Kaplan-Meier Estimator:** A non-parametric statistic used to estimate the survival function from lifetime data.
104 | - **Cox Proportional-Hazards Model:** A regression model commonly used in medical research for investigating the association between the survival time of patients and one or more predictor variables.
105 | 
106 | ---
107 | 
108 | ## Linear Algebra
109 | 
110 | ### Basics:
111 | Linear algebra is a branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces.
112 | 
113 | - **Vector:** A quantity having direction and magnitude. It can be represented as an array of numbers.
114 | - **Matrix:** A rectangular array of numbers, symbols, or expressions, arranged in rows and columns.
115 | - **Transpose:** A matrix derived by interchanging rows and columns.
116 | - **Determinant:** A scalar value derived from a square matrix and tells if the matrix has an inverse.
117 | - **Inverse:** A matrix, when multiplied by the original matrix, results in the identity matrix.
118 | - **Eigenvalues and Eigenvectors:** Scalar and vector values, respectively, that satisfy the equation `A*v = lambda*v` where A is a matrix, v is the eigenvector, and lambda is the eigenvalue.
119 | - **Dot Product:** A scalar product of two vectors, representing the cosine of the angle between them.
120 | - **Cross Product:** A vector product of two vectors, resulting in a vector perpendicular to both.
121 | - **Orthogonal:** Two vectors are orthogonal if their dot product is zero, meaning they are at right angles to each other.
122 | - **Orthonormal:** A set of vectors that are both orthogonal and of unit length.
123 | 
124 | ### Matrix Factorizations:
125 | Matrix factorization is the breaking down of one matrix into a product of multiple matrices.
126 | 
127 | - **Row Echelon Form:** A form of a matrix where the leading entry of each row is to the right of the leading entry of the previous row. It's used to solve systems of linear equations.
128 | - **Reduced Row Echelon Form:** A matrix in row echelon form with additional properties like leading entries are 1 and the only non-zero entries in their columns. It provides a unique solution to the system of equations.
129 | - **Gaussian Elimination:** A method to solve systems of linear equations by transforming the system to an upper triangular matrix. It's a sequence of operations performed on the associated matrix of coefficients.
130 | - **LU Decomposition:** Decomposing a matrix into a product of a lower triangular matrix (L) and an upper triangular matrix (U). It's used to simplify the solving of systems of linear equations, such as finding the determinants and inverses.
131 | 
132 | ---
133 | 
134 | ## Calculus
135 | 
136 | ### Basics:
137 | Calculus is a branch of mathematics that studies continuous change, via derivatives and integrals.
138 | 
139 | - **Limit:** The value a function approaches as the input approaches a certain value. It's foundational to understanding the concepts of derivatives and integrals.
140 | - **Derivative:** Measures how a function changes as its input changes. It represents the slope of the tangent line to the function at any point.
141 | - **Integral:** Represents the area under a curve. It can be thought of as the "opposite" of differentiation.
142 | - **Partial Derivative:** Derivative of a function with respect to one of its variables, treating the other variables as constants. It's used in multivariable calculus.
143 | - **Gradient:** A vector containing all the partial derivatives of a function. It points in the direction of the greatest rate of increase of the function.
144 | - **Chain Rule:** A formula to compute the derivative of a composite function. It's used when differentiating a function that has another function inside it.
145 | - **Product Rule:** A formula to compute the derivative of the product of two functions.
146 | - **Quotient Rule:** A formula to compute the derivative of the quotient of two functions.
147 | 
148 | ### Techniques of Integration:
149 | Integration techniques are methods used to find antiderivatives and integrals of functions.
150 | 
151 | - **Integration by Parts:** A method based on the product rule for differentiation. It's used to integrate products of functions.
152 | - **Integration by Substitution:** Changing the variable of integration to simplify the integral. It's similar to the chain rule for differentiation.
153 | - **Partial Fraction Decomposition:** Expressing a rational function as a sum of simpler fractions to simplify the integral. It's used for integrating rational functions.
154 | 
155 | ### Multivariable Calculus:
156 | Multivariable calculus involves calculus of more than one variable.
157 | 
158 | - **Double Integral:** An integral over a region in the plane. It gives the volume under the surface defined by a function of two variables.
159 | - **Triple Integral:** An integral over a region in space. It gives the volume of a region in space defined by a function of three variables.
160 | - **Line Integral:** An integral over a curve in the plane or space. It's used to find the work done by a force field in moving a particle along a curve.
161 | - **Surface Integral:** An integral over a surface in space. It's used to find the flux of a vector field across a surface.
162 | - **Divergence:** A scalar measure of a vector field's source or sink at a given point. It represents the rate at which "density" exits a given region of space.
163 | - **Curl:** A vector measure of a vector field's rotation at a given point. It measures the rotation or circulation of a vector field.
164 | - **Stokes' Theorem:** Relates a surface integral of a vector field to a line integral around the boundary of the surface. It's a fundamental theorem in vector calculus.
165 | - **Green's Theorem:** Relates a line integral around a simple closed curve C to a double integral over the plane region bounded by C. It's a special case of Stokes' theorem.
166 | - **Divergence Theorem:** Relates a triple integral over a volume bounded by a closed surface S to a surface integral over S. It's used to transform volume integrals to surface integrals.
167 | 
168 | ---
169 | 
170 | ## Numerical Optimization
171 | 
172 | ### Basics:
173 | Numerical optimization involves finding the best solution or approximation to a problem using numerical methods.
174 | 
175 | - **Gradient Descent:** An iterative optimization algorithm used to find the minimum of a function. It works by taking steps proportional to the negative of the gradient of the function at the current point.
176 | - **Newton's Method:** An iterative method used to find successively better approximations to the roots (or zeros) of a real-valued function.
177 | - **Conjugate Gradient:** An algorithm for the numerical solution of particular systems of linear equations. It's often used when the system has a large number of equations.
178 | - **Lagrange Multipliers:** A method for finding the local maxima and minima of a function subject to equality constraints.
179 | - **Simplex Algorithm:** A popular algorithm for numerical solution of linear programming problems.
180 | 
181 | ### Constraints & Regularization:
182 | Constraints restrict the feasible solutions in optimization, while regularization adds penalties to prevent overfitting.
183 | 
184 | - **Bound Constraints:** Limit the values of the decision variables to lie in a specified range.
185 | - **Linear Constraints:** Require linear functions of the decision variables to be less than, equal to, or greater than a constant.
186 | - **Nonlinear Constraints:** Involve nonlinear functions of the decision variables.
187 | - **L1 Regularization (Lasso):** Adds a penalty equal to the absolute value of the magnitude of coefficients. It can lead to some coefficients being exactly zero.
188 | - **L2 Regularization (Ridge):** Adds a penalty equal to the square of the magnitude of coefficients. It prevents overfitting by penalizing large coefficients.
189 | 
190 | ---
191 | 
192 | ## Measurement Theory
193 | 
194 | ### Basics:
195 | Measurement theory deals with the assignment of numbers to objects or events according to specific rules.
196 | 
197 | - **Nominal Scale:** A scale of measurement where numbers serve only as labels and do not have any quantitative significance.
198 | - **Ordinal Scale:** A scale of measurement that allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted, but differences between data are not meaningful.
199 | - **Interval Scale:** A scale of measurement where the difference between two values is meaningful. It doesn't have a true zero point.
200 | - **Ratio Scale:** A scale of measurement that has a true zero point. It's the most informative scale.
201 | - **Reliability:** The consistency or repeatability of a measure. If the same test is repeated, it should produce the same results.
202 | - **Validity:** The extent to which a test measures what it claims to measure. It ensures that the instrument is accurate.
203 | 
204 | ### Scaling & Transformation:
205 | Scaling and transformation involve changing the scale or values of a variable.
206 | 
207 | - **Z-Score Scaling:** Transforms data into a standard normal distribution with a mean of 0 and a standard deviation of 1.
208 | - **Min-Max Scaling:** Transforms data to fit within a specified range, usually [0,1].
209 | - **Log Transformation:** Used to transform skewed data to approximately conform to normality.
210 | - **Box-Cox Transformation:** A family of power transformations that are used to stabilize variance and make the data more normal in distribution.
211 | 
212 | ---
213 | 
214 | 


--------------------------------------------------------------------------------