├── docs ├── Statistics For ML.pdf └── Statistics Resources.pdf ├── Naive Bayes ├── docs │ └── Naive Bayes.pdf └── README.md ├── GD Regressor ├── img │ ├── change_in_cost.gif │ ├── change_in_slope.gif │ ├── training_with_gd.gif │ └── change_in_intercept.gif └── README.md ├── Probability Distribution Functions ├── docs │ └── PDF.pdf └── README.md ├── Descriptive Statistics ├── docs │ └── Descriptive Statistics.pdf └── README.md ├── Analysis with Statistics ├── docs │ └── Analysis with Statistics.pdf └── README.md ├── .gitignore ├── create_new_folder.py ├── README.md ├── Linear Regression └── README.md ├── course_parser.py └── SGD Regressor └── notebook └── my-SGD.ipynb /docs/Statistics For ML.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/docs/Statistics For ML.pdf -------------------------------------------------------------------------------- /docs/Statistics Resources.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/docs/Statistics Resources.pdf -------------------------------------------------------------------------------- /Naive Bayes/docs/Naive Bayes.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Naive Bayes/docs/Naive Bayes.pdf -------------------------------------------------------------------------------- /GD Regressor/img/change_in_cost.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_cost.gif -------------------------------------------------------------------------------- /GD Regressor/img/change_in_slope.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_slope.gif -------------------------------------------------------------------------------- /GD Regressor/img/training_with_gd.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/training_with_gd.gif -------------------------------------------------------------------------------- /GD Regressor/img/change_in_intercept.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_intercept.gif -------------------------------------------------------------------------------- /Probability Distribution Functions/docs/PDF.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Probability Distribution Functions/docs/PDF.pdf -------------------------------------------------------------------------------- /Descriptive Statistics/docs/Descriptive Statistics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Descriptive Statistics/docs/Descriptive Statistics.pdf -------------------------------------------------------------------------------- /Analysis with Statistics/docs/Analysis with Statistics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Analysis with Statistics/docs/Analysis with Statistics.pdf -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore virtual environments 2 | .venv/ 3 | 4 | # Ignore environment variables 5 | .env 6 | 7 | # Ignore files 8 | *__rough__*.* 9 | 10 | # Ignore directories 11 | .vscode/ 12 | .DS_Store 13 | __pycache__/ 14 | .ipynb_checkpoints/ 15 | __rough__/ 16 | 17 | raw/ 18 | -------------------------------------------------------------------------------- /Naive Bayes/README.md: -------------------------------------------------------------------------------- 1 | # Naive Bayes 2 | 3 | ## Table of Contents 4 | 5 | 0. [Resources](#resources) 6 | 7 | ## Resources 8 | 9 | - [CampusX Playlist](https://www.youtube.com/watch?v=Ty7knppVo9E&list=PLKnIA16_RmvZ67wQaHoBuzXaDAfPz-a6l) 10 | - [PDF](./docs/Naive%20Bayes.pdf) 11 | - [Online PDF](https://drive.google.com/file/d/1UqadGJVXFZEPD4YOUAZ2t15mJCpJYghS/view?usp=sharing) 12 | - [Session Notebook](https://colab.research.google.com/drive/1lbqkDb-3TQn4xKu3yUzMeS8tjgZLwd4k?usp=sharing) 13 | - [Kaggle Notebook](https://www.kaggle.com/campusx/sentiment-analysis-using-naive-bayes) 14 | 15 | ## Topics 16 | 17 | -------------------------------------------------------------------------------- /GD Regressor/README.md: -------------------------------------------------------------------------------- 1 | # GD Regressor 2 | 3 | ## Resources 4 | 5 | - [Video](https://youtu.be/ORyfPJypKuU) 6 | - [Session Notebook](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day51-gradient-descent) 7 | - [Gradient Descent Tool](https://developers.google.com/machine-learning/crash-course/fitter/graph) 8 | - My Notebooks for [Gradient Descent](./notebook) 9 | 10 | ### I created a Gradient Descent class from scratch and train it using artificial dataset created using `sklearn.datasets.make_regression` function. 11 | 12 | ### Also I created a class called `AnimateRegressor` which is used to create some awesome animation like below. 13 | 14 | #### How the regression line gets fit on the data 15 | 16 | ![training_with_gd](./img/training_with_gd.gif) 17 | 18 | ### Below graphs shows that how does the cost/slope/intercept changes w.r.t epochs 19 | 20 | ![change_in_cost](./img/change_in_cost.gif) 21 | ![change_in_slope](./img/change_in_slope.gif) 22 | ![change_in_intercept](./img/change_in_intercept.gif) 23 | -------------------------------------------------------------------------------- /create_new_folder.py: -------------------------------------------------------------------------------- 1 | from argparse import ArgumentParser 2 | from pathlib import Path 3 | 4 | readme_txt = """# {name} 5 | 6 | ## Table of Contents 7 | 8 | 0. [Resources](#resources) 9 | 10 | ## Resources 11 | 12 | - [Video]() 13 | - [PDF](./docs/) 14 | - [Online PDF]() 15 | - [Session Notebook]() 16 | 17 | ## Topics 18 | """ 19 | 20 | 21 | def create_folder_with_files(name): 22 | # Create the main folder 23 | folder_path = Path(name) 24 | 25 | try: 26 | folder_path.mkdir(parents=True) 27 | except FileExistsError as e: 28 | return print(e) 29 | 30 | # Create empty files 31 | readme_fp = folder_path / 'README.md' 32 | with open(readme_fp, 'w') as f: 33 | f.write(readme_txt.format(name=name)) 34 | 35 | # Create folders 36 | (folder_path / 'docs').mkdir(exist_ok=True) 37 | (folder_path / 'notebook').mkdir(exist_ok=True) 38 | 39 | print(f"Folder '{name}' with files and folders created successfully.") 40 | 41 | 42 | if __name__ == '__main__': 43 | parser = ArgumentParser( 44 | description='Create a folder with empty files and folders.' 45 | ) 46 | parser.add_argument('-n', '--name', type=str, 47 | help='Name of the folder to create', required=True) 48 | args = parser.parse_args() 49 | 50 | create_folder_with_files(args.name) 51 | 52 | # DEMO 53 | # $ python3 create_new_folder.py -n "Naive Bayes" 54 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Learning from CampusX 2 | 3 | This contains all the notes and docs created by [**@arv-anshul**][arv-github] while learning Machine Learning concept by [**CampusX**][campusx-yt]. 4 | I am learning from CampusX [**YouTube Channel**][campusx-yt] and its paid course [**Data Science Mentorship Program**][campusx-website] (DSMP). 5 | 6 | ## Topics 7 | 8 | - Statistics 9 | - [Descriptive Statistics](./Descriptive%20Statistics/README.md) 10 | - [Analysis with Statistics](./Analysis%20with%20Statistics/README.md) 11 | - [Probability Distribution Functions](./Probability%20Distribution%20Functions/README.md) 12 | 13 | ## Resources 14 | 15 |

16 | 17 |

18 | 19 | - [CampusX Data Science Mentorship Program 2022-23](https://www.youtube.com/playlist?list=PLKnIA16_RmvbAlyx4_rdtR66B7EHX5k3z) 20 | - [Maths for Machine Learning](https://www.youtube.com/playlist?list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST) 21 | - [Maths for ML and DL G-Drive](https://docs.google.com/spreadsheets/d/10spJMs0Zmv5cugfFjJVc4MudyOVjl_16Ef5z54oxqnM/edit#gid=241859416) 22 | - [Statistics For ML](./docs/Statistics%20For%20ML.pdf) 23 | - [Statistics Resources PDF](./docs/Statistics%20Resources.pdf) 24 | - [Statistics Resource G-Drive](https://docs.google.com/document/d/1GDKMZG5es9wkqk3ftiAXeUXKBc5fl0HlFIIKucPgRIs/edit) 25 | 26 | ## Acknowledgement 27 | 28 | 1. **Tutor:** [CampusX][campusx-yt] by [Nitish Sir](mailto:nitish.campusx@gmail.com) 29 | 2. **Github Repo Owner:** [Anshul Raj Verma][arv-github] 30 | 31 | 32 | 33 | [arv-github]: https://github.com/arv-anshul 34 | [campusx-yt]: https://youtube.com/@campusx-official 35 | [campusx-website]: https://learnwith.campusx.in 36 | -------------------------------------------------------------------------------- /Linear Regression/README.md: -------------------------------------------------------------------------------- 1 | # Linear Regression 2 | 3 | ## Resources 4 | 5 | - [Video](https://youtu.be/aEPoLeS6UMM) 6 | - [Session 49 - PDF](https://drive.google.com/file/d/18oSjN8aEztz_m-_CoKb5i_kGHvKccjdp/view?usp=share_link) 7 | - [Day48 Simple Linear Regression](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day48-simple-linear-regression) 8 | - [Day49 Regression Metrics](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day49-regression-metrics) 9 | - [Session 50 Notebook](https://colab.research.google.com/github/campusx-official/100-days-of-machine-learning/blob/main/day50-multiple-linear-regression/multiple_linear_regression.ipynb#scrollTo=NpAvnU-t3yV0) 10 | - [Session 50 Notebook - 2](https://colab.research.google.com/github/campusx-official/100-days-of-machine-learning/blob/main/day50-multiple-linear-regression/code-from-scratch.ipynb#scrollTo=afc9a715) 11 | - [Session 50 - PDF](https://drive.google.com/file/d/1fYGa7wXCirq8Tvo2YqfHsQSlhs1DXXwo/view?usp=share_link) 12 | 13 | ## Topics 14 | 15 | **Practice topics [in Code](./notebook)** 16 | 17 | ### Simple Linear Regression 18 | 19 | Used to create relationship between target feature and only one input feature. 20 | 21 | > [!IMPORTANT] 22 | > 23 | > **For Example,** if have data of college student CGPA and LPA salary after placement of the student as input feature. The Linear Regression model tries to create relationship between these two features by plotting a regression line on the graph which pass through all the points in such a way that **the residuals/error between the line and points is least**. 24 | 25 | | $m = \text{Slope of Regression Line}$ | $b = \text{Intercept of Regression Line}$ | 26 | | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | 27 | | $$m = \frac{\displaystyle\sum_{i=1}^{n} {(x_i - \bar{x}) (y_i - \bar{y})}}{\displaystyle\sum_{i=1}^{n} {(x_i - \bar{x})^2}}$$ | $$b = \bar{y} - m \cdot \bar{x}$$ | 28 | 29 | $$f(x) = m \cdot x + b$$ 30 | 31 | ### Multiple Linear Regression 32 | 33 | This method used to model the relationship between multiple independent variables (features) and a dependent variable (response) using a linear equation. The general form of a multiple linear regression model with $(p)$ independent variables is: 34 | 35 | $$Y = \beta{_0} + \beta{_1}X_1 + \beta{_2}X_2 + \ldots + \beta{_p}X_p + \varepsilon$$ 36 | 37 | Where: 38 | 39 | - $(Y)$ is the dependent variable (response). 40 | - $(X_1, X_2, \ldots, X_p)$ are the independent variables (features). 41 | - $(\beta{_0}, \beta{_1}, \beta{_2}, \ldots, \beta{_p})$ are the coefficients that represent the impact of each independent variable on the dependent variable. 42 | - $(\varepsilon)$ is the error term, representing the unexplained variation in the dependent variable. 43 | 44 | This equation can be expressed in matrix notation as follows: 45 | 46 | $$[ \mathbf{Y} = \mathbf{X} \beta + \mathbf{\varepsilon} ]$$ 47 | 48 | **Where:** 49 | 50 | - $(\mathbf{Y})$ is the vector of observed values of the dependent variable. 51 | - $(\mathbf{X})$ is the design matrix containing the observed values of the independent variables. 52 | - $(\beta)$ is the vector of coefficients. 53 | - $(\mathbf{\varepsilon})$ is the vector of error terms. 54 | 55 | In matrix notation, the model is typically written as: 56 | 57 | $$ 58 | \begin{bmatrix} 59 | y_1 \\ 60 | y_2 \\ 61 | \vdots \\ 62 | y_n 63 | \end{bmatrix} = \begin{bmatrix} 64 | 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 65 | 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ 66 | \vdots & \vdots & \vdots & \ddots & \vdots \\ 67 | 1 & x_{n1} & x_{n2} & \ldots & x_{np} 68 | \end{bmatrix} \begin{bmatrix} 69 | \beta{_0} \\ 70 | \beta{_1} \\ 71 | \beta{_2} \\ 72 | \vdots \\ 73 | \beta{_p} 74 | \end{bmatrix} + \begin{bmatrix} 75 | \varepsilon{_1} \\ 76 | \varepsilon{_2} \\ 77 | \vdots \\ 78 | \varepsilon{_n} 79 | \end{bmatrix} 80 | $$ 81 | 82 | To estimate the coefficients $(\beta)$, the least squares method is commonly used. The goal is to minimize the sum of squared differences between the observed values $(\mathbf{Y})$ and the values predicted by the model $(\mathbf{X} \beta)$: 83 | 84 | $$\text{minimize} |{\mathbf{Y} - \mathbf{X} \beta}|^2$$ 85 | 86 | The least squares solution for $(\beta)$ is given by: 87 | 88 | $$\beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}$$ 89 | 90 | **Where:** 91 | 92 | - $((\mathbf{X}^T \mathbf{X})^{-1})$ is the inverse of the matrix $(\mathbf{X}^T \mathbf{X})$. 93 | - $(\mathbf{X}^T)$ is the transpose of the matrix $(\mathbf{X})$. 94 | - $(\mathbf{Y})$ is the vector of observed values of the dependent variable. 95 | 96 | This solution gives us the estimated coefficients $(\beta)$ that best fit the data in a least squares sense. 97 | 98 | In summary, multiple linear regression uses matrices to express the relationships between multiple independent variables and a dependent variable. The goal is to find the coefficients that minimize the sum of squared differences between the observed and predicted values. The least squares method provides a way to estimate these coefficients using matrix operations. 99 | -------------------------------------------------------------------------------- /Descriptive Statistics/README.md: -------------------------------------------------------------------------------- 1 | # Descriptive Statistics 2 | 3 | ## Table of Contents 4 | 5 | 0. [Resources](#resources) 6 | 1. [What is Statistics?](#what-is-statistics?) 7 | 2. [Types of Statistics](#types-of-statistics) 8 | 3. [Population and Sample](#population-and-sample) 9 | 4. [Types of Data in Statistics](#types-of-data-in-statistics) 10 | 5. [Measure of Central Tendency](#measure-of-central-tendency) 11 | 6. [Measure of Dispersion](#measure-of-dispersion) 12 | 13 | ## Resources 14 | 15 | 1. [Descriptive Statistics](https://www.youtube.com/watch?v=Uv3Blie7F3g&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=1) 16 | 2. [PDF](./docs/Descriptive%20Statistics.pdf) 17 | 18 | ## Topics 19 | 20 | ### What is Statistics? 21 | 22 | - Statistics is a branch of mathematics that involves collecting, analyzing, interpreting and presenting data. 23 | - It provide methods to understand and make sense of large amounts of data and to draw conclusions and make decisions based on the data. 24 | - It is used to conduct research studies, analyze market trends, evaluate the effectiveness of treatments and interventions, and make forecasts and predictions. 25 | 26 | ### Types of Statistics 27 | 28 | 1. **Descriptive Statistics:** It uses to summarize the data using some methods like _mean, median, mode, variance, standard deviation, etc._ It doesn't not depend upon population data. 29 | In simple words, the statistics used to summarize the data to draw some insights from the sample data. 30 | 31 | 2. **Inferential Statistics:** It deals with making conclusions and prediction about a population based on a sample. It uses probability to estimate the predictions. 32 | In simple words, the statistics used for making predictions is known as Inferential Statistics. 33 | 34 | ### Population and Sample 35 | 36 | - **Population:** It is the entire group/sample/data/observations that we want to make inferences about. 37 | 38 | **Example:** We want to calculate the average salary of Indian citizens. Here, all the 100 crore people (except children) of india is the population for this inference. 39 | 40 | - **Sample:** It is the random subset of population which is used to make inference about the population. 41 | 42 | **Example:** According to above population, any random sample size i.e 10,000, 1,00,000 etc. people are sample data to calculate the average salary of Indian citizens. 43 | 44 | ### Types of Data in Statistics 45 | 46 | ```mermaid 47 | graph 48 | A[Types of Data \n in Statistics] 49 | A --> B(Categorical or \n Qualitative Data) 50 | A --> C(Numerical or \n Quantitative Data) 51 | C --> D(Discrete Data) 52 | C --> E(Continuous Data) 53 | B --> F(Nominal Data) 54 | B --> G(Ordinal Data) 55 | ``` 56 | 57 | ### Measure of Central Tendency 58 | 59 | It is used to measure the centered value of sample dataset. It shows the summary of data by identifying a single value that is most representative of the dataset as a whole. 60 | 61 | 1. **Mean:** The mean is the sum of all values in the dataset divided by the number of values. 62 | 63 | | Sample Mean | Population Mean | 64 | | :----------------------------------------: | :------------------------------------: | 65 | | $$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$ | $$\mu = \frac{\sum_{i=1}^{N} x_i}{N}$$ | 66 | 67 | 2. **Median:** The median is the middle value in the dataset when the data is arranged in order. 68 | 69 | 3. **Mode:** The mode is the value that appears most frequently in the dataset. 70 | 71 | 4. **Weighted Mean:** The weighted mean is the sum of the products of each value and its weight, divided by the sum of the weights. It is used to calculate a mean when the values in the dataset have different importance or frequency. 72 | 73 | $$\bar{x}_w = \frac{\sum{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$ 74 | 75 | 5. **Trimmed Mean:** It is calculated by removing a certain percentage of the smallest and largest values from the dataset and then taking the mean of the remaining values. The percentage of values removed is called the trimming percentage. 76 | 77 | ### Measure of Dispersion 78 | 79 | It describes the spread or variability of a dataset. It provides information about how the data is distributed around the central tendency (mean, median or mode) of the dataset. 80 | 81 | 1. **Range:** It is the difference between the maximum and minimum values in the dataset. It can be affected by outliers. 82 | 83 | 2. **Variance:** It measures the average distance of each data point from the mean. 84 | 85 | | Sample Variance | Population Variance | 86 | | :----------------------------------------------------: | :---------------------------------------------------: | 87 | | $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$ | $$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$ | 88 | 89 | 3. **Standard Deviation:** It is the square root of the variance. And it is useful in describing **the shape of a distribution**. 90 | 91 | | Sample Standard Deviation | Population Standard Deviation | 92 | | :---------------------------------------------------------: | :--------------------------------------------------------: | 93 | | $$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$$ | $$\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}$$ | 94 | 95 | 4. **Coefficient of Variation (CV):** CV is the ratio of the standard deviation to the mean expressed as a percentage. It is used to compare the variability of datasets with mean. 96 | 97 | $$ \frac{\sigma}{\mu} \cdot 100 = \text{CV} \% $$ 98 | -------------------------------------------------------------------------------- /Analysis with Statistics/README.md: -------------------------------------------------------------------------------- 1 | # Analysis with Statistics 2 | 3 | ## Table of Contents 4 | 5 | 0. [Resources](#resources) 6 | 7 | 1. [Types of Analysis](#types-of-analysis) 8 | 9 | 2. [Univariate Analysis](#univariate-analysis) 10 | 11 | - [Categorical](#categorical) 12 | - [Numerical](#numerical) 13 | 14 | 3. [Bivariate Analysis](#bivariate-analysis) 15 | 16 | - [Categorical - Categorical](#categorical---categorical) 17 | - [Numerical - Numerical](#numerical---numerical) 18 | - [Categorical - Numerical](#categorical---numerical) 19 | 20 | 4. [Multivariate Analysis](#multivariate-analysis) 21 | 22 | ## Resources 23 | 24 | - [Video](https://www.youtube.com/watch?v=1ndVC500-EU&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=2) 25 | - [PDF](./docs/Analysis%20with%20Statistics.pdf) 26 | - [Notebook](https://colab.research.google.com/drive/19YlpW_N7idyQQvmpgrZg8KNSvIjCPk-8?usp=sharing) 27 | 28 | ## Topics 29 | 30 | ### Types of Analysis 31 | 32 | ```mermaid 33 | graph LR 34 | A((Types of Analysis \n in Stats)) 35 | 36 | A --> B(Univariate Analysis) 37 | B --> B1(Each Categorical Feature) 38 | B --> B2(Each Numerical Feature) 39 | 40 | A --> C(Bivariate Analysis) 41 | C --> C1(Categorical - Categorical) 42 | C --> C2(Categorical - Numerical) 43 | C --> C3(Numerical - Numerical) 44 | 45 | A --> D(Multivariate Analysis) 46 | D --> D1{{Analysis using more than \n two features}} 47 | ``` 48 | 49 | ### Univariate Analysis 50 | 51 | #### Categorical 52 | 53 | 1. **Frequency Distribution Table** is a table that summarizes the number of time (or frequency) that each value occurs in the dataset. 54 | 55 | 2. **Relative frequency** is the proportion or percentage of a category in a dataset or sample. 56 | It is calculated by dividing the frequency of a category by the total number of observations in the dataset or sample. 57 | 58 | 3. **Cumulative frequency** is the running total of frequencies of a variable or category in a dataset or sample. It is calculated by adding up the frequencies of the current category and all previous categories in the dataset or sample. 59 | 60 | #### Numerical 61 | 62 | 1. **Frequency Distribution Table or Histogram** is being made using binning method for numerical data. It calculate the number of data falls in each bins. 63 | Here, every bins works as category of particular data. 64 | 65 | ### Bivariate Analysis 66 | 67 | #### Categorical - Categorical 68 | 69 | 1. **Contingency Table or Cross-Tabulation** is used to summarize the relationship between two categorical variables. 70 | It shows the frequencies or relative frequencies of the observed values of two variables. 71 | 72 | #### Numerical - Numerical 73 | 74 | 1. **Scatter Plot** tells the positive/negative relationship of two numerical datasets. 75 | 76 | 2. **Regression Plot** is a special scatter plot which also draw a line. 77 | 78 | 3. **Jointplot** is display two plots at a time scatter plot and histogram both. 79 | 80 | #### Categorical - Numerical 81 | 82 | 1. **Quantiles** are used to divide the data into equal-sized groups. 83 | Quantiles are important measures of variability and can be use to understand distribution of data, summarize and compare different datasets. They can also be used to identify outliers. 84 | 85 | There are several types of quantiles used in statistics such as **Quartiles, Deciles, Percentiles, Quintiles** but the most important one is **Percentile** because it divides the data into 100 equal parts. 86 | 87 | 2. **Percentile** is a measure that represents the percentage of dataset that falls below a particular value. 88 | _For example,_ the 75th percentile is the value below which 75% of the observations in the dataset fall. 89 | 90 | 3. **Inter Quartile Range (IQR)** is a the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset. 91 | 92 | 4. **Five number summary** represents the Minimum, Q1, Q2 (Median), Q3 and Maximum. Where, 93 | $$ \text{Minimum} = \text{Q1} - (1.5 \ast \text{IQR}) $$ 94 | $$ \text{Maximum} = \text{Q1} + (1.5 \ast \text{IQR}) $$ 95 | 96 | Five number summary generally visualize using **Box Plot or Whisker Plot,**. 97 | 98 | ![Box Plot](https://miro.medium.com/max/8000/1*0MPDTLn8KoLApoFvI0P2vQ.png) 99 | 100 | - **Benefits of a Box Plot:** 101 | 102 | - Easy way to see the **distribution of data**. 103 | - Tells about **skewness of data**. 104 | - Can **identify outliers**. 105 | - **Compare 2 categories** of data. 106 | 107 | 5. **Covariance** describes the degree to which two variables are linearly related. It measures how much two variables change together, such that when one variable increases, does the other variable also increase, or does it decrease? 108 | A covariance of zero indicates that the variables are not linearly related. 109 | 110 | ![Covariance](https://www.k2analytics.co.in/wp-content/uploads/2020/05/Formula.png) 111 | 112 | - **Disadvantages of using Covariance** 113 | One limitation of covariance is that it does not tell us about the **strength of the relationship between two variables**, since the magnitude of **covariance is affected by the scale of the variables**. 114 | 115 | 6. **Correlation** measures the degree to which two variables are related and how they tend to change together. 116 | Correlation is often measured using a statistical tool called the correlation coefficient, which ranges from -1 to 1. A correlation coefficient of -1 indicates a perfect negative correlation, a correlation coefficient of 0 indicates no correlation, and a correlation coefficient of 1 indicates a perfect positive correlation. 117 | 118 | $$ \text{Correlation} = \frac{Cov(x, y)}{\sigma x \ast \sigma y} $$ 119 | 120 | > **Note:** Correlation does not imply causation means if two variables are correlated then it does means that other variable is affected by first variable or vice-versa. 121 | 122 | ### Multivariate Analysis 123 | 124 | 1. **3D Scatter Plot** 125 | 126 | 2. **Plots Parameters** are used to display another impact of another categorical or numerical variable in the plot. 127 | 128 | - Hue/Color Parameter 129 | - Size Parameter 130 | 131 | 3. **Facet Grids** 132 | 133 | 4. **Pairplot** 134 | 135 | 5. **Bubble Chart** 136 | -------------------------------------------------------------------------------- /Probability Distribution Functions/README.md: -------------------------------------------------------------------------------- 1 | # Analysis with Statistics 2 | 3 | ## Table of Contents 4 | 5 | 0. [Resources](#resources) 6 | 7 | 1. [Random Variables](#random-variables) 8 | 9 | 2. [Types of Random Variables](#types-of-random-variables) 10 | 11 | 3. [Probability Distributions](#probability-distributions) 12 | 13 | 4. [What are Probability Distributions?](#what-are-probability-distributions?) 14 | 15 | 5. [Problem with Distribution?](#problem-with-distribution?) 16 | 17 | 6. [Solution: Probability Distribution Functions](#solution:-probability-distribution-functions) 18 | 19 | 7. [Different types of Probability Distributions](#different-types-of-probability-distributions) 20 | 21 | 8. [Why are Probability Distributions important?](#why-are-probability-distributions-important?) 22 | 23 | 9. [A note on Parameters of Probability Distribution Functions](#a-note-on-parameters-of-probability-distribution-functions) 24 | 25 | 10. [Probability Mass Function (PMF)](#probability-mass-function-pmf) 26 | 27 | 11. [Cumulative Distribution Function (CDF) of PMF](#cumulative-distribution-function-cdf-of-pmf) 28 | 29 | 12. [Probability Density Function (PDF)](#probability-density-function-pdf) 30 | 31 | 13. [Questions related to PDFs](#questions-related-to-pdfs) 32 | 33 | 14. [Density Estimation](#density-estimation) 34 | 35 | 15. [Types of Density Estimation](#types-of-density-estimation) 36 | 37 | 16. [Parametric Density Estimation](#parametric-density-estimation) 38 | 39 | 17. [Non-Parametric Density Estimation](#non-parametric-density-estimation) 40 | 41 | 18. [Kernel Density Estimate (KDE)](#kernel-density-estimate-kde) 42 | 43 | 19. [PDF, PMF and CDF](#pdf-pmf-and-cdf) 44 | 45 | ## Resources 46 | 47 | - [Video](https://www.youtube.com/watch?v=C_QAURbgBqY&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=4) 48 | - [PDF](./docs/PDF.pdf) 49 | - [Online PDF](https://drive.google.com/file/d/1FQ65CTmMLK-PYZ6NT9txGcGmJHobtNYl/view) 50 | - [Session Notebook](https://colab.research.google.com/drive/1N_T0_w5vpT1k1Z4pSf4IMhAxYT1nRKLU?usp=sharing) 51 | 52 | ## Topics 53 | 54 | ### Random Variables 55 | 56 | A Random Variable is a set of possible values from a random experiment. 57 | 58 | ### Types of Random Variables 59 | 60 | ```mermaid 61 | graph 62 | A[Types of Random Variable] 63 | 64 | A --> B(Discrete \n Random Variable) 65 | A --> C(Continuous \n Random Variable) 66 | ``` 67 | 68 | ### Probability Distributions 69 | 70 | ### What are Probability Distributions? 71 | 72 | A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values. 73 | 74 | ### Problem with Distribution? 75 | 76 | In many scenarios, the number of outcomes can be much larger and hence a table would be tedious to write down. Worse still, the number of possible outcomes could be infinite, in which case, good luck writing a table for that. 77 | 78 | > **Example:** Height of people, Rolling 10 dice together. 79 | 80 | ### Solution: Probability Distribution Functions 81 | 82 | A probability distribution function is a mathematical function that describes the **probability of obtaining different values of a random variable** in a particular probability distribution. 83 | 84 | ### Different types of Probability Distributions 85 | 86 | ![Image](https://miro.medium.com/v2/resize:fit:962/1*DmPUIjvecL7KllOamoFSDw.png) 87 | 88 | ![Image](https://tinyheero.github.io/assets/prob-distr/overview-prob-distr.png) 89 | 90 | ### Why are Probability Distributions important? 91 | 92 | - Gives an idea about the shape/distribution of the data. 93 | - And if our data follows a famous distribution then we automatically know a lot about the data. 94 | 95 | ### A note on Parameters of Probability Distribution Functions 96 | 97 | Parameters in probability distributions are numerical values that determine the shape, location, and scale of the distribution. 98 | Different probability distributions have different sets of parameters that determine their shape and characteristics, and understanding these parameters is essential in statistical analysis and inference. 99 | 100 | ### Probability Mass Function (PMF) 101 | 102 | Describes the probability distribution of a **discrete random variable**. 103 | 104 | PMF assign a probability to each value of the random variable. The probabilities assigned by the PMF must satisfy two conditions: 105 | 106 | 1. The probability assigned to each **value must be non-negative** (i.e., greater than or equal to zero). 107 | 2. The **sum** of the probabilities assigned to all possible values must **equal 1**. 108 | 109 | ### Cumulative Distribution Function (CDF) of PMF 110 | 111 | Describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to x. 112 | 113 | $$ F(x) = P(X \le x) $$ 114 | 115 | **Examples:** 116 | 117 | - [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) 118 | - [Binomial Distribution](https://en.wikipedia.org/wiki/Binomial_distribution) 119 | 120 | ### Probability Density Function (PDF) 121 | 122 | Describes the probability distribution of a continuous random variable. 123 | 124 | ### Questions related to PDFs 125 | 126 | 1. Why Probability Density represents the y-axis and why not Probability? 127 | 128 | - Because you have infinite value on the x-axis and you cannot calculate probability of each of the values of a continuous random variable dataset. 129 | 130 | 2. What does the area under the graph represents in PDF? 131 | 132 | - Area under the graph represents the probability of a range (3.0 to 3.1) on x-axis because you have probability density on the y-axis. 133 | 134 | 3. How to calculate Probability from PDF graph? 135 | 136 | - If you reduce the range of two points on x-axis with a very significant amount then you can calculate approx. probability of a point. 137 | 138 | 4. Examples of PDF: 139 | 140 | - [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution) 141 | - [Log Normal Distribution](https://en.wikipedia.org/wiki/Log-normal_distribution) 142 | - [Poisson Distribution](https://en.wikipedia.org/wiki/Poisson_distribution) 143 | 144 | 5. How is graph calculated? 145 | 146 | - Using [Density Estimation](#density-estimation) 147 | 148 | ### Density Estimation 149 | 150 | Density estimation is a statistical technique used to estimate the probability density function (PDF) of a random variable. 151 | 152 | It is particularly useful in areas such as machine learning, where it is often used to estimate the probability distribution of input data or to model the likelihood of certain events or outcomes. 153 | 154 | ### Types of Density Estimation 155 | 156 | ```mermaid 157 | graph 158 | 159 | A[Types of \n Density Estimation] 160 | 161 | A --> B(Parametric \n Density Estimation) 162 | A --> C(Non-Parametric \n Density Estimation) 163 | ``` 164 | 165 | ### Parametric Density Estimation 166 | 167 | This method estimate probability density by assuming that the random variable is follow some specific distribution such normal, exponential, log normal, or Poisson distributions. 168 | This estimation depends on population mean and standard deviation. 169 | 170 | ### Non-Parametric Density Estimation 171 | 172 | When sometime the distribution of random variable is not clear or it's not one of the famous distributions. 173 | 174 | This method estimate probability density of a random variable without making any assumption about the underlying distribution. This typically done by creating a **kernel density estimate**. 175 | 176 | It has several **advantages over parametric density estimation**. 177 | One of the main advantages is that **it does not require the assumption of a specific distribution**, which allows for more flexible and accurate estimation in situations where the underlying distribution is unknown or complex. 178 | However, non-parametric density estimation **can be computationally intensive and may require more data to achieve accurate estimates** compared to parametric methods. 179 | 180 | ### Kernel Density Estimate (KDE) 181 | 182 | The KDE technique involves using a kernel function to smooth out the data and create a continuous estimate of the underlying density function. 183 | 184 | [**Watch the video for more clarity.**](https://www.youtube.com/watch?v=C_QAURbgBqY&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=4) 185 | 186 | ### PDF, PMF and CDF 187 | 188 | ![Image](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ktIttLCFRAqdUlLE180v9g.png) 189 | 190 | ![Image](https://pbs.twimg.com/media/Eme8kJJXUAMWJid.png) 191 | -------------------------------------------------------------------------------- /course_parser.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | import json 4 | import time 5 | from dataclasses import dataclass, field 6 | from typing import TYPE_CHECKING, Iterable, Literal, Self 7 | 8 | from bs4 import BeautifulSoup, Tag 9 | 10 | if TYPE_CHECKING: 11 | from pathlib import Path 12 | 13 | import httpx 14 | 15 | COURSE_URL = "https://learnwith.campusx.in/s/courses/653f50d1e4b0d2eae855480a/take" 16 | BASE_RESOURCE_URL = "https://learnwith.campusx.in/s/courses/653f50d1e4b0d2eae855480a" 17 | BASE_HEADERS = { 18 | "accept": "application/json, text/javascript, */*; q=0.01", 19 | "referer": COURSE_URL, 20 | "user-agent": ( 21 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_7) AppleWebKit/537.36 (KHTML, " 22 | "like Gecko) Chrome/111.5.0.0 Safari/507.02" 23 | ), 24 | } 25 | ResourceType = Literal[ 26 | "article", 27 | "assessment", 28 | "assignment", 29 | "link", 30 | "livetest", 31 | "pdf", 32 | "video", 33 | ] 34 | 35 | 36 | def fetch_sub_topic_resource( 37 | client: httpx.Client, 38 | sub_topic_id: str, 39 | resource_type: ResourceType, 40 | ) -> bytes: 41 | """Fetches the resource data for the given subtopic ID and resource type. 42 | 43 | Args: 44 | client: HTTPX client instance with cookies set. 45 | sub_topic_id: ID of the subtopic to fetch. 46 | resource_type: Type of resource to fetch. 47 | 48 | Returns: 49 | The data as bytes for the requested resource. 50 | 51 | Raises: 52 | ValueError: If client does not have cookies set. 53 | HTTPError: If the API request fails. 54 | """ 55 | if not client.cookies: 56 | raise ValueError("Client does not have cookies.") 57 | res = client.get(f"/{resource_type}s/{sub_topic_id}/get") 58 | res.raise_for_status() 59 | return res.content 60 | 61 | 62 | @dataclass(kw_only=True) 63 | class CourseTopic: 64 | title: str 65 | id: str 66 | source: Tag = field(repr=False) 67 | 68 | @staticmethod 69 | def search(html_path: Path) -> Tag: 70 | """ 71 | Parses CourseTopic instances from a BeautifulSoup tag. 72 | 73 | Yields CourseTopic instances parsed from the provided BeautifulSoup tag source. 74 | """ 75 | soup = BeautifulSoup(html_path.read_bytes(), "html.parser") 76 | course_items_tag = soup.select_one("div.courseItems") 77 | if course_items_tag: 78 | return course_items_tag 79 | raise ValueError("'div.courseItems' css selector not present in source.") 80 | 81 | @classmethod 82 | def parse(cls, source: Tag) -> Iterable[Self]: 83 | """ 84 | Parses CourseTopic instances from a BeautifulSoup tag. 85 | 86 | Yields CourseTopic instances parsed from the provided BeautifulSoup tag source. 87 | """ 88 | yield from ( 89 | cls( 90 | title=tag["data-title"], 91 | id=tag["data-id"], 92 | source=tag, 93 | ) 94 | for tag in source.find_all("div", {"data-type": "label"}) 95 | ) 96 | 97 | 98 | @dataclass(kw_only=True) 99 | class CourseSubTopic: 100 | id: str 101 | topicId: str 102 | title: str 103 | type: ResourceType 104 | source: Tag = field(repr=False) 105 | 106 | @classmethod 107 | def parse(cls, topic: CourseTopic) -> Iterable[Self]: 108 | """ 109 | Parses CourseSubTopic instances from a CourseTopic BeautifulSoup tag. 110 | 111 | Yields CourseSubTopic instances parsed from the provided CourseTopic 112 | BeautifulSoup tag source. 113 | """ 114 | yield from ( 115 | cls( 116 | id=tag["data-id"], 117 | topicId=topic.id, 118 | title=tag["data-title"], 119 | type=tag["data-type"], 120 | source=tag, 121 | ) 122 | for tag in topic.source.find_all("div", {"data-type": True}) 123 | ) 124 | 125 | @classmethod 126 | def parse_many( 127 | cls, 128 | topics: Iterable[CourseTopic], 129 | ) -> Iterable[tuple[CourseTopic, Iterable[Self]]]: 130 | for topic in topics: 131 | yield topic, cls.parse(topic) 132 | 133 | @classmethod 134 | def find( 135 | cls, 136 | course_topics: Iterable[CourseTopic], 137 | *, 138 | id: str | None = None, 139 | title: str | None = None, 140 | ) -> Iterable[Self]: 141 | """ 142 | Parses a single CourseSubTopic from the given CourseTopics. 143 | 144 | This allows fetching a specific CourseSubTopic by title or id from the 145 | list of CourseTopics, by searching through their associated subtopics. 146 | 147 | Args: 148 | course_topics: Iterable of CourseTopic instances to search through. 149 | title: Optional title of subtopic to find. 150 | id: Optional id of subtopic to find. 151 | 152 | Returns: 153 | Iterable of matching CourseSubTopic instances. 154 | 155 | Raises: 156 | ValueError: If both title and id are None. 157 | ValueError: If no matching subtopic is found. 158 | """ 159 | if id is None and title is None: 160 | raise ValueError("Both 'id' and 'title' must not be None.") 161 | 162 | for topic in course_topics: 163 | if topic.id == id or topic.title == title: 164 | yield from cls.parse(topic) 165 | break 166 | else: 167 | raise ValueError(f"No subtopic found matching id={id} or title={title}") 168 | 169 | 170 | @dataclass(kw_only=True) 171 | class CourseVideoResource: 172 | id: str 173 | topicId: str 174 | title: str 175 | totalTime: str 176 | description: str = field(repr=False) 177 | isDescriptionHtml: bool = field(repr=False) 178 | 179 | @classmethod 180 | def fetch(cls, client: httpx.Client, sub_topic: CourseSubTopic) -> Self: 181 | if sub_topic.type != "video": 182 | raise ValueError(f"sub_topic is not a video resource, got {sub_topic.type}") 183 | 184 | response = fetch_sub_topic_resource( 185 | client=client, 186 | sub_topic_id=sub_topic.id, 187 | resource_type="video", 188 | ) 189 | try: 190 | data = json.loads(response) 191 | data = data["spayee:resource"] 192 | except json.JSONDecodeError as e: 193 | raise ValueError("Response could not be parsed as JSON.") from e 194 | except KeyError as e: 195 | raise ValueError("Bad response or missing required fields.") from e 196 | return cls( 197 | id=sub_topic.id, 198 | topicId=sub_topic.topicId, 199 | title=data["spayee:title"], 200 | description=data["spayee:description"], 201 | totalTime=data["spayee:totalTime"], 202 | isDescriptionHtml=data["spayee:isDescriptionHtml"], 203 | ) 204 | 205 | 206 | @dataclass(kw_only=True) 207 | class CourseAssignmentResource: 208 | id: str 209 | topicId: str 210 | title: str 211 | assignmentLink: str = field(repr=False) 212 | 213 | @classmethod 214 | def fetch(cls, client: httpx.Client, sub_topic: CourseSubTopic) -> Self: 215 | if sub_topic.type != "assignment": 216 | raise ValueError( 217 | f"sub_topic is not an assignment resource, got {sub_topic.type}" 218 | ) 219 | 220 | response = fetch_sub_topic_resource( 221 | client=client, 222 | sub_topic_id=sub_topic.id, 223 | resource_type="assignment", 224 | ) 225 | 226 | def parse_assignment_link(source: str | bytes) -> str: 227 | soup = BeautifulSoup(source, "html.parser") 228 | link_tag = soup.select_one("#instructions a") 229 | if link_tag: 230 | return link_tag.get_attribute_list("href", "")[0] 231 | raise ValueError("assignmentLink tag not found in source") 232 | 233 | return cls( 234 | id=sub_topic.id, 235 | topicId=sub_topic.topicId, 236 | title=sub_topic.title, 237 | assignmentLink=parse_assignment_link(response), 238 | ) 239 | 240 | 241 | if __name__ == "__main__": 242 | from pathlib import Path 243 | 244 | import httpx 245 | from rich import print 246 | 247 | # campusx.html contains the html content of the website 248 | course_topic_tag = CourseTopic.search(Path("campusx.html")) 249 | course_topics = list(CourseTopic.parse(course_topic_tag)) 250 | print(course_topics[-10:]) 251 | 252 | sub_topics = list(CourseSubTopic.find(course_topics, id="olh5gfqpjt")) 253 | print(list(sub_topics)) 254 | 255 | # Fill cookies from browser's network tab 256 | cookies = { 257 | "c_ujwt": "Your Token", 258 | "SESSIONID": "Your current SESSION ID.", 259 | } 260 | 261 | results: list[CourseVideoResource] = [] 262 | with httpx.Client( 263 | base_url=BASE_RESOURCE_URL, headers=BASE_HEADERS, cookies=cookies 264 | ) as client: 265 | for i, sub_topic in enumerate(sub_topics, 1): 266 | if sub_topic.type != "video": 267 | print(f"subtopic id={sub_topic.id} is not a video resource.") 268 | continue 269 | if i % 7 == 0: 270 | print("sleeping for 3 seconds...") 271 | time.sleep(3) 272 | results.append(CourseVideoResource.fetch(client, sub_topic)) 273 | if not results: 274 | raise ValueError("No video resources found.") 275 | print(results) 276 | -------------------------------------------------------------------------------- /SGD Regressor/notebook/my-SGD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 28, 6 | "id": "60e43fd7", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import numpy as np\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "\n", 13 | "from sklearn.linear_model import LinearRegression, SGDRegressor\n", 14 | "from sklearn.metrics import r2_score\n", 15 | "from sklearn.model_selection import train_test_split\n", 16 | "from sklearn.datasets import make_regression" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 33, 22 | "id": "a90f5266", 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "name": "stdout", 27 | "output_type": "stream", 28 | "text": [ 29 | "(100, 1) (100,)\n" 30 | ] 31 | }, 32 | { 33 | "data": { 34 | "image/png": "", 35 | "text/plain": [ 36 | "
" 37 | ] 38 | }, 39 | "metadata": {}, 40 | "output_type": "display_data" 41 | } 42 | ], 43 | "source": [ 44 | "random_state = 2\n", 45 | "X, y, *_ = make_regression(\n", 46 | " n_samples=100, n_features=1, n_informative=1, n_targets=1, noise=20, random_state=random_state\n", 47 | ")\n", 48 | "print(X.shape, y.shape)\n", 49 | "\n", 50 | "plt.scatter(X, y)\n", 51 | "plt.show()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 34, 57 | "id": "f6f4edcf", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/plain": [ 63 | "((80, 1), (20, 1))" 64 | ] 65 | }, 66 | "execution_count": 34, 67 | "metadata": {}, 68 | "output_type": "execute_result" 69 | } 70 | ], 71 | "source": [ 72 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)\n", 73 | "\n", 74 | "X_train.shape, X_test.shape" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "b692b6aa", 80 | "metadata": {}, 81 | "source": [ 82 | "\n", 83 | "---\n", 84 | "\n", 85 | "# LinearRegression" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 35, 91 | "id": "eb4d3f4e", 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "data": { 96 | "text/html": [ 97 | "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 98 | ], 99 | "text/plain": [ 100 | "LinearRegression()" 101 | ] 102 | }, 103 | "execution_count": 35, 104 | "metadata": {}, 105 | "output_type": "execute_result" 106 | } 107 | ], 108 | "source": [ 109 | "reg = LinearRegression()\n", 110 | "reg.fit(X_train, y_train)" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 36, 116 | "id": "8af3b5af", 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "[57.25933739]\n", 124 | "-0.2009464759960462\n" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "print(reg.coef_)\n", 130 | "print(reg.intercept_)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 37, 136 | "id": "036e002b", 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "def model_score(y_true: np.ndarray, y_pred: np.ndarray):\n", 141 | " print(\"R2 Score\", r2_score(y_true, y_pred))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 38, 147 | "id": "33459e4e", 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "R2 Score 0.9095184055289018\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "y_pred = reg.predict(X_test)\n", 160 | "model_score(y_test, y_pred)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "fb76d077", 166 | "metadata": {}, 167 | "source": [ 168 | "\n", 169 | "---\n", 170 | "\n", 171 | "# MySGDRegressor" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 39, 177 | "id": "ed75f3c1", 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "class MySGDRegressor:\n", 182 | " def __init__(self, learning_rate=0.01, epochs=100):\n", 183 | " self.learning_rate = learning_rate\n", 184 | " self.epochs = epochs\n", 185 | " self.coef_ = None\n", 186 | " self.intercept_ = None\n", 187 | "\n", 188 | " def fit(self, X_train: np.ndarray, y_train: np.ndarray):\n", 189 | " self.intercept_ = 0\n", 190 | " self.coef_ = np.ones(X_train.shape[1])\n", 191 | "\n", 192 | " # Time complexity: O(epochs X rows)\n", 193 | " for _ in range(self.epochs): # Runs (epochs) times\n", 194 | " for _ in range(X_train.shape[0]): # Runs (no. of rows in the dataset) times\n", 195 | " idx = np.random.randint(0, X_train.shape[0])\n", 196 | "\n", 197 | " y_hat = np.dot(X_train[idx], self.coef_) + self.intercept_\n", 198 | " intercept_der = -2 * (y_train[idx] - y_hat)\n", 199 | " self.intercept_ -= self.learning_rate * intercept_der\n", 200 | "\n", 201 | " coef_der = -2 * np.dot((y_train[idx] - y_hat), X_train[idx])\n", 202 | " self.coef_ -= self.learning_rate * coef_der\n", 203 | "\n", 204 | " def predict(self, X_test: np.ndarray):\n", 205 | " if self.coef_ is None or self.intercept_ is None:\n", 206 | " raise ValueError('First train the model with `fit()` method.')\n", 207 | " return np.dot(X_test, self.coef_) + self.intercept_" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 50, 213 | "id": "028e7219", 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "name": "stdout", 218 | "output_type": "stream", 219 | "text": [ 220 | "R2 Score 0.9171210388872366\n" 221 | ] 222 | } 223 | ], 224 | "source": [ 225 | "my_sgd = MySGDRegressor(learning_rate=0.01, epochs=40)\n", 226 | "my_sgd.fit(X_train, y_train)\n", 227 | "\n", 228 | "y_pred = my_sgd.predict(X_test)\n", 229 | "model_score(y_test, y_pred)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "id": "0efd64b0", 235 | "metadata": {}, 236 | "source": [ 237 | "\n", 238 | "---\n", 239 | "\n", 240 | "# Original SGDRegressor" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 46, 246 | "id": "457600ac", 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "name": "stdout", 251 | "output_type": "stream", 252 | "text": [ 253 | "R2 Score 0.9083595274789573\n" 254 | ] 255 | } 256 | ], 257 | "source": [ 258 | "org_sgd = SGDRegressor(max_iter=40, learning_rate=\"constant\", eta0=0.01)\n", 259 | "org_sgd.fit(X_train, y_train)\n", 260 | "\n", 261 | "y_pred = org_sgd.predict(X_test)\n", 262 | "model_score(y_test, y_pred)" 263 | ] 264 | } 265 | ], 266 | "metadata": { 267 | "kernelspec": { 268 | "display_name": "Python 3", 269 | "language": "python", 270 | "name": "python3" 271 | }, 272 | "language_info": { 273 | "codemirror_mode": { 274 | "name": "ipython", 275 | "version": 3 276 | }, 277 | "file_extension": ".py", 278 | "mimetype": "text/x-python", 279 | "name": "python", 280 | "nbconvert_exporter": "python", 281 | "pygments_lexer": "ipython3", 282 | "version": "3.11.0" 283 | } 284 | }, 285 | "nbformat": 4, 286 | "nbformat_minor": 5 287 | } 288 | --------------------------------------------------------------------------------