├── docs
├── Statistics For ML.pdf
└── Statistics Resources.pdf
├── Naive Bayes
├── docs
│ └── Naive Bayes.pdf
└── README.md
├── GD Regressor
├── img
│ ├── change_in_cost.gif
│ ├── change_in_slope.gif
│ ├── training_with_gd.gif
│ └── change_in_intercept.gif
└── README.md
├── Probability Distribution Functions
├── docs
│ └── PDF.pdf
└── README.md
├── Descriptive Statistics
├── docs
│ └── Descriptive Statistics.pdf
└── README.md
├── Analysis with Statistics
├── docs
│ └── Analysis with Statistics.pdf
└── README.md
├── .gitignore
├── create_new_folder.py
├── README.md
├── Linear Regression
└── README.md
├── course_parser.py
└── SGD Regressor
└── notebook
└── my-SGD.ipynb
/docs/Statistics For ML.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/docs/Statistics For ML.pdf
--------------------------------------------------------------------------------
/docs/Statistics Resources.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/docs/Statistics Resources.pdf
--------------------------------------------------------------------------------
/Naive Bayes/docs/Naive Bayes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Naive Bayes/docs/Naive Bayes.pdf
--------------------------------------------------------------------------------
/GD Regressor/img/change_in_cost.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_cost.gif
--------------------------------------------------------------------------------
/GD Regressor/img/change_in_slope.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_slope.gif
--------------------------------------------------------------------------------
/GD Regressor/img/training_with_gd.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/training_with_gd.gif
--------------------------------------------------------------------------------
/GD Regressor/img/change_in_intercept.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/GD Regressor/img/change_in_intercept.gif
--------------------------------------------------------------------------------
/Probability Distribution Functions/docs/PDF.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Probability Distribution Functions/docs/PDF.pdf
--------------------------------------------------------------------------------
/Descriptive Statistics/docs/Descriptive Statistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Descriptive Statistics/docs/Descriptive Statistics.pdf
--------------------------------------------------------------------------------
/Analysis with Statistics/docs/Analysis with Statistics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arv-anshul/campusx-learning/HEAD/Analysis with Statistics/docs/Analysis with Statistics.pdf
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore virtual environments
2 | .venv/
3 |
4 | # Ignore environment variables
5 | .env
6 |
7 | # Ignore files
8 | *__rough__*.*
9 |
10 | # Ignore directories
11 | .vscode/
12 | .DS_Store
13 | __pycache__/
14 | .ipynb_checkpoints/
15 | __rough__/
16 |
17 | raw/
18 |
--------------------------------------------------------------------------------
/Naive Bayes/README.md:
--------------------------------------------------------------------------------
1 | # Naive Bayes
2 |
3 | ## Table of Contents
4 |
5 | 0. [Resources](#resources)
6 |
7 | ## Resources
8 |
9 | - [CampusX Playlist](https://www.youtube.com/watch?v=Ty7knppVo9E&list=PLKnIA16_RmvZ67wQaHoBuzXaDAfPz-a6l)
10 | - [PDF](./docs/Naive%20Bayes.pdf)
11 | - [Online PDF](https://drive.google.com/file/d/1UqadGJVXFZEPD4YOUAZ2t15mJCpJYghS/view?usp=sharing)
12 | - [Session Notebook](https://colab.research.google.com/drive/1lbqkDb-3TQn4xKu3yUzMeS8tjgZLwd4k?usp=sharing)
13 | - [Kaggle Notebook](https://www.kaggle.com/campusx/sentiment-analysis-using-naive-bayes)
14 |
15 | ## Topics
16 |
17 |
--------------------------------------------------------------------------------
/GD Regressor/README.md:
--------------------------------------------------------------------------------
1 | # GD Regressor
2 |
3 | ## Resources
4 |
5 | - [Video](https://youtu.be/ORyfPJypKuU)
6 | - [Session Notebook](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day51-gradient-descent)
7 | - [Gradient Descent Tool](https://developers.google.com/machine-learning/crash-course/fitter/graph)
8 | - My Notebooks for [Gradient Descent](./notebook)
9 |
10 | ### I created a Gradient Descent class from scratch and train it using artificial dataset created using `sklearn.datasets.make_regression` function.
11 |
12 | ### Also I created a class called `AnimateRegressor` which is used to create some awesome animation like below.
13 |
14 | #### How the regression line gets fit on the data
15 |
16 | 
17 |
18 | ### Below graphs shows that how does the cost/slope/intercept changes w.r.t epochs
19 |
20 | 
21 | 
22 | 
23 |
--------------------------------------------------------------------------------
/create_new_folder.py:
--------------------------------------------------------------------------------
1 | from argparse import ArgumentParser
2 | from pathlib import Path
3 |
4 | readme_txt = """# {name}
5 |
6 | ## Table of Contents
7 |
8 | 0. [Resources](#resources)
9 |
10 | ## Resources
11 |
12 | - [Video]()
13 | - [PDF](./docs/)
14 | - [Online PDF]()
15 | - [Session Notebook]()
16 |
17 | ## Topics
18 | """
19 |
20 |
21 | def create_folder_with_files(name):
22 | # Create the main folder
23 | folder_path = Path(name)
24 |
25 | try:
26 | folder_path.mkdir(parents=True)
27 | except FileExistsError as e:
28 | return print(e)
29 |
30 | # Create empty files
31 | readme_fp = folder_path / 'README.md'
32 | with open(readme_fp, 'w') as f:
33 | f.write(readme_txt.format(name=name))
34 |
35 | # Create folders
36 | (folder_path / 'docs').mkdir(exist_ok=True)
37 | (folder_path / 'notebook').mkdir(exist_ok=True)
38 |
39 | print(f"Folder '{name}' with files and folders created successfully.")
40 |
41 |
42 | if __name__ == '__main__':
43 | parser = ArgumentParser(
44 | description='Create a folder with empty files and folders.'
45 | )
46 | parser.add_argument('-n', '--name', type=str,
47 | help='Name of the folder to create', required=True)
48 | args = parser.parse_args()
49 |
50 | create_folder_with_files(args.name)
51 |
52 | # DEMO
53 | # $ python3 create_new_folder.py -n "Naive Bayes"
54 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Learning from CampusX
2 |
3 | This contains all the notes and docs created by [**@arv-anshul**][arv-github] while learning Machine Learning concept by [**CampusX**][campusx-yt].
4 | I am learning from CampusX [**YouTube Channel**][campusx-yt] and its paid course [**Data Science Mentorship Program**][campusx-website] (DSMP).
5 |
6 | ## Topics
7 |
8 | - Statistics
9 | - [Descriptive Statistics](./Descriptive%20Statistics/README.md)
10 | - [Analysis with Statistics](./Analysis%20with%20Statistics/README.md)
11 | - [Probability Distribution Functions](./Probability%20Distribution%20Functions/README.md)
12 |
13 | ## Resources
14 |
15 |
16 |
17 |
18 |
19 | - [CampusX Data Science Mentorship Program 2022-23](https://www.youtube.com/playlist?list=PLKnIA16_RmvbAlyx4_rdtR66B7EHX5k3z)
20 | - [Maths for Machine Learning](https://www.youtube.com/playlist?list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST)
21 | - [Maths for ML and DL G-Drive](https://docs.google.com/spreadsheets/d/10spJMs0Zmv5cugfFjJVc4MudyOVjl_16Ef5z54oxqnM/edit#gid=241859416)
22 | - [Statistics For ML](./docs/Statistics%20For%20ML.pdf)
23 | - [Statistics Resources PDF](./docs/Statistics%20Resources.pdf)
24 | - [Statistics Resource G-Drive](https://docs.google.com/document/d/1GDKMZG5es9wkqk3ftiAXeUXKBc5fl0HlFIIKucPgRIs/edit)
25 |
26 | ## Acknowledgement
27 |
28 | 1. **Tutor:** [CampusX][campusx-yt] by [Nitish Sir](mailto:nitish.campusx@gmail.com)
29 | 2. **Github Repo Owner:** [Anshul Raj Verma][arv-github]
30 |
31 |
32 |
33 | [arv-github]: https://github.com/arv-anshul
34 | [campusx-yt]: https://youtube.com/@campusx-official
35 | [campusx-website]: https://learnwith.campusx.in
36 |
--------------------------------------------------------------------------------
/Linear Regression/README.md:
--------------------------------------------------------------------------------
1 | # Linear Regression
2 |
3 | ## Resources
4 |
5 | - [Video](https://youtu.be/aEPoLeS6UMM)
6 | - [Session 49 - PDF](https://drive.google.com/file/d/18oSjN8aEztz_m-_CoKb5i_kGHvKccjdp/view?usp=share_link)
7 | - [Day48 Simple Linear Regression](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day48-simple-linear-regression)
8 | - [Day49 Regression Metrics](https://github.com/campusx-official/100-days-of-machine-learning/tree/main/day49-regression-metrics)
9 | - [Session 50 Notebook](https://colab.research.google.com/github/campusx-official/100-days-of-machine-learning/blob/main/day50-multiple-linear-regression/multiple_linear_regression.ipynb#scrollTo=NpAvnU-t3yV0)
10 | - [Session 50 Notebook - 2](https://colab.research.google.com/github/campusx-official/100-days-of-machine-learning/blob/main/day50-multiple-linear-regression/code-from-scratch.ipynb#scrollTo=afc9a715)
11 | - [Session 50 - PDF](https://drive.google.com/file/d/1fYGa7wXCirq8Tvo2YqfHsQSlhs1DXXwo/view?usp=share_link)
12 |
13 | ## Topics
14 |
15 | **Practice topics [in Code](./notebook)**
16 |
17 | ### Simple Linear Regression
18 |
19 | Used to create relationship between target feature and only one input feature.
20 |
21 | > [!IMPORTANT]
22 | >
23 | > **For Example,** if have data of college student CGPA and LPA salary after placement of the student as input feature. The Linear Regression model tries to create relationship between these two features by plotting a regression line on the graph which pass through all the points in such a way that **the residuals/error between the line and points is least**.
24 |
25 | | $m = \text{Slope of Regression Line}$ | $b = \text{Intercept of Regression Line}$ |
26 | | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |
27 | | $$m = \frac{\displaystyle\sum_{i=1}^{n} {(x_i - \bar{x}) (y_i - \bar{y})}}{\displaystyle\sum_{i=1}^{n} {(x_i - \bar{x})^2}}$$ | $$b = \bar{y} - m \cdot \bar{x}$$ |
28 |
29 | $$f(x) = m \cdot x + b$$
30 |
31 | ### Multiple Linear Regression
32 |
33 | This method used to model the relationship between multiple independent variables (features) and a dependent variable (response) using a linear equation. The general form of a multiple linear regression model with $(p)$ independent variables is:
34 |
35 | $$Y = \beta{_0} + \beta{_1}X_1 + \beta{_2}X_2 + \ldots + \beta{_p}X_p + \varepsilon$$
36 |
37 | Where:
38 |
39 | - $(Y)$ is the dependent variable (response).
40 | - $(X_1, X_2, \ldots, X_p)$ are the independent variables (features).
41 | - $(\beta{_0}, \beta{_1}, \beta{_2}, \ldots, \beta{_p})$ are the coefficients that represent the impact of each independent variable on the dependent variable.
42 | - $(\varepsilon)$ is the error term, representing the unexplained variation in the dependent variable.
43 |
44 | This equation can be expressed in matrix notation as follows:
45 |
46 | $$[ \mathbf{Y} = \mathbf{X} \beta + \mathbf{\varepsilon} ]$$
47 |
48 | **Where:**
49 |
50 | - $(\mathbf{Y})$ is the vector of observed values of the dependent variable.
51 | - $(\mathbf{X})$ is the design matrix containing the observed values of the independent variables.
52 | - $(\beta)$ is the vector of coefficients.
53 | - $(\mathbf{\varepsilon})$ is the vector of error terms.
54 |
55 | In matrix notation, the model is typically written as:
56 |
57 | $$
58 | \begin{bmatrix}
59 | y_1 \\
60 | y_2 \\
61 | \vdots \\
62 | y_n
63 | \end{bmatrix} = \begin{bmatrix}
64 | 1 & x_{11} & x_{12} & \ldots & x_{1p} \\
65 | 1 & x_{21} & x_{22} & \ldots & x_{2p} \\
66 | \vdots & \vdots & \vdots & \ddots & \vdots \\
67 | 1 & x_{n1} & x_{n2} & \ldots & x_{np}
68 | \end{bmatrix} \begin{bmatrix}
69 | \beta{_0} \\
70 | \beta{_1} \\
71 | \beta{_2} \\
72 | \vdots \\
73 | \beta{_p}
74 | \end{bmatrix} + \begin{bmatrix}
75 | \varepsilon{_1} \\
76 | \varepsilon{_2} \\
77 | \vdots \\
78 | \varepsilon{_n}
79 | \end{bmatrix}
80 | $$
81 |
82 | To estimate the coefficients $(\beta)$, the least squares method is commonly used. The goal is to minimize the sum of squared differences between the observed values $(\mathbf{Y})$ and the values predicted by the model $(\mathbf{X} \beta)$:
83 |
84 | $$\text{minimize} |{\mathbf{Y} - \mathbf{X} \beta}|^2$$
85 |
86 | The least squares solution for $(\beta)$ is given by:
87 |
88 | $$\beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}$$
89 |
90 | **Where:**
91 |
92 | - $((\mathbf{X}^T \mathbf{X})^{-1})$ is the inverse of the matrix $(\mathbf{X}^T \mathbf{X})$.
93 | - $(\mathbf{X}^T)$ is the transpose of the matrix $(\mathbf{X})$.
94 | - $(\mathbf{Y})$ is the vector of observed values of the dependent variable.
95 |
96 | This solution gives us the estimated coefficients $(\beta)$ that best fit the data in a least squares sense.
97 |
98 | In summary, multiple linear regression uses matrices to express the relationships between multiple independent variables and a dependent variable. The goal is to find the coefficients that minimize the sum of squared differences between the observed and predicted values. The least squares method provides a way to estimate these coefficients using matrix operations.
99 |
--------------------------------------------------------------------------------
/Descriptive Statistics/README.md:
--------------------------------------------------------------------------------
1 | # Descriptive Statistics
2 |
3 | ## Table of Contents
4 |
5 | 0. [Resources](#resources)
6 | 1. [What is Statistics?](#what-is-statistics?)
7 | 2. [Types of Statistics](#types-of-statistics)
8 | 3. [Population and Sample](#population-and-sample)
9 | 4. [Types of Data in Statistics](#types-of-data-in-statistics)
10 | 5. [Measure of Central Tendency](#measure-of-central-tendency)
11 | 6. [Measure of Dispersion](#measure-of-dispersion)
12 |
13 | ## Resources
14 |
15 | 1. [Descriptive Statistics](https://www.youtube.com/watch?v=Uv3Blie7F3g&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=1)
16 | 2. [PDF](./docs/Descriptive%20Statistics.pdf)
17 |
18 | ## Topics
19 |
20 | ### What is Statistics?
21 |
22 | - Statistics is a branch of mathematics that involves collecting, analyzing, interpreting and presenting data.
23 | - It provide methods to understand and make sense of large amounts of data and to draw conclusions and make decisions based on the data.
24 | - It is used to conduct research studies, analyze market trends, evaluate the effectiveness of treatments and interventions, and make forecasts and predictions.
25 |
26 | ### Types of Statistics
27 |
28 | 1. **Descriptive Statistics:** It uses to summarize the data using some methods like _mean, median, mode, variance, standard deviation, etc._ It doesn't not depend upon population data.
29 | In simple words, the statistics used to summarize the data to draw some insights from the sample data.
30 |
31 | 2. **Inferential Statistics:** It deals with making conclusions and prediction about a population based on a sample. It uses probability to estimate the predictions.
32 | In simple words, the statistics used for making predictions is known as Inferential Statistics.
33 |
34 | ### Population and Sample
35 |
36 | - **Population:** It is the entire group/sample/data/observations that we want to make inferences about.
37 |
38 | **Example:** We want to calculate the average salary of Indian citizens. Here, all the 100 crore people (except children) of india is the population for this inference.
39 |
40 | - **Sample:** It is the random subset of population which is used to make inference about the population.
41 |
42 | **Example:** According to above population, any random sample size i.e 10,000, 1,00,000 etc. people are sample data to calculate the average salary of Indian citizens.
43 |
44 | ### Types of Data in Statistics
45 |
46 | ```mermaid
47 | graph
48 | A[Types of Data \n in Statistics]
49 | A --> B(Categorical or \n Qualitative Data)
50 | A --> C(Numerical or \n Quantitative Data)
51 | C --> D(Discrete Data)
52 | C --> E(Continuous Data)
53 | B --> F(Nominal Data)
54 | B --> G(Ordinal Data)
55 | ```
56 |
57 | ### Measure of Central Tendency
58 |
59 | It is used to measure the centered value of sample dataset. It shows the summary of data by identifying a single value that is most representative of the dataset as a whole.
60 |
61 | 1. **Mean:** The mean is the sum of all values in the dataset divided by the number of values.
62 |
63 | | Sample Mean | Population Mean |
64 | | :----------------------------------------: | :------------------------------------: |
65 | | $$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$$ | $$\mu = \frac{\sum_{i=1}^{N} x_i}{N}$$ |
66 |
67 | 2. **Median:** The median is the middle value in the dataset when the data is arranged in order.
68 |
69 | 3. **Mode:** The mode is the value that appears most frequently in the dataset.
70 |
71 | 4. **Weighted Mean:** The weighted mean is the sum of the products of each value and its weight, divided by the sum of the weights. It is used to calculate a mean when the values in the dataset have different importance or frequency.
72 |
73 | $$\bar{x}_w = \frac{\sum{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$
74 |
75 | 5. **Trimmed Mean:** It is calculated by removing a certain percentage of the smallest and largest values from the dataset and then taking the mean of the remaining values. The percentage of values removed is called the trimming percentage.
76 |
77 | ### Measure of Dispersion
78 |
79 | It describes the spread or variability of a dataset. It provides information about how the data is distributed around the central tendency (mean, median or mode) of the dataset.
80 |
81 | 1. **Range:** It is the difference between the maximum and minimum values in the dataset. It can be affected by outliers.
82 |
83 | 2. **Variance:** It measures the average distance of each data point from the mean.
84 |
85 | | Sample Variance | Population Variance |
86 | | :----------------------------------------------------: | :---------------------------------------------------: |
87 | | $$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$ | $$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$ |
88 |
89 | 3. **Standard Deviation:** It is the square root of the variance. And it is useful in describing **the shape of a distribution**.
90 |
91 | | Sample Standard Deviation | Population Standard Deviation |
92 | | :---------------------------------------------------------: | :--------------------------------------------------------: |
93 | | $$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$$ | $$\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}$$ |
94 |
95 | 4. **Coefficient of Variation (CV):** CV is the ratio of the standard deviation to the mean expressed as a percentage. It is used to compare the variability of datasets with mean.
96 |
97 | $$ \frac{\sigma}{\mu} \cdot 100 = \text{CV} \% $$
98 |
--------------------------------------------------------------------------------
/Analysis with Statistics/README.md:
--------------------------------------------------------------------------------
1 | # Analysis with Statistics
2 |
3 | ## Table of Contents
4 |
5 | 0. [Resources](#resources)
6 |
7 | 1. [Types of Analysis](#types-of-analysis)
8 |
9 | 2. [Univariate Analysis](#univariate-analysis)
10 |
11 | - [Categorical](#categorical)
12 | - [Numerical](#numerical)
13 |
14 | 3. [Bivariate Analysis](#bivariate-analysis)
15 |
16 | - [Categorical - Categorical](#categorical---categorical)
17 | - [Numerical - Numerical](#numerical---numerical)
18 | - [Categorical - Numerical](#categorical---numerical)
19 |
20 | 4. [Multivariate Analysis](#multivariate-analysis)
21 |
22 | ## Resources
23 |
24 | - [Video](https://www.youtube.com/watch?v=1ndVC500-EU&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=2)
25 | - [PDF](./docs/Analysis%20with%20Statistics.pdf)
26 | - [Notebook](https://colab.research.google.com/drive/19YlpW_N7idyQQvmpgrZg8KNSvIjCPk-8?usp=sharing)
27 |
28 | ## Topics
29 |
30 | ### Types of Analysis
31 |
32 | ```mermaid
33 | graph LR
34 | A((Types of Analysis \n in Stats))
35 |
36 | A --> B(Univariate Analysis)
37 | B --> B1(Each Categorical Feature)
38 | B --> B2(Each Numerical Feature)
39 |
40 | A --> C(Bivariate Analysis)
41 | C --> C1(Categorical - Categorical)
42 | C --> C2(Categorical - Numerical)
43 | C --> C3(Numerical - Numerical)
44 |
45 | A --> D(Multivariate Analysis)
46 | D --> D1{{Analysis using more than \n two features}}
47 | ```
48 |
49 | ### Univariate Analysis
50 |
51 | #### Categorical
52 |
53 | 1. **Frequency Distribution Table** is a table that summarizes the number of time (or frequency) that each value occurs in the dataset.
54 |
55 | 2. **Relative frequency** is the proportion or percentage of a category in a dataset or sample.
56 | It is calculated by dividing the frequency of a category by the total number of observations in the dataset or sample.
57 |
58 | 3. **Cumulative frequency** is the running total of frequencies of a variable or category in a dataset or sample. It is calculated by adding up the frequencies of the current category and all previous categories in the dataset or sample.
59 |
60 | #### Numerical
61 |
62 | 1. **Frequency Distribution Table or Histogram** is being made using binning method for numerical data. It calculate the number of data falls in each bins.
63 | Here, every bins works as category of particular data.
64 |
65 | ### Bivariate Analysis
66 |
67 | #### Categorical - Categorical
68 |
69 | 1. **Contingency Table or Cross-Tabulation** is used to summarize the relationship between two categorical variables.
70 | It shows the frequencies or relative frequencies of the observed values of two variables.
71 |
72 | #### Numerical - Numerical
73 |
74 | 1. **Scatter Plot** tells the positive/negative relationship of two numerical datasets.
75 |
76 | 2. **Regression Plot** is a special scatter plot which also draw a line.
77 |
78 | 3. **Jointplot** is display two plots at a time scatter plot and histogram both.
79 |
80 | #### Categorical - Numerical
81 |
82 | 1. **Quantiles** are used to divide the data into equal-sized groups.
83 | Quantiles are important measures of variability and can be use to understand distribution of data, summarize and compare different datasets. They can also be used to identify outliers.
84 |
85 | There are several types of quantiles used in statistics such as **Quartiles, Deciles, Percentiles, Quintiles** but the most important one is **Percentile** because it divides the data into 100 equal parts.
86 |
87 | 2. **Percentile** is a measure that represents the percentage of dataset that falls below a particular value.
88 | _For example,_ the 75th percentile is the value below which 75% of the observations in the dataset fall.
89 |
90 | 3. **Inter Quartile Range (IQR)** is a the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset.
91 |
92 | 4. **Five number summary** represents the Minimum, Q1, Q2 (Median), Q3 and Maximum. Where,
93 | $$ \text{Minimum} = \text{Q1} - (1.5 \ast \text{IQR}) $$
94 | $$ \text{Maximum} = \text{Q1} + (1.5 \ast \text{IQR}) $$
95 |
96 | Five number summary generally visualize using **Box Plot or Whisker Plot,**.
97 |
98 | 
99 |
100 | - **Benefits of a Box Plot:**
101 |
102 | - Easy way to see the **distribution of data**.
103 | - Tells about **skewness of data**.
104 | - Can **identify outliers**.
105 | - **Compare 2 categories** of data.
106 |
107 | 5. **Covariance** describes the degree to which two variables are linearly related. It measures how much two variables change together, such that when one variable increases, does the other variable also increase, or does it decrease?
108 | A covariance of zero indicates that the variables are not linearly related.
109 |
110 | 
111 |
112 | - **Disadvantages of using Covariance**
113 | One limitation of covariance is that it does not tell us about the **strength of the relationship between two variables**, since the magnitude of **covariance is affected by the scale of the variables**.
114 |
115 | 6. **Correlation** measures the degree to which two variables are related and how they tend to change together.
116 | Correlation is often measured using a statistical tool called the correlation coefficient, which ranges from -1 to 1. A correlation coefficient of -1 indicates a perfect negative correlation, a correlation coefficient of 0 indicates no correlation, and a correlation coefficient of 1 indicates a perfect positive correlation.
117 |
118 | $$ \text{Correlation} = \frac{Cov(x, y)}{\sigma x \ast \sigma y} $$
119 |
120 | > **Note:** Correlation does not imply causation means if two variables are correlated then it does means that other variable is affected by first variable or vice-versa.
121 |
122 | ### Multivariate Analysis
123 |
124 | 1. **3D Scatter Plot**
125 |
126 | 2. **Plots Parameters** are used to display another impact of another categorical or numerical variable in the plot.
127 |
128 | - Hue/Color Parameter
129 | - Size Parameter
130 |
131 | 3. **Facet Grids**
132 |
133 | 4. **Pairplot**
134 |
135 | 5. **Bubble Chart**
136 |
--------------------------------------------------------------------------------
/Probability Distribution Functions/README.md:
--------------------------------------------------------------------------------
1 | # Analysis with Statistics
2 |
3 | ## Table of Contents
4 |
5 | 0. [Resources](#resources)
6 |
7 | 1. [Random Variables](#random-variables)
8 |
9 | 2. [Types of Random Variables](#types-of-random-variables)
10 |
11 | 3. [Probability Distributions](#probability-distributions)
12 |
13 | 4. [What are Probability Distributions?](#what-are-probability-distributions?)
14 |
15 | 5. [Problem with Distribution?](#problem-with-distribution?)
16 |
17 | 6. [Solution: Probability Distribution Functions](#solution:-probability-distribution-functions)
18 |
19 | 7. [Different types of Probability Distributions](#different-types-of-probability-distributions)
20 |
21 | 8. [Why are Probability Distributions important?](#why-are-probability-distributions-important?)
22 |
23 | 9. [A note on Parameters of Probability Distribution Functions](#a-note-on-parameters-of-probability-distribution-functions)
24 |
25 | 10. [Probability Mass Function (PMF)](#probability-mass-function-pmf)
26 |
27 | 11. [Cumulative Distribution Function (CDF) of PMF](#cumulative-distribution-function-cdf-of-pmf)
28 |
29 | 12. [Probability Density Function (PDF)](#probability-density-function-pdf)
30 |
31 | 13. [Questions related to PDFs](#questions-related-to-pdfs)
32 |
33 | 14. [Density Estimation](#density-estimation)
34 |
35 | 15. [Types of Density Estimation](#types-of-density-estimation)
36 |
37 | 16. [Parametric Density Estimation](#parametric-density-estimation)
38 |
39 | 17. [Non-Parametric Density Estimation](#non-parametric-density-estimation)
40 |
41 | 18. [Kernel Density Estimate (KDE)](#kernel-density-estimate-kde)
42 |
43 | 19. [PDF, PMF and CDF](#pdf-pmf-and-cdf)
44 |
45 | ## Resources
46 |
47 | - [Video](https://www.youtube.com/watch?v=C_QAURbgBqY&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=4)
48 | - [PDF](./docs/PDF.pdf)
49 | - [Online PDF](https://drive.google.com/file/d/1FQ65CTmMLK-PYZ6NT9txGcGmJHobtNYl/view)
50 | - [Session Notebook](https://colab.research.google.com/drive/1N_T0_w5vpT1k1Z4pSf4IMhAxYT1nRKLU?usp=sharing)
51 |
52 | ## Topics
53 |
54 | ### Random Variables
55 |
56 | A Random Variable is a set of possible values from a random experiment.
57 |
58 | ### Types of Random Variables
59 |
60 | ```mermaid
61 | graph
62 | A[Types of Random Variable]
63 |
64 | A --> B(Discrete \n Random Variable)
65 | A --> C(Continuous \n Random Variable)
66 | ```
67 |
68 | ### Probability Distributions
69 |
70 | ### What are Probability Distributions?
71 |
72 | A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values.
73 |
74 | ### Problem with Distribution?
75 |
76 | In many scenarios, the number of outcomes can be much larger and hence a table would be tedious to write down. Worse still, the number of possible outcomes could be infinite, in which case, good luck writing a table for that.
77 |
78 | > **Example:** Height of people, Rolling 10 dice together.
79 |
80 | ### Solution: Probability Distribution Functions
81 |
82 | A probability distribution function is a mathematical function that describes the **probability of obtaining different values of a random variable** in a particular probability distribution.
83 |
84 | ### Different types of Probability Distributions
85 |
86 | 
87 |
88 | 
89 |
90 | ### Why are Probability Distributions important?
91 |
92 | - Gives an idea about the shape/distribution of the data.
93 | - And if our data follows a famous distribution then we automatically know a lot about the data.
94 |
95 | ### A note on Parameters of Probability Distribution Functions
96 |
97 | Parameters in probability distributions are numerical values that determine the shape, location, and scale of the distribution.
98 | Different probability distributions have different sets of parameters that determine their shape and characteristics, and understanding these parameters is essential in statistical analysis and inference.
99 |
100 | ### Probability Mass Function (PMF)
101 |
102 | Describes the probability distribution of a **discrete random variable**.
103 |
104 | PMF assign a probability to each value of the random variable. The probabilities assigned by the PMF must satisfy two conditions:
105 |
106 | 1. The probability assigned to each **value must be non-negative** (i.e., greater than or equal to zero).
107 | 2. The **sum** of the probabilities assigned to all possible values must **equal 1**.
108 |
109 | ### Cumulative Distribution Function (CDF) of PMF
110 |
111 | Describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to x.
112 |
113 | $$ F(x) = P(X \le x) $$
114 |
115 | **Examples:**
116 |
117 | - [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)
118 | - [Binomial Distribution](https://en.wikipedia.org/wiki/Binomial_distribution)
119 |
120 | ### Probability Density Function (PDF)
121 |
122 | Describes the probability distribution of a continuous random variable.
123 |
124 | ### Questions related to PDFs
125 |
126 | 1. Why Probability Density represents the y-axis and why not Probability?
127 |
128 | - Because you have infinite value on the x-axis and you cannot calculate probability of each of the values of a continuous random variable dataset.
129 |
130 | 2. What does the area under the graph represents in PDF?
131 |
132 | - Area under the graph represents the probability of a range (3.0 to 3.1) on x-axis because you have probability density on the y-axis.
133 |
134 | 3. How to calculate Probability from PDF graph?
135 |
136 | - If you reduce the range of two points on x-axis with a very significant amount then you can calculate approx. probability of a point.
137 |
138 | 4. Examples of PDF:
139 |
140 | - [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution)
141 | - [Log Normal Distribution](https://en.wikipedia.org/wiki/Log-normal_distribution)
142 | - [Poisson Distribution](https://en.wikipedia.org/wiki/Poisson_distribution)
143 |
144 | 5. How is graph calculated?
145 |
146 | - Using [Density Estimation](#density-estimation)
147 |
148 | ### Density Estimation
149 |
150 | Density estimation is a statistical technique used to estimate the probability density function (PDF) of a random variable.
151 |
152 | It is particularly useful in areas such as machine learning, where it is often used to estimate the probability distribution of input data or to model the likelihood of certain events or outcomes.
153 |
154 | ### Types of Density Estimation
155 |
156 | ```mermaid
157 | graph
158 |
159 | A[Types of \n Density Estimation]
160 |
161 | A --> B(Parametric \n Density Estimation)
162 | A --> C(Non-Parametric \n Density Estimation)
163 | ```
164 |
165 | ### Parametric Density Estimation
166 |
167 | This method estimate probability density by assuming that the random variable is follow some specific distribution such normal, exponential, log normal, or Poisson distributions.
168 | This estimation depends on population mean and standard deviation.
169 |
170 | ### Non-Parametric Density Estimation
171 |
172 | When sometime the distribution of random variable is not clear or it's not one of the famous distributions.
173 |
174 | This method estimate probability density of a random variable without making any assumption about the underlying distribution. This typically done by creating a **kernel density estimate**.
175 |
176 | It has several **advantages over parametric density estimation**.
177 | One of the main advantages is that **it does not require the assumption of a specific distribution**, which allows for more flexible and accurate estimation in situations where the underlying distribution is unknown or complex.
178 | However, non-parametric density estimation **can be computationally intensive and may require more data to achieve accurate estimates** compared to parametric methods.
179 |
180 | ### Kernel Density Estimate (KDE)
181 |
182 | The KDE technique involves using a kernel function to smooth out the data and create a continuous estimate of the underlying density function.
183 |
184 | [**Watch the video for more clarity.**](https://www.youtube.com/watch?v=C_QAURbgBqY&list=PLKnIA16_RmvbYFaaeLY28cWeqV-3vADST&index=4)
185 |
186 | ### PDF, PMF and CDF
187 |
188 | 
189 |
190 | 
191 |
--------------------------------------------------------------------------------
/course_parser.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | import json
4 | import time
5 | from dataclasses import dataclass, field
6 | from typing import TYPE_CHECKING, Iterable, Literal, Self
7 |
8 | from bs4 import BeautifulSoup, Tag
9 |
10 | if TYPE_CHECKING:
11 | from pathlib import Path
12 |
13 | import httpx
14 |
15 | COURSE_URL = "https://learnwith.campusx.in/s/courses/653f50d1e4b0d2eae855480a/take"
16 | BASE_RESOURCE_URL = "https://learnwith.campusx.in/s/courses/653f50d1e4b0d2eae855480a"
17 | BASE_HEADERS = {
18 | "accept": "application/json, text/javascript, */*; q=0.01",
19 | "referer": COURSE_URL,
20 | "user-agent": (
21 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_7) AppleWebKit/537.36 (KHTML, "
22 | "like Gecko) Chrome/111.5.0.0 Safari/507.02"
23 | ),
24 | }
25 | ResourceType = Literal[
26 | "article",
27 | "assessment",
28 | "assignment",
29 | "link",
30 | "livetest",
31 | "pdf",
32 | "video",
33 | ]
34 |
35 |
36 | def fetch_sub_topic_resource(
37 | client: httpx.Client,
38 | sub_topic_id: str,
39 | resource_type: ResourceType,
40 | ) -> bytes:
41 | """Fetches the resource data for the given subtopic ID and resource type.
42 |
43 | Args:
44 | client: HTTPX client instance with cookies set.
45 | sub_topic_id: ID of the subtopic to fetch.
46 | resource_type: Type of resource to fetch.
47 |
48 | Returns:
49 | The data as bytes for the requested resource.
50 |
51 | Raises:
52 | ValueError: If client does not have cookies set.
53 | HTTPError: If the API request fails.
54 | """
55 | if not client.cookies:
56 | raise ValueError("Client does not have cookies.")
57 | res = client.get(f"/{resource_type}s/{sub_topic_id}/get")
58 | res.raise_for_status()
59 | return res.content
60 |
61 |
62 | @dataclass(kw_only=True)
63 | class CourseTopic:
64 | title: str
65 | id: str
66 | source: Tag = field(repr=False)
67 |
68 | @staticmethod
69 | def search(html_path: Path) -> Tag:
70 | """
71 | Parses CourseTopic instances from a BeautifulSoup tag.
72 |
73 | Yields CourseTopic instances parsed from the provided BeautifulSoup tag source.
74 | """
75 | soup = BeautifulSoup(html_path.read_bytes(), "html.parser")
76 | course_items_tag = soup.select_one("div.courseItems")
77 | if course_items_tag:
78 | return course_items_tag
79 | raise ValueError("'div.courseItems' css selector not present in source.")
80 |
81 | @classmethod
82 | def parse(cls, source: Tag) -> Iterable[Self]:
83 | """
84 | Parses CourseTopic instances from a BeautifulSoup tag.
85 |
86 | Yields CourseTopic instances parsed from the provided BeautifulSoup tag source.
87 | """
88 | yield from (
89 | cls(
90 | title=tag["data-title"],
91 | id=tag["data-id"],
92 | source=tag,
93 | )
94 | for tag in source.find_all("div", {"data-type": "label"})
95 | )
96 |
97 |
98 | @dataclass(kw_only=True)
99 | class CourseSubTopic:
100 | id: str
101 | topicId: str
102 | title: str
103 | type: ResourceType
104 | source: Tag = field(repr=False)
105 |
106 | @classmethod
107 | def parse(cls, topic: CourseTopic) -> Iterable[Self]:
108 | """
109 | Parses CourseSubTopic instances from a CourseTopic BeautifulSoup tag.
110 |
111 | Yields CourseSubTopic instances parsed from the provided CourseTopic
112 | BeautifulSoup tag source.
113 | """
114 | yield from (
115 | cls(
116 | id=tag["data-id"],
117 | topicId=topic.id,
118 | title=tag["data-title"],
119 | type=tag["data-type"],
120 | source=tag,
121 | )
122 | for tag in topic.source.find_all("div", {"data-type": True})
123 | )
124 |
125 | @classmethod
126 | def parse_many(
127 | cls,
128 | topics: Iterable[CourseTopic],
129 | ) -> Iterable[tuple[CourseTopic, Iterable[Self]]]:
130 | for topic in topics:
131 | yield topic, cls.parse(topic)
132 |
133 | @classmethod
134 | def find(
135 | cls,
136 | course_topics: Iterable[CourseTopic],
137 | *,
138 | id: str | None = None,
139 | title: str | None = None,
140 | ) -> Iterable[Self]:
141 | """
142 | Parses a single CourseSubTopic from the given CourseTopics.
143 |
144 | This allows fetching a specific CourseSubTopic by title or id from the
145 | list of CourseTopics, by searching through their associated subtopics.
146 |
147 | Args:
148 | course_topics: Iterable of CourseTopic instances to search through.
149 | title: Optional title of subtopic to find.
150 | id: Optional id of subtopic to find.
151 |
152 | Returns:
153 | Iterable of matching CourseSubTopic instances.
154 |
155 | Raises:
156 | ValueError: If both title and id are None.
157 | ValueError: If no matching subtopic is found.
158 | """
159 | if id is None and title is None:
160 | raise ValueError("Both 'id' and 'title' must not be None.")
161 |
162 | for topic in course_topics:
163 | if topic.id == id or topic.title == title:
164 | yield from cls.parse(topic)
165 | break
166 | else:
167 | raise ValueError(f"No subtopic found matching id={id} or title={title}")
168 |
169 |
170 | @dataclass(kw_only=True)
171 | class CourseVideoResource:
172 | id: str
173 | topicId: str
174 | title: str
175 | totalTime: str
176 | description: str = field(repr=False)
177 | isDescriptionHtml: bool = field(repr=False)
178 |
179 | @classmethod
180 | def fetch(cls, client: httpx.Client, sub_topic: CourseSubTopic) -> Self:
181 | if sub_topic.type != "video":
182 | raise ValueError(f"sub_topic is not a video resource, got {sub_topic.type}")
183 |
184 | response = fetch_sub_topic_resource(
185 | client=client,
186 | sub_topic_id=sub_topic.id,
187 | resource_type="video",
188 | )
189 | try:
190 | data = json.loads(response)
191 | data = data["spayee:resource"]
192 | except json.JSONDecodeError as e:
193 | raise ValueError("Response could not be parsed as JSON.") from e
194 | except KeyError as e:
195 | raise ValueError("Bad response or missing required fields.") from e
196 | return cls(
197 | id=sub_topic.id,
198 | topicId=sub_topic.topicId,
199 | title=data["spayee:title"],
200 | description=data["spayee:description"],
201 | totalTime=data["spayee:totalTime"],
202 | isDescriptionHtml=data["spayee:isDescriptionHtml"],
203 | )
204 |
205 |
206 | @dataclass(kw_only=True)
207 | class CourseAssignmentResource:
208 | id: str
209 | topicId: str
210 | title: str
211 | assignmentLink: str = field(repr=False)
212 |
213 | @classmethod
214 | def fetch(cls, client: httpx.Client, sub_topic: CourseSubTopic) -> Self:
215 | if sub_topic.type != "assignment":
216 | raise ValueError(
217 | f"sub_topic is not an assignment resource, got {sub_topic.type}"
218 | )
219 |
220 | response = fetch_sub_topic_resource(
221 | client=client,
222 | sub_topic_id=sub_topic.id,
223 | resource_type="assignment",
224 | )
225 |
226 | def parse_assignment_link(source: str | bytes) -> str:
227 | soup = BeautifulSoup(source, "html.parser")
228 | link_tag = soup.select_one("#instructions a")
229 | if link_tag:
230 | return link_tag.get_attribute_list("href", "")[0]
231 | raise ValueError("assignmentLink tag not found in source")
232 |
233 | return cls(
234 | id=sub_topic.id,
235 | topicId=sub_topic.topicId,
236 | title=sub_topic.title,
237 | assignmentLink=parse_assignment_link(response),
238 | )
239 |
240 |
241 | if __name__ == "__main__":
242 | from pathlib import Path
243 |
244 | import httpx
245 | from rich import print
246 |
247 | # campusx.html contains the html content of the website
248 | course_topic_tag = CourseTopic.search(Path("campusx.html"))
249 | course_topics = list(CourseTopic.parse(course_topic_tag))
250 | print(course_topics[-10:])
251 |
252 | sub_topics = list(CourseSubTopic.find(course_topics, id="olh5gfqpjt"))
253 | print(list(sub_topics))
254 |
255 | # Fill cookies from browser's network tab
256 | cookies = {
257 | "c_ujwt": "Your Token",
258 | "SESSIONID": "Your current SESSION ID.",
259 | }
260 |
261 | results: list[CourseVideoResource] = []
262 | with httpx.Client(
263 | base_url=BASE_RESOURCE_URL, headers=BASE_HEADERS, cookies=cookies
264 | ) as client:
265 | for i, sub_topic in enumerate(sub_topics, 1):
266 | if sub_topic.type != "video":
267 | print(f"subtopic id={sub_topic.id} is not a video resource.")
268 | continue
269 | if i % 7 == 0:
270 | print("sleeping for 3 seconds...")
271 | time.sleep(3)
272 | results.append(CourseVideoResource.fetch(client, sub_topic))
273 | if not results:
274 | raise ValueError("No video resources found.")
275 | print(results)
276 |
--------------------------------------------------------------------------------
/SGD Regressor/notebook/my-SGD.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 28,
6 | "id": "60e43fd7",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import numpy as np\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "\n",
13 | "from sklearn.linear_model import LinearRegression, SGDRegressor\n",
14 | "from sklearn.metrics import r2_score\n",
15 | "from sklearn.model_selection import train_test_split\n",
16 | "from sklearn.datasets import make_regression"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 33,
22 | "id": "a90f5266",
23 | "metadata": {},
24 | "outputs": [
25 | {
26 | "name": "stdout",
27 | "output_type": "stream",
28 | "text": [
29 | "(100, 1) (100,)\n"
30 | ]
31 | },
32 | {
33 | "data": {
34 | "image/png": "",
35 | "text/plain": [
36 | ""
37 | ]
38 | },
39 | "metadata": {},
40 | "output_type": "display_data"
41 | }
42 | ],
43 | "source": [
44 | "random_state = 2\n",
45 | "X, y, *_ = make_regression(\n",
46 | " n_samples=100, n_features=1, n_informative=1, n_targets=1, noise=20, random_state=random_state\n",
47 | ")\n",
48 | "print(X.shape, y.shape)\n",
49 | "\n",
50 | "plt.scatter(X, y)\n",
51 | "plt.show()"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 34,
57 | "id": "f6f4edcf",
58 | "metadata": {},
59 | "outputs": [
60 | {
61 | "data": {
62 | "text/plain": [
63 | "((80, 1), (20, 1))"
64 | ]
65 | },
66 | "execution_count": 34,
67 | "metadata": {},
68 | "output_type": "execute_result"
69 | }
70 | ],
71 | "source": [
72 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)\n",
73 | "\n",
74 | "X_train.shape, X_test.shape"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "id": "b692b6aa",
80 | "metadata": {},
81 | "source": [
82 | "\n",
83 | "---\n",
84 | "\n",
85 | "# LinearRegression"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 35,
91 | "id": "eb4d3f4e",
92 | "metadata": {},
93 | "outputs": [
94 | {
95 | "data": {
96 | "text/html": [
97 | "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.