├── .gitignore ├── 01_Fundamentals ├── 16_regex.py ├── 1_fundamentals.py └── README.md ├── 02_Statistics ├── 2_descriptive-statistics.py └── README.md ├── 03_Programming ├── 15_csv-example.csv ├── 15_reading-csv.py ├── 1_python-basics.py ├── 21_install-pkgs.py ├── 4_r_basics.R └── README.md ├── 04_Machine-Learning ├── 01_Machine_Learning_Basics.ipynb ├── 04_Supervised_Machine_Learning.ipynb ├── 05_Supervised_Learning_Algorithms │ ├── 03. Support Vector Machine (SVM).ipynb │ ├── 04. Decision Trees.ipynb │ ├── 05. Random Forest.ipynb │ └── 06. Naive Bayes Classifier.ipynb ├── 22_perceptron.py ├── Algorithms │ ├── 01. Linear Regression.ipynb │ ├── 02. Logistic Regression.ipynb │ └── images │ │ ├── CF.png │ │ ├── CFG.gif │ │ ├── CH.png │ │ ├── DT.png │ │ ├── GD.jpg │ │ ├── GDD.jpg │ │ ├── K.png │ │ ├── LR.jpg │ │ ├── LR.png │ │ ├── NN.png │ │ └── SVM.jpg └── README.md ├── 05_Text-Mining-NLP └── README.md ├── 06_Data-Visualization ├── 1_data-exploration.R ├── 4_histogram-pie.R └── README.md ├── 07_Big-Data └── README.md ├── 08_Data-Ingestion └── README.md ├── 09_Data-Munging └── README.md ├── 10_Toolbox └── README.md ├── LICENCE.txt ├── README.md ├── poetry.lock └── pyproject.toml /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | -------------------------------------------------------------------------------- /01_Fundamentals/16_regex.py: -------------------------------------------------------------------------------- 1 | # import re library 2 | import re 3 | 4 | # Text coming from Python module __re__ 5 | text = "This module provides regular expression matching operations similar to those found in Perl." 6 | 7 | # Substitution of "Perl" by "every languages" 8 | new_text = re.sub("Perl", "every languages", text) 9 | print(new_text) 10 | 11 | # Searching for capitals letters in the text 12 | new_text = re.findall("[A-Z]", text) 13 | print(new_text) 14 | 15 | # Test if a word is in the text or not 16 | new_text = re.match(".*regular.*", text) 17 | print(new_text) 18 | -------------------------------------------------------------------------------- /01_Fundamentals/1_fundamentals.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | # Generate a list of 4 list of 5 random numbers each 5 | list_of_lists = [] 6 | # Loop the action 4 times 7 | for i in range(4): 8 | # Generate a list of 5 numbers between 0 and 100 and add this list to [list_of_lists] 9 | list_of_lists.append(np.random.randint(low=0, high=100, size=5)) 10 | 11 | # Convert list_of_lists into numpy matrix 12 | matrix = np.matrix(list_of_lists) 13 | print("Here is your matrix:\n{}\n".format(matrix)) 14 | 15 | # Addition 16 | new_matrix = np.sum([matrix, 5]) 17 | print("Here is your matrix with addition +5:\n{}\n".format(new_matrix)) 18 | 19 | # Multiplication 20 | new_matrix = np.multiply(matrix, matrix) 21 | print("Here is your matrix multiplied by itself:\n{}\n".format(new_matrix)) 22 | 23 | # Transposition 24 | new_matrix = np.transpose(matrix) 25 | print("Here is your matrix transposed:\n{}\n".format(new_matrix)) 26 | -------------------------------------------------------------------------------- /01_Fundamentals/README.md: -------------------------------------------------------------------------------- 1 | # 1_ Fundamentals 2 | 3 | ## 1_ Matrices & Algebra fundamentals 4 | 5 | ### About 6 | 7 | In mathematics, a matrix is a __rectangular array of numbers, symbols, or expressions, arranged in rows and columns__. A matrix could be reduced as a submatrix of a matrix by deleting any collection of rows and/or columns. 8 | 9 | ![matrix-image](https://upload.wikimedia.org/wikipedia/commons/b/bb/Matrix.svg) 10 | 11 | ### Operations 12 | 13 | There are a number of basic operations that can be applied to modify matrices: 14 | 15 | * [Addition](https://en.wikipedia.org/wiki/Matrix_addition) 16 | * [Scalar Multiplication](https://en.wikipedia.org/wiki/Scalar_multiplication) 17 | * [Transposition](https://en.wikipedia.org/wiki/Transpose) 18 | * [Multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) 19 | 20 | ## 2_ Hash function, binary tree, O(n) 21 | 22 | ### Hash function 23 | 24 | #### Definition 25 | 26 | A hash function is __any function that can be used to map data of arbitrary size to data of fixed size__. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file. 27 | 28 | ![hash-image](https://upload.wikimedia.org/wikipedia/commons/5/58/Hash_table_4_1_1_0_0_1_0_LL.svg) 29 | 30 | ### Binary tree 31 | 32 | #### Definition 33 | 34 | In computer science, a binary tree is __a tree data structure in which each node has at most two children__, which are referred to as the left child and the right child. 35 | 36 | ![binary-tree-image](https://upload.wikimedia.org/wikipedia/commons/f/f7/Binary_tree.svg) 37 | 38 | ### O(n) 39 | 40 | #### Definition 41 | 42 | In computer science, big O notation is used to __classify algorithms according to how their running time or space requirements grow as the input size grows__. In analytic number theory, big O notation is often used to __express a bound on the difference between an arithmetical function and a better understood approximation__. 43 | 44 | ## 3_ Relational algebra, DB basics 45 | 46 | ### Definition 47 | 48 | Relational algebra is a family of algebras with a __well-founded semantics used for modelling the data stored in relational databases__, and defining queries on it. 49 | 50 | The main application of relational algebra is providing a theoretical foundation for __relational databases__, particularly query languages for such databases, chief among which is SQL. 51 | 52 | ### Natural join 53 | 54 | #### About 55 | 56 | In SQL language, a natural junction between two tables will be done if : 57 | 58 | * At least one column has the same name in both tables 59 | * Theses two columns have the same data type 60 | * CHAR (character) 61 | * INT (integer) 62 | * FLOAT (floating point numeric data) 63 | * VARCHAR (long character chain) 64 | 65 | #### mySQL request 66 | 67 | SELECT 68 | FROM 69 | NATURAL JOIN 70 | 71 | SELECT 72 | FROM , 73 | WHERE TABLE_1.ID = TABLE_2.ID 74 | 75 | ## 4_ Inner, Outer, Cross, theta-join 76 | 77 | ### Inner join 78 | 79 | The INNER JOIN keyword selects records that have matching values in both tables. 80 | 81 | #### Request 82 | 83 | SELECT column_name(s) 84 | FROM table1 85 | INNER JOIN table2 ON table1.column_name = table2.column_name; 86 | 87 | ![inner-join-image](https://www.w3schools.com/sql/img_innerjoin.gif) 88 | 89 | ### Outer join 90 | 91 | The FULL OUTER JOIN keyword return all records when there is a match in either left (table1) or right (table2) table records. 92 | 93 | #### Request 94 | 95 | SELECT column_name(s) 96 | FROM table1 97 | FULL OUTER JOIN table2 ON table1.column_name = table2.column_name; 98 | 99 | ![outer-join-image](https://www.w3schools.com/sql/img_fulljoin.gif) 100 | 101 | ### Left join 102 | 103 | The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match. 104 | 105 | #### Request 106 | 107 | SELECT column_name(s) 108 | FROM table1 109 | LEFT JOIN table2 ON table1.column_name = table2.column_name; 110 | 111 | ![left-join-image](https://www.w3schools.com/sql/img_leftjoin.gif) 112 | 113 | ### Right join 114 | 115 | The RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match. 116 | 117 | #### Request 118 | 119 | SELECT column_name(s) 120 | FROM table1 121 | RIGHT JOIN table2 ON table1.column_name = table2.column_name; 122 | 123 | ![left-join-image](https://www.w3schools.com/sql/img_rightjoin.gif) 124 | 125 | ## 5_ CAP theorem 126 | 127 | It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: 128 | 129 | * Every read receives the most recent write or an error. 130 | * Every request receives a (non-error) response – without guarantee that it contains the most recent write. 131 | * The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. 132 | 133 | In other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP Theorem is quite different from the consistency guaranteed in ACID database transactions. 134 | 135 | ## 6_ Tabular data 136 | 137 | Tabular data are __opposed to relational__ data, like SQL database. 138 | 139 | In tabular data, __everything is arranged in columns and rows__. Every row have the same number of column (except for missing value, which could be substituted by "N/A". 140 | 141 | The __first line__ of tabular data is most of the time a __header__, describing the content of each column. 142 | 143 | The most used format of tabular data in data science is __CSV___. Every column is surrounded by a character (a tabulation, a coma ..), delimiting this column from its two neighbours. 144 | 145 | ## 7_ Entropy 146 | 147 | Entropy is a __measure of uncertainty__. High entropy means the data has high variance and thus contains a lot of information and/or noise. 148 | 149 | For instance, __a constant function where f(x) = 4 for all x has no entropy and is easily predictable__, has little information, has no noise and can be succinctly represented . Similarly, f(x) = ~4 has some entropy while f(x) = random number is very high entropy due to noise. 150 | 151 | ## 8_ Data frames & series 152 | 153 | A data frame is used for storing data tables. It is a list of vectors of equal length. 154 | 155 | A series is a series of data points ordered. 156 | 157 | ## 9_ Sharding 158 | 159 | *Sharding* is __horizontal(row wise) database partitioning__ as opposed to __vertical(column wise) partitioning__ which is *Normalization* 160 | 161 | Why use Sharding? 162 | 163 | 1. Database systems with large data sets or high throughput applications can challenge the capacity of a single server. 164 | 2. Two methods to address the growth : Vertical Scaling and Horizontal Scaling 165 | 3. Vertical Scaling 166 | 167 | * Involves increasing the capacity of a single server 168 | * But due to technological and economical restrictions, a single machine may not be sufficient for the given workload. 169 | 170 | 4. Horizontal Scaling 171 | * Involves dividing the dataset and load over multiple servers, adding additional servers to increase capacity as required 172 | * While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server. 173 | * Idea is to use concepts of Distributed systems to achieve scale 174 | * But it comes with same tradeoffs of increased complexity that comes hand in hand with distributed systems. 175 | * Many Database systems provide Horizontal scaling via Sharding the datasets. 176 | 177 | ## 10_ OLAP 178 | 179 | Online analytical processing, or OLAP, is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing. 180 | 181 | OLAP is part of the __broader category of business intelligence__, which also encompasses relational database, report writing and data mining. Typical applications of OLAP include ___business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture__. 182 | 183 | The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). 184 | 185 | ## 11_ Multidimensional Data model 186 | 187 | ## 12_ ETL 188 | 189 | * Extract 190 | * extracting the data from the multiple heterogenous source system(s) 191 | * data validation to confirm whether the data pulled has the correct/expected values in a given domain 192 | 193 | * Transform 194 | * extracted data is fed into a pipeline which applies multiple functions on top of data 195 | * these functions intend to convert the data into the format which is accepted by the end system 196 | * involves cleaning the data to remove noise, anamolies and redudant data 197 | * Load 198 | * loads the transformed data into the end target 199 | 200 | ## 13_ Reporting vs BI vs Analytics 201 | 202 | ## 14_ JSON and XML 203 | 204 | ### JSON 205 | 206 | JSON is a language-independent data format. Example describing a person: 207 | 208 | { 209 | "firstName": "John", 210 | "lastName": "Smith", 211 | "isAlive": true, 212 | "age": 25, 213 | "address": { 214 | "streetAddress": "21 2nd Street", 215 | "city": "New York", 216 | "state": "NY", 217 | "postalCode": "10021-3100" 218 | }, 219 | "phoneNumbers": [ 220 | { 221 | "type": "home", 222 | "number": "212 555-1234" 223 | }, 224 | { 225 | "type": "office", 226 | "number": "646 555-4567" 227 | }, 228 | { 229 | "type": "mobile", 230 | "number": "123 456-7890" 231 | } 232 | ], 233 | "children": [], 234 | "spouse": null 235 | } 236 | 237 | ## XML 238 | 239 | Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. 240 | 241 | 242 | 243 | Bloodroot 244 | Sanguinaria canadensis 245 | 4 246 | Mostly Shady 247 | $2.44 248 | 031599 249 | 250 | 251 | Columbine 252 | Aquilegia canadensis 253 | 3 254 | Mostly Shady 255 | $9.37 256 | 030699 257 | 258 | 259 | Marsh Marigold 260 | Caltha palustris 261 | 4 262 | Mostly Sunny 263 | $6.81 264 | 051799 265 | 266 | 267 | 268 | ## 15_ NoSQL 269 | 270 | noSQL is oppsed to relationnal databases (stand for __N__ot __O__nly __SQL__). Data are not structured and there's no notion of keys between tables. 271 | 272 | Any kind of data can be stored in a noSQL database (JSON, CSV, ...) whithout thinking about a complex relationnal scheme. 273 | 274 | __Commonly used noSQL stacks__: Cassandra, MongoDB, Redis, Oracle noSQL ... 275 | 276 | ## 16_ Regex 277 | 278 | ### About 279 | 280 | __Reg__ ular __ex__ pressions (__regex__) are commonly used in informatics. 281 | 282 | It can be used in a wide range of possibilities : 283 | 284 | * Text replacing 285 | * Extract information in a text (email, phone number, etc) 286 | * List files with the .txt extension .. 287 | 288 | is a good website for experimenting on Regex. Additionally, is another regex tester for Python, with a built-in regex visualizer. 289 | 290 | ### Utilisation 291 | 292 | To use them in [Python](https://docs.python.org/3/library/re.html), just import: 293 | 294 | import re 295 | 296 | ## 17_ Vendor landscape 297 | 298 | ## 18_ Env Setup 299 | 300 | ### What is Python Virtual Environment? 301 | 302 | A Python Virtual Environment is an isolated space where you can work on your Python projects, separately from your system-installed Python. This is one of the most important tools that most Python developers use. 303 | 304 | ### Why Use Virtual Environments? 305 | - Avoids dependency conflicts. 306 | - Allows working on multiple projects with different dependencies. 307 | - Keeps system Python clean and unmodified. 308 | 309 | ### Create a Virtual Environment 310 | Virtual environments are created by executing the `venv` module. 311 | 312 | On Linux: 313 | ``` 314 | python3 -m venv myenv 315 | ``` 316 | On Windows: 317 | ``` 318 | python -m venv myenv 319 | ``` 320 | This creates a folder named `myenv`, which contains the virtual environment. You can name the folder to anything you like. 321 | 322 | ### Activate the Virtual Environment 323 | On Linux: 324 | ``` 325 | source myenv/bin/activate 326 | ``` 327 | On Windows: 328 | - Command Prompt (cmd): 329 | ``` 330 | myenv\Scripts\activate 331 | ``` 332 | - PowerShell: 333 | ``` 334 | myenv\Scripts\Activate.ps1 335 | ``` 336 | If you get a security error, run this command first: 337 | ``` 338 | Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser 339 | ``` 340 | 341 | Now you can install required packages in this Virtual Environment using `pip`. 342 | 343 | ### Deactivate the Virtual Environment 344 | When you’re done, deactivate the virtual environment by running: 345 | ``` 346 | deactivate 347 | ``` 348 | ### Reactivating the Virtual Environment Later 349 | To use the environment again, navigate to the project folder and use the same command that is used to activate the virtual environment. 350 | -------------------------------------------------------------------------------- /02_Statistics/2_descriptive-statistics.py: -------------------------------------------------------------------------------- 1 | # Import 2 | import numpy as np 3 | 4 | # Create a dataset 5 | dataset = [12, 52, 45, 65, 78, 11, 12, 54, 56] 6 | 7 | # Apply mean/median functions 8 | dataset_mean = np.mean(dataset) 9 | dataset_median = np.median(dataset) 10 | 11 | # Print results 12 | print(f"Mean: {dataset_mean}, median: {dataset_median}") 13 | -------------------------------------------------------------------------------- /02_Statistics/README.md: -------------------------------------------------------------------------------- 1 | # 2_ Statistics 2 | 3 | [Statistics-101 for data noobs](https://medium.com/@debuggermalhotra/statistics-101-for-data-noobs-2e2a0e23a5dc) 4 | 5 | ## 1_ Pick a dataset 6 | 7 | ### Datasets repositories 8 | 9 | #### Generalists 10 | 11 | - [KAGGLE](https://www.kaggle.com/datasets) 12 | - [Google](https://toolbox.google.com/datasetsearch) 13 | 14 | #### Medical 15 | 16 | - [PMC](https://www.ncbi.nlm.nih.gov/pmc/) 17 | 18 | #### Other languages 19 | 20 | ##### French 21 | 22 | - [DATAGOUV](https://www.data.gouv.fr/fr/) 23 | 24 | ## 2_ Descriptive statistics 25 | 26 | ### Mean 27 | 28 | In probability and statistics, population mean and expected value are used synonymously to refer to one __measure of the central tendency either of a probability distribution or of the random variable__ characterized by that distribution. 29 | 30 | For a data set, the terms arithmetic mean, mathematical expectation, and sometimes average are used synonymously to refer to a central value of a discrete set of numbers: specifically, the __sum of the values divided by the number of values__. 31 | 32 | ![mean_formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd2f5fb530fc192e4db7a315777f5bbb5d462c90) 33 | 34 | ### Median 35 | 36 | The median is the value __separating the higher half of a data sample, a population, or a probability distribution, from the lower half__. In simple terms, it may be thought of as the "middle" value of a data set. 37 | 38 | ### Descriptive statistics in Python 39 | 40 | [Numpy](http://www.numpy.org/) is a python library widely used for statistical analysis. 41 | 42 | #### Installation 43 | 44 | pip3 install numpy 45 | 46 | #### Utilization 47 | 48 | import numpy as np 49 | 50 | #### Averages and variances using numpy 51 | 52 | | Code | Return | 53 | |--------------------------------------------------------------------------------------|-----------------------------------------| 54 | |`np.median(a, axis=None, out=None, overwrite_input=False, keepdims=False)` | Compute the median along the specified axis | 55 | |`np.mean(a, axis=None, dtype=None, out=None, keepdims=, *, where=)` | Compute the arithmetic mean along the specified axis | 56 | |`np.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=, *, where=)` | Compute the standard deviation along the specified axis. | 57 | |`np.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=, *, where=)` | Compute the variance along the specified axis | 58 | 59 | #### Code Example 60 | 61 | input 62 | 63 | ``` 64 | import numpy as np #import the numpy package 65 | a = np.array([1,2,3,4,5,6,7,8,9]) #Create a numpy array 66 | print ('median = ' , np.median(a) ) #Calculate the median of the array 67 | print ('mean = ' , np.mean (a)) #Calculate the mean of the array 68 | print ('standard deviation = ' , np.std(a) ) #Calculate the standarddeviation of the array 69 | print ('variance = ' , np.var (a) ) #Calculate the variance of the array 70 | ``` 71 | 72 | output 73 | 74 | ``` 75 | median = 5.0 76 | mean = 5.0 77 | standard deviation = 2.581988897471611 78 | variance = 6.666666666666667 79 | ``` 80 | 81 | you can found more [here](https://numpy.org/doc/stable/reference/routines.statistics.html) on how to apply the Descriptive statistics in Python using numpy package. 82 | 83 | ## 3_ Exploratory data analysis 84 | 85 | The step includes visualization and analysis of data. 86 | 87 | Raw data may possess improper distributions of data which may lead to issues moving forward. 88 | 89 | Again, during applications we must also know the distribution of data, for instance, the fact whether the data is linear or spirally distributed. 90 | 91 | [Guide to EDA in Python](https://towardsdatascience.com/data-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-1-eda-49ce99e36655) 92 | 93 | ##### Libraries in Python 94 | 95 | [Matplotlib](https://matplotlib.org/) 96 | 97 | Library used to plot graphs in Python 98 | 99 | __Installation__: 100 | 101 | pip3 install matplotlib 102 | 103 | __Utilization__: 104 | 105 | import matplotlib.pyplot as plt 106 | 107 | [Pandas](https://pandas.pydata.org/) 108 | 109 | Library used to large datasets in python 110 | 111 | __Installation__: 112 | 113 | pip3 install pandas 114 | 115 | __Utilization__: 116 | 117 | import pandas as pd 118 | 119 | [Seaborn](https://seaborn.pydata.org/) 120 | 121 | Yet another Graph Plotting Library in Python. 122 | 123 | __Installation__: 124 | 125 | pip3 install seaborn 126 | 127 | __Utilization__: 128 | 129 | import seaborn as sns 130 | 131 | #### PCA 132 | 133 | PCA stands for principle component analysis. 134 | 135 | We often require to shape of the data distribution as we have seen previously. We need to plot the data for the same. 136 | 137 | Data can be Multidimensional, that is, a dataset can have multiple features. 138 | 139 | We can plot only two dimensional data, so, for multidimensional data, we project the multidimensional distribution in two dimensions, preserving the principle components of the distribution, in order to get an idea of the actual distribution through the 2D plot. 140 | 141 | It is used for dimensionality reduction also. Often it is seen that several features do not significantly contribute any important insight to the data distribution. Such features creates complexity and increase dimensionality of the data. Such features are not considered which results in decrease of the dimensionality of the data. 142 | 143 | [Mathematical Explanation](https://medium.com/towards-artificial-intelligence/demystifying-principal-component-analysis-9f13f6f681e6) 144 | 145 | [Application in Python](https://towardsdatascience.com/data-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-2-pca-feature-92f8f6ec8c8) 146 | 147 | ## 4_ Histograms 148 | 149 | Histograms are representation of distribution of numerical data. The procedure consists of binnng the numeric values using range divisions i.e, the entire range in which the data varies is split into several fixed intervals. Count or frequency of occurences of the numbers in the range of the bins are represented. 150 | 151 | [Histograms](https://en.wikipedia.org/wiki/Histogram) 152 | 153 | ![plot](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Example_histogram.png/220px-Example_histogram.png) 154 | 155 | In python, __Pandas__,__Matplotlib__,__Seaborn__ can be used to create Histograms. 156 | 157 | ## 5_ Percentiles & outliers 158 | 159 | ### Percentiles 160 | 161 | Percentiles are numberical measures in statistics, which represents how much or what percentage of data falls below a given number or instance in a numerical data distribution. 162 | 163 | For instance, if we say 70 percentile, it represents, 70% of the data in the ditribution are below the given numerical value. 164 | 165 | [Percentiles](https://en.wikipedia.org/wiki/Percentile) 166 | 167 | ### Outliers 168 | 169 | Outliers are data points(numerical) which have significant differences with other data points. They differ from majority of points in the distribution. Such points may cause the central measures of distribution, like mean, and median. So, they need to be detected and removed. 170 | 171 | [Outliers](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) 172 | 173 | __Box Plots__ can be used detect Outliers in the data. They can be created using __Seaborn__ library 174 | 175 | ![Image_Box_Plot](https://miro.medium.com/max/612/1*105IeKBRGtyPyMy3-WQ8hw.png) 176 | 177 | ## 6_ Probability theory 178 | 179 | __Probability__ is the likelihood of an event in a Random experiment. For instance, if a coin is tossed, the chance of getting a head is 50% so, probability is 0.5. 180 | 181 | __Sample Space__: It is the set of all possible outcomes of a Random Experiment. 182 | __Favourable Outcomes__: The set of outcomes we are looking for in a Random Experiment 183 | 184 | __Probability = (Number of Favourable Outcomes) / (Sample Space)__ 185 | 186 | __Probability theory__ is a branch of mathematics that is associated with the concept of probability. 187 | 188 | [Basics of Probability](https://towardsdatascience.com/basic-probability-theory-and-statistics-3105ab637213) 189 | 190 | ## 7_ Bayes theorem 191 | 192 | ### Conditional Probability 193 | 194 | It is the probability of one event occurring, given that another event has already occurred. So, it gives a sense of relationship between two events and the probabilities of the occurences of those events. 195 | 196 | It is given by: 197 | 198 | __P( A | B )__ : Probability of occurence of A, after B occured. 199 | 200 | The formula is given by: 201 | 202 | ![formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/74cbddb93db29a62d522cd6ab266531ae295a0fb) 203 | 204 | So, P(A|B) is equal to Probablity of occurence of A and B, divided by Probability of occurence of B. 205 | 206 | [Guide to Conditional Probability](https://en.wikipedia.org/wiki/Conditional_probability) 207 | 208 | ### Bayes Theorem 209 | 210 | Bayes theorem provides a way to calculate conditional probability. Bayes theorem is widely used in machine learning most in Bayesian Classifiers. 211 | 212 | According to Bayes theorem the probability of A, given that B has already occurred is given by Probability of A multiplied by the probability of B given A has already occurred divided by the probability of B. 213 | 214 | __P(A|B) = P(A).P(B|A) / P(B)__ 215 | 216 | [Guide to Bayes Theorem](https://machinelearningmastery.com/bayes-theorem-for-machine-learning/) 217 | 218 | ## 8_ Random variables 219 | 220 | Random variable are the numeric outcome of an experiment or random events. They are normally a set of values. 221 | 222 | There are two main types of Random Variables: 223 | 224 | __Discrete Random Variables__: Such variables take only a finite number of distinct values 225 | 226 | __Continous Random Variables__: Such variables can take an infinite number of possible values. 227 | 228 | ## 9_ Cumul Dist Fn (CDF) 229 | 230 | In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable __X__, or just distribution function of __X__, evaluated at __x__, is the probability that __X__ will take a value less than or equal to __x__. 231 | 232 | The cumulative distribution function of a real-valued random variable X is the function given by: 233 | 234 | ![CDF](https://wikimedia.org/api/rest_v1/media/math/render/svg/f81c05aba576a12b4e05ee3f4cba709dd16139c7) 235 | 236 | Resource: 237 | 238 | [Wikipedia](https://en.wikipedia.org/wiki/Cumulative_distribution_function) 239 | 240 | ## 10_ Continuous distributions 241 | 242 | A continuous distribution describes the probabilities of the possible values of a continuous random variable. A continuous random variable is a random variable with a set of possible values (known as the range) that is infinite and uncountable. 243 | 244 | ## 11_ Skewness 245 | 246 | Skewness is the measure of assymetry in the data distribution or a random variable distribution about its mean. 247 | 248 | Skewness can be positive, negative or zero. 249 | 250 | ![skewed image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg/446px-Negative_and_positive_skew_diagrams_%28English%29.svg.png) 251 | 252 | __Negative skew__: Distribution Concentrated in the right, left tail is longer. 253 | 254 | __Positive skew__: Distribution Concentrated in the left, right tail is longer. 255 | 256 | Variation of central tendency measures are shown below. 257 | 258 | ![cet](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cc/Relationship_between_mean_and_median_under_different_skewness.png/434px-Relationship_between_mean_and_median_under_different_skewness.png) 259 | 260 | Data Distribution are often Skewed which may cause trouble during processing the data. __Skewed Distribution can be converted to Symmetric Distribution, taking Log of the distribution__. 261 | 262 | ##### Skew Distribution 263 | 264 | ![Skew](https://miro.medium.com/max/379/1*PLSczKIQRc8ZtlvHED-6mQ.png) 265 | 266 | ##### Log of the Skew Distribution 267 | 268 | ![log](https://miro.medium.com/max/376/1*4GFayBYKIiqAcyI69wIFzA.png) 269 | 270 | [Guide to Skewness](https://en.wikipedia.org/wiki/Skewness) 271 | 272 | ## 12_ ANOVA 273 | 274 | ANOVA stands for __analysis of variance__. 275 | 276 | It is used to compare among groups of data distributions. 277 | 278 | Often we are provided with huge data. They are too huge to work with. The total data is called the __Population__. 279 | 280 | In order to work with them, we pick random smaller groups of data. They are called __Samples__. 281 | 282 | ANOVA is used to compare the variance among these groups or samples. 283 | 284 | Variance of group is given by: 285 | 286 | ![var](https://miro.medium.com/max/446/1*yzAMFVIEFysMKwuT0YHrZw.png) 287 | 288 | The differences in the collected samples are observed using the differences between the means of the groups. We often use the __t-test__ to compare the means and also to check if the samples belong to the same population, 289 | 290 | Now, t-test can only be possible among two groups. But, often we get more groups or samples. 291 | 292 | If we try to use t-test for more than two groups we have to perform t-tests multiple times, once for each pair. This is where ANOVA is used. 293 | 294 | ANOVA has two components: 295 | 296 | __1.Variation within each group__ 297 | 298 | __2.Variation between groups__ 299 | 300 | It works on a ratio called the __F-Ratio__ 301 | 302 | It is given by: 303 | 304 | ![F-ratio](https://miro.medium.com/max/491/1*I5dSwtUICySQ5xvKmq6M8A.png) 305 | 306 | F ratio shows how much of the total variation comes from the variation between groups and how much comes from the variation within groups. If much of the variation comes from the variation between groups, it is more likely that the mean of groups are different. However, if most of the variation comes from the variation within groups, then we can conclude the elements in a group are different rather than entire groups. The larger the F ratio, the more likely that the groups have different means. 307 | 308 | Resources: 309 | 310 | [Defnition](https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php) 311 | 312 | [GUIDE 1](https://towardsdatascience.com/anova-analysis-of-variance-explained-b48fee6380af) 313 | 314 | [Details](https://medium.com/@StepUpAnalytics/anova-one-way-vs-two-way-6b3ff87d3a94) 315 | 316 | ## 13_ Prob Den Fn (PDF) 317 | 318 | It stands for probability density function. 319 | 320 | __In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.__ 321 | 322 | The probability density function (PDF) P(x) of a continuous distribution is defined as the derivative of the (cumulative) distribution function D(x). 323 | 324 | It is given by the integral of the function over a given range. 325 | 326 | ![PDF](https://wikimedia.org/api/rest_v1/media/math/render/svg/45fd7691b5fbd323f64834d8e5b8d4f54c73a6f8) 327 | 328 | ## 14_ Central Limit theorem 329 | 330 | ## 15_ Monte Carlo method 331 | 332 | ## 16_ Hypothesis Testing 333 | 334 | ### Types of curves 335 | 336 | We need to know about two distribution curves first. 337 | 338 | Distribution curves reflect the probabilty of finding an instance or a sample of a population at a certain value of the distribution. 339 | 340 | __Normal Distribution__ 341 | 342 | ![normal distribution](https://sciences.usca.edu/biology/zelmer/305/norm/stanorm.jpg) 343 | 344 | The normal distribution represents how the data is distributed. In this case, most of the data samples in the distribution are scattered at and around the mean of the distribution. A few instances are scattered or present at the long tail ends of the distribution. 345 | 346 | Few points about Normal Distributions are: 347 | 348 | 1. The curve is always Bell-shaped. This is because most of the data is found around the mean, so the proababilty of finding a sample at the mean or central value is more. 349 | 350 | 2. The curve is symmetric 351 | 352 | 3. The area under the curve is always 1. This is because all the points of the distribution must be present under the curve 353 | 354 | 4. For Normal Distribution, Mean and Median lie on the same line in the distribution. 355 | 356 | __Standard Normal Distribution__ 357 | 358 | This type of distribution are normal distributions which following conditions. 359 | 360 | 1. Mean of the distribution is 0 361 | 362 | 2. The Standard Deviation of the distribution is equal to 1. 363 | 364 | The idea of Hypothesis Testing works completely on the data distributions. 365 | 366 | ### Hypothesis Testing 367 | 368 | Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter. 369 | 370 | For example, say, we take the hypothesis that boys in a class are taller than girls. 371 | 372 | The above statement is just an assumption on the population of the class. 373 | 374 | __Hypothesis__ is just an assumptive proposal or statement made on the basis of observations made on a set of information or data. 375 | 376 | We initially propose two mutually exclusive statements based on the population of the sample data. 377 | 378 | The initial one is called __NULL HYPOTHESIS__. It is denoted by H0. 379 | 380 | The second one is called __ALTERNATE HYPOTHESIS__. It is denoted by H1 or Ha. It is used as a contrary to Null Hypothesis. 381 | 382 | Based on the instances of the population we accept or reject the NULL Hypothesis and correspondingly we reject or accept the ALTERNATE Hypothesis. 383 | 384 | #### Level of Significance 385 | 386 | It is the degree which we consider to decide whether to accept or reject the NULL hypothesis. When we consider a hypothesis on a population, it is not the case that 100% or all instances of the population abides the assumption, so we decide a __level of significance as a cutoff degree, i.e, if our level of significance is 5%, and (100-5)% = 95% of the data abides by the assumption, we accept the Hypothesis.__ 387 | 388 | __It is said with 95% confidence, the hypothesis is accepted__ 389 | 390 | ![curve](https://i.stack.imgur.com/d8iHd.png) 391 | 392 | The non-reject region is called __acceptance region or beta region__. The rejection regions are called __critical or alpha regions__. __alpha__ denotes the __level of significance__. 393 | 394 | If level of significance is 5%. the two alpha regions have (2.5+2.5)% of the population and the beta region has the 95%. 395 | 396 | The acceptance and rejection gives rise to two kinds of errors: 397 | 398 | __Type-I Error:__ NULL Hypothesis is true, but wrongly Rejected. 399 | 400 | __Type-II Error:__ NULL Hypothesis if false but is wrongly accepted. 401 | 402 | ![hypothesis](https://microbenotes.com/wp-content/uploads/2020/07/Graphical-representation-of-type-1-and-type-2-errors.jpg) 403 | 404 | ### Tests for Hypothesis 405 | 406 | __One Tailed Test__: 407 | 408 | ![One-tailed](https://prwatech.in/blog/wp-content/uploads/2019/07/onetailtest.png) 409 | 410 | This is a test for Hypothesis, where the rejection region is only one side of the sampling distribution. The rejection region may be in right tail end or in the left tail end. 411 | 412 | The idea is if we say our level of significance is 5% and we consider a hypothesis "Hieght of Boys in a class is <=6 ft". We consider the hypothesis true if atmost 5% of our population are more than 6 feet. So, this will be one-tailed as the test condition only restricts one tail end, the end with hieght > 6ft. 413 | 414 | ![Two Tailed](https://i0.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/two-tailed-significance-testing.png) 415 | 416 | In this case, the rejection region extends at both tail ends of the distribution. 417 | 418 | The idea is if we say our level of significance is 5% and we consider a hypothesis "Hieght of Boys in a class is !=6 ft". 419 | 420 | Here, we can accept the NULL hyposthesis iff atmost 5% of the population is less than or greater than 6 feet. So, it is evident that the crirtical region will be at both tail ends and the region is 5% / 2 = 2.5% at both ends of the distribution. 421 | 422 | ## 17_ p-Value 423 | 424 | Before we jump into P-values we need to look at another important topic in the context: Z-test. 425 | 426 | ### Z-test 427 | 428 | We need to know two terms: __Population and Sample.__ 429 | 430 | __Population__ describes the entire available data distributed. So, it refers to all records provided in the dataset. 431 | 432 | __Sample__ is said to be a group of data points randomly picked from a population or a given distribution. The size of the sample can be any number of data points, given by __sample size.__ 433 | 434 | __Z-test__ is simply used to determine if a given sample distribution belongs to a given population. 435 | 436 | Now,for Z-test we have to use __Standard Normal Form__ for the standardized comparison measures. 437 | 438 | ![std1](https://miro.medium.com/max/700/1*VYCN5b-Zubr4rrc9k37SAg.png) 439 | 440 | As we already have seen, standard normal form is a normal form with mean=0 and standard deviation=1. 441 | 442 | The __Standard Deviation__ is a measure of how much differently the points are distributed around the mean. 443 | 444 | ![std2](https://miro.medium.com/max/640/1*kzFQaZ08dTjlPq1zrcJXgg.png) 445 | 446 | It states that approximately 68% , 95% and 99.7% of the data lies within 1, 2 and 3 standard deviations of a normal distribution respectively. 447 | 448 | Now, to convert the normal distribution to standard normal distribution we need a standard score called Z-Score. 449 | It is given by: 450 | 451 | ![Z-score](https://miro.medium.com/max/125/1*X--kDNyurDEo2zKbSDDf-w.png) 452 | 453 | x = value that we want to standardize 454 | 455 | µ = mean of the distribution of x 456 | 457 | σ = standard deviation of the distribution of x 458 | 459 | We need to know another concept __Central Limit Theorem__. 460 | 461 | ##### Central Limit Theorem 462 | 463 | _The theorem states that the mean of the sampling distribution of the sample means is equal to the population mean irrespective if the distribution of population where sample size is greater than 30._ 464 | 465 | And 466 | 467 | _The sampling distribution of sampling mean will also follow the normal distribution._ 468 | 469 | So, it states, if we pick several samples from a distribution with the size above 30, and pick the static sample means and use the sample means to create a distribution, the mean of the newly created sampling distribution is equal to the original population mean. 470 | 471 | According to the theorem, if we draw samples of size N, from a population with population mean μ and population standard deviation σ, the condition stands: 472 | 473 | ![std3](https://miro.medium.com/max/121/0*VPW964abYGyevE3h.png) 474 | 475 | i.e, mean of the distribution of sample means is equal to the sample means. 476 | 477 | The standard deviation of the sample means is give by: 478 | 479 | ![std4](https://miro.medium.com/max/220/0*EMx4C_A9Efsd6Ef6.png) 480 | 481 | The above term is also called standard error. 482 | 483 | We use the theory discussed above for Z-test. If the sample mean lies close to the population mean, we say that the sample belongs to the population and if it lies at a distance from the population mean, we say the sample is taken from a different population. 484 | 485 | To do this we use a formula and check if the z statistic is greater than or less than 1.96 (considering two tailed test, level of significance = 5%) 486 | 487 | ![los](https://miro.medium.com/max/424/0*C9XaCIUWoJaBSMeZ.gif) 488 | 489 | ![std5](https://miro.medium.com/max/137/1*DRiPmBtjK4wmidq9Ha440Q.png) 490 | 491 | The above formula gives Z-static 492 | 493 | z = z statistic 494 | 495 | X̄ = sample mean 496 | 497 | μ = population mean 498 | 499 | σ = population standard deviation 500 | 501 | n = sample size 502 | 503 | Now, as the Z-score is used to standardize the distribution, it gives us an idea how the data is distributed overall. 504 | 505 | ### P-values 506 | 507 | It is used to check if the results are statistically significant based on the significance level. 508 | 509 | Say, we perform an experiment and collect observations or data. Now, we make a hypothesis (NULL hypothesis) primary, and a second hypothesis, contradictory to the first one called the alternative hypothesis. 510 | 511 | Then we decide a level of significance which serve as a threshold for our null hypothesis. The P value actually gives the probability of the statement. Say, the p-value of our alternative hypothesis is 0.02, it means the probability of alternate hypothesis happenning is 2%. 512 | 513 | Now, the level of significance into play to decide if we can allow 2% or p-value of 0.02. It can be said as a level of endurance of the null hypothesis. If our level of significance is 5% using a two tailed test, we can allow 2.5% on both ends of the distribution, we accept the NULL hypothesis, as level of significance > p-value of alternate hypothesis. 514 | 515 | But if the p-value is greater than level of significance, we tell that the result is __statistically significant, and we reject NULL hypothesis.__ . 516 | 517 | Resources: 518 | 519 | 1. 520 | 521 | 2. 522 | 523 | 3. 524 | 525 | ## 18_ Chi2 test 526 | 527 | Chi2 test is extensively used in data science and machine learning problems for feature selection. 528 | 529 | A chi-square test is used in statistics to test the independence of two events. So, it is used to check for independence of features used. Often dependent features are used which do not convey a lot of information but adds dimensionality to a feature space. 530 | 531 | It is one of the most common ways to examine relationships between two or more categorical variables. 532 | 533 | It involves calculating a number, called the chi-square statistic - χ2. Which follows a chi-square distribution. 534 | 535 | It is given as the summation of the difference of the expected values and observed value divided by the observed value. 536 | 537 | ![Chi2](https://miro.medium.com/max/266/1*S8rfFkmLhDbOz4RGNwuz6g.png) 538 | 539 | Resources: 540 | 541 | [Definitions](investopedia.com/terms/c/chi-square-statistic.asp) 542 | 543 | [Guide 1](https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223) 544 | 545 | [Guide 2](https://medium.com/swlh/what-is-chi-square-test-how-does-it-work-3b7f22c03b01) 546 | 547 | [Example of Operation](https://medium.com/@kuldeepnpatel/chi-square-test-of-independence-bafd14028250) 548 | 549 | ## 19_ Estimation 550 | 551 | ## 20_ Confid Int (CI) 552 | 553 | ## 21_ MLE 554 | 555 | ## 22_ Kernel Density estimate 556 | 557 | In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. 558 | 559 | Kernel Density estimate can be regarded as another way to represent the probability distribution. 560 | 561 | ![KDE1](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Kernel_density.svg/250px-Kernel_density.svg.png) 562 | 563 | It consists of choosing a kernel function. There are mostly three used. 564 | 565 | 1. Gaussian 566 | 567 | 2. Box 568 | 569 | 3. Tri 570 | 571 | The kernel function depicts the probability of finding a data point. So, it is highest at the centre and decreases as we move away from the point. 572 | 573 | We assign a kernel function over all the data points and finally calculate the density of the functions, to get the density estimate of the distibuted data points. It practically adds up the Kernel function values at a particular point on the axis. It is as shown below. 574 | 575 | ![KDE 2](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Comparison_of_1D_histogram_and_KDE.png/500px-Comparison_of_1D_histogram_and_KDE.png) 576 | 577 | Now, the kernel function is given by: 578 | 579 | ![kde3](https://wikimedia.org/api/rest_v1/media/math/render/svg/f3b09505158fb06033aabf9b0116c8c07a68bf31) 580 | 581 | where K is the kernel — a non-negative function — and h > 0 is a smoothing parameter called the bandwidth. 582 | 583 | The 'h' or the bandwidth is the parameter, on which the curve varies. 584 | 585 | ![kde4](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Comparison_of_1D_bandwidth_selectors.png/220px-Comparison_of_1D_bandwidth_selectors.png) 586 | 587 | Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2. 588 | 589 | Resources: 590 | 591 | [Basics](https://www.youtube.com/watch?v=x5zLaWT5KPs) 592 | 593 | [Advanced](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html) 594 | 595 | ## 23_ Regression 596 | 597 | Regression tasks deal with predicting the value of a __dependent variable__ from a set of __independent variables.__ 598 | 599 | Say, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price. 600 | 601 | If there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by __y=mx+c__, where the m is the coefficient of the independent in the equation, c is the intercept or bias. 602 | 603 | The image shows the types of regression 604 | 605 | ![types](https://miro.medium.com/max/2001/1*dSFn-uIYDhDfdaG5GXlB3A.png) 606 | 607 | [Guide to Regression](https://towardsdatascience.com/a-deep-dive-into-the-concept-of-regression-fb912d427a2e) 608 | 609 | ## 24_ Covariance 610 | 611 | ### Variance 612 | 613 | The variance is a measure of how dispersed or spread out the set is. If it is said that the variance is zero, it means all the elements in the dataset are same. If the variance is low, it means the data are slightly dissimilar. If the variance is very high, it means the data in the dataset are largely dissimilar. 614 | 615 | Mathematically, it is a measure of how far each value in the data set is from the mean. 616 | 617 | Variance (sigma^2) is given by summation of the square of distances of each point from the mean, divided by the number of points 618 | 619 | ![formula var](https://cdn.sciencebuddies.org/Files/474/9/DefVarEqn.jpg) 620 | 621 | ### Covariance 622 | 623 | Covariance gives us an idea about the degree of association between two considered random variables. Now, we know random variables create distributions. Distribution are a set of values or data points which the variable takes and we can easily represent as vectors in the vector space. 624 | 625 | For vectors covariance is defined as the dot product of two vectors. The value of covariance can vary from positive infinity to negative infinity. If the two distributions or vectors grow in the same direction the covariance is positive and vice versa. The Sign gives the direction of variation and the Magnitude gives the amount of variation. 626 | 627 | Covariance is given by: 628 | 629 | ![cov_form](https://cdn.corporatefinanceinstitute.com/assets/covariance1.png) 630 | 631 | where Xi and Yi denotes the i-th point of the two distributions and X-bar and Y-bar represent the mean values of both the distributions, and n represents the number of values or data points in the distribution. 632 | 633 | ## 25_ Correlation 634 | 635 | Covariance measures the total relation of the variables namely both direction and magnitude. Correlation is a scaled measure of covariance. It is dimensionless and independent of scale. It just shows the strength of variation for both the variables. 636 | 637 | Mathematically, if we represent the distribution using vectors, correlation is said to be the cosine angle between the vectors. The value of correlation varies from +1 to -1. +1 is said to be a strong positive correlation and -1 is said to be a strong negative correlation. 0 implies no correlation, or the two variables are independent of each other. 638 | 639 | Correlation is given by: 640 | 641 | ![corr](https://cdn.corporatefinanceinstitute.com/assets/covariance3.png) 642 | 643 | Where: 644 | 645 | ρ(X,Y) – the correlation between the variables X and Y 646 | 647 | Cov(X,Y) – the covariance between the variables X and Y 648 | 649 | σX – the standard deviation of the X-variable 650 | 651 | σY – the standard deviation of the Y-variable 652 | 653 | Standard deviation is given by square roo of variance. 654 | 655 | ## 26_ Pearson coeff 656 | 657 | ## 27_ Causation 658 | 659 | ## 28_ Least2-fit 660 | 661 | ## 29_ Euclidian Distance 662 | 663 | __Eucladian Distance is the most used and standard measure for the distance between two points.__ 664 | 665 | It is given as the square root of sum of squares of the difference between coordinates of two points. 666 | 667 | __The Euclidean distance between two points in Euclidean space is a number, the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and is occasionally called the Pythagorean distance.__ 668 | 669 | __In the Euclidean plane, let point p have Cartesian coordinates (p_{1},p_{2}) and let point q have coordinates (q_{1},q_{2}). Then the distance between p and q is given by:__ 670 | 671 | ![eucladian](https://wikimedia.org/api/rest_v1/media/math/render/svg/9c0157084fd89f5f3d462efeedc47d3d7aa0b773) 672 | -------------------------------------------------------------------------------- /03_Programming/15_csv-example.csv: -------------------------------------------------------------------------------- 1 | Leaf Green 2 | Lemon Yellow 3 | Cherry Red 4 | Snow White 5 | -------------------------------------------------------------------------------- /03_Programming/15_reading-csv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | """ Example 1 """ 4 | 5 | # Import 6 | import pandas 7 | 8 | # Open file 9 | data = pandas.read_csv("15_csv-example.csv", delimiter="\t") 10 | # Print data 11 | for index, row in data.iterrows(): 12 | print(f"{row['Name']}'s color is {row['Color']}") 13 | 14 | """ Exemple 2 """ 15 | 16 | # Import 17 | import csv 18 | 19 | # Open file 20 | with open("15_csv-example.csv", encoding="utf-8") as csv_file: 21 | # Read file with csv library 22 | read_csv_file = csv.reader(csv_file, delimiter="\t") 23 | # Parse every row to print it 24 | for row in read_csv_file: 25 | print("'s color is ".join(row)) 26 | -------------------------------------------------------------------------------- /03_Programming/1_python-basics.py: -------------------------------------------------------------------------------- 1 | """ 1_ Python basics """ 2 | 3 | # Print something 4 | print("Hello, world") 5 | 6 | # Assign a variable 7 | a = 23 8 | b = "Hi guys, i'm a text variable" 9 | print(f"This is my variable: {b}") 10 | 11 | # Mathematics 12 | c = (a + 2) * (245 / 23) 13 | print(f"This is mathe-magic: {c}") 14 | -------------------------------------------------------------------------------- /03_Programming/21_install-pkgs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | # Import pip 4 | import pip 5 | 6 | 7 | # Build an installation function 8 | def install(package): 9 | pip.main(["install", package]) 10 | 11 | 12 | # Execution 13 | install("fooBAR") 14 | -------------------------------------------------------------------------------- /03_Programming/4_r_basics.R: -------------------------------------------------------------------------------- 1 | ## PACKAGES 2 | 3 | # To install the package "foobar" 4 | install.packages("foobar") 5 | # And load it 6 | library(foobar) 7 | # Package documentation 8 | ?foobar 9 | # or 10 | help("foobar") 11 | 12 | ## DATASET 13 | 14 | # Load in environment 15 | data(iris) 16 | # Import data already loaded in R into the variable "data" 17 | data <- iris 18 | # The same as 19 | data = iris 20 | # To read a CSV 21 | data <- read.csv('path/to/the/file', sep = ',') 22 | # description 23 | str(iris) 24 | # statistical summary 25 | summary(iris) 26 | # type of object 27 | class(iris) # data frame 28 | # names of variables (columns) 29 | names(iris) 30 | # num rows 31 | nrow(iris) 32 | # num columns 33 | ncol(iris) 34 | # dimension 35 | dim(iris) 36 | # Select the 2nd column 37 | data[,2] 38 | # And the 3rd row 39 | data[3,] 40 | # Mean of the 2nd column 41 | mean(data[,2]) 42 | # Histogram of the 3rd column 43 | hist(data[,3]) 44 | 45 | ## DATA WRANGLING 46 | 47 | # access variables of data frame 48 | iris$Sepal.Length 49 | iris$Species 50 | # or 51 | iris["Species"] 52 | # row subsetting w.r.t column 53 | iris[, "Petal.Width"] 54 | # column subsetting w.r.t row 55 | iris[1:10, ] 56 | -------------------------------------------------------------------------------- /03_Programming/README.md: -------------------------------------------------------------------------------- 1 | # 3_ Programming 2 | 3 | ## 1_ Python Basics 4 | 5 | ### About Python 6 | 7 | Python is a high-level programming langage. I can be used in a wide range of works. 8 | 9 | Commonly used in data-science, [Python](https://www.python.org/) has a huge set of libraries, helpful to quickly do something. 10 | 11 | Most of informatics systems already support Python, without installing anything. 12 | 13 | ### Execute a script 14 | 15 | * Download the .py file on your computer 16 | * Make it executable (_chmod +x file.py_ on Linux) 17 | * Open a terminal and go to the directory containing the python file 18 | * _python file.py_ to run with Python2 or _python3 file.py_ with Python3 19 | 20 | ## 2_ Working in excel 21 | 22 | ## 3_ R setup / R studio 23 | 24 | ### About R 25 | 26 | R is a programming language specialized in statistics and mathematical visualizations. 27 | 28 | I can be used withs manually created scripts launched in the terminal, or directly in the R console. 29 | 30 | ### Installation 31 | 32 | #### Linux 33 | 34 | sudo apt-get install r-base 35 | 36 | sudo apt-get install r-base-dev 37 | 38 | #### Windows 39 | 40 | Download the .exe setup available on [CRAN](https://cran.rstudio.com/bin/windows/base/) website. 41 | 42 | ### R-studio 43 | 44 | Rstudio is a graphical interface for R. It is available for free on [their website](https://www.rstudio.com/products/rstudio/download/). 45 | 46 | This interface is divided in 4 main areas : 47 | 48 | ![rstudio](https://owi.usgs.gov/R/training-curriculum/intro-curriculum/static/img/rstudio.png) 49 | 50 | * The top left is the script you are working on (highlight code you want to execute and press Ctrl + Enter) 51 | * The bottom left is the console to instant-execute some lines of codes 52 | * The top right is showing your environment (variables, history, ...) 53 | * The bottom right show figures you plotted, packages, help ... The result of code execution 54 | 55 | ## 4_ R basics 56 | 57 | R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. 58 | 59 | The R language is widely used among statisticians and data miners for developing statistical software and data analysis. 60 | 61 | Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years. 62 | 63 | ## 5_ Expressions 64 | 65 | ## 6_ Variables 66 | 67 | ## 7_ IBM SPSS 68 | 69 | ## 8_ Rapid Miner 70 | 71 | ## 9_ Vectors 72 | 73 | ## 10_ Matrices 74 | 75 | ## 11_ Arrays 76 | 77 | ## 12_ Factors 78 | 79 | ## 13_ Lists 80 | 81 | ## 14_ Data frames 82 | 83 | ## 15_ Reading CSV data 84 | 85 | CSV is a format of __tabular data__ comonly used in data science. Most of structured data will come in such a format. 86 | 87 | To __open a CSV file__ in Python, just open the file as usual : 88 | 89 | raw_file = open('file.csv', 'r') 90 | 91 | * 'r': Reading, no modification on the file is possible 92 | * 'w': Writing, every modification will erease the file 93 | * 'a': Adding, every modification will be made at the end of the file 94 | 95 | ### How to read it ? 96 | 97 | Most of the time, you will parse this file line by line and do whatever you want on this line. If you want to store data to use them later, build lists or dictionnaries. 98 | 99 | To read such a file row by row, you can use : 100 | 101 | * Python [library csv](https://docs.python.org/3/library/csv.html) 102 | * Python [function open](https://docs.python.org/2/library/functions.html#open) 103 | 104 | ## 16_ Reading raw data 105 | 106 | ## 17_ Subsetting data 107 | 108 | ## 18_ Manipulate data frames 109 | 110 | ## 19_ Functions 111 | 112 | A function is helpful to execute redondant actions. 113 | 114 | First, define the function: 115 | 116 | def MyFunction(number): 117 | """This function will multiply a number by 9""" 118 | number = number * 9 119 | return number 120 | 121 | ## 20_ Factor analysis 122 | 123 | ## 21_ Install PKGS 124 | 125 | Python actually has two mainly used distributions. Python2 and python3. 126 | 127 | ### Install pip 128 | 129 | Pip is a library manager for Python. Thus, you can easily install most of the packages with a one-line command. To install pip, just go to a terminal and do: 130 | 131 | # __python2__ 132 | sudo apt-get install python-pip 133 | # __python3__ 134 | sudo apt-get install python3-pip 135 | 136 | You can then install a library with [pip](https://pypi.python.org/pypi/pip?) via a terminal doing: 137 | 138 | # __python2__ 139 | sudo pip install [PCKG_NAME] 140 | # __python3__ 141 | sudo pip3 install [PCKG_NAME] 142 | 143 | You also can install it directly from the core (see 21_install_pkgs.py) 144 | -------------------------------------------------------------------------------- /04_Machine-Learning/04_Supervised_Machine_Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Supervised Machine Learning\n", 8 | "Supervised machine learning is a type of machine learning where the algorithm is trained on a labeled dataset, meaning the input data is paired with corresponding output labels. The goal of supervised learning is to learn a mapping from input features to the corresponding output labels, allowing the algorithm to make predictions or decisions when presented with new, unseen data.\n", 9 | "\n", 10 | "In a supervised learning scenario:\n", 11 | "\n", 12 | "1. **Input Data (Features):** The algorithm is provided with a set of input data, also known as features. These features represent the characteristics or attributes of the data that are relevant to the learning task.\n", 13 | "\n", 14 | "2. **Output Labels (Target):** Each input data point is associated with a known output label, also called the target or response variable. The output labels represent the desired prediction or classification for each corresponding input.\n", 15 | "\n", 16 | "3. **Training Process:** During the training phase, the algorithm uses the labeled dataset to learn the relationship between the input features and the output labels. The goal is to create a model that generalizes well to new, unseen data.\n", 17 | "\n", 18 | "4. **Model Building:** The supervised learning algorithm builds a model based on the patterns and relationships identified in the training data. The model captures the underlying mapping from inputs to outputs.\n", 19 | "\n", 20 | "5. **Prediction:** Once the model is trained, it can be used to make predictions or classifications on new, unseen data. The model applies the learned patterns to new input features to generate predictions.\n", 21 | "\n", 22 | "### Types of Supervised Learning:\n", 23 | "\n", 24 | "1. **Classification:**\n", 25 | " - In classification tasks, the goal is to predict the categorical class labels of new instances.\n", 26 | " - Example: Spam or not spam, disease diagnosis (positive/negative).\n", 27 | "\n", 28 | "2. **Regression:**\n", 29 | " - In regression tasks, the goal is to predict a continuous numeric value.\n", 30 | " - Example: Predicting house prices, stock prices, or temperature.\n", 31 | "\n", 32 | "### Key Concepts:\n", 33 | "\n", 34 | "- **Training Data:** The labeled dataset used for training the model.\n", 35 | "- **Testing Data:** Unseen data used to evaluate the model's performance.\n", 36 | "- **Features:** The input variables or attributes of the data.\n", 37 | "- **Labels/Targets:** The desired output or prediction associated with each input.\n", 38 | "- **Model Evaluation:** Assessing how well the model performs on new, unseen data.\n", 39 | "\n", 40 | "### Applications:\n", 41 | "\n", 42 | "Supervised learning is widely used in various applications, including:\n", 43 | "\n", 44 | "- Image and speech recognition.\n", 45 | "- Natural language processing (NLP).\n", 46 | "- Fraud detection.\n", 47 | "- Autonomous vehicles.\n", 48 | "- Healthcare for diagnosis and prognosis.\n", 49 | "- Recommendation systems.\n", 50 | "\n", 51 | "Supervised learning is a foundational and widely applied approach in machine learning, providing the basis for many practical applications across different domains." 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# Table Of Contents:\n", 59 | "Supervised learning is a category of machine learning where the algorithm is trained on a labeled dataset, meaning the input data is paired with corresponding output labels. Here is a list of key topics in supervised learning:\n", 60 | "\n", 61 | "1. **Introduction to Supervised Learning:**\n", 62 | " - Definition and basic concepts of supervised learning.\n", 63 | " - Distinction between supervised and unsupervised learning.\n", 64 | "\n", 65 | "2. **Types of Supervised Learning:**\n", 66 | " - Classification: Predicting the class label of new instances.\n", 67 | " - Regression: Predicting a continuous-valued output.\n", 68 | "\n", 69 | "3. **Training and Testing Sets:**\n", 70 | " - Division of data into training and testing sets.\n", 71 | " - The importance of evaluating model performance on unseen data.\n", 72 | "\n", 73 | "4. **Linear Regression:**\n", 74 | " - Simple linear regression.\n", 75 | " - Multiple linear regression.\n", 76 | " - Assessing model fit and interpreting coefficients.\n", 77 | "\n", 78 | "5. **Logistic Regression:**\n", 79 | " - Binary logistic regression.\n", 80 | " - Multinomial logistic regression.\n", 81 | " - Probability estimation and decision boundaries.\n", 82 | "\n", 83 | "6. **Decision Trees:**\n", 84 | " - Tree structure and decision-making process.\n", 85 | " - Entropy and information gain.\n", 86 | " - Pruning to avoid overfitting.\n", 87 | "\n", 88 | "7. **Ensemble Learning:**\n", 89 | " - Bagging and Random Forests.\n", 90 | " - Boosting algorithms like AdaBoost, Gradient Boosting.\n", 91 | "\n", 92 | "8. **Support Vector Machines (SVM):**\n", 93 | " - Linear SVM for classification.\n", 94 | " - Non-linear SVM using the kernel trick.\n", 95 | " - Margin and support vectors.\n", 96 | "\n", 97 | "9. **k-Nearest Neighbors (k-NN):**\n", 98 | " - Definition and intuition.\n", 99 | " - Distance metrics and choosing k.\n", 100 | " - Pros and cons of k-NN.\n", 101 | "\n", 102 | "10. **Naive Bayes:**\n", 103 | " - Bayes' theorem and conditional probability.\n", 104 | " - Naive Bayes for text classification.\n", 105 | " - Handling continuous features with Gaussian Naive Bayes.\n", 106 | "\n", 107 | "11. **Neural Networks:**\n", 108 | " - Basics of artificial neural networks (ANN).\n", 109 | " - Feedforward and backpropagation.\n", 110 | " - Activation functions and hidden layers.\n", 111 | "\n", 112 | "12. **Evaluation Metrics:**\n", 113 | " - Accuracy, precision, recall, F1-score.\n", 114 | " - Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) for binary classification.\n", 115 | " - Mean Squared Error (MSE) for regression.\n", 116 | "\n", 117 | "13. **Feature Engineering:**\n", 118 | " - Handling missing data.\n", 119 | " - Encoding categorical variables.\n", 120 | " - Scaling and normalization.\n", 121 | "\n", 122 | "14. **Hyperparameter Tuning:**\n", 123 | " - Cross-validation for model selection.\n", 124 | " - Grid search and random search for hyperparameter optimization.\n", 125 | "\n", 126 | "15. **Handling Imbalanced Datasets:**\n", 127 | " - Techniques for dealing with imbalanced class distribution.\n", 128 | " - Resampling methods, such as oversampling and undersampling.\n", 129 | "\n", 130 | "16. **Transfer Learning:**\n", 131 | " - Leveraging pre-trained models for new tasks.\n", 132 | " - Fine-tuning and feature extraction.\n", 133 | "\n", 134 | "17. **Explainable AI (XAI):**\n", 135 | " - Interpretable models.\n", 136 | " - Model-agnostic interpretability techniques.\n", 137 | "\n", 138 | "18. **Challenges and Ethical Considerations:**\n", 139 | " - Bias in machine learning.\n", 140 | " - Fairness and transparency in model predictions.\n", 141 | "\n", 142 | "This list provides a comprehensive overview of the key topics in supervised learning. Each topic can be explored in greater detail, and practical implementation through coding exercises is essential for a deeper understanding of these concepts." 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "# Types Of Model:\n", 150 | "\n", 151 | "Machine learning models can be categorized into various types based on the nature of the learning task they are designed to perform. Here are some common types of machine learning models with examples:\n", 152 | "\n", 153 | "### 1.**Supervised Learning Models:**\n", 154 | " - **Classification Models:**\n", 155 | " - **Example:** Support Vector Machines (SVM), Logistic Regression, Decision Trees, Random Forest, k-Nearest Neighbors (k-NN).\n", 156 | " - **Example Application:** Spam email classification, image recognition.\n", 157 | "\n", 158 | " - **Regression Models:**\n", 159 | " - **Example:** Linear Regression, Ridge Regression, Lasso Regression, Decision Trees, Random Forest.\n", 160 | " - **Example Application:** House price prediction, stock price prediction.\n", 161 | "\n", 162 | "### **2. Unsupervised Learning Models:**\n", 163 | " - **Clustering Models:**\n", 164 | " - **Example:** K-Means, Hierarchical Clustering, DBSCAN.\n", 165 | " - **Example Application:** Customer segmentation, image segmentation.\n", 166 | "\n", 167 | " - **Dimensionality Reduction Models:**\n", 168 | " - **Example:** Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE).\n", 169 | " - **Example Application:** Visualization, feature extraction.\n", 170 | "\n", 171 | " - **Association Rule Learning Models:**\n", 172 | " - **Example:** Apriori, Eclat.\n", 173 | " - **Example Application:** Market basket analysis, recommendation systems.\n", 174 | "\n", 175 | "### **3. Semi-Supervised Learning Models:**\n", 176 | " - **Example:** Self-training, Multi-view learning.\n", 177 | "\n", 178 | "### **4. Reinforcement Learning Models:**\n", 179 | " - **Example:** Q-Learning, Deep Q Networks (DQN), Policy Gradient methods.\n", 180 | " - **Example Application:** Game playing (e.g., AlphaGo), robotic control, autonomous vehicles.\n", 181 | "\n", 182 | "### **5. Ensemble Models:**\n", 183 | " - **Bagging Models:**\n", 184 | " - **Example:** Random Forest, Bagged Decision Trees.\n", 185 | " - **Example Application:** Classification, regression.\n", 186 | "\n", 187 | " - **Boosting Models:**\n", 188 | " - **Example:** AdaBoost, Gradient Boosting, XGBoost, LightGBM.\n", 189 | " - **Example Application:** Classification, regression.\n", 190 | "\n", 191 | "### **6. Neural Network Models:**\n", 192 | " - **Feedforward Neural Networks:**\n", 193 | " - **Example:** Multi-layer Perceptron (MLP).\n", 194 | " - **Example Application:** Image recognition, natural language processing.\n", 195 | "\n", 196 | " - **Convolutional Neural Networks (CNN):**\n", 197 | " - **Example:** LeNet, AlexNet, ResNet.\n", 198 | " - **Example Application:** Image classification, object detection.\n", 199 | "\n", 200 | " - **Recurrent Neural Networks (RNN):**\n", 201 | " - **Example:** LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit).\n", 202 | " - **Example Application:** Natural language processing, time series analysis.\n", 203 | "\n", 204 | " - **Generative Adversarial Networks (GAN):**\n", 205 | " - **Example:** DCGAN (Deep Convolutional GAN), StyleGAN.\n", 206 | " - **Example Application:** Image generation, style transfer.\n", 207 | "\n", 208 | "### **7. Instance-Based Models:**\n", 209 | " - **Example:** k-Nearest Neighbors (k-NN).\n", 210 | " - **Example Application:** Classification, regression.\n", 211 | "\n", 212 | "### **8. Self-Supervised Learning Models:**\n", 213 | " - **Example:** Word2Vec, Contrastive Learning.\n", 214 | "\n", 215 | "These categories and examples provide an overview of the diverse range of machine learning models, each tailored to specific learning tasks and applications. The choice of a model depends on factors such as the nature of the data, the problem at hand, and the desired outcomes." 216 | ] 217 | } 218 | ], 219 | "metadata": { 220 | "kernelspec": { 221 | "display_name": "Python 3", 222 | "language": "python", 223 | "name": "python3" 224 | }, 225 | "language_info": { 226 | "name": "python", 227 | "version": "3.10.12" 228 | } 229 | }, 230 | "nbformat": 4, 231 | "nbformat_minor": 2 232 | } 233 | -------------------------------------------------------------------------------- /04_Machine-Learning/05_Supervised_Learning_Algorithms/05. Random Forest.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# **Random Forest**" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# What is Random forest?" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "Random Forest is a popular machine learning algorithm that belongs to the ensemble learning method. It involves constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", 22 | "\n", 23 | "Here are some key points about Random Forest:\n", 24 | "\n", 25 | "1. **Ensemble Method**: Random Forest is an ensemble of Decision Trees, usually trained with the \"bagging\" method. The general idea of the bagging method is that a combination of learning models increases the overall result.\n", 26 | "\n", 27 | "2. **Randomness**: To ensure that the model does not overfit the data, randomness is introduced into the model learning process, which creates variation between the trees. This is done in two ways:\n", 28 | " - Each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.\n", 29 | " - When splitting a node during the construction of the tree, the split that is chosen is the best split among a random subset of the features.\n", 30 | "\n", 31 | "3. **Prediction**: For a classification problem, the output of the Random Forest model is the class selected by most trees (majority vote). For a regression problem, it could be the average of the output of each tree.\n", 32 | "\n", 33 | "Random Forests are a powerful and widely used machine learning algorithm that provide robustness and accuracy in many scenarios. They also handle overfitting well and can work with large datasets and high dimensional spaces." 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "# Real-Life Analogy of Random Forest..." 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Imagine you're trying to decide what movie to watch tonight. You have several ways to make this decision:\n", 48 | "\n", 49 | "1. **Ask a friend**: You could ask a friend who knows your movie preferences well. This is like using a single decision tree. Your friend knows you well (the tree is well-fitted to the training data), but their recommendation might be overly influenced by the movies you've both watched recently (the tree is overfitting).\n", 50 | "\n", 51 | "2. **Ask a group of friends independently**: You could ask a group of friends independently, and watch the movie that the majority of them recommend. Each friend will make their recommendation based on their understanding of your movie preferences. Some friends may give more weight to your preference for action movies, while others may focus more on the director of the movie or the actors. This is like a Random Forest. Each friend forms a \"tree\" in the \"forest\", and the final decision is made based on the majority vote.\n", 52 | "\n", 53 | "In this analogy, each friend in the group is a decision tree, and the group of friends is the random forest. Each friend makes a decision based on a subset of your preferences (a subset of the total \"features\" available), and the final decision is a democratic one, based on the majority vote. This process helps to avoid the risk of overfitting (relying too much on one friend's opinion) and underfitting (not considering enough preferences)." 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "# Working of Random Forest Algorithm?" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "The Random Forest algorithm works in the following steps:\n", 68 | "\n", 69 | "1. **Bootstrap Dataset**: Random Forest starts by picking N random records from the dataset. This sampling is done with replacement, meaning the same row can be chosen multiple times. This sample will be used to build a tree.\n", 70 | "\n", 71 | "2. **Build Decision Trees**: For each sample, it then constructs a decision tree. But unlike a standard decision tree, each node is split using the best among a subset of predictors randomly chosen at that node. This introduces randomness into the model creation process and helps to prevent overfitting.\n", 72 | "\n", 73 | "3. **Repeat the Process**: Steps 1 and 2 are repeated to create a forest of decision trees.\n", 74 | "\n", 75 | "4. **Make Predictions**: For a new input, each tree in the forest gives its prediction. In a classification problem, the class that has the majority of votes becomes the model’s prediction. In a regression problem, the average of all the tree outputs is the final output of the model.\n", 76 | "\n", 77 | "The key to the success of Random Forest is that the model is not overly reliant on any individual decision tree. By averaging the results of a lot of different trees, it reduces the variance and provides a much more stable and robust prediction.\n", 78 | "\n", 79 | "Random Forests also have a built-in method of measuring variable importance. This is done by looking at how much the tree nodes that use a particular feature reduce impurity across all trees in the forest, and it is a useful tool for interpretability of the model." 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Bagging & Boosting?" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", 94 | "\n", 95 | "Random Forest uses the concept of **bagging** (Bootstrap Aggregating), not boosting. Here's how it works:\n", 96 | "\n", 97 | "1. **Bagging**: In bagging, multiple subsets of the original dataset are created using bootstrap sampling. Then, a decision tree is fitted on each of these subsets. The final prediction is made by averaging the predictions (regression) or taking a majority vote (classification) from all the decision trees. Bagging helps to reduce variance and overfitting.\n", 98 | "\n", 99 | "2. **Random Subspace Method**: In addition to bagging, Random Forest also uses a method called the random subspace method, where a subset of features is selected randomly to create a split at each node of the decision tree. This introduces further randomness into the model, which helps to reduce variance and overfitting.\n", 100 | "\n", 101 | "Boosting, on the other hand, is a different ensemble technique where models are trained sequentially, with each new model being trained to correct the errors made by the previous ones. Models are weighted based on their performance, and higher weight is given to the models that perform well. Boosting can reduce bias and variance, but it's not used in Random Forest. Examples of boosting algorithms include AdaBoost and Gradient Boosting." 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Steps Involved in Random Forest Algorithm?" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "Here are the steps involved in the Random Forest algorithm:\n", 116 | "\n", 117 | "1. **Select random samples from a given dataset**: This is done with replacement, meaning the same row can be chosen multiple times. This sample will be used to build a tree.\n", 118 | "\n", 119 | "2. **Construct a decision tree for each sample and get a prediction result from each decision tree**: Unlike a standard decision tree, each node in the tree is split using the best among a subset of predictors randomly chosen at that node. This introduces randomness into the model creation process and helps to prevent overfitting.\n", 120 | "\n", 121 | "3. **Perform a vote for each predicted result**: For a new input, each tree in the forest gives its prediction. In a classification problem, the class that has the majority of votes becomes the model’s prediction. In a regression problem, the average of all the tree outputs is the final output of the model.\n", 122 | "\n", 123 | "4. **Select the prediction result with the most votes as the final prediction**: For classification, the mode of all the predictions is returned. For regression, the mean of all the predictions is returned.\n", 124 | "\n", 125 | "The key to the success of Random Forest is that the model is not overly reliant on any individual decision tree. By averaging the results of a lot of different trees, it reduces the variance and provides a much more stable and robust prediction." 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "# Important Features of Random Forest?" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "Random Forest has several important features that make it a popular choice for machine learning tasks:\n", 140 | "\n", 141 | "1. **Robustness to Overfitting**: Due to the randomness introduced in building the individual trees, Random Forests are less likely to overfit the training data compared to individual decision trees.\n", 142 | "\n", 143 | "2. **Handling of Large Datasets**: Random Forest can handle large datasets with high dimensionality. It can handle thousands of input variables and identify the most significant ones.\n", 144 | "\n", 145 | "3. **Versatility**: It can be used for both regression and classification tasks, and it can also handle multi-output problems.\n", 146 | "\n", 147 | "4. **Feature Importance**: Random Forests provide an importance score for each feature, allowing for feature selection and interpretability.\n", 148 | "\n", 149 | "5. **Out-of-Bag Error Estimation**: In Random Forest, about one-third of the data is not used to train each tree, and this data (called out-of-bag data) can be used to get an unbiased estimate of the model's performance.\n", 150 | "\n", 151 | "6. **Parallelizable**: The process of building trees is easily parallelizable as each tree is built independently of the others.\n", 152 | "\n", 153 | "7. **Missing Values Handling**: Random Forest can handle missing values. When the dataset has missing values, the Random Forest algorithm will learn the best impute value for the missing values based on the reduction in the impurity.\n", 154 | "\n", 155 | "8. **Non-Parametric**: Random Forest is a non-parametric method, which means that it makes no assumptions about the functional form of the transformation from inputs to output. This is an advantage for datasets where the relationship between inputs and output is complex and non-linear.\n", 156 | "\n", 157 | "Remember, while Random Forest has these advantages, it also has some disadvantages like being a black box model with limited interpretability compared to a single decision tree, and being slower to train and predict than simpler models like linear models." 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "# Difference Between Decision Tree and Random Forest?" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "Here's a comparison between Decision Trees and Random Forests in a tabular format:\n", 172 | "\n", 173 | "| Feature | Decision Tree | Random Forest |\n", 174 | "| --- | --- | --- |\n", 175 | "| **Basic** | Single tree | Ensemble of multiple trees |\n", 176 | "| **Overfitting** | Prone to overfitting | Less prone due to averaging of multiple trees |\n", 177 | "| **Performance** | Lower performance on complex datasets | Higher performance due to ensemble method |\n", 178 | "| **Training Speed** | Faster | Slower due to building multiple trees |\n", 179 | "| **Prediction Speed** | Faster | Slower due to aggregating results from multiple trees |\n", 180 | "| **Interpretability** | High (easy to visualize and understand) | Lower (hard to visualize many trees) |\n", 181 | "| **Feature Selection** | Uses all features for splitting a node | Randomly selects a subset of features for splitting a node |\n", 182 | "| **Handling Unseen Data** | Less effective | More effective due to averaging |\n", 183 | "| **Variance** | High variance | Low variance due to averaging |\n", 184 | "\n", 185 | "Remember, the choice between a Decision Tree and Random Forest often depends on the specific problem and the computational resources available. Random Forests generally perform better, but they require more computational resources and are less interpretable." 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "# Important Hyperparameters in Random Forest?" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "Random Forest has several important hyperparameters that control its behavior:\n", 200 | "\n", 201 | "1. **n_estimators**: This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher values make the predictions stronger and more stable, but also slow down the computation.\n", 202 | "\n", 203 | "2. **max_features**: These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available such as \"auto\", \"sqrt\", \"log2\", or an integer. Typically, sqrt(number of features) is a good starting point.\n", 204 | "\n", 205 | "3. **max_depth**: This is the maximum number of levels in each decision tree. You can set it to an integer or leave it as None for unlimited depth. This can be used to control overfitting.\n", 206 | "\n", 207 | "4. **min_samples_split**: This is the minimum number of data points placed in a node before the node is split. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.\n", 208 | "\n", 209 | "5. **min_samples_leaf**: This is the minimum number of data points allowed in a leaf node. Higher values reduce overfitting.\n", 210 | "\n", 211 | "6. **bootstrap**: This is a boolean value indicating whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.\n", 212 | "\n", 213 | "7. **oob_score**: Also a boolean, it indicates whether to use out-of-bag samples to estimate the generalization accuracy.\n", 214 | "\n", 215 | "8. **n_jobs**: This indicates the number of jobs to run in parallel for both fit and predict. If set to -1, then the number of jobs is set to the number of cores.\n", 216 | "\n", 217 | "9. **random_state**: This controls the randomness of the bootstrapping of the samples used when building trees. If the random state is fixed, the model output will be deterministic.\n", 218 | "\n", 219 | "10. **class_weight**: This parameter allows you to specify weights for the classes. This is useful if your classes are imbalanced.\n", 220 | "\n", 221 | "Remember, tuning these hyperparameters can significantly improve the performance of the model, but it can also lead to overfitting if not done carefully. It's usually a good idea to use some form of cross-validation, such as GridSearchCV or RandomizedSearchCV, to find the optimal values for these hyperparameters." 222 | ] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "# Coding in Python – Random Forest?" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Here's a basic example of how to use the Random Forest algorithm for a classification problem in Python using the sklearn library. We'll use the iris dataset, which is a multi-class classification problem, built into sklearn for this example.\n", 236 | "\n" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 1, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "Accuracy: 1.0\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "from sklearn.ensemble import RandomForestClassifier\n", 254 | "from sklearn.datasets import load_iris\n", 255 | "from sklearn.model_selection import train_test_split\n", 256 | "from sklearn.metrics import accuracy_score\n", 257 | "\n", 258 | "# Load the iris dataset\n", 259 | "iris = load_iris()\n", 260 | "X = iris.data\n", 261 | "y = iris.target\n", 262 | "\n", 263 | "# Split the data into train and test sets\n", 264 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n", 265 | "\n", 266 | "# Create the model with 100 trees\n", 267 | "model = RandomForestClassifier(n_estimators=100, \n", 268 | " bootstrap = True,\n", 269 | " max_features = 'sqrt')\n", 270 | "\n", 271 | "# Fit on training data\n", 272 | "model.fit(X_train, y_train)\n", 273 | "\n", 274 | "# Predict the test set\n", 275 | "predictions = model.predict(X_test)\n", 276 | "\n", 277 | "# Calculate the accuracy score\n", 278 | "accuracy = accuracy_score(y_test, predictions)\n", 279 | "print(\"Accuracy: \", accuracy)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "\n", 287 | "\n", 288 | "This code first loads the iris dataset, then splits it into a training set and a test set. A Random Forest model is created with 100 trees and fitted on the training data. The model is then used to predict the classes of the test set, and the accuracy of these predictions is printed out." 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "# Coding in R – Random Forest?" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "Here's a basic example of how to use the Random Forest algorithm for a classification problem in R using the randomForest package. We'll use the iris dataset, which is a multi-class classification problem, built into R for this example.\n", 303 | "\n" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 1, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "name": "stderr", 313 | "output_type": "stream", 314 | "text": [ 315 | "Installing package into ‘/home/blackheart/R/x86_64-pc-linux-gnu-library/4.1’\n", 316 | "(as ‘lib’ is unspecified)\n", 317 | "\n", 318 | "randomForest 4.7-1.1\n", 319 | "\n", 320 | "Type rfNews() to see new features/changes/bug fixes.\n", 321 | "\n" 322 | ] 323 | }, 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "\n", 329 | "Call:\n", 330 | " randomForest(formula = Species ~ ., data = iris, ntree = 100, importance = TRUE) \n", 331 | " Type of random forest: classification\n", 332 | " Number of trees: 100\n", 333 | "No. of variables tried at each split: 2\n", 334 | "\n", 335 | " OOB estimate of error rate: 4.67%\n", 336 | "Confusion matrix:\n", 337 | " setosa versicolor virginica class.error\n", 338 | "setosa 50 0 0 0.00\n", 339 | "versicolor 0 47 3 0.06\n", 340 | "virginica 0 4 46 0.08\n" 341 | ] 342 | }, 343 | { 344 | "data": { 345 | "text/html": [ 346 | "\n", 347 | "\n", 348 | "\n", 349 | "\t\n", 350 | "\n", 351 | "\n", 352 | "\t\n", 353 | "\t\n", 354 | "\t\n", 355 | "\t\n", 356 | "\n", 357 | "
A matrix: 4 × 5 of type dbl
setosaversicolorvirginicaMeanDecreaseAccuracyMeanDecreaseGini
Sepal.Length 2.313169 1.9048166 3.275294 4.299426 9.541714
Sepal.Width 1.931173-0.6401191 2.173373 1.834349 2.033635
Petal.Length10.17680916.858192913.53984316.29814146.032251
Petal.Width10.13944413.139710614.11332714.88576441.657799
\n" 358 | ], 359 | "text/latex": [ 360 | "A matrix: 4 × 5 of type dbl\n", 361 | "\\begin{tabular}{r|lllll}\n", 362 | " & setosa & versicolor & virginica & MeanDecreaseAccuracy & MeanDecreaseGini\\\\\n", 363 | "\\hline\n", 364 | "\tSepal.Length & 2.313169 & 1.9048166 & 3.275294 & 4.299426 & 9.541714\\\\\n", 365 | "\tSepal.Width & 1.931173 & -0.6401191 & 2.173373 & 1.834349 & 2.033635\\\\\n", 366 | "\tPetal.Length & 10.176809 & 16.8581929 & 13.539843 & 16.298141 & 46.032251\\\\\n", 367 | "\tPetal.Width & 10.139444 & 13.1397106 & 14.113327 & 14.885764 & 41.657799\\\\\n", 368 | "\\end{tabular}\n" 369 | ], 370 | "text/markdown": [ 371 | "\n", 372 | "A matrix: 4 × 5 of type dbl\n", 373 | "\n", 374 | "| | setosa | versicolor | virginica | MeanDecreaseAccuracy | MeanDecreaseGini |\n", 375 | "|---|---|---|---|---|---|\n", 376 | "| Sepal.Length | 2.313169 | 1.9048166 | 3.275294 | 4.299426 | 9.541714 |\n", 377 | "| Sepal.Width | 1.931173 | -0.6401191 | 2.173373 | 1.834349 | 2.033635 |\n", 378 | "| Petal.Length | 10.176809 | 16.8581929 | 13.539843 | 16.298141 | 46.032251 |\n", 379 | "| Petal.Width | 10.139444 | 13.1397106 | 14.113327 | 14.885764 | 41.657799 |\n", 380 | "\n" 381 | ], 382 | "text/plain": [ 383 | " setosa versicolor virginica MeanDecreaseAccuracy\n", 384 | "Sepal.Length 2.313169 1.9048166 3.275294 4.299426 \n", 385 | "Sepal.Width 1.931173 -0.6401191 2.173373 1.834349 \n", 386 | "Petal.Length 10.176809 16.8581929 13.539843 16.298141 \n", 387 | "Petal.Width 10.139444 13.1397106 14.113327 14.885764 \n", 388 | " MeanDecreaseGini\n", 389 | "Sepal.Length 9.541714 \n", 390 | "Sepal.Width 2.033635 \n", 391 | "Petal.Length 46.032251 \n", 392 | "Petal.Width 41.657799 " 393 | ] 394 | }, 395 | "metadata": {}, 396 | "output_type": "display_data" 397 | }, 398 | { 399 | "name": "stdout", 400 | "output_type": "stream", 401 | "text": [ 402 | "[1] \"Accuracy: 1\"\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "# Install and load the randomForest package\n", 408 | "install.packages(\"randomForest\")\n", 409 | "library(randomForest)\n", 410 | "\n", 411 | "# Load the iris dataset\n", 412 | "data(iris)\n", 413 | "\n", 414 | "# Create a random forest model\n", 415 | "set.seed(42) # for reproducibility\n", 416 | "iris.rf <- randomForest(Species ~ ., data=iris, ntree=100, importance=TRUE)\n", 417 | "\n", 418 | "# Print the model summary\n", 419 | "print(iris.rf)\n", 420 | "\n", 421 | "# Get importance of each feature\n", 422 | "importance(iris.rf)\n", 423 | "\n", 424 | "# Predict using the model\n", 425 | "iris.pred <- predict(iris.rf, iris)\n", 426 | "\n", 427 | "# Check the accuracy\n", 428 | "accuracy <- sum(iris.pred == iris$Species) / nrow(iris)\n", 429 | "print(paste(\"Accuracy: \", accuracy))" 430 | ] 431 | }, 432 | { 433 | "cell_type": "markdown", 434 | "metadata": {}, 435 | "source": [ 436 | "\n", 437 | "\n", 438 | "This code first loads the iris dataset, then a Random Forest model is created with 100 trees and fitted on the entire dataset. The model summary and feature importance are printed out. The model is then used to predict the classes of the same dataset (this is just for demonstration, in practice you should split your data into training and testing sets), and the accuracy of these predictions is printed out." 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "# **Thank You!**" 446 | ] 447 | } 448 | ], 449 | "metadata": { 450 | "kernelspec": { 451 | "display_name": "Python 3", 452 | "language": "python", 453 | "name": "python3" 454 | }, 455 | "language_info": { 456 | "codemirror_mode": { 457 | "name": "ipython", 458 | "version": 3 459 | }, 460 | "file_extension": ".py", 461 | "mimetype": "text/x-python", 462 | "name": "python", 463 | "nbconvert_exporter": "python", 464 | "pygments_lexer": "ipython3", 465 | "version": "3.10.12" 466 | } 467 | }, 468 | "nbformat": 4, 469 | "nbformat_minor": 2 470 | } 471 | -------------------------------------------------------------------------------- /04_Machine-Learning/05_Supervised_Learning_Algorithms/06. Naive Bayes Classifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Naive Bayes Classifier" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Title: Understanding the Naive Bayes Classifier\n", 15 | "\n", 16 | "---\n", 17 | "\n", 18 | "#### **Introduction:**\n", 19 | "\n", 20 | "In the realm of machine learning and data science, the Naive Bayes classifier holds a pivotal role due to its simplicity, efficiency, and surprising power, especially in the field of text analysis. Named after the famous mathematician Thomas Bayes, the Naive Bayes classifier is a probabilistic machine learning model used for classification tasks.\n", 21 | "\n", 22 | "The Naive Bayes classifier is based on Bayes' theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities. In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can express as P(L | features). Bayes' theorem tells us how to express this in terms of quantities we can compute more directly.\n", 23 | "\n", 24 | "A key assumption made by the Naive Bayes classifier, which is actually a simplification, is that the features are conditionally independent given the class. This means that the presence of a particular feature in a class is unrelated to the presence of any other feature in the same class. This is the 'naive' part of the 'Naive Bayes' - it's a naive assumption because it's not often encountered in real-world data, yet the classifier can perform surprisingly well even when this assumption doesn't hold.\n", 25 | "\n", 26 | "Naive Bayes classifiers are highly scalable and well-suited to high-dimensional datasets. They are often used for text classification, spam filtering, sentiment analysis, and recommendation systems, among other applications.\n", 27 | "\n", 28 | "In this article, we will delve deeper into the workings of the Naive Bayes classifier, explore its mathematical foundation, discuss its strengths and weaknesses, and see it in action through Python code examples.\n", 29 | "\n", 30 | "---\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# What is Naive Bayes Classifier?" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "The Naive Bayes Classifier is a type of probabilistic machine learning model used for classification tasks. The classifier is based on applying Bayes' theorem with strong (naive) independence assumptions between the features.\n", 45 | "\n", 46 | "In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. For example, a fruit may be considered an apple if it is red, round, and about 3 inches in diameter. A Naive Bayes classifier considers each of these features (red, round, 3 inches in diameter) to contribute independently to the probability that the fruit is an apple, regardless of any possible correlations between color, roundness, and diameter.\n", 47 | "\n", 48 | "Naive Bayes classifiers are highly scalable and are known to outperform even highly sophisticated classification methods. They are particularly well suited for high-dimensional data sets and are commonly used in text categorization (spam or not spam), sentiment analysis, and document classification. Despite their naive design and oversimplified assumptions, Naive Bayes classifiers often perform very well in many complex real-world situations.\n", 49 | "\n" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "# What Is the Naive Bayes Algorithm?" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "The Naive Bayes algorithm is a classification technique based on applying Bayes' theorem with a strong assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.\n", 64 | "\n", 65 | "Here's a step-by-step breakdown of how the Naive Bayes algorithm works:\n", 66 | "\n", 67 | "1. **Convert the data set into a frequency table**\n", 68 | "2. **Create a likelihood table by finding the probabilities of given observations**\n", 69 | "3. **Now, use Bayes theorem to calculate the posterior probability.**\n", 70 | "\n", 71 | "The core idea is that the predictors contribute independently to the probability of the class. This independent contribution is where the term 'naive' comes from.\n", 72 | "\n", 73 | "The Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. It is widely used in text analytics, spam filtering, recommendation systems, and more.\n", 74 | "\n", 75 | "Despite its simplicity, the Naive Bayes algorithm often performs well and is widely used because it often outputs a classification model that classifies correctly even when its assumption of the independence of features is violated." 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "# Realife Example of Naive Bayes Algorithm\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "Here are a few real-life examples of where the Naive Bayes algorithm is commonly used:\n", 90 | "\n", 91 | "1. **Email Spam Filtering**: Naive Bayes is a popular algorithm for email spam filtering. It classifies emails as 'spam' or 'not spam' by examining the frequency of certain words and phrases. For example, emails containing the words 'lottery', 'win', or 'claim' might be classified as spam.\n", 92 | "\n", 93 | "2. **Sentiment Analysis**: Naive Bayes is often used in sentiment analysis to determine whether a given piece of text (like a product review or a tweet) is positive, negative, or neutral. It does this by looking at the words used in the text and their associated sentiments.\n", 94 | "\n", 95 | "3. **Document Categorization**: Naive Bayes can be used to categorize documents into different categories based on their content. For example, news articles might be categorized into 'sports', 'politics', 'entertainment', etc.\n", 96 | "\n", 97 | "4. **Healthcare**: In healthcare, Naive Bayes can be used to predict the likelihood of a patient having a particular disease based on their symptoms.\n", 98 | "\n", 99 | "5. **Recommendation Systems**: Naive Bayes can be used in recommendation systems to predict a user's interests and recommend products or services. For example, if a user often watches action movies, the system might recommend other action movies for them to watch.\n", 100 | "\n", 101 | "Remember, the 'naive' assumption of Naive Bayes (that all features are independent given the class) is often violated in real-world data, yet the algorithm often performs surprisingly well." 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Bayes' Theorem" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "Bayes' theorem is a fundamental principle in the field of statistics and probability. It describes the relationship of conditional probabilities of statistical quantities. In other words, it gives us a way to update our previous beliefs based on new evidence.\n", 116 | "\n", 117 | "The theorem is named after Thomas Bayes, who first provided an equation that allows new evidence to update beliefs in his \"An Essay towards solving a Problem in the Doctrine of Chances\" (1763). It's articulated as:\n", 118 | "\n", 119 | "P(A|B) = [P(B|A) * P(A)] / P(B)\n", 120 | "\n", 121 | "Where:\n", 122 | "\n", 123 | "- P(A|B) is the posterior probability of A given B. It's what we are trying to estimate.\n", 124 | "- P(B|A) is the likelihood, the probability of observing B given A.\n", 125 | "- P(A) is the prior probability of A. It's our belief about A before observing B.\n", 126 | "- P(B) is the marginal likelihood of B.\n", 127 | "\n", 128 | "In the context of a classification problem, we can understand it as follows:\n", 129 | "\n", 130 | "- A is the event that a given data point belongs to a certain class.\n", 131 | "- B is the observed features of the data point.\n", 132 | "- P(A|B) is then the probability that the data point belongs to that class given its features.\n", 133 | "\n", 134 | "Bayes' theorem thus provides a way to calculate the probability of a data point belonging to a certain class based on its features, which is the fundamental idea behind the Naive Bayes classifier." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "# What are the steps involved in training a Naive Bayes classifier?" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "Training a Naive Bayes classifier involves several steps:\n", 149 | "\n", 150 | "1. **Data Preprocessing**: The first step in training a Naive Bayes classifier is to preprocess the data. This may involve cleaning the data, handling missing values, encoding categorical variables, and normalizing numerical variables.\n", 151 | "\n", 152 | "2. **Feature Extraction**: Depending on the type of data, you might need to extract features that the classifier can use. For example, if you're classifying text documents, you might need to convert the documents into a bag-of-words representation or a TF-IDF representation.\n", 153 | "\n", 154 | "3. **Train-Test Split**: Split your dataset into a training set and a test set. The training set is used to train the model, and the test set is used to evaluate its performance.\n", 155 | "\n", 156 | "4. **Model Training**: Train the Naive Bayes classifier on the training data. This involves calculating the prior probabilities (the probabilities of each class) and the likelihoods (the probabilities of each feature given each class).\n", 157 | "\n", 158 | "5. **Prediction**: Use the trained model to make predictions on unseen data. For each data point, the model calculates the posterior probability of each class given the features of the data point, and assigns the data point to the class with the highest posterior probability.\n", 159 | "\n", 160 | "6. **Evaluation**: Evaluate the performance of the model on the test set. Common evaluation metrics for classification tasks include accuracy, precision, recall, and the F1 score.\n", 161 | "\n", 162 | "7. **Model Tuning**: Depending on the performance of the model, you might need to go back and adjust the model's parameters, choose different features, or preprocess the data in a different way. This is an iterative process.\n", 163 | "\n", 164 | "Remember, the 'naive' assumption of Naive Bayes (that all features are independent given the class) is often violated in real-world data, yet the algorithm often performs surprisingly well." 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "# How Do Naive Bayes Algorithms Work?" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "Naive Bayes algorithms are based on Bayes' theorem, which is a way of finding a probability when we know certain other probabilities. The 'naive' part comes from the assumption that each input (feature) is independent of the others.\n", 179 | "\n", 180 | "Let's consider a simple example related to weather conditions. Suppose we are trying to predict whether a person will go for a walk based on the weather conditions. We have the following data:\n", 181 | "\n", 182 | "- 60% of the days are sunny.\n", 183 | "- The person goes for a walk on 70% of the sunny days.\n", 184 | "- The person goes for a walk on 40% of the days.\n", 185 | "\n", 186 | "We want to find out the probability that it is sunny given that the person went for a walk. This is written as P(Sunny|Walk).\n", 187 | "\n", 188 | "Using Bayes' theorem:\n", 189 | "\n", 190 | "P(Sunny|Walk) = [P(Walk|Sunny) * P(Sunny)] / P(Walk)\n", 191 | "\n", 192 | "We can substitute the known values into this equation:\n", 193 | "\n", 194 | "P(Sunny|Walk) = [(0.7) * (0.6)] / (0.4) = 1.05\n", 195 | "\n", 196 | "However, a probability cannot be greater than 1. This discrepancy arises because the actual probability of the person going for a walk (P(Walk)) is not independent of the weather conditions. The correct value of P(Walk) should be the total probability of the person going for a walk under all weather conditions, which is calculated as follows:\n", 197 | "\n", 198 | "P(Walk) = P(Walk and Sunny) + P(Walk and not Sunny)\n", 199 | " = P(Walk|Sunny) * P(Sunny) + P(Walk|not Sunny) * P(not Sunny)\n", 200 | " = (0.7 * 0.6) + (0.3 * 0.4) = 0.42 + 0.12 = 0.54\n", 201 | "\n", 202 | "Substituting the correct value of P(Walk) into Bayes' theorem gives:\n", 203 | "\n", 204 | "P(Sunny|Walk) = [(0.7) * (0.6)] / (0.54) = 0.777...\n", 205 | "\n", 206 | "So, the updated probability that it is sunny given that the person went for a walk is approximately 0.78, or 78%.\n", 207 | "\n", 208 | "This example demonstrates how Naive Bayes updates our beliefs based on evidence (in this case, the fact that the person went for a walk). However, it also shows why the 'naive' assumption can lead to errors if the input features are not truly independent." 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "# Python Code Implementation" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "Here's an example of how you might use the Naive Bayes algorithm to classify emails as spam or not spam. This example uses the `sklearn` library in Python.\n", 223 | "\n" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 2, 229 | "metadata": { 230 | "vscode": { 231 | "languageId": "r" 232 | } 233 | }, 234 | "outputs": [ 235 | { 236 | "name": "stdout", 237 | "output_type": "stream", 238 | "text": [ 239 | " precision recall f1-score support\n", 240 | "\n", 241 | " not spam 0.00 0.00 0.00 0.0\n", 242 | " spam 0.00 0.00 0.00 1.0\n", 243 | "\n", 244 | " accuracy 0.00 1.0\n", 245 | " macro avg 0.00 0.00 0.00 1.0\n", 246 | "weighted avg 0.00 0.00 0.00 1.0\n", 247 | "\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "from sklearn.model_selection import train_test_split\n", 253 | "from sklearn.feature_extraction.text import CountVectorizer\n", 254 | "from sklearn.naive_bayes import MultinomialNB\n", 255 | "from sklearn import metrics\n", 256 | "from warnings import filterwarnings\n", 257 | "\n", 258 | "filterwarnings('ignore')\n", 259 | "\n", 260 | "# Sample data\n", 261 | "emails = ['Hey, can we meet tomorrow?', 'Upto 20% discount, exclusive offer just for you', \n", 262 | " 'Are you available tomorrow?', 'Win a lottery of $1 Million']\n", 263 | "labels = ['not spam', 'spam', 'not spam', 'spam']\n", 264 | "\n", 265 | "# Convert emails to word count vectors\n", 266 | "vectorizer = CountVectorizer()\n", 267 | "email_vec = vectorizer.fit_transform(emails)\n", 268 | "\n", 269 | "# Split data into training and test sets\n", 270 | "X_train, X_test, y_train, y_test = train_test_split(email_vec, labels, test_size=0.2, random_state=1)\n", 271 | "\n", 272 | "# Train a Naive Bayes classifier\n", 273 | "nb = MultinomialNB()\n", 274 | "nb.fit(X_train, y_train)\n", 275 | "\n", 276 | "# Make predictions on the test set\n", 277 | "predictions = nb.predict(X_test)\n", 278 | "\n", 279 | "# Evaluate the model\n", 280 | "print(metrics.classification_report(y_test, predictions))" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "\n", 288 | "\n", 289 | "In this example, we first convert the emails into word count vectors using `CountVectorizer`. This transforms the text such that each email becomes a vector in a high-dimensional space, where each dimension corresponds to a unique word in all the emails.\n", 290 | "\n", 291 | "We then split the data into a training set and a test set using `train_test_split`.\n", 292 | "\n", 293 | "Next, we create a `MultinomialNB` object, which is a Naive Bayes classifier for multinomial models. We train this classifier on the training data using the `fit` method.\n", 294 | "\n", 295 | "Finally, we use the trained classifier to make predictions on the test set, and we evaluate the performance of the classifier using `classification_report`, which prints precision, recall, F1-score, and support for each class." 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": {}, 301 | "source": [ 302 | "# R Code Implementation" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "Here's an example of how you might use the Naive Bayes classifier in R using the `e1071` package. This example uses the built-in `iris` dataset.\n", 310 | "\n" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 1, 316 | "metadata": { 317 | "vscode": { 318 | "languageId": "r" 319 | } 320 | }, 321 | "outputs": [ 322 | { 323 | "name": "stderr", 324 | "output_type": "stream", 325 | "text": [ 326 | "Installing package into ‘/home/blackheart/R/x86_64-pc-linux-gnu-library/4.1’\n", 327 | "(as ‘lib’ is unspecified)\n", 328 | "\n" 329 | ] 330 | }, 331 | { 332 | "name": "stdout", 333 | "output_type": "stream", 334 | "text": [ 335 | " true\n", 336 | "pred setosa versicolor virginica\n", 337 | " setosa 14 0 0\n", 338 | " versicolor 0 18 0\n", 339 | " virginica 0 0 13\n" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "# Load the necessary library\n", 345 | "install.packages(\"e1071\")\n", 346 | "library(e1071)\n", 347 | "\n", 348 | "# Load the iris dataset\n", 349 | "data(iris)\n", 350 | "\n", 351 | "# Split the data into training and test sets\n", 352 | "set.seed(123)\n", 353 | "train_indices <- sample(1:nrow(iris), nrow(iris)*0.7)\n", 354 | "train_data <- iris[train_indices, ]\n", 355 | "test_data <- iris[-train_indices, ]\n", 356 | "\n", 357 | "# Train a Naive Bayes classifier\n", 358 | "model <- naiveBayes(Species ~ ., data = train_data)\n", 359 | "\n", 360 | "# Make predictions on the test set\n", 361 | "predictions <- predict(model, test_data)\n", 362 | "\n", 363 | "# Print out the confusion matrix to see how well the model did\n", 364 | "print(table(pred = predictions, true = test_data$Species))" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "\n", 372 | "\n", 373 | "In this example, we first load the `e1071` library, which provides the `naiveBayes` function. We then load the `iris` dataset and split it into a training set and a test set.\n", 374 | "\n", 375 | "Next, we train a Naive Bayes classifier on the training data using the `naiveBayes` function. The formula `Species ~ .` tells the function to use `Species` as the dependent variable and all other variables in the data frame as independent variables.\n", 376 | "\n", 377 | "Finally, we use the trained classifier to make predictions on the test set, and we print out a confusion matrix to see how well the model did. The confusion matrix shows the number of correct and incorrect predictions made by the classifier, broken down by each class." 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "# What are the different types of Naive Bayes classifiers?" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "There are several types of Naive Bayes classifiers, each suited to a different type of input data:\n", 392 | "\n", 393 | "1. **Gaussian Naive Bayes**: This is the most common type. It assumes that the data for each class is distributed according to a Gaussian (normal) distribution. It's often used when the features are continuous.\n", 394 | "\n", 395 | "2. **Multinomial Naive Bayes**: This is used when the data are discrete counts, such as the number of times a particular word appears in a document. It's often used in text classification problems.\n", 396 | "\n", 397 | "3. **Bernoulli Naive Bayes**: This is used when the features are binary (0/1). It's also often used in text classification, where the features might be whether or not a particular word appears in a document.\n", 398 | "\n", 399 | "4. **Complement Naive Bayes**: This is a variation of Multinomial Naive Bayes that is particularly suited for imbalanced data sets. Instead of modeling the data with respect to each class, it models the data with respect to all classes that are not in the current class.\n", 400 | "\n", 401 | "5. **Categorical Naive Bayes**: This is used for categorical data. Each feature is assumed to be a categorical variable.\n", 402 | "\n", 403 | "Each of these types makes a different assumption about the distribution of the data, and the best one to use depends on the specific problem and data set." 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "# What are the advantages and disadvantages of using the Naive Bayes classifier?" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "Advantages of Naive Bayes Classifier:\n", 418 | "\n", 419 | "1. **Efficiency**: Naive Bayes requires a small amount of training data to estimate the necessary parameters. It's fast and easy to predict the class of the test dataset.\n", 420 | "\n", 421 | "2. **Easy to implement**: Naive Bayes is simple to understand and easy to implement. It's a good choice as a baseline model to compare with more complex models.\n", 422 | "\n", 423 | "3. **Works well with high dimensions**: Naive Bayes performs well when dealing with many input variables. It's often used for text classification where the number of input variables (words) can be large.\n", 424 | "\n", 425 | "4. **Handles continuous and discrete data**: Naive Bayes can handle both continuous and discrete data. Different types of Naive Bayes classifiers can be used depending on the distribution of the data (Gaussian, Multinomial, Bernoulli).\n", 426 | "\n", 427 | "Disadvantages of Naive Bayes Classifier:\n", 428 | "\n", 429 | "1. **Assumption of independent predictors**: Naive Bayes assumes that all features are independent. In real life, it's almost impossible that we get a set of predictors which are completely independent.\n", 430 | "\n", 431 | "2. **Zero Frequency**: If the category of any categorical variable is not observed in training data set, then the model will assign a zero probability to that category and then a prediction cannot be made. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.\n", 432 | "\n", 433 | "3. **Bad estimator**: While Naive Bayes is a good classifier, it is known to be a bad estimator. So the probability outputs from `predict_proba` are not to be taken too seriously.\n", 434 | "\n", 435 | "4. **Data scarcity**: For data with a categorical variable, the estimation of probabilities can be a problem if a category has not been observed in the training data set." 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "# What are some common applications of the Naive Bayes classifier in real-world scenarios?" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "Naive Bayes classifiers have a wide range of applications due to their simplicity, efficiency, and relatively high accuracy. Here are some common real-world applications:\n", 450 | "\n", 451 | "1. **Email Spam Filtering**: Naive Bayes is one of the most popular algorithms for spam filtering. It classifies emails as 'spam' or 'not spam' by examining the frequency of certain words and phrases.\n", 452 | "\n", 453 | "2. **Sentiment Analysis**: Naive Bayes is often used in sentiment analysis to determine whether a given piece of text (like a product review or a tweet) is positive, negative, or neutral. It does this by looking at the words used in the text and their associated sentiments.\n", 454 | "\n", 455 | "3. **Document Categorization**: Naive Bayes can be used to categorize documents into different categories based on their content. For example, news articles might be categorized into 'sports', 'politics', 'entertainment', etc.\n", 456 | "\n", 457 | "4. **Healthcare**: In healthcare, Naive Bayes can be used to predict the likelihood of a patient having a particular disease based on their symptoms.\n", 458 | "\n", 459 | "5. **Recommendation Systems**: Naive Bayes can be used in recommendation systems to predict a user's interests and recommend products or services. For example, if a user often watches action movies, the system might recommend other action movies for them to watch.\n", 460 | "\n", 461 | "6. **Text Classification**: Naive Bayes is widely used in text classification, where the data are typically represented as word vector counts (although tf-idf vectors are also commonly used in text classification).\n", 462 | "\n", 463 | "7. **Fraud Detection**: In finance, Naive Bayes is used for credit scoring, predicting loan defaults, and fraud detection in transactions." 464 | ] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "# Tips to Improve the Power of the NB Model" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "Here are some tips to improve the performance of a Naive Bayes model:\n", 478 | "\n", 479 | "1. **Feature Selection**: Naive Bayes assumes that all features are independent. If some features are dependent on each other, the prediction might be incorrect. So, it's important to select only the relevant features.\n", 480 | "\n", 481 | "2. **Avoid Zero Frequency**: If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. To solve this, we can use a smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.\n", 482 | "\n", 483 | "3. **Tune the Model**: Use grid search or random search to find the optimal parameters for the Naive Bayes model. For example, in the case of the Gaussian Naive Bayes, you can adjust the `var_smoothing` parameter.\n", 484 | "\n", 485 | "4. **Preprocess Your Data**: Techniques such as removing outliers, filling missing values, and scaling can help improve the performance of a Naive Bayes model.\n", 486 | "\n", 487 | "5. **Use the Right Variant of Naive Bayes**: Different variants of Naive Bayes (like Gaussian, Multinomial, Bernoulli) are suitable for different types of data. Choose the one that's most appropriate for your data.\n", 488 | "\n", 489 | "6. **Ensemble Methods**: Combining the predictions of multiple different models can often result in better performance than any single model. You could consider using a Naive Bayes model as part of an ensemble.\n", 490 | "\n", 491 | "7. **Update Your Model**: Naive Bayes allows for incremental learning. As new data comes in, you can update your model's probabilities without having to retrain it from scratch.\n", 492 | "\n", 493 | "Remember, while Naive Bayes is a powerful tool, it's not always the best choice for every problem. Depending on the complexity and nature of your data, other models may yield better results." 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "# Reference\n", 501 | "\n", 502 | "1. [analyticsvidya.com](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/)\n" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "# **Thank You!**" 510 | ] 511 | } 512 | ], 513 | "metadata": { 514 | "kernelspec": { 515 | "display_name": "R", 516 | "language": "R", 517 | "name": "ir" 518 | }, 519 | "language_info": { 520 | "codemirror_mode": "r", 521 | "file_extension": ".r", 522 | "mimetype": "text/x-r-source", 523 | "name": "R", 524 | "pygments_lexer": "r", 525 | "version": "4.1.2" 526 | } 527 | }, 528 | "nbformat": 4, 529 | "nbformat_minor": 2 530 | } 531 | -------------------------------------------------------------------------------- /04_Machine-Learning/22_perceptron.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding: utf8 3 | 4 | 5 | """ IMPORTS """ 6 | import numpy as np 7 | import matplotlib.pyplot as plt 8 | 9 | 10 | class Perceptron(object): 11 | 12 | def __init__(self): 13 | 14 | self.l_rate = 0.01 15 | self.n_epoch = 5 16 | self.bias = 0 17 | self.input_data = [ 18 | [0.2, 0.7], 19 | [15.25, 14.37], 20 | [0.02, 0.68], 21 | [14.55, 16.36], 22 | [0.55, 0.36], 23 | [0.45, 0.16], 24 | [0.45, 0.26], 25 | [11.54, 17.226], 26 | [12.58, 17.36], 27 | [13.95, 15.26], 28 | ] 29 | self.expected = [0, 1, 0, 1, 0, 0, 0, 1, 1, 1] 30 | 31 | # ~ x = [i[0] for i in self.input_data] 32 | # ~ y = [i[1] for i in self.input_data] 33 | # ~ plt.scatter(x, y) 34 | # ~ plt.show() 35 | 36 | def predict(self, row, weights): 37 | 38 | # Activation threshold function 39 | activation = 0 40 | for i in range(len(row) - 1): 41 | activation += weights[i] * row[i] 42 | activation += self.bias 43 | # Return class 44 | if activation >= 0.0: 45 | return 1.0 46 | else: 47 | return 0.0 48 | 49 | # Estimate Perceptron weights using stochastic gradient descent 50 | def train_weights(self): 51 | 52 | # Let's initiate weights with small values 53 | weights = list(np.random.uniform(low=0, high=0.1, size=2)) 54 | global_errors = [] 55 | for epoch in range(self.n_epoch): 56 | # Keep track of errors for plotting 57 | epoch_errors = 0.0 58 | for row, expected in zip(self.input_data, self.expected): 59 | prediction = self.predict(row, weights) 60 | error = expected - prediction 61 | # We keep a track of global errors for plotting 62 | global_errors.append(abs(error)) 63 | epoch_errors += abs(error) 64 | # If predicted and expected are different, update weights and bias 65 | if expected != prediction: 66 | self.bias = self.bias + self.l_rate * error 67 | for i in range(len(row) - 1): 68 | weights[i] = weights[i] + self.l_rate * error * row[i] 69 | print(f"epoch: {epoch}, lrate: {self.l_rate}, errors: {epoch_errors}") 70 | 71 | plt.plot(global_errors) 72 | plt.ylim(-1, 2) 73 | plt.show() 74 | 75 | return weights 76 | 77 | 78 | if __name__ == "__main__": 79 | # Calculate weights 80 | 81 | neuron = Perceptron() 82 | 83 | weights = neuron.train_weights() 84 | 85 | print(neuron.predict([8.23, 9.45], weights)) 86 | print(neuron.predict([0.23, 1.45], weights)) 87 | -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/CF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/CF.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/CFG.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/CFG.gif -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/CH.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/CH.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/DT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/DT.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/GD.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/GD.jpg -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/GDD.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/GDD.jpg -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/K.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/K.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/LR.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/LR.jpg -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/LR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/LR.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/NN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/NN.png -------------------------------------------------------------------------------- /04_Machine-Learning/Algorithms/images/SVM.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MrMimic/data-scientist-roadmap/6e07cbd7a793d1bb010bbd3d34a0045a1ee03cdd/04_Machine-Learning/Algorithms/images/SVM.jpg -------------------------------------------------------------------------------- /04_Machine-Learning/README.md: -------------------------------------------------------------------------------- 1 | # 4_ Machine learning 2 | 3 | ## 1_ What is ML ? 4 | 5 | ### Definition 6 | 7 | Machine Learning is part of the Artificial Intelligences study. It concerns the conception, devloppement and implementation of sophisticated methods, allowing a machine to achieve really hard tasks, nearly impossible to solve with classic algorithms. 8 | 9 | Machine learning mostly consists of three algorithms: 10 | 11 | ![ml](https://miro.medium.com/max/561/0*qlvUmkmkeefqe_Mk) 12 | 13 | ### Utilisation examples 14 | 15 | * Computer vision 16 | * Search engines 17 | * Financial analysis 18 | * Documents classification 19 | * Music generation 20 | * Robotics ... 21 | 22 | ## 2_ Numerical var 23 | 24 | Variables which can take continous integer or real values. They can take infinite values. 25 | 26 | These types of variables are mostly used for features which involves measurements. For example, hieghts of all students in a class. 27 | 28 | ## 3_ Categorical var 29 | 30 | Variables that take finite discrete values. They take a fixed set of values, in order to classify a data item. 31 | 32 | They act like assigned labels. For example: Labelling the students of a class according to gender: 'Male' and 'Female' 33 | 34 | ## 4_ Supervised learning 35 | 36 | Supervised learning is the machine learning task of inferring a function from __labeled training data__. 37 | 38 | The training data consist of a __set of training examples__. 39 | 40 | In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). 41 | 42 | A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. 43 | 44 | In other words: 45 | 46 | Supervised Learning learns from a set of labeled examples. From the instances and the labels, supervised learning models try to find the correlation among the features, used to describe an instance, and learn how each feature contributes to the label corresponding to an instance. On receiving an unseen instance, the goal of supervised learning is to label the instance based on its feature correctly. 47 | 48 | __An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances__. 49 | 50 | ## 5_ Unsupervised learning 51 | 52 | Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure __from "unlabeled" data__ (a classification or categorization is not included in the observations). 53 | 54 | Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning. 55 | 56 | Unsupervised learning deals with data instances only. This approach tries to group data and form clusters based on the similarity of features. If two instances have similar features and placed in close proximity in feature space, there are high chances the two instances will belong to the same cluster. On getting an unseen instance, the algorithm will try to find, to which cluster the instance should belong based on its feature. 57 | 58 | Resource: 59 | 60 | [Guide to unsupervised learning](https://towardsdatascience.com/a-dive-into-unsupervised-learning-bf1d6b5f02a7) 61 | 62 | ## 6_ Concepts, inputs and attributes 63 | 64 | A machine learning problem takes in the features of a dataset as input. 65 | 66 | For supervised learning, the model trains on the data and then it is ready to perform. So, for supervised learning, apart from the features we also need to input the corresponding labels of the data points to let the model train on them. 67 | 68 | For unsupervised learning, the models simply perform by just citing complex relations among data items and grouping them accordingly. So, unsupervised learning do not need a labelled dataset. The input is only the feature section of the dataset. 69 | 70 | ## 7_ Training and test data 71 | 72 | If we train a supervised machine learning model using a dataset, the model captures the dependencies of that particular data set very deeply. So, the model will always perform well on the data and it won't be proper measure of how well the model performs. 73 | 74 | To know how well the model performs, we must train and test the model on different datasets. The dataset we train the model on is called Training set, and the dataset we test the model on is called the test set. 75 | 76 | We normally split the provided dataset to create the training and test set. The ratio of splitting is majorly: 3:7 or 2:8 depending on the data, larger being the trining data. 77 | 78 | #### sklearn.model_selection.train_test_split is used for splitting the data 79 | 80 | Syntax: 81 | 82 | from sklearn.model_selection import train_test_split 83 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 84 | 85 | [Sklearn docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 86 | 87 | ## 8_ Classifiers 88 | 89 | Classification is the most important and most common machine learning problem. Classification problems can be both suprvised and unsupervised problems. 90 | 91 | The classification problems involve labelling data points to belong to a particular class based on the feature set corresponding to the particluar data point. 92 | 93 | Classification tasks can be performed using both machine learning and deep learning techniques. 94 | 95 | Machine learning classification techniques involve: Logistic Regressions, SVMs, and Classification trees. The models used to perform the classification are called classifiers. 96 | 97 | ## 9_ Prediction 98 | 99 | The output generated by a machine learning models for a particuolar problem is called its prediction. 100 | 101 | There are majorly two kinds of predictions corresponding to two types of problen: 102 | 103 | 1. Classification 104 | 105 | 2. Regression 106 | 107 | In classiication, the prediction is mostly a class or label, to which a data points belong 108 | 109 | In regression, the prediction is a number, a continous a numeric value, because regression problems deal with predicting the value. For example, predicting the price of a house. 110 | 111 | ## 10_ Lift 112 | 113 | ## 11_ Overfitting 114 | 115 | Often we train our model so much or make our model so complex that our model fits too tghtly with the training data. 116 | 117 | The training data often contains outliers or represents misleading patterns in the data. Fitting the training data with such irregularities to deeply cause the model to lose its generalization. The model performs very well on the training set but not so good on the test set. 118 | 119 | ![overfitting](https://hackernoon.com/hn-images/1*xWfbNW3arf39wxk4ZkI2Mw.png) 120 | 121 | As we can see on training further a point the training error decreases and testing error increases. 122 | 123 | A hypothesis h1 is said to overfit iff there exists another hypothesis h where h gives more error than h1 on training data and less error than h1 on the test data 124 | 125 | ## 12_ Bias & variance 126 | 127 | Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. 128 | 129 | Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. 130 | 131 | Basically High variance causes overfitting and high bias causes underfitting. We want our model to have low bias and low variance to perform perfectly. We need to avoid a model with higher variance and high bias 132 | 133 | ![bias&variance](https://community.alteryx.com/t5/image/serverpage/image-id/52874iE986B6E19F3248CF?v=1.0) 134 | 135 | We can see that for Low bias and Low Variance our model predicts all the data points correctly. Again in the last image having high bias and high variance the model predicts no data point correctly. 136 | 137 | ![B&v2](https://adolfoeliazat.com/wp-content/uploads/2020/07/Bias-Variance-tradeoff-in-Machine-Learning.png) 138 | 139 | We can see from the graph that rge Error increases when the complex is either too complex or the model is too simple. The bias increases with simpler model and Variance increases with complex models. 140 | 141 | This is one of the most important tradeoffs in machine learning 142 | 143 | ## 13_ Tree and classification 144 | 145 | We have previously talked about classificaion. We have seen the most used methods are Logistic Regression, SVMs and decision trees. Now, if the decision boundary is linear the methods like logistic regression and SVM serves best, but its a complete scenerio when the decision boundary is non linear, this is where decision tree is used. 146 | 147 | ![tree](https://www.researchgate.net/profile/Zena_Hira/publication/279274803/figure/fig4/AS:324752402075653@1454438414424/Linear-versus-nonlinear-classification-problems.png) 148 | 149 | The first image shows linear decision boundary and second image shows non linear decision boundary. 150 | 151 | Ih the cases, for non linear boundaries, the decision trees condition based approach work very well for classification problems. The algorithm creates conditions on features to drive and reach a decision, so is independent of functions. 152 | 153 | ![tree2](https://databricks.com/wp-content/uploads/2014/09/decision-tree-example.png) 154 | 155 | Decision tree approach for classification 156 | 157 | ## 14_ Classification rate 158 | 159 | ## 15_ Decision tree 160 | 161 | Decision Trees are some of the most used machine learning algorithms. They are used for both classification and Regression. They can be used for both linear and non-linear data, but they are mostly used for non-linear data. Decision Trees as the name suggests works on a set of decisions derived from the data and its behavior. It does not use a linear classifier or regressor, so its performance is independent of the linear nature of the data. 162 | 163 | One of the other most important reasons to use tree models is that they are very easy to interpret. 164 | 165 | Decision Trees can be used for both classification and regression. The methodologies are a bit different, though principles are the same. The decision trees use the CART algorithm (Classification and Regression Trees) 166 | 167 | Resource: 168 | 169 | [Guide to Decision Tree](https://towardsdatascience.com/a-dive-into-decision-trees-a128923c9298) 170 | 171 | ## 16_ Boosting 172 | 173 | #### Ensemble Learning 174 | 175 | It is the method used to enhance the performance of the Machine learning models by combining several number of models or weak learners. They provide improved efficiency. 176 | 177 | There are two types of ensemble learning: 178 | 179 | __1. Parallel ensemble learning or bagging method__ 180 | 181 | __2. Sequential ensemble learning or boosting method__ 182 | 183 | In parallel method or bagging technique, several weak classifiers are created in parallel. The training datasets are created randomly on a bootstrapping basis from the original dataset. The datasets used for the training and creation phases are weak classifiers. Later during predictions, the reults from all the classifiers are bagged together to provide the final results. 184 | 185 | ![bag](https://miro.medium.com/max/850/1*_pfQ7Xf-BAwfQXtaBbNTEg.png) 186 | 187 | Ex: Random Forests 188 | 189 | In sequential learning or boosting weak learners are created one after another and the data sample set are weighted in such a manner that during creation, the next learner focuses on the samples that were wrongly predicted by the previous classifier. So, at each step, the classifier improves and learns from its previous mistakes or misclassifications. 190 | 191 | ![boosting](https://www.kdnuggets.com/wp-content/uploads/Budzik-fig2-ensemble-learning.jpg) 192 | 193 | There are mostly three types of boosting algorithm: 194 | 195 | __1. Adaboost__ 196 | 197 | __2. Gradient Boosting__ 198 | 199 | __3. XGBoost__ 200 | 201 | __Adaboost__ algorithm works in the exact way describe. It creates a weak learner, also known as stumps, they are not full grown trees, but contain a single node based on which the classification is done. The misclassifications are observed and they are weighted more than the correctly classified ones while training the next weak learner. 202 | 203 | __sklearn.ensemble.AdaBoostClassifier__ is used for the application of the classifier on real data in python. 204 | 205 | ![adaboost](https://ars.els-cdn.com/content/image/3-s2.0-B9780128177365000090-f09-18-9780128177365.jpg) 206 | 207 | Reources: 208 | 209 | [Understanding](https://blog.paperspace.com/adaboost-optimizer/#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.) 210 | 211 | __Gradient Boosting__ algorithm starts with a node giving 0.5 as output for both classification and regression. It serves as the first stump or weak learner. We then observe the Errors in predictions. Now, we create other learners or decision trees to actually predict the errors based on the conditions. The errors are called Residuals. Our final output is: 212 | 213 | __0.5 (Provided by the first learner) + The error provided by the second tree or learner.__ 214 | 215 | Now, if we use this method, it learns the predictions too tightly, and loses generalization. In order to avoid that gradient boosting uses a learning parameter _alpha_. 216 | 217 | So, the final results after two learners is obtained as: 218 | 219 | __0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)__ 220 | 221 | We can see that using the added portion we take a small leap towards the correct results. We continue adding learners until the point we are very close to the actual value given by the training set. 222 | 223 | Overall the equation becomes: 224 | 225 | __0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)+ _alpha_ X (The error provided by the third tree or learner.)+.............__ 226 | 227 | __sklearn.ensemble.GradientBoostingClassifier__ used to apply gradient boosting in python 228 | 229 | ![GBM](https://www.elasticfeed.com/wp-content/uploads/09cc1168a39db0c0d6ea1c66d27ecfd3.jpg) 230 | 231 | Resource: 232 | 233 | [Guide](https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d) 234 | 235 | ## 17_ Naïves Bayes classifiers 236 | 237 | The Naive Bayes classifiers are a collection of classification algorithms based on __Bayes’ Theorem.__ 238 | 239 | Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is given by: 240 | 241 | ![bayes](https://wikimedia.org/api/rest_v1/media/math/render/svg/87c061fe1c7430a5201eef3fa50f9d00eac78810) 242 | 243 | Where P(A|B) is the probabaility of occurrence of A knowing B already occurred and P(B|A) is the probability of occurrence of B knowing A occurred. 244 | 245 | [Scikit-learn Guide](https://github.com/abr-98/data-scientist-roadmap/edit/master/04_Machine-Learning/README.md) 246 | 247 | There are mostly two types of Naive Bayes: 248 | 249 | __1. Gaussian Naive Bayes__ 250 | 251 | __2. Multinomial Naive Bayes.__ 252 | 253 | #### Multinomial Naive Bayes 254 | 255 | The method is used mostly for document classification. For example, classifying an article as sports article or say film magazine. It is also used for differentiating actual mails from spam mails. It uses the frequency of words used in different magazine to make a decision. 256 | 257 | For example, the word "Dear" and "friends" are used a lot in actual mails and "offer" and "money" are used a lot in "Spam" mails. It calculates the prorbability of the occurrence of the words in case of actual mails and spam mails using the training examples. So, the probability of occurrence of "money" is much higher in case of spam mails and so on. 258 | 259 | Now, we calculate the probability of a mail being a spam mail using the occurrence of words in it. 260 | 261 | #### Gaussian Naive Bayes 262 | 263 | When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution. 264 | 265 | ![gnb](https://miro.medium.com/max/422/1*AYsUOvPkgxe3j1tEj2lQbg.gif) 266 | 267 | It links guassian distribution and Bayes theorem. 268 | 269 | Resources: 270 | 271 | [GUIDE](https://youtu.be/H3EjCKtlVog) 272 | 273 | ## 18_ K-Nearest neighbor 274 | 275 | K-nearest neighbour algorithm is the most basic and still essential algorithm. It is a memory based approach and not a model based one. 276 | 277 | KNN is used in both supervised and unsupervised learning. It simply locates the data points across the feature space and used distance as a similarity metrics. 278 | 279 | Lesser the distance between two data points, more similar the points are. 280 | 281 | In K-NN classification algorithm, the point to classify is plotted on the feature space and classified as the class of its nearest K-neighbours. K is the user parameter. It gives the measure of how many points we should consider while deciding the label of the point concerned. If K is more than 1 we consider the label that is in majority. 282 | 283 | If the dataset is very large, we can use a large k. The large k is less effected by noise and generates smooth boundaries. For small dataset, a small k must be used. A small k helps to notice the variation in boundaries better. 284 | 285 | ![knn](https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/46117/versions/4/screenshot.jpg) 286 | 287 | Resource: 288 | 289 | [GUIDE](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761) 290 | 291 | ## 19_ Logistic regression 292 | 293 | Regression is one of the most important concepts used in machine learning. 294 | 295 | [Guide to regression](https://towardsdatascience.com/a-deep-dive-into-the-concept-of-regression-fb912d427a2e) 296 | 297 | Logistic Regression is the most used classification algorithm for linearly seperable datapoints. Logistic Regression is used when the dependent variable is categorical. 298 | 299 | It uses the linear regression equation: 300 | 301 | __Y= w1x1+w2x2+w3x3……..wkxk__ 302 | 303 | in a modified format: 304 | 305 | __Y= 1/ 1+e^-(w1x1+w2x2+w3x3……..wkxk)__ 306 | 307 | This modification ensures the value always stays between 0 and 1. Thus, making it feasible to be used for classification. 308 | 309 | The above equation is called __Sigmoid__ function. The function looks like: 310 | 311 | ![Logreg](https://miro.medium.com/max/700/1*HXCBO-Wx5XhuY_OwMl0Phw.png) 312 | 313 | The loss fucnction used is called logloss or binary cross-entropy. 314 | 315 | __Loss= —Y_actual. log(h(x)) —(1 — Y_actual.log(1 — h(x)))__ 316 | 317 | If Y_actual=1, the first part gives the error, else the second part. 318 | 319 | ![loss](https://miro.medium.com/max/700/1*GZiV3ph20z0N9QSwQTHKqg.png) 320 | 321 | Logistic Regression is used for multiclass classification also. It uses softmax regresssion or One-vs-all logistic regression. 322 | 323 | [Guide to logistic Regression](https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc) 324 | 325 | __sklearn.linear_model.LogisticRegression__ is used to apply logistic Regression in python. 326 | 327 | ## 20_ Ranking 328 | 329 | ## 21_ Linear regression 330 | 331 | Regression tasks deal with predicting the value of a dependent variable from a set of independent variables i.e, the provided features. Say, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price. 332 | 333 | Now, if there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by y=mx+c, where the m is the coefficient of the feature in the equation, c is the intercept or bias. Both M and C are the model parameters. 334 | 335 | We use a loss function or cost function called Mean Square error of (MSE). It is given by the square of the difference between the actual and the predicted value of the dependent variable. 336 | 337 | __MSE=1/2m * (Y_actual — Y_pred)²__ 338 | 339 | If we observe the function we will see its a parabola, i.e, the function is convex in nature. This convex function is the principle used in Gradient Descent to obtain the value of the model parameters 340 | 341 | ![loss](https://miro.medium.com/max/2238/1*Xgk6XI4kEcSmDaEAxqB1CA.png) 342 | 343 | The image shows the loss function. 344 | 345 | To get the correct estimate of the model parameters we use the method of __Gradient Descent__ 346 | 347 | [Guide to Gradient Descent](https://towardsdatascience.com/an-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2) 348 | 349 | [Guide to linear Regression](https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86) 350 | 351 | __sklearn.linear_model.LinearRegression__ is used to apply linear regression in python 352 | 353 | ## 22_ Perceptron 354 | 355 | The perceptron has been the first model described in the 50ies. 356 | 357 | This is a __binary classifier__, ie it can't separate more than 2 groups, and thoses groups have to be __linearly separable__. 358 | 359 | The perceptron __works like a biological neuron__. It calculate an activation value, and if this value if positive, it returns 1, 0 otherwise. 360 | 361 | ## 23_ Hierarchical clustering 362 | 363 | The hierarchical algorithms are so-called because they create tree-like structures to create clusters. These algorithms also use a distance-based approach for cluster creation. 364 | 365 | The most popular algorithms are: 366 | 367 | __Agglomerative Hierarchical clustering__ 368 | 369 | __Divisive Hierarchical clustering__ 370 | 371 | __Agglomerative Hierarchical clustering__: In this type of hierarchical clustering, each point initially starts as a cluster, and slowly the nearest or similar most clusters merge to create one cluster. 372 | 373 | __Divisive Hierarchical Clustering__: The type of hierarchical clustering is just the opposite of Agglomerative clustering. In this type, all the points start as one large cluster and slowly the clusters get divided into smaller clusters based on how large the distance or less similarity is between the two clusters. We keep on dividing the clusters until all the points become individual clusters. 374 | 375 | For agglomerative clustering, we keep on merging the clusters which are nearest or have a high similarity score to one cluster. So, if we define a cut-off or threshold score for the merging we will get multiple clusters instead of a single one. For instance, if we say the threshold similarity metrics score is 0.5, it means the algorithm will stop merging the clusters if no two clusters are found with a similarity score less than 0.5, and the number of clusters present at that step will give the final number of clusters that need to be created to the clusters. 376 | 377 | Similarly, for divisive clustering, we divide the clusters based on the least similarity scores. So, if we define a score of 0.5, it will stop dividing or splitting if the similarity score between two clusters is less than or equal to 0.5. We will be left with a number of clusters and it won’t reduce to every point of the distribution. 378 | 379 | The process is as shown below: 380 | 381 | ![HC](https://miro.medium.com/max/1000/1*4GRJvFaRdapnF3K4yH97DA.png) 382 | 383 | One of the most used methods for the measuring distance and applying cutoff is the dendrogram method. 384 | 385 | The dendogram for above clustering is: 386 | 387 | ![Dend](https://miro.medium.com/max/700/1*3TV7NtpSSFoqeX-p9wr1xw.png) 388 | 389 | [Guide](https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec) 390 | 391 | ## 24_ K-means clustering 392 | 393 | The algorithm initially creates K clusters randomly using N data points and finds the mean of all the point values in a cluster for each cluster. So, for each cluster we find a central point or centroid calculating the mean of the values of the cluster. Then the algorithm calculates the sum of squared error (SSE) for each cluster. SSE is used to measure the quality of clusters. If a cluster has large distances between the points and the center, then the SSE will be high and if we check the interpretation it allows only points in the close vicinity to create clusters. 394 | 395 | The algorithm works on the principle that the points lying close to a center of a cluster should be in that cluster. So, if a point x is closer to the center of cluster A than cluster B, then x will belong to cluster A. Thus a point enters a cluster and as even a single point moves from one cluster to another, the centroid changes and so does the SSE. We keep doing this until the SSE decreases and the centroid does not change anymore. After a certain number of shifts, the optimal clusters are found and the shifting stops as the centroids don’t change any more. 396 | 397 | The initial number of clusters ‘K’ is a user parameter. 398 | 399 | The image shows the method 400 | 401 | ![Kmeans](https://miro.medium.com/max/1000/1*lZdpqQxhcGyqztp_mvXi4w.png) 402 | 403 | We have seen that for this type of clustering technique we need a user-defined parameter ‘K’ which defines the number of clusters that need to be created. Now, this is a very important parameter. To, find this parameter a number of methods are used. The most important and used method is the elbow method. 404 | For smaller datasets, k=(N/2)^(1/2) or the square root of half of the number of points in the distribution. 405 | 406 | [Guide](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1) 407 | 408 | ## 25_ Neural networks 409 | 410 | Neural Networks are a set of interconnected layers of artificial neurons or nodes. They are frameworks that are modeled keeping in mind, the structure and working of the human brain. They are meant for predictive modeling and applications where they can be trained via a dataset. They are based on self-learning algorithms and predict based on conclusions and complex relations derived from their training sets of information. 411 | 412 | A typical Neural Network has a number of layers. The First Layer is called the Input Layer and The Last layer is called the Output Layer. The layers between the Input and Output layers are called Hidden Layers. It basically functions like a Black Box for prediction and classification. All the layers are interconnected and consist of numerous artificial neurons called Nodes. 413 | 414 | [Guide to nueral Networks](https://medium.com/ai-in-plain-english/neural-networks-overview-e6ea484a474e) 415 | 416 | Neural networks are too complex to work on Gradient Descent algorithms, so it works on the principles of Backproapagations and Optimizers. 417 | 418 | [Guide to Backpropagation](https://towardsdatascience.com/an-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2) 419 | 420 | [Guide to optimizers](https://towardsdatascience.com/introduction-to-gradient-descent-weight-initiation-and-optimizers-ee9ae212723f) 421 | 422 | ## 26_ Sentiment analysis 423 | 424 | Text Classification and sentiment analysis is a very common machine learning problem and is used in a lot of activities like product predictions, movie recommendations, and several others. 425 | 426 | Text classification problems like sentimental analysis can be achieved in a number of ways using a number of algorithms. These are majorly divided into two main categories: 427 | 428 | A bag of Word model: In this case, all the sentences in our dataset are tokenized to form a bag of words that denotes our vocabulary. Now each individual sentence or sample in our dataset is represented by that bag of words vector. This vector is called the feature vector. For example, ‘It is a sunny day’, and ‘The Sun rises in east’ are two sentences. The bag of words would be all the words in both the sentences uniquely. 429 | 430 | The second method is based on a time series approach: Here each word is represented by an Individual vector. So, a sentence is represented as a vector of vectors. 431 | 432 | [Guide to sentimental analysis](https://towardsdatascience.com/a-guide-to-text-classification-and-sentiment-analysis-2ab021796317) 433 | 434 | ## 27_ Collaborative filtering 435 | 436 | We all have used services like Netflix, Amazon, and Youtube. These services use very sophisticated systems to recommend the best items to their users to make their experiences great. 437 | 438 | Recommenders mostly have 3 components mainly, out of which, one of the main component is Candidate generation. This method is responsible for generating smaller subsets of candidates to recommend to a user, given a huge pool of thousands of items. 439 | 440 | Types of Candidate Generation Systems: 441 | 442 | __Content-based filtering System__ 443 | 444 | __Collaborative filtering System__ 445 | 446 | __Content-based filtering system__: Content-Based recommender system tries to guess the features or behavior of a user given the item’s features, he/she reacts positively to. 447 | 448 | __Collaborative filtering System__: Collaborative does not need the features of the items to be given. Every user and item is described by a feature vector or embedding. 449 | 450 | It creates embedding for both users and items on its own. It embeds both users and items in the same embedding space. 451 | 452 | It considers other users’ reactions while recommending a particular user. It notes which items a particular user likes and also the items that the users with behavior and likings like him/her likes, to recommend items to that user. 453 | 454 | It collects user feedbacks on different items and uses them for recommendations. 455 | 456 | [Guide to collaborative filtering](https://towardsdatascience.com/introduction-to-recommender-systems-1-971bd274f421) 457 | 458 | ## 28_ Tagging 459 | 460 | ## 29_ Support Vector Machine 461 | 462 | Support vector machines are used for both Classification and Regressions. 463 | 464 | SVM uses a margin around its classifier or regressor. The margin provides an extra robustness and accuracy to the model and its performance. 465 | 466 | ![SVM](https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/300px-SVM_margin.png) 467 | 468 | The above image describes a SVM classifier. The Red line is the actual classifier and the dotted lines show the boundary. The points that lie on the boundary actually decide the Margins. They support the classifier margins, so they are called __Support Vectors__. 469 | 470 | The distance between the classifier and the nearest points is called __Marginal Distance__. 471 | 472 | There can be several classifiers possible but we choose the one with the maximum marginal distance. So, the marginal distance and the support vectors help to choose the best classifier. 473 | 474 | [Official Documentation from Sklearn](https://scikit-learn.org/stable/modules/svm.html) 475 | 476 | [Guide to SVM](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47) 477 | 478 | ## 30_Reinforcement Learning 479 | 480 | “Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward.” 481 | 482 | To play a game, we need to make multiple choices and predictions during the course of the game to achieve success, so they can be called a multiple decision processes. This is where we need a type of algorithm called reinforcement learning algorithms. The class of algorithm is based on decision-making chains which let such algorithms to support multiple decision processes. 483 | 484 | The reinforcement algorithm can be used to reach a goal state from a starting state making decisions accordingly. 485 | 486 | The reinforcement learning involves an agent which learns on its own. If it makes a correct or good move that takes it towards the goal, it is positively rewarded, else not. This way the agent learns. 487 | 488 | ![reinforced](https://miro.medium.com/max/539/0*4d9KHTzW6xrWTBld) 489 | 490 | The above image shows reinforcement learning setup. 491 | 492 | [WIKI](https://en.wikipedia.org/wiki/Reinforcement_learning#:~:text=Reinforcement%20learning%20(RL)%20is%20an,supervised%20learning%20and%20unsupervised%20learning.) 493 | -------------------------------------------------------------------------------- /05_Text-Mining-NLP/README.md: -------------------------------------------------------------------------------- 1 | # 5_ Text Mining 2 | 3 | Text mining is the process of deriving high-quality information from text. 4 | 5 | ## 1_ Corpus 6 | 7 | A Corpus is a large and structured set of texts. 8 | 9 | ## 2_ Named Entity Recognition 10 | 11 | Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories. 12 | 13 | ## 3_ Text Analysis 14 | 15 | Text Analysis is the automated process of understanding and sorting unstructured text, making it easier to manage. 16 | 17 | ## 4_ UIMA 18 | 19 | Unstructured Information Management applications (UIMA) are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. 20 | 21 | ## 5_ Term Document matrix 22 | 23 | A term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. 24 | 25 | ## 6_ Term frequency and Weight 26 | 27 | Term frequency is the number of times a term appears in a document. The weight of a term often reflects its importance to a document. 28 | 29 | ## 7_ Support Vector Machines (SVM) 30 | 31 | Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. 32 | 33 | ## 8_ Association rules 34 | 35 | Association rules analysis is a technique to uncover how items are associated to each other. 36 | 37 | ## 9_ Market based analysis 38 | 39 | Market-based analysis is a method of analyzing market data to help strategize on a company's future market trends. 40 | 41 | ## 10_ Feature extraction 42 | 43 | Feature extraction starts from an initial set of measured data and builds derived values intended to be informative and non-redundant. 44 | 45 | ## 11_ Using mahout 46 | 47 | Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms. 48 | 49 | ## 12_ Using Weka 50 | 51 | Weka is a collection of machine learning algorithms for data mining tasks. 52 | 53 | ## 13_ Using NLTK 54 | 55 | NLTK, or Natural Language Toolkit, is a platform used for building Python programs to work with human language data. 56 | 57 | ## 14_ Classify text 58 | 59 | Text classification is the process of assigning tags or categories to text according to its content. 60 | 61 | ## 15_ Vocabulary mapping 62 | 63 | Vocabulary mapping is a process to equate vocabularies from different systems or vocabularies within the same system. 64 | -------------------------------------------------------------------------------- /06_Data-Visualization/1_data-exploration.R: -------------------------------------------------------------------------------- 1 | ##################### 2 | # To execute line by line in Rstudio, select it (hightlight) 3 | # Press Ctrl+Enter 4 | 5 | # Iris is an array of values examples coming with R. 6 | data <- iris 7 | # This is equal to : 8 | data = iris 9 | # To print it: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 10 | show(data) 11 | 12 | # Histogram 13 | column = data[,1] 14 | hist(column) 15 | # Change parameters : 16 | hist(column, main = "Main title", xlab = "SEPAL LENGTH", ylab = "FREQUENCY", col = 'red', breaks = 10) 17 | hist(column, main = "Main title", xlab = "SEPAL LENGTH", ylab = "FREQUENCY", col = 'red', breaks = 15) 18 | 19 | # Box plot 20 | boxplot(column, main = "Main title", ylab = "SEPAL LENGTH", col = 'red') 21 | 22 | # Line chart, not very useful here, indeed 23 | X = data[,1] 24 | Y = data[,3] 25 | plot(x = X, y = Y, main = "Main title", xlab = "SEPAL LENGTH", ylab = "PETAL LENGTH", col = 'red') 26 | -------------------------------------------------------------------------------- /06_Data-Visualization/4_histogram-pie.R: -------------------------------------------------------------------------------- 1 | # Two plot on the same window 2 | par(mfrow = c(1,2)) 3 | # Histogram 4 | data <- iris 5 | hist(data[,2], main = "histogram about sepal width", xlab = "sepal width", ylab = "Frequency") 6 | # Pie chart 7 | classes <- summary(data[,5]) 8 | pie(classes, main = "Iris species") 9 | -------------------------------------------------------------------------------- /06_Data-Visualization/README.md: -------------------------------------------------------------------------------- 1 | # 6_ Data Visualization 2 | 3 | Open .R scripts in Rstudio for line-by-line execution. 4 | 5 | See [10_ Toolbox/3_ R, Rstudio, Rattle](https://github.com/MrMimic/data-scientist-roadmap/tree/master/10_Toolbox#3_-r-rstudio-rattle) for installation. 6 | 7 | ## 1_ Data exploration in R 8 | 9 | In mathematics, the graph of a function f is the collection of all ordered pairs (x, f(x)). If the function input x is a scalar, the graph is a two-dimensional graph, and for a continuous function is a curve. If the function input x is an ordered pair (x1, x2) of real numbers, the graph is the collection of all ordered triples (x1, x2, f(x1, x2)), and for a continuous function is a surface. 10 | 11 | ## 2_ Uni, bi and multivariate viz 12 | 13 | ### Univariate 14 | 15 | The term is commonly used in statistics to distinguish a distribution of one variable from a distribution of several variables, although it can be applied in other ways as well. For example, univariate data are composed of a single scalar component. In time series analysis, the term is applied with a whole time series as the object referred to: thus a univariate time series refers to the set of values over time of a single quantity. 16 | 17 | ### Bivariate 18 | 19 | Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. 20 | 21 | ### Multivariate 22 | 23 | Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. 24 | 25 | ## 3_ ggplot2 26 | 27 | ### About 28 | 29 | ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. 30 | 31 | [http://ggplot2.org/](http://ggplot2.org/) 32 | 33 | ### Documentation 34 | 35 | ### Examples 36 | 37 | [http://r4stats.com/examples/graphics-ggplot2/](http://r4stats.com/examples/graphics-ggplot2/) 38 | 39 | ## 4_ Histogram and pie (Uni) 40 | 41 | ### About 42 | 43 | Histograms and pie are 2 types of graphes used to visualize frequencies. 44 | 45 | Histogram is showing the distribution of these frequencies over classes, and pie the relative proportion of this frequencies in a 100% circle. 46 | 47 | ## 5_ Tree & tree map 48 | 49 | ### About 50 | 51 | [Treemaps](https://en.wikipedia.org/wiki/Treemapping) display hierarchical (tree-structured) data as a set of nested rectangles. 52 | Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. 53 | A leaf node’s rectangle has an area proportional to a specified dimension of the data. 54 | Often the leaf nodes are colored to show a separate dimension of the data. 55 | 56 | ### When to use it ? 57 | 58 | - Less than 10 branches. 59 | - Positive values. 60 | - Space for visualisation is limited. 61 | 62 | ### Example 63 | 64 | ![treemap-example](https://jingwen-z.github.io/images/20181030-treemap.png) 65 | 66 | This treemap describes volume for each product universe with corresponding surface. Liquid products are more sold than others. 67 | If you want to explore more, we can go into products “liquid” and find which shelves are prefered by clients. 68 | 69 | ### More information 70 | 71 | [Matplotlib Series 5: Treemap](https://jingwen-z.github.io/data-viz-with-matplotlib-series5-treemap/) 72 | 73 | ## 6_ Scatter plot 74 | 75 | ### About 76 | 77 | A [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) (also called a scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. 78 | 79 | ### When to use it ? 80 | 81 | Scatter plots are used when you want to show the relationship between two variables. 82 | Scatter plots are sometimes called correlation plots because they show how two variables are correlated. 83 | 84 | ### Example 85 | 86 | ![scatter-plot-example](https://jingwen-z.github.io/images/20181025-pos-scatter-plot.png) 87 | 88 | This plot describes the positive relation between store’s surface and its turnover(k euros), which is reasonable: for stores, the larger it is, more clients it can accept, more turnover it will generate. 89 | 90 | ### More information 91 | 92 | [Matplotlib Series 4: Scatter plot](https://jingwen-z.github.io/data-viz-with-matplotlib-series4-scatter-plot/) 93 | 94 | ## 7_ Line chart 95 | 96 | ### About 97 | 98 | A [line chart](https://en.wikipedia.org/wiki/Line_chart) or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically. 99 | 100 | ### When to use it ? 101 | 102 | - Track changes over time. 103 | - X-axis displays continuous variables. 104 | - Y-axis displays measurement. 105 | 106 | ### Example 107 | 108 | ![line-chart-example](https://jingwen-z.github.io/images/20180916-line-chart.png) 109 | 110 | Suppose that the plot above describes the turnover(k euros) of ice-cream’s sales during one year. 111 | According to the plot, we can clearly find that the sales reach a peak in summer, then fall from autumn to winter, which is logical. 112 | 113 | ### More information 114 | 115 | [Matplotlib Series 2: Line chart](https://jingwen-z.github.io/data-viz-with-matplotlib-series2-line-chart/) 116 | 117 | ## 8_ Spatial charts 118 | 119 | ## 9_ Survey plot 120 | 121 | ## 10_ Timeline 122 | 123 | ## 11_ Decision tree 124 | 125 | ## 12_ D3.js 126 | 127 | ### About 128 | 129 | This is a JavaScript library, allowing you to create a huge number of different figure easily. 130 | 131 | 132 | 133 | D3.js is a JavaScript library for manipulating documents based on data. 134 | D3 helps you bring data to life using HTML, SVG, and CSS. 135 | D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation. 136 | 137 | ### Examples 138 | 139 | There is many examples of chars using D3.js on [D3's Github](https://github.com/d3/d3/wiki/Gallery). 140 | 141 | ## 13_ InfoVis 142 | 143 | ## 14_ IBM ManyEyes 144 | 145 | ## 15_ Tableau 146 | 147 | ## 16_ Venn diagram 148 | 149 | ### About 150 | 151 | A [venn diagram](https://en.wikipedia.org/wiki/Venn_diagram) (also called primary diagram, set diagram or logic diagram) is a diagram that shows all possible logical relations between a finite collection of different sets. 152 | 153 | ### When to use it ? 154 | 155 | Show logical relations between different groups (intersection, difference, union). 156 | 157 | ### Example 158 | 159 | ![venn-diagram-example](https://jingwen-z.github.io/images/20181106-venn2.png) 160 | 161 | This kind of venn diagram can usually be used in retail trading. 162 | Assuming that we need to study the popularity of cheese and red wine, and 2500 clients answered our questionnaire. 163 | According to the diagram above, we find that among 2500 clients, 900 clients(36%) prefer cheese, 1200 clients(48%) prefer red wine, and 400 clients(16%) favor both product. 164 | 165 | ### More information 166 | 167 | [Matplotlib Series 6: Venn diagram](https://jingwen-z.github.io/data-viz-with-matplotlib-series6-venn-diagram/) 168 | 169 | ## 17_ Area chart 170 | 171 | ### About 172 | 173 | An [area chart](https://en.wikipedia.org/wiki/Area_chart) or area graph displays graphically quantitative data. 174 | It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings. 175 | 176 | ### When to use it ? 177 | 178 | Show or compare a quantitative progression over time. 179 | 180 | ### Example 181 | 182 | ![area-chart-example](https://jingwen-z.github.io/images/20181114-stacked-area-chart.png) 183 | 184 | This stacked area chart displays the amounts’ changes in each account, their contribution to total amount (in term of value) as well. 185 | 186 | ### More information 187 | 188 | [Matplotlib Series 7: Area chart](https://jingwen-z.github.io/data-viz-with-matplotlib-series7-area-chart/) 189 | 190 | ## 18_ Radar chart 191 | 192 | ### About 193 | 194 | The [radar chart](https://en.wikipedia.org/wiki/Radar_chart) is a chart and/or plot that consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. A line is drawn connecting the data values for each spoke. This gives the plot a star-like appearance and the origin of one of the popular names for this plot. 195 | 196 | ### When to use it ? 197 | 198 | - Comparing two or more items or groups on various features or characteristics. 199 | - Examining the relative values for a single data point. 200 | - Displaying less than ten factors on one radar chart. 201 | 202 | ### Example 203 | 204 | ![radar-chart-example](https://jingwen-z.github.io/images/20181121-multi-radar-chart.png) 205 | 206 | This radar chart displays the preference of 2 clients among 4. 207 | Client c1 favors chicken and bread, and doesn’t like cheese that much. 208 | Nevertheless, client c2 prefers cheese to other 4 products and doesn’t like beer. 209 | We can have an interview with these 2 clients, in order to find the weakness of products which are out of preference. 210 | 211 | ### More information 212 | 213 | [Matplotlib Series 8: Radar chart](https://jingwen-z.github.io/data-viz-with-matplotlib-series8-radar-chart/) 214 | 215 | ## 19_ Word cloud 216 | 217 | ### About 218 | 219 | A [word cloud](https://en.wikipedia.org/wiki/Tag_cloud) (tag cloud, or weighted list in visual design) is a novelty visual representation of text data. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. 220 | 221 | ### When to use it ? 222 | 223 | - Depicting keyword metadata (tags) on websites. 224 | - Delighting and provide emotional connection. 225 | 226 | ### Example 227 | 228 | ![word-cloud-example](https://jingwen-z.github.io/images/20181127-basic-word-cloud.png) 229 | 230 | According to this word cloud, we can globally know that data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. It can be used for business analysis, and called “The Sexiest Job of the 21st Century”. 231 | 232 | ### More information 233 | 234 | [Matplotlib Series 9: Word cloud](https://jingwen-z.github.io/data-viz-with-matplotlib-series9-word-cloud/) 235 | -------------------------------------------------------------------------------- /07_Big-Data/README.md: -------------------------------------------------------------------------------- 1 | # 7_ Big Data 2 | 3 | Big Data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations. It involves the storage, processing, and analysis of data that is too complex or large for traditional data processing tools. 4 | 5 | ## 1_ Map Reduce fundamentals 6 | 7 | MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. It consists of a Map procedure that performs filtering and sorting, and a Reduce method that performs a summary operation. 8 | 9 | ## 2_ Hadoop Ecosystem 10 | 11 | The Hadoop Ecosystem refers to the various components of the Apache Hadoop software library, as well as the accessories and tools provided by the Apache Software Foundation for cloud computing and big data processing. These include the Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. 12 | 13 | ## 3_ HDFS 14 | 15 | HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. It creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. 16 | 17 | ## 4_ Data replications Principles 18 | 19 | Data replication is the process of storing data in more than one site or node to improve the availability of data. It is a key factor in improving the reliability, speed, and accessibility of data in distributed systems. 20 | 21 | ## 5_ Setup Hadoop 22 | 23 | This section covers the steps and requirements for setting up a Hadoop environment. This includes installing the Hadoop software, configuring the system, and setting up the necessary environments for data processing. 24 | 25 | ## 6_ Name & data nodes 26 | 27 | In HDFS, the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. DataNodes are responsible for serving read and write requests from the file system's clients, as well as block creation, deletion, and replication upon instruction from the NameNode. 28 | 29 | ## 7_ Job & task tracker 30 | 31 | JobTracker and TaskTracker are two essential services or daemons provided by Hadoop for submitting and tracking MapReduce jobs. JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster. TaskTracker is a node in the cluster that accepts tasks from the JobTracker and reports back the status of the task. 32 | 33 | ## 8_ M/R/SAS programming 34 | 35 | This section covers programming with MapReduce and SAS (Statistical Analysis System). MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster, while SAS is a software suite developed for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics. 36 | 37 | ## 9_ Sqoop: Loading data in HDFS 38 | 39 | Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. 40 | 41 | ## 10_ Flume, Scribe 42 | 43 | Flume and Scribe are services for efficiently collecting, aggregating, and moving large amounts of log data. They are used for continuous data/log streaming and are suitable for data collection. 44 | 45 | ## 11_ SQL with Pig 46 | 47 | Pig is a high-level platform for creating MapReduce programs used with Hadoop. It is designed to process any kind of data (structured or unstructured) and it provides a high-level language known as Pig Latin, which is SQL-like and easy to learn. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Flink. 48 | 49 | ## 12_ DWH with Hive 50 | 51 | Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. 52 | 53 | ## 13_ Scribe, Chukwa for Weblog 54 | 55 | Scribe and Chukwa are tools used for collecting, aggregating, and analyzing weblogs. Scribe is a server for aggregating log data streamed in real time from many servers. Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. 56 | 57 | ## 14_ Using Mahout 58 | 59 | Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms. Mahout supports mainly three use cases: collaborative filtering, clustering and classification. 60 | 61 | ## 15_ Zookeeper Avro 62 | 63 | ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Avro is a data serialization system that provides rich data structures and a compact, fast, binary data format. 64 | 65 | ## 16_ Lambda Architecture 66 | 67 | Lambda Architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. It provides a robust system that is fault-tolerant against hardware failures and human mistakes. 68 | 69 | ## 17_ Storm: Hadoop Realtime 70 | 71 | Storm is a free and open source distributed realtime computation system. It makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. It's simple, can be used with any programming language, and is a lot of fun to use! 72 | 73 | ## 18_ Rhadoop, RHIPE 74 | 75 | Rhadoop and RHIPE are R packages that provide a set of tools for data analysis with Hadoop. 76 | 77 | ## 19_ RMR 78 | 79 | RMR (Rhipe MapReduce) is a package that provides Hadoop MapReduce functionality in R. 80 | 81 | ## 20_ NoSQL Databases (MongoDB, Neo4j) 82 | 83 | NoSQL databases are non-tabular, and store data differently than relational tables. MongoDB is a source-available cross-platform document-oriented database program. Neo4j is a graph database management system. 84 | 85 | ## 21_ Distributed Databases and Systems (Cassandra) 86 | 87 | Distributed databases and systems are databases in which storage devices are not all attached to a common processor. Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system. 88 | -------------------------------------------------------------------------------- /08_Data-Ingestion/README.md: -------------------------------------------------------------------------------- 1 | # 8_ Data Ingestion 2 | 3 | Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. 4 | 5 | ## 1_ Summary of data formats 6 | 7 | This section provides an overview of various data formats like CSV, JSON, XML, etc., and their characteristics. 8 | 9 | ## 2_ Data discovery 10 | 11 | Data discovery is the process of collecting data from different sources by performing exploratory data analysis. 12 | 13 | ## 3_ Data sources & Acquisition 14 | 15 | This section discusses various data sources and the methods of acquiring data from these sources. 16 | 17 | ## 4_ Data integration 18 | 19 | Data integration involves combining data from different sources and providing users with a unified view of the data. 20 | 21 | ## 5_ Data fusion 22 | 23 | Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. 24 | 25 | ## 6_ Transformation & enrichment 26 | 27 | Transformation involves converting the data from one format or structure into another. Enrichment refers to enhancing data with relevant information that could make the data more useful. 28 | 29 | ## 7_ Data survey 30 | 31 | Data survey involves collecting data by asking people questions and recording their answers. 32 | 33 | ## 8_ Google OpenRefine 34 | 35 | Google OpenRefine is a tool for working with messy data, cleaning it, transforming it from one format into another, and extending it with web services and external data. 36 | 37 | ## 9_ How much data ? 38 | 39 | This section discusses the considerations and strategies for determining the amount of data needed for specific purposes. 40 | 41 | ## 10_ Using ETL 42 | 43 | ETL stands for Extract, Transform, Load. It's a process that extracts data from source systems, transforms the information into a consistent data type, then loads the data into a single depository. 44 | -------------------------------------------------------------------------------- /09_Data-Munging/README.md: -------------------------------------------------------------------------------- 1 | # 9_ Data Munging 2 | 3 | Data Munging, also known as data wrangling, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes. 4 | 5 | ## 1_ Dimensionality and Numerical reduction 6 | 7 | Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Numerical reduction is the process of reducing the amount of data by using techniques like binning, sampling etc. 8 | 9 | ## 2_ Normalization 10 | 11 | Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. 12 | 13 | ## 3_ Data scrubbing 14 | 15 | Data scrubbing, also known as data cleansing, is the process of cleaning data by correcting or removing errors in datasets. 16 | 17 | ## 4_ Handling missing Values 18 | 19 | Handling missing values involves using techniques to either replace or use statistical analysis to fill in the gaps in data where values are missing. 20 | 21 | ## 5_ Unbiased estimators 22 | 23 | An unbiased estimator is a statistic that's mean or expectation is equal to the parameter it is estimating. 24 | 25 | ## 6_ Binning Sparse Values 26 | 27 | Binning sparse values involves grouping values into bins to handle sparse data or to reduce noise. 28 | 29 | ## 7_ Feature extraction 30 | 31 | Feature extraction starts from an initial set of measured data and builds derived values intended to be informative and non-redundant. 32 | 33 | ## 8_ Denoising 34 | 35 | Denoising is the process of removing noise from a signal. 36 | 37 | ## 9_ Sampling 38 | 39 | Sampling is the process of selecting a subset of individuals from a statistical population to estimate characteristics of the whole population. 40 | 41 | ## 10_ Stratified sampling 42 | 43 | Stratified sampling is a method of sampling from a population which can be partitioned into subpopulations. 44 | 45 | ## 11_ PCA 46 | 47 | Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. 48 | -------------------------------------------------------------------------------- /10_Toolbox/README.md: -------------------------------------------------------------------------------- 1 | # 10_ Toolbox 2 | 3 | ## 1_ MS Excel with Analysis toolpack 4 | 5 | Microsoft Excel is a spreadsheet program included in the Microsoft Office suite of applications. The Analysis ToolPak is an Excel add-in program that provides data analysis tools for financial, statistical and engineering data analysis. 6 | 7 | ## 2_ Java, Python 8 | 9 | Java and Python are high-level programming languages. Java is a general-purpose programming language that is class-based, object-oriented, and designed to have as few implementation dependencies as possible. Python is an interpreted, high-level, general-purpose programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. 10 | 11 | ## 3_ R, Rstudio, Rattle 12 | 13 | R is a programming language and free software environment for statistical computing and graphics. RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. Rattle is a graphical data mining application built upon the open source statistical language R. 14 | 15 | ## 4_ Weka, Knime, RapidMiner 16 | 17 | Weka, Knime, and RapidMiner are data mining tools. Weka is a collection of machine learning algorithms for data mining tasks. KNIME is a free and open-source data analytics, reporting and integration platform. RapidMiner is a data science software platform developed by the company of the same name. 18 | 19 | ## 5_ Hadoop dist of choice 20 | 21 | Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. 22 | 23 | ## 6_ Spark, Storm 24 | 25 | Spark and Storm are big data processing tools. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Apache Storm is a free and open source distributed realtime computation system. 26 | 27 | ## 7_ Flume, Scribe, Chukwa 28 | 29 | Flume, Scribe, and Chukwa are tools for managing large amounts of log data. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Scribe is a server for aggregating log data streamed in real time from many servers. Apache Chukwa is an open source data collection system for monitoring large distributed systems. 30 | 31 | ## 8_ Nutch, Talend, Scraperwiki 32 | 33 | Nutch, Talend, and Scraperwiki are data scraping tools. Apache Nutch is a highly extensible and scalable open source web crawler software project. Talend is an open source software integration platform/vendor. ScraperWiki is a web-based platform for collaboratively building programs to process and analyze data. 34 | 35 | ## 9_ Webscraper, Flume, Sqoop 36 | 37 | Webscraper, Flume, and Sqoop are tools for data ingestion. Web Scraper is a tool for extracting information from websites. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 38 | 39 | ## 10_ tm, RWeka, NLTK 40 | 41 | tm, RWeka, and NLTK are text mining tools. tm is a text mining framework for R. RWeka provides R programming language interface to Weka. NLTK is a leading platform for building Python programs to work with human language data. 42 | 43 | ## 11_ RHIPE 44 | 45 | RHIPE is an R library that provides a way to use Hadoop's map-reduce functionality with R. 46 | 47 | ## 12_ D3.js, ggplot2, Shiny 48 | 49 | D3.js, ggplot2, and Shiny are visualization tools. D3.js is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. ggplot2 is a data visualization package for the statistical programming language R. Shiny is an R package that makes it easy to build interactive web apps straight from R. 50 | 51 | ## 13_ IBM Languageware 52 | 53 | IBM LanguageWare is a technology that helps you to understand, analyze and interpret the content of your text. 54 | 55 | ## 14_ Cassandra, MongoDB 56 | 57 | Cassandra and MongoDB are NoSQL databases. Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system. MongoDB is a source-available cross-platform document-oriented database program. 58 | 59 | ## 13_ Microsoft Azure, AWS, Google Cloud 60 | 61 | Microsoft Azure, AWS, and Google Cloud are cloud computing services. They provide a range of cloud services, including those for computing, analytics, storage and networking. Users can pick and choose from these services to develop and scale new applications, or run existing applications, in the public cloud. 62 | 63 | ## 14_ Microsoft Cognitive API 64 | 65 | Microsoft Cognitive Services (formerly Project Oxford) are a set of APIs, SDKs and services available to developers to make their applications more intelligent, engaging and discoverable. 66 | 67 | ## 15_ Tensorflow 68 | 69 | 70 | 71 | TensorFlow is an open source software library for numerical computation using data flow graphs. 72 | 73 | Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. 74 | 75 | The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. 76 | 77 | TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. 78 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # data-scientist-roadmap 2 | 3 | I just found this data science skills roadmap, drew by [Swami Chandrasekaran](http://nirvacana.com/thoughts/becoming-a-data-scientist/) on his cool blog. 4 | 5 | **** 6 | 7 | ![roadmap-picture](http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png) 8 | 9 | **** 10 | 11 | Jobs linked to __data science__ are becoming __more and more popular__. A __bunch of tutorials__ could easily complete this roadmap, helping whoever wants to __start learning stuff about data science__. 12 | 13 | For the moment, a lot is __got on wikipedia or generated by LLMs__ (except for codes, always handmade). Any help's thus welcome! 14 | 15 | ## Run the examples 16 | 17 | Install Poetry 18 | 19 | ```bash 20 | curl -sSL https://install.python-poetry.org | python3 - 21 | ``` 22 | 23 | Install dependencies 24 | 25 | ```bash 26 | poetry install 27 | ``` 28 | 29 | 30 | ## Rules 31 | 32 | * __Feel free to fork this repository and pull requests__. 33 | * Always comment your code. 34 | * Please respect topology for filenames. 35 | * There's one README for each directory. 36 | * Also, could be great to share useful links or resources in README files. 37 | -------------------------------------------------------------------------------- /poetry.lock: -------------------------------------------------------------------------------- 1 | # This file is automatically @generated by Poetry 1.8.3 and should not be changed by hand. 2 | 3 | [[package]] 4 | name = "black" 5 | version = "24.8.0" 6 | description = "The uncompromising code formatter." 7 | optional = false 8 | python-versions = ">=3.8" 9 | files = [ 10 | {file = "black-24.8.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:09cdeb74d494ec023ded657f7092ba518e8cf78fa8386155e4a03fdcc44679e6"}, 11 | {file = "black-24.8.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:81c6742da39f33b08e791da38410f32e27d632260e599df7245cccee2064afeb"}, 12 | {file = "black-24.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:707a1ca89221bc8a1a64fb5e15ef39cd755633daa672a9db7498d1c19de66a42"}, 13 | {file = "black-24.8.0-cp310-cp310-win_amd64.whl", hash = "sha256:d6417535d99c37cee4091a2f24eb2b6d5ec42b144d50f1f2e436d9fe1916fe1a"}, 14 | {file = "black-24.8.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:fb6e2c0b86bbd43dee042e48059c9ad7830abd5c94b0bc518c0eeec57c3eddc1"}, 15 | {file = "black-24.8.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:837fd281f1908d0076844bc2b801ad2d369c78c45cf800cad7b61686051041af"}, 16 | {file = "black-24.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:62e8730977f0b77998029da7971fa896ceefa2c4c4933fcd593fa599ecbf97a4"}, 17 | {file = "black-24.8.0-cp311-cp311-win_amd64.whl", hash = "sha256:72901b4913cbac8972ad911dc4098d5753704d1f3c56e44ae8dce99eecb0e3af"}, 18 | {file = "black-24.8.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:7c046c1d1eeb7aea9335da62472481d3bbf3fd986e093cffd35f4385c94ae368"}, 19 | {file = "black-24.8.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:649f6d84ccbae73ab767e206772cc2d7a393a001070a4c814a546afd0d423aed"}, 20 | {file = "black-24.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2b59b250fdba5f9a9cd9d0ece6e6d993d91ce877d121d161e4698af3eb9c1018"}, 21 | {file = "black-24.8.0-cp312-cp312-win_amd64.whl", hash = "sha256:6e55d30d44bed36593c3163b9bc63bf58b3b30e4611e4d88a0c3c239930ed5b2"}, 22 | {file = "black-24.8.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:505289f17ceda596658ae81b61ebbe2d9b25aa78067035184ed0a9d855d18afd"}, 23 | {file = "black-24.8.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:b19c9ad992c7883ad84c9b22aaa73562a16b819c1d8db7a1a1a49fb7ec13c7d2"}, 24 | {file = "black-24.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1f13f7f386f86f8121d76599114bb8c17b69d962137fc70efe56137727c7047e"}, 25 | {file = "black-24.8.0-cp38-cp38-win_amd64.whl", hash = "sha256:f490dbd59680d809ca31efdae20e634f3fae27fba3ce0ba3208333b713bc3920"}, 26 | {file = "black-24.8.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:eab4dd44ce80dea27dc69db40dab62d4ca96112f87996bca68cd75639aeb2e4c"}, 27 | {file = "black-24.8.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3c4285573d4897a7610054af5a890bde7c65cb466040c5f0c8b732812d7f0e5e"}, 28 | {file = "black-24.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9e84e33b37be070ba135176c123ae52a51f82306def9f7d063ee302ecab2cf47"}, 29 | {file = "black-24.8.0-cp39-cp39-win_amd64.whl", hash = "sha256:73bbf84ed136e45d451a260c6b73ed674652f90a2b3211d6a35e78054563a9bb"}, 30 | {file = "black-24.8.0-py3-none-any.whl", hash = "sha256:972085c618ee94f402da1af548a4f218c754ea7e5dc70acb168bfaca4c2542ed"}, 31 | {file = "black-24.8.0.tar.gz", hash = "sha256:2500945420b6784c38b9ee885af039f5e7471ef284ab03fa35ecdde4688cd83f"}, 32 | ] 33 | 34 | [package.dependencies] 35 | click = ">=8.0.0" 36 | mypy-extensions = ">=0.4.3" 37 | packaging = ">=22.0" 38 | pathspec = ">=0.9.0" 39 | platformdirs = ">=2" 40 | tomli = {version = ">=1.1.0", markers = "python_version < \"3.11\""} 41 | typing-extensions = {version = ">=4.0.1", markers = "python_version < \"3.11\""} 42 | 43 | [package.extras] 44 | colorama = ["colorama (>=0.4.3)"] 45 | d = ["aiohttp (>=3.7.4)", "aiohttp (>=3.7.4,!=3.9.0)"] 46 | jupyter = ["ipython (>=7.8.0)", "tokenize-rt (>=3.2.0)"] 47 | uvloop = ["uvloop (>=0.15.2)"] 48 | 49 | [[package]] 50 | name = "click" 51 | version = "8.1.7" 52 | description = "Composable command line interface toolkit" 53 | optional = false 54 | python-versions = ">=3.7" 55 | files = [ 56 | {file = "click-8.1.7-py3-none-any.whl", hash = "sha256:ae74fb96c20a0277a1d615f1e4d73c8414f5a98db8b799a7931d1582f3390c28"}, 57 | {file = "click-8.1.7.tar.gz", hash = "sha256:ca9853ad459e787e2192211578cc907e7594e294c7ccc834310722b41b9ca6de"}, 58 | ] 59 | 60 | [package.dependencies] 61 | colorama = {version = "*", markers = "platform_system == \"Windows\""} 62 | 63 | [[package]] 64 | name = "colorama" 65 | version = "0.4.6" 66 | description = "Cross-platform colored terminal text." 67 | optional = false 68 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7" 69 | files = [ 70 | {file = "colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6"}, 71 | {file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"}, 72 | ] 73 | 74 | [[package]] 75 | name = "mypy-extensions" 76 | version = "1.0.0" 77 | description = "Type system extensions for programs checked with the mypy type checker." 78 | optional = false 79 | python-versions = ">=3.5" 80 | files = [ 81 | {file = "mypy_extensions-1.0.0-py3-none-any.whl", hash = "sha256:4392f6c0eb8a5668a69e23d168ffa70f0be9ccfd32b5cc2d26a34ae5b844552d"}, 82 | {file = "mypy_extensions-1.0.0.tar.gz", hash = "sha256:75dbf8955dc00442a438fc4d0666508a9a97b6bd41aa2f0ffe9d2f2725af0782"}, 83 | ] 84 | 85 | [[package]] 86 | name = "packaging" 87 | version = "24.1" 88 | description = "Core utilities for Python packages" 89 | optional = false 90 | python-versions = ">=3.8" 91 | files = [ 92 | {file = "packaging-24.1-py3-none-any.whl", hash = "sha256:5b8f2217dbdbd2f7f384c41c628544e6d52f2d0f53c6d0c3ea61aa5d1d7ff124"}, 93 | {file = "packaging-24.1.tar.gz", hash = "sha256:026ed72c8ed3fcce5bf8950572258698927fd1dbda10a5e981cdf0ac37f4f002"}, 94 | ] 95 | 96 | [[package]] 97 | name = "pathspec" 98 | version = "0.12.1" 99 | description = "Utility library for gitignore style pattern matching of file paths." 100 | optional = false 101 | python-versions = ">=3.8" 102 | files = [ 103 | {file = "pathspec-0.12.1-py3-none-any.whl", hash = "sha256:a0d503e138a4c123b27490a4f7beda6a01c6f288df0e4a8b79c7eb0dc7b4cc08"}, 104 | {file = "pathspec-0.12.1.tar.gz", hash = "sha256:a482d51503a1ab33b1c67a6c3813a26953dbdc71c31dacaef9a838c4e29f5712"}, 105 | ] 106 | 107 | [[package]] 108 | name = "platformdirs" 109 | version = "4.2.2" 110 | description = "A small Python package for determining appropriate platform-specific dirs, e.g. a `user data dir`." 111 | optional = false 112 | python-versions = ">=3.8" 113 | files = [ 114 | {file = "platformdirs-4.2.2-py3-none-any.whl", hash = "sha256:2d7a1657e36a80ea911db832a8a6ece5ee53d8de21edd5cc5879af6530b1bfee"}, 115 | {file = "platformdirs-4.2.2.tar.gz", hash = "sha256:38b7b51f512eed9e84a22788b4bce1de17c0adb134d6becb09836e37d8654cd3"}, 116 | ] 117 | 118 | [package.extras] 119 | docs = ["furo (>=2023.9.10)", "proselint (>=0.13)", "sphinx (>=7.2.6)", "sphinx-autodoc-typehints (>=1.25.2)"] 120 | test = ["appdirs (==1.4.4)", "covdefaults (>=2.3)", "pytest (>=7.4.3)", "pytest-cov (>=4.1)", "pytest-mock (>=3.12)"] 121 | type = ["mypy (>=1.8)"] 122 | 123 | [[package]] 124 | name = "tomli" 125 | version = "2.0.1" 126 | description = "A lil' TOML parser" 127 | optional = false 128 | python-versions = ">=3.7" 129 | files = [ 130 | {file = "tomli-2.0.1-py3-none-any.whl", hash = "sha256:939de3e7a6161af0c887ef91b7d41a53e7c5a1ca976325f429cb46ea9bc30ecc"}, 131 | {file = "tomli-2.0.1.tar.gz", hash = "sha256:de526c12914f0c550d15924c62d72abc48d6fe7364aa87328337a31007fe8a4f"}, 132 | ] 133 | 134 | [[package]] 135 | name = "typing-extensions" 136 | version = "4.12.2" 137 | description = "Backported and Experimental Type Hints for Python 3.8+" 138 | optional = false 139 | python-versions = ">=3.8" 140 | files = [ 141 | {file = "typing_extensions-4.12.2-py3-none-any.whl", hash = "sha256:04e5ca0351e0f3f85c6853954072df659d0d13fac324d0072316b67d7794700d"}, 142 | {file = "typing_extensions-4.12.2.tar.gz", hash = "sha256:1a7ead55c7e559dd4dee8856e3a88b41225abfe1ce8df57b7c13915fe121ffb8"}, 143 | ] 144 | 145 | [metadata] 146 | lock-version = "2.0" 147 | python-versions = "^3.10" 148 | content-hash = "8045e7ae99a3bc9b7b6be08bda4b9f44915b13db7e97c971e6af2ab1d8a93008" 149 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | package-mode = false 3 | description = "Some tutorial coming with the data science roadmap." 4 | authors = ["Emeric "] 5 | readme = "README.md" 6 | 7 | [tool.poetry.dependencies] 8 | python = "^3.10" 9 | 10 | 11 | [tool.poetry.group.dev.dependencies] 12 | black = "^24.8.0" 13 | 14 | [build-system] 15 | requires = ["poetry-core"] 16 | build-backend = "poetry.core.masonry.api" 17 | --------------------------------------------------------------------------------