└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # **Machine Learning with Python** 2 | 3 | This repository contains Data Science projects in Python programming language completed by me for self-learning and demonstration purposes. 4 | 5 | All the projects are done on Jupyter Notebooks (Notebook Server 5.6.0). The server is running on Python version 3.7.0. 6 | 7 | =============================================================================== 8 | 9 | ## **Libraries required** 10 | 11 | The following libraries are required to successfully implement the projects. 12 | 13 | - Python 3.6+ 14 | - NumPy (for linear algebra) 15 | - Pandas (for data preprocessing) 16 | - Scikit-learn (for machine-learning) 17 | - Matplotlib (for data visualization) 18 | - Seaborn (for statistical data visualization) 19 | - SciPy (for scientific computing) 20 | - Statsmodels (statistical computation) 21 | 22 | 23 | =============================================================================== 24 | 25 | 26 | The projects description are given in the readme document. The projects are divided into various categories listed below:- 27 | 28 | ## Contents 29 | 30 | 31 | - ### Supervised Learning : Regression Projects 32 | 33 | 34 | * [Simple Linear Regression Project](https://github.com/pb111/Simple-Linear-Regression-Project/blob/master/SLRProject.ipynb): A Simple Linear Regression model to model the linear relationship between Sales and Advertising dataset for a dietary weight control product. 35 | 36 | * [Multiple Linear Regression Project](https://github.com/pb111/Multiple-Linear-Regression-Project/blob/master/Multiple%20Linear%20Regression%20using%20Scikit-Learn.ipynb): In this project, I build a Multiple Linear Regression model to estimate the relative CPU performance of computer hardware dataset. I discuss the linear regression assumptions and various tools to estimate the model performance. 37 | 38 | 39 | =============================================================================== 40 | 41 | 42 | - ### Supervised Learning : Classification Projects 43 | 44 | 45 | * [Logistic Regression Project](https://github.com/pb111/Logistic-Regression-in-Python-Project/blob/master/Logistic%20Regression%20with%20Python%20and%20Scikit-Learn.ipynb): In this project, I train a binary Logistic Regression classifier to predict whether or not it will rain tomorrow in Australia. I have used **Rain in Australia** dataset from the Kaggle website. I have demonstrated feature engineering techniques alongwith **Recursive Feature Elimination with Cross-validation**, **k-fold Cross Validation** and **GridSearch CV** in this project. 46 | 47 | 48 | * [Support Vector Machines Project](https://github.com/pb111/Support-Vector-Machines-Project/blob/master/Support%20Vector%20Machines%20with%20Python%20and%20Scikit-Learn.ipynb): In this project, I build a Support Vector Machines classifier to classify a Pulsar star. I have used the **Predicting a Pulsar Star** dataset from the Kaggle website. I have discussed the **kernel trick** in this project. I have used **Stratified Cross-Validation** technique alongwith **GridSearch CV** in this project. 49 | 50 | 51 | * [k Nearest Neighbours Project](https://github.com/pb111/k-Nearest-Neighbours-Project/blob/master/k%20Nearest%20Neighbours%20with%20Python%20and%20Scikit-Learn.ipynb): k Nearest Neighbours is the simplest of all machine learning algorithms. In this project, I build a kNN classifier to classify the patients suffering from Breast Cancer. I have used the **Breast Cancer Wisconsin (Original) Data Set** from the UCI Machine Learning Repository. 52 | 53 | 54 | * [Naive Bayes Classification Project](https://github.com/pb111/Naive-Bayes-Classification-Project/blob/master/Na%C3%AFve%20Bayes%20Classification%20with%20Python%20and%20Scikit-Learn.ipynb): In this project, I build a Naïve Bayes Classifier to classify a person's salary. I implement Naive Bayes Classification with Python and Scikit-Learn to predict whether a person makes over 50K a year. I have used **Adult Data Set** from the UCI Machine Learning Repository website. 55 | 56 | 57 | * [Decision Tree Classification Project](https://github.com/pb111/Decision-Tree-Classification-Project/blob/master/Decision-Tree%20Classification%20with%20Python%20and%20Scikit-Learn.ipynb): Classification and Regression Trees or **CART** are very popular machine learning algorithms. In this project, I build two Decision Tree Classifier models - with criterion **gini** and **entropy** to predict the safety of the car. I have used the **Car Evaluation Data Set** from the UCI Machine Learning Repository website. 58 | 59 | 60 | * [Random Forest Classification Project](https://github.com/pb111/Random-Forest-Classifier-Project/blob/master/Random%20Forest%20Classification%20with%20Python%20and%20Scikit-Learn.ipynb): In this project, I build two Random Forest Classifier models (with 10 and 100 decision-trees) to predict safety of the car. The accuracy increases with number of decision-trees. I have also demonstrated the feature selection process using the Random Forest model. I have used the **Car Evaluation Data Set** from the UCI Machine Learning Repository website. 61 | 62 | 63 | * [XGBoost Classification Project](https://github.com/pb111/XGBoost-Classification-Project/blob/master/XGBoost%20with%20Python%20and%20Scikit-Learn.ipynb): **XGBoost** is an acronym for **Extreme Gradient Boosting**. In this project, I implement XGBoost with Python and Scikit-Learn to classify the customers from two different channels as Horeca (Hotel/Retail/Café) customers or Retail channel (nominal) customers. I have used **Wholesale customers data set** from UCI Machine learning repository. 64 | 65 | 66 | =============================================================================== 67 | 68 | 69 | - ### Unsupervised Learning Projects 70 | 71 | 72 | * [K Means Clustering Project](https://github.com/pb111/K-Means-Clustering-Project/blob/master/K-Means%20Clustering%20with%20Python%20and%20Scikit-Learn.ipynb): K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences. In this project, I implement K-Means clustering with Python and Scikit-Learn. I have used **Facebook Live Sellers in Thailand** dataset for this project from the UCI machine learning repository. 73 | 74 | 75 | =============================================================================== 76 | 77 | 78 | - ### Recommender Systems Project 79 | 80 | - [Recommender Systems with Python](https://github.com/pb111/Recommender-Systems-with-Python/blob/master/README.md): Recommender Systems are one of the most popular and widely used application of data science. In this project, I build a Recommender System with Python. I discuss various types of recommender systems including **Content-based** and **Collaborative filtering** recommender systems. Also, I discuss **matrix factorization** and how to evaluate recommender systems. 81 | 82 | 83 | =============================================================================== 84 | 85 | 86 | - ### Statistical Analysis Projects 87 | 88 | - [Descriptive Statistics Project](https://github.com/pb111/Descriptive-Statistics-Project/blob/master/Descriptive%20Statistics%20with%20Python.ipynb): **Descriptive Statistics** is the subject matter of this project. It gives us the basic summary measures about the dataset. The summary measures include measures of central tendency (mean, median and mode) and measures of variability (variance, standard deviation, minimum/maximum values, IQR (Interquartile Range), skewness and kurtosis). 89 | 90 | 91 | - [Inferential Statistics Project](https://github.com/pb111/Inferential-Statistics-Project/blob/master/README.md): **Inferential Statistics** is the process of drawing inferences about the population from the sample data. In this project, I have discussed various inferential statistical concepts and their practical applications. I have discussed Central Limit Theorem, t-test, ANOVA , Chi-square goodness of fit test and Correlation analysis. 92 | 93 | 94 | - [Hypothesis Testing Project](https://github.com/pb111/Hypothesis-Testing-Project/blob/master/README.md): **Hypothesis testing** is a statistical tool to test an assumption regarding the population parameter. This project is dedicated towards hypothesis testing. In this project, I have discussed, hypothesis testing, p-value, significance level, types of errors in hypothesis testing and one-tailed and two-tailed tests. 95 | 96 | 97 | =============================================================================== 98 | 99 | 100 | - ### Data Cleaning and Preprocessing Projects 101 | 102 | - [Data Cleaning with Python and Pandas](https://github.com/pb111/Data-Cleaning-with-Python-NumPy-and-Pandas/blob/master/Data%20Cleaning%20with%20Python%20and%20Pandas.ipynb): In this project, I discuss principles of tidy data and signs of an untidy data. I discuss EDA and present ways to deal with outliers and missing and negative numerical values. I discuss how to check for missing values with **ASSERT** statement. I present how to reshape data using the pandas melt() function. 103 | 104 | 105 | - [Data Preprocessing Project- Dealing with missing numerical values](https://github.com/pb111/Data-Preprocessing-Project-Dealing-with-Missing-Numerical-Values/blob/master/Data%20Preprocessing%20Project%20-%20Dealing%20with%20Missing%20Numerical%20Values.ipynb): This project describes various techniques to deal with missing numerical values. I have discussed how to drop missing values, fill missing values with test-statistic and imputer. I discuss how to check for missing values with **ASSERT** statement. 106 | 107 | - [Data Preprocessing Project- Dealing with text and categorical data](https://github.com/pb111/Data-Preprocessing-Project-Dealing-with-Text-and-Categorical-Data-/blob/master/Data%20Preprocessing%20Project%20-%20Dealing%20with%20Text%20and%20Categorical%20data.ipynb): In this project, I discuss various Scikit-learn classes to deal with text and categorical data. The classes are LabelEncoder, OneHotEncoder, LabelBinarizer, DictVectorizer, CountVectorizer, TfidfVectorizer and TfidfTransformer. I also discuss **tokenization** and **vectorization**. 108 | 109 | - [Data Preprocessing Project-Feature Scaling](https://github.com/pb111/Data-Preprocessing-Project-Feature-Scaling/blob/master/Data%20Preprocessing%20Project%20-%20Feature%20Scaling.ipynb): **Feature Scaling** is the process used to standardize range of independent variables so that they can be mapped onto same scale. In this project, I have discussed useful estimators related to Feature Scaling. The estimators are MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler, Normalizer, Binarizer and scale. 110 | 111 | - [Data Preprocessing Project- Imbalanced Classes Problem](https://github.com/pb111/Data-Preprocessing-Project-Imbalanced-Classes-Problem/blob/master/Data%20Preprocessing%20Project%20-%20Imbalanced%20Classes%20Problem.ipynb): **Imbalanced classes** is a major problems in machine learning. In this project, I discuss imbalanced classes problem and the approaches to deal with this problem. I have used the **Credit Card Fraud Detection** dataset, downloaded from the Kaggle website. 112 | 113 | 114 | =============================================================================== 115 | 116 | - ### Data Analysis Projects 117 | 118 | - [Exploratory Data Analysis with Python](https://github.com/pb111/Exploratory-Data-Analysis-with-Python-Project/blob/master/Exploratory%20Data%20Analysis%20with%20Python.ipynb): This project is all about Exploratory Data Analysis. In this project, I explore the **Absenteeism at work dataset**. I discuss univariate and multivariate useful techniques to explore this dataset. 119 | 120 | 121 | - [Data Analysis with Pandas](https://github.com/pb111/Data-Analysis-with-Pandas/blob/master/Data%20Analysis%20with%20Pandas.ipynb): **Pandas** is an open source library for data analysis in Python. In this project, I explore Pandas and important data analysis tools of pandas. I have used the **BlackFriday** dataset downloaded from Kaggle website. 122 | 123 | 124 | - [Data Analysis with NumPy](https://github.com/pb111/Data-Analysis-with-NumPy/blob/master/Data%20Analysis%20with%20NumPy.ipynb): **NumPy** is the fundamental library of Python which is required for scientific computing. In this project, I explore NumPy and various data analysis tools of NumPy. 125 | 126 | 127 | - [Time Series Analysis with Python](https://github.com/pb111/Time-series-analysis-with-Python/blob/master/Time%20Series%20Analysis%20in%20Python.ipynb): A time series is a series of data points recorded at different time intervals. The time series analysis means analyzing the time series. In this project, I implement a **Seasonal ARIMA time series model** in Python to predict Occupancy rates of car parks in **Parking Birmingham** Data Set. 128 | 129 | 130 | =============================================================================== 131 | 132 | 133 | - ### Data Visualization Projects 134 | 135 | - [Data Visualization with Matplotlib](https://github.com/pb111/Data-Visualization-with-Matplotlib-Project/blob/master/Data%20Visualization%20with%20Matplotlib.ipynb): **Matplotlib** is the basic data visualization library of Python. In this project, I describe Matplotlib, its object hierarchy, its interfaces, different plot types with Matplotlib and various customization techniques with Matplotlib. 136 | 137 | - [Data Visualization with Seaborn](https://github.com/pb111/Data-Visualization-with-Seaborn): **Seaborn** is a Python data visualization library based on Matplotlib. In this project, I explore Seaborn. I discuss Seaborn API overview, its functionality, setting Seaborn aesthetic parameters and colour palette. I discuss different distributions, various plot types and multi-plot grids 138 | with seaborn. 139 | 140 | 141 | 142 | --------------------------------------------------------------------------------