├── .gitignore ├── LICENSE.txt ├── MANIFEST.in ├── README.md ├── README.rst ├── dist ├── visualize_ML-0.1.2.tar.gz └── visualize_ML-0.2.2.tar.gz ├── images ├── explore1.png └── relation1.png ├── setug.cfg ├── setup.py ├── visualize_ML.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── requires.txt └── top_level.txt └── visualize_ML ├── .~lock.people.csv# ├── __init__.py ├── explore.py └── relation.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | *.pyc 3 | *.md 4 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Ayush Singh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/MANIFEST.in -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # visualize_ML 2 | 3 | visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklean,scipy for statistical computations. 4 | 5 | [![PyPI version](https://badge.fury.io/py/visualize_ML.svg)](https://badge.fury.io/py/visualize_ML) 6 | ### Table of content: 7 | * [Requirements](https://github.com/ayush1997/visualize_ML/#requirement) 8 | * [Install](https://github.com/ayush1997/visualize_ML/#install) 9 | * [Let's code](https://github.com/ayush1997/visualize_ML/#lets-code) 10 | * [explore module](https://github.com/ayush1997/visualize_ML/#-explore-module) 11 | * [relation module](https://github.com/ayush1997/visualize_ML/#-relation-module) 12 | * [Contribute](https://github.com/ayush1997/visualize_ML/#contribute) 13 | * [Tasks To Do](https://github.com/ayush1997/visualize_ML/#tasks-to-do) 14 | * [Licence](https://github.com/ayush1997/visualize_ML/#licence) 15 | * [Copyright](https://github.com/ayush1997/visualize_ML/#copyright) 16 | 17 | 18 | ## Requirement 19 | 20 | * python 2.x or python 3.x 21 | 22 | ## Install 23 | Install dependencies needed for matplotlib 24 | 25 | sudo apt-get build-dep python-matplotlib 26 | 27 | Install it using pip 28 | 29 | pip install visualize_ML 30 | 31 | 32 | 33 | 34 | ## Let's Code 35 | 36 | While dealing with a Machine Learning problem some of the initial steps involved are data exploration,analysis followed by feature selection.Below are the modules for these tasks. 37 | 38 | ### 1) Data Exploration 39 | At this stage, we explore variables one by one using **Uni-variate Analysis** which depends on whether the variable type is categorical or continuous .To deal with this we have the **explore** module. 40 | 41 | ## >>> explore module 42 | visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20, 43 | bar_width=0.2,wspace=0.5,hspace=0.8) 44 | **Continuous Variables** : In case of continous variables it plots the *Histogram* for every variable and gives descriptive statistics for them. 45 | 46 | **Categorical Variables** : In case on categorical variables with 2 or more classes it plots the *Bar chart* for every variable and gives descriptive statistics for them. 47 | 48 | Parameters | Type | Description 49 | -------------------- | -------------|------------------------------------------------------------------------ 50 | data_input | Dataframe | This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.) 51 | categorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes. 52 | drop | list default=[ ]|Names of columns to be dropped. 53 | PLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly. 54 | bin_size |int (default="auto") | Number of bins for the histogram displayed in the categorical vs categorical category. 55 | wspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window. 56 | hspace | float32 (default = 0.8) |Vertical padding between subplot on the display window. 57 | 58 | 59 | **Code Snippet** 60 | ```python 61 | /* The data set is taken from famous Titanic data(Kaggle)*/ 62 | 63 | import pandas as pd 64 | from visualize_ML import explore 65 | df = pd.read_csv("dataset/train.csv") 66 | explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"]) 67 | ``` 68 | ![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/explore1.png?raw=true "Optional Title") 69 | 70 | see the [dataset](https://www.kaggle.com/c/titanic/data) 71 | 72 | **Note:** While plotting all the rows with **NaN** values and columns with **Character** values are removed(except if values are True and False ),only numeric data is plotted. 73 | 74 | ### 2) Feature Selection 75 | This is one of the challenging task to deal with for a ML task.Here we have to do **Bi-variate Analysis** to find out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. 76 | 77 | **relation** module helps in visualizing the analysis done on various combination of variables and see relation between them. 78 | 79 | ## >>> relation module 80 | visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10) 81 | 82 | **Continuous vs Continuous variables:** To do the Bi-variate analysis *scatter plots* are made as their pattern indicates the relationship between variables. 83 | To indicates the strength of relationship amongst them we use Correlation between them. 84 | 85 | The graph displays the correlation coefficient along with other information. 86 | 87 | Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y)) 88 | 89 | * -1: perfect negative linear correlation 90 | * +1:perfect positive linear correlation and 91 | * 0: No correlation 92 | 93 | **Categorical vs Categorical variables**: *Stacked Column Charts* are made to visualize the relation.**Chi square test** is used to derive the statistical significance of relationship between the variables. It returns *probability* for the computed chi-square distribution with the degree of freedom. For more information on Chi Test see [this](http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm) 94 | 95 | Probability of 0: It indicates that both categorical variable are dependent 96 | 97 | Probability of 1: It shows that both variables are independent. 98 | 99 | The graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent. 100 | 101 | **Categorical vs Continuous variables:** To explore the relation between categorical and continuous variables,box plots re drawn at each level of categorical variables. If levels are small in number, it will not show the statistical significance. 102 | **ANOVA test** is used to derive the statistical significance of relationship between the variables. 103 | 104 | The graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent. 105 | 106 | For more information on ANOVA test see [this](https://onlinecourses.science.psu.edu/stat200/book/export/html/66) 107 | 108 | Parameters | Type | Description 109 | -------------------- | -------------|-------------------------------------------------------------------- 110 | data_input | Dataframe | This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.) 111 | target_name | String | The name of the target column. 112 | categorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes. 113 | drop | list default=[ ]|Names of columns to be dropped. 114 | PLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly. 115 | bin_size |int (default="auto") | Number of bins for the histogram displayed in the categorical vs categorical category. 116 | wspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window. 117 | hspace | float32 (default = 0.8) |Vertical padding between subplot on the display window. 118 | 119 | **Code Snippet** 120 | ```python 121 | /* The data set is taken from famous Titanic data(Kaggle)*/ 122 | import pandas as pd 123 | from visualize_ML import relation 124 | df = pd.read_csv("dataset/train.csv") 125 | relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10) 126 | 127 | ``` 128 | 129 | ![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/relation1.png?raw=true "Optional Title") 130 | 131 | see the [dataset](https://www.kaggle.com/c/titanic/data) 132 | 133 | **Note:** While plotting all the rows with **NaN** values and columns with **Non numeric** values are removed only numeric data is plotted.Only categorical taget variable with string values are allowed. 134 | 135 | ## Contribute 136 | If you want to contribute and add new feature feel free to send Pull request [here](https://github.com/ayush1997/visualize_ML) 137 | 138 | This project is still under development so to report any bugs or request new features, head over to the Issues page 139 | 140 | ## Tasks To Do 141 | - [ ] Make input compatible with other formats like Numpy. 142 | - [ ] Visualize best fit lines and decision boundaries for various models to make **Parameter Tuning** task easy. 143 | 144 | and many others! 145 | 146 | ## Licence 147 | Licensed under [The MIT License (MIT)](https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt). 148 | 149 | ## Copyright 150 | ayush1997(c) 2016 151 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | visualize\_ML 2 | ============= 3 | 4 | visualize\_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklearn,scipy for statistical computations. 5 | 6 | Table of content: 7 | ~~~~~~~~~~~~~~~~~ 8 | 9 | - Requirements 10 | - Install 11 | - Let’s code 12 | 13 | - explore module 14 | - relation module 15 | 16 | - contribute 17 | - Licence 18 | - Copyright 19 | 20 | Let’s Code 21 | ---------- 22 | 23 | When we start dealing with a Machine Learning problem some of the 24 | initial steps involved are data exploration,analysis followed by feature 25 | selection.Below are the modules for these tasks. 26 | 27 | 1) Data Exploration 28 | ~~~~~~~~~~~~~~~~~~~ 29 | 30 | At this stage, we explore variables one by one using **Uni-variate 31 | Analysis** which depends on whether the variable type is categorical or 32 | continuous .To deal with this we have the **explore** module. 33 | 34 | >>>explore module 35 | ~~~~~~~~~~~~~~~~~~ 36 | 37 | :: 38 | 39 | visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20, 40 | bar_width=0.2,wspace=0.5,hspace=0.8) 41 | 42 | **Continuous Variables** : In case of continous variables it plots the 43 | *Histogram* for every variable and gives descriptive statistics for 44 | them. 45 | 46 | **Categorical Variables** : In case on categorical variables with 2 or 47 | more classes it plots the *Bar chart* for every variable and gives 48 | descriptive statistics for them. 49 | 50 | +---------------------+-----------------+---------------------------------------+ 51 | | Parameters | Type | Description | 52 | +=====================+=================+=======================================+ 53 | | data\_input | Dataframe | This is the input Dataframe with all | 54 | | | | data.(Right now the input can be only | 55 | | | | be a dataframe input.) | 56 | +---------------------+-----------------+---------------------------------------+ 57 | | categorical\_name | list (default=[ | Names of all categorical variable | 58 | | | ]) | columns with more than 2 classes, to | 59 | | | | distinguish them with the continuous | 60 | | | | variablesEmply list implies that | 61 | | | | there are no categorical features | 62 | | | | with more than 2 classes. | 63 | +---------------------+-----------------+---------------------------------------+ 64 | | drop | list default=[ | Names of columns to be dropped. | 65 | | | ] | | 66 | +---------------------+-----------------+---------------------------------------+ 67 | | PLOT\_COLUMNS\_SIZE | int (default=4) | Number of plots to display vertically | 68 | | | | in the display window.The row size is | 69 | | | | adjusted accordingly. | 70 | +---------------------+-----------------+---------------------------------------+ 71 | | bin\_size | int | Number of bins for the histogram | 72 | | | (default=“auto” | displayed in the categorical vs | 73 | | | ) | categorical category. | 74 | +---------------------+-----------------+---------------------------------------+ 75 | | wspace | float32 | Horizontal padding between subplot on | 76 | | | (default = 0.5) | the display window. | 77 | +---------------------+-----------------+---------------------------------------+ 78 | | hspace | float32 | Vertical padding between subplot on | 79 | | | (default = 0.8) | the display window. | 80 | +---------------------+-----------------+---------------------------------------+ 81 | 82 | **Code Snippet** 83 | 84 | .. code :: python 85 | 86 | /* The data set is taken from famous Titanic data(Kaggle)*/ 87 | 88 | import pandas as pd 89 | from visualize_ML import explore 90 | df = pd.read_csv("dataset/train.csv") 91 | 92 | explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"]) 93 | 94 | .. figure:: /images/explore1.png?raw=true 95 | :alt: Optional Title 96 | 97 | Graph made using explore module using matplotlib. 98 | 99 | see the [dataset](https://www.kaggle.com/c/titanic/data) 100 | 101 | **Note:** While plotting all the rows with **NaN** values and columns 102 | with **Character** values are removed(except if values are True and False ) only numeric data is plotted. 103 | 104 | 2) Feature Selection 105 | ~~~~~~~~~~~~~~~~~~~~ 106 | 107 | This is one of the challenging task to deal with for a ML task.Here we 108 | have to do **Bi-variate Analysis** to find out the relationship between 109 | two variables. Here, we look for association and disassociation between 110 | variables at a pre-defined 111 | 112 | 113 | **relation** module helps in visualizing the analysis done on various 114 | combination of variables and see relation between them. 115 | 116 | >>>relation module 117 | ~~~~~~~~~~~~~~~~~~~ 118 | 119 | :: 120 | 121 | visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10) 122 | 123 | **Continuous vs Continuous variables:** To do the Bi-variate analysis 124 | *scatter plots* are made as their pattern indicates the relationship 125 | between variables. To indicates the strength of relationship amongst 126 | them we use Correlation between them. 127 | 128 | The graph displays the correlation coefficient along with other 129 | information. 130 | 131 | :: 132 | 133 | Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y)) 134 | 135 | - -1: perfect negative linear correlation 136 | - +1:perfect positive linear correlation and 137 | - 0: No correlation 138 | 139 | **Categorical vs Categorical variables**: *Stacked Column Charts* are 140 | made to visualize the relation.\ **Chi square test** is used to derive 141 | the statistical significance of relationship between the variables. It 142 | returns *probability* for the computed chi-square distribution with the 143 | degree of freedom. For more information on Chi Test see `this`_ 144 | 145 | Probability of 0: It indicates that both categorical variable are 146 | dependent 147 | 148 | Probability of 1: It shows that both variables are independent. 149 | 150 | The graph displays the *p\_value* along with other information. If it is 151 | leass than **0.05** it states that the variables are dependent. 152 | 153 | **Categorical vs Continuous variables:** To explore the relation between 154 | categorical and continuous variables,box plots re drawn at each level of 155 | categorical variables. If levels are small in number, it will not show 156 | the statistical significance. **ANOVA test** is used to derive the 157 | statistical significance of relationship between the variables. 158 | 159 | The graph displays the *p\_value* along with other information. If it is 160 | leass than **0.05** it states that the variables are dependent. 161 | 162 | For more information on ANOVA test see 163 | `this `__ 164 | 165 | +----------------+-----------+-------------------------------------------------+ 166 | | Parameters | Type | Description | 167 | +================+===========+=================================================+ 168 | | data\_input | Dataframe | This is the input Dataframe with all | 169 | | | | data.(Right now the input can be only be a | 170 | | | | dataframe input.) | 171 | +----------------+-----------+-------------------------------------------------+ 172 | | target\_name | String | The name of the target column. | 173 | +----------------+-----------+-------------------------------------------------+ 174 | | categorical\_n | list | Names of all categorical variable columns with | 175 | | ame | (default= | more than 2 classes, to distinguish them with | 176 | | | [ | the continuous variablesEmply list implies that | 177 | | | ]) | there are no categorical features with more | 178 | | | | than 2 classes. | 179 | +----------------+-----------+-------------------------------------------------+ 180 | | drop | list | Names of columns to be dropped. | 181 | | | default=[ | | 182 | | | ] | | 183 | +----------------+-----------+-------------------------------------------------+ 184 | | PLOT\_COLUMNS\ | int | Number of plots to display vertically in the | 185 | | _SIZE | (default= | display window.The row size is adjusted | 186 | | | 4) | accordingly. | 187 | +----------------+-----------+-------------------------------------------------+ 188 | | bin\_size | int | Number of bins for the histogram displayed in | 189 | | | (default= | the categorical vs categorical category. | 190 | | | “auto”) | | 191 | +----------------+-----------+-------------------------------------------------+ 192 | | wspace | float32 | Horizontal padding between subplot on the | 193 | | | (default | display window. | 194 | | | = 0.5) | | 195 | +----------------+-----------+-------------------------------------------------+ 196 | | hspace | float32 | Vertical padding between subplot on the display | 197 | | | (default | window. | 198 | | | = 0.8) | | 199 | +----------------+-----------+-------------------------------------------------+ 200 | 201 | **Code Snippet** 202 | 203 | .. code :: python 204 | 205 | /* The data set is taken from famous Titanic data(Kaggle)*/ 206 | import pandas as pd 207 | from visualize_ML import relation 208 | df = pd.read_csv("dataset/train.csv") 209 | 210 | relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10) 211 | 212 | .. figure:: /images/relation1.png?raw=true 213 | :alt: Optional Title 214 | 215 | Graph made using relation module using matplotlib. 216 | 217 | see the [dataset](https://www.kaggle.com/c/titanic/data) 218 | 219 | **Note:** While plotting all the rows with **NaN** values and columns 220 | with **Non numeric** values are removed only numeric data is 221 | plotted.Only categorical taget variable with string values are allowed. 222 | 223 | Contribute 224 | ---------- 225 | 226 | If you want to contribute and add new feature feel free to send Pull 227 | request `here`_ 228 | 229 | This project is still under development so to report any bugs or request new features, head over to the Issues page 230 | 231 | Licence 232 | ------- 233 | Licensed under `The MIT License (MIT)`_. 234 | 235 | Copyright 236 | --------- 237 | ayush1997(c) 2016 238 | 239 | .. _here: https://github.com/ayush1997/visualize_ML 240 | .. _The MIT License (MIT): https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt 241 | -------------------------------------------------------------------------------- /dist/visualize_ML-0.1.2.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/dist/visualize_ML-0.1.2.tar.gz -------------------------------------------------------------------------------- /dist/visualize_ML-0.2.2.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/dist/visualize_ML-0.2.2.tar.gz -------------------------------------------------------------------------------- /images/explore1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/images/explore1.png -------------------------------------------------------------------------------- /images/relation1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/images/relation1.png -------------------------------------------------------------------------------- /setug.cfg: -------------------------------------------------------------------------------- 1 | [bdist_wheel] 2 | universal=1 3 | [metadata] 4 | description-file = README.rst 5 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | from codecs import open 3 | from os import path 4 | 5 | here = path.abspath(path.dirname(__file__)) 6 | 7 | # Get the long description from the README file 8 | with open(path.join(here, 'README.rst'), encoding='utf-8') as f: 9 | long_description = f.read() 10 | 11 | setup( 12 | name='visualize_ML', 13 | 14 | version='0.2.2', 15 | 16 | description='To visualize various processes involved in dealing with a Machine Learning problem.', 17 | long_description=long_description, 18 | 19 | # The project's main homepage. 20 | url='https://github.com/ayush1997/visualize_ML', 21 | 22 | 23 | author='ayush1997', 24 | author_email='ayushkumarsingh97@gmail.com', 25 | 26 | 27 | license='MIT', 28 | 29 | 30 | classifiers=[ 31 | 32 | 'Development Status :: 3 - Alpha', 33 | 34 | # Indicate who your project is intended for 35 | 'Intended Audience :: Science/Research', 36 | 'Intended Audience :: Developers', 37 | 'Topic :: Software Development :: Build Tools', 38 | 39 | # Pick your license as you wish (should match "license" above) 40 | 'License :: OSI Approved :: MIT License', 41 | 42 | # Specify the Python versions you support here. In particular, ensure 43 | # that you indicate whether you support Python 2, Python 3 or both. 44 | 'Programming Language :: Python :: 2', 45 | 'Programming Language :: Python :: 2.6', 46 | 'Programming Language :: Python :: 2.7', 47 | 'Programming Language :: Python :: 3', 48 | 'Programming Language :: Python :: 3.3', 49 | 'Programming Language :: Python :: 3.4', 50 | 'Programming Language :: Python :: 3.5', 51 | ], 52 | 53 | keywords='visualization MachineLearning DataScience', 54 | 55 | packages=['visualize_ML'], 56 | 57 | 58 | install_requires=["scikit-learn","pandas","numpy","matplotlib"], 59 | 60 | 61 | 62 | ) 63 | -------------------------------------------------------------------------------- /visualize_ML.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 1.1 2 | Name: visualize-ML 3 | Version: 0.2.2 4 | Summary: To visualize various processes involved in dealing with a Machine Learning problem. 5 | Home-page: https://github.com/ayush1997/visualize_ML 6 | Author: ayush1997 7 | Author-email: ayushkumarsingh97@gmail.com 8 | License: MIT 9 | Description: visualize\_ML 10 | ============= 11 | 12 | visualize\_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklearn,scipy for statistical computations. 13 | 14 | Table of content: 15 | ~~~~~~~~~~~~~~~~~ 16 | 17 | - Requirements 18 | - Install 19 | - Let’s code 20 | 21 | - explore module 22 | - relation module 23 | 24 | - contribute 25 | - Licence 26 | - Copyright 27 | 28 | Let’s Code 29 | ---------- 30 | 31 | When we start dealing with a Machine Learning problem some of the 32 | initial steps involved are data exploration,analysis followed by feature 33 | selection.Below are the modules for these tasks. 34 | 35 | 1) Data Exploration 36 | ~~~~~~~~~~~~~~~~~~~ 37 | 38 | At this stage, we explore variables one by one using **Uni-variate 39 | Analysis** which depends on whether the variable type is categorical or 40 | continuous .To deal with this we have the **explore** module. 41 | 42 | >>>explore module 43 | ~~~~~~~~~~~~~~~~~~ 44 | 45 | :: 46 | 47 | visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20, 48 | bar_width=0.2,wspace=0.5,hspace=0.8) 49 | 50 | **Continuous Variables** : In case of continous variables it plots the 51 | *Histogram* for every variable and gives descriptive statistics for 52 | them. 53 | 54 | **Categorical Variables** : In case on categorical variables with 2 or 55 | more classes it plots the *Bar chart* for every variable and gives 56 | descriptive statistics for them. 57 | 58 | +---------------------+-----------------+---------------------------------------+ 59 | | Parameters | Type | Description | 60 | +=====================+=================+=======================================+ 61 | | data\_input | Dataframe | This is the input Dataframe with all | 62 | | | | data.(Right now the input can be only | 63 | | | | be a dataframe input.) | 64 | +---------------------+-----------------+---------------------------------------+ 65 | | categorical\_name | list (default=[ | Names of all categorical variable | 66 | | | ]) | columns with more than 2 classes, to | 67 | | | | distinguish them with the continuous | 68 | | | | variablesEmply list implies that | 69 | | | | there are no categorical features | 70 | | | | with more than 2 classes. | 71 | +---------------------+-----------------+---------------------------------------+ 72 | | drop | list default=[ | Names of columns to be dropped. | 73 | | | ] | | 74 | +---------------------+-----------------+---------------------------------------+ 75 | | PLOT\_COLUMNS\_SIZE | int (default=4) | Number of plots to display vertically | 76 | | | | in the display window.The row size is | 77 | | | | adjusted accordingly. | 78 | +---------------------+-----------------+---------------------------------------+ 79 | | bin\_size | int | Number of bins for the histogram | 80 | | | (default=“auto” | displayed in the categorical vs | 81 | | | ) | categorical category. | 82 | +---------------------+-----------------+---------------------------------------+ 83 | | wspace | float32 | Horizontal padding between subplot on | 84 | | | (default = 0.5) | the display window. | 85 | +---------------------+-----------------+---------------------------------------+ 86 | | hspace | float32 | Vertical padding between subplot on | 87 | | | (default = 0.8) | the display window. | 88 | +---------------------+-----------------+---------------------------------------+ 89 | 90 | **Code Snippet** 91 | 92 | .. code :: python 93 | 94 | /* The data set is taken from famous Titanic data(Kaggle)*/ 95 | 96 | import pandas as pd 97 | from visualize_ML import explore 98 | df = pd.read_csv("dataset/train.csv") 99 | 100 | explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"]) 101 | 102 | .. figure:: /images/explore1.png?raw=true 103 | :alt: Optional Title 104 | 105 | Graph made using explore module using matplotlib. 106 | 107 | see the [dataset](https://www.kaggle.com/c/titanic/data) 108 | 109 | **Note:** While plotting all the rows with **NaN** values and columns 110 | with **Character** values are removed(except if values are True and False ) only numeric data is plotted. 111 | 112 | 2) Feature Selection 113 | ~~~~~~~~~~~~~~~~~~~~ 114 | 115 | This is one of the challenging task to deal with for a ML task.Here we 116 | have to do **Bi-variate Analysis** to find out the relationship between 117 | two variables. Here, we look for association and disassociation between 118 | variables at a pre-defined 119 | 120 | 121 | **relation** module helps in visualizing the analysis done on various 122 | combination of variables and see relation between them. 123 | 124 | >>>relation module 125 | ~~~~~~~~~~~~~~~~~~~ 126 | 127 | :: 128 | 129 | visualize_ML.relation.plot(df,"Sex",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10) 130 | 131 | **Continuous vs Continuous variables:** To do the Bi-variate analysis 132 | *scatter plots* are made as their pattern indicates the relationship 133 | between variables. To indicates the strength of relationship amongst 134 | them we use Correlation between them. 135 | 136 | The graph displays the correlation coefficient along with other 137 | information. 138 | 139 | :: 140 | 141 | Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y)) 142 | 143 | - -1: perfect negative linear correlation 144 | - +1:perfect positive linear correlation and 145 | - 0: No correlation 146 | 147 | **Categorical vs Categorical variables**: *Stacked Column Charts* are 148 | made to visualize the relation.\ **Chi square test** is used to derive 149 | the statistical significance of relationship between the variables. It 150 | returns *probability* for the computed chi-square distribution with the 151 | degree of freedom. For more information on Chi Test see `this`_ 152 | 153 | Probability of 0: It indicates that both categorical variable are 154 | dependent 155 | 156 | Probability of 1: It shows that both variables are independent. 157 | 158 | The graph displays the *p\_value* along with other information. If it is 159 | leass than **0.05** it states that the variables are dependent. 160 | 161 | **Categorical vs Continuous variables:** To explore the relation between 162 | categorical and continuous variables,box plots re drawn at each level of 163 | categorical variables. If levels are small in number, it will not show 164 | the statistical significance. **ANOVA test** is used to derive the 165 | statistical significance of relationship between the variables. 166 | 167 | The graph displays the *p\_value* along with other information. If it is 168 | leass than **0.05** it states that the variables are dependent. 169 | 170 | For more information on ANOVA test see 171 | `this `__ 172 | 173 | +----------------+-----------+-------------------------------------------------+ 174 | | Parameters | Type | Description | 175 | +================+===========+=================================================+ 176 | | data\_input | Dataframe | This is the input Dataframe with all | 177 | | | | data.(Right now the input can be only be a | 178 | | | | dataframe input.) | 179 | +----------------+-----------+-------------------------------------------------+ 180 | | target\_name | String | The name of the target column. | 181 | +----------------+-----------+-------------------------------------------------+ 182 | | categorical\_n | list | Names of all categorical variable columns with | 183 | | ame | (default= | more than 2 classes, to distinguish them with | 184 | | | [ | the continuous variablesEmply list implies that | 185 | | | ]) | there are no categorical features with more | 186 | | | | than 2 classes. | 187 | +----------------+-----------+-------------------------------------------------+ 188 | | drop | list | Names of columns to be dropped. | 189 | | | default=[ | | 190 | | | ] | | 191 | +----------------+-----------+-------------------------------------------------+ 192 | | PLOT\_COLUMNS\ | int | Number of plots to display vertically in the | 193 | | _SIZE | (default= | display window.The row size is adjusted | 194 | | | 4) | accordingly. | 195 | +----------------+-----------+-------------------------------------------------+ 196 | | bin\_size | int | Number of bins for the histogram displayed in | 197 | | | (default= | the categorical vs categorical category. | 198 | | | “auto”) | | 199 | +----------------+-----------+-------------------------------------------------+ 200 | | wspace | float32 | Horizontal padding between subplot on the | 201 | | | (default | display window. | 202 | | | = 0.5) | | 203 | +----------------+-----------+-------------------------------------------------+ 204 | | hspace | float32 | Vertical padding between subplot on the display | 205 | | | (default | window. | 206 | | | = 0.8) | | 207 | +----------------+-----------+-------------------------------------------------+ 208 | 209 | **Code Snippet** 210 | 211 | .. code :: python 212 | 213 | /* The data set is taken from famous Titanic data(Kaggle)*/ 214 | import pandas as pd 215 | from visualize_ML import relation 216 | df = pd.read_csv("dataset/train.csv") 217 | 218 | relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10) 219 | 220 | .. figure:: /images/relation1.png?raw=true 221 | :alt: Optional Title 222 | 223 | Graph made using relation module using matplotlib. 224 | 225 | see the [dataset](https://www.kaggle.com/c/titanic/data) 226 | 227 | **Note:** While plotting all the rows with **NaN** values and columns 228 | with **Non numeric** values are removed only numeric data is 229 | plotted.Only categorical taget variable with string values are allowed. 230 | 231 | Contribute 232 | ---------- 233 | 234 | If you want to contribute and add new feature feel free to send Pull 235 | request `here`_ 236 | 237 | This project is still under development so to report any bugs or request new features, head over to the Issues page 238 | 239 | Licence 240 | ------- 241 | Licensed under `The MIT License (MIT)`_. 242 | 243 | Copyright 244 | --------- 245 | ayush1997(c) 2016 246 | 247 | .. _here: https://github.com/ayush1997/visualize_ML 248 | .. _The MIT License (MIT): https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt 249 | 250 | Keywords: visualization MachineLearning DataScience 251 | Platform: UNKNOWN 252 | Classifier: Development Status :: 3 - Alpha 253 | Classifier: Intended Audience :: Science/Research 254 | Classifier: Intended Audience :: Developers 255 | Classifier: Topic :: Software Development :: Build Tools 256 | Classifier: License :: OSI Approved :: MIT License 257 | Classifier: Programming Language :: Python :: 2 258 | Classifier: Programming Language :: Python :: 2.6 259 | Classifier: Programming Language :: Python :: 2.7 260 | Classifier: Programming Language :: Python :: 3 261 | Classifier: Programming Language :: Python :: 3.3 262 | Classifier: Programming Language :: Python :: 3.4 263 | Classifier: Programming Language :: Python :: 3.5 264 | -------------------------------------------------------------------------------- /visualize_ML.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | MANIFEST.in 2 | README.rst 3 | setup.py 4 | visualize_ML/__init__.py 5 | visualize_ML/explore.py 6 | visualize_ML/relation.py 7 | visualize_ML.egg-info/PKG-INFO 8 | visualize_ML.egg-info/SOURCES.txt 9 | visualize_ML.egg-info/dependency_links.txt 10 | visualize_ML.egg-info/requires.txt 11 | visualize_ML.egg-info/top_level.txt -------------------------------------------------------------------------------- /visualize_ML.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /visualize_ML.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | scikit-learn 2 | pandas 3 | numpy 4 | matplotlib 5 | -------------------------------------------------------------------------------- /visualize_ML.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | visualize_ML 2 | -------------------------------------------------------------------------------- /visualize_ML/.~lock.people.csv#: -------------------------------------------------------------------------------- 1 | ,ayush,ayush-Lenovo-U41-70,04.08.2016 18:17,file:///home/ayush/.config/libreoffice/4; -------------------------------------------------------------------------------- /visualize_ML/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/visualize_ML/__init__.py -------------------------------------------------------------------------------- /visualize_ML/explore.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from math import ceil 4 | import matplotlib.pyplot as plt 5 | plt.style.use('ggplot') 6 | 7 | fig = plt.figure() 8 | 9 | PLOT_COLUMNS_SIZE = 4 10 | COUNTER = 1 11 | def dataframe_to_numpy(df): 12 | return np.array(df) 13 | 14 | #Return the category dictionary,categorical variables list and continuous list for every colum in dataframe. 15 | def get_category(df,categorical_name,columns_name): 16 | cat_dict = {} 17 | categorical = [] 18 | continous = [] 19 | for col in columns_name: 20 | if len(df[col].unique())<=2: 21 | cat_dict[col] = "categorical" 22 | categorical.append(col) 23 | elif col in categorical_name: 24 | cat_dict[col] = "categorical" 25 | categorical.append(col) 26 | else: 27 | cat_dict[col] = "continous" 28 | continous.append(col) 29 | 30 | return cat_dict,categorical,continous 31 | 32 | #Return True if the categorical_name are present in the orignal dataframe columns. 33 | def is_present(columns_name,categorical_name): 34 | ls = [i for i in categorical_name if i not in columns_name] 35 | if len(ls)==0: 36 | return True 37 | else: 38 | raise ValueError(i+" is not present as a column in the data,Please check the name") 39 | 40 | #function removes any column with string values which cannt be plotted 41 | def clean_str_list(df,lst): 42 | rem=[] 43 | for i in lst: 44 | 45 | res = any(isinstance(n,str) for n in df[i]) 46 | if res == True: 47 | rem.append(i) 48 | 49 | for j in rem: 50 | lst.remove(j) 51 | 52 | return lst 53 | 54 | 55 | #Univariate analysis for continuous variables is done using histograms and graph summary. 56 | def univariate_analysis_continous(cont_list,df,sub,COUNTER,bin_size,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE): 57 | 58 | clean_cont_list = clean_str_list(df,cont_list) 59 | for col in cont_list: 60 | summary = df[col].dropna().describe() 61 | count = summary[0] 62 | mean = summary[1] 63 | std = summary[2] 64 | count_50 = summary[5] 65 | count_75 = summary[6] 66 | 67 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 68 | plt.title("mean: "+str(np.float32(mean))+" std: "+str(np.float32(std)),fontsize=12) 69 | x = np.array(df[col].dropna()) 70 | plt.xlabel(col+"\n count "+str(count)+"\n50%: "+str(count_50)+" 75%: "+str(count_75), fontsize=12) 71 | plt.ylabel("Frequency", fontsize=12) 72 | plt.hist(x,bins=bin_size) 73 | print (col+" plotted....") 74 | COUNTER +=1 75 | 76 | return plt,COUNTER 77 | 78 | 79 | #Returns the frequecy table for a class 80 | def get_catg_info(df,col): 81 | return df[col].value_counts() 82 | 83 | 84 | #Univariate analysis for categotical variables is done using histograms and graph summary. 85 | def univariate_analysis_categorical(catg_list,df,sub_len,COUNTER,bar_width,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE): 86 | # clean_catg_list = clean_str_list(df,catg_list) 87 | 88 | for col in catg_list: 89 | 90 | summary = df[col].dropna().describe() 91 | 92 | # if len(summary)!=5: 93 | # raise ValueError(col+"has string values please Label Encode them") 94 | if len(summary)!= 4: 95 | count = summary[0] 96 | mean = summary[1] 97 | std = summary[2] 98 | count_50 = summary[5] 99 | count_75 = summary[6] 100 | plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=12) 101 | plt.xlabel(col+"\n count "+str(count)+"\n50%: "+str(count_50)+" 75%: "+str(count_75), fontsize=12) 102 | 103 | else: 104 | count = summary[0] 105 | plt.xlabel(col+"\n count "+str(count), fontsize=12) 106 | 107 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 108 | 109 | x = df.dropna()[col].unique() 110 | 111 | y = get_catg_info(df.dropna(),col) 112 | y = np.float32([y[i] for i in x]) 113 | 114 | labels = y/y.sum() * 100 115 | 116 | plt.ylabel("Frequency", fontsize=12) 117 | plt.bar(x,y,width=bar_width) 118 | 119 | for x,y, label in zip(x,y, np.around(np.float32(labels), decimals=2)): 120 | plt.text(x + bar_width/2,y + 5, label, ha='center', va='bottom',rotation=90) 121 | print (col+" plotted....") 122 | COUNTER +=1 123 | 124 | return plt,COUNTER 125 | 126 | #returns the total number of subplots to be made. 127 | def total_subplots(df,lst): 128 | clean_df = df.dropna() 129 | total = [len(clean_str_list(clean_df,i)) for i in lst] 130 | 131 | return sum(total) 132 | 133 | #This function returns new categotical list after removing drop values if in case they are written in both drop and categorical_name list. 134 | def remove_drop_from_catglist(drop,categorical_name): 135 | for col in drop: 136 | if col in categorical_name: 137 | categorical_name.remove(col) 138 | return categorical_name 139 | def plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE = 4,bin_size=20,bar_width=0.2,wspace=0.5,hspace=0.8): 140 | 141 | """ 142 | This is the main function to give Bivariate analysis between the target variable and the input features. 143 | 144 | Parameters 145 | ----------- 146 | data_input : Dataframe 147 | This is the input Dataframe with all data. 148 | 149 | categorical_name : list 150 | Names of all categorical variable columns with more than 2 classes, to distinguish with the continuous variables. 151 | 152 | drop : list 153 | Names of columns to be dropped. 154 | 155 | PLOT_COLUMNS_SIZE : int; default =4 156 | Number of plots to display vertically in the display window.The row size is adjusted accordingly. 157 | 158 | bin_size : int ;default="auto" 159 | Number of bins for the histogram displayed in the categorical vs categorical category. 160 | 161 | wspace : float32 ;default = 0.5 162 | Horizontal padding between subplot on the display window. 163 | 164 | hspace : float32 ;default = 0.8 165 | Vertical padding between subplot on the display window. 166 | 167 | ----------- 168 | 169 | """ 170 | if type(data_input).__name__ == "DataFrame" : 171 | 172 | # Column names 173 | columns_name = data_input.columns.values 174 | 175 | #To drop user specified columns. 176 | if is_present(columns_name,drop): 177 | data_input = data_input.drop(drop,axis=1) 178 | columns_name = data_input.columns.values 179 | categorical_name = remove_drop_from_catglist(drop,categorical_name) 180 | else: 181 | raise ValueError("Couldn't find it in the input Dataframe!") 182 | 183 | 184 | #Checks if the categorical_name are present in the orignal dataframe columns. 185 | categorical_is_present = is_present(columns_name,categorical_name) 186 | if categorical_is_present: 187 | category_dict,catg_list,cont_list = get_category(data_input,categorical_name,columns_name) 188 | 189 | #Subplot(Total number of graphs) 190 | 191 | total = total_subplots(data_input,[catg_list,cont_list]) 192 | 193 | if total < PLOT_COLUMNS_SIZE: 194 | total = PLOT_COLUMNS_SIZE 195 | PLOT_ROW_SIZE = ceil(float(total)/PLOT_COLUMNS_SIZE) 196 | 197 | 198 | plot,count = univariate_analysis_continous(cont_list,data_input,total,COUNTER,bin_size,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE) 199 | plot,count = univariate_analysis_categorical(catg_list,data_input,total,count,bar_width,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE) 200 | 201 | fig.subplots_adjust(bottom=0.08,left = 0.05,right=0.97,top=0.93,wspace = wspace,hspace = hspace) 202 | plot.show() 203 | 204 | else: 205 | raise ValueError("The input doesn't seems to be Dataframe") 206 | -------------------------------------------------------------------------------- /visualize_ML/relation.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from numpy import corrcoef 4 | import matplotlib.pyplot as plt 5 | from sklearn.feature_selection import chi2 6 | from sklearn.feature_selection import f_classif 7 | from math import * 8 | plt.style.use('ggplot') 9 | 10 | fig = plt.figure() 11 | COUNTER = 1 12 | 13 | #Return the category dictionary,categorical variables list and continuous list for every column in dataframe. 14 | #The categories are assigned as "target(type)_feature(type)" 15 | def get_category(df,target_name,categorical_name,columns_name): 16 | cat_dict = {} 17 | fin_cat_dict = {} 18 | catg_catg = [] 19 | cont_cont = [] 20 | catg_cont = [] 21 | cont_catg = [] 22 | for col in columns_name: 23 | if len(df[col].unique())<=2: 24 | cat_dict[col] = "categorical" 25 | elif col in categorical_name: 26 | cat_dict[col] = "categorical" 27 | else: 28 | cat_dict[col] = "continous" 29 | 30 | for col in cat_dict: 31 | if cat_dict[col]=="categorical" and cat_dict[target_name]=="categorical": 32 | fin_cat_dict[col] = "catg_catg" 33 | catg_catg.append(col) 34 | elif cat_dict[col]=="continous" and cat_dict[target_name]=="continous": 35 | fin_cat_dict[col] = "cont_cont" 36 | cont_cont.append(col) 37 | elif cat_dict[col]=="continous" and cat_dict[target_name]=="categorical": 38 | fin_cat_dict[col] = "catg_cont" 39 | catg_cont.append(col) 40 | else: 41 | fin_cat_dict[col] = "cont_catg" 42 | cont_catg.append(col) 43 | return fin_cat_dict,catg_catg,cont_cont,catg_cont,cont_catg 44 | 45 | #Return True if the categorical_name are present in the orignal dataframe columns. 46 | def is_present(columns_name,categorical_name): 47 | ls = [i for i in categorical_name if i not in columns_name] 48 | if len(ls)==0: 49 | return True 50 | else: 51 | raise ValueError(str(ls)+" is not present as a column in the data,Please check the name") 52 | 53 | #Function returns list of columns with non-numeric data. 54 | def clean_str_list(df,lst): 55 | rem=[] 56 | for i in lst: 57 | 58 | res = any(isinstance(n,str) for n in df[i]) 59 | if res == True: 60 | rem.append(i) 61 | 62 | for j in rem: 63 | lst.remove(j) 64 | 65 | return lst 66 | 67 | #Returns the Pearson Correlation Coefficient for the continous data columns. 68 | def pearson_correlation_cont_cont(x,y): 69 | 70 | return corrcoef(x,y) 71 | 72 | 73 | # This function is for the bivariate analysis between two continous varibale.Plots scatter plots and shows the coeff for the data. 74 | def bivariate_analysis_cont_cont(cont_cont_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE): 75 | 76 | clean_cont_cont_list = clean_str_list(df,cont_cont_list) 77 | 78 | if len(clean_str_list(df,[target_name])) == 0 and len(cont_cont_list)>0: 79 | raise ValueError("You seem to have a target variable with string values.") 80 | clean_df = df.dropna() 81 | for col in clean_cont_cont_list: 82 | summary = clean_df[col].describe() 83 | count = summary[0] 84 | mean = summary[1] 85 | std = summary[2] 86 | 87 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 88 | plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10) 89 | 90 | x = clean_df[col] 91 | y = np.float32(clean_df[target_name]) 92 | corr = pearson_correlation_cont_cont(x,y) 93 | 94 | plt.xlabel(col+"\n count "+str(count)+"\n Corr: "+str(np.float32(corr[0][1])), fontsize=10) 95 | plt.ylabel(target_name, fontsize=10) 96 | plt.scatter(x,y) 97 | 98 | print (col+" vs "+target_name+" plotted....") 99 | COUNTER +=1 100 | 101 | return plt,COUNTER 102 | 103 | 104 | #Chi test is used to see association between catgorical vs categorical variables. 105 | #Lower Pvalue are significant they should be < 0.05 106 | #chi value = X^2 = summation [(observed-expected)^2/expected] 107 | # The distribution of the statistic X2 is chi-square with (r-1)(c-1) degrees of freedom, where r represents the number of rows in the two-way table and c represents the number of columns. The distribution is denoted (df), where df is the number of degrees of freedom. 108 | #pvalue = p(df>=x^2) 109 | 110 | def evaluate_chi(x,y): 111 | chi,p_val = chi2(x,y) 112 | return chi,p_val 113 | def bivariate_analysis_catg_catg(catg_catg_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,bin_size="auto"): 114 | 115 | clean_catg_catg_list = clean_str_list(df,catg_catg_list) 116 | 117 | clean_df = df.dropna() 118 | 119 | target_classes =df[target_name].unique() 120 | label = [str(i) for i in target_classes] 121 | 122 | c = 0 123 | for col in clean_catg_catg_list: 124 | summary = clean_df[col].describe() 125 | binwidth = 0.7 126 | 127 | if bin_size == 'auto': 128 | bins_size =np.arange(min(clean_df[col].tolist()), max(clean_df[col].tolist()) + binwidth, binwidth) 129 | else: 130 | bins_size = bin_size 131 | 132 | count = summary[0] 133 | mean = summary[1] 134 | std = summary[2] 135 | 136 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 137 | plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10) 138 | 139 | x = [np.array(clean_df[clean_df[target_name]==i][col]) for i in target_classes] 140 | y = clean_df[target_name] 141 | 142 | chi,p_val = evaluate_chi(np.array(clean_df[col]).reshape(-1,1),y) 143 | 144 | plt.xlabel(col+"\n chi: "+str(np.float32(chi[0]))+" / p_val: "+str(p_val[0]), fontsize=10) 145 | plt.ylabel("Frequency", fontsize=10) 146 | plt.hist(x,bins=bins_size,stacked=True,label = label) 147 | plt.legend(prop={'size': 10}) 148 | 149 | print (col+" vs "+target_name+" plotted....") 150 | 151 | COUNTER +=1 152 | c+=1 153 | 154 | return plt,COUNTER 155 | 156 | # Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups) 157 | # In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance. 158 | # A one-way ANOVA is used to compare the means of more than two independent groups. A one-way ANOVA comparing just two groups will give you the same results as the independent t test. 159 | def evaluate_anova(x,y): 160 | F_value,pvalue = f_classif(x,y) 161 | return F_value,pvalue 162 | 163 | # In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. 164 | # Quartile: In descriptive statistics, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data 165 | def bivariate_analysis_cont_catg(cont_catg_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE): 166 | 167 | clean_cont_catg_list = clean_str_list(df,cont_catg_list) 168 | 169 | if len(clean_str_list(df,[target_name])) == 0 and len(cont_catg_list)>0: 170 | raise ValueError("You seem to have a target variable with string values.") 171 | clean_df = df.dropna() 172 | 173 | for col in clean_cont_catg_list: 174 | 175 | col_classes =clean_df[col].unique() 176 | 177 | summary = clean_df[col].describe() 178 | count = summary[0] 179 | mean = summary[1] 180 | std = summary[2] 181 | 182 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 183 | plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10) 184 | 185 | x = [np.array(clean_df[clean_df[col]==i][target_name]) for i in col_classes] 186 | y = np.float32(clean_df[target_name]) 187 | 188 | f_value,p_val = evaluate_anova(np.array(clean_df[col]).reshape(-1,1),y) 189 | 190 | plt.xlabel(col+"\n f_value: "+str(np.float32(f_value[0]))+" / p_val: "+str(p_val[0]), fontsize=10) 191 | plt.ylabel(target_name, fontsize=10) 192 | plt.boxplot(x) 193 | 194 | print (col+" vs "+target_name+" plotted....") 195 | 196 | COUNTER +=1 197 | 198 | return plt,COUNTER 199 | 200 | 201 | # This function is for the bivariate analysis between categorical vs continuous varibale.Plots box plots. 202 | def bivariate_analysis_catg_cont(catg_cont_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE): 203 | 204 | # No need to remove string varible as they are handled by chi2 function of sklearn. 205 | # clean_catg_cont_list = clean_str_list(df,catg_cont_list) 206 | clean_catg_cont_list = catg_cont_list 207 | clean_df = df.dropna() 208 | 209 | for col in clean_catg_cont_list: 210 | 211 | col_classes =df[target_name].unique() 212 | 213 | summary = clean_df[col].describe() 214 | count = summary[0] 215 | mean = summary[1] 216 | std = summary[2] 217 | 218 | plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER) 219 | plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10) 220 | 221 | x = [np.array(clean_df[clean_df[target_name]==i][col]) for i in col_classes] 222 | y = clean_df[target_name] 223 | 224 | f_value,p_val = evaluate_anova(np.array(clean_df[col]).reshape(-1,1),y) 225 | 226 | plt.xlabel(target_name+"\n f_value: "+str(np.float32(f_value[0]))+" / p_val: "+str(p_val[0]), fontsize=10) 227 | plt.ylabel(col, fontsize=10) 228 | plt.boxplot(x) 229 | 230 | print (col+" vs "+target_name+" plotted....") 231 | 232 | COUNTER +=1 233 | 234 | return plt,COUNTER 235 | 236 | #returns the total number of subplots to be made. 237 | def total_subplots(df,lst): 238 | clean_df = df.dropna() 239 | total = [len(clean_str_list(clean_df,i)) for i in lst] 240 | return sum(total) 241 | 242 | # This function returns new categotical list after removing drop values if in case they are written in both drop and categorical_name list. 243 | def remove_drop_from_catglist(drop,categorical_name): 244 | for col in drop: 245 | if col in categorical_name: 246 | categorical_name.remove(col) 247 | return categorical_name 248 | 249 | def plot(data_input,target_name="",categorical_name=[],drop=[],PLOT_COLUMNS_SIZE = 4,bin_size="auto",wspace=0.5,hspace=0.8): 250 | """ 251 | This is the main function to give Bivariate analysis between the target variable and the input features. 252 | 253 | Parameters 254 | ----------- 255 | data_input : Dataframe 256 | This is the input Dataframe with all data. 257 | 258 | target_name : String 259 | The name of the target column. 260 | 261 | categorical_name : list 262 | Names of all categorical variable columns with more than 2 classes, to distinguish with the continuous variables. 263 | 264 | drop : list 265 | Names of columns to be dropped. 266 | 267 | PLOT_COLUMNS_SIZE : int 268 | Number of plots to display vertically in the display window.The row size is adjusted accordingly. 269 | 270 | bin_size : int ;default="auto" 271 | Number of bins for the histogram displayed in the categorical vs categorical category. 272 | 273 | wspace : int ;default = 0.5 274 | Horizontal padding between subplot on the display window. 275 | 276 | hspace : int ;default = 0.5 277 | Vertical padding between subplot on the display window. 278 | 279 | ----------- 280 | 281 | """ 282 | 283 | if type(data_input).__name__ == "DataFrame" : 284 | 285 | # Column names 286 | columns_name = data_input.columns.values 287 | 288 | #To drop user specified columns. 289 | if is_present(columns_name,drop): 290 | data_input = data_input.drop(drop,axis=1) 291 | columns_name = data_input.columns.values 292 | categorical_name = remove_drop_from_catglist(drop,categorical_name) 293 | 294 | else: 295 | raise ValueError("Couldn't find it in the input Dataframe!") 296 | 297 | if target_name == "": 298 | raise ValueError("Please mention a target variable") 299 | 300 | #Checks if the categorical_name are present in the orignal dataframe columns. 301 | categorical_is_present = is_present(columns_name,categorical_name) 302 | target_is_present = is_present(columns_name,[target_name]) 303 | if categorical_is_present: 304 | fin_cat_dict,catg_catg_list,cont_cont_list,catg_cont_list,cont_catg_list = get_category(data_input,target_name,categorical_name,columns_name) 305 | 306 | #Subplot(Total number of graphs) 307 | total = total_subplots(data_input,[cont_cont_list,catg_catg_list,catg_cont_list,cont_catg_list]) 308 | if total < PLOT_COLUMNS_SIZE: 309 | total = PLOT_COLUMNS_SIZE 310 | 311 | PLOT_ROW_SIZE = ceil(float(total)/PLOT_COLUMNS_SIZE) 312 | 313 | #Call various functions 314 | plot,count = bivariate_analysis_cont_cont(cont_cont_list,data_input,target_name,total,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE) 315 | plot,count = bivariate_analysis_catg_catg(catg_catg_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,bin_size=bin_size) 316 | plot,count = bivariate_analysis_cont_catg(cont_catg_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE) 317 | plot,count = bivariate_analysis_catg_cont(catg_cont_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE) 318 | 319 | fig.subplots_adjust(bottom=0.08,left = 0.05,right=0.97,top=0.93,wspace = wspace,hspace = hspace) 320 | plot.show() 321 | 322 | else: 323 | raise ValueError("Make sure input data is a Dataframe.") 324 | --------------------------------------------------------------------------------