├── .gitignore
├── LICENSE.txt
├── MANIFEST.in
├── README.md
├── README.rst
├── dist
    ├── visualize_ML-0.1.2.tar.gz
    └── visualize_ML-0.2.2.tar.gz
├── images
    ├── explore1.png
    └── relation1.png
├── setug.cfg
├── setup.py
├── visualize_ML.egg-info
    ├── PKG-INFO
    ├── SOURCES.txt
    ├── dependency_links.txt
    ├── requires.txt
    └── top_level.txt
└── visualize_ML
    ├── .~lock.people.csv#
    ├── __init__.py
    ├── explore.py
    └── relation.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 | *.pyc
3 | *.md
4 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Ayush Singh
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/MANIFEST.in


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # visualize_ML
  2 | 
  3 | visualize_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklean,scipy for statistical computations.
  4 | 
  5 | [![PyPI version](https://badge.fury.io/py/visualize_ML.svg)](https://badge.fury.io/py/visualize_ML)
  6 | ### Table of content:
  7 | * [Requirements](https://github.com/ayush1997/visualize_ML/#requirement)
  8 | * [Install](https://github.com/ayush1997/visualize_ML/#install)
  9 | * [Let's code](https://github.com/ayush1997/visualize_ML/#lets-code)
 10 | 	* [explore module](https://github.com/ayush1997/visualize_ML/#-explore-module)
 11 | 	* [relation module](https://github.com/ayush1997/visualize_ML/#-relation-module)
 12 | * [Contribute](https://github.com/ayush1997/visualize_ML/#contribute)
 13 | * [Tasks To Do](https://github.com/ayush1997/visualize_ML/#tasks-to-do)
 14 | * [Licence](https://github.com/ayush1997/visualize_ML/#licence)
 15 | * [Copyright](https://github.com/ayush1997/visualize_ML/#copyright)
 16 | 
 17 | 
 18 | ## Requirement
 19 | 
 20 | * python 2.x or python 3.x
 21 | 
 22 | ## Install
 23 | Install dependencies needed for matplotlib
 24 | 
 25 | 	sudo apt-get build-dep python-matplotlib
 26 | 
 27 | Install it using pip
 28 | 
 29 | 	pip install visualize_ML
 30 | 
 31 | 
 32 | 
 33 | 
 34 | ## Let's Code
 35 | 
 36 | While dealing with a Machine Learning problem some of the initial steps involved are data exploration,analysis followed by feature selection.Below are the modules for these tasks.
 37 | 
 38 | ### 1) Data Exploration
 39 | At this stage, we explore variables one by one using **Uni-variate Analysis** which depends on whether the variable type is categorical or continuous .To deal with this we have the **explore** module.
 40 | 
 41 | ## >>> explore module
 42 | 	visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,
 43 | 	bar_width=0.2,wspace=0.5,hspace=0.8)
 44 | **Continuous Variables** : In case of continous variables it plots the *Histogram* for every variable and gives descriptive statistics for them.
 45 | 
 46 | **Categorical Variables** : In case on categorical variables with 2 or more classes it plots the *Bar chart* for every variable and gives descriptive statistics for them.
 47 | 
 48 | Parameters | Type | Description
 49 | -------------------- | -------------|------------------------------------------------------------------------
 50 | data_input  | Dataframe	| This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
 51 | categorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
 52 | drop | list default=[ ]|Names of columns to be dropped.
 53 | PLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly.
 54 | bin_size |int (default="auto") | Number of bins for the histogram displayed in the categorical vs categorical category.
 55 | wspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window.
 56 | hspace | float32 (default = 0.8) |Vertical padding between subplot on the display window.
 57 | 
 58 | 
 59 | **Code Snippet**
 60 | ```python
 61 | /* The data set is taken from famous Titanic data(Kaggle)*/
 62 | 
 63 | import pandas as pd
 64 | from visualize_ML import explore
 65 | df = pd.read_csv("dataset/train.csv")
 66 | explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"])
 67 | ```
 68 | ![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/explore1.png?raw=true "Optional Title")
 69 | 
 70 | see the [dataset](https://www.kaggle.com/c/titanic/data)
 71 | 
 72 | **Note:** While plotting all the rows with **NaN** values and columns with **Character** values are removed(except if values are True and False ),only numeric data is plotted.
 73 | 
 74 | ### 2) Feature Selection
 75 | This is one of the challenging task to deal with for a ML task.Here we have to do **Bi-variate Analysis** to find out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.
 76 | 
 77 | **relation** module helps in visualizing the analysis done on various combination of variables and see relation between them.
 78 | 
 79 | ## >>> relation module
 80 | 	visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10)
 81 | 
 82 | **Continuous vs Continuous variables:** To do the Bi-variate analysis *scatter plots* are made as their pattern indicates the relationship between variables.
 83 | To indicates the strength of relationship amongst them we use Correlation between them.
 84 | 
 85 | The graph displays the correlation coefficient along with other information.
 86 | 
 87 | 	Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))
 88 | 
 89 | * -1: perfect negative linear correlation
 90 | * +1:perfect positive linear correlation and
 91 | * 0: No correlation
 92 | 
 93 | **Categorical vs Categorical variables**: *Stacked Column Charts* are made to visualize the relation.**Chi square test** is used to derive the statistical significance of relationship between the variables. It returns *probability* for the computed chi-square distribution with the degree of freedom. For more information on Chi Test see [this](http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm)
 94 | 
 95 | Probability of 0: It indicates that both categorical variable are dependent
 96 | 
 97 | Probability of 1: It shows that both variables are independent.
 98 | 
 99 | The graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent.
100 | 
101 | **Categorical vs Continuous variables:** To explore the relation between categorical and continuous variables,box plots re drawn at each level of categorical variables. If levels are small in number, it will not show the statistical significance.
102 | **ANOVA test** is used to derive the statistical significance of relationship between the variables.
103 | 
104 | The graph displays the *p_value* along with other information. If it is leass than **0.05** it states that the variables are dependent.
105 | 
106 | For more information on ANOVA test see [this](https://onlinecourses.science.psu.edu/stat200/book/export/html/66)
107 | 
108 | Parameters | Type | Description
109 | -------------------- | -------------|--------------------------------------------------------------------
110 | data_input  | Dataframe	| This is the input Dataframe with all data.(Right now the input can be only be a dataframe input.)
111 | target_name | String | The name of the target column.
112 | categorical_name| list (default=[ ])| Names of all categorical variable columns with more than 2 classes, to distinguish them with the continuous variablesEmply list implies that there are no categorical features with more than 2 classes.
113 | drop | list default=[ ]|Names of columns to be dropped.
114 | PLOT_COLUMNS_SIZE| int (default=4)|Number of plots to display vertically in the display window.The row size is adjusted accordingly.
115 | bin_size |int (default="auto") | Number of bins for the histogram displayed in the categorical vs categorical category.
116 | wspace | float32 (default = 0.5) |Horizontal padding between subplot on the display window.
117 | hspace | float32 (default = 0.8) |Vertical padding between subplot on the display window.
118 | 
119 | **Code Snippet**
120 | ```python
121 | /* The data set is taken from famous Titanic data(Kaggle)*/
122 | import pandas as pd
123 | from visualize_ML import relation
124 | df = pd.read_csv("dataset/train.csv")
125 | relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)
126 | 
127 | ```
128 | 
129 | ![Alt text](https://github.com/ayush1997/visualize_ML/blob/master/images/relation1.png?raw=true "Optional Title")
130 | 
131 | see the [dataset](https://www.kaggle.com/c/titanic/data)
132 | 
133 | **Note:** While plotting all the rows with **NaN** values and columns with **Non numeric** values are removed only numeric data is plotted.Only categorical taget variable with string values are allowed.
134 | 
135 | ## Contribute
136 | If you want to contribute and add new feature feel free to send Pull request [here](https://github.com/ayush1997/visualize_ML)
137 | 
138 | This project is still under development so to report any bugs or request new features, head over to the Issues page
139 | 
140 | ## Tasks To Do
141 | - [ ] Make input compatible with other formats like Numpy.
142 | - [ ] Visualize best fit lines and decision boundaries for various models to make **Parameter Tuning** task easy.
143 | 
144 | 	and many others!
145 | 
146 | ## Licence
147 | Licensed under [The MIT License (MIT)](https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt).
148 | 
149 | ## Copyright
150 | ayush1997(c) 2016
151 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | visualize\_ML
  2 | =============
  3 | 
  4 | visualize\_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklearn,scipy for statistical computations.
  5 | 
  6 | Table of content:
  7 | ~~~~~~~~~~~~~~~~~
  8 | 
  9 | -  Requirements
 10 | -  Install
 11 | -  Let’s code
 12 | 
 13 |    -  explore module
 14 |    -  relation module
 15 | 
 16 | -  contribute
 17 | -  Licence
 18 | -  Copyright
 19 | 
 20 | Let’s Code
 21 | ----------
 22 | 
 23 | When we start dealing with a Machine Learning problem some of the
 24 | initial steps involved are data exploration,analysis followed by feature
 25 | selection.Below are the modules for these tasks.
 26 | 
 27 | 1) Data Exploration
 28 | ~~~~~~~~~~~~~~~~~~~
 29 | 
 30 | At this stage, we explore variables one by one using **Uni-variate
 31 | Analysis** which depends on whether the variable type is categorical or
 32 | continuous .To deal with this we have the **explore** module.
 33 | 
 34 | >>>explore module
 35 | ~~~~~~~~~~~~~~~~~~
 36 | 
 37 | ::
 38 | 
 39 |     visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,
 40 |     bar_width=0.2,wspace=0.5,hspace=0.8)
 41 | 
 42 | **Continuous Variables** : In case of continous variables it plots the
 43 | *Histogram* for every variable and gives descriptive statistics for
 44 | them.
 45 | 
 46 | **Categorical Variables** : In case on categorical variables with 2 or
 47 | more classes it plots the *Bar chart* for every variable and gives
 48 | descriptive statistics for them.
 49 | 
 50 | +---------------------+-----------------+---------------------------------------+
 51 | | Parameters          | Type            | Description                           |
 52 | +=====================+=================+=======================================+
 53 | | data\_input         | Dataframe       | This is the input Dataframe with all  |
 54 | |                     |                 | data.(Right now the input can be only |
 55 | |                     |                 | be a dataframe input.)                |
 56 | +---------------------+-----------------+---------------------------------------+
 57 | | categorical\_name   | list (default=[ | Names of all categorical variable     |
 58 | |                     | ])              | columns with more than 2 classes, to  |
 59 | |                     |                 | distinguish them with the continuous  |
 60 | |                     |                 | variablesEmply list implies that      |
 61 | |                     |                 | there are no categorical features     |
 62 | |                     |                 | with more than 2 classes.             |
 63 | +---------------------+-----------------+---------------------------------------+
 64 | | drop                | list default=[  | Names of columns to be dropped.       |
 65 | |                     | ]               |                                       |
 66 | +---------------------+-----------------+---------------------------------------+
 67 | | PLOT\_COLUMNS\_SIZE | int (default=4) | Number of plots to display vertically |
 68 | |                     |                 | in the display window.The row size is |
 69 | |                     |                 | adjusted accordingly.                 |
 70 | +---------------------+-----------------+---------------------------------------+
 71 | | bin\_size           | int             | Number of bins for the histogram      |
 72 | |                     | (default=“auto” | displayed in the categorical vs       |
 73 | |                     | )               | categorical category.                 |
 74 | +---------------------+-----------------+---------------------------------------+
 75 | | wspace              | float32         | Horizontal padding between subplot on |
 76 | |                     | (default = 0.5) | the display window.                   |
 77 | +---------------------+-----------------+---------------------------------------+
 78 | | hspace              | float32         | Vertical padding between subplot on   |
 79 | |                     | (default = 0.8) | the display window.                   |
 80 | +---------------------+-----------------+---------------------------------------+
 81 | 
 82 | **Code Snippet**
 83 | 
 84 | .. code :: python
 85 | 
 86 |     /* The data set is taken from famous Titanic data(Kaggle)*/
 87 | 
 88 |     import pandas as pd
 89 |     from visualize_ML import explore
 90 |     df = pd.read_csv("dataset/train.csv")
 91 | 
 92 |     explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"])
 93 | 
 94 | .. figure:: /images/explore1.png?raw=true
 95 |    :alt: Optional Title
 96 | 
 97 |    Graph made using explore module using matplotlib.
 98 | 
 99 | see the [dataset](https://www.kaggle.com/c/titanic/data)
100 | 
101 | **Note:** While plotting all the rows with **NaN** values and columns
102 | with **Character** values are removed(except if values are True and False ) only numeric data is plotted.
103 | 
104 | 2) Feature Selection
105 | ~~~~~~~~~~~~~~~~~~~~
106 | 
107 | This is one of the challenging task to deal with for a ML task.Here we
108 | have to do **Bi-variate Analysis** to find out the relationship between
109 | two variables. Here, we look for association and disassociation between
110 | variables at a pre-defined
111 | 
112 | 
113 | **relation** module helps in visualizing the analysis done on various
114 | combination of variables and see relation between them.
115 | 
116 | >>>relation module
117 | ~~~~~~~~~~~~~~~~~~~
118 | 
119 | ::
120 | 
121 |     visualize_ML.relation.plot(data_input,target_name="",categorical_name=[],drop=[],bin_size=10)
122 | 
123 | **Continuous vs Continuous variables:** To do the Bi-variate analysis
124 | *scatter plots* are made as their pattern indicates the relationship
125 | between variables. To indicates the strength of relationship amongst
126 | them we use Correlation between them.
127 | 
128 | The graph displays the correlation coefficient along with other
129 | information.
130 | 
131 | ::
132 | 
133 |     Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))
134 | 
135 | -  -1: perfect negative linear correlation
136 | -  +1:perfect positive linear correlation and
137 | -  0: No correlation
138 | 
139 | **Categorical vs Categorical variables**: *Stacked Column Charts* are
140 | made to visualize the relation.\ **Chi square test** is used to derive
141 | the statistical significance of relationship between the variables. It
142 | returns *probability* for the computed chi-square distribution with the
143 | degree of freedom. For more information on Chi Test see `this`_
144 | 
145 | Probability of 0: It indicates that both categorical variable are
146 | dependent
147 | 
148 | Probability of 1: It shows that both variables are independent.
149 | 
150 | The graph displays the *p\_value* along with other information. If it is
151 | leass than **0.05** it states that the variables are dependent.
152 | 
153 | **Categorical vs Continuous variables:** To explore the relation between
154 | categorical and continuous variables,box plots re drawn at each level of
155 | categorical variables. If levels are small in number, it will not show
156 | the statistical significance. **ANOVA test** is used to derive the
157 | statistical significance of relationship between the variables.
158 | 
159 | The graph displays the *p\_value* along with other information. If it is
160 | leass than **0.05** it states that the variables are dependent.
161 | 
162 | For more information on ANOVA test see
163 | `this <https://onlinecourses.science.psu.edu/stat200/book/export/html/66>`__
164 | 
165 | +----------------+-----------+-------------------------------------------------+
166 | | Parameters     | Type      | Description                                     |
167 | +================+===========+=================================================+
168 | | data\_input    | Dataframe | This is the input Dataframe with all            |
169 | |                |           | data.(Right now the input can be only be a      |
170 | |                |           | dataframe input.)                               |
171 | +----------------+-----------+-------------------------------------------------+
172 | | target\_name   | String    | The name of the target column.                  |
173 | +----------------+-----------+-------------------------------------------------+
174 | | categorical\_n | list      | Names of all categorical variable columns with  |
175 | | ame            | (default= | more than 2 classes, to distinguish them with   |
176 | |                | [         | the continuous variablesEmply list implies that |
177 | |                | ])        | there are no categorical features with more     |
178 | |                |           | than 2 classes.                                 |
179 | +----------------+-----------+-------------------------------------------------+
180 | | drop           | list      | Names of columns to be dropped.                 |
181 | |                | default=[ |                                                 |
182 | |                | ]         |                                                 |
183 | +----------------+-----------+-------------------------------------------------+
184 | | PLOT\_COLUMNS\ | int       | Number of plots to display vertically in the    |
185 | | _SIZE          | (default= | display window.The row size is adjusted         |
186 | |                | 4)        | accordingly.                                    |
187 | +----------------+-----------+-------------------------------------------------+
188 | | bin\_size      | int       | Number of bins for the histogram displayed in   |
189 | |                | (default= | the categorical vs categorical category.        |
190 | |                | “auto”)   |                                                 |
191 | +----------------+-----------+-------------------------------------------------+
192 | | wspace         | float32   | Horizontal padding between subplot on the       |
193 | |                | (default  | display window.                                 |
194 | |                | = 0.5)    |                                                 |
195 | +----------------+-----------+-------------------------------------------------+
196 | | hspace         | float32   | Vertical padding between subplot on the display |
197 | |                | (default  | window.                                         |
198 | |                | = 0.8)    |                                                 |
199 | +----------------+-----------+-------------------------------------------------+
200 | 
201 | **Code Snippet**
202 | 
203 | .. code :: python
204 | 
205 |     /* The data set is taken from famous Titanic data(Kaggle)*/
206 |     import pandas as pd
207 |     from visualize_ML import relation
208 |     df = pd.read_csv("dataset/train.csv")
209 | 
210 |     relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)
211 | 
212 | .. figure:: /images/relation1.png?raw=true
213 |    :alt: Optional Title
214 | 
215 |    Graph made using relation module using matplotlib.
216 | 
217 | see the [dataset](https://www.kaggle.com/c/titanic/data)
218 | 
219 | **Note:** While plotting all the rows with **NaN** values and columns
220 | with **Non numeric** values are removed only numeric data is
221 | plotted.Only categorical taget variable with string values are allowed.
222 | 
223 | Contribute
224 | ----------
225 | 
226 | If you want to contribute and add new feature feel free to send Pull
227 | request `here`_
228 | 
229 | This project is still under development so to report any bugs or request new features, head over to the Issues page
230 | 
231 | Licence
232 | -------
233 | Licensed under `The MIT License (MIT)`_.
234 | 
235 | Copyright
236 | ---------
237 | ayush1997(c) 2016
238 | 
239 | .. _here: https://github.com/ayush1997/visualize_ML
240 | .. _The MIT License (MIT): https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt
241 | 


--------------------------------------------------------------------------------
/dist/visualize_ML-0.1.2.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/dist/visualize_ML-0.1.2.tar.gz


--------------------------------------------------------------------------------
/dist/visualize_ML-0.2.2.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/dist/visualize_ML-0.2.2.tar.gz


--------------------------------------------------------------------------------
/images/explore1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/images/explore1.png


--------------------------------------------------------------------------------
/images/relation1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/images/relation1.png


--------------------------------------------------------------------------------
/setug.cfg:
--------------------------------------------------------------------------------
1 | [bdist_wheel]
2 | universal=1
3 | [metadata]
4 | description-file = README.rst
5 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | from codecs import open
 3 | from os import path
 4 | 
 5 | here = path.abspath(path.dirname(__file__))
 6 | 
 7 | # Get the long description from the README file
 8 | with open(path.join(here, 'README.rst'), encoding='utf-8') as f:
 9 |     long_description = f.read()
10 | 
11 | setup(
12 |     name='visualize_ML',
13 | 
14 |     version='0.2.2',
15 | 
16 |     description='To visualize various processes involved in dealing with a Machine Learning problem.',
17 |     long_description=long_description,
18 | 
19 |     # The project's main homepage.
20 |     url='https://github.com/ayush1997/visualize_ML',
21 | 
22 | 
23 |     author='ayush1997',
24 |     author_email='ayushkumarsingh97@gmail.com',
25 | 
26 | 
27 |     license='MIT',
28 | 
29 | 
30 |     classifiers=[
31 | 
32 |         'Development Status :: 3 - Alpha',
33 | 
34 |         # Indicate who your project is intended for
35 |         'Intended Audience :: Science/Research',
36 |         'Intended Audience :: Developers',
37 |         'Topic :: Software Development :: Build Tools',
38 | 
39 |         # Pick your license as you wish (should match "license" above)
40 |         'License :: OSI Approved :: MIT License',
41 | 
42 |         # Specify the Python versions you support here. In particular, ensure
43 |         # that you indicate whether you support Python 2, Python 3 or both.
44 |         'Programming Language :: Python :: 2',
45 |         'Programming Language :: Python :: 2.6',
46 |         'Programming Language :: Python :: 2.7',
47 |         'Programming Language :: Python :: 3',
48 |         'Programming Language :: Python :: 3.3',
49 |         'Programming Language :: Python :: 3.4',
50 |         'Programming Language :: Python :: 3.5',
51 |     ],
52 | 
53 |     keywords='visualization MachineLearning DataScience',
54 | 
55 |     packages=['visualize_ML'],
56 | 
57 | 
58 |     install_requires=["scikit-learn","pandas","numpy","matplotlib"],
59 | 
60 | 
61 | 
62 | )
63 | 


--------------------------------------------------------------------------------
/visualize_ML.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
  1 | Metadata-Version: 1.1
  2 | Name: visualize-ML
  3 | Version: 0.2.2
  4 | Summary: To visualize various processes involved in dealing with a Machine Learning problem.
  5 | Home-page: https://github.com/ayush1997/visualize_ML
  6 | Author: ayush1997
  7 | Author-email: ayushkumarsingh97@gmail.com
  8 | License: MIT
  9 | Description: visualize\_ML
 10 |         =============
 11 |         
 12 |         visualize\_ML is a python package made to visualize some of the steps involved while dealing with a Machine Learning problem. It is build on libraries like matplotlib for visualization and sklearn,scipy for statistical computations.
 13 |         
 14 |         Table of content:
 15 |         ~~~~~~~~~~~~~~~~~
 16 |         
 17 |         -  Requirements
 18 |         -  Install
 19 |         -  Let’s code
 20 |         
 21 |            -  explore module
 22 |            -  relation module
 23 |         
 24 |         -  contribute
 25 |         -  Licence
 26 |         -  Copyright
 27 |         
 28 |         Let’s Code
 29 |         ----------
 30 |         
 31 |         When we start dealing with a Machine Learning problem some of the
 32 |         initial steps involved are data exploration,analysis followed by feature
 33 |         selection.Below are the modules for these tasks.
 34 |         
 35 |         1) Data Exploration
 36 |         ~~~~~~~~~~~~~~~~~~~
 37 |         
 38 |         At this stage, we explore variables one by one using **Uni-variate
 39 |         Analysis** which depends on whether the variable type is categorical or
 40 |         continuous .To deal with this we have the **explore** module.
 41 |         
 42 |         >>>explore module
 43 |         ~~~~~~~~~~~~~~~~~~
 44 |         
 45 |         ::
 46 |         
 47 |             visualize_ML.explore.plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE=4,bin_size=20,
 48 |             bar_width=0.2,wspace=0.5,hspace=0.8)
 49 |         
 50 |         **Continuous Variables** : In case of continous variables it plots the
 51 |         *Histogram* for every variable and gives descriptive statistics for
 52 |         them.
 53 |         
 54 |         **Categorical Variables** : In case on categorical variables with 2 or
 55 |         more classes it plots the *Bar chart* for every variable and gives
 56 |         descriptive statistics for them.
 57 |         
 58 |         +---------------------+-----------------+---------------------------------------+
 59 |         | Parameters          | Type            | Description                           |
 60 |         +=====================+=================+=======================================+
 61 |         | data\_input         | Dataframe       | This is the input Dataframe with all  |
 62 |         |                     |                 | data.(Right now the input can be only |
 63 |         |                     |                 | be a dataframe input.)                |
 64 |         +---------------------+-----------------+---------------------------------------+
 65 |         | categorical\_name   | list (default=[ | Names of all categorical variable     |
 66 |         |                     | ])              | columns with more than 2 classes, to  |
 67 |         |                     |                 | distinguish them with the continuous  |
 68 |         |                     |                 | variablesEmply list implies that      |
 69 |         |                     |                 | there are no categorical features     |
 70 |         |                     |                 | with more than 2 classes.             |
 71 |         +---------------------+-----------------+---------------------------------------+
 72 |         | drop                | list default=[  | Names of columns to be dropped.       |
 73 |         |                     | ]               |                                       |
 74 |         +---------------------+-----------------+---------------------------------------+
 75 |         | PLOT\_COLUMNS\_SIZE | int (default=4) | Number of plots to display vertically |
 76 |         |                     |                 | in the display window.The row size is |
 77 |         |                     |                 | adjusted accordingly.                 |
 78 |         +---------------------+-----------------+---------------------------------------+
 79 |         | bin\_size           | int             | Number of bins for the histogram      |
 80 |         |                     | (default=“auto” | displayed in the categorical vs       |
 81 |         |                     | )               | categorical category.                 |
 82 |         +---------------------+-----------------+---------------------------------------+
 83 |         | wspace              | float32         | Horizontal padding between subplot on |
 84 |         |                     | (default = 0.5) | the display window.                   |
 85 |         +---------------------+-----------------+---------------------------------------+
 86 |         | hspace              | float32         | Vertical padding between subplot on   |
 87 |         |                     | (default = 0.8) | the display window.                   |
 88 |         +---------------------+-----------------+---------------------------------------+
 89 |         
 90 |         **Code Snippet**
 91 |         
 92 |         .. code :: python
 93 |         
 94 |             /* The data set is taken from famous Titanic data(Kaggle)*/
 95 |         
 96 |             import pandas as pd
 97 |             from visualize_ML import explore
 98 |             df = pd.read_csv("dataset/train.csv")
 99 |         
100 |             explore.plot(df,["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"])
101 |         
102 |         .. figure:: /images/explore1.png?raw=true
103 |            :alt: Optional Title
104 |         
105 |            Graph made using explore module using matplotlib.
106 |         
107 |         see the [dataset](https://www.kaggle.com/c/titanic/data)
108 |         
109 |         **Note:** While plotting all the rows with **NaN** values and columns
110 |         with **Character** values are removed(except if values are True and False ) only numeric data is plotted.
111 |         
112 |         2) Feature Selection
113 |         ~~~~~~~~~~~~~~~~~~~~
114 |         
115 |         This is one of the challenging task to deal with for a ML task.Here we
116 |         have to do **Bi-variate Analysis** to find out the relationship between
117 |         two variables. Here, we look for association and disassociation between
118 |         variables at a pre-defined
119 |         
120 |         
121 |         **relation** module helps in visualizing the analysis done on various
122 |         combination of variables and see relation between them.
123 |         
124 |         >>>relation module
125 |         ~~~~~~~~~~~~~~~~~~~
126 |         
127 |         ::
128 |         
129 |             visualize_ML.relation.plot(df,"Sex",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)
130 |         
131 |         **Continuous vs Continuous variables:** To do the Bi-variate analysis
132 |         *scatter plots* are made as their pattern indicates the relationship
133 |         between variables. To indicates the strength of relationship amongst
134 |         them we use Correlation between them.
135 |         
136 |         The graph displays the correlation coefficient along with other
137 |         information.
138 |         
139 |         ::
140 |         
141 |             Correlation = Covariance(X,Y) / SQRT( Var(X)*Var(Y))
142 |         
143 |         -  -1: perfect negative linear correlation
144 |         -  +1:perfect positive linear correlation and
145 |         -  0: No correlation
146 |         
147 |         **Categorical vs Categorical variables**: *Stacked Column Charts* are
148 |         made to visualize the relation.\ **Chi square test** is used to derive
149 |         the statistical significance of relationship between the variables. It
150 |         returns *probability* for the computed chi-square distribution with the
151 |         degree of freedom. For more information on Chi Test see `this`_
152 |         
153 |         Probability of 0: It indicates that both categorical variable are
154 |         dependent
155 |         
156 |         Probability of 1: It shows that both variables are independent.
157 |         
158 |         The graph displays the *p\_value* along with other information. If it is
159 |         leass than **0.05** it states that the variables are dependent.
160 |         
161 |         **Categorical vs Continuous variables:** To explore the relation between
162 |         categorical and continuous variables,box plots re drawn at each level of
163 |         categorical variables. If levels are small in number, it will not show
164 |         the statistical significance. **ANOVA test** is used to derive the
165 |         statistical significance of relationship between the variables.
166 |         
167 |         The graph displays the *p\_value* along with other information. If it is
168 |         leass than **0.05** it states that the variables are dependent.
169 |         
170 |         For more information on ANOVA test see
171 |         `this <https://onlinecourses.science.psu.edu/stat200/book/export/html/66>`__
172 |         
173 |         +----------------+-----------+-------------------------------------------------+
174 |         | Parameters     | Type      | Description                                     |
175 |         +================+===========+=================================================+
176 |         | data\_input    | Dataframe | This is the input Dataframe with all            |
177 |         |                |           | data.(Right now the input can be only be a      |
178 |         |                |           | dataframe input.)                               |
179 |         +----------------+-----------+-------------------------------------------------+
180 |         | target\_name   | String    | The name of the target column.                  |
181 |         +----------------+-----------+-------------------------------------------------+
182 |         | categorical\_n | list      | Names of all categorical variable columns with  |
183 |         | ame            | (default= | more than 2 classes, to distinguish them with   |
184 |         |                | [         | the continuous variablesEmply list implies that |
185 |         |                | ])        | there are no categorical features with more     |
186 |         |                |           | than 2 classes.                                 |
187 |         +----------------+-----------+-------------------------------------------------+
188 |         | drop           | list      | Names of columns to be dropped.                 |
189 |         |                | default=[ |                                                 |
190 |         |                | ]         |                                                 |
191 |         +----------------+-----------+-------------------------------------------------+
192 |         | PLOT\_COLUMNS\ | int       | Number of plots to display vertically in the    |
193 |         | _SIZE          | (default= | display window.The row size is adjusted         |
194 |         |                | 4)        | accordingly.                                    |
195 |         +----------------+-----------+-------------------------------------------------+
196 |         | bin\_size      | int       | Number of bins for the histogram displayed in   |
197 |         |                | (default= | the categorical vs categorical category.        |
198 |         |                | “auto”)   |                                                 |
199 |         +----------------+-----------+-------------------------------------------------+
200 |         | wspace         | float32   | Horizontal padding between subplot on the       |
201 |         |                | (default  | display window.                                 |
202 |         |                | = 0.5)    |                                                 |
203 |         +----------------+-----------+-------------------------------------------------+
204 |         | hspace         | float32   | Vertical padding between subplot on the display |
205 |         |                | (default  | window.                                         |
206 |         |                | = 0.8)    |                                                 |
207 |         +----------------+-----------+-------------------------------------------------+
208 |         
209 |         **Code Snippet**
210 |         
211 |         .. code :: python
212 |         
213 |             /* The data set is taken from famous Titanic data(Kaggle)*/
214 |             import pandas as pd
215 |             from visualize_ML import relation
216 |             df = pd.read_csv("dataset/train.csv")
217 |         
218 |             relation.plot(df,"Survived",["Survived","Pclass","Sex","SibSp","Ticket","Embarked"],drop=["PassengerId","Name"],bin_size=10)
219 |         
220 |         .. figure:: /images/relation1.png?raw=true
221 |            :alt: Optional Title
222 |         
223 |            Graph made using relation module using matplotlib.
224 |         
225 |         see the [dataset](https://www.kaggle.com/c/titanic/data)
226 |         
227 |         **Note:** While plotting all the rows with **NaN** values and columns
228 |         with **Non numeric** values are removed only numeric data is
229 |         plotted.Only categorical taget variable with string values are allowed.
230 |         
231 |         Contribute
232 |         ----------
233 |         
234 |         If you want to contribute and add new feature feel free to send Pull
235 |         request `here`_
236 |         
237 |         This project is still under development so to report any bugs or request new features, head over to the Issues page
238 |         
239 |         Licence
240 |         -------
241 |         Licensed under `The MIT License (MIT)`_.
242 |         
243 |         Copyright
244 |         ---------
245 |         ayush1997(c) 2016
246 |         
247 |         .. _here: https://github.com/ayush1997/visualize_ML
248 |         .. _The MIT License (MIT): https://github.com/ayush1997/visualize_ML/blob/master/LICENSE.txt
249 |         
250 | Keywords: visualization MachineLearning DataScience
251 | Platform: UNKNOWN
252 | Classifier: Development Status :: 3 - Alpha
253 | Classifier: Intended Audience :: Science/Research
254 | Classifier: Intended Audience :: Developers
255 | Classifier: Topic :: Software Development :: Build Tools
256 | Classifier: License :: OSI Approved :: MIT License
257 | Classifier: Programming Language :: Python :: 2
258 | Classifier: Programming Language :: Python :: 2.6
259 | Classifier: Programming Language :: Python :: 2.7
260 | Classifier: Programming Language :: Python :: 3
261 | Classifier: Programming Language :: Python :: 3.3
262 | Classifier: Programming Language :: Python :: 3.4
263 | Classifier: Programming Language :: Python :: 3.5
264 | 


--------------------------------------------------------------------------------
/visualize_ML.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
 1 | MANIFEST.in
 2 | README.rst
 3 | setup.py
 4 | visualize_ML/__init__.py
 5 | visualize_ML/explore.py
 6 | visualize_ML/relation.py
 7 | visualize_ML.egg-info/PKG-INFO
 8 | visualize_ML.egg-info/SOURCES.txt
 9 | visualize_ML.egg-info/dependency_links.txt
10 | visualize_ML.egg-info/requires.txt
11 | visualize_ML.egg-info/top_level.txt


--------------------------------------------------------------------------------
/visualize_ML.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/visualize_ML.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | scikit-learn
2 | pandas
3 | numpy
4 | matplotlib
5 | 


--------------------------------------------------------------------------------
/visualize_ML.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | visualize_ML
2 | 


--------------------------------------------------------------------------------
/visualize_ML/.~lock.people.csv#:
--------------------------------------------------------------------------------
1 | ,ayush,ayush-Lenovo-U41-70,04.08.2016 18:17,file:///home/ayush/.config/libreoffice/4;


--------------------------------------------------------------------------------
/visualize_ML/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ayush1997/visualize_ML/7332f41a08c6c6488920cae346af5cce8b1088d6/visualize_ML/__init__.py


--------------------------------------------------------------------------------
/visualize_ML/explore.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from math import ceil
  4 | import matplotlib.pyplot as plt
  5 | plt.style.use('ggplot')
  6 | 
  7 | fig = plt.figure()
  8 | 
  9 | PLOT_COLUMNS_SIZE = 4
 10 | COUNTER = 1
 11 | def dataframe_to_numpy(df):
 12 |     return np.array(df)
 13 | 
 14 | #Return the category dictionary,categorical variables list and continuous list for every colum in dataframe.
 15 | def get_category(df,categorical_name,columns_name):
 16 |     cat_dict = {}
 17 |     categorical = []
 18 |     continous = []
 19 |     for col in columns_name:
 20 |         if len(df[col].unique())<=2:
 21 |             cat_dict[col] = "categorical"
 22 |             categorical.append(col)
 23 |         elif col in categorical_name:
 24 |             cat_dict[col] = "categorical"
 25 |             categorical.append(col)
 26 |         else:
 27 |             cat_dict[col] = "continous"
 28 |             continous.append(col)
 29 | 
 30 |     return cat_dict,categorical,continous
 31 | 
 32 | #Return True if the categorical_name are present in the orignal dataframe columns.
 33 | def is_present(columns_name,categorical_name):
 34 |     ls = [i for i in categorical_name if i not in columns_name]
 35 |     if len(ls)==0:
 36 |         return True
 37 |     else:
 38 |         raise ValueError(i+" is not present as a column in the data,Please check the name")
 39 | 
 40 | #function removes any column with string values which cannt be plotted
 41 | def clean_str_list(df,lst):
 42 |     rem=[]
 43 |     for i in lst:
 44 | 
 45 |         res = any(isinstance(n,str) for n in df[i])
 46 |         if res == True:
 47 |             rem.append(i)
 48 | 
 49 |     for j in rem:
 50 |         lst.remove(j)
 51 | 
 52 |     return lst
 53 | 
 54 | 
 55 | #Univariate analysis for continuous variables is done using histograms and graph summary.
 56 | def univariate_analysis_continous(cont_list,df,sub,COUNTER,bin_size,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE):
 57 | 
 58 |     clean_cont_list = clean_str_list(df,cont_list)
 59 |     for col in cont_list:
 60 |         summary = df[col].dropna().describe()
 61 |         count = summary[0]
 62 |         mean = summary[1]
 63 |         std = summary[2]
 64 |         count_50 = summary[5]
 65 |         count_75 = summary[6]
 66 | 
 67 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
 68 |         plt.title("mean: "+str(np.float32(mean))+" std: "+str(np.float32(std)),fontsize=12)
 69 |         x = np.array(df[col].dropna())
 70 |         plt.xlabel(col+"\n count "+str(count)+"\n50%: "+str(count_50)+" 75%: "+str(count_75), fontsize=12)
 71 |         plt.ylabel("Frequency", fontsize=12)
 72 |         plt.hist(x,bins=bin_size)
 73 |         print (col+" plotted....")
 74 |         COUNTER +=1
 75 | 
 76 |     return plt,COUNTER
 77 | 
 78 | 
 79 | #Returns the frequecy table for a class
 80 | def get_catg_info(df,col):
 81 |     return df[col].value_counts()
 82 | 
 83 | 
 84 | #Univariate analysis for categotical variables is done using histograms and graph summary.
 85 | def univariate_analysis_categorical(catg_list,df,sub_len,COUNTER,bar_width,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE):
 86 |     # clean_catg_list = clean_str_list(df,catg_list)
 87 | 
 88 |     for col in catg_list:
 89 | 
 90 |         summary = df[col].dropna().describe()
 91 | 
 92 |         # if len(summary)!=5:
 93 |         #     raise ValueError(col+"has string values please Label Encode them")
 94 |         if len(summary)!= 4:
 95 |             count = summary[0]
 96 |             mean = summary[1]
 97 |             std = summary[2]
 98 |             count_50 = summary[5]
 99 |             count_75 = summary[6]
100 |             plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=12)
101 |             plt.xlabel(col+"\n count "+str(count)+"\n50%: "+str(count_50)+" 75%: "+str(count_75), fontsize=12)
102 | 
103 |         else:
104 |             count = summary[0]
105 |             plt.xlabel(col+"\n count "+str(count), fontsize=12)
106 | 
107 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
108 | 
109 |         x = df.dropna()[col].unique()
110 | 
111 |         y = get_catg_info(df.dropna(),col)
112 |         y = np.float32([y[i] for i in x])
113 | 
114 |         labels = y/y.sum() * 100
115 | 
116 |         plt.ylabel("Frequency", fontsize=12)
117 |         plt.bar(x,y,width=bar_width)
118 | 
119 |         for x,y, label in zip(x,y, np.around(np.float32(labels), decimals=2)):
120 |             plt.text(x + bar_width/2,y + 5, label, ha='center', va='bottom',rotation=90)
121 |         print (col+" plotted....")
122 |         COUNTER +=1
123 | 
124 |     return plt,COUNTER
125 | 
126 | #returns the total number of subplots to be made.
127 | def total_subplots(df,lst):
128 |     clean_df = df.dropna()
129 |     total = [len(clean_str_list(clean_df,i)) for i in lst]
130 | 
131 |     return sum(total)
132 | 
133 | #This function returns new categotical list after removing drop values if in case they are written in both drop and categorical_name list.
134 | def remove_drop_from_catglist(drop,categorical_name):
135 |     for col in drop:
136 |         if col in categorical_name:
137 |             categorical_name.remove(col)
138 |     return categorical_name
139 | def plot(data_input,categorical_name=[],drop=[],PLOT_COLUMNS_SIZE = 4,bin_size=20,bar_width=0.2,wspace=0.5,hspace=0.8):
140 | 
141 |     """
142 |     This is the main function to give Bivariate analysis between the target variable and the input features.
143 | 
144 |     Parameters
145 |     -----------
146 |     data_input  : Dataframe
147 |             This is the input Dataframe with all data.
148 | 
149 |     categorical_name : list
150 |             Names of all categorical variable columns with more than 2 classes, to distinguish with the continuous variables.
151 | 
152 |     drop : list
153 |             Names of columns to be dropped.
154 | 
155 |     PLOT_COLUMNS_SIZE : int; default =4
156 |             Number of plots to display vertically in the display window.The row size is adjusted accordingly.
157 | 
158 |     bin_size : int ;default="auto"
159 |             Number of bins for the histogram displayed in the categorical vs categorical category.
160 | 
161 |     wspace : float32 ;default = 0.5
162 |             Horizontal padding between subplot on the display window.
163 | 
164 |     hspace : float32 ;default = 0.8
165 |             Vertical padding between subplot on the display window.
166 | 
167 |     -----------
168 | 
169 |     """
170 |     if type(data_input).__name__ == "DataFrame" :
171 | 
172 |         # Column names
173 |         columns_name = data_input.columns.values
174 | 
175 |         #To drop user specified columns.
176 |         if is_present(columns_name,drop):
177 |             data_input = data_input.drop(drop,axis=1)
178 |             columns_name = data_input.columns.values
179 |             categorical_name = remove_drop_from_catglist(drop,categorical_name)
180 |         else:
181 |             raise ValueError("Couldn't find it in the input Dataframe!")
182 | 
183 | 
184 |         #Checks if the categorical_name are present in the orignal dataframe columns.
185 |         categorical_is_present = is_present(columns_name,categorical_name)
186 |         if categorical_is_present:
187 |             category_dict,catg_list,cont_list = get_category(data_input,categorical_name,columns_name)
188 | 
189 |         #Subplot(Total number of graphs)
190 | 
191 |         total = total_subplots(data_input,[catg_list,cont_list])
192 | 
193 |         if total < PLOT_COLUMNS_SIZE:
194 |             total = PLOT_COLUMNS_SIZE
195 |         PLOT_ROW_SIZE = ceil(float(total)/PLOT_COLUMNS_SIZE)
196 | 
197 | 
198 |         plot,count = univariate_analysis_continous(cont_list,data_input,total,COUNTER,bin_size,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE)
199 |         plot,count = univariate_analysis_categorical(catg_list,data_input,total,count,bar_width,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE)
200 | 
201 |         fig.subplots_adjust(bottom=0.08,left = 0.05,right=0.97,top=0.93,wspace = wspace,hspace = hspace)
202 |         plot.show()
203 | 
204 |     else:
205 |         raise ValueError("The input doesn't seems to be Dataframe")
206 | 


--------------------------------------------------------------------------------
/visualize_ML/relation.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from numpy import corrcoef
  4 | import matplotlib.pyplot as plt
  5 | from sklearn.feature_selection import chi2
  6 | from sklearn.feature_selection import f_classif
  7 | from math import *
  8 | plt.style.use('ggplot')
  9 | 
 10 | fig = plt.figure()
 11 | COUNTER = 1
 12 | 
 13 | #Return the category dictionary,categorical variables list and continuous list for every column in dataframe.
 14 | #The categories are assigned as "target(type)_feature(type)"
 15 | def get_category(df,target_name,categorical_name,columns_name):
 16 |     cat_dict = {}
 17 |     fin_cat_dict = {}
 18 |     catg_catg = []
 19 |     cont_cont = []
 20 |     catg_cont = []
 21 |     cont_catg = []
 22 |     for col in columns_name:
 23 |         if len(df[col].unique())<=2:
 24 |             cat_dict[col] = "categorical"
 25 |         elif col in categorical_name:
 26 |             cat_dict[col] = "categorical"
 27 |         else:
 28 |             cat_dict[col] = "continous"
 29 | 
 30 |     for col in cat_dict:
 31 |         if cat_dict[col]=="categorical" and cat_dict[target_name]=="categorical":
 32 |             fin_cat_dict[col] = "catg_catg"
 33 |             catg_catg.append(col)
 34 |         elif cat_dict[col]=="continous" and cat_dict[target_name]=="continous":
 35 |             fin_cat_dict[col] = "cont_cont"
 36 |             cont_cont.append(col)
 37 |         elif cat_dict[col]=="continous" and cat_dict[target_name]=="categorical":
 38 |             fin_cat_dict[col] = "catg_cont"
 39 |             catg_cont.append(col)
 40 |         else:
 41 |             fin_cat_dict[col] = "cont_catg"
 42 |             cont_catg.append(col)
 43 |     return fin_cat_dict,catg_catg,cont_cont,catg_cont,cont_catg
 44 | 
 45 | #Return True if the categorical_name are present in the orignal dataframe columns.
 46 | def is_present(columns_name,categorical_name):
 47 |     ls = [i for i in categorical_name if i not in columns_name]
 48 |     if len(ls)==0:
 49 |         return True
 50 |     else:
 51 |         raise ValueError(str(ls)+" is not present as a column in the data,Please check the name")
 52 | 
 53 | #Function returns list of columns with non-numeric data.
 54 | def clean_str_list(df,lst):
 55 |     rem=[]
 56 |     for i in lst:
 57 | 
 58 |         res = any(isinstance(n,str) for n in df[i])
 59 |         if res == True:
 60 |             rem.append(i)
 61 | 
 62 |     for j in rem:
 63 |         lst.remove(j)
 64 | 
 65 |     return lst
 66 | 
 67 | #Returns the Pearson Correlation Coefficient for the continous data columns.
 68 | def pearson_correlation_cont_cont(x,y):
 69 | 
 70 |     return corrcoef(x,y)
 71 | 
 72 | 
 73 | # This function is for the bivariate analysis between two continous varibale.Plots scatter plots and shows the coeff for the data.
 74 | def bivariate_analysis_cont_cont(cont_cont_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE):
 75 | 
 76 |     clean_cont_cont_list = clean_str_list(df,cont_cont_list)
 77 | 
 78 |     if len(clean_str_list(df,[target_name])) == 0 and len(cont_cont_list)>0:
 79 |         raise ValueError("You seem to have a target variable with string values.")
 80 |     clean_df = df.dropna()
 81 |     for col in clean_cont_cont_list:
 82 |         summary = clean_df[col].describe()
 83 |         count = summary[0]
 84 |         mean = summary[1]
 85 |         std = summary[2]
 86 | 
 87 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
 88 |         plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10)
 89 | 
 90 |         x = clean_df[col]
 91 |         y = np.float32(clean_df[target_name])
 92 |         corr = pearson_correlation_cont_cont(x,y)
 93 | 
 94 |         plt.xlabel(col+"\n count "+str(count)+"\n Corr: "+str(np.float32(corr[0][1])), fontsize=10)
 95 |         plt.ylabel(target_name, fontsize=10)
 96 |         plt.scatter(x,y)
 97 | 
 98 |         print (col+" vs "+target_name+" plotted....")
 99 |         COUNTER +=1
100 | 
101 |     return plt,COUNTER
102 | 
103 | 
104 | #Chi test is used to see association between catgorical vs categorical variables.
105 | #Lower Pvalue are significant they should be < 0.05
106 | #chi value = X^2 = summation [(observed-expected)^2/expected]
107 | # The distribution of the statistic X2 is chi-square with (r-1)(c-1) degrees of freedom, where r represents the number of rows in the two-way table and c represents the number of columns. The distribution is denoted (df), where df is the number of degrees of freedom.
108 | #pvalue = p(df>=x^2)
109 | 
110 | def evaluate_chi(x,y):
111 |     chi,p_val = chi2(x,y)
112 |     return chi,p_val
113 | def bivariate_analysis_catg_catg(catg_catg_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,bin_size="auto"):
114 | 
115 |     clean_catg_catg_list = clean_str_list(df,catg_catg_list)
116 | 
117 |     clean_df = df.dropna()
118 | 
119 |     target_classes =df[target_name].unique()
120 |     label = [str(i) for i in target_classes]
121 | 
122 |     c = 0
123 |     for col in clean_catg_catg_list:
124 |         summary = clean_df[col].describe()
125 |         binwidth = 0.7
126 | 
127 |         if bin_size == 'auto':
128 |             bins_size =np.arange(min(clean_df[col].tolist()), max(clean_df[col].tolist()) + binwidth, binwidth)
129 |         else:
130 |             bins_size = bin_size
131 | 
132 |         count = summary[0]
133 |         mean = summary[1]
134 |         std = summary[2]
135 | 
136 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
137 |         plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10)
138 | 
139 |         x = [np.array(clean_df[clean_df[target_name]==i][col]) for i in target_classes]
140 |         y = clean_df[target_name]
141 | 
142 |         chi,p_val = evaluate_chi(np.array(clean_df[col]).reshape(-1,1),y)
143 | 
144 |         plt.xlabel(col+"\n chi: "+str(np.float32(chi[0]))+" / p_val: "+str(p_val[0]), fontsize=10)
145 |         plt.ylabel("Frequency", fontsize=10)
146 |         plt.hist(x,bins=bins_size,stacked=True,label = label)
147 |         plt.legend(prop={'size': 10})
148 | 
149 |         print (col+" vs "+target_name+" plotted....")
150 | 
151 |         COUNTER +=1
152 |         c+=1
153 | 
154 |     return plt,COUNTER
155 | 
156 | # Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups)
157 | #  In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance.
158 | # A one-way ANOVA is used to compare the means of more than two independent groups. A one-way ANOVA comparing just two groups will give you the same results as the independent t test.
159 | def evaluate_anova(x,y):
160 |     F_value,pvalue = f_classif(x,y)
161 |     return F_value,pvalue
162 | 
163 | # In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.
164 | # Quartile: In descriptive statistics, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data
165 | def bivariate_analysis_cont_catg(cont_catg_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE):
166 | 
167 |     clean_cont_catg_list = clean_str_list(df,cont_catg_list)
168 | 
169 |     if len(clean_str_list(df,[target_name])) == 0 and len(cont_catg_list)>0:
170 |         raise ValueError("You seem to have a target variable with string values.")
171 |     clean_df = df.dropna()
172 | 
173 |     for col in clean_cont_catg_list:
174 | 
175 |         col_classes =clean_df[col].unique()
176 | 
177 |         summary = clean_df[col].describe()
178 |         count = summary[0]
179 |         mean = summary[1]
180 |         std = summary[2]
181 | 
182 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
183 |         plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10)
184 | 
185 |         x = [np.array(clean_df[clean_df[col]==i][target_name]) for i in col_classes]
186 |         y = np.float32(clean_df[target_name])
187 | 
188 |         f_value,p_val = evaluate_anova(np.array(clean_df[col]).reshape(-1,1),y)
189 | 
190 |         plt.xlabel(col+"\n f_value: "+str(np.float32(f_value[0]))+" / p_val: "+str(p_val[0]), fontsize=10)
191 |         plt.ylabel(target_name, fontsize=10)
192 |         plt.boxplot(x)
193 | 
194 |         print (col+" vs "+target_name+" plotted....")
195 | 
196 |         COUNTER +=1
197 | 
198 |     return plt,COUNTER
199 | 
200 | 
201 | # This function is for the bivariate analysis between categorical vs continuous varibale.Plots box plots.
202 | def bivariate_analysis_catg_cont(catg_cont_list,df,target_name,sub_len,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE):
203 | 
204 |     # No need to remove string varible as they are handled by chi2 function of sklearn.
205 |     # clean_catg_cont_list = clean_str_list(df,catg_cont_list)
206 |     clean_catg_cont_list = catg_cont_list
207 |     clean_df = df.dropna()
208 | 
209 |     for col in clean_catg_cont_list:
210 | 
211 |         col_classes =df[target_name].unique()
212 | 
213 |         summary = clean_df[col].describe()
214 |         count = summary[0]
215 |         mean = summary[1]
216 |         std = summary[2]
217 | 
218 |         plt.subplot(PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,COUNTER)
219 |         plt.title("mean "+str(np.float32(mean))+" std "+str(np.float32(std)),fontsize=10)
220 | 
221 |         x = [np.array(clean_df[clean_df[target_name]==i][col]) for i in col_classes]
222 |         y = clean_df[target_name]
223 | 
224 |         f_value,p_val = evaluate_anova(np.array(clean_df[col]).reshape(-1,1),y)
225 | 
226 |         plt.xlabel(target_name+"\n f_value: "+str(np.float32(f_value[0]))+" / p_val: "+str(p_val[0]), fontsize=10)
227 |         plt.ylabel(col, fontsize=10)
228 |         plt.boxplot(x)
229 | 
230 |         print (col+" vs "+target_name+" plotted....")
231 | 
232 |         COUNTER +=1
233 | 
234 |     return plt,COUNTER
235 | 
236 | #returns the total number of subplots to be made.
237 | def total_subplots(df,lst):
238 |     clean_df = df.dropna()
239 |     total = [len(clean_str_list(clean_df,i)) for i in lst]
240 |     return sum(total)
241 | 
242 | # This function returns new categotical list after removing drop values if in case they are written in both drop and categorical_name list.
243 | def remove_drop_from_catglist(drop,categorical_name):
244 |     for col in drop:
245 |         if col in categorical_name:
246 |             categorical_name.remove(col)
247 |     return categorical_name
248 | 
249 | def plot(data_input,target_name="",categorical_name=[],drop=[],PLOT_COLUMNS_SIZE = 4,bin_size="auto",wspace=0.5,hspace=0.8):
250 |     """
251 |     This is the main function to give Bivariate analysis between the target variable and the input features.
252 | 
253 |     Parameters
254 |     -----------
255 |     data_input  : Dataframe
256 |             This is the input Dataframe with all data.
257 | 
258 |     target_name : String
259 |             The name of the target column.
260 | 
261 |     categorical_name : list
262 |             Names of all categorical variable columns with more than 2 classes, to distinguish with the continuous variables.
263 | 
264 |     drop : list
265 |             Names of columns to be dropped.
266 | 
267 |     PLOT_COLUMNS_SIZE : int
268 |             Number of plots to display vertically in the display window.The row size is adjusted accordingly.
269 | 
270 |     bin_size : int ;default="auto"
271 |             Number of bins for the histogram displayed in the categorical vs categorical category.
272 | 
273 |     wspace : int ;default = 0.5
274 |             Horizontal padding between subplot on the display window.
275 | 
276 |     hspace : int ;default = 0.5
277 |             Vertical padding between subplot on the display window.
278 | 
279 |     -----------
280 | 
281 |     """
282 | 
283 |     if type(data_input).__name__ == "DataFrame" :
284 | 
285 |         # Column names
286 |         columns_name = data_input.columns.values
287 | 
288 |         #To drop user specified columns.
289 |         if is_present(columns_name,drop):
290 |             data_input = data_input.drop(drop,axis=1)
291 |             columns_name = data_input.columns.values
292 |             categorical_name = remove_drop_from_catglist(drop,categorical_name)
293 | 
294 |         else:
295 |             raise ValueError("Couldn't find it in the input Dataframe!")
296 | 
297 |         if target_name == "":
298 |             raise ValueError("Please mention a target variable")
299 | 
300 |         #Checks if the categorical_name are present in the orignal dataframe columns.
301 |         categorical_is_present = is_present(columns_name,categorical_name)
302 |         target_is_present = is_present(columns_name,[target_name])
303 |         if categorical_is_present:
304 |             fin_cat_dict,catg_catg_list,cont_cont_list,catg_cont_list,cont_catg_list = get_category(data_input,target_name,categorical_name,columns_name)
305 | 
306 |         #Subplot(Total number of graphs)
307 |         total = total_subplots(data_input,[cont_cont_list,catg_catg_list,catg_cont_list,cont_catg_list])
308 |         if total < PLOT_COLUMNS_SIZE:
309 |             total = PLOT_COLUMNS_SIZE
310 | 
311 |         PLOT_ROW_SIZE = ceil(float(total)/PLOT_COLUMNS_SIZE)
312 | 
313 |         #Call various functions
314 |         plot,count =  bivariate_analysis_cont_cont(cont_cont_list,data_input,target_name,total,COUNTER,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE)
315 |         plot,count =  bivariate_analysis_catg_catg(catg_catg_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE,bin_size=bin_size)
316 |         plot,count =  bivariate_analysis_cont_catg(cont_catg_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE)
317 |         plot,count =  bivariate_analysis_catg_cont(catg_cont_list,data_input,target_name,total,count,PLOT_ROW_SIZE,PLOT_COLUMNS_SIZE)
318 | 
319 |         fig.subplots_adjust(bottom=0.08,left = 0.05,right=0.97,top=0.93,wspace = wspace,hspace = hspace)
320 |         plot.show()
321 | 
322 |     else:
323 |         raise ValueError("Make sure input data is a Dataframe.")
324 | 


--------------------------------------------------------------------------------