├── .DS_Store
├── .gitignore
├── Image
    ├── Screen Shot 2020-05-16 at 4.45.48 pm.png
    ├── Screen Shot 2020-06-11 at 8.32.55 pm.png
    └── eau de parfum.png
├── LICENSE
├── README.md
├── Script
    ├── .DS_Store
    ├── Graph
    │   ├── .DS_Store
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-36.pyc
    │   │   └── plot.cpython-36.pyc
    │   └── plot.py
    ├── Report
    │   ├── .DS_Store
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-36.pyc
    │   │   └── model.cpython-36.pyc
    │   └── model.py
    ├── __init__.py
    └── main.py
├── requirements.txt
└── setup.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/Image/Screen Shot 2020-05-16 at 4.45.48 pm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/Screen Shot 2020-05-16 at 4.45.48 pm.png


--------------------------------------------------------------------------------
/Image/Screen Shot 2020-06-11 at 8.32.55 pm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/Screen Shot 2020-06-11 at 8.32.55 pm.png


--------------------------------------------------------------------------------
/Image/eau de parfum.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/eau de parfum.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Kian Wee
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![](https://raw.githubusercontent.com/kianweelee/Edator/master/Image/eau%20de%20parfum.png)
 2 | # Edator
 3 | 
 4 | [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
 5 | [![CodeFactor](https://www.codefactor.io/repository/github/kianweelee/edator/badge)](https://www.codefactor.io/repository/github/kianweelee/edator)
 6 | [![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE)
 7 | ![](https://img.shields.io/bitbucket/issues-raw/kianweelee/Edator)
 8 | [![](https://img.shields.io/github/v/release/kianweelee/edator)](https://github.com/kianweelee/Edator/releases)
 9 | ![](https://img.shields.io/github/last-commit/kianweelee/edator)
10 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://github.com/kianweelee/Edator/pulls)
11 | [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/kianweelee/edator/issues)
12 | 
13 | This is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output.
14 |  
15 | ## Set up
16 | ### Dependencies 
17 | - Python 3.8x
18 | - matplotlib==3.1.2
19 | - numpy==1.18.1
20 | - pandas==1.0.0
21 | - PySimpleGUI==4.19.0
22 | - scikit-learn==0.22.1
23 | - scipy==1.4.1
24 | - seaborn==0.10.0
25 | - statsmodels==0.11.1
26 | - more-itertools==8.3.0
27 | 
28 | ### How to set up? (**Important!**)
29 | 1. You can clone or download my package.
30 | 2. Using terminal, move to the directory. 
31 |    - Example for Mac OS users: 
32 |    ```bash
33 |    $ cd Downloads/Edator
34 |    ```
35 | 3. Install the required packages using:
36 |    ```py
37 |    pip install -r requirements.txt
38 |    ```
39 | 4. After that, change directory into the Script folder using:
40 |    ```bash
41 |    $ cd Script
42 |    ```
43 | 5. Now, execute the main.py file by:
44 |    ```py
45 |    $ python main.py
46 |    ```
47 | 6. You should see the following:
48 | 
49 | ![](https://github.com/kianweelee/Edator/blob/master/Image/Screen%20Shot%202020-06-11%20at%208.32.55%20pm.png)
50 | 
51 | 7. Choose the format of the file (csv or xls), the path to the file and the paths to export the plots, the report and the cleaned csv file to.
52 | 
53 | 8. Done!
54 | 
55 | ## The concept behind Edator
56 | 
57 | ### Dealing with NaN values and zeros
58 | How I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is **less than 5%**. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode.
59 | 
60 | Dealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset.
61 | 
62 | ### Processing outliers
63 | I use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
64 | 
65 | In most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis.
66 | 
67 | ### Correlation
68 | For correlation, I included:
69 | 1. Pearson and Spearman correlation for numerical-numerical variables.
70 | 2. One Way ANOVA for numerical-categorical variables
71 | 3. Chi-Square test for categorical-categorical variables
72 | 
73 | Using itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above.
74 | 
75 | ### Plots
76 | For plots, I created:
77 | 1. Scatterplot for numerical variables
78 | 2. Countplot for categorical variables
79 | 3. Boxplot for numerical-categorical variables
80 | 
81 | Similar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = "fruits", I should only see 4 types of fruits.
82 | 
83 | ### Upcoming changes for version 0.3
84 | 1. Take in more file outputs beyond CSV and Excel
85 | 2. Gathering user input, I will increase the variety of plots beyond scatterplots, barplots and boxplots.
86 | 3. Report generated will be in HTML format. 
87 | 


--------------------------------------------------------------------------------
/Script/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/.DS_Store


--------------------------------------------------------------------------------
/Script/Graph/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/.DS_Store


--------------------------------------------------------------------------------
/Script/Graph/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__init__.py


--------------------------------------------------------------------------------
/Script/Graph/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Script/Graph/__pycache__/plot.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__pycache__/plot.cpython-36.pyc


--------------------------------------------------------------------------------
/Script/Graph/plot.py:
--------------------------------------------------------------------------------
 1 | import seaborn as sns
 2 | import itertools
 3 | from sklearn.preprocessing import LabelEncoder
 4 | import matplotlib.pyplot as plt
 5 | 
 6 | # Create a function for plots
 7 |     ## Look into making count plots for categorical data
 8 |     ## Look into making scatterplots and barplot for numerical data
 9 |     ## Export plots into plot path
10 | def run(data,categorical_variable,numerical_variable,plot_path):
11 | 
12 |     ## Set Unique categorical values that are < 5 as hue
13 |     hue_lst = []
14 |     for x in categorical_variable:
15 |         if len(set(data[x])) <= 5: # if we have less than 5 unique values, we will use it for hue attributes
16 |             hue_lst.append(x)
17 |     ## Creating possible combinations among a list of numerical variables
18 |     num_var_combination = list(itertools.combinations(numerical_variable, 2))
19 |     ## Creating possible combinations among a list of categorical variables
20 |     cat_var_combination = list(itertools.combinations(categorical_variable, 2))
21 |     ## Creating possible combinations among a list of numerical and categorical variuable
22 |     catnum_combination = list(itertools.product(numerical_variable, categorical_variable))
23 | 
24 |     ## Using scatterplot for numerical-numerical variables
25 |     if len(categorical_variable) > 1:    
26 |         num_var_hue_combination = list(itertools.product(num_var_combination, hue_lst))
27 |         for i in num_var_hue_combination:
28 |             var1 = i[0][0]
29 |             var2 = i[0][1]
30 |             hue1 = i[1]
31 |             plot1 = sns.scatterplot(data = data, x = var1, y = var2, hue = hue1)
32 |             fig1 = plot1.get_figure()
33 |             fig1.savefig(plot_path + "/{} vs {} by {} scatterplot.png".format(var1,var2, hue1))
34 |             fig1.clf()
35 |     else:
36 |         for l in num_var_combination:
37 |             var1 = l[0]
38 |             var2 = l[1]
39 |             plot1 = sns.scatterplot(data = data, x = var1, y = var2)
40 |             fig1 = plot1.get_figure()
41 |             fig1.savefig(plot_path + "/{} vs {} scatterplot.png".format(var1,var2))
42 |             fig1.clf()
43 | 
44 | 
45 |     ## Using countplot for categorical data
46 |     for j in categorical_variable:
47 |         plot2 = sns.countplot(data = data, x = j)
48 |         fig2 = plot2.get_figure()
49 |         fig2.savefig(plot_path + "/{}_countplot.png".format(j))
50 |         fig2.clf()
51 | 
52 |     ## Using boxplot for numerical + Categorical data
53 |     for k in catnum_combination:
54 |         num1 = k[0]
55 |         cat1 = k[1]
56 |         plot3 = sns.boxplot(data = data, x = cat1, y = num1)
57 |         fig3 = plot3.get_figure()
58 |         fig3.savefig(plot_path + "/{}_{}_barplot.png".format(num1,cat1))
59 |         fig3.clf()
60 | 
61 |     ## Creating heatmap to show correlation
62 |     le = LabelEncoder()
63 |     for cat in data[categorical_variable]:
64 |         data[cat] = le.fit_transform(data[cat])
65 |     plt.figure(figsize=(15,10))
66 |     corrMatrix = data.corr()
67 |     plot4 = sns.heatmap(corrMatrix, annot=True)
68 |     fig4 = plot4.get_figure()
69 |     fig4.savefig(plot_path + "/heatplot.png")
70 |     fig4.clf()


--------------------------------------------------------------------------------
/Script/Report/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/.DS_Store


--------------------------------------------------------------------------------
/Script/Report/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__init__.py


--------------------------------------------------------------------------------
/Script/Report/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/Script/Report/__pycache__/model.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__pycache__/model.cpython-36.pyc


--------------------------------------------------------------------------------
/Script/Report/model.py:
--------------------------------------------------------------------------------
 1 | from scipy import stats
 2 | from scipy.stats import linregress
 3 | from scipy.stats import chi2_contingency
 4 | import statsmodels.api as sm
 5 | from statsmodels.formula.api import ols
 6 | import pandas as pd
 7 | 
 8 | 
 9 | # Creating a function that provide an overview of the data
10 |     ## To include a comment for each section
11 |     ## To print the first 5 lines of data
12 |     ## To print the shape of data in words
13 |     ## To print the dtypes of each column
14 |     ## To print the number of null values in each columns.
15 |         ## Also to include anything like "no values", "unknown"...
16 |         ## If data isnumeric()
17 |     ## To print the summary of data (i.e. count,mean,min,max)
18 | def overview(df, numerical_variable, report):
19 |     '''
20 | 
21 | 
22 |     Parameters
23 |     ----------
24 |     df : DataFrame
25 |         Imported dataframe from csv_path.
26 | 
27 |     Returns
28 |     -------
29 |     None.
30 | 
31 |     '''
32 |     data_head = df.head()
33 |     data_shape = df.shape
34 |     data_type = df.dtypes
35 |     df = (df.drop(numerical_variable, axis=1).join(df[numerical_variable].apply(pd.to_numeric, errors='coerce'))) # Converts any non-numeric values in a numerical column into NaN
36 |     null_values = df.isnull().sum()
37 |     zero_prop = ((df[df == 0].count(axis=0)/len(df.index)).round(2)* 100)
38 |     data_summary = df.describe()
39 |     report.write("______Exploratory data analysis summary by Edator______\n\n\n\nThe first 5 rows of content comprise of:\n\n{}\n\n\nThere are a total of {} rows and {} columns.\n\n\nThe data type for each column is:\n\n{}\n\n\nNumber of NaN values for each column:\n\n{}\n\n\n% of zeros in each column:\n\n{}\n\n\nThe summary of data:\n\n{}"
40 |                  .format(data_head, data_shape[0], data_shape[1], data_type, null_values, zero_prop, data_summary))
41 |     return df
42 | 
43 | 
44 | # Creating report for correlation
45 | def run(num_var_combination, catnum_combination, cat_var_combination, report,data):
46 | ## For numeric variables
47 | # Pearson correlation (Numerical)
48 |     report.write("\n\n\n__________Correlation Summary (Pearson)__________")
49 |     for i in num_var_combination:
50 |         var1 = i[0]
51 |         var2 = i[1]
52 |         pearson_data = linregress(data[var1], data[var2])
53 |         pearson_r2, pearson_pvalue = ((pearson_data[2]**2), pearson_data[3])
54 |         report.write("\n\nThe Pearson R_Square and Pearson P-values between {} and {} are {} and {} respectively."
55 |                  .format(var1, var2, pearson_r2, pearson_pvalue))
56 | 
57 |     # Spearsman correlation (Ordinal)
58 |     report.write("\n\n\n\n__________Correlation Summary (Spearsman)__________")
59 |     for q in num_var_combination:
60 |         var1 = q[0]
61 |         var2 = q[1]
62 |         spearsman_data = stats.spearmanr(data[var1], data[var2])
63 |         spearsman_r2, spearsman_pvalue = ((spearsman_data[0]**2), spearsman_data[1])
64 |         report.write("\n\nThe Spearsman R_Square and Spearsman P-values between {} and {} are {} and {} respectively."
65 |                  .format(var1, var2, spearsman_r2, spearsman_pvalue))
66 | 
67 |     ## For numeric-categorical variables
68 |     # ONE WAY ANOVA (Cat-num variables)
69 |     report.write("\n\n\n\n__________Correlation Summary (One Way ANOVA)__________")
70 |     for j in catnum_combination:
71 |         var1 = j[0]
72 |         var2 = j[1]
73 |         lm = ols('{} ~ {}'.format(var1,var2), data = data).fit()
74 |         table = sm.stats.anova_lm(lm)
75 |         one_way_anova_pvalue = table.loc[var2,'PR(>F)']
76 |         report.write("\n\nThe One Way ANOVA P-value between {} and {} is {}."
77 |                  .format(var1, var2, one_way_anova_pvalue))
78 |         
79 |     ## For categorical-categorical variables
80 |     # Chi-Sq test
81 |     report.write("\n\n\n\n__________Correlation Summary (Chi Square Test)__________")
82 |     for k in cat_var_combination:
83 |         cat1 = k[0]
84 |         cat2 = k[1]
85 |         chi_sq = pd.crosstab(data[cat1], data[cat2])
86 |         chi_sq_result = chi2_contingency(chi_sq)
87 |         report.write("\n\nThe Chi-Square P-value between {} and {} is {}."
88 |                  .format(cat1, cat2, chi_sq_result[1]))
89 | 
90 |     report.close()
91 |  
92 |     


--------------------------------------------------------------------------------
/Script/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/__init__.py


--------------------------------------------------------------------------------
/Script/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | Created on Sat Apr 11 11:07:29 2020
  5 | 
  6 | @author: kianweelee
  7 | """
  8 | 
  9 | # Importing the required packages
 10 | import os.path
 11 | import pandas as pd
 12 | import matplotlib.pyplot as plt
 13 | import seaborn as sns
 14 | import PySimpleGUI as sg
 15 | from sklearn.impute import SimpleImputer
 16 | import numpy as np
 17 | from scipy import stats
 18 | import statsmodels.api as sm
 19 | from statsmodels.formula.api import ols
 20 | from scipy.stats import linregress
 21 | from scipy.stats import chi2_contingency
 22 | import itertools
 23 | from sklearn.preprocessing import LabelEncoder
 24 | from Report import model
 25 | from Graph import plot
 26 | 
 27 | def main():
 28 |     # Create a path to access csv file and also define a path to deposit EDA plots and report
 29 |         ## Required to create a GUI using pySimpleGUI
 30 | 
 31 |     layout = [
 32 |     [sg.Frame(layout=[
 33 |     [sg.Radio('CSV', "file_format", default=True, size=(10,1)), sg.Radio('XLS', "file_format")]], title='File Format',title_color='red', relief=sg.RELIEF_SUNKEN)],
 34 |     [sg.Text('Please input the folder path')],
 35 |     [sg.Text('File path:', size=(18, 1)), sg.Input(), sg.FileBrowse()],
 36 |     [sg.Text('Export plots to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()],
 37 |     [sg.Text('Export report to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()],
 38 |     [sg.Text('Export cleaned csv to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()],
 39 |     [sg.Submit(), sg.Cancel()]]
 40 | 
 41 |     window = sg.Window('Edator', layout)
 42 | 
 43 |     event, values = window.read()
 44 |     csv_option, xls_option, file_path, plot_path, report_path, clean_csv_path = values[0], values[1], values[2], values[3], values[4], values[5]
 45 |     print(values[0])
 46 |     print(values[1])
 47 |     window.close()
 48 | 
 49 |     # Creating a txt file in the report_path
 50 |     filename = os.path.join(report_path, "report" + ".txt")
 51 | 
 52 |     # Assigning csv file to a variable call 'data'
 53 |     if csv_option:
 54 |     	data = pd.read_csv(file_path)
 55 |     else:
 56 |     	data = pd.read_excel(file_path, index_col = None)
 57 | 
 58 |     # Create a function to separate out numerical and categorical data
 59 |         ## Using this function to ensure that all non-numerical in a numerical column
 60 |         ## and non-categorical in a categorical column is annotated
 61 |     def cat_variable(df):
 62 |         return list(df.select_dtypes(include = ['category', 'object']))
 63 | 
 64 |     def num_variable(df):
 65 |         return list(df.select_dtypes(exclude = ['category', 'object']))
 66 | 
 67 |     categorical_variable = cat_variable(data)
 68 |     numerical_variable = num_variable(data)
 69 | 
 70 |     # Assigning variable filename to report and enable writing mode
 71 |     report = open(filename, "w")
 72 | 
 73 |     # Execute overview function in model module
 74 |     data = model.overview(data, numerical_variable, report)
 75 | 
 76 | 
 77 |     # Create a function to decide whether to drop all NA values or replace them
 78 |     ## Drop it if NAN count < 5 %
 79 |     nan_prop = (data.isna().mean().round(2)*100) # Show % of NaN values per column
 80 | 
 81 |     def drop_na():
 82 |         return [i for i, v in nan_prop.items() if v < 5 and v > 0]
 83 | 
 84 |     cols_to_drop = drop_na()
 85 | 
 86 |     data = data.dropna(subset = cols_to_drop)
 87 | 
 88 | 
 89 |     ## Using Imputer to fill NaN values
 90 |     ## Counting the proportion of NaN
 91 | 
 92 |     def fill_na():
 93 |         return [i for i, v in nan_prop.items() if v > 5]
 94 | 
 95 |     cols_to_fill = fill_na()
 96 | 
 97 |     cat_var_tofill = []
 98 |     num_var_tofill = []
 99 | 
100 |     for var in cols_to_fill:
101 |         if var in categorical_variable:
102 |             cat_var_tofill.append(var)
103 |         else:
104 |             num_var_tofill.append(var)
105 | 
106 |     imp_cat = SimpleImputer(missing_values = np.nan, strategy='most_frequent')
107 |     try:
108 |         data[cat_var_tofill] = imp_cat.fit_transform(data[cat_var_tofill])
109 |     except ValueError:
110 |         pass
111 | 
112 |     imp_num = SimpleImputer(missing_values = np.nan, strategy='median')
113 |     try:
114 |         data[num_var_tofill] = imp_num.fit_transform(data[num_var_tofill])
115 |     except ValueError:
116 |         pass
117 | 
118 |     # Create a function to process outlier data
119 |     def outlier():
120 |         z = np.abs(stats.zscore(data[numerical_variable]))
121 |         z_data = data[(z < 3).all(axis=1)] # Remove any outliers with Z-score > 3 or < -3
122 |         return z_data
123 | 
124 |     data = outlier()
125 | 
126 |     # Create a function to compute correlation
127 |     ## Pearson and Spearsman correlation for numerical-numerical data
128 |     ## One-Way ANOVA for numerical-categorical data
129 |     ## Chi-Square test for categorical-categorical data
130 | 
131 |     ## Creating possible combinations among a list of numerical variables
132 |     num_var_combination = list(itertools.combinations(numerical_variable, 2))
133 | 
134 |     ## Creating possible combinations among a list of categorical variables
135 |     cat_var_combination = list(itertools.combinations(categorical_variable, 2))
136 | 
137 |     ## Creating possible combinations among a list of numerical and categorical variuable
138 |     catnum_combination = list(itertools.product(numerical_variable, categorical_variable))
139 | 
140 |     ## Running the report now
141 |     model.run(num_var_combination,catnum_combination,cat_var_combination,report,data)
142 | 
143 |     # Create an output file that shows cleaned data
144 |     data2 = data.copy()
145 |     data2.to_csv(r'{}/cleaned_csv.csv'.format(clean_csv_path), index = False)
146 | 
147 |     # Running plot.py from Graph package
148 |     plot.run(data, categorical_variable,numerical_variable,plot_path)
149 | 
150 | 
151 | 
152 | # Running program
153 | if __name__ == "__main__":
154 |     main()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | matplotlib>=3.1.2
 2 | numpy>=1.18.1
 3 | pandas>=1.0.0
 4 | PySimpleGUI>=4.19.0
 5 | scikit-learn>=0.22.1
 6 | scipy>=1.4.1
 7 | seaborn>=0.10.0
 8 | statsmodels>=0.11.1
 9 | more-itertools>=8.3.0
10 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | setup(
 4 |     name='Edator',
 5 |     version='0.1',
 6 |     packages=find_packages(),
 7 |     license='MIT',
 8 |     description='A python package that runs exploratory data analysis for users',
 9 |     long_description=open('README.txt').read(),
10 |     install_requires=['os', 'pandas', 'matplotlib', 'seaborn', 'PySimpleGUI', 'sklearn', 'numpy', 'scipy', 'statsmodels', 'itertools'],
11 |     url='https://https://github.com/kianweelee/Edator',
12 |     download_url= 'https://github.com/kianweelee/Edator/archive/0.1.tar.gz',
13 |     author='Lee Kian Wee',
14 |     author_email='leekianwee@outlook.com'
15 | )


--------------------------------------------------------------------------------