├── .DS_Store ├── .gitignore ├── Image ├── Screen Shot 2020-05-16 at 4.45.48 pm.png ├── Screen Shot 2020-06-11 at 8.32.55 pm.png └── eau de parfum.png ├── LICENSE ├── README.md ├── Script ├── .DS_Store ├── Graph │ ├── .DS_Store │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ └── plot.cpython-36.pyc │ └── plot.py ├── Report │ ├── .DS_Store │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-36.pyc │ │ └── model.cpython-36.pyc │ └── model.py ├── __init__.py └── main.py ├── requirements.txt └── setup.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /Image/Screen Shot 2020-05-16 at 4.45.48 pm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/Screen Shot 2020-05-16 at 4.45.48 pm.png -------------------------------------------------------------------------------- /Image/Screen Shot 2020-06-11 at 8.32.55 pm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/Screen Shot 2020-06-11 at 8.32.55 pm.png -------------------------------------------------------------------------------- /Image/eau de parfum.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Image/eau de parfum.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Kian Wee 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![](https://raw.githubusercontent.com/kianweelee/Edator/master/Image/eau%20de%20parfum.png) 2 | # Edator 3 | 4 | [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/) 5 | [![CodeFactor](https://www.codefactor.io/repository/github/kianweelee/edator/badge)](https://www.codefactor.io/repository/github/kianweelee/edator) 6 | [![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE) 7 | ![](https://img.shields.io/bitbucket/issues-raw/kianweelee/Edator) 8 | [![](https://img.shields.io/github/v/release/kianweelee/edator)](https://github.com/kianweelee/Edator/releases) 9 | ![](https://img.shields.io/github/last-commit/kianweelee/edator) 10 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](https://github.com/kianweelee/Edator/pulls) 11 | [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/kianweelee/edator/issues) 12 | 13 | This is a python package that performs exploratory data analysis for users. It takes in a csv file and generates 3 documents that comprise of a text report containing a descriptive summary, a series of plots and a cleaned csv output. 14 | 15 | ## Set up 16 | ### Dependencies 17 | - Python 3.8x 18 | - matplotlib==3.1.2 19 | - numpy==1.18.1 20 | - pandas==1.0.0 21 | - PySimpleGUI==4.19.0 22 | - scikit-learn==0.22.1 23 | - scipy==1.4.1 24 | - seaborn==0.10.0 25 | - statsmodels==0.11.1 26 | - more-itertools==8.3.0 27 | 28 | ### How to set up? (**Important!**) 29 | 1. You can clone or download my package. 30 | 2. Using terminal, move to the directory. 31 | - Example for Mac OS users: 32 | ```bash 33 | $ cd Downloads/Edator 34 | ``` 35 | 3. Install the required packages using: 36 | ```py 37 | pip install -r requirements.txt 38 | ``` 39 | 4. After that, change directory into the Script folder using: 40 | ```bash 41 | $ cd Script 42 | ``` 43 | 5. Now, execute the main.py file by: 44 | ```py 45 | $ python main.py 46 | ``` 47 | 6. You should see the following: 48 | 49 | ![](https://github.com/kianweelee/Edator/blob/master/Image/Screen%20Shot%202020-06-11%20at%208.32.55%20pm.png) 50 | 51 | 7. Choose the format of the file (csv or xls), the path to the file and the paths to export the plots, the report and the cleaned csv file to. 52 | 53 | 8. Done! 54 | 55 | ## The concept behind Edator 56 | 57 | ### Dealing with NaN values and zeros 58 | How I deal with NaN value is that I only remove the affected rows when the percentage of NaN within that column is **less than 5%**. This applies to both numerical and categorical values. For anything above 5%, I replace the NaN values with median. For categorical values, the NaN values will be replace by mode. 59 | 60 | Dealing with zeros is much harder as it is challenging to differentiate between a zero that is meaningful (has a purpose and should not be removed) and a zero that serves no purpose and can potentially add more noise to the dataset. Hence, I decided to inform the user about the percentage of zeros in the dataset. 61 | 62 | ### Processing outliers 63 | I use Z-score to detect outliers. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean. 64 | 65 | In most cases, a threshold of 3 or -3 is used to filter off outliers and I have used this approach for all of my analysis. 66 | 67 | ### Correlation 68 | For correlation, I included: 69 | 1. Pearson and Spearman correlation for numerical-numerical variables. 70 | 2. One Way ANOVA for numerical-categorical variables 71 | 3. Chi-Square test for categorical-categorical variables 72 | 73 | Using itertools.combinations, I identify every possible combinations among numerical-numerical variables, numerical-categorical variables and categorical-categorical variables. I then apply the correlation test based on the criteria I have set above. 74 | 75 | ### Plots 76 | For plots, I created: 77 | 1. Scatterplot for numerical variables 78 | 2. Countplot for categorical variables 79 | 3. Boxplot for numerical-categorical variables 80 | 81 | Similar to correlation, I used itertools.combinations to create every possible plot. I have also added the hue feature to each scatterplot. I will only do so when the categorical variable has less than 5 unique values. Example, if hue = "fruits", I should only see 4 types of fruits. 82 | 83 | ### Upcoming changes for version 0.3 84 | 1. Take in more file outputs beyond CSV and Excel 85 | 2. Gathering user input, I will increase the variety of plots beyond scatterplots, barplots and boxplots. 86 | 3. Report generated will be in HTML format. 87 | -------------------------------------------------------------------------------- /Script/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/.DS_Store -------------------------------------------------------------------------------- /Script/Graph/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/.DS_Store -------------------------------------------------------------------------------- /Script/Graph/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__init__.py -------------------------------------------------------------------------------- /Script/Graph/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Script/Graph/__pycache__/plot.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Graph/__pycache__/plot.cpython-36.pyc -------------------------------------------------------------------------------- /Script/Graph/plot.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import itertools 3 | from sklearn.preprocessing import LabelEncoder 4 | import matplotlib.pyplot as plt 5 | 6 | # Create a function for plots 7 | ## Look into making count plots for categorical data 8 | ## Look into making scatterplots and barplot for numerical data 9 | ## Export plots into plot path 10 | def run(data,categorical_variable,numerical_variable,plot_path): 11 | 12 | ## Set Unique categorical values that are < 5 as hue 13 | hue_lst = [] 14 | for x in categorical_variable: 15 | if len(set(data[x])) <= 5: # if we have less than 5 unique values, we will use it for hue attributes 16 | hue_lst.append(x) 17 | ## Creating possible combinations among a list of numerical variables 18 | num_var_combination = list(itertools.combinations(numerical_variable, 2)) 19 | ## Creating possible combinations among a list of categorical variables 20 | cat_var_combination = list(itertools.combinations(categorical_variable, 2)) 21 | ## Creating possible combinations among a list of numerical and categorical variuable 22 | catnum_combination = list(itertools.product(numerical_variable, categorical_variable)) 23 | 24 | ## Using scatterplot for numerical-numerical variables 25 | if len(categorical_variable) > 1: 26 | num_var_hue_combination = list(itertools.product(num_var_combination, hue_lst)) 27 | for i in num_var_hue_combination: 28 | var1 = i[0][0] 29 | var2 = i[0][1] 30 | hue1 = i[1] 31 | plot1 = sns.scatterplot(data = data, x = var1, y = var2, hue = hue1) 32 | fig1 = plot1.get_figure() 33 | fig1.savefig(plot_path + "/{} vs {} by {} scatterplot.png".format(var1,var2, hue1)) 34 | fig1.clf() 35 | else: 36 | for l in num_var_combination: 37 | var1 = l[0] 38 | var2 = l[1] 39 | plot1 = sns.scatterplot(data = data, x = var1, y = var2) 40 | fig1 = plot1.get_figure() 41 | fig1.savefig(plot_path + "/{} vs {} scatterplot.png".format(var1,var2)) 42 | fig1.clf() 43 | 44 | 45 | ## Using countplot for categorical data 46 | for j in categorical_variable: 47 | plot2 = sns.countplot(data = data, x = j) 48 | fig2 = plot2.get_figure() 49 | fig2.savefig(plot_path + "/{}_countplot.png".format(j)) 50 | fig2.clf() 51 | 52 | ## Using boxplot for numerical + Categorical data 53 | for k in catnum_combination: 54 | num1 = k[0] 55 | cat1 = k[1] 56 | plot3 = sns.boxplot(data = data, x = cat1, y = num1) 57 | fig3 = plot3.get_figure() 58 | fig3.savefig(plot_path + "/{}_{}_barplot.png".format(num1,cat1)) 59 | fig3.clf() 60 | 61 | ## Creating heatmap to show correlation 62 | le = LabelEncoder() 63 | for cat in data[categorical_variable]: 64 | data[cat] = le.fit_transform(data[cat]) 65 | plt.figure(figsize=(15,10)) 66 | corrMatrix = data.corr() 67 | plot4 = sns.heatmap(corrMatrix, annot=True) 68 | fig4 = plot4.get_figure() 69 | fig4.savefig(plot_path + "/heatplot.png") 70 | fig4.clf() -------------------------------------------------------------------------------- /Script/Report/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/.DS_Store -------------------------------------------------------------------------------- /Script/Report/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__init__.py -------------------------------------------------------------------------------- /Script/Report/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /Script/Report/__pycache__/model.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/Report/__pycache__/model.cpython-36.pyc -------------------------------------------------------------------------------- /Script/Report/model.py: -------------------------------------------------------------------------------- 1 | from scipy import stats 2 | from scipy.stats import linregress 3 | from scipy.stats import chi2_contingency 4 | import statsmodels.api as sm 5 | from statsmodels.formula.api import ols 6 | import pandas as pd 7 | 8 | 9 | # Creating a function that provide an overview of the data 10 | ## To include a comment for each section 11 | ## To print the first 5 lines of data 12 | ## To print the shape of data in words 13 | ## To print the dtypes of each column 14 | ## To print the number of null values in each columns. 15 | ## Also to include anything like "no values", "unknown"... 16 | ## If data isnumeric() 17 | ## To print the summary of data (i.e. count,mean,min,max) 18 | def overview(df, numerical_variable, report): 19 | ''' 20 | 21 | 22 | Parameters 23 | ---------- 24 | df : DataFrame 25 | Imported dataframe from csv_path. 26 | 27 | Returns 28 | ------- 29 | None. 30 | 31 | ''' 32 | data_head = df.head() 33 | data_shape = df.shape 34 | data_type = df.dtypes 35 | df = (df.drop(numerical_variable, axis=1).join(df[numerical_variable].apply(pd.to_numeric, errors='coerce'))) # Converts any non-numeric values in a numerical column into NaN 36 | null_values = df.isnull().sum() 37 | zero_prop = ((df[df == 0].count(axis=0)/len(df.index)).round(2)* 100) 38 | data_summary = df.describe() 39 | report.write("______Exploratory data analysis summary by Edator______\n\n\n\nThe first 5 rows of content comprise of:\n\n{}\n\n\nThere are a total of {} rows and {} columns.\n\n\nThe data type for each column is:\n\n{}\n\n\nNumber of NaN values for each column:\n\n{}\n\n\n% of zeros in each column:\n\n{}\n\n\nThe summary of data:\n\n{}" 40 | .format(data_head, data_shape[0], data_shape[1], data_type, null_values, zero_prop, data_summary)) 41 | return df 42 | 43 | 44 | # Creating report for correlation 45 | def run(num_var_combination, catnum_combination, cat_var_combination, report,data): 46 | ## For numeric variables 47 | # Pearson correlation (Numerical) 48 | report.write("\n\n\n__________Correlation Summary (Pearson)__________") 49 | for i in num_var_combination: 50 | var1 = i[0] 51 | var2 = i[1] 52 | pearson_data = linregress(data[var1], data[var2]) 53 | pearson_r2, pearson_pvalue = ((pearson_data[2]**2), pearson_data[3]) 54 | report.write("\n\nThe Pearson R_Square and Pearson P-values between {} and {} are {} and {} respectively." 55 | .format(var1, var2, pearson_r2, pearson_pvalue)) 56 | 57 | # Spearsman correlation (Ordinal) 58 | report.write("\n\n\n\n__________Correlation Summary (Spearsman)__________") 59 | for q in num_var_combination: 60 | var1 = q[0] 61 | var2 = q[1] 62 | spearsman_data = stats.spearmanr(data[var1], data[var2]) 63 | spearsman_r2, spearsman_pvalue = ((spearsman_data[0]**2), spearsman_data[1]) 64 | report.write("\n\nThe Spearsman R_Square and Spearsman P-values between {} and {} are {} and {} respectively." 65 | .format(var1, var2, spearsman_r2, spearsman_pvalue)) 66 | 67 | ## For numeric-categorical variables 68 | # ONE WAY ANOVA (Cat-num variables) 69 | report.write("\n\n\n\n__________Correlation Summary (One Way ANOVA)__________") 70 | for j in catnum_combination: 71 | var1 = j[0] 72 | var2 = j[1] 73 | lm = ols('{} ~ {}'.format(var1,var2), data = data).fit() 74 | table = sm.stats.anova_lm(lm) 75 | one_way_anova_pvalue = table.loc[var2,'PR(>F)'] 76 | report.write("\n\nThe One Way ANOVA P-value between {} and {} is {}." 77 | .format(var1, var2, one_way_anova_pvalue)) 78 | 79 | ## For categorical-categorical variables 80 | # Chi-Sq test 81 | report.write("\n\n\n\n__________Correlation Summary (Chi Square Test)__________") 82 | for k in cat_var_combination: 83 | cat1 = k[0] 84 | cat2 = k[1] 85 | chi_sq = pd.crosstab(data[cat1], data[cat2]) 86 | chi_sq_result = chi2_contingency(chi_sq) 87 | report.write("\n\nThe Chi-Square P-value between {} and {} is {}." 88 | .format(cat1, cat2, chi_sq_result[1])) 89 | 90 | report.close() 91 | 92 | -------------------------------------------------------------------------------- /Script/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kianweelee/Edator/4746f577a1ffda7a8013fac37792913b8cfce7fb/Script/__init__.py -------------------------------------------------------------------------------- /Script/main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Sat Apr 11 11:07:29 2020 5 | 6 | @author: kianweelee 7 | """ 8 | 9 | # Importing the required packages 10 | import os.path 11 | import pandas as pd 12 | import matplotlib.pyplot as plt 13 | import seaborn as sns 14 | import PySimpleGUI as sg 15 | from sklearn.impute import SimpleImputer 16 | import numpy as np 17 | from scipy import stats 18 | import statsmodels.api as sm 19 | from statsmodels.formula.api import ols 20 | from scipy.stats import linregress 21 | from scipy.stats import chi2_contingency 22 | import itertools 23 | from sklearn.preprocessing import LabelEncoder 24 | from Report import model 25 | from Graph import plot 26 | 27 | def main(): 28 | # Create a path to access csv file and also define a path to deposit EDA plots and report 29 | ## Required to create a GUI using pySimpleGUI 30 | 31 | layout = [ 32 | [sg.Frame(layout=[ 33 | [sg.Radio('CSV', "file_format", default=True, size=(10,1)), sg.Radio('XLS', "file_format")]], title='File Format',title_color='red', relief=sg.RELIEF_SUNKEN)], 34 | [sg.Text('Please input the folder path')], 35 | [sg.Text('File path:', size=(18, 1)), sg.Input(), sg.FileBrowse()], 36 | [sg.Text('Export plots to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()], 37 | [sg.Text('Export report to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()], 38 | [sg.Text('Export cleaned csv to:', size=(18, 1)), sg.Input(), sg.FolderBrowse()], 39 | [sg.Submit(), sg.Cancel()]] 40 | 41 | window = sg.Window('Edator', layout) 42 | 43 | event, values = window.read() 44 | csv_option, xls_option, file_path, plot_path, report_path, clean_csv_path = values[0], values[1], values[2], values[3], values[4], values[5] 45 | print(values[0]) 46 | print(values[1]) 47 | window.close() 48 | 49 | # Creating a txt file in the report_path 50 | filename = os.path.join(report_path, "report" + ".txt") 51 | 52 | # Assigning csv file to a variable call 'data' 53 | if csv_option: 54 | data = pd.read_csv(file_path) 55 | else: 56 | data = pd.read_excel(file_path, index_col = None) 57 | 58 | # Create a function to separate out numerical and categorical data 59 | ## Using this function to ensure that all non-numerical in a numerical column 60 | ## and non-categorical in a categorical column is annotated 61 | def cat_variable(df): 62 | return list(df.select_dtypes(include = ['category', 'object'])) 63 | 64 | def num_variable(df): 65 | return list(df.select_dtypes(exclude = ['category', 'object'])) 66 | 67 | categorical_variable = cat_variable(data) 68 | numerical_variable = num_variable(data) 69 | 70 | # Assigning variable filename to report and enable writing mode 71 | report = open(filename, "w") 72 | 73 | # Execute overview function in model module 74 | data = model.overview(data, numerical_variable, report) 75 | 76 | 77 | # Create a function to decide whether to drop all NA values or replace them 78 | ## Drop it if NAN count < 5 % 79 | nan_prop = (data.isna().mean().round(2)*100) # Show % of NaN values per column 80 | 81 | def drop_na(): 82 | return [i for i, v in nan_prop.items() if v < 5 and v > 0] 83 | 84 | cols_to_drop = drop_na() 85 | 86 | data = data.dropna(subset = cols_to_drop) 87 | 88 | 89 | ## Using Imputer to fill NaN values 90 | ## Counting the proportion of NaN 91 | 92 | def fill_na(): 93 | return [i for i, v in nan_prop.items() if v > 5] 94 | 95 | cols_to_fill = fill_na() 96 | 97 | cat_var_tofill = [] 98 | num_var_tofill = [] 99 | 100 | for var in cols_to_fill: 101 | if var in categorical_variable: 102 | cat_var_tofill.append(var) 103 | else: 104 | num_var_tofill.append(var) 105 | 106 | imp_cat = SimpleImputer(missing_values = np.nan, strategy='most_frequent') 107 | try: 108 | data[cat_var_tofill] = imp_cat.fit_transform(data[cat_var_tofill]) 109 | except ValueError: 110 | pass 111 | 112 | imp_num = SimpleImputer(missing_values = np.nan, strategy='median') 113 | try: 114 | data[num_var_tofill] = imp_num.fit_transform(data[num_var_tofill]) 115 | except ValueError: 116 | pass 117 | 118 | # Create a function to process outlier data 119 | def outlier(): 120 | z = np.abs(stats.zscore(data[numerical_variable])) 121 | z_data = data[(z < 3).all(axis=1)] # Remove any outliers with Z-score > 3 or < -3 122 | return z_data 123 | 124 | data = outlier() 125 | 126 | # Create a function to compute correlation 127 | ## Pearson and Spearsman correlation for numerical-numerical data 128 | ## One-Way ANOVA for numerical-categorical data 129 | ## Chi-Square test for categorical-categorical data 130 | 131 | ## Creating possible combinations among a list of numerical variables 132 | num_var_combination = list(itertools.combinations(numerical_variable, 2)) 133 | 134 | ## Creating possible combinations among a list of categorical variables 135 | cat_var_combination = list(itertools.combinations(categorical_variable, 2)) 136 | 137 | ## Creating possible combinations among a list of numerical and categorical variuable 138 | catnum_combination = list(itertools.product(numerical_variable, categorical_variable)) 139 | 140 | ## Running the report now 141 | model.run(num_var_combination,catnum_combination,cat_var_combination,report,data) 142 | 143 | # Create an output file that shows cleaned data 144 | data2 = data.copy() 145 | data2.to_csv(r'{}/cleaned_csv.csv'.format(clean_csv_path), index = False) 146 | 147 | # Running plot.py from Graph package 148 | plot.run(data, categorical_variable,numerical_variable,plot_path) 149 | 150 | 151 | 152 | # Running program 153 | if __name__ == "__main__": 154 | main() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib>=3.1.2 2 | numpy>=1.18.1 3 | pandas>=1.0.0 4 | PySimpleGUI>=4.19.0 5 | scikit-learn>=0.22.1 6 | scipy>=1.4.1 7 | seaborn>=0.10.0 8 | statsmodels>=0.11.1 9 | more-itertools>=8.3.0 10 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='Edator', 5 | version='0.1', 6 | packages=find_packages(), 7 | license='MIT', 8 | description='A python package that runs exploratory data analysis for users', 9 | long_description=open('README.txt').read(), 10 | install_requires=['os', 'pandas', 'matplotlib', 'seaborn', 'PySimpleGUI', 'sklearn', 'numpy', 'scipy', 'statsmodels', 'itertools'], 11 | url='https://https://github.com/kianweelee/Edator', 12 | download_url= 'https://github.com/kianweelee/Edator/archive/0.1.tar.gz', 13 | author='Lee Kian Wee', 14 | author_email='leekianwee@outlook.com' 15 | ) --------------------------------------------------------------------------------