├── README.md ├── increasing-the-participation-rate-in-standardized-tests ├── README.md ├── images │ ├── 123 │ ├── sat.png │ └── sat_act_df.png └── notebooks │ ├── 123 │ └── project-1-marco-tavora.ipynb └── west-nile-virus ├── README.md ├── data └── 123 ├── documents ├── 123 └── noaa_weather_qclcd_documentation.pdf ├── images ├── 123 ├── corr1.png ├── corr2.png ├── corr3.png └── moggie2.png └── notebooks ├── 123 └── eda-west-nile-virus-project.ipynb /README.md: -------------------------------------------------------------------------------- 1 | ## Exploratory Data Analysis 2 | 3 | 4 | ![image title](https://img.shields.io/badge/python-v3.6-green.svg) ![image title](https://img.shields.io/badge/ntlk-v3.2.5-yellow.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/BeautifulSoup-4.6.0-blue.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 5 |
6 |
7 | 8 | 9 | 10 | ### Index 11 | 12 | * [increasing-the-participation-rate-in-standardized-tests](#increasing-the-participation-rate-in-standardized-tests) 13 | * [west-nile-virus](#west-nile-virus) 14 | 15 | 16 | ### increasing-the-participation-rate-in-standardized-tests 17 | 18 | The problem we need to solve is to how to make actionable suggestions to the College Board to help them increase the participation rates in their exams. For that we need to perform an exploratory data analysis (EDA) to find appropriate metrics that can be adjusted by the College Board accordingly. 19 | 20 |

21 | 22 |

23 | 24 | 25 | ### west-nile-virus 26 | 27 | 28 | 29 | From the Kaggle website: 30 | 31 | > West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death. 32 | 33 | In this notebook I will perform a detailed EDA of this Kaggle dataset. 34 | 35 |

36 | 38 |

39 | -------------------------------------------------------------------------------- /increasing-the-participation-rate-in-standardized-tests/README.md: -------------------------------------------------------------------------------- 1 | ## Statistical Analysis of Participation in Standardized Tests [[view code]](http://nbviewer.jupyter.org/github/marcotav/exploratory-data-analysis/blob/master/increasing-the-participation-rate-in-standardized-tests/notebooks/project-1-marco-tavora.ipynb) 2 | ![image title](https://img.shields.io/badge/work-in%20progress-blue.svg) ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 | 4 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/exploratory-data-analysis/blob/master/increasing-the-participation-rate-in-standardized-tests/notebooks/project-1-marco-tavora.ipynb) or by clicking on the [view code] link above.** 5 | 6 | 7 |
8 | 9 |

10 | 11 |

12 |

13 | Overview • 14 | Problem Statement • 15 | Brief introduction to the data • 16 | EDA Steps
• 17 | Descriptive and Inferential Statistics
18 |

19 | 20 | 21 | 22 | ## Overview 23 | Suppose that the College Board, a nonprofit organization responsible for administering the SAT (Scholastic Aptitude Test), seeks to increase the rate of high-school graduates who participate in its exams. This project's aim is to make recommendations about which measures the College Board might take in order to achieve that. 24 | 25 | 26 | ## Problem Statement 27 | The problem we need to solve is to how to make actionable suggestions to the College Board to help them increase the participation rates in their exams. For that we need to perform an exploratory data analysis (EDA) to find appropriate metrics that can be adjusted by the College Board accordingly. Performing the EDA we must among other things: 28 | - Find relevants patterns in the data 29 | - Search for possible relations between subsets of the data (for example, are scores and participation rates correlated? If yes how?) 30 | - Test hypotheses about the data using statistical inference method 31 | - Identify possible biases in the data and, if possible, suggest corrections 32 | 33 | 34 | ## Brief introduction to the data 35 | 36 | The data is based on the SAT and the ACT (which stands for American College Testing and it is administered by another institution, namely, the ACT. Inc) exams from around the United States in 2017. 37 | 38 | The data contains: 39 | 40 | - Average SAT and ACT scores by state (scores for each section of each exam) 41 | - Participation rates for both exams by state. 42 | 43 | Both SAT and ACT are standardized tests for college admissions and are similar in content type but have some differences in structure. A few relevant differences are: 44 | - The ACT has a Science Test and the SAT does not 45 | - There is a SAT Math Section for which the student is not allowed to use a calculator 46 | - The SAT's College Board joins Reading and Writing into one score, the "Evidence-Based Reading and Writing" whereas in the ACT the tests are separated. 47 | 48 | 49 | ## EDA Steps 50 | 51 | ### Importing basic modules 52 | 53 | We first need to import Python libraries including: 54 | - `Pandas`, for data manipulation and analysis 55 | - `SciPy` which is a Python-based ecosystem of software for mathematics, science, and engineering. 56 | - `NumPy` which is a library consisting of multidimensional array objects and a collection of routines for processing of array. 57 | - `Statsmodels` which is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests and complements `SciPy`; 58 | - `Matplotlib` is a plotting library for the Python and NumPy; 59 | - `Seaborn` is complimentary to Matplotlib and it specifically targets statistical data visualization 60 | - `Pylab` is embedded inside Matplotlib and provides a Matlab-like experience for the user. It imports portions of Matplotlib and NumPy. 61 | 62 | This information was taken directly from the documentation. 63 | 64 | ### Loading the data and performing basic operations. 65 | 66 | To read our data (which is in the form of `csv` files), into a `DataFrame` structure, we use the method `read_csv()` and pass in each file name as a string: 67 | 68 | ``` 69 | sat = pd.read_csv('sat.csv') 70 | act = pd.read_csv('act.csv') 71 | ``` 72 | Note that the first columns of both tables seem to be identical to the DataFrame indexes. We can quickly confirm that using an `assert` statement. When an `assert` statement is encountered, Python evaluates it and if the expression is false raises an `AssertionError` exception this [link](https://www.tutorialspoint.com/index.htm) contains more details. 73 | 74 | ``` 75 | assert (sat.index.tolist() == sat['Unnamed: 0'].tolist()) 76 | assert (act.index.tolist() == act['Unnamed: 0'].tolist()) 77 | ``` 78 | After some data wrangling and feature engineering, the SAT (top) and ACT (bottomw) `DataFrames` become: 79 | 80 |
81 | 82 |

83 | 84 |

85 | 86 |
87 | 88 | The SAT table displays three different averages for each state: 89 | - The first column is the state 90 | - The second column is the average participation of students in each state 91 | - The third and fourth columns are the average scores in the Math and Evidence-Based Reading and Writing tests (the name EBRW is explained above). 92 | 93 | The ACT table displays the following averages for each State: 94 | - The first column is the state 95 | - The second column is the average participation of students in that state 96 | - The third, fourth, fifth and sixth columns are the scores in the English, Math, Reading and Science tests 97 | 98 | We can look for problems with the data for example: 99 | 100 | - Using `info()` 101 | - Using `describe()` 102 | - Looking at the last rows and or last columns which frequently contain aggregate values 103 | - Looking for null values 104 | - Outliers 105 | 106 | The third item was taken care of. There were no null values but there are outliers as we shall see when we perform the plotting. I will convert the columns 'Participation' into `floats` using a function to extract the % but keeping the scale between 0 and 100. The `.replace( )` method takes the argument `regex` =`True` because `type('%')` = `str`. 107 | 108 | ``` 109 | def perc_into_float(df,col): 110 | return df[col].replace('%','',regex=True).astype('float') 111 | df_sat['Participation'] = perc_into_float(df_sat,'Participation') 112 | df_act['Participation'] = perc_into_float(df_act,'Participation') 113 | ``` 114 | 115 | I now will create a dictionary for each column mapping the State to its respective value for that column using the function: 116 | ``` 117 | def dict_all(df,cols,n): 118 | return [df.set_index('State').to_dict()[cols[i]] for i in range(1,len(cols))][n] 119 | ``` 120 | The dicitonaries are: 121 | ``` 122 | dsat_part = dict_all(df_sat,df_sat.columns.tolist(),0) 123 | dsat_EBRW = dict_all(df_sat,df_sat.columns.tolist(),1) 124 | dsat_math = dict_all(df_sat,df_sat.columns.tolist(),2) 125 | ``` 126 | Now I create one dictionary where each key is the column name, and each value is an iterable (a list or a Pandas Series) of all the values in that column. The following function accomplishes that: 127 | 128 | ``` 129 | def dict_col(df): 130 | return {col:df[col].tolist() for col in df.columns} 131 | ``` 132 | Merging the dataframes on the state column and changing the names of the columns to distinguish between the SAT columns and the ACT columns: 133 | 134 | ``` 135 | df_total = pd.merge(df_sat, df_act, on='State') 136 | df_total = pd.merge(df_sat, df_act, on='State') 137 | df_total.columns = ['State','Participation_SAT (%)','EBRW_SAT','Math_SAT','Participation_ACT (%)','English_ACT','Math_ACT','Reading_ACT','Science_ACT'] 138 | ``` 139 | I know write a function from scratch without loops to compute standard deviation and calculate the standard deviation of each numeric column in both data sets: 140 | 141 | ``` 142 | def stdev(X): 143 | n = len(X) 144 | return ((1.0/n)*np.sum([(x-np.mean(X))**2 for x in X]))**(0.5) 145 | cols = df_total.columns[1:].tolist() 146 | sd = [round(stdev([df_total[col].tolist() for col in cols][i] ) ,3) for i in range(0,len(cols))] 147 | ``` 148 | Pandas calculates `std` DataFrame using n-1 as denominator instead of n. Setting the number of `ddof`=0 solves this issue. I will now turn the list `sd` into a new observation in your dataset. I first put `State` as index and then concatenate the new row, renaming it. 149 | 150 | ``` 151 | df_total_new = df_total.copy() 152 | df_total_new = df_total_new.set_index('State') 153 | df2 = pd.DataFrame([[34.929, 45.217, 84.073, 31.824, 2.33, 1.962, 2.047, 3.151]],columns=df_total_new.columns) 154 | df_total_new = pd.concat([df2,df_total_new]) 155 | df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'}) 156 | ``` 157 | Sorting the dataframe by the values in a numeric column e.g. observations descending by SAT participation rate: 158 | 159 | ``` 160 | df_total_new = df_total.copy() 161 | df_total_new = df_total_new.set_index('State').sort_values("Participation_SAT (%)",ascending=False) 162 | df_total_new = pd.concat([df2,df_total_new]) 163 | df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'}) 164 | ``` 165 | 166 | I will now use a boolean filter to display only observations with a score above a certain threshold 167 | 168 | ``` 169 | df_total_new = df_total_new[df_total_new['Participation_SAT (%)']>50] 170 | ``` 171 | 172 | 173 | ## Descriptive and Inferential Statistics 174 | 175 | ### Confidence Interval 176 | 177 | In inferential statistics usually the true parameter is not known, and all one can do is to determine a confidence interval which is a range of likely values for it. Confidence intervals are centered at a point estimate and then include the standard error multiplied by a multiplier. 178 | 179 | ## To be continued 180 | 181 | The full analysis is contained in the [notebook](http://nbviewer.jupyter.org/github/marcotav/increasing-the-participation-rate-in-standardized-tests/blob/master/project-1-marco-tavora.ipynb). 182 | -------------------------------------------------------------------------------- /increasing-the-participation-rate-in-standardized-tests/images/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /increasing-the-participation-rate-in-standardized-tests/images/sat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/increasing-the-participation-rate-in-standardized-tests/images/sat.png -------------------------------------------------------------------------------- /increasing-the-participation-rate-in-standardized-tests/images/sat_act_df.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/increasing-the-participation-rate-in-standardized-tests/images/sat_act_df.png -------------------------------------------------------------------------------- /increasing-the-participation-rate-in-standardized-tests/notebooks/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /west-nile-virus/README.md: -------------------------------------------------------------------------------- 1 | ## Exploratory Data Analysis of the West Nile Virus problem dataset [[view code]](http://nbviewer.jupyter.org/github/marcotav/exploratory-data-analysis/blob/master/west-nile-virus/notebooks/eda-west-nile-virus-project.ipynb) 2 | ![image title](https://img.shields.io/badge/statsmodels-v0.8.0-blue.svg) ![Image title](https://img.shields.io/badge/sklearn-0.19.1-orange.svg) ![Image title](https://img.shields.io/badge/seaborn-v0.8.1-yellow.svg) ![Image title](https://img.shields.io/badge/pandas-0.22.0-red.svg) ![Image title](https://img.shields.io/badge/numpy-1.14.2-green.svg) ![Image title](https://img.shields.io/badge/matplotlib-v2.1.2-orange.svg) 3 |
4 | 5 | **The code is available [here](http://nbviewer.jupyter.org/github/marcotav/exploratory-data-analysis/blob/master/west-nile-virus/notebooks/eda-west-nile-virus-project.ipynb) or by clicking on the [view code] link above.** 6 | 7 |

8 | 9 |

10 |

11 | Introduction • 12 | Importing libraries and datasets • 13 | Data Dictionary • 14 | Function to perform some data munging • 15 | Correlations and feature engineering 16 |

17 | 18 | 19 | ## Introduction 20 | 21 | From the [Kaggle](https://www.kaggle.com/c/predict-west-nile-virus) website: 22 | 23 | > West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death. 24 | 25 | In this notebook I will perform a detailed EDA of this Kaggle dataset. 26 | 27 | 28 | ### Importing libraries and datasets 29 | 30 | We first import the necessary libraries and datasets from [Kaggle](https://www.kaggle.com/c/predict-west-nile-virus). The former include: 31 | ``` 32 | pandas 33 | numpy 34 | matplotlib 35 | seaborn 36 | ``` 37 | 38 | ### Data Dictionary 39 | 40 | The data description corresponding to each of the `csv` files follows: 41 | 42 | #### Files `train.csv` and `test.csv` 43 | 44 | - `Id`: the id of the record 45 | - `Date`: date that the WNV test is performed 46 | - `Address`: approximate address of the location of trap. This is used to send to the GeoCoder. 47 | - `Species`: the species of mosquitos 48 | - `Block`: block number of address 49 | - `Street`: street name 50 | - `Trap`: Id of the trap 51 | - `AddressNumberAndStreet`: approximate address returned from GeoCoder 52 | - `Latitude, Longitude`: Latitude and Longitude returned from GeoCoder 53 | - `AddressAccuracy`: accuracy returned from GeoCoder 54 | - `NumMosquitos`: number of mosquitoes caught in this trap 55 | - `WnvPresent`: whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. 56 | 57 | 58 | #### File `spray.csv` 59 | 60 | - `Date, Time`: the date and time of the spray 61 | - `Latitude, Longitude`: the Latitude and Longitude of the spray 62 | 63 | 64 | #### File `weather.csv` 65 | - Column descriptions can be found in this [pdf](https://github.com/marcotav/eda-west-nile-virus/blob/master/noaa_weather_qclcd_documentation.pdf) 66 | 67 | 68 | 69 | ### Function to perform some data munging 70 | 71 | I first defined a function to perform some of the simplest steps of the EDA: 72 | 73 | ``` 74 | def eda(df): 75 | print("1) Are there missing values:") 76 | if df.isnull().any().unique().shape[0] == 2: 77 | if df.isnull().any().unique()[0] == False and df.isnull().any().unique()[1] == False: 78 | print('No\n') 79 | else: 80 | print("Yes|Percentage of missing values in each column:\n",df.isnull().sum()/df.shape[0],'\n') 81 | elif df.isnull().any().unique().shape[0] == 1: 82 | if df.isnull().any().unique() == False: 83 | print('No\n') 84 | else: 85 | print("Yes|Percentage of missing values in each column:\n",df.isnull().sum()/df.shape[0],'\n') 86 | 87 | print("2) Which are the data types:\n") 88 | print(df.dtypes,'\n') 89 | print("3) Dataframe shape:",df.shape) 90 | print("4) Unique values per columm") 91 | for col in df.columns.tolist(): 92 | print (col,":",df[col].nunique()) 93 | print("5) Removing duplicates") 94 | print('Initial shape:',df.shape) 95 | df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}).sort_values('count',ascending=False).head() 96 | df.drop_duplicates(inplace=True) 97 | print('Shape after removing duplicates:',df.shape) 98 | return 99 | ``` 100 | 101 | Looking at `train` `DataFrame` we find that: 102 | - Address features are redundant and some of them can be removed 103 | - `NumMosquitos` and `WnvPresent` are not in the test set. 104 | - I will remove `NumMosquitos` since the number of mosquitos is less relevant than whether West Nile Virus was present in these mosquitos. 105 | - There are many duplicates which were removed using the function `eda( )` 106 | - Only `Species` can be transformed into dummies. The others have too many unique values. 107 | - Using `value_counts` we find that the `WnvPresent` column is highly unbalanced with ~ 95% of zeros. 108 | - Other steps include: 109 | - Creating dummies from `Species` 110 | - Breaking up dates columns 111 | - Similiar changes are applied to the test `DataFrame` 112 | 113 | Now, we look at the `spray` data and perform similar steps. We see that there are several `NaNs` but the percentage is low. We can either drop `NaNs` or remove the column altogether. The second option seems to make more sense since time does look like a relevant variable 114 | 115 | Looking at the `weather` `DataFrame`, the `Water1` column has just 1 value namely `M` and the latter means missing. We remove this column. 116 | 117 | There are two types of `Station`, namely, 1 and 2. From Kaggle's Website Weather Data: 118 | 119 | > Hot and dry conditions are more favorable for West Nile virus than cold and wet. 120 | 121 | > We provide you with the dataset from NOAA of the weather conditions of 2007 to 2014, during the months of the tests. 122 | 123 | > Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level 124 | 125 | > Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level 126 | 127 | Each date had 2 records, 1 for each `Station=1` and other for `Station=2`. However most missing values are in the latter which we will drop. 128 | 129 | The `for` below searches each column for data that cannot be converted to numbers: 130 | ``` 131 | for station in [1,2]: 132 | print('Station',station,'\n') 133 | weather_station = weather[weather['Station']==station] 134 | for col in weather_station[cols_to_keep]: 135 | for x in sorted(weather_station[col].unique()): 136 | try: 137 | x = float(x) 138 | except: 139 | print(col,'| Non-convertibles, their frequency and their station:',\ 140 | (x,weather_station[weather_station[col] == x][col].count())) 141 | ``` 142 | Indeed, as stated above, most missing values are in the station 2. We will there drop rows with `Station=2`. The strings 'T' and 'M' stand for trace and missing data. Traces are defined to be smaller that 0.05. Following cells take care of that: 143 | 144 | ``` 145 | cols_with_M = ['WetBulb', 'StnPressure', 'SeaLevel'] 146 | for col in cols_with_M: 147 | weather[col] = weather[col].str.strip() 148 | weather[col] = weather[col].str.replace('M','0.0').astype(float) 149 | cols_with_T = ['SnowFall', 'PrecipTotal'] 150 | for col in cols_with_T: 151 | weather[col] = weather[col].str.replace(' T','0.05').astype(float) 152 | for col in cols_to_keep: 153 | weather[col] = weather[col].astype(float) 154 | ``` 155 | 156 | We also notice that there are many zeros in the data, in particular in the columns 157 | 158 | ``` 159 | cols_zeros = ['Heat','Cool','SnowFall'] 160 | ``` 161 | 162 | there is a substantial quantity of zeros. We will drop these. 163 | 164 | If `CodeSum` entries are letters, they indicate some significant weather event. We can dummify it. 165 | 166 | Let us use regex. We use `'^\w'` to match a string consisting of a single character where that character is alphanumeric (the '\w' means "any word character"), an underscore or an asterisk. 167 | 168 | ``` 169 | weather['CodeSum'].str.strip() # strips empty spaces 170 | weather['CodeSum'][weather['CodeSum'].str.contains('^\w')] = '1' 171 | weather['CodeSum'][weather['CodeSum'] !='1'] = '0' 172 | weather['CodeSum']= weather['CodeSum'].astype(int) 173 | ``` 174 | 175 | ## Correlations and feature engineering 176 | 177 | Three heatmaps for correlations follow below: 178 | 179 | 180 |

181 | 182 |

183 | 184 | 185 |

186 | 187 |

188 | 189 | 190 |

191 | 192 |

193 | 194 | 195 | ## To be continued. 196 | -------------------------------------------------------------------------------- /west-nile-virus/data/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /west-nile-virus/documents/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /west-nile-virus/documents/noaa_weather_qclcd_documentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/west-nile-virus/documents/noaa_weather_qclcd_documentation.pdf -------------------------------------------------------------------------------- /west-nile-virus/images/123: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /west-nile-virus/images/corr1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/west-nile-virus/images/corr1.png -------------------------------------------------------------------------------- /west-nile-virus/images/corr2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/west-nile-virus/images/corr2.png -------------------------------------------------------------------------------- /west-nile-virus/images/corr3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/west-nile-virus/images/corr3.png -------------------------------------------------------------------------------- /west-nile-virus/images/moggie2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marcotav/exploratory-data-analysis/bc9ba25cd3ce7a83e012d2fd869a9b306caba65f/west-nile-virus/images/moggie2.png -------------------------------------------------------------------------------- /west-nile-virus/notebooks/123: -------------------------------------------------------------------------------- 1 | 2 | --------------------------------------------------------------------------------