├── .gitignore ├── EDA-cheat-sheet.md ├── LICENSE ├── README.md ├── data ├── aquastat │ ├── aquastat.csv.gzip │ └── world.json └── redcard │ └── redcard.csv.gz ├── notebooks ├── 0-Intro │ ├── 0-Introduction-to-Exploratory-Data-Analysis.ipynb │ ├── 0-Introduction-to-Jupyter-Notebooks.ipynb │ ├── figures │ │ ├── branches.jpg │ │ └── crisp.png │ └── html-versions │ │ ├── 0-Introduction-to-Exploratory-Data-Analysis.html │ │ └── 0-Introduction-to-Jupyter-Notebooks.html ├── 1-RedCard-EDA │ ├── 1-Redcard-Dataset.ipynb │ ├── 2-Redcard-Players.ipynb │ ├── 3-Redcard-Dyads.ipynb │ ├── 4-Redcard-final-joins.ipynb │ ├── figures │ │ ├── covariates.png │ │ ├── models.png │ │ └── results.png │ └── html-versions │ │ ├── 1-Redcard-Dataset.html │ │ ├── 2-Redcard-Players.html │ │ ├── 3-Redcard-Dyads.html │ │ └── 4-Redcard-final-joins.html └── 2-Aquastat-EDA │ ├── 1-Aquastat-Introduction.ipynb │ ├── 2-Aquastat-Missing-Data.ipynb │ ├── 3-Aquastat-Univariate.ipynb │ ├── 4-Aquastat-TargetxVariable.ipynb │ ├── 5-Aquastat-Multivariate.ipynb │ ├── figures │ ├── branches.jpg │ └── fao.jpg │ └── html-versions │ ├── 1-Aquastat-Introduction.html │ ├── 2-Aquastat-Missing-Data.html │ ├── 3-Aquastat-Univariate.html │ ├── 4-Aquastat-TargetxVariable.html │ ├── 5-Aquastat-Multivariate.html │ └── pivottablejs.html ├── scripts └── aqua_helper.py └── setup ├── environment.yml ├── pivottablejs.html ├── requirements.txt ├── test-my-environment.html └── test-my-environment.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | .ipynb_checkpoints/ 3 | -------------------------------------------------------------------------------- /EDA-cheat-sheet.md: -------------------------------------------------------------------------------- 1 | # EDA Cheat Sheet(s) 2 | 3 | ## Why we EDA 4 | Sometimes the consumer of your analysis won't understand why you need the time for EDA and will want results NOW! Here are some of the reasons you can give to convince them it's a good use of time for everyone involved. 5 | 6 | **Reasons for the analyst** 7 | * Identify patterns and develop hypotheses. 8 | * Test technical assumptions. 9 | * Inform model selection and feature engineering. 10 | * Build an intuition for the data. 11 | 12 | **Reasons for consumer of analysis** 13 | * Ensures delivery of technically-sound results. 14 | * Ensures right question is being asked. 15 | * Tests business assumptions. 16 | * Provides context necessary for maximum applicability and value of results. 17 | * Leads to insights that would otherwise not be found. 18 | 19 | ## Things to keep in mind 20 | * You're never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it. 21 | * Stay open-minded. You're supposed to be challenging your assumptions and those of the stakeholder who you're performing the analysis for. 22 | * Repeat EDA for every new problem. Just because you've done EDA on a dataset before doesn't mean you shouldn't do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation. 23 | 24 | ## The plan 25 | Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first or you might jump back because you learned something and need to have another look. 26 | 27 | 1. Form hypotheses/develop investigation themes to explore 28 | 3. Wrangle data 29 | 3. Assess quality of data 30 | 4. Profile data 31 | 5. Explore each individual variable in the dataset 32 | 6. Assess the relationship between each variable and the target 33 | 7. Assess interactions between variables 34 | 8. Explore data across many dimensions 35 | 36 | Throughout the entire analysis you want to: 37 | * Capture a list of hypotheses and questions that come up for further exploration. 38 | * Record things to watch out for/ be aware of in future analyses. 39 | * Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge. 40 | * Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity. 41 | 42 | ## Wrangling 43 | 44 | ### Basic things to do 45 | * Make your data [tidy](https://tomaugspurger.github.io/modern-5-tidy.html). 46 | 1. Each variable forms a column 47 | 2. Each observation forms a row 48 | 3. Each type of observational unit forms a table 49 | * Transform data: sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand. 50 | * Log: when data is highly skewed (versus normally distributed like a bell curve), sometimes it has a log-normal distribution and taking the log of each data point will normalize it. 51 | * Binning of continuous variables: Binning continuous variables and then analyzing the groups of observations created can allow for easier pattern identification. Especially with non-linear relationships. 52 | * Simplifying of categories: you really don't want more than 8-10 categories within a single data field. Try to aggregate to higher-level categories when it makes sense. 53 | 54 | ### Helpful packages 55 | * [`pandas`](http://pandas.pydata.org) 56 | 57 | ## Data quality assessment and profiling 58 | Before trying to understand what information is in the data, make sure you understand what the data represents and what's missing. 59 | 60 | ### Basic things to do 61 | * Categorical: count, count distinct, assess unique values 62 | * Numerical: count, min, max 63 | * Spot-check random samples and samples that you are familiar with 64 | * Slice and dice 65 | 66 | ### Questions to consider 67 | * What data isn’t there? 68 | * Are there systematic reasons for missing data? 69 | * Are there fields that are always missing at the same time ? 70 | * Is there information in what data is missing? 71 | * Is the data that is there right? 72 | * Are there frequent values that are default values? 73 | * Are there fields that represent the same information? 74 | * What timestamp should you use? 75 | * Are there numerical values reported as strings? 76 | * Are there special values? 77 | * Is the data being generated the way you think? 78 | * Are there variables that are numerical but really should be categorical? 79 | * Is data consistent across different operating systems, device type, platforms, countries? 80 | * Are there any direct relationships between fields (e.g. a value of x always implies a specific value of y)? 81 | * What are the units of measurement? Are they consistent? 82 | * Is data consistent across the population and time? (time series) 83 | * Are there obvious changes in reported data around the time of important events that affect data generation (e.g. version release)? (panel data) 84 | 85 | ### Helpful packages 86 | * [`missingno`](https://github.com/ResidentMario/missingno) 87 | * [`pivottablejs`](https://github.com/nicolaskruchten/jupyter_pivottablejs) 88 | * [`pandas_profiling`](https://github.com/JosPolfliet/pandas-profiling) 89 | 90 | ### Example backlog 91 | * Assess the prevalence of missing data across all data fields, assess whether its missing is random or systematic, and identify patterns when such data is missing 92 | * Identify any default values that imply missing data for a given field 93 | * Determine sampling strategy for quality assessment and initial EDA 94 | * For datetime data types, ensure consistent formatting and granularity of data, and perform sanity checks on all dates present in the data. 95 | * In cases where multiple fields capture the same or similar information, understand the relationships between them and assess the most effective field to use 96 | * Assess data type of each field 97 | * For discrete value types, ensure data formats are consistent 98 | * For discrete value types, assess number of distinct values and percent unique and do sanity check on types of answers 99 | * For continuous data types, assess descriptive statistics and perform sanity check on values 100 | * Understand relationships between timestamps and assess which to use in analysis 101 | * Slice data by device type, operating system, software version and ensure consistency in data across slices 102 | * For device or app data, identify version release dates and assess data for any changes in format or value around those dates 103 | 104 | ## Exploration 105 | 106 | After quality assessment and profiling, exploratory data analysis can be divided into 4 main types of tasks: 107 | 1. Exploration of each individual variable 108 | 2. Assessment of the relationship between each variable and the target variable 109 | 3. Assessment of the interaction between variables 110 | 4. Exploration of data cross many dimensions 111 | 112 | ### 1. Exploring each individual variable 113 | 114 | #### Basics to do 115 | 116 | Quantify: 117 | * *Location*: mean, median, mode, interquartile mean 118 | * *Spread*: standard deviation, variance, range, interquartile range 119 | * *Shape*: skewness, kurtosis 120 | 121 | For time series, plot summary statistics over time. 122 | 123 | For panel data: 124 | * Plot cross-sectional summary statistics over time 125 | * Plot time-series statistics across the population 126 | 127 | #### Questions to consider 128 | * What does each field in the data look like? 129 | * Is the distribution skewed? Bimodal? 130 | * Are there outliers? Are they feasible? 131 | * Are there discontinuities? 132 | * Are the typical assumptions seen in modeling valid? 133 | * Gaussian 134 | * Identically and independently distributed 135 | * Have one mode 136 | * Can be negative 137 | * Generating processes are stationary and isoptropic (time series) 138 | * Independence between subjects (panel data) 139 | 140 | ### 2. Exploring the relationship between each variable and the target 141 | How does each field interact with the target? 142 | 143 | Assess each relationship’s: 144 | * Linearity 145 | * Direction 146 | * Rough size 147 | * Strength 148 | 149 | Methods: 150 | * [Bivariate visualizations](#Bivariate) 151 | * Calculate correlation 152 | 153 | 154 | ### 3. Assessing interactions between variables 155 | How do the variables interact with each other? 156 | 157 | #### Basic things to do 158 | * [Bivariate visualizations](#Bivariate) for all combinations 159 | * Correlation matrices 160 | * Compare summary statistics of variable x for different categories of y 161 | 162 | ### 4. Exploring data across many dimensions 163 | Are there patterns across many of the variables? 164 | 165 | #### Basic things to do 166 | * Categorical: 167 | * Parallel coordinates 168 | * Continuous 169 | * Principal component analysis 170 | * Clustering 171 | 172 | ### Helpful packages 173 | * `ipywidgets`: making function variables interactive for visualizations and calculations 174 | * `mpld3`: interactive visualizations 175 | 176 | ### Example backlog 177 | * Generate list of questions and hypotheses to be considered during EDA 178 | * Create univariate plots for all fields 179 | * Create bivariate plots for each combination of fields to assess correlation and other relationships 180 | * Plot summary statistics over time for time series data 181 | * Plot distribution of x for different categories y 182 | * Plot mean/median/min/max/count/distinct count of x over time for different categories of y 183 | * Capture list of hypotheses and questions that come up during EDA 184 | * Record things to watch out for/ be aware of in future analyses 185 | * Distill and present findings 186 | 187 | ## Visualization guide 188 | Here are the types of visualizations and the python packages we find most useful for data exploration. 189 | 190 | ### Univariate 191 | * Categorical: 192 | * [Bar plot](http://seaborn.pydata.org/generated/seaborn.barplot.html?highlight=bar%20plot#seaborn.barplot) 193 | * Continuous: 194 | * [Histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms) 195 | * [Kernel density estimation plot](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimaton) 196 | * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot) 197 | 198 | ### Bivariate 199 | * Categorical x categorical 200 | * [Heat map of contingency table](http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap) 201 | * [Multiple bar plots](http://seaborn.pydata.org/tutorial/categorical.html?highlight=bar%20plot#bar-plots) 202 | * Categorical x continuous 203 | * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot) of continuous for each category 204 | * [Violin plots](http://seaborn.pydata.org/examples/simple_violinplots.html) of continuous distribution for each category 205 | * Overlaid [histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms) (if 3 or less categories) 206 | * Continuous x continuous 207 | * [Scatter plots](http://seaborn.pydata.org/examples/marginal_ticks.html?highlight=scatter) 208 | * [Hexibin plots](http://seaborn.pydata.org/tutorial/distributions.html#hexbin-plots) 209 | * [Joint kernel density estimation plots](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimation) 210 | * [Correlation matrix heatmap](http://seaborn.pydata.org/examples/network_correlations.html?highlight=correlation) 211 | 212 | ### Multivariate 213 | * [Pairwise bivariate figures/ scatter matrix](http://seaborn.pydata.org/tutorial/distributions.html#visualizing-pairwise-relationships-in-a-dataset) 214 | 215 | ## Timeseries 216 | * Line plots 217 | * Any bivariate plot with time or time period as the x-axis. 218 | 219 | ## Panel data 220 | * [Heat map](http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap) with rows denoting observations and columns denoting time periods. 221 | * [Multiple line plots](http://seaborn.pydata.org/tutorial/categorical.html?highlight=panel%20data#drawing-multi-panel-categorical-plots) 222 | * [Strip plot](http://seaborn.pydata.org/tutorial/categorical.html?highlight=panel%20data#categorical-scatterplots) where time is on the x-axis, each entity has a constant y-value and a point is plotted every time an event is observed for that entity. 223 | 224 | ## Geospatial 225 | * [Choropleths](https://folium.readthedocs.io/en/latest/quickstart.html#choropleth-maps): regions colored according to their data value. 226 | 227 | ### Helpful packages 228 | * [`matplotlib`](https://matplotlib.org/): basic plotting 229 | * [`seaborn`](ttp://seaborn.pydata.org/): prettier versions of some `matplotlib` figures 230 | * [`mpld3`](http://mpld3.github.io/): interactive plotting 231 | * [`folium`](https://folium.readthedocs.io/en/latest/): geospatial plotting -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Chloe Mawer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EDA Tutorial 2 | 3 | 4 | This repo holds the contents developed for the tutorial, *Exploratory Data Analysis in Python*, presented at PyCon 2017 on May 17, 2017. 5 | 6 | We suggest setting up your environment and testing it (as detailed below) and then following along the video of the tutorial found [here](https://www.youtube.com/watch?v=W5WE9Db2RLU). 7 | 8 | As there was limited time for instruction, we also recommend pausing throughout and practicing some of the methods discussed as you go. 9 | 10 | We welcome any PRs with other demonstrations of how you would perform EDA on the provided datasets. 11 | 12 | ## The datasets 13 | 14 | * [Redcard Dataset](https://osf.io/47tnc/) 15 | * [Aquastat Dataset](http://www.fao.org/nr/water/aquastat/main/index.stm) 16 | 17 | ## Before the tutorial 18 | 19 | ### Microsoft Azure option 20 | If you don't want to deal with setting up your environment or have any problems with the below instructions, you can work through the tutorial through Microsoft Azure Notebooks by creating an account and cloning the tutorial library found [here](https://notebooks.azure.com/chloe/libraries/pycon-2017-eda-tutorial) (all of this is for free, forever). 21 | 22 | ### Github option 23 | 24 | #### 1. Clone this repo 25 | Clone this repository locally on your laptop. 26 | 1. Go to the green **Clone or download** button at the top of the repository page and copy the https link. 27 | 2. From the command line run the command: 28 | 29 | ```bash 30 | git clone https://github.com/cmawer/pycon-2017-eda-tutorial.git 31 | ``` 32 | 33 | #### 2. Set up your python environment 34 | 35 | ##### Install conda or miniconda 36 | We recommend using conda for managing your python environments. Specifically, we like miniconda, which is the most lightweight installation. You can install miniconda [here](https://conda.io/miniconda.html). However, the full [anaconda](https://www.continuum.io/downloads) is good for beginners as it comes with many packages already installed. 37 | 38 | ##### Create your environment 39 | 40 | Once installed, you can create the environment necessary for running this tutorial by running the following command from the command line in the `setup/` directory of this repository: 41 | 42 | ```bash 43 | conda update conda 44 | ``` 45 | 46 | then: 47 | 48 | ```bash 49 | conda env create -f environment.yml 50 | ``` 51 | 52 | This command will create a new environment named `eda3`. 53 | 54 | ##### Activate your environment 55 | To activate the environment you can run this command from any directory: 56 | 57 | `source activate eda3` (Mac/Linux) 58 | 59 | `activate eda3` (Windows) 60 | 61 | ##### Non-conda users 62 | 63 | 64 | If you are experienced in python and do not use conda, the `requirements.txt` file is available also in the `setup/` directory for pip installation. This was our environment frozen as is for a Mac. If using Windows or Linux, you may need to remove some of the version requirements. 65 | 66 | #### 3. Enable `ipywidgets` 67 | 68 | We will be using widgets to create interactive visualizations. They will have been installed during your environment setup but you still need to run the following from the commandline: 69 | 70 | ```bash 71 | jupyter nbextension enable --py --sys-prefix widgetsnbextension 72 | ``` 73 | 74 | #### 4. Test your python environment 75 | 76 | Now that your environment is set up, let's check that it works. 77 | 78 | 1. Go to the `setup/` directory from the command line and start a Jupyter notebook instance: 79 | 80 | ```bash 81 | jupyter notebook 82 | ``` 83 | 84 | a lot of text should appear -- you need to leave this terminal running for your Jupyter instance to work. 85 | 86 | 2. Assuming this worked, open up the notebook titled `test-my-environment.ipynb` 87 | 88 | 3. Once the notebook is open, go to the `Cell` menu and select `Run All`. 89 | 90 | 4. Check that every cell in the notebook ran (i.e did not produce error as output). `test-my-environment.html` shows what the notebook should look like after running. 91 | 92 | 93 | -------------------------------------------------------------------------------- /data/aquastat/aquastat.csv.gzip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/data/aquastat/aquastat.csv.gzip -------------------------------------------------------------------------------- /data/redcard/redcard.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/data/redcard/redcard.csv.gz -------------------------------------------------------------------------------- /notebooks/0-Intro/0-Introduction-to-Exploratory-Data-Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "toc": "true" 7 | }, 8 | "source": [ 9 | "# Table of Contents\n", 10 | "

1  What is exploratory data analysis?
2  Why we EDA
3  Things to keep in mind
4  The game plan
4.1  1. Brainstorm areas of investigation
4.2  2. Wrangle the data
4.3  3. Assess data quality and profile
4.4  4. Explore each individual variable in the dataset
4.5  5. Assess the relationship between each variable and the target
4.6  6. Assess interactions between the variables
4.7  7. Explore data across many dimensions
5  Our objectives for this tutorial
" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "\"SVDS\"" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "# What is exploratory data analysis? " 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "ExecuteTime": { 31 | "end_time": "2017-05-14T21:28:21.398615Z", 32 | "start_time": "2017-05-14T21:28:21.393864Z" 33 | }, 34 | "collapsed": true, 35 | "run_control": { 36 | "frozen": false, 37 | "read_only": false 38 | } 39 | }, 40 | "source": [ 41 | "\"Crisp-DM\"" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "# Why we EDA\n", 49 | "Sometimes the consumer of your analysis won't understand why you need the time for EDA and will want results NOW! Here are some of the reasons you can give to convince them it's a good use of time for everyone involved. \n", 50 | "\n", 51 | "**Reasons for the analyst**\n", 52 | "* Identify patterns and develop hypotheses.\n", 53 | "* Test technical assumptions.\n", 54 | "* Inform model selection and feature engineering.\n", 55 | "* Build an intuition for the data.\n", 56 | "\n", 57 | "**Reasons for consumer of analysis**\n", 58 | "* Ensures delivery of technically-sound results.\n", 59 | "* Ensures right question is being asked.\n", 60 | "* Tests business assumptions.\n", 61 | "* Provides context necessary for maximum applicability and value of results.\n", 62 | "* Leads to insights that would otherwise not be found.\n", 63 | "\n", 64 | "# Things to keep in mind \n", 65 | "* You're never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it. \n", 66 | "* Stay open-minded. You're supposed to be challenging your assumptions and those of the stakeholder who you're performing the analysis for. \n", 67 | "* Repeat EDA for every new problem. Just because you've done EDA on a dataset before doesn't mean you shouldn't do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation.\n" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "# The game plan\n", 75 | "\"Crisp-DM\"\n", 76 | "\n", 77 | "Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first. Or often, an observation will bring up a question you want to investigate and you'll branch off and explore to answer that question before returning down the main path of exhaustive EDA.\n", 78 | " \n", 79 | "2. Form hypotheses/develop investigation themes to explore \n", 80 | "3. Wrangle data \n", 81 | "3. Assess data quality and profile \n", 82 | "5. Explore each individual variable in the dataset \n", 83 | "6. Assess the relationship between each variable and the target \n", 84 | "7. Assess interactions between variables \n", 85 | "8. Explore data across many dimensions \n", 86 | "\n", 87 | "Throughout the entire analysis you want to:\n", 88 | "* Capture a list of hypotheses and questions that come up for further exploration.\n", 89 | "* Record things to watch out for/ be aware of in future analyses. \n", 90 | "* Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge. \n", 91 | "* Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity. \n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## 1. Brainstorm areas of investigation\n", 99 | "Yes, you're exploring, but that doesn't mean it's a free for all.\n", 100 | "\n", 101 | "* What do you need to understand the question you're trying to answer? \n", 102 | "* List before diving in and update throughout the analysis" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## 2. Wrangle the data\n", 110 | "\n", 111 | "* Make your data [tidy](https://tomaugspurger.github.io/modern-5-tidy.html).\n", 112 | " 1. Each variable forms a column\n", 113 | " 2. Each observation forms a row\n", 114 | " 3. Each type of observational unit forms a table\n", 115 | "* Transform data\n", 116 | " * Log \n", 117 | " * Binning\n", 118 | " * Aggegration into higher level categories " 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "## 3. Assess data quality and profile\n", 126 | "* What data isn’t there? \n", 127 | "* Is the data that is there right? \n", 128 | "* Is the data being generated in the way you think?" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "## 4. Explore each individual variable in the dataset \n", 136 | "* What does each field in the data look like? \n", 137 | "* How can each variable be described by a few key values? \n", 138 | "* Are the assumptions often made in modeling valid? " 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "## 5. Assess the relationship between each variable and the target \n", 146 | "\n", 147 | "How does each variable interact with the target variable? \n", 148 | "\n", 149 | "Assess each relationship’s:\n", 150 | "* Linearity \n", 151 | "* Direction \n", 152 | "* Rough size \n", 153 | "* Strength" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "## 6. Assess interactions between the variables\n", 161 | "* How do the variables interact with each other? \n", 162 | "* What is the linearity, direction, rough size, and strength of the relationships between pairs of variables? " 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## 7. Explore data across many dimensions\n", 170 | "Are there patterns across many of the variables?" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "# Our objectives for this tutorial\n", 178 | "\n", 179 | "Our objectives for this tutorial are to help you: \n", 180 | "* Develop the EDA mindset \n", 181 | " * Questions to consider while exploring \n", 182 | " * Things to look out for \n", 183 | "* Learn basic methods for effective EDA \n", 184 | " * Slicing and dicing \n", 185 | " * Calculating summary statistics\n", 186 | " * Basic plotting \n", 187 | " * Basic mapping \n", 188 | " * Using widgets for interactive exploration\n", 189 | "\n", 190 | "The actual exploration you do in this tutorial is *yours*. We have no answers or set of conclusions we think you should come to about the datasets. Our goal is simply to aid in making your exploration as effective as possible. \n" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "

© 2017 Silicon Valley Data Science LLC

" 198 | ] 199 | } 200 | ], 201 | "metadata": { 202 | "kernelspec": { 203 | "display_name": "Python 3", 204 | "language": "python", 205 | "name": "python3" 206 | }, 207 | "language_info": { 208 | "codemirror_mode": { 209 | "name": "ipython", 210 | "version": 3 211 | }, 212 | "file_extension": ".py", 213 | "mimetype": "text/x-python", 214 | "name": "python", 215 | "nbconvert_exporter": "python", 216 | "pygments_lexer": "ipython3", 217 | "version": "3.6.1" 218 | }, 219 | "nav_menu": {}, 220 | "toc": { 221 | "navigate_menu": true, 222 | "number_sections": false, 223 | "sideBar": true, 224 | "threshold": 6, 225 | "toc_cell": true, 226 | "toc_section_display": "block", 227 | "toc_window_display": false 228 | } 229 | }, 230 | "nbformat": 4, 231 | "nbformat_minor": 2 232 | } 233 | -------------------------------------------------------------------------------- /notebooks/0-Intro/0-Introduction-to-Jupyter-Notebooks.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor": 1, "cells": [{"execution_count": 2, "cell_type": "code", "source": "%matplotlib inline\n%config InlineBackend.figure_format='retina' ", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 3, "cell_type": "code", "source": "# Add this to python2 code to make life easier\nfrom __future__ import absolute_import, division, print_function", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 4, "cell_type": "code", "source": "from itertools import combinations\nimport string\n\nfrom IPython.display import IFrame, HTML, YouTubeVideo\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib.pyplot import GridSpec\nimport seaborn as sns\n\n\nimport numpy as np\n# don't do:\n# from numpy import *", "outputs": [], "metadata": {"collapsed": false}}, {"execution_count": 5, "cell_type": "code", "source": "import pandas as pd\nimport os, sys\nimport warnings\n\nsns.set();\nplt.rcParams['figure.figsize'] = (12, 8)\nsns.set_style(\"darkgrid\")\nsns.set_context(\"poster\", font_scale=1.3)\n\nwarnings.filterwarnings('ignore')", "outputs": [], "metadata": {"collapsed": true}}, {"source": "# Keyboard shortcuts\n\nFor help, `ESC` + `h`", "cell_type": "markdown", "metadata": {}}, {"execution_count": 17, "cell_type": "code", "source": "# in select mode, shift j/k (to select multiple cells at once)\n# split cell with ctrl shift -\n# merge with shift M", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 18, "cell_type": "code", "source": "first = 1", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 19, "cell_type": "code", "source": "second = 2", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 21, "cell_type": "code", "source": "third = 3", "outputs": [], "metadata": {"collapsed": true}}, {"source": "# Different heading levels\n\nWith text and $\\LaTeX$ support.", "cell_type": "markdown", "metadata": {}}, {"source": "You can also get monospaced fonts by indenting 4 spaces:\n\n mkdir toc\n cd toc", "cell_type": "markdown", "metadata": {}}, {"source": "Wrap with triple-backticks and language:\n\n```bash\nmkdir toc\ncd toc\nwget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh\n```", "cell_type": "markdown", "metadata": {}}, {"execution_count": 22, "cell_type": "code", "source": "import numpy as np", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": null, "cell_type": "code", "source": "np.random.choice()", "outputs": [], "metadata": {"collapsed": false}}, {"source": "```SQL\nSELECT first_name,\n last_name,\n year_of_birth\nFROM presidents\nWHERE year_of_birth > 1800;\n```", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"execution_count": 24, "cell_type": "code", "source": "mylist = !ls ", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 27, "cell_type": "code", "source": "[x.split('_')[-1] for x in mylist]", "outputs": [{"execution_count": 27, "output_type": "execute_result", "data": {"text/plain": "['410', '410', '431']"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 29, "cell_type": "code", "source": "%%bash\npwd \nfor i in *.ipynb\ndo\n echo $i | awk -F . '{print $1}'\ndone\n\necho\necho \"break\"\necho\n\nfor i in *.ipynb\ndo\n echo $i | awk -F - '{print $2}'\ndone\n", "outputs": [{"output_type": "stream", "name": "stdout", "text": "/home/nbuser\n*\n\nbreak\n\n\n"}], "metadata": {"collapsed": false}}, {"source": "# Tab; shift-tab; shift-tab-tab; shift-tab-tab-tab-tab; and more!", "cell_type": "markdown", "metadata": {}}, {"execution_count": 10, "cell_type": "code", "source": "def silly_function(xval):\n \"\"\"Takes a value and returns the absolute value.\"\"\"\n xval_sq = xval ** 2.0\n 1 + 4\n xval_abs = np.sqrt(xval_sq)\n return xval_abs", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 11, "cell_type": "code", "source": "silly_function?", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 12, "cell_type": "code", "source": "silly_function??", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": null, "cell_type": "code", "source": "silly_function()", "outputs": [], "metadata": {"collapsed": false}}, {"execution_count": null, "cell_type": "code", "source": "import numpy as np", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 9, "cell_type": "code", "source": "# \nnp.linspace??", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 8, "cell_type": "code", "source": "# \nnp.linspace?", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 40, "cell_type": "code", "source": "ex_dict = {}", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 41, "cell_type": "code", "source": "# Indent/dedent/comment\nfor _ in range(5):\n ex_dict[\"one\"] = 1\n ex_dict[\"two\"] = 2\n ex_dict[\"three\"] = 3\n ex_dict[\"four\"] = 4", "outputs": [], "metadata": {"collapsed": false}}, {"execution_count": 42, "cell_type": "code", "source": "ex_dict", "outputs": [{"execution_count": 42, "output_type": "execute_result", "data": {"text/plain": "{'four': 4, 'one': 1, 'three': 3, 'two': 2}"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"source": "## Multicursor magic", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"execution_count": 44, "cell_type": "code", "source": "ex_dict[\"one_better_name\"] = 1.\nex_dict[\"two_better_name\"] = 2. \nex_dict[\"three_better_name\"] = 3.\nex_dict[\"four_better_name\"] = 4.", "outputs": [], "metadata": {"collapsed": false}}, {"execution_count": null, "cell_type": "code", "source": "", "outputs": [], "metadata": {"collapsed": true}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 3.6", "name": "python36", "language": "python"}, "widgets": {"state": {"1f2b33bdbf62410e92177b1ae2e22d0e": {"views": [{"cell_index": 9}]}}, "version": "1.2.0"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "3.6.0", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython3", "codemirror_mode": {"version": 3, "name": "ipython"}}, "anaconda-cloud": {}}} -------------------------------------------------------------------------------- /notebooks/0-Intro/figures/branches.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/0-Intro/figures/branches.jpg -------------------------------------------------------------------------------- /notebooks/0-Intro/figures/crisp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/0-Intro/figures/crisp.png -------------------------------------------------------------------------------- /notebooks/1-RedCard-EDA/3-Redcard-Dyads.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor": 1, "cells": [{"source": "# Redcard Exploratory Data Analysis\n\nThis dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.\n\n[Many analysts, one dataset: Making transparent how variations in analytical choices affect results](https://osf.io/gvm2z/)\n\nThe data can be found [here](https://osf.io/47tnc/).\n\n", "cell_type": "markdown", "metadata": {}}, {"source": "## The Task\n\nDo an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: **Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?**\n\n- Before plotting/joining/doing something, have a question or hypothesis that you want to investigate\n- Draw a plot of what you want to see on paper to sketch the idea\n- Write it down, then make the plan on how to get there\n- How do you know you aren't fooling yourself\n- What else can I check if this is actually true?\n- What evidence could there be that it's wrong?\n", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"execution_count": 1, "cell_type": "code", "source": "%matplotlib inline\n%config InlineBackend.figure_format='retina'\n\nfrom __future__ import absolute_import, division, print_function\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib.pyplot import GridSpec\nimport seaborn as sns\nimport numpy as np\nimport pandas as pd\nimport os, sys\nfrom tqdm import tqdm\nimport warnings\nwarnings.filterwarnings('ignore')\nsns.set_context(\"poster\", font_scale=1.3)\n\nimport missingno as msno\nimport pandas_profiling\n\nfrom sklearn.datasets import make_blobs\nimport time", "outputs": [], "metadata": {"collapsed": true}}, {"source": "## About the Data\n\n> The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website. \n\n> From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of player\u00e2\u0080\u0093referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.\n\n> -- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit\n\n\n| Variable Name: | Variable Description: | \n| -- | -- | \n| playerShort | short player ID | \n| player | player name | \n| club | player club | \n| leagueCountry | country of player club (England, Germany, France, and Spain) | \n| height | player height (in cm) | \n| weight | player weight (in kg) | \n| position | player position | \n| games | number of games in the player-referee dyad | \n| goals | number of goals in the player-referee dyad | \n| yellowCards | number of yellow cards player received from the referee | \n| yellowReds | number of yellow-red cards player received from the referee | \n| redCards | number of red cards player received from the referee | \n| photoID | ID of player photo (if available) | \n| rater1 | skin rating of photo by rater 1 | \n| rater2 | skin rating of photo by rater 2 | \n| refNum | unique referee ID number (referee name removed for anonymizing purposes) | \n| refCountry | unique referee country ID number | \n| meanIAT | mean implicit bias score (using the race IAT) for referee country | \n| nIAT | sample size for race IAT in that particular country | \n| seIAT | standard error for mean estimate of race IAT | \n| meanExp | mean explicit bias score (using a racial thermometer task) for referee country | \n| nExp | sample size for explicit bias in that particular country | \n| seExp | standard error for mean estimate of explicit bias measure | \n\n", "cell_type": "markdown", "metadata": {}}, {"execution_count": 2, "cell_type": "code", "source": "def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):\n save_subgroup_filename = \"\".join([prefix, subgroup_name, \".csv.gz\"])\n dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')\n test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')\n # Test that we recover what we send in\n if dataframe.equals(test_df):\n print(\"Test-passed: we recover the equivalent subgroup dataframe.\")\n else:\n print(\"Warning -- equivalence test!!! Double-check.\")", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 3, "cell_type": "code", "source": "def load_subgroup(filename, index_col=[0]):\n return pd.read_csv(filename, compression='gzip', index_col=index_col)", "outputs": [], "metadata": {"collapsed": true}}, {"source": "# Tidy Dyads and Starting Joins", "cell_type": "markdown", "metadata": {}}, {"execution_count": 4, "cell_type": "code", "source": "clean_players = load_subgroup(\"cleaned_players.csv.gz\")\nplayers = load_subgroup(\"raw_players.csv.gz\", )\ncountries = load_subgroup(\"raw_countries.csv.gz\")\nreferees = load_subgroup(\"raw_referees.csv.gz\")\nagg_dyads = pd.read_csv(\"raw_dyads.csv.gz\", compression='gzip', index_col=[0, 1])", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 5, "cell_type": "code", "source": "agg_dyads.head(10)", "outputs": [{"execution_count": 5, "output_type": "execute_result", "data": {"text/plain": " yellowCards yellowReds victories ties games \\\nrefNum playerShort \n1 lucas-wilchez 0 0 0 0 1 \n2 john-utaka 1 0 0 0 1 \n3 abdon-prats 1 0 0 1 1 \n pablo-mari 0 0 1 0 1 \n ruben-pena 0 0 1 0 1 \n4 aaron-hughes 0 0 0 0 1 \n aleksandar-kolarov 0 0 1 0 1 \n alexander-tettey 0 0 0 0 1 \n anders-lindegaard 0 0 0 1 1 \n andreas-beck 0 0 1 0 1 \n\n defeats goals redCards \nrefNum playerShort \n1 lucas-wilchez 1 0 0 \n2 john-utaka 1 0 0 \n3 abdon-prats 0 0 0 \n pablo-mari 0 0 0 \n ruben-pena 0 0 0 \n4 aaron-hughes 1 0 0 \n aleksandar-kolarov 0 0 0 \n alexander-tettey 1 0 0 \n anders-lindegaard 0 0 0 \n andreas-beck 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
<\/th>\n <\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n redCards<\/th>\n <\/tr>\n
refNum<\/th>\n playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
1<\/th>\n lucas-wilchez<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
2<\/th>\n john-utaka<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
3<\/th>\n abdon-prats<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
pablo-mari<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
ruben-pena<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
4<\/th>\n aaron-hughes<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aleksandar-kolarov<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
alexander-tettey<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
anders-lindegaard<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
andreas-beck<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 6, "cell_type": "code", "source": "# Test if the number of games is equal to the victories + ties + defeats in the dataset", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 7, "cell_type": "code", "source": "all(agg_dyads['games'] == agg_dyads.victories + agg_dyads.ties + agg_dyads.defeats)", "outputs": [{"execution_count": 7, "output_type": "execute_result", "data": {"text/plain": "True"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 8, "cell_type": "code", "source": "# Sanity check passes", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 9, "cell_type": "code", "source": "len(agg_dyads.reset_index().set_index('playerShort'))", "outputs": [{"execution_count": 9, "output_type": "execute_result", "data": {"text/plain": "146028"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 10, "cell_type": "code", "source": "agg_dyads['totalRedCards'] = agg_dyads['yellowReds'] + agg_dyads['redCards']\nagg_dyads.rename(columns={'redCards': 'strictRedCards'}, inplace=True)", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 11, "cell_type": "code", "source": "agg_dyads.head()", "outputs": [{"execution_count": 11, "output_type": "execute_result", "data": {"text/plain": " yellowCards yellowReds victories ties games \\\nrefNum playerShort \n1 lucas-wilchez 0 0 0 0 1 \n2 john-utaka 1 0 0 0 1 \n3 abdon-prats 1 0 0 1 1 \n pablo-mari 0 0 1 0 1 \n ruben-pena 0 0 1 0 1 \n\n defeats goals strictRedCards totalRedCards \nrefNum playerShort \n1 lucas-wilchez 1 0 0 0 \n2 john-utaka 1 0 0 0 \n3 abdon-prats 0 0 0 0 \n pablo-mari 0 0 0 0 \n ruben-pena 0 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n <\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n
refNum<\/th>\n playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
1<\/th>\n lucas-wilchez<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
2<\/th>\n john-utaka<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
3<\/th>\n abdon-prats<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
pablo-mari<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
ruben-pena<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"source": "## Remove records that come from players who don't have a skintone rating\n\nThere are a couple of ways to do this -- set operations and joins are two ways demonstrated below: ", "cell_type": "markdown", "metadata": {}}, {"execution_count": 12, "cell_type": "code", "source": "clean_players.head()", "outputs": [{"execution_count": 12, "output_type": "execute_result", "data": {"text/plain": " height weight skintone position_agg weightclass \\\nplayerShort \naaron-hughes 182.0 71.0 0.125 Defense low_weight \naaron-hunt 183.0 73.0 0.125 Forward low_weight \naaron-lennon 165.0 63.0 0.250 Midfield vlow_weight \naaron-ramsey 178.0 76.0 0.000 Midfield mid_weight \nabdelhamid-el-kaoutari 180.0 73.0 0.250 Defense low_weight \n\n heightclass skintoneclass age_years \nplayerShort \naaron-hughes mid_height [0, 0.125] 33.149897 \naaron-hunt mid_height [0, 0.125] 26.327173 \naaron-lennon vlow_height (0.125, 0.25] 25.713895 \naaron-ramsey low_height [0, 0.125] 22.017796 \nabdelhamid-el-kaoutari low_height (0.125, 0.25] 22.795346 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n height<\/th>\n weight<\/th>\n skintone<\/th>\n position_agg<\/th>\n weightclass<\/th>\n heightclass<\/th>\n skintoneclass<\/th>\n age_years<\/th>\n <\/tr>\n
playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n <\/tr>\n
aaron-hunt<\/th>\n 183.0<\/td>\n 73.0<\/td>\n 0.125<\/td>\n Forward<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 26.327173<\/td>\n <\/tr>\n
aaron-lennon<\/th>\n 165.0<\/td>\n 63.0<\/td>\n 0.250<\/td>\n Midfield<\/td>\n vlow_weight<\/td>\n vlow_height<\/td>\n (0.125, 0.25]<\/td>\n 25.713895<\/td>\n <\/tr>\n
aaron-ramsey<\/th>\n 178.0<\/td>\n 76.0<\/td>\n 0.000<\/td>\n Midfield<\/td>\n mid_weight<\/td>\n low_height<\/td>\n [0, 0.125]<\/td>\n 22.017796<\/td>\n <\/tr>\n
abdelhamid-el-kaoutari<\/th>\n 180.0<\/td>\n 73.0<\/td>\n 0.250<\/td>\n Defense<\/td>\n low_weight<\/td>\n low_height<\/td>\n (0.125, 0.25]<\/td>\n 22.795346<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 13, "cell_type": "code", "source": "agg_dyads.head()", "outputs": [{"execution_count": 13, "output_type": "execute_result", "data": {"text/plain": " yellowCards yellowReds victories ties games \\\nrefNum playerShort \n1 lucas-wilchez 0 0 0 0 1 \n2 john-utaka 1 0 0 0 1 \n3 abdon-prats 1 0 0 1 1 \n pablo-mari 0 0 1 0 1 \n ruben-pena 0 0 1 0 1 \n\n defeats goals strictRedCards totalRedCards \nrefNum playerShort \n1 lucas-wilchez 1 0 0 0 \n2 john-utaka 1 0 0 0 \n3 abdon-prats 0 0 0 0 \n pablo-mari 0 0 0 0 \n ruben-pena 0 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n <\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n
refNum<\/th>\n playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
1<\/th>\n lucas-wilchez<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
2<\/th>\n john-utaka<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
3<\/th>\n abdon-prats<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
pablo-mari<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
ruben-pena<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 14, "cell_type": "code", "source": "agg_dyads.reset_index().head()", "outputs": [{"execution_count": 14, "output_type": "execute_result", "data": {"text/plain": " refNum playerShort yellowCards yellowReds victories ties games \\\n0 1 lucas-wilchez 0 0 0 0 1 \n1 2 john-utaka 1 0 0 0 1 \n2 3 abdon-prats 1 0 0 1 1 \n3 3 pablo-mari 0 0 1 0 1 \n4 3 ruben-pena 0 0 1 0 1 \n\n defeats goals strictRedCards totalRedCards \n0 1 0 0 0 \n1 1 0 0 0 \n2 0 0 0 0 \n3 0 0 0 0 \n4 0 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n
<\/th>\n refNum<\/th>\n playerShort<\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n <\/thead>\n
0<\/th>\n 1<\/td>\n lucas-wilchez<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
1<\/th>\n 2<\/td>\n john-utaka<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
2<\/th>\n 3<\/td>\n abdon-prats<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
3<\/th>\n 3<\/td>\n pablo-mari<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
4<\/th>\n 3<\/td>\n ruben-pena<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 15, "cell_type": "code", "source": "agg_dyads.reset_index().set_index('playerShort').head()", "outputs": [{"execution_count": 15, "output_type": "execute_result", "data": {"text/plain": " refNum yellowCards yellowReds victories ties games \\\nplayerShort \nlucas-wilchez 1 0 0 0 0 1 \njohn-utaka 2 1 0 0 0 1 \nabdon-prats 3 1 0 0 1 1 \npablo-mari 3 0 0 1 0 1 \nruben-pena 3 0 0 1 0 1 \n\n defeats goals strictRedCards totalRedCards \nplayerShort \nlucas-wilchez 1 0 0 0 \njohn-utaka 1 0 0 0 \nabdon-prats 0 0 0 0 \npablo-mari 0 0 0 0 \nruben-pena 0 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n refNum<\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n
playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
lucas-wilchez<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
john-utaka<\/th>\n 2<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
abdon-prats<\/th>\n 3<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
pablo-mari<\/th>\n 3<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
ruben-pena<\/th>\n 3<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 16, "cell_type": "code", "source": "player_dyad = (clean_players.merge(agg_dyads.reset_index().set_index('playerShort'),\n left_index=True,\n right_index=True))", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 17, "cell_type": "code", "source": "player_dyad.head()", "outputs": [{"execution_count": 17, "output_type": "execute_result", "data": {"text/plain": " height weight skintone position_agg weightclass heightclass \\\nplayerShort \naaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height \naaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height \naaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height \naaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height \naaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height \n\n skintoneclass age_years refNum yellowCards yellowReds \\\nplayerShort \naaron-hughes [0, 0.125] 33.149897 4 0 0 \naaron-hughes [0, 0.125] 33.149897 66 0 0 \naaron-hughes [0, 0.125] 33.149897 77 0 0 \naaron-hughes [0, 0.125] 33.149897 163 0 0 \naaron-hughes [0, 0.125] 33.149897 194 2 0 \n\n victories ties games defeats goals strictRedCards \\\nplayerShort \naaron-hughes 0 0 1 1 0 0 \naaron-hughes 1 0 1 0 0 0 \naaron-hughes 13 8 26 5 0 0 \naaron-hughes 1 1 2 0 0 0 \naaron-hughes 3 5 16 8 0 0 \n\n totalRedCards \nplayerShort \naaron-hughes 0 \naaron-hughes 0 \naaron-hughes 0 \naaron-hughes 0 \naaron-hughes 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n height<\/th>\n weight<\/th>\n skintone<\/th>\n position_agg<\/th>\n weightclass<\/th>\n heightclass<\/th>\n skintoneclass<\/th>\n age_years<\/th>\n refNum<\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n
playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n 4<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n 66<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n 77<\/td>\n 0<\/td>\n 0<\/td>\n 13<\/td>\n 8<\/td>\n 26<\/td>\n 5<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n 163<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 2<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aaron-hughes<\/th>\n 182.0<\/td>\n 71.0<\/td>\n 0.125<\/td>\n Defense<\/td>\n low_weight<\/td>\n mid_height<\/td>\n [0, 0.125]<\/td>\n 33.149897<\/td>\n 194<\/td>\n 2<\/td>\n 0<\/td>\n 3<\/td>\n 5<\/td>\n 16<\/td>\n 8<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 18, "cell_type": "code", "source": "clean_dyads = (agg_dyads.reset_index()[agg_dyads.reset_index()\n .playerShort\n .isin(set(clean_players.index))\n ]).set_index(['refNum', 'playerShort'])", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 19, "cell_type": "code", "source": "clean_dyads.head()", "outputs": [{"execution_count": 19, "output_type": "execute_result", "data": {"text/plain": " yellowCards yellowReds victories ties games \\\nrefNum playerShort \n1 lucas-wilchez 0 0 0 0 1 \n2 john-utaka 1 0 0 0 1 \n4 aaron-hughes 0 0 0 0 1 \n aleksandar-kolarov 0 0 1 0 1 \n alexander-tettey 0 0 0 0 1 \n\n defeats goals strictRedCards totalRedCards \nrefNum playerShort \n1 lucas-wilchez 1 0 0 0 \n2 john-utaka 1 0 0 0 \n4 aaron-hughes 1 0 0 0 \n aleksandar-kolarov 0 0 0 0 \n alexander-tettey 1 0 0 0 ", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n <\/th>\n yellowCards<\/th>\n yellowReds<\/th>\n victories<\/th>\n ties<\/th>\n games<\/th>\n defeats<\/th>\n goals<\/th>\n strictRedCards<\/th>\n totalRedCards<\/th>\n <\/tr>\n
refNum<\/th>\n playerShort<\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
1<\/th>\n lucas-wilchez<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
2<\/th>\n john-utaka<\/th>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
4<\/th>\n aaron-hughes<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
aleksandar-kolarov<\/th>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n
alexander-tettey<\/th>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n 1<\/td>\n 1<\/td>\n 0<\/td>\n 0<\/td>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 20, "cell_type": "code", "source": "clean_dyads.shape, agg_dyads.shape, player_dyad.shape", "outputs": [{"execution_count": 20, "output_type": "execute_result", "data": {"text/plain": "((124621, 9), (146028, 9), (124621, 18))"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"source": "## Disaggregate\n\nThe dyads are currently an aggregated metric summarizing all times a particular referee-player pair play were matched. To properly handle the data, we have to disaggregate the data into a tidy/long format. This means that each game is a row.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 21, "cell_type": "code", "source": "# inspired by https://github.com/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb\ncolnames = ['games', 'totalRedCards']\nj = 0\nout = [0 for _ in range(sum(clean_dyads['games']))]\n\nfor index, row in clean_dyads.reset_index().iterrows():\n n = row['games']\n d = row['totalRedCards']\n ref = row['refNum']\n player = row['playerShort']\n for _ in range(n):\n row['totalRedCards'] = 1 if (d-_) > 0 else 0\n rowlist=list([ref, player, row['totalRedCards']])\n out[j] = rowlist\n j += 1\n\ntidy_dyads = pd.DataFrame(out, columns=['refNum', 'playerShort', 'redcard'],).set_index(['refNum', 'playerShort'])", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 22, "cell_type": "code", "source": "# 3092\ntidy_dyads.redcard.sum()", "outputs": [{"execution_count": 22, "output_type": "execute_result", "data": {"text/plain": "3092"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 23, "cell_type": "code", "source": "# Notice this is longer than before\nclean_dyads.games.sum()", "outputs": [{"execution_count": 23, "output_type": "execute_result", "data": {"text/plain": "373067"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 24, "cell_type": "code", "source": "tidy_dyads.shape", "outputs": [{"execution_count": 24, "output_type": "execute_result", "data": {"text/plain": "(373067, 1)"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 25, "cell_type": "code", "source": "# Ok, this is a bit crazy... tear it apart and figure out what each piece is doing if it's not clear\nclean_referees = (referees.reset_index()[referees.reset_index()\n .refNum.isin(tidy_dyads.reset_index().refNum\n .unique())\n ]).set_index('refNum')", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 26, "cell_type": "code", "source": "clean_referees.shape, referees.shape", "outputs": [{"execution_count": 26, "output_type": "execute_result", "data": {"text/plain": "((2978, 1), (3147, 1))"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 27, "cell_type": "code", "source": "clean_countries = (countries.reset_index()[countries.reset_index()\n .refCountry\n .isin(clean_referees.refCountry\n .unique())\n ].set_index('refCountry'))", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 28, "cell_type": "code", "source": "clean_countries.shape, countries.shape", "outputs": [{"execution_count": 28, "output_type": "execute_result", "data": {"text/plain": "((160, 7), (161, 7))"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 29, "cell_type": "code", "source": "tidy_dyads.head()", "outputs": [{"execution_count": 29, "output_type": "execute_result", "data": {"text/plain": " redcard\nrefNum playerShort \n1 lucas-wilchez 0\n2 john-utaka 0\n4 aaron-hughes 0\n aleksandar-kolarov 0\n alexander-tettey 0", "text/html": "
\n\n \n \n \n \n \n \n \n \n \n
<\/th>\n <\/th>\n redcard<\/th>\n <\/tr>\n
refNum<\/th>\n playerShort<\/th>\n <\/th>\n <\/tr>\n <\/thead>\n
1<\/th>\n lucas-wilchez<\/th>\n 0<\/td>\n <\/tr>\n
2<\/th>\n john-utaka<\/th>\n 0<\/td>\n <\/tr>\n
4<\/th>\n aaron-hughes<\/th>\n 0<\/td>\n <\/tr>\n
aleksandar-kolarov<\/th>\n 0<\/td>\n <\/tr>\n
alexander-tettey<\/th>\n 0<\/td>\n <\/tr>\n <\/tbody>\n<\/table>\n<\/div>"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": 30, "cell_type": "code", "source": "tidy_dyads.to_csv(\"cleaned_dyads.csv.gz\", compression='gzip')", "outputs": [], "metadata": {"collapsed": true}}, {"execution_count": 31, "cell_type": "code", "source": "tidy_dyads.shape", "outputs": [{"execution_count": 31, "output_type": "execute_result", "data": {"text/plain": "(373067, 1)"}, "metadata": {}}], "metadata": {"collapsed": false}}, {"execution_count": null, "cell_type": "code", "source": "", "outputs": [], "metadata": {"collapsed": true}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Python 3", "name": "python3", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "3.5.1", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython3", "codemirror_mode": {"version": 3, "name": "ipython"}}, "anaconda-cloud": {}}} -------------------------------------------------------------------------------- /notebooks/1-RedCard-EDA/figures/covariates.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/1-RedCard-EDA/figures/covariates.png -------------------------------------------------------------------------------- /notebooks/1-RedCard-EDA/figures/models.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/1-RedCard-EDA/figures/models.png -------------------------------------------------------------------------------- /notebooks/1-RedCard-EDA/figures/results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/1-RedCard-EDA/figures/results.png -------------------------------------------------------------------------------- /notebooks/2-Aquastat-EDA/figures/branches.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/2-Aquastat-EDA/figures/branches.jpg -------------------------------------------------------------------------------- /notebooks/2-Aquastat-EDA/figures/fao.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/notebooks/2-Aquastat-EDA/figures/fao.jpg -------------------------------------------------------------------------------- /notebooks/2-Aquastat-EDA/html-versions/pivottablejs.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | PivotTable.js 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 38 | 39 | 40 | 60 | 261 | 262 | 263 | -------------------------------------------------------------------------------- /scripts/aqua_helper.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import folium 3 | from matplotlib import pyplot as plt 4 | import matplotlib as mpl 5 | import numpy as np 6 | 7 | def time_slice(df, time_period): 8 | # Only take data for time period of interest 9 | df = df[df.time_period == time_period] 10 | 11 | # Pivot table 12 | df = df.pivot(index='country', columns='variable', values='value') 13 | 14 | df.columns.name = time_period 15 | 16 | return df 17 | 18 | def country_slice(df, country): 19 | # Only take data for country of interest 20 | df = df[df.country == country] 21 | 22 | # Pivot table 23 | df = df.pivot(index='variable', columns='time_period', values='value') 24 | 25 | df.index.name = country 26 | return df 27 | 28 | def time_series(df, country, variable): 29 | # Only take data for country/variable combo 30 | series = df[(df.country == country) & (df.variable == variable)] 31 | 32 | # Drop years with no data 33 | series = series.dropna()[['year_measured', 'value']] 34 | 35 | # Change years to int and set as index 36 | series.year_measured = series.year_measured.astype(int) 37 | series.set_index('year_measured', inplace=True) 38 | series.columns = [variable] 39 | return series 40 | 41 | simple_regions = { 42 | 'World | Asia': 'Asia', 43 | 'Americas | Central America and Caribbean | Central America': 'North America', 44 | 'Americas | Central America and Caribbean | Greater Antilles': 'North America', 45 | 'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America', 46 | 'Americas | Northern America | Northern America': 'North America', 47 | 'Americas | Northern America | Mexico': 'North America', 48 | 'Americas | Southern America | Guyana': 'South America', 49 | 'Americas | Southern America | Andean': 'South America', 50 | 'Americas | Southern America | Brazil': 'South America', 51 | 'Americas | Southern America | Southern America': 'South America', 52 | 'World | Africa': 'Africa', 53 | 'World | Europe': 'Europe', 54 | 'World | Oceania': 'Oceania' 55 | } 56 | 57 | def subregion(data, region): 58 | return data[data.region == region] 59 | 60 | def variable_slice(df, variable): 61 | df = df[df.variable==variable] 62 | df = df.pivot(index='country', columns='time_period', values='value') 63 | return df 64 | 65 | 66 | def plot_map(df, variable, time_period=None, log=False, 67 | legend_name=None, threshold_scale=None, 68 | geo=r'../../data/aquastat/world.json'): 69 | 70 | if time_period: 71 | df = time_slice(df, time_period).reset_index() 72 | else: 73 | df = df.reset_index() 74 | 75 | if log: 76 | df[variable] = df[variable].apply(np.log) 77 | 78 | fmap = folium.Map(location=[34, -45], zoom_start=2, 79 | width=1200, height=600) 80 | fmap.choropleth(geo_path=geo, 81 | data=df, 82 | columns=['country', variable], 83 | key_on='feature.properties.name', reset=True, 84 | fill_color='PuBuGn', fill_opacity=0.7, line_opacity=0.2, 85 | legend_name=legend_name if legend_name else variable, 86 | threshold_scale=threshold_scale) 87 | return fmap 88 | 89 | 90 | def map_over_time(df, variable, time_periods, log=False, 91 | threshold_scale=None, legend_name=None, 92 | geo=r'../../data/aquastat/world.json'): 93 | 94 | time_slider = widgets.SelectionSlider(options=time_periods.tolist(), 95 | value=time_periods[0], 96 | description='Time period:', 97 | disabled=False, 98 | button_style='') 99 | widgets.interact(plot_map, df=widgets.fixed(df), 100 | variable=widgets.fixed(variable), 101 | time_period=time_slider, log=widgets.fixed(log), 102 | legend_name=widgets.fixed(legend_name), 103 | threshold_scale=widgets.fixed(threshold_scale), 104 | geo=widgets.fixed(geo)); 105 | 106 | 107 | def plot_hist(df, variable, bins=None, xlabel=None, by=None, 108 | ylabel=None, title=None, logx=False, ax=None): 109 | if not bins: 110 | bins = 20 111 | 112 | if not ax: 113 | fig, ax = plt.subplots(figsize=(12, 8)) 114 | if logx: 115 | if df[variable].min() <=0: 116 | df[variable] = df[variable] - df[variable].min() + 1 117 | print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1)) 118 | bins = np.logspace(np.log10(df[variable].min()), 119 | np.log10(df[variable].max()), bins) 120 | ax.set_xscale("log") 121 | 122 | if by: 123 | if type(df[by].unique()) == pd.core.categorical.Categorical: 124 | cats = df[by].unique().categories.tolist() 125 | else: 126 | cats = df[by].unique().tolist() 127 | 128 | for cat in cats: 129 | to_plot = df[df[by] == cat][variable].dropna() 130 | ax.hist(to_plot, bins=bins); 131 | else: 132 | ax.hist(df[variable].dropna().values, bins=bins); 133 | 134 | if xlabel: 135 | ax.set_xlabel(xlabel); 136 | if ylabel: 137 | ax.set_ylabel(ylabel); 138 | if title: 139 | ax.set_title(title); 140 | 141 | return ax 142 | 143 | def conditional_bar(series, bar_colors=None, color_labels=None, figsize=(13,24), 144 | xlabel=None, by=None, ylabel=None, title=None): 145 | fig, ax = plt.subplots(figsize=figsize) 146 | if not bar_colors: 147 | bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0] 148 | plt.barh(range(len(series)),series.values, color=bar_colors) 149 | plt.xlabel('' if not xlabel else xlabel); 150 | plt.ylabel('' if not ylabel else ylabel) 151 | plt.yticks(range(len(series)), series.index.tolist()) 152 | plt.title('' if not title else title); 153 | plt.ylim([-1,len(series)]); 154 | if color_labels: 155 | for col, lab in color_labels.items(): 156 | plt.plot([], linestyle='',marker='s',c=col, label= lab); 157 | lines, labels = ax.get_legend_handles_labels(); 158 | ax.legend(lines[-len(color_labels.keys()):], labels[-len(color_labels.keys()):], loc='upper right'); 159 | plt.close() 160 | return fig 161 | 162 | 163 | def plot_scatter(df, x, y, xlabel=None, ylabel=None, title=None, 164 | logx=False, logy=False, by=None, ax=None): 165 | if not ax: 166 | fig, ax = plt.subplots(figsize=(12, 10)) 167 | 168 | colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'] 169 | if by: 170 | groups = df.groupby(by) 171 | for j, (name, group) in enumerate(groups): 172 | ax.scatter(group[x], group[y], color=colors[j], label=name) 173 | ax.legend() 174 | else: 175 | ax.scatter(df[x], df[y], color=colors[0]) 176 | if logx: 177 | ax.set_xscale('log') 178 | if logy: 179 | ax.set_yscale('log') 180 | 181 | ax.set_xlabel(xlabel if xlabel else x); 182 | ax.set_ylabel(ylabel if ylabel else y); 183 | if title: 184 | ax.set_title(title); 185 | 186 | def two_hist(df, variable, bins=50, 187 | ylabel='Number of countries', title=None): 188 | 189 | fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18,8)) 190 | ax1 = plot_hist(df, variable, bins=bins, 191 | xlabel=variable, ylabel=ylabel, 192 | ax=ax1, title=variable if not title else title) 193 | ax2 = plot_hist(df, variable, bins=bins, 194 | xlabel='Log of '+ variable, ylabel=ylabel, 195 | logx=True, ax=ax2, 196 | title='Log of '+ variable if not title else title) 197 | plt.close() 198 | return fig 199 | 200 | def hist_over_var(df, variables, bins=50, first_choice=None, 201 | ylabel='Number of countries', title=None): 202 | if not first_choice: 203 | first_choice = variables[0] 204 | variable_slider = widgets.Dropdown(options=variables.tolist(), 205 | value=first_choice, 206 | description='Variable:', 207 | disabled=False, 208 | button_style='') 209 | widgets.interact(two_hist, df=widgets.fixed(df), 210 | variable=variable_slider, ylabel=widgets.fixed(ylabel), 211 | title=widgets.fixed(title), bins=widgets.fixed(bins)); 212 | -------------------------------------------------------------------------------- /setup/environment.yml: -------------------------------------------------------------------------------- 1 | name: eda3 2 | channels: 3 | - conda-forge 4 | - defaults 5 | dependencies: 6 | - ipython-sql 7 | - ipywidgets 8 | - jupyter_contrib_core 9 | - jupyter_contrib_nbextensions 10 | - jupyter_highlight_selected_word 11 | - jupyter_latex_envs 12 | - jupyter_nbextensions_configurator 13 | - mpld3 14 | - prettytable 15 | - anaconda-client 16 | - ipykernel 17 | - ipython 18 | - ipython_genutils 19 | - jupyter_client 20 | - jupyter_console 21 | - jupyter_core 22 | - matplotlib 23 | - nbconvert 24 | - notebook 25 | - numpy 26 | - openssl 27 | - pandas 28 | - pip 29 | - pivottablejs 30 | - python=3.6 31 | - qgrid 32 | - requests 33 | - scikit-learn 34 | - scipy 35 | - seaborn 36 | - setuptools 37 | - statsmodels 38 | - tqdm 39 | - widgetsnbextension 40 | - pip: 41 | - folium 42 | - ipython-genutils 43 | - jupyter-client 44 | - jupyter-console 45 | - jupyter-contrib-core 46 | - jupyter-contrib-nbextensions 47 | - jupyter-core 48 | - jupyter-highlight-selected-word 49 | - jupyter-latex-envs 50 | - jupyter-nbextensions-configurator 51 | - missingno 52 | - pandas-profiling 53 | 54 | -------------------------------------------------------------------------------- /setup/pivottablejs.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | PivotTable.js 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 38 | 39 | 40 | 60 | 212 | 213 | 214 | -------------------------------------------------------------------------------- /setup/requirements.txt: -------------------------------------------------------------------------------- 1 | folium==0.7.0 2 | ipydatawidgets==4.0.0 3 | ipykernel==5.1.0 4 | ipyleaflet==0.9.1 5 | ipympl==0.2.1 6 | ipyscales==0.3.0 7 | ipython==7.2.0 8 | ipython-genutils==0.2.0 9 | ipyvolume==0.5.1 10 | ipywebrtc==0.4.3 11 | ipywidgets==7.4.2 12 | itsdangerous==1.1.0 13 | jedi==0.13.1 14 | jmespath==0.9.3 15 | jsonschema==3.0.0a3 16 | jupyter==1.0.0 17 | jupyter-client==5.2.3 18 | jupyter-console==6.0.0 19 | jupyter-contrib-core==0.3.3 20 | jupyter-contrib-nbextensions==0.5.0 21 | jupyter-core==4.4.0 22 | jupyter-highlight-selected-word==0.2.0 23 | jupyter-latex-envs==1.4.4 24 | jupyter-nbextensions-configurator==0.4.0 25 | jupyterlab==0.35.4 26 | jupyterlab-server==0.2.0 27 | matplotlib==2.2.3 28 | missingno==0.4.1 29 | nbconvert==5.4.1 30 | nbformat==4.4.0 31 | notebook==5.7.8 32 | numpy==1.15.1 33 | palettable==3.1.1 34 | pandas==0.23.4 35 | pandas-profiling==1.4.1 36 | pivottablejs==0.9.0 37 | qgrid==1.1.1 38 | scipy==1.1.0 39 | seaborn==0.9.0 40 | statsmodels==0.9.0 41 | widgetsnbextension==3.4.2 --------------------------------------------------------------------------------