├── .gitignore
├── EDA-cheat-sheet.md
├── LICENSE
├── README.md
├── data
├── aquastat
│ ├── aquastat.csv.gzip
│ └── world.json
└── redcard
│ └── redcard.csv.gz
├── notebooks
├── 0-Intro
│ ├── 0-Introduction-to-Exploratory-Data-Analysis.ipynb
│ ├── 0-Introduction-to-Jupyter-Notebooks.ipynb
│ ├── figures
│ │ ├── branches.jpg
│ │ └── crisp.png
│ └── html-versions
│ │ ├── 0-Introduction-to-Exploratory-Data-Analysis.html
│ │ └── 0-Introduction-to-Jupyter-Notebooks.html
├── 1-RedCard-EDA
│ ├── 1-Redcard-Dataset.ipynb
│ ├── 2-Redcard-Players.ipynb
│ ├── 3-Redcard-Dyads.ipynb
│ ├── 4-Redcard-final-joins.ipynb
│ ├── figures
│ │ ├── covariates.png
│ │ ├── models.png
│ │ └── results.png
│ └── html-versions
│ │ ├── 1-Redcard-Dataset.html
│ │ ├── 2-Redcard-Players.html
│ │ ├── 3-Redcard-Dyads.html
│ │ └── 4-Redcard-final-joins.html
└── 2-Aquastat-EDA
│ ├── 1-Aquastat-Introduction.ipynb
│ ├── 2-Aquastat-Missing-Data.ipynb
│ ├── 3-Aquastat-Univariate.ipynb
│ ├── 4-Aquastat-TargetxVariable.ipynb
│ ├── 5-Aquastat-Multivariate.ipynb
│ ├── figures
│ ├── branches.jpg
│ └── fao.jpg
│ └── html-versions
│ ├── 1-Aquastat-Introduction.html
│ ├── 2-Aquastat-Missing-Data.html
│ ├── 3-Aquastat-Univariate.html
│ ├── 4-Aquastat-TargetxVariable.html
│ ├── 5-Aquastat-Multivariate.html
│ └── pivottablejs.html
├── scripts
└── aqua_helper.py
└── setup
├── environment.yml
├── pivottablejs.html
├── requirements.txt
├── test-my-environment.html
└── test-my-environment.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .ipynb_checkpoints/
3 |
--------------------------------------------------------------------------------
/EDA-cheat-sheet.md:
--------------------------------------------------------------------------------
1 | # EDA Cheat Sheet(s)
2 |
3 | ## Why we EDA
4 | Sometimes the consumer of your analysis won't understand why you need the time for EDA and will want results NOW! Here are some of the reasons you can give to convince them it's a good use of time for everyone involved.
5 |
6 | **Reasons for the analyst**
7 | * Identify patterns and develop hypotheses.
8 | * Test technical assumptions.
9 | * Inform model selection and feature engineering.
10 | * Build an intuition for the data.
11 |
12 | **Reasons for consumer of analysis**
13 | * Ensures delivery of technically-sound results.
14 | * Ensures right question is being asked.
15 | * Tests business assumptions.
16 | * Provides context necessary for maximum applicability and value of results.
17 | * Leads to insights that would otherwise not be found.
18 |
19 | ## Things to keep in mind
20 | * You're never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it.
21 | * Stay open-minded. You're supposed to be challenging your assumptions and those of the stakeholder who you're performing the analysis for.
22 | * Repeat EDA for every new problem. Just because you've done EDA on a dataset before doesn't mean you shouldn't do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation.
23 |
24 | ## The plan
25 | Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first or you might jump back because you learned something and need to have another look.
26 |
27 | 1. Form hypotheses/develop investigation themes to explore
28 | 3. Wrangle data
29 | 3. Assess quality of data
30 | 4. Profile data
31 | 5. Explore each individual variable in the dataset
32 | 6. Assess the relationship between each variable and the target
33 | 7. Assess interactions between variables
34 | 8. Explore data across many dimensions
35 |
36 | Throughout the entire analysis you want to:
37 | * Capture a list of hypotheses and questions that come up for further exploration.
38 | * Record things to watch out for/ be aware of in future analyses.
39 | * Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge.
40 | * Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity.
41 |
42 | ## Wrangling
43 |
44 | ### Basic things to do
45 | * Make your data [tidy](https://tomaugspurger.github.io/modern-5-tidy.html).
46 | 1. Each variable forms a column
47 | 2. Each observation forms a row
48 | 3. Each type of observational unit forms a table
49 | * Transform data: sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand.
50 | * Log: when data is highly skewed (versus normally distributed like a bell curve), sometimes it has a log-normal distribution and taking the log of each data point will normalize it.
51 | * Binning of continuous variables: Binning continuous variables and then analyzing the groups of observations created can allow for easier pattern identification. Especially with non-linear relationships.
52 | * Simplifying of categories: you really don't want more than 8-10 categories within a single data field. Try to aggregate to higher-level categories when it makes sense.
53 |
54 | ### Helpful packages
55 | * [`pandas`](http://pandas.pydata.org)
56 |
57 | ## Data quality assessment and profiling
58 | Before trying to understand what information is in the data, make sure you understand what the data represents and what's missing.
59 |
60 | ### Basic things to do
61 | * Categorical: count, count distinct, assess unique values
62 | * Numerical: count, min, max
63 | * Spot-check random samples and samples that you are familiar with
64 | * Slice and dice
65 |
66 | ### Questions to consider
67 | * What data isn’t there?
68 | * Are there systematic reasons for missing data?
69 | * Are there fields that are always missing at the same time ?
70 | * Is there information in what data is missing?
71 | * Is the data that is there right?
72 | * Are there frequent values that are default values?
73 | * Are there fields that represent the same information?
74 | * What timestamp should you use?
75 | * Are there numerical values reported as strings?
76 | * Are there special values?
77 | * Is the data being generated the way you think?
78 | * Are there variables that are numerical but really should be categorical?
79 | * Is data consistent across different operating systems, device type, platforms, countries?
80 | * Are there any direct relationships between fields (e.g. a value of x always implies a specific value of y)?
81 | * What are the units of measurement? Are they consistent?
82 | * Is data consistent across the population and time? (time series)
83 | * Are there obvious changes in reported data around the time of important events that affect data generation (e.g. version release)? (panel data)
84 |
85 | ### Helpful packages
86 | * [`missingno`](https://github.com/ResidentMario/missingno)
87 | * [`pivottablejs`](https://github.com/nicolaskruchten/jupyter_pivottablejs)
88 | * [`pandas_profiling`](https://github.com/JosPolfliet/pandas-profiling)
89 |
90 | ### Example backlog
91 | * Assess the prevalence of missing data across all data fields, assess whether its missing is random or systematic, and identify patterns when such data is missing
92 | * Identify any default values that imply missing data for a given field
93 | * Determine sampling strategy for quality assessment and initial EDA
94 | * For datetime data types, ensure consistent formatting and granularity of data, and perform sanity checks on all dates present in the data.
95 | * In cases where multiple fields capture the same or similar information, understand the relationships between them and assess the most effective field to use
96 | * Assess data type of each field
97 | * For discrete value types, ensure data formats are consistent
98 | * For discrete value types, assess number of distinct values and percent unique and do sanity check on types of answers
99 | * For continuous data types, assess descriptive statistics and perform sanity check on values
100 | * Understand relationships between timestamps and assess which to use in analysis
101 | * Slice data by device type, operating system, software version and ensure consistency in data across slices
102 | * For device or app data, identify version release dates and assess data for any changes in format or value around those dates
103 |
104 | ## Exploration
105 |
106 | After quality assessment and profiling, exploratory data analysis can be divided into 4 main types of tasks:
107 | 1. Exploration of each individual variable
108 | 2. Assessment of the relationship between each variable and the target variable
109 | 3. Assessment of the interaction between variables
110 | 4. Exploration of data cross many dimensions
111 |
112 | ### 1. Exploring each individual variable
113 |
114 | #### Basics to do
115 |
116 | Quantify:
117 | * *Location*: mean, median, mode, interquartile mean
118 | * *Spread*: standard deviation, variance, range, interquartile range
119 | * *Shape*: skewness, kurtosis
120 |
121 | For time series, plot summary statistics over time.
122 |
123 | For panel data:
124 | * Plot cross-sectional summary statistics over time
125 | * Plot time-series statistics across the population
126 |
127 | #### Questions to consider
128 | * What does each field in the data look like?
129 | * Is the distribution skewed? Bimodal?
130 | * Are there outliers? Are they feasible?
131 | * Are there discontinuities?
132 | * Are the typical assumptions seen in modeling valid?
133 | * Gaussian
134 | * Identically and independently distributed
135 | * Have one mode
136 | * Can be negative
137 | * Generating processes are stationary and isoptropic (time series)
138 | * Independence between subjects (panel data)
139 |
140 | ### 2. Exploring the relationship between each variable and the target
141 | How does each field interact with the target?
142 |
143 | Assess each relationship’s:
144 | * Linearity
145 | * Direction
146 | * Rough size
147 | * Strength
148 |
149 | Methods:
150 | * [Bivariate visualizations](#Bivariate)
151 | * Calculate correlation
152 |
153 |
154 | ### 3. Assessing interactions between variables
155 | How do the variables interact with each other?
156 |
157 | #### Basic things to do
158 | * [Bivariate visualizations](#Bivariate) for all combinations
159 | * Correlation matrices
160 | * Compare summary statistics of variable x for different categories of y
161 |
162 | ### 4. Exploring data across many dimensions
163 | Are there patterns across many of the variables?
164 |
165 | #### Basic things to do
166 | * Categorical:
167 | * Parallel coordinates
168 | * Continuous
169 | * Principal component analysis
170 | * Clustering
171 |
172 | ### Helpful packages
173 | * `ipywidgets`: making function variables interactive for visualizations and calculations
174 | * `mpld3`: interactive visualizations
175 |
176 | ### Example backlog
177 | * Generate list of questions and hypotheses to be considered during EDA
178 | * Create univariate plots for all fields
179 | * Create bivariate plots for each combination of fields to assess correlation and other relationships
180 | * Plot summary statistics over time for time series data
181 | * Plot distribution of x for different categories y
182 | * Plot mean/median/min/max/count/distinct count of x over time for different categories of y
183 | * Capture list of hypotheses and questions that come up during EDA
184 | * Record things to watch out for/ be aware of in future analyses
185 | * Distill and present findings
186 |
187 | ## Visualization guide
188 | Here are the types of visualizations and the python packages we find most useful for data exploration.
189 |
190 | ### Univariate
191 | * Categorical:
192 | * [Bar plot](http://seaborn.pydata.org/generated/seaborn.barplot.html?highlight=bar%20plot#seaborn.barplot)
193 | * Continuous:
194 | * [Histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms)
195 | * [Kernel density estimation plot](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimaton)
196 | * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot)
197 |
198 | ### Bivariate
199 | * Categorical x categorical
200 | * [Heat map of contingency table](http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap)
201 | * [Multiple bar plots](http://seaborn.pydata.org/tutorial/categorical.html?highlight=bar%20plot#bar-plots)
202 | * Categorical x continuous
203 | * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot) of continuous for each category
204 | * [Violin plots](http://seaborn.pydata.org/examples/simple_violinplots.html) of continuous distribution for each category
205 | * Overlaid [histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms) (if 3 or less categories)
206 | * Continuous x continuous
207 | * [Scatter plots](http://seaborn.pydata.org/examples/marginal_ticks.html?highlight=scatter)
208 | * [Hexibin plots](http://seaborn.pydata.org/tutorial/distributions.html#hexbin-plots)
209 | * [Joint kernel density estimation plots](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimation)
210 | * [Correlation matrix heatmap](http://seaborn.pydata.org/examples/network_correlations.html?highlight=correlation)
211 |
212 | ### Multivariate
213 | * [Pairwise bivariate figures/ scatter matrix](http://seaborn.pydata.org/tutorial/distributions.html#visualizing-pairwise-relationships-in-a-dataset)
214 |
215 | ## Timeseries
216 | * Line plots
217 | * Any bivariate plot with time or time period as the x-axis.
218 |
219 | ## Panel data
220 | * [Heat map](http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap) with rows denoting observations and columns denoting time periods.
221 | * [Multiple line plots](http://seaborn.pydata.org/tutorial/categorical.html?highlight=panel%20data#drawing-multi-panel-categorical-plots)
222 | * [Strip plot](http://seaborn.pydata.org/tutorial/categorical.html?highlight=panel%20data#categorical-scatterplots) where time is on the x-axis, each entity has a constant y-value and a point is plotted every time an event is observed for that entity.
223 |
224 | ## Geospatial
225 | * [Choropleths](https://folium.readthedocs.io/en/latest/quickstart.html#choropleth-maps): regions colored according to their data value.
226 |
227 | ### Helpful packages
228 | * [`matplotlib`](https://matplotlib.org/): basic plotting
229 | * [`seaborn`](ttp://seaborn.pydata.org/): prettier versions of some `matplotlib` figures
230 | * [`mpld3`](http://mpld3.github.io/): interactive plotting
231 | * [`folium`](https://folium.readthedocs.io/en/latest/): geospatial plotting
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Chloe Mawer
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EDA Tutorial
2 |
3 |
4 | This repo holds the contents developed for the tutorial, *Exploratory Data Analysis in Python*, presented at PyCon 2017 on May 17, 2017.
5 |
6 | We suggest setting up your environment and testing it (as detailed below) and then following along the video of the tutorial found [here](https://www.youtube.com/watch?v=W5WE9Db2RLU).
7 |
8 | As there was limited time for instruction, we also recommend pausing throughout and practicing some of the methods discussed as you go.
9 |
10 | We welcome any PRs with other demonstrations of how you would perform EDA on the provided datasets.
11 |
12 | ## The datasets
13 |
14 | * [Redcard Dataset](https://osf.io/47tnc/)
15 | * [Aquastat Dataset](http://www.fao.org/nr/water/aquastat/main/index.stm)
16 |
17 | ## Before the tutorial
18 |
19 | ### Microsoft Azure option
20 | If you don't want to deal with setting up your environment or have any problems with the below instructions, you can work through the tutorial through Microsoft Azure Notebooks by creating an account and cloning the tutorial library found [here](https://notebooks.azure.com/chloe/libraries/pycon-2017-eda-tutorial) (all of this is for free, forever).
21 |
22 | ### Github option
23 |
24 | #### 1. Clone this repo
25 | Clone this repository locally on your laptop.
26 | 1. Go to the green **Clone or download** button at the top of the repository page and copy the https link.
27 | 2. From the command line run the command:
28 |
29 | ```bash
30 | git clone https://github.com/cmawer/pycon-2017-eda-tutorial.git
31 | ```
32 |
33 | #### 2. Set up your python environment
34 |
35 | ##### Install conda or miniconda
36 | We recommend using conda for managing your python environments. Specifically, we like miniconda, which is the most lightweight installation. You can install miniconda [here](https://conda.io/miniconda.html). However, the full [anaconda](https://www.continuum.io/downloads) is good for beginners as it comes with many packages already installed.
37 |
38 | ##### Create your environment
39 |
40 | Once installed, you can create the environment necessary for running this tutorial by running the following command from the command line in the `setup/` directory of this repository:
41 |
42 | ```bash
43 | conda update conda
44 | ```
45 |
46 | then:
47 |
48 | ```bash
49 | conda env create -f environment.yml
50 | ```
51 |
52 | This command will create a new environment named `eda3`.
53 |
54 | ##### Activate your environment
55 | To activate the environment you can run this command from any directory:
56 |
57 | `source activate eda3` (Mac/Linux)
58 |
59 | `activate eda3` (Windows)
60 |
61 | ##### Non-conda users
62 |
63 |
64 | If you are experienced in python and do not use conda, the `requirements.txt` file is available also in the `setup/` directory for pip installation. This was our environment frozen as is for a Mac. If using Windows or Linux, you may need to remove some of the version requirements.
65 |
66 | #### 3. Enable `ipywidgets`
67 |
68 | We will be using widgets to create interactive visualizations. They will have been installed during your environment setup but you still need to run the following from the commandline:
69 |
70 | ```bash
71 | jupyter nbextension enable --py --sys-prefix widgetsnbextension
72 | ```
73 |
74 | #### 4. Test your python environment
75 |
76 | Now that your environment is set up, let's check that it works.
77 |
78 | 1. Go to the `setup/` directory from the command line and start a Jupyter notebook instance:
79 |
80 | ```bash
81 | jupyter notebook
82 | ```
83 |
84 | a lot of text should appear -- you need to leave this terminal running for your Jupyter instance to work.
85 |
86 | 2. Assuming this worked, open up the notebook titled `test-my-environment.ipynb`
87 |
88 | 3. Once the notebook is open, go to the `Cell` menu and select `Run All`.
89 |
90 | 4. Check that every cell in the notebook ran (i.e did not produce error as output). `test-my-environment.html` shows what the notebook should look like after running.
91 |
92 |
93 |
--------------------------------------------------------------------------------
/data/aquastat/aquastat.csv.gzip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/data/aquastat/aquastat.csv.gzip
--------------------------------------------------------------------------------
/data/redcard/redcard.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cmawer/pycon-2017-eda-tutorial/91c3a131950e30b1832e35cfb0cda450ab082b2c/data/redcard/redcard.csv.gz
--------------------------------------------------------------------------------
/notebooks/0-Intro/0-Introduction-to-Exploratory-Data-Analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "toc": "true"
7 | },
8 | "source": [
9 | "# Table of Contents\n",
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | ""
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "# What is exploratory data analysis? "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {
30 | "ExecuteTime": {
31 | "end_time": "2017-05-14T21:28:21.398615Z",
32 | "start_time": "2017-05-14T21:28:21.393864Z"
33 | },
34 | "collapsed": true,
35 | "run_control": {
36 | "frozen": false,
37 | "read_only": false
38 | }
39 | },
40 | "source": [
41 | ""
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "# Why we EDA\n",
49 | "Sometimes the consumer of your analysis won't understand why you need the time for EDA and will want results NOW! Here are some of the reasons you can give to convince them it's a good use of time for everyone involved. \n",
50 | "\n",
51 | "**Reasons for the analyst**\n",
52 | "* Identify patterns and develop hypotheses.\n",
53 | "* Test technical assumptions.\n",
54 | "* Inform model selection and feature engineering.\n",
55 | "* Build an intuition for the data.\n",
56 | "\n",
57 | "**Reasons for consumer of analysis**\n",
58 | "* Ensures delivery of technically-sound results.\n",
59 | "* Ensures right question is being asked.\n",
60 | "* Tests business assumptions.\n",
61 | "* Provides context necessary for maximum applicability and value of results.\n",
62 | "* Leads to insights that would otherwise not be found.\n",
63 | "\n",
64 | "# Things to keep in mind \n",
65 | "* You're never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it. \n",
66 | "* Stay open-minded. You're supposed to be challenging your assumptions and those of the stakeholder who you're performing the analysis for. \n",
67 | "* Repeat EDA for every new problem. Just because you've done EDA on a dataset before doesn't mean you shouldn't do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation.\n"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "# The game plan\n",
75 | "\n",
76 | "\n",
77 | "Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first. Or often, an observation will bring up a question you want to investigate and you'll branch off and explore to answer that question before returning down the main path of exhaustive EDA.\n",
78 | " \n",
79 | "2. Form hypotheses/develop investigation themes to explore \n",
80 | "3. Wrangle data \n",
81 | "3. Assess data quality and profile \n",
82 | "5. Explore each individual variable in the dataset \n",
83 | "6. Assess the relationship between each variable and the target \n",
84 | "7. Assess interactions between variables \n",
85 | "8. Explore data across many dimensions \n",
86 | "\n",
87 | "Throughout the entire analysis you want to:\n",
88 | "* Capture a list of hypotheses and questions that come up for further exploration.\n",
89 | "* Record things to watch out for/ be aware of in future analyses. \n",
90 | "* Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge. \n",
91 | "* Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity. \n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "## 1. Brainstorm areas of investigation\n",
99 | "Yes, you're exploring, but that doesn't mean it's a free for all.\n",
100 | "\n",
101 | "* What do you need to understand the question you're trying to answer? \n",
102 | "* List before diving in and update throughout the analysis"
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "## 2. Wrangle the data\n",
110 | "\n",
111 | "* Make your data [tidy](https://tomaugspurger.github.io/modern-5-tidy.html).\n",
112 | " 1. Each variable forms a column\n",
113 | " 2. Each observation forms a row\n",
114 | " 3. Each type of observational unit forms a table\n",
115 | "* Transform data\n",
116 | " * Log \n",
117 | " * Binning\n",
118 | " * Aggegration into higher level categories "
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "## 3. Assess data quality and profile\n",
126 | "* What data isnât there? \n",
127 | "* Is the data that is there right? \n",
128 | "* Is the data being generated in the way you think?"
129 | ]
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "## 4. Explore each individual variable in the dataset \n",
136 | "* What does each field in the data look like? \n",
137 | "* How can each variable be described by a few key values? \n",
138 | "* Are the assumptions often made in modeling valid? "
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "## 5. Assess the relationship between each variable and the target \n",
146 | "\n",
147 | "How does each variable interact with the target variable? \n",
148 | "\n",
149 | "Assess each relationshipâs:\n",
150 | "* Linearity \n",
151 | "* Direction \n",
152 | "* Rough size \n",
153 | "* Strength"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "## 6. Assess interactions between the variables\n",
161 | "* How do the variables interact with each other? \n",
162 | "* What is the linearity, direction, rough size, and strength of the relationships between pairs of variables? "
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "## 7. Explore data across many dimensions\n",
170 | "Are there patterns across many of the variables?"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "# Our objectives for this tutorial\n",
178 | "\n",
179 | "Our objectives for this tutorial are to help you: \n",
180 | "* Develop the EDA mindset \n",
181 | " * Questions to consider while exploring \n",
182 | " * Things to look out for \n",
183 | "* Learn basic methods for effective EDA \n",
184 | " * Slicing and dicing \n",
185 | " * Calculating summary statistics\n",
186 | " * Basic plotting \n",
187 | " * Basic mapping \n",
188 | " * Using widgets for interactive exploration\n",
189 | "\n",
190 | "The actual exploration you do in this tutorial is *yours*. We have no answers or set of conclusions we think you should come to about the datasets. Our goal is simply to aid in making your exploration as effective as possible. \n"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "