├── .gitignore ├── README.md ├── binder ├── download_geodata.py └── environment.yml ├── examples ├── data │ ├── processed │ │ └── placeholder │ └── raw │ │ └── placeholder └── notebooks │ ├── census-data-downloader.ipynb │ ├── censusdata-example-internet-access.ipynb │ └── censusdata-example.ipynb ├── exercises ├── data │ ├── final │ │ └── placeholder │ ├── interim │ │ ├── placeholder │ │ ├── state_data-01-May-19.dta │ │ └── working_data-01-May-19.dta │ ├── processed │ │ └── placeholder │ └── raw │ │ ├── acs_data.csv.gz │ │ └── acs_data.dta.gz └── notebooks │ ├── 00_DigitalDivide_Data_Prep.ipynb │ ├── 01_DigitalDivide_Analysis.ipynb │ ├── solutions-Analysis.ipynb │ ├── solutions-Data_Prep.ipynb │ └── tools.py ├── presentation ├── presentation.ipynb └── static │ ├── ipums.gif │ └── style.css └── static ├── github-download.gif ├── math.png └── nooice.gif /.gitignore: -------------------------------------------------------------------------------- 1 | ## jupyter and python 2 | .ipynb_checkpoints/ 3 | __pycache__/ 4 | .DS_Store 5 | 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![badge](https://img.shields.io/badge/-Exercises-579ACA.svg?logo=)](https://mybinder.org/v2/gh/Chekos/analyzing-census-data/master?urlpath=lab%2Ftree%2Fexercises%2Fnotebooks%2F00_DigitalDivide_Data_Prep.ipynb) 2 | [![badge](https://img.shields.io/badge/-Slides-E66581.svg?logo=)](https://mybinder.org/v2/gh/chekos/analyzing-census-data/master?filepath=presentation%2Fpresentation.ipynb) 3 | 4 | # Analyzing Census Data with Pandas 5 | ## PyCon 2019 6 | *** 7 | 8 | Materials for my Analyzing Census Data with Pandas workshop for PyCon 2019. 9 | 10 | # The tutorial 11 | This tutorial is meant to be followed using mybinder.org but if you choose to download the materials and follow along these are the instructions. 12 | 13 | # Getting the materials 14 | The easiest way to get a copy of this repository is to clone it if you know git 15 | ```bash 16 | git clone https://github.com/chekos/analyzing-census-data.git 17 | ``` 18 | 19 | But you can also download it straight from GitHub: 20 | 21 | ![GitHub Download](static/github-download.gif) 22 | 23 | # Setting up your environment 24 | Only 2 packages are essential for this workshop: 25 | 1. Pandas 26 | 2. Jupyter (notebooks or lab) 27 | 28 | You can either `pip` install them: 29 | ```bash 30 | pip install pandas jupyterlab 31 | ``` 32 | or use conda to install them 33 | ```bash 34 | conda install -c conda-forge pandas jupyterlab 35 | ``` 36 | 37 | Once you have the materials and `python` packages necessary, head over to the **exercises** directory and launch Jupyter Lab 38 | ```bash 39 | cd analyzing-census-data 40 | cd exercises 41 | jupyter lab 42 | ``` 43 | 44 | # The structure 45 | This tutorial will guide you through a typical data analysis project utilizing Census data acquired from [IPUMS](https://usa.ipums.org/). It's split into 2 notebooks: 46 | 1. [Data Preparation](exercises/notebooks/Project1_Data_Prep.ipynb) 47 | 2. [Data Analysis](exercises/notebooks/Project1_Data_Analysis.ipynb) 48 | 49 | In the first notebook you will: 50 | 1. Work with compressed data with pandas. 51 | 2. Retrieve high-level descriptive analytics of your data. 52 | 3. Drop columns. 53 | 4. Slice data (boolean indexing). 54 | 5. Work with categorical data. 55 | 6. Work with weighted data. 56 | 7. Use python's `pathlib` library, making your code more reproducible across platforms. 57 | 8. Develop a reproducible data prep workflow for future projects. 58 | 59 | On top of that, in the second notebook you will: 60 | 1. Aggregate data. 61 | 2. Learn about `.groupby()` 62 | 3. Learn about cross-sections `.xs()` 63 | 4. Learn about `pivot_table`s and `crosstabs` 64 | 5. Develop a reproducible data analysis workflow for future projects. 65 | 66 | -------------------------------------------------------------------------------- /binder/download_geodata.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import requests 3 | import io 4 | from zipfile import ZipFile 5 | 6 | RAW_DATA_PATH = Path("exercises/data/raw/counties/") 7 | 8 | url = "https://www2.census.gov/geo/tiger/TIGER2018/COUNTY/tl_2018_us_county.zip" 9 | site = requests.get(url) 10 | 11 | z = ZipFile(io.BytesIO(site.content)) 12 | z.extractall(RAW_DATA_PATH) -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | name: analyzing-census-data 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - pandas==0.24.* 6 | - rise==5.5.* 7 | - pip 8 | - pip: 9 | - cenpy 10 | - census 11 | - us 12 | - census-data-downloader 13 | - censusdata 14 | - pypums -------------------------------------------------------------------------------- /examples/data/processed/placeholder: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/examples/data/processed/placeholder -------------------------------------------------------------------------------- /examples/data/raw/placeholder: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/examples/data/raw/placeholder -------------------------------------------------------------------------------- /examples/notebooks/census-data-downloader.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Los Angeles Times datadesk's `census-data-downloader` \n", 8 | "\n", 9 | "Repo: https://github.com/datadesk/census-data-downloader \n", 10 | "\n", 11 | "Needs an API KEY." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "from pathlib import Path\n", 22 | "from tools import tree" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 3, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "RAW_DATA_PATH = Path(\"../data/raw/\")\n", 32 | "PROCESSED_DATA_PATH = Path(\"../data/processed/\")" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "`census-data-downloader` will look for an enviornmental variable with your Census API Key.\n", 40 | "\n", 41 | "```\n", 42 | "%env CENSUS_API_KEY=\n", 43 | "```" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 24, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "!censusdatadownloader --year 2017 --data-dir ../data/ medianage counties" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "It will create 2 datafiles:\n", 60 | "1. A raw data file in your `data-dir`/raw\n", 61 | "2. A human-readable data file in your `data-dir`/processed" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "+ ../data/raw\n", 74 | " + .DS_Store\n", 75 | " + acs_data.csv.gz\n", 76 | " + acs_data.dta.gz\n", 77 | " + cps_data.dta.gz\n" 78 | ] 79 | } 80 | ], 81 | "source": [ 82 | "tree(RAW_DATA_PATH)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 15, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "raw_data = pd.read_csv(RAW_DATA_PATH / \"acs5_2017_medianage_states.csv\")\n", 92 | "processed_data = pd.read_csv(PROCESSED_DATA_PATH / \"acs5_2017_medianage_states.csv\")" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 16, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/html": [ 103 | "
\n", 104 | "\n", 117 | "\n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | "
B01002_001EB01002_002EB01002_003Enamestate
040.138.141.8Puerto Rico72
138.737.240.1Alabama1
233.933.434.4Alaska2
337.235.938.5Arizona4
437.936.539.3Arkansas5
\n", 171 | "
" 172 | ], 173 | "text/plain": [ 174 | " B01002_001E B01002_002E B01002_003E name state\n", 175 | "0 40.1 38.1 41.8 Puerto Rico 72\n", 176 | "1 38.7 37.2 40.1 Alabama 1\n", 177 | "2 33.9 33.4 34.4 Alaska 2\n", 178 | "3 37.2 35.9 38.5 Arizona 4\n", 179 | "4 37.9 36.5 39.3 Arkansas 5" 180 | ] 181 | }, 182 | "execution_count": 16, 183 | "metadata": {}, 184 | "output_type": "execute_result" 185 | } 186 | ], 187 | "source": [ 188 | "raw_data.head()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 17, 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "data": { 198 | "text/html": [ 199 | "
\n", 200 | "\n", 213 | "\n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | "
geoidnamemedianmalefemalestate
072Puerto Rico40.138.141.872
11Alabama38.737.240.11
22Alaska33.933.434.42
34Arizona37.235.938.54
45Arkansas37.936.539.35
\n", 273 | "
" 274 | ], 275 | "text/plain": [ 276 | " geoid name median male female state\n", 277 | "0 72 Puerto Rico 40.1 38.1 41.8 72\n", 278 | "1 1 Alabama 38.7 37.2 40.1 1\n", 279 | "2 2 Alaska 33.9 33.4 34.4 2\n", 280 | "3 4 Arizona 37.2 35.9 38.5 4\n", 281 | "4 5 Arkansas 37.9 36.5 39.3 5" 282 | ] 283 | }, 284 | "execution_count": 17, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "processed_data.head()" 291 | ] 292 | } 293 | ], 294 | "metadata": { 295 | "kernelspec": { 296 | "display_name": "Python 3", 297 | "language": "python", 298 | "name": "python3" 299 | }, 300 | "language_info": { 301 | "codemirror_mode": { 302 | "name": "ipython", 303 | "version": 3 304 | }, 305 | "file_extension": ".py", 306 | "mimetype": "text/x-python", 307 | "name": "python", 308 | "nbconvert_exporter": "python", 309 | "pygments_lexer": "ipython3", 310 | "version": "3.7.3" 311 | } 312 | }, 313 | "nbformat": 4, 314 | "nbformat_minor": 2 315 | } 316 | -------------------------------------------------------------------------------- /exercises/data/final/placeholder: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/final/placeholder -------------------------------------------------------------------------------- /exercises/data/interim/placeholder: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/interim/placeholder -------------------------------------------------------------------------------- /exercises/data/interim/state_data-01-May-19.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/interim/state_data-01-May-19.dta -------------------------------------------------------------------------------- /exercises/data/interim/working_data-01-May-19.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/interim/working_data-01-May-19.dta -------------------------------------------------------------------------------- /exercises/data/processed/placeholder: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/processed/placeholder -------------------------------------------------------------------------------- /exercises/data/raw/acs_data.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/raw/acs_data.csv.gz -------------------------------------------------------------------------------- /exercises/data/raw/acs_data.dta.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/exercises/data/raw/acs_data.dta.gz -------------------------------------------------------------------------------- /exercises/notebooks/00_DigitalDivide_Data_Prep.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Project 1: Digital Divide\n", 8 | "### Data Prep\n", 9 | "\n", 10 | "#### Based on PPIC's Just the Facts report [\"California's Digital Divide\"](https://www.ppic.org/publication/californias-digital-divide/)" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "## Research Question(s):\n", 18 | "1. What share households in X state have access to high-speed internet? \n", 19 | "2. Does this number vary across demographic groups? (in this case race/ethnicity)." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## Goal:\n", 27 | "* explore datafiles (`acsdata.data.gz`) and create a _working dataset_ from it." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Context:\n", 35 | "Obtained American Community Survey (ACS) survey data from [IPUMS](https://usa.ipums.org/usa/).
\n", 36 | "It contains basic demographics:\n", 37 | " - age\n", 38 | " - gender\n", 39 | " - race/ethnicity\n", 40 | "\n", 41 | "and geographic indicators:\n", 42 | " - state\n", 43 | " - county" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "***" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "#### Step 1: Set up your working environment.\n", 58 | "\n", 59 | "Import all necessary libraries and create `Path`s to your data directories. This ensures reproducibility across file systems (windows uses `\\` instead of `/`)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "We need \n", 67 | "1. `pandas` to work with the data.\n", 68 | "2. `pathlib`, and more specifically its `Path` object, to work with paths. This will ensure our code works in both Windows (which uses `\\` in its file paths) and MacOS/Linux (which uses `/`).\n", 69 | "3. `datetime` - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started." 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 1, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "27-Apr-19\n" 82 | ] 83 | } 84 | ], 85 | "source": [ 86 | "# setting up working environment\n", 87 | "import _____ as pd\n", 88 | "from _____ import Path\n", 89 | "from datetime import datetime as dt\n", 90 | "today = __.today().strftime(\"%d-%b-%y\")\n", 91 | "\n", 92 | "print(today)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "_note: even if you are on windows you can type the path forward slashes_ `/` _below_" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 2, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "# data folder and paths\n", 109 | "RAW_DATA_PATH = ____(\"../data/raw/\")\n", 110 | "XXXX_XXXXX_XXXX = ____(\"../data/interim/\")\n", 111 | "YYYY_YYYYY_YYYY = ____(\"../data/processed/\")\n", 112 | "ZZZZ_ZZZZZ_ZZZZ = ____(\"../data/final/\")" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "**NOTE:** I've included a `tools.py` script with the function `tree` which displays a directory's tree (obtained from [RealPython's tutorial on the `pathlib` module](https://realpython.com/python-pathlib/)).\n", 120 | "\n", 121 | " from our tools script import tree so we can use it." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "tree(________)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "***" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "#### Step 2: Load and explore the data\n", 152 | "\n", 153 | "With `pandas` loading data is as easy as `.read_csv(PATH_TO_CSV_FILE)` and that works most of the time. `Pandas` `read_csv()` is so powerful it'll even read compressed files without any other parameter specification. Try the following:\n", 154 | "\n", 155 | "```python\n", 156 | "data = pd.read_csv(RAW_DATA_PATH / 'acs_data.csv.gz')\n", 157 | "data.head()\n", 158 | "```\n", 159 | "_*make sure you change_ `RAW_DATA_PATH` _to match whatever variable name you chose for it earlier._" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "\n" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "***\n", 176 | "IPUMS offers a few data formats which can be more useful [[docs]](https://usa.ipums.org/usa-action/faq#ques12):\n", 177 | "> In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer." 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "In this case, we'll be using a **Stata** file (`.dta`). The main reason is that `.dta` files can store *value labels* which `pandas` can then read and convert columns to `Categorical` columns in our pandas DataFrame. This 1) saves memory, and 2) is good practice because certain social sciences really, _really_, ***really*** love Stata so their interesting datasets are likely `.dta` files. " 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "However, `pandas` cannot read compressed `.dta` directly like it can `.csv` files. IPUMS, uses *gzip* compressed format and `python` includes a `gzip` module in its standard library.\n", 192 | "\n", 193 | "**Import** gzip and try the following:\n", 194 | "```python\n", 195 | "with gzip.open(RAW_DATA_PATH / 'acs_data.dta.gz') as file:\n", 196 | " data = pd.read_stata(file)\n", 197 | "```\n", 198 | "\n", 199 | "and then display the first five rows of your `data` DataFrame." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "# import gzip and load data\n", 209 | "\n" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "# display first 5 rows\n" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "***" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "#### Step 3: Familiarize yourself with the dataset\n", 233 | "\n", 234 | "We've already seen `.head()` - the `pandas` method that will display the first 5 rows of your DataFrame. This gives you an idea of what your data looks like. However, there are is a lot more `.info()` you can get out of your dataframe. You can also just ask the data to `.describe()` itself..." 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "# find out more info about your dataframe\n", 244 | "data.____()" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "# describing your data\n", 254 | "data.____()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "Check out the `shape` of your data with it's `.shape` attribute. Notice the lack of parentheses." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "data._____" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "***" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "#### Step 4: Trim your data\n", 285 | "\n", 286 | "Right now you're working with your **masterfile** - a dataset containing everything you _could_ need for your analysis. You don't really want to modify this dataset because you might be using it for other analyses. For example, we're going to be analyzing access to high-speed internet in a state of your choosing but next week you might want to run the same analysis on another state or maybe just on a specific county. To make sure you can **reuse** your data and code later let's create an _analytical file_ or a _working dataset_, a dataset that contains only the data needed for **this** specific analysis at hand." 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "metadata": {}, 292 | "source": [ 293 | "First, we are only interested in finding the _\"Digital Divide\"_ of one state right now. The **masterfile** contains data for all 50 states and the Disctric of Columbia. \n", 294 | "\n", 295 | "What you want to do is find all the rows where the `statefip` matches the your state's name. This is called boolean indexing.\n", 296 | "\n", 297 | "Try the following\n", 298 | "```python\n", 299 | "data['statefip'] == 'ohio'\n", 300 | "```\n", 301 | "_Note: you can change 'ohio' to any other of the 50 states or 'district of columbia' for DC._" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "# try boolean indexing\n", 311 | "___['______'] == '_______'" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "This is going to return a `pandas.Series` of booleans (Trues and Falses) which then you can use to filter out any unnecessary rows.\n", 319 | "\n", 320 | "It's good practice to save these as a variable early in your code (if you know them beforehand) or right before you use them in case you use these conditionals in more than one place. This is going to save you time if you decide to change the value you're comparing, `'ohio'` for `'california'` for example.\n", 321 | "\n", 322 | "```python\n", 323 | "mask_state = (data['statefip'] == 'ohio')\n", 324 | "data[mask_state].head()\n", 325 | "```" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "# try it yourself\n", 335 | "mask_state = (________________________ == _______)\n", 336 | "data[mask_state].____()" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "let's save it to another variable with a more useful name:\n", 344 | "\n", 345 | "```python\n", 346 | "state_data = data[mask_state].copy()\n", 347 | "```" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "You have to use `.copy()` to create actual copies of the data. If you ran\n", 355 | "```python\n", 356 | "state_data = data[mask_state]\n", 357 | "```\n", 358 | "`state_data` would be a _view_ of the `data` dataframe. This can have unintended consequences down the road if you modify your dataframes. A lot of the times you'd get just a warning and your code will run just as intented - but why take risks, right?" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "# save your data to state_data\n", 368 | "state_data = __________.copy()" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "Now, let's see what `.columns` we have in our dataframe. You can find these the same way you found the `.shape` of it earlier." 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "state_data._____" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "Are there any columns that you are **confident** you don't need? If you are not 90% sure you won't need a variable don't drop it. \n", 392 | "\n", 393 | "Dropping columns is as easy using `.drop()` on your dataframe.\n", 394 | "\n", 395 | "```python\n", 396 | "state_data.drop(columns = ['list', 'of', 'columns', 'to', 'drop'])\n", 397 | "```" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "If there are variables you _think_ you won't need but you're not very sure that's the case, you should explore them. \n", 412 | "\n", 413 | "`pandas` dataframe's columns are `pandas.Series` and they have methods and attributes just like dataframes." 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "Let's explore the variable `gq` which stands for `Group Quarters`. From the IPUMS [docs](https://usa.ipums.org/usa-action/variables/GQ#description_section):\n", 421 | ">Group quarters are largely institutions and other group living arrangements, such as rooming houses and military barracks." 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "Let's see what `.unique()` values the `state_data['gq']` series has..." 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "We can also see the `.value_counts()` which would give us a better idea of how useful this column might be. For example, if a column has 2 values but 99% of the observations have one value and 1% have the other - you could drop column altogether since it might not add a lot value to your analysis. \n", 443 | "\n", 444 | "Some variables have 100% of it's rows with the same value... \\*cough\\* \\*cough\\* `state_data['year']`..." 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "From IPUMS [docs](https://usa.ipums.org/usa-action/variables/GQ#comparability_section):\n", 459 | ">There are three slightly different definitions of group quarters in the IPUMS. For the period 1940-1970 (excluding the 1940 100% dataset), group quarters are housing units with five or more individuals unrelated to the householder. Before 1940 and in 1980-1990, units with 10 or more individuals unrelated to the householder are considered group quarters. **In the 2000 census, 2010 census, the ACS and the PRCS, no threshold was applied; for a household to be considered group quarters, it had to be on a list of group quarters that is continuously maintained by the Census Bureau. In earlier years, a similar list was used, with the unrelated-persons rule imposed as a safeguard.**" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "Because of this and the fact that most of our observations fall into the 1970 and 1990 definition, we'll stick to those 2 for our analysis." 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "Let's create another _mask_ to filter out households that don't fit our definition.\n", 474 | "\n", 475 | "For multiple conditions we use: `&` and `|` operators (**and** and **or**, respectively)" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "mask_household = ( CONDITION ONE ) | ( CONDITION TWO )" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "metadata": {}, 490 | "source": [ 491 | "**note**: another value added from having categorical variables is that, **if** they are ordered, you can use the `<`, `>` operators for conditions as well.\n", 492 | "```python\n", 493 | "mask_household = (state_data['gq'] <= 'additional households under 1990 definition')\n", 494 | "```" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "_note: since you are overwriting_ `state_data` _you don't need to use_ `.copy()` _but it doesn't hurt and if you're a beginner at_ `pandas` _it's good practice for when you actually need to use_ `.copy()`." 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "state_data = state_data[mask_household].______()" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "At this point you're really close to a `working_data` dataset. You have:\n", 518 | "1. Kept one state's information and dropped the rest.\n", 519 | "2. Kept only those _households_ you're interested in and dropped the rest" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "metadata": {}, 525 | "source": [ 526 | "Our research question 1 is: \"What share of households in X state have access to high-speed internet?\"\n", 527 | "\n", 528 | "Mathematically, \n", 529 | "$$ \\frac{households\\ with\\ high\\ speed\\ internet}{households\\ in\\ state}$$\n", 530 | "\n", 531 | "Your `state_data` dataset contains all you need to find the answer. " 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "***" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "#### Step 5: Save your data\n", 546 | "\n", 547 | "Now that you have trimmed your **masterfile** into a `working_data` dataset you should save it. \n", 548 | "\n", 549 | "We've been working with a `.dta` file and it'd be best if we keep it that way. \n", 550 | "\n", 551 | "Try the following:\n", 552 | "```python\n", 553 | "state_data.to_stata(INTERIM_DATA_PATH / f'state_data-{today}.dta', write_index = False)\n", 554 | "```" 555 | ] 556 | }, 557 | { 558 | "cell_type": "markdown", 559 | "metadata": {}, 560 | "source": [ 561 | "A few things:\n", 562 | "1. We're using `f-strings` to tag our datafile with today's date.\n", 563 | "2. You're turning off the `write_index` flag so you don't add a 'index' column to your `.dta` file. In this dataset, our index isn't meaningful. In other analysis you might have a meaningful index and you won't want to turn off this flag." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": null, 569 | "metadata": {}, 570 | "outputs": [], 571 | "source": [] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "***" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "#### Step 6: Bonus\n", 585 | "What if we changed our research question a little bit, from
_\"What share of households in X state have access to high-speed internet?_
to
_\"What share of households **with school-age children** in X state have access to high-speed internet?\"_" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "This would be an interesting statistic to policy-makers, especially if we find discrepancies across demographic groups (research question 2).\n", 593 | "\n", 594 | "The challenge here is that the **unit of observation** in our `state_data` file is a (weighted) person and we want to _filter_ out those **households** without any school-age children in them. This might sound a little complicated at first but it just requires modifying our previous workflow just a little.\n", 595 | "\n", 596 | "We need to do a few things:\n", 597 | "1. Define what we mean by school-age children.\n", 598 | "2. Create a _mask_ to grab all households where these children are.\n", 599 | "3. Create a list of unduplicated household identifiers (`'serial'`) \n", 600 | "4. Use that list to drop unwanted observations." 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "metadata": {}, 606 | "source": [ 607 | "#### Step 6.1: School-age children\n", 608 | "\n", 609 | "Most people would agree school age (Elementary through High School) is 6 - 17 year olds. Some people are interested in K-12 (5 - 17 or 18). Some people wouldn't include 18 year olds. Whatever measure you choose you must be able to defend why you are choosing it. \n", 610 | "\n", 611 | "For this analysis, I'll suggest we use 5 - 18 year olds (K-12) but you can choose whatever age range you want. Maybe high-school kids 14-18? That'd be interesting, you probably need access to high-speed internet at home a lot more in high school than you do in kindergarden. " 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": {}, 618 | "outputs": [], 619 | "source": [ 620 | "mask_children = (state_data['age'] >= ___) & (___________________ <= )" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | " What data type is state_data['age'] again? \n", 628 | "
\n", 629 | " Categorical. This means that you even though its values _look_ like numbers, they're actually _value labels_ aka strings.\n", 630 | "
\n", 631 | "
" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "Now that we have our _mask_, we can use it to create a list of households with children in them.\n", 639 | "\n", 640 | "Earlier we applied a mask to a dataframe and saved it to another variable. Here, we'll go a step further and grab just a column of that _filtered out_ dataframe.\n", 641 | "\n", 642 | "Try it yourself first.\n", 643 | "\n", 644 | "*Hint: How did we grab and explore a single column of a dataframe earlier?*" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "metadata": {}, 651 | "outputs": [], 652 | "source": [ 653 | "households_with_children = _________________________________________" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": null, 659 | "metadata": {}, 660 | "outputs": [], 661 | "source": [ 662 | "households_with_children.head()" 663 | ] 664 | }, 665 | { 666 | "cell_type": "markdown", 667 | "metadata": {}, 668 | "source": [ 669 | "How do you think we can `.drop_duplicates()`?" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "Once you have your unduplicated list of households with children all you have to do is to check if a `serial` value from our `state_data` dataset `.isin()` our `households_with_children` series." 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": null, 689 | "metadata": {}, 690 | "outputs": [], 691 | "source": [ 692 | "\n" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "Let's save that as our `working_data` dataset and save that to memory." 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "working_data = _____________________" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "```python\n", 716 | "working_data.to_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta', write_index = False)\n", 717 | "```" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": null, 723 | "metadata": {}, 724 | "outputs": [], 725 | "source": [] 726 | } 727 | ], 728 | "metadata": { 729 | "kernelspec": { 730 | "display_name": "Python 3", 731 | "language": "python", 732 | "name": "python3" 733 | }, 734 | "language_info": { 735 | "codemirror_mode": { 736 | "name": "ipython", 737 | "version": 3 738 | }, 739 | "file_extension": ".py", 740 | "mimetype": "text/x-python", 741 | "name": "python", 742 | "nbconvert_exporter": "python", 743 | "pygments_lexer": "ipython3", 744 | "version": "3.7.3" 745 | } 746 | }, 747 | "nbformat": 4, 748 | "nbformat_minor": 2 749 | } 750 | -------------------------------------------------------------------------------- /exercises/notebooks/01_DigitalDivide_Analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Project 1: Digital Divide\n", 8 | "### Data Analysis\n", 9 | "\n", 10 | "#### Based on PPIC's Just the Facts report [\"California's Digital Divide\"](https://www.ppic.org/publication/californias-digital-divide/)" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "## Research Question(s):\n", 18 | "1. What share households with school-age children in X state have access to high-speed internet? \n", 19 | "2. Does this number vary across demographic groups? (in this case race/ethnicity)." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## Goal:\n", 27 | "* Use our `working-data dataset` (created in [Data_Prep notebook](00_DigitalDivide_Data_Prep.ipynb) notebook) to answer our research questions." 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Context:\n", 35 | "* Write yourself a description of the context: Include a description of the data (_data set contains X state's data for YYYY year_)" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "***" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "#### Step 1: Set up your working environment.\n", 50 | "\n", 51 | "Import all necessary libraries and create `Path`s to your data directories. This ensures reproducibility across file systems (windows uses `\\` instead of `/`)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "We need \n", 59 | "1. `pandas` to work with the data.\n", 60 | "2. `pathlib`, and more specifically its `Path` object, to work with paths. This will ensure our code works in both Windows (which uses `\\` in its file paths) and MacOS/Linux (which uses `/`).\n", 61 | "3. `datetime` - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started.\n", 62 | "4. `tree` - to display a directory's tree." 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 1, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "name": "stdout", 72 | "output_type": "stream", 73 | "text": [ 74 | "27-Apr-19\n" 75 | ] 76 | } 77 | ], 78 | "source": [ 79 | "# setting up working environment\n", 80 | "import _____ as pd\n", 81 | "from _____ import Path\n", 82 | "from tools import _________\n", 83 | "from ______ import _______ as dt\n", 84 | "today = dt.______()._______(\"%_-%_-%_\")\n", 85 | "\n", 86 | "print(today)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 2, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "# data folder and paths\n", 96 | "RAW_DATA_PATH = ____(\"../data/raw/\")\n", 97 | "XXXX_XXXXX_XXXX = ____(\"../data/interim/\")\n", 98 | "YYYY_YYYYY_YYYY = ____(\"../data/processed/\")\n", 99 | "ZZZZ_ZZZZZ_ZZZZ = ____(\"../data/final/\")" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "tree(INTERIM_DATA_PATH)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 4, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "data = pd.read_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta')" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 5, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "(44816, 14)" 129 | ] 130 | }, 131 | "execution_count": 5, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "data._______" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 6, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "data": { 147 | "text/html": [ 148 | "
\n", 149 | "\n", 162 | "\n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | "
yearserialhhwtstateicpcountyfipcinethhcihispeedpernumperwtrelatesexageracehispan
0201795366257ohio0yes, with a subscription to an internet serviceyes (cable modem, fiber optic or dsl service)158head/householderfemale48whitenot hispanic
1201795366257ohio0yes, with a subscription to an internet serviceyes (cable modem, fiber optic or dsl service)262childmale20whitenot hispanic
2201795366257ohio0yes, with a subscription to an internet serviceyes (cable modem, fiber optic or dsl service)378childfemale9whitenot hispanic
32017953668140ohio61yes, with a subscription to an internet serviceyes (cable modem, fiber optic or dsl service)1140head/householdermale28black/african american/negronot hispanic
42017953668140ohio61yes, with a subscription to an internet serviceyes (cable modem, fiber optic or dsl service)2192siblingfemale16black/african american/negronot hispanic
\n", 270 | "
" 271 | ], 272 | "text/plain": [ 273 | " year serial hhwt stateicp countyfip \\\n", 274 | "0 2017 953662 57 ohio 0 \n", 275 | "1 2017 953662 57 ohio 0 \n", 276 | "2 2017 953662 57 ohio 0 \n", 277 | "3 2017 953668 140 ohio 61 \n", 278 | "4 2017 953668 140 ohio 61 \n", 279 | "\n", 280 | " cinethh \\\n", 281 | "0 yes, with a subscription to an internet service \n", 282 | "1 yes, with a subscription to an internet service \n", 283 | "2 yes, with a subscription to an internet service \n", 284 | "3 yes, with a subscription to an internet service \n", 285 | "4 yes, with a subscription to an internet service \n", 286 | "\n", 287 | " cihispeed pernum perwt \\\n", 288 | "0 yes (cable modem, fiber optic or dsl service) 1 58 \n", 289 | "1 yes (cable modem, fiber optic or dsl service) 2 62 \n", 290 | "2 yes (cable modem, fiber optic or dsl service) 3 78 \n", 291 | "3 yes (cable modem, fiber optic or dsl service) 1 140 \n", 292 | "4 yes (cable modem, fiber optic or dsl service) 2 192 \n", 293 | "\n", 294 | " relate sex age race hispan \n", 295 | "0 head/householder female 48 white not hispanic \n", 296 | "1 child male 20 white not hispanic \n", 297 | "2 child female 9 white not hispanic \n", 298 | "3 head/householder male 28 black/african american/negro not hispanic \n", 299 | "4 sibling female 16 black/african american/negro not hispanic " 300 | ] 301 | }, 302 | "execution_count": 6, 303 | "metadata": {}, 304 | "output_type": "execute_result" 305 | } 306 | ], 307 | "source": [ 308 | "data._______()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 7, 314 | "metadata": {}, 315 | "outputs": [ 316 | { 317 | "name": "stdout", 318 | "output_type": "stream", 319 | "text": [ 320 | "\n", 321 | "Int64Index: 44816 entries, 0 to 44815\n", 322 | "Data columns (total 14 columns):\n", 323 | "year 44816 non-null category\n", 324 | "serial 44816 non-null int32\n", 325 | "hhwt 44816 non-null int16\n", 326 | "stateicp 44816 non-null category\n", 327 | "countyfip 44816 non-null int16\n", 328 | "cinethh 44816 non-null category\n", 329 | "cihispeed 44816 non-null category\n", 330 | "pernum 44816 non-null int8\n", 331 | "perwt 44816 non-null int16\n", 332 | "relate 44816 non-null category\n", 333 | "sex 44816 non-null category\n", 334 | "age 44816 non-null category\n", 335 | "race 44816 non-null category\n", 336 | "hispan 44816 non-null category\n", 337 | "dtypes: category(9), int16(3), int32(1), int8(1)\n", 338 | "memory usage: 1.2 MB\n" 339 | ] 340 | } 341 | ], 342 | "source": [ 343 | "data._____()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "Our **unit of observation** is still a (weighted) person but we're interested in **household-level** data. \n", 351 | "\n", 352 | "From IPUMS docs:\n", 353 | ">HHWT indicates how many households in the U.S. population are represented by a given household in an IPUMS sample.

\n", 354 | ">It is generally a good idea to use HHWT when conducting a household-level analysis of any IPUMS sample. The use of HHWT is optional when analyzing one of the \"flat\" or unweighted IPUMS samples. Flat IPUMS samples include the 1% samples from 1850-1930, all samples from 1960, 1970, and 1980, the 1% unweighted samples from 1990 and 2000, the 10% 2010 sample, and any of the full count 100% census datasets. HHWT must be used to obtain nationally representative statistics for household-level analyses of any sample other than those.

\n", 355 | ">**Users should also be sure to select one person (e.g., PERNUM = 1) to represent the entire household.**" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "***" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "#### Step 2: Drop all observations were `pernum` doesn't equal 1" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": {}, 376 | "outputs": [], 377 | "source": [ 378 | "mask_pernum = (________ _= 1)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "data[mask_pernum].shape" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "Save your data to an appropriately named variable." 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "metadata": {}, 401 | "outputs": [], 402 | "source": [ 403 | "state_households = ____[_________]" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "***" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "#### Step 3: Familiarize yourself with your variables of interest" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "From IPUMS [docs](https://usa.ipums.org/usa-action/variables/CINETHH#description_section):\n", 425 | "\n", 426 | ">CINETHH reports whether any member of the household accesses the Internet. Here, \"access\" refers to whether or not someone in the household uses or connects to the Internet, regardless of whether or not they pay for the service." 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "# find the value_counts for your cinethh series\n" 436 | ] 437 | }, 438 | { 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "From IPUMS [docs](https://usa.ipums.org/usa-action/variables/CIHISPEED#description_section):\n", 443 | ">CIHISPEED reports whether the respondent or any member of their household subscribed to the Internet using broadband (high speed) Internet service such as cable, fiber optic, or DSL service.

\n", 444 | ">User Note: The ACS 2016 introduced changes to the questions regarding computer use and Internet access. See the comparability section and questionnaire text for more information. Additional information provided by the Census Bureau regarding these question alterations are available in the report: ACS Content Test Shows Need to Update Terminology" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "# find the value_counts for your cihispeed series\n" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "_quick tip_ `.value_counts()` _has a_ `normalize` _parameter:_" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 13, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "data": { 470 | "text/plain": [ 471 | "\u001b[0;31mSignature:\u001b[0m\n", 472 | "\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", 473 | "\u001b[0;34m\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 474 | "\u001b[0;34m\u001b[0m \u001b[0mnormalize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 475 | "\u001b[0;34m\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 476 | "\u001b[0;34m\u001b[0m \u001b[0mascending\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 477 | "\u001b[0;34m\u001b[0m \u001b[0mbins\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 478 | "\u001b[0;34m\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 479 | "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 480 | "\u001b[0;31mDocstring:\u001b[0m\n", 481 | "Return a Series containing counts of unique values.\n", 482 | "\n", 483 | "The resulting object will be in descending order so that the\n", 484 | "first element is the most frequently-occurring element.\n", 485 | "Excludes NA values by default.\n", 486 | "\n", 487 | "Parameters\n", 488 | "----------\n", 489 | "normalize : boolean, default False\n", 490 | " If True then the object returned will contain the relative\n", 491 | " frequencies of the unique values.\n", 492 | "sort : boolean, default True\n", 493 | " Sort by values.\n", 494 | "ascending : boolean, default False\n", 495 | " Sort in ascending order.\n", 496 | "bins : integer, optional\n", 497 | " Rather than count values, group them into half-open bins,\n", 498 | " a convenience for ``pd.cut``, only works with numeric data.\n", 499 | "dropna : boolean, default True\n", 500 | " Don't include counts of NaN.\n", 501 | "\n", 502 | "Returns\n", 503 | "-------\n", 504 | "counts : Series\n", 505 | "\n", 506 | "See Also\n", 507 | "--------\n", 508 | "Series.count: Number of non-NA elements in a Series.\n", 509 | "DataFrame.count: Number of non-NA elements in a DataFrame.\n", 510 | "\n", 511 | "Examples\n", 512 | "--------\n", 513 | ">>> index = pd.Index([3, 1, 2, 3, 4, np.nan])\n", 514 | ">>> index.value_counts()\n", 515 | "3.0 2\n", 516 | "4.0 1\n", 517 | "2.0 1\n", 518 | "1.0 1\n", 519 | "dtype: int64\n", 520 | "\n", 521 | "With `normalize` set to `True`, returns the relative frequency by\n", 522 | "dividing all values by the sum of values.\n", 523 | "\n", 524 | ">>> s = pd.Series([3, 1, 2, 3, 4, np.nan])\n", 525 | ">>> s.value_counts(normalize=True)\n", 526 | "3.0 0.4\n", 527 | "4.0 0.2\n", 528 | "2.0 0.2\n", 529 | "1.0 0.2\n", 530 | "dtype: float64\n", 531 | "\n", 532 | "**bins**\n", 533 | "\n", 534 | "Bins can be useful for going from a continuous variable to a\n", 535 | "categorical variable; instead of counting unique\n", 536 | "apparitions of values, divide the index in the specified\n", 537 | "number of half-open bins.\n", 538 | "\n", 539 | ">>> s.value_counts(bins=3)\n", 540 | "(2.0, 3.0] 2\n", 541 | "(0.996, 2.0] 2\n", 542 | "(3.0, 4.0] 1\n", 543 | "dtype: int64\n", 544 | "\n", 545 | "**dropna**\n", 546 | "\n", 547 | "With `dropna` set to `False` we can also see NaN index values.\n", 548 | "\n", 549 | ">>> s.value_counts(dropna=False)\n", 550 | "3.0 2\n", 551 | "NaN 1\n", 552 | "4.0 1\n", 553 | "2.0 1\n", 554 | "1.0 1\n", 555 | "dtype: int64\n", 556 | "\u001b[0;31mFile:\u001b[0m /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/base.py\n", 557 | "\u001b[0;31mType:\u001b[0m function\n" 558 | ] 559 | }, 560 | "metadata": {}, 561 | "output_type": "display_data" 562 | } 563 | ], 564 | "source": [ 565 | "pd.Series.value_counts?" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "# try it on your cinethh series\n", 575 | "\n" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "# on cihispeed \n", 585 | "\n" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "***" 593 | ] 594 | }, 595 | { 596 | "cell_type": "markdown", 597 | "metadata": {}, 598 | "source": [ 599 | "This would be the end of our analysis if we weren't working with **weighted** data. **Weighted** data means each of our observations represent more than one person or household.\n", 600 | "\n", 601 | "`perwt` = \"Person's weight\"\n", 602 | "\n", 603 | "`hhwt` = \"Household's weight\"" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "metadata": {}, 609 | "source": [ 610 | "`.value_counts(normalize=True)` counts the number of **observations** for each of a series' values and then divides it by the total count. If each of our observations was 1 person/household, we would have the answer already. " 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "What we need to do is **aggregate**." 618 | ] 619 | }, 620 | { 621 | "cell_type": "markdown", 622 | "metadata": {}, 623 | "source": [ 624 | "***" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "#### Step 4: Grouping and aggregating data\n", 632 | "\n", 633 | "The mechanics are kind of the same: \n", 634 | "1. Count the number of observations each that match each of the values in a series.\n", 635 | "2. Add up **not the number of observations** but the weight of each observation.\n", 636 | "3. Divide by the total." 637 | ] 638 | }, 639 | { 640 | "cell_type": "markdown", 641 | "metadata": {}, 642 | "source": [ 643 | "#### Step 4.1: Group your data by their corresponding values" 644 | ] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "execution_count": 17, 649 | "metadata": {}, 650 | "outputs": [ 651 | { 652 | "data": { 653 | "text/plain": [ 654 | "" 655 | ] 656 | }, 657 | "execution_count": 17, 658 | "metadata": {}, 659 | "output_type": "execute_result" 660 | } 661 | ], 662 | "source": [ 663 | "state_households.groupby(\"_________\")" 664 | ] 665 | }, 666 | { 667 | "cell_type": "markdown", 668 | "metadata": {}, 669 | "source": [ 670 | "From the [docs](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html):\n", 671 | "\n", 672 | ">A groupby operation involves some combination of splitting the\n", 673 | "object, __applying a function__, and combining the results. This can be\n", 674 | "used to group large amounts of data and compute operations on these\n", 675 | "groups." 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "We're missing the **applying a function** part of it.\n", 683 | "\n", 684 | "Try the following:\n", 685 | "```python\n", 686 | "state_households.groupby(\"countyfip\").sum()\n", 687 | "```\n", 688 | "\n", 689 | "you can pass _almost_ any function to this. \n", 690 | "\n", 691 | "Try `.mean()`, `.max()`, `.min()`, `.std()`." 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "metadata": {}, 698 | "outputs": [], 699 | "source": [] 700 | }, 701 | { 702 | "cell_type": "markdown", 703 | "metadata": {}, 704 | "source": [ 705 | "You can select columns just like you would any other regular dataframe." 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [ 714 | "state_households.groupby(\"________\")['hhwt']._____()" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "metadata": {}, 720 | "source": [ 721 | "***" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "metadata": {}, 728 | "outputs": [], 729 | "source": [ 730 | "n_households = state_households.groupby(\"cihispeed\")['hhwt'].sum()[2]\n", 731 | "_state = state_households['statefip'].unique()[0]\n", 732 | "print(f\"\"\"\n", 733 | "We can see now {n_households:,} households in {_state} have access to high-speed internet. But, out of how many?\n", 734 | "\n", 735 | "To make this easier to follow, let's save our results to a variable:\n", 736 | "\"\"\")" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": null, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "households_with_highspeed_access = ____________._____(\"_____\")[\"____\"].___()\n", 746 | "\n", 747 | "households_with_highspeed_access" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": {}, 753 | "source": [ 754 | "This looks like any regular `pandas.Series`, how do we find the total `.sum()` of a series elements?" 755 | ] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": {}, 760 | "source": [ 761 | "![math](../../static/math.png)" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": null, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "That's our denominator! \n", 776 | "\n", 777 | "![nice](../../static/nooice.gif)" 778 | ] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": {}, 783 | "source": [ 784 | "***" 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "When you _apply_ and operation to a `pandas.Series` it _maps_ to each of its elements.\n", 792 | "\n", 793 | "Try the following:\n", 794 | "```python\n", 795 | "households_with_highspeed_access * 1_000_000\n", 796 | "```\n", 797 | "\n", 798 | "```python\n", 799 | "households_with_highspeed_access + 1_000_000\n", 800 | "```\n", 801 | "\n", 802 | "```python\n", 803 | "households_with_highspeed_access / 1_000_000\n", 804 | "```" 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": null, 810 | "metadata": {}, 811 | "outputs": [], 812 | "source": [] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": null, 817 | "metadata": {}, 818 | "outputs": [], 819 | "source": [] 820 | }, 821 | { 822 | "cell_type": "code", 823 | "execution_count": null, 824 | "metadata": {}, 825 | "outputs": [], 826 | "source": [] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "Now that you know the denominator of our equation (how many households total in X state), how would you find each of the 3 values in your `households_with_highspeed_access` share of the total?" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": null, 838 | "metadata": {}, 839 | "outputs": [], 840 | "source": [] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "***\n", 847 | "***" 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "metadata": {}, 853 | "source": [ 854 | "### Part 2 of analysis: Creating derived variables" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "Now that you have answered **Research Question 1**, we can move on to Q2: \n", 862 | ">_Does this number vary across demographic groups? (in this case race/ethnicity)._" 863 | ] 864 | }, 865 | { 866 | "cell_type": "markdown", 867 | "metadata": {}, 868 | "source": [ 869 | "pandas `.groupby()` function can take a list of columns by which to group by \n", 870 | "\n", 871 | "Try the following:\n", 872 | "```python\n", 873 | "state_households.groupby(['race', 'cihispeed'])[['hhwt']].sum()\n", 874 | "```\n", 875 | "\n", 876 | "_Notice that I'm passing_ `[['hhwt']]` _(a 1-element list) and not just_ `['hhwt']` _try both yourself and let's discuss what's the difference._" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": null, 882 | "metadata": {}, 883 | "outputs": [], 884 | "source": [] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": null, 889 | "metadata": {}, 890 | "outputs": [], 891 | "source": [] 892 | }, 893 | { 894 | "cell_type": "markdown", 895 | "metadata": {}, 896 | "source": [ 897 | "***" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "#### Step 1: Define your groups\n", 905 | "\n" 906 | ] 907 | }, 908 | { 909 | "cell_type": "markdown", 910 | "metadata": {}, 911 | "source": [ 912 | "Pandas' `.loc` indexer serves not only to slice dataframes but also to assign new values to certain slices of dataframes.\n", 913 | "\n", 914 | "For example,\n", 915 | "```python\n", 916 | "mask_madeup_data = (data['column_1'] == 'no answer')\n", 917 | "data.loc[mask_madeup_data, 'new_column'] = 'this row did not answer'\n", 918 | "```" 919 | ] 920 | }, 921 | { 922 | "cell_type": "markdown", 923 | "metadata": {}, 924 | "source": [ 925 | "The code above grabs all the rows that satisfy the condition and then looks at `'new_column'`, if it doesn't exist, it'll create it for you and assign the value `'this row did not answer'` to all the rows that match the condition. The rest will be filled with null values (NaNs)." 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "###### Let's create our masks" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": {}, 939 | "outputs": [], 940 | "source": [ 941 | "mask_latino = \n" 942 | ] 943 | }, 944 | { 945 | "cell_type": "code", 946 | "execution_count": null, 947 | "metadata": {}, 948 | "outputs": [], 949 | "source": [ 950 | "mask_white = \n" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [ 959 | "mask_black = \n" 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": null, 965 | "metadata": {}, 966 | "outputs": [], 967 | "source": [ 968 | "mask_______ = \n" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": null, 974 | "metadata": {}, 975 | "outputs": [], 976 | "source": [ 977 | "mask_______ = \n" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": null, 983 | "metadata": {}, 984 | "outputs": [], 985 | "source": [ 986 | "mask_______ =\n" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": {}, 992 | "source": [ 993 | "Assign the values to a new column `'racen'` for Race/Ethnicity" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": null, 999 | "metadata": {}, 1000 | "outputs": [], 1001 | "source": [ 1002 | "state_households.loc[mask_latino, 'racen'] = 'Latino'\n", 1003 | "state_households.loc[mask_white, 'racen'] = 'White'\n", 1004 | "state_households.loc[mask_black, 'racen'] = 'Black/African-American'\n", 1005 | "state_households.loc[mask_______, 'racen'] = '_______'\n", 1006 | "state_households.loc[mask_______, 'racen'] = '_______'\n", 1007 | "state_households.loc[mask_______, 'racen'] = '_______'\n" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "Checking your results.\n", 1015 | "\n", 1016 | "Under your new logic, all `race` values should fit into `racen` values so there should not be any null values, right?" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | "Pandas `.isna()` returns a series of either True or False for each value of a series depending on whether or not it is Null. \n", 1024 | "\n", 1025 | "AND\n", 1026 | "\n", 1027 | "in python, True = 1 and False = 0. \n", 1028 | "\n", 1029 | "What do you think would happen if you as for the `.sum()` total of a `pandas.Series` of booleans?" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": null, 1035 | "metadata": {}, 1036 | "outputs": [], 1037 | "source": [] 1038 | }, 1039 | { 1040 | "cell_type": "markdown", 1041 | "metadata": {}, 1042 | "source": [ 1043 | "***" 1044 | ] 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "##### Multiple ways of grouping data" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "markdown", 1055 | "metadata": {}, 1056 | "source": [ 1057 | "Now that you have derived a working variable for race/ethnicity you can aggregate your data to answer **RQ2**. In pandas, there are many ways to do this, some of them are:\n", 1058 | "1. `.groupby()` like we've done so far.\n", 1059 | "2. `.pivot_table()`\n", 1060 | "3. `pd.crosstabs()` <- this one is a `pandas` method, not a DataFrame method. More later." 1061 | ] 1062 | }, 1063 | { 1064 | "cell_type": "markdown", 1065 | "metadata": {}, 1066 | "source": [ 1067 | "##### GroupBy" 1068 | ] 1069 | }, 1070 | { 1071 | "cell_type": "code", 1072 | "execution_count": null, 1073 | "metadata": {}, 1074 | "outputs": [], 1075 | "source": [ 1076 | "state_households.groupby(['racen', '______'])[['______']]._____()" 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "markdown", 1081 | "metadata": {}, 1082 | "source": [ 1083 | "Let's save that to an appropriately named variable since we'll be using it later." 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "code", 1088 | "execution_count": null, 1089 | "metadata": {}, 1090 | "outputs": [], 1091 | "source": [ 1092 | "cihispeed_by_racen = state_households.groupby(['racen', '______'])[['______']]._____()" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "markdown", 1097 | "metadata": {}, 1098 | "source": [ 1099 | "Now, this grouped dataframe has the total number of households in each of these racen-cihispeed groups. \n", 1100 | "\n", 1101 | "We need the share of cihispeed values by racen group. \n", 1102 | "\n", 1103 | "In our equation,\n", 1104 | "\n", 1105 | "$$ \\frac{households\\ with\\ high\\ speed\\ internet}{total\\ households\\ in\\ racen\\ group}$$\n", 1106 | "\n", 1107 | "We need to find the denominator." 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "metadata": {}, 1114 | "outputs": [], 1115 | "source": [ 1116 | "# find the denominator\n" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "code", 1121 | "execution_count": null, 1122 | "metadata": {}, 1123 | "outputs": [], 1124 | "source": [ 1125 | "# divide your racen-cihispeed by denominator\n" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "code", 1130 | "execution_count": null, 1131 | "metadata": {}, 1132 | "outputs": [], 1133 | "source": [ 1134 | "# save to appropriately named variable\n", 1135 | "shares_cihispeed_by_racen = " 1136 | ] 1137 | }, 1138 | { 1139 | "cell_type": "markdown", 1140 | "metadata": {}, 1141 | "source": [ 1142 | "This is a multi-level index dataframe and there are a few ways to slice it. Let's try 3:\n", 1143 | "1. a classsic `.loc` slice\n", 1144 | "2. a cross-section (`.xs()`)\n", 1145 | "3. the `.reset_index()` method" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "markdown", 1150 | "metadata": {}, 1151 | "source": [ 1152 | "**Classic `.loc`**" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": null, 1158 | "metadata": {}, 1159 | "outputs": [], 1160 | "source": [ 1161 | "shares_cihispeed_by_racen.loc[INDEX_SLICER, COLUMNS]" 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "markdown", 1166 | "metadata": {}, 1167 | "source": [ 1168 | "**Cross-section**" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "code", 1173 | "execution_count": 59, 1174 | "metadata": {}, 1175 | "outputs": [ 1176 | { 1177 | "data": { 1178 | "text/plain": [ 1179 | "\u001b[0;31mSignature:\u001b[0m \u001b[0mshares_cihispeed_by_racen\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdrop_level\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 1180 | "\u001b[0;31mDocstring:\u001b[0m\n", 1181 | "Return cross-section from the Series/DataFrame.\n", 1182 | "\n", 1183 | "This method takes a `key` argument to select data at a particular\n", 1184 | "level of a MultiIndex.\n", 1185 | "\n", 1186 | "Parameters\n", 1187 | "----------\n", 1188 | "key : label or tuple of label\n", 1189 | " Label contained in the index, or partially in a MultiIndex.\n", 1190 | "axis : {0 or 'index', 1 or 'columns'}, default 0\n", 1191 | " Axis to retrieve cross-section on.\n", 1192 | "level : object, defaults to first n levels (n=1 or len(key))\n", 1193 | " In case of a key partially contained in a MultiIndex, indicate\n", 1194 | " which levels are used. Levels can be referred by label or position.\n", 1195 | "drop_level : bool, default True\n", 1196 | " If False, returns object with same levels as self.\n", 1197 | "\n", 1198 | "Returns\n", 1199 | "-------\n", 1200 | "Series or DataFrame\n", 1201 | " Cross-section from the original Series or DataFrame\n", 1202 | " corresponding to the selected index levels.\n", 1203 | "\n", 1204 | "See Also\n", 1205 | "--------\n", 1206 | "DataFrame.loc : Access a group of rows and columns\n", 1207 | " by label(s) or a boolean array.\n", 1208 | "DataFrame.iloc : Purely integer-location based indexing\n", 1209 | " for selection by position.\n", 1210 | "\n", 1211 | "Notes\n", 1212 | "-----\n", 1213 | "`xs` can not be used to set values.\n", 1214 | "\n", 1215 | "MultiIndex Slicers is a generic way to get/set values on\n", 1216 | "any level or levels.\n", 1217 | "It is a superset of `xs` functionality, see\n", 1218 | ":ref:`MultiIndex Slicers `.\n", 1219 | "\n", 1220 | "Examples\n", 1221 | "--------\n", 1222 | ">>> d = {'num_legs': [4, 4, 2, 2],\n", 1223 | "... 'num_wings': [0, 0, 2, 2],\n", 1224 | "... 'class': ['mammal', 'mammal', 'mammal', 'bird'],\n", 1225 | "... 'animal': ['cat', 'dog', 'bat', 'penguin'],\n", 1226 | "... 'locomotion': ['walks', 'walks', 'flies', 'walks']}\n", 1227 | ">>> df = pd.DataFrame(data=d)\n", 1228 | ">>> df = df.set_index(['class', 'animal', 'locomotion'])\n", 1229 | ">>> df\n", 1230 | " num_legs num_wings\n", 1231 | "class animal locomotion\n", 1232 | "mammal cat walks 4 0\n", 1233 | " dog walks 4 0\n", 1234 | " bat flies 2 2\n", 1235 | "bird penguin walks 2 2\n", 1236 | "\n", 1237 | "Get values at specified index\n", 1238 | "\n", 1239 | ">>> df.xs('mammal')\n", 1240 | " num_legs num_wings\n", 1241 | "animal locomotion\n", 1242 | "cat walks 4 0\n", 1243 | "dog walks 4 0\n", 1244 | "bat flies 2 2\n", 1245 | "\n", 1246 | "Get values at several indexes\n", 1247 | "\n", 1248 | ">>> df.xs(('mammal', 'dog'))\n", 1249 | " num_legs num_wings\n", 1250 | "locomotion\n", 1251 | "walks 4 0\n", 1252 | "\n", 1253 | "Get values at specified index and level\n", 1254 | "\n", 1255 | ">>> df.xs('cat', level=1)\n", 1256 | " num_legs num_wings\n", 1257 | "class locomotion\n", 1258 | "mammal walks 4 0\n", 1259 | "\n", 1260 | "Get values at several indexes and levels\n", 1261 | "\n", 1262 | ">>> df.xs(('bird', 'walks'),\n", 1263 | "... level=[0, 'locomotion'])\n", 1264 | " num_legs num_wings\n", 1265 | "animal\n", 1266 | "penguin 2 2\n", 1267 | "\n", 1268 | "Get values at specified column and axis\n", 1269 | "\n", 1270 | ">>> df.xs('num_wings', axis=1)\n", 1271 | "class animal locomotion\n", 1272 | "mammal cat walks 0\n", 1273 | " dog walks 0\n", 1274 | " bat flies 2\n", 1275 | "bird penguin walks 2\n", 1276 | "Name: num_wings, dtype: int64\n", 1277 | "\u001b[0;31mFile:\u001b[0m /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/generic.py\n", 1278 | "\u001b[0;31mType:\u001b[0m method\n" 1279 | ] 1280 | }, 1281 | "metadata": {}, 1282 | "output_type": "display_data" 1283 | } 1284 | ], 1285 | "source": [ 1286 | "shares_cihispeed_by_racen.xs?" 1287 | ] 1288 | }, 1289 | { 1290 | "cell_type": "code", 1291 | "execution_count": null, 1292 | "metadata": {}, 1293 | "outputs": [], 1294 | "source": [ 1295 | "shares_cihispeed_by_racen.xs(key = '________', level = _)" 1296 | ] 1297 | }, 1298 | { 1299 | "cell_type": "markdown", 1300 | "metadata": {}, 1301 | "source": [ 1302 | "**`.reset_index()`**" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "markdown", 1307 | "metadata": {}, 1308 | "source": [ 1309 | "Another way to slice a multi-level index dataframe is to make it a not-multi-level index dataframe. To do that you need to _reset_ its index. After that, we can slice it how we've been slicing our dataframes previously." 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": null, 1315 | "metadata": {}, 1316 | "outputs": [], 1317 | "source": [ 1318 | "__________ = ____________._________()" 1319 | ] 1320 | }, 1321 | { 1322 | "cell_type": "code", 1323 | "execution_count": null, 1324 | "metadata": {}, 1325 | "outputs": [], 1326 | "source": [ 1327 | "__________" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": null, 1333 | "metadata": {}, 1334 | "outputs": [], 1335 | "source": [ 1336 | "mask_yes_cihispeed = (_____________ = '___________')\n", 1337 | "_______[mask_yes_cihispeed]" 1338 | ] 1339 | }, 1340 | { 1341 | "cell_type": "markdown", 1342 | "metadata": {}, 1343 | "source": [ 1344 | "***" 1345 | ] 1346 | }, 1347 | { 1348 | "cell_type": "markdown", 1349 | "metadata": {}, 1350 | "source": [ 1351 | "##### Pivot Tables" 1352 | ] 1353 | }, 1354 | { 1355 | "cell_type": "markdown", 1356 | "metadata": {}, 1357 | "source": [ 1358 | "The second method of aggregating our data is `.pivot_table()`s.\n", 1359 | "\n", 1360 | "If you've worked with Excel, you might already be familiar with what a pivot table is.\n", 1361 | "\n", 1362 | "From [Wikipedia](https://en.wikipedia.org/wiki/Pivot_table):\n", 1363 | ">A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way." 1364 | ] 1365 | }, 1366 | { 1367 | "cell_type": "code", 1368 | "execution_count": 66, 1369 | "metadata": {}, 1370 | "outputs": [ 1371 | { 1372 | "data": { 1373 | "text/plain": [ 1374 | "\u001b[0;31mSignature:\u001b[0m\n", 1375 | "\u001b[0mstate_households\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpivot_table\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", 1376 | "\u001b[0;34m\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1377 | "\u001b[0;34m\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1378 | "\u001b[0;34m\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1379 | "\u001b[0;34m\u001b[0m \u001b[0maggfunc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'mean'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1380 | "\u001b[0;34m\u001b[0m \u001b[0mfill_value\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1381 | "\u001b[0;34m\u001b[0m \u001b[0mmargins\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1382 | "\u001b[0;34m\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1383 | "\u001b[0;34m\u001b[0m \u001b[0mmargins_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'All'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", 1384 | "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 1385 | "\u001b[0;31mDocstring:\u001b[0m\n", 1386 | "Create a spreadsheet-style pivot table as a DataFrame. The levels in\n", 1387 | "the pivot table will be stored in MultiIndex objects (hierarchical\n", 1388 | "indexes) on the index and columns of the result DataFrame.\n", 1389 | "\n", 1390 | "Parameters\n", 1391 | "----------\n", 1392 | "values : column to aggregate, optional\n", 1393 | "index : column, Grouper, array, or list of the previous\n", 1394 | " If an array is passed, it must be the same length as the data. The\n", 1395 | " list can contain any of the other types (except list).\n", 1396 | " Keys to group by on the pivot table index. If an array is passed,\n", 1397 | " it is being used as the same manner as column values.\n", 1398 | "columns : column, Grouper, array, or list of the previous\n", 1399 | " If an array is passed, it must be the same length as the data. The\n", 1400 | " list can contain any of the other types (except list).\n", 1401 | " Keys to group by on the pivot table column. If an array is passed,\n", 1402 | " it is being used as the same manner as column values.\n", 1403 | "aggfunc : function, list of functions, dict, default numpy.mean\n", 1404 | " If list of functions passed, the resulting pivot table will have\n", 1405 | " hierarchical columns whose top level are the function names\n", 1406 | " (inferred from the function objects themselves)\n", 1407 | " If dict is passed, the key is column to aggregate and value\n", 1408 | " is function or list of functions\n", 1409 | "fill_value : scalar, default None\n", 1410 | " Value to replace missing values with\n", 1411 | "margins : boolean, default False\n", 1412 | " Add all row / columns (e.g. for subtotal / grand totals)\n", 1413 | "dropna : boolean, default True\n", 1414 | " Do not include columns whose entries are all NaN\n", 1415 | "margins_name : string, default 'All'\n", 1416 | " Name of the row / column that will contain the totals\n", 1417 | " when margins is True.\n", 1418 | "\n", 1419 | "Returns\n", 1420 | "-------\n", 1421 | "table : DataFrame\n", 1422 | "\n", 1423 | "See Also\n", 1424 | "--------\n", 1425 | "DataFrame.pivot : Pivot without aggregation that can handle\n", 1426 | " non-numeric data.\n", 1427 | "\n", 1428 | "Examples\n", 1429 | "--------\n", 1430 | ">>> df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n", 1431 | "... \"bar\", \"bar\", \"bar\", \"bar\"],\n", 1432 | "... \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n", 1433 | "... \"one\", \"one\", \"two\", \"two\"],\n", 1434 | "... \"C\": [\"small\", \"large\", \"large\", \"small\",\n", 1435 | "... \"small\", \"large\", \"small\", \"small\",\n", 1436 | "... \"large\"],\n", 1437 | "... \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n", 1438 | "... \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})\n", 1439 | ">>> df\n", 1440 | " A B C D E\n", 1441 | "0 foo one small 1 2\n", 1442 | "1 foo one large 2 4\n", 1443 | "2 foo one large 2 5\n", 1444 | "3 foo two small 3 5\n", 1445 | "4 foo two small 3 6\n", 1446 | "5 bar one large 4 6\n", 1447 | "6 bar one small 5 8\n", 1448 | "7 bar two small 6 9\n", 1449 | "8 bar two large 7 9\n", 1450 | "\n", 1451 | "This first example aggregates values by taking the sum.\n", 1452 | "\n", 1453 | ">>> table = pivot_table(df, values='D', index=['A', 'B'],\n", 1454 | "... columns=['C'], aggfunc=np.sum)\n", 1455 | ">>> table\n", 1456 | "C large small\n", 1457 | "A B\n", 1458 | "bar one 4 5\n", 1459 | " two 7 6\n", 1460 | "foo one 4 1\n", 1461 | " two NaN 6\n", 1462 | "\n", 1463 | "We can also fill missing values using the `fill_value` parameter.\n", 1464 | "\n", 1465 | ">>> table = pivot_table(df, values='D', index=['A', 'B'],\n", 1466 | "... columns=['C'], aggfunc=np.sum, fill_value=0)\n", 1467 | ">>> table\n", 1468 | "C large small\n", 1469 | "A B\n", 1470 | "bar one 4 5\n", 1471 | " two 7 6\n", 1472 | "foo one 4 1\n", 1473 | " two 0 6\n", 1474 | "\n", 1475 | "The next example aggregates by taking the mean across multiple columns.\n", 1476 | "\n", 1477 | ">>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", 1478 | "... aggfunc={'D': np.mean,\n", 1479 | "... 'E': np.mean})\n", 1480 | ">>> table\n", 1481 | " D E\n", 1482 | " mean mean\n", 1483 | "A C\n", 1484 | "bar large 5.500000 7.500000\n", 1485 | " small 5.500000 8.500000\n", 1486 | "foo large 2.000000 4.500000\n", 1487 | " small 2.333333 4.333333\n", 1488 | "\n", 1489 | "We can also calculate multiple types of aggregations for any given\n", 1490 | "value column.\n", 1491 | "\n", 1492 | ">>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", 1493 | "... aggfunc={'D': np.mean,\n", 1494 | "... 'E': [min, max, np.mean]})\n", 1495 | ">>> table\n", 1496 | " D E\n", 1497 | " mean max mean min\n", 1498 | "A C\n", 1499 | "bar large 5.500000 9 7.500000 6\n", 1500 | " small 5.500000 9 8.500000 8\n", 1501 | "foo large 2.000000 5 4.500000 4\n", 1502 | " small 2.333333 6 4.333333 2\n", 1503 | "\u001b[0;31mFile:\u001b[0m /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/frame.py\n", 1504 | "\u001b[0;31mType:\u001b[0m method\n" 1505 | ] 1506 | }, 1507 | "metadata": {}, 1508 | "output_type": "display_data" 1509 | } 1510 | ], 1511 | "source": [ 1512 | "state_households.pivot_table?" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "markdown", 1517 | "metadata": {}, 1518 | "source": [ 1519 | "What we need are four things:\n", 1520 | "1. What variable will become our `index`?\n", 1521 | "2. What variable will become our `columns`?\n", 1522 | "3. What variable will become our `values`?\n", 1523 | "4. How will we aggregate our values?\n", 1524 | "\n", 1525 | "Pandas is going to grab each unique value in the variables you choose and use those as rows in your `.index` or separate columns in your `.columns`. The `values` variable should be _quantitative_ in this case (but it doesn't have to be, necessarily). `.pivot_table` will by default find the `mean` of your `values` variable for each cell in your new table, in this case we don't care about the `mean`, we want to `sum` up the total number of households." 1526 | ] 1527 | }, 1528 | { 1529 | "cell_type": "markdown", 1530 | "metadata": {}, 1531 | "source": [ 1532 | "Try the following:\n", 1533 | "\n", 1534 | "```python\n", 1535 | "state_households.pivot_table(\n", 1536 | " index = '______',\n", 1537 | " columns = '______', \n", 1538 | " values = 'hhwt',\n", 1539 | " aggfunc = '___',\n", 1540 | " margins = True,\n", 1541 | ")\n", 1542 | "```" 1543 | ] 1544 | }, 1545 | { 1546 | "cell_type": "code", 1547 | "execution_count": null, 1548 | "metadata": {}, 1549 | "outputs": [], 1550 | "source": [] 1551 | }, 1552 | { 1553 | "cell_type": "markdown", 1554 | "metadata": {}, 1555 | "source": [ 1556 | "Save it to an appropriately named variable." 1557 | ] 1558 | }, 1559 | { 1560 | "cell_type": "code", 1561 | "execution_count": null, 1562 | "metadata": {}, 1563 | "outputs": [], 1564 | "source": [ 1565 | "households_pivot_table = state_households.pivot_table(\n", 1566 | " index = '_____',\n", 1567 | " columns = '______',\n", 1568 | " ______ = '______',\n", 1569 | " ______ = '____',\n", 1570 | " _______ = True,\n", 1571 | ")" 1572 | ] 1573 | }, 1574 | { 1575 | "cell_type": "markdown", 1576 | "metadata": {}, 1577 | "source": [ 1578 | "What do you think the next step should be?" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "code", 1583 | "execution_count": null, 1584 | "metadata": {}, 1585 | "outputs": [], 1586 | "source": [] 1587 | } 1588 | ], 1589 | "metadata": { 1590 | "kernelspec": { 1591 | "display_name": "Python 3", 1592 | "language": "python", 1593 | "name": "python3" 1594 | }, 1595 | "language_info": { 1596 | "codemirror_mode": { 1597 | "name": "ipython", 1598 | "version": 3 1599 | }, 1600 | "file_extension": ".py", 1601 | "mimetype": "text/x-python", 1602 | "name": "python", 1603 | "nbconvert_exporter": "python", 1604 | "pygments_lexer": "ipython3", 1605 | "version": "3.7.3" 1606 | } 1607 | }, 1608 | "nbformat": 4, 1609 | "nbformat_minor": 2 1610 | } 1611 | -------------------------------------------------------------------------------- /exercises/notebooks/tools.py: -------------------------------------------------------------------------------- 1 | # useful tools for our project 2 | 3 | def tree(directory): 4 | print(f'+ {directory}') 5 | for path in sorted(directory.rglob('*')): 6 | depth = len(path.relative_to(directory).parts) 7 | spacer = ' ' * depth 8 | print(f'{spacer}+ {path.name}') -------------------------------------------------------------------------------- /presentation/presentation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# Analyzing Census Data with Pandas\n", 12 | "\n", 13 | "### Sergio Sánchez Zavala" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "slideshow": { 20 | "slide_type": "slide" 21 | } 22 | }, 23 | "source": [ 24 | "# Who am I?\n", 25 | "\n", 26 | "My name is Sergio Sánchez and I'm a research associate at PPIC (Public Policy Institute of California) in the Higher Ed Center. The work I do there covers developmental education reform in California Community Colleges, economic mobility, and some immigration stuff. " 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "slideshow": { 33 | "slide_type": "slide" 34 | } 35 | }, 36 | "source": [ 37 | "# Who am I? (part 2)\n", 38 | "\n", 39 | "I'm very interested in data visualization. I'm a facilitator in the newly formed [Data Visualization Society](https://www.datavisualizationsociety.com/the-team). My newest project is [@tacosdedatos](https://twitter.com/tacosdedatos) - [tacosdedatos.com](https://tacosdedatos.com/) where I hope to build a place to learn data analysis and data visualization best practices, techniques, and knowledge in Spanish.\n", 40 | "\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": { 46 | "slideshow": { 47 | "slide_type": "slide" 48 | } 49 | }, 50 | "source": [ 51 | "# Housekeeping\n", 52 | "\n", 53 | "Materials are on GitHub at https://github.com/chekos/analyzing-census-data\n", 54 | "```bash\n", 55 | "git clone https://github.com/chekos/analyzing-census-data\n", 56 | "cd analyzing-census-data\n", 57 | "```\n", 58 | "\n", 59 | "You only need jupyter and pandas to follow along in your personal computer." 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": { 65 | "slideshow": { 66 | "slide_type": "slide" 67 | } 68 | }, 69 | "source": [ 70 | "We will be using Jupyter Lab but you can follow along in a Jupyter Notebook if you're more comfortable that way." 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": { 76 | "slideshow": { 77 | "slide_type": "slide" 78 | } 79 | }, 80 | "source": [ 81 | "# MyBinder.org \n", 82 | "We'll be using [mybinder.org](https://mybinder.org/) to go through this tutorial.\n", 83 | ">Binder allows you to create custom computing environments that can be shared and used by many remote users. It is powered by BinderHub, which is an open-source tool that deploys the Binder service in the cloud. One-such deployment lives here, at mybinder.org, and is free to use. For more information about the mybinder.org deployment and the team that runs it, see About mybinder.org." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": { 89 | "slideshow": { 90 | "slide_type": "slide" 91 | } 92 | }, 93 | "source": [ 94 | "# Census Data" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": { 100 | "slideshow": { 101 | "slide_type": "slide" 102 | } 103 | }, 104 | "source": [ 105 | "The US Census conducts more than 130 surveys every year. They have households surveys with data on education, health, employment, migration and many more topics. \n", 106 | "\n", 107 | "https://www.census.gov/programs-surveys/are-you-in-a-survey/survey-list/household-survey-list.html" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": { 113 | "slideshow": { 114 | "slide_type": "slide" 115 | } 116 | }, 117 | "source": [ 118 | "They also have business surveys on retail, wholesale, imports/exports, entrepeneurship, and [public libraries](https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey) among many, many other things.\n", 119 | "\n", 120 | "https://www.census.gov/programs-surveys/are-you-in-a-survey/survey-list/business-survey-list.html" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "slideshow": { 127 | "slide_type": "slide" 128 | } 129 | }, 130 | "source": [ 131 | "One of the most popular households surveys are the American Community Survey or ACS which we will be using for our analysis today.\n", 132 | "\n", 133 | ">The American Community Survey (ACS) helps local officials, community leaders, and businesses understand the changes taking place in their communities. It is the premier source for detailed population and housing information about our nation." 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": { 139 | "slideshow": { 140 | "slide_type": "slide" 141 | } 142 | }, 143 | "source": [ 144 | "# Where to get it?\n", 145 | "\n", 146 | "The Census website provides **a lot** of ways to access their data.\n", 147 | "\n", 148 | "[**AmericanFactFinder**](https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml)\n", 149 | " - American FactFinder provides access to data about the United States, Puerto Rico and the Island Areas. The data in American FactFinder come from several censuses and surveys. \n", 150 | " " 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": { 156 | "slideshow": { 157 | "slide_type": "slide" 158 | } 159 | }, 160 | "source": [ 161 | "# Where to get it?\n", 162 | "\n", 163 | "**Pre-computed Tables** \n", 164 | "\n", 165 | "They also provide pre-computed tables for popular topics like educational attainment or median incomes at various geographic levels (region, metropolitan area, state, county, etc)\n", 166 | "\n", 167 | "https://www.census.gov/data/tables.html" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": { 173 | "slideshow": { 174 | "slide_type": "slide" 175 | } 176 | }, 177 | "source": [ 178 | "# Where to get it?\n", 179 | "\n", 180 | "**IPUMS**\n", 181 | "\n", 182 | ">IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community context. Data and services available free of charge." 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": { 188 | "slideshow": { 189 | "slide_type": "slide" 190 | } 191 | }, 192 | "source": [ 193 | "IPUMS stands for **Integrated Public Microdata Series**\n", 194 | "![ipums](static/ipums.gif)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": { 200 | "slideshow": { 201 | "slide_type": "slide" 202 | } 203 | }, 204 | "source": [ 205 | "# How do I get it using `python`?\n", 206 | "There are a few python packages on pypi.org related to Census data. Here are four notable ones:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": { 212 | "slideshow": { 213 | "slide_type": "slide" 214 | } 215 | }, 216 | "source": [ 217 | "`census` - [pypi](https://pypi.org/project/census/)\n", 218 | "\n", 219 | "> A simple wrapper for the United States Census Bureau’s API.\n", 220 | "Provides access to ACS, SF1, and SF3 data sets.\n", 221 | "\n", 222 | "```python\n", 223 | "from census import Census\n", 224 | "from us import states\n", 225 | "\n", 226 | "c = Census(\"MY_API_KEY\")\n", 227 | "c.acs5.get(('NAME', 'B25034_010E'),\n", 228 | " {'for': 'state:{}'.format(states.MD.fips)})\n", 229 | "```" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": { 235 | "slideshow": { 236 | "slide_type": "slide" 237 | } 238 | }, 239 | "source": [ 240 | "`cenpy` - [pypi](https://pypi.org/project/cenpy/)\n", 241 | "\n", 242 | ">An interface to explore and query the US Census API and return Pandas Dataframes. Ideally, this package is intended for exploratory data analysis and draws inspiration from sqlalchemy-like interfaces and acs.R.\n", 243 | "\n", 244 | "The docs include an [intro notebook](https://nbviewer.jupyter.org/github/ljwolf/cenpy/blob/master/demo.ipynb)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": { 250 | "slideshow": { 251 | "slide_type": "slide" 252 | } 253 | }, 254 | "source": [ 255 | "`census-data-downloader` - [GitHub](https://github.com/datadesk/census-data-downloader) but also pip installable\n", 256 | "\n", 257 | "census-data-downloader is a Command Line Interface developed by the Los Angeles Times to download Census data and reformat it for humans.\n", 258 | "\n", 259 | "```bash\n", 260 | "export CENSUS_API_KEY=''\n", 261 | "censusdatadownloader --year 2010 medianage states\n", 262 | "```" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": { 268 | "slideshow": { 269 | "slide_type": "slide" 270 | } 271 | }, 272 | "source": [ 273 | "`censusdata` - [pypi](https://pypi.org/project/censusdata/)\n", 274 | "\n", 275 | ">This package handles the details of interacting with the Census API for you, so that you can focus on working with the data. It provides a class for representing Census geographies. It also provides functions for gaining further information about specific variables and tables and for searching for variables." 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": { 281 | "slideshow": { 282 | "slide_type": "slide" 283 | } 284 | }, 285 | "source": [ 286 | "# Let's analyze some census data!" 287 | ] 288 | } 289 | ], 290 | "metadata": { 291 | "kernelspec": { 292 | "display_name": "Python 3", 293 | "language": "python", 294 | "name": "python3" 295 | }, 296 | "language_info": { 297 | "codemirror_mode": { 298 | "name": "ipython", 299 | "version": 3 300 | }, 301 | "file_extension": ".py", 302 | "mimetype": "text/x-python", 303 | "name": "python", 304 | "nbconvert_exporter": "python", 305 | "pygments_lexer": "ipython3", 306 | "version": "3.7.3" 307 | }, 308 | "livereveal": { 309 | "autolaunch": true, 310 | "footer": "

/chekos
@ChekosWH
tacosdedatos

", 311 | "header": "", 312 | "theme": "solarized", 313 | "transition": "slide" 314 | } 315 | }, 316 | "nbformat": 4, 317 | "nbformat_minor": 2 318 | } 319 | -------------------------------------------------------------------------------- /presentation/static/ipums.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/presentation/static/ipums.gif -------------------------------------------------------------------------------- /presentation/static/style.css: -------------------------------------------------------------------------------- 1 | #exit_b { 2 | opacity: 0; 3 | } 4 | 5 | #help_b { 6 | opacity: 0; 7 | } 8 | 9 | .rendered_html code { 10 | font-size: 80%; 11 | } 12 | 13 | .reveal blockquote { 14 | width: 90%; 15 | } -------------------------------------------------------------------------------- /static/github-download.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/static/github-download.gif -------------------------------------------------------------------------------- /static/math.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/static/math.png -------------------------------------------------------------------------------- /static/nooice.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chekos/analyzing-census-data/4073f47c6cc884a0b3531f3bfc4205070b1a2766/static/nooice.gif --------------------------------------------------------------------------------