├── Census ├── County_Rural_Lookup.csv ├── County_Rural_Lookup.xlsx ├── Original co-est2019-annres.xlsx └── co-est2019-annres.csv ├── Data_Wrangling.Rmd ├── Data_Wrangling_data_table.Rmd ├── Data_Wrangling_pandas.Rmd ├── EADA ├── InstLevel.sav ├── InstLevel.xlsx ├── InstlevelDataDoc2019.doc ├── Schools.sav ├── Schools.xlsx ├── SchoolsDoc2019.doc ├── instlevel.sas7bdat └── schools.sas7bdat ├── IPEDS ├── STATA_RV_942020-417.csv ├── STATA_RV_942020-417.do ├── STATA_RV_942020-614.csv ├── STATA_RV_942020-614.do ├── STATA_RV_942020-662.csv ├── STATA_RV_942020-662.do ├── cdsfile_all_STATA_RV_942020-310.dta ├── cdsfile_all_STATA_RV_942020-417.dta ├── cdsfile_all_STATA_RV_942020-614.dta └── cdsfile_all_STATA_RV_942020-662.dta ├── NYT ├── README_masksurvey.txt ├── mask-use-by-county.csv └── us-counties_cases.csv ├── Pandas Example Walkthrough.ipynb ├── README.md ├── example_walkthrough.R ├── example_walkthrough_data_table.R ├── example_walkthrough_tidyverse.R └── foot_traffic_panel.Rdata /Census/County_Rural_Lookup.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/Census/County_Rural_Lookup.csv -------------------------------------------------------------------------------- /Census/County_Rural_Lookup.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/Census/County_Rural_Lookup.xlsx -------------------------------------------------------------------------------- /Census/Original co-est2019-annres.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/Census/Original co-est2019-annres.xlsx -------------------------------------------------------------------------------- /Census/co-est2019-annres.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/Census/co-est2019-annres.csv -------------------------------------------------------------------------------- /Data_Wrangling.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Wrangling in the Tidyverse" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | 18 | ```{r setup, include=FALSE} 19 | knitr::opts_chunk$set(echo = FALSE) 20 | library(tidyverse) 21 | library(DT) 22 | library(purrr) 23 | library(readxl) 24 | ``` 25 | 26 | ## Data Wrangling 27 | 28 | ```{r, results = 'asis'} 29 | cat(" 30 | ") 36 | ``` 37 | 38 | Welcome to the Data Wrangling Workshop! 39 | 40 | - The goal of data wrangling 41 | - How to think about data wrangling 42 | - Technical tips for data wrangling in R using the **tidyverse** package (which, importantly, contains the **dplyr** and **tidyr** packages inside) 43 | - A walkthrough example 44 | 45 | ## Limitations 46 | 47 | - I will assume you already have some familiarity with R in general 48 | - We only have so much time! I won't be going into *great* detail on the use of all the technical commands, but by the end of this you will know what's out there and generally how it's used 49 | - *As with any computer skill, a teacher's comparative advantage is in letting you know what's out there. The* **real learning** *comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end* 50 | 51 | ## Tidyverse notes 52 | 53 | - The **tidyverse** functions often return "`tibble`s" instead of `data.frame`s - these are very similar to `data.frame`s but look nicer when you print them, and can accept `list()` columns, as well as some other neat stuff 54 | - Also, throughout this talk I'll be using the pipe (`%>%`), which simply means "take whatever's on the left and make it the first argument of the thing on the right" 55 | - Very handy for chaining together operations and making code more readable. 56 | 57 | ## The pipe 58 | 59 | `scales::percent(mean(mtcars$am, na.rm = TRUE), accuracy = .1)` can be rewritten 60 | 61 | ```{r, eval = FALSE, echo = TRUE} 62 | mtcars %>% 63 | pull(am) %>% 64 | mean(na.rm = TRUE) %>% 65 | scales::percent(accuracy = .1) 66 | ``` 67 | 68 | - Like a conveyer belt! Nice and easy. Note that future versions of R will switch to the use of `|>` for the pipe 69 | - `pull()` is a **dplyr** function that says "give me back this one variable instead of a data set" but in a pipe-friendly way, so `mtcars %>% pull(am)` is the same as `mtcars$am` or `mtcars[['am']]` 70 | 71 | ## Data Wrangling 72 | 73 | What is data wrangling? 74 | 75 | - You have data 76 | - It's not ready for you to run your model 77 | - You want to get it ready to run your model 78 | - Ta-da! 79 | 80 | ## The Core of Data Wrangling 81 | 82 | - Always **look directly at your data so you know what it looks like** 83 | - Always **think about what you want your data to look like when you're done** 84 | - Think about **how you can take information from where it is and put it where you want it to be** 85 | - After every step, **look directly at your data again to make sure it's doing what you think it's doing** 86 | 87 | I help a lot of people with their problems with data wrangling. Their issues are almost always *not doing one of these four things*, much more so than having trouble coding or anything like that 88 | 89 | ## The Core of Data Wrangling 90 | 91 | - How can you "look at your data"? 92 | - Literally is one way - click on the data set, or do `View()` to look at it 93 | - Summary statistics tables: `sumtable()` or `vtable(lush = TRUE)` in **vtable** for example 94 | - Checking what values it takes: `table()` or `summary()` on individual variables 95 | - Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured 96 | 97 | ## The Stages of Data Wrangling 98 | 99 | - From records to data 100 | - From data to tidy data 101 | - From tidy data to data for your analysis 102 | 103 | # From Records to Data 104 | 105 | ## From Records to Data 106 | 107 | Not something we'll be focusing on today! But any time the data isn't in a workable format, like a spreadsheet or database, someone's got to get it there! 108 | 109 | - "Google Trends has information on the popularity of our marketing terms, go get it!" 110 | - "Here's a 600-page unformatted PDF of our sales records for the past three years. Turn it into a database." 111 | - "Here are scans of the 15,000 handwritten doctor's notes at the hospital over the past year" 112 | - "Here's access to the website. The records are in there somewhere." 113 | - "Go do a survey" 114 | 115 | ## From Records to Data: Tips! 116 | 117 | - Do as little by hand as possible. It's a lot of work and you *will* make mistakes 118 | - *Look at the data* a lot! 119 | - Check for changes in formatting - it's common for things like "this enormous PDF of our tables" or "eight hundred text files with the different responses/orders" to change formatting halfway through 120 | - When working with something like a PDF or a bunch of text files, think "how can I tell a computer to spot where the actual data is?" 121 | - If push comes to shove, or if the data set is small enough, you can do by-hand data entry. Be very careful! 122 | 123 | ## Reading Files 124 | 125 | One common thing you run across is data split into multiple files. How can we read these in and compile them? 126 | 127 | - `list.files()` produces a vector of filenames (tip: `full.names = TRUE` gives full filepaths) 128 | - Use `map()` from **purrr** to iterate over that vector and read in the data. This gives a list of `tibble`s (`data.frame`s) read in 129 | - Create your own function to process each, use `map` with that too (if you want some processing before you combine) 130 | - Combine the results with `bind_rows()`! 131 | 132 | ## Reading Files 133 | 134 | For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together. 135 | 136 | ```{r, echo = TRUE, eval = FALSE} 137 | # For reading Excel 138 | library(readxl) 139 | # For map 140 | library(purrr) 141 | 142 | # Get the list of 200 reports 143 | filelist <- list.files(path = '../Monthly_reports/', pattern = 'sales', full.names = TRUE) 144 | ``` 145 | 146 | ## Reading Files 147 | 148 | We can simplify by making a little function that processes each of the reports as it's read. Then, use `map()` with `read_excel()` and then our function, then bind it together! 149 | 150 | How do I get `df[1,3]`, etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it! 151 | 152 | ```{r, echo = TRUE, eval = FALSE} 153 | process_file <- function(df) { 154 | sales <- df[1,3] 155 | employee <- df[42,2] 156 | return(tibble(sales = sales, employee = employee)) 157 | } 158 | 159 | compiled_data <- filelist %>% 160 | map(read_excel) %>% 161 | map(process_file) %>% 162 | bind_rows() 163 | ``` 164 | 165 | # From Data to Tidy Data 166 | 167 | ## From Data to Tidy Data 168 | 169 | - **Data** is any time you have your records stored in some structured format 170 | - But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation 171 | - These structures can be great for *looking up values*. That's why they are often used in business or other settings where you say "I wonder what the value of X is for person/day/etc. N" 172 | - They're rarely good for *doing analysis* (calculating statistics, fitting models, making visualizations) 173 | - For that, we will aim to get ourselves *tidy data* (see [this walkthrough](https://tidyr.tidyverse.org/articles/tidy-data.html) ) 174 | 175 | ## Tidy Data 176 | 177 | In tidy data: 178 | 179 | 1. Each variable forms a column 180 | 1. Each observation forms a row 181 | 1. Each type of observational unit forms a table 182 | 183 | ```{r} 184 | df <- data.frame(Country = c('Argentina','Belize','China'), TradeImbalance = c(-10, 35.33, 5613.32), PopulationM = c(45.3, .4, 1441.5)) 185 | datatable(df) 186 | ``` 187 | 188 | ## Tidy Data 189 | 190 | The variables in tidy data come in two types: 191 | 192 | 1. *Identifying Variables*/*Keys* are the columns you'd look at to locate a particular observation. 193 | 1. *Measures*/*Values* are the actual data. 194 | 195 | Which are they in this data? 196 | 197 | ```{r} 198 | df <- data.frame(Person = c('Chidi','Chidi','Eleanor','Eleanor'), Year = c(2017, 2018, 2017, 2018), Points = c(14321,83325, 6351, 63245), ShrimpConsumption = c(0,13, 238, 172)) 199 | datatable(df) 200 | ``` 201 | ## Tidy Data 202 | 203 | - *Person* and *Year* are our identifying variables. The combination of person and year *uniquely identifies* a row in the data. Our "observation level" is person and year. There's only one row with Person == "Chidi" and Year == 2018 204 | - *Points* and *ShrimpConsumption* are our measures. They are the things we have measured for each of our observations 205 | - Notice how there's one row per observation (combination of Person and Year), and one column per variable 206 | - Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don't change over Year) would go in a different table, although this last one is less important 207 | 208 | ## Tidying Non-Tidy Data 209 | 210 | - So what might data look like when it's *not* like this, and how can we get it this way? 211 | - Here's one common example, a *count table* (not tidy!) where each column is a *value*, not a *variable* 212 | 213 | ```{r} 214 | data("relig_income") 215 | datatable(relig_income) 216 | ``` 217 | 218 | ## Tidying Non-tidy Data 219 | 220 | - Here's another, where the "chart position" variable is split across 52 columns, one for each week 221 | 222 | ```{r} 223 | data("billboard") 224 | datatable(billboard) 225 | ``` 226 | 227 | 228 | 229 | ## Tidying Non-Tidy Data 230 | 231 | - The first big tool in our tidying toolbox is the *pivot* 232 | - A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up. 233 | - This can also be referred to as going from "wide" data to "long" data 234 | - Long to wide is also an option 235 | - In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended. 236 | 237 | ## Tidying Non-Tidy Data 238 | 239 | Check our steps! 240 | 241 | - We looked at the data 242 | - Think about how we want the data to look - one row per (keys) artist, track, and week, and a column for the chart position of that artist/track in that week, and the date entered for that artist/track (value) 243 | - How can we carry information from where it is to where we want it to be? With a pivot! 244 | - And afterwards we'll look at the result (and, likely, go back and fix our pivot code - the person who gets a pivot right the first try is a mysterious genius) 245 | 246 | ## Pivot 247 | 248 | - In the **tidyverse** we have the functions `pivot_longer()` and `pivot_wider()`. Here we want wide-to-long so we use `pivot_longer()` 249 | - This asks for: 250 | - `data` (the data set you're working with, also the first argument so we can pipe to it) 251 | - `cols` (the columns to pivot) - it will assume anything not named here are the keys 252 | - `names_to` (the name of the variable to store which column a given row came from, here "week") 253 | - `values_to` (the name of the vairable to store the value in) 254 | - Many other options (see `help(pivot_longer)`) 255 | 256 | ## Pivot 257 | 258 | ```{r, echo = TRUE, eval = FALSE} 259 | billboard %>% 260 | pivot_longer(cols = starts_with('wk'), # tidyselect functions help us pick columns based on name patterns 261 | names_to = 'week', 262 | names_prefix = 'wk', # Remove the "wk" at the start of the column names 263 | values_to = 'chart_position', 264 | values_drop_na = TRUE) # Drop any key combination with a missing value 265 | 266 | ``` 267 | 268 | ```{r} 269 | pivot_longer(billboard, 270 | cols = starts_with('wk'), # tidyselect functions help us pick columns based on name patterns 271 | names_to = 'week', 272 | names_prefix = 'wk', # Remove the "wk" at the start of the column names 273 | values_to = 'chart_position', 274 | values_drop_na = TRUE) %>% 275 | datatable() 276 | 277 | ``` 278 | 279 | ## Variables Stored as Rows 280 | 281 | - Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use `pivot_wider()`, and then combine multiple individuals with `bind_rows()` 282 | 283 | ```{r} 284 | taxdata <- data.frame(TaxFormRow = c('Person','Income','Deductible','AGI'), Value = c('James Acaster',112341, 24000, 88341)) 285 | taxdata2 <- data.frame(TaxFormRow = c('Person','Income','Deductible','AGI'), Value = c('Eddie Izzard',325122, 16000,325122 - 16000)) 286 | datatable(taxdata) 287 | ``` 288 | 289 | ## Variables Stored as Rows 290 | 291 | - `pivot_wider()` needs: 292 | - `data` (first argument, the data we're working with) 293 | - `id_cols` (the columns that give us the key - what should it be here?) 294 | - `names_from` (the column containing what will be the new variable names) 295 | - `values_from` (the column containing the new values) 296 | - Many others! See `help(pivot_wider)` 297 | 298 | ## Variables Stored as Rows 299 | 300 | ```{r, echo = TRUE} 301 | taxdata %>% 302 | pivot_wider(names_from = 'TaxFormRow', 303 | values_from = 'Value') 304 | ``` 305 | 306 | (note that the variables are all stored as character variables not numbers - that's because the "person" row is a character, which forced the rest to be too. we'll go through how to fix that later) 307 | 308 | ## Variables Stored as Rows 309 | 310 | We can use `bind_rows()` to stack data sets with the same variables together, handy for compiling data from different sources 311 | 312 | ```{r} 313 | taxdata %>% 314 | pivot_wider(names_from = 'TaxFormRow', 315 | values_from = 'Value') %>% 316 | bind_rows(taxdata2 %>% 317 | pivot_wider(names_from = 'TaxFormRow', 318 | values_from = 'Value')) 319 | ``` 320 | 321 | ## Merging Data 322 | 323 | - Commonly, you will need to link two datasets together based on some shared keys 324 | - For example, if one dataset has the variables "Person", "Year", and "Income" and the other has "Person" and "Birthplace" 325 | 326 | ```{r} 327 | person_year_data <- data.frame(Person = c('Ramesh','Ramesh','Whitney', 'Whitney','David','David'), Year = c(2014, 2015, 2014, 2015,2014,2015), Income = c(81314,82155,131292,141262,102452,105133)) 328 | person_data <- data.frame(Person = c('Ramesh','Whitney'), Birthplace = c('Crawley','Washington D.C.')) 329 | datatable(person_year_data) 330 | ``` 331 | 332 | ## Merging Data 333 | 334 | That was `person_year_data`. And now for `person_data`: 335 | 336 | ```{r} 337 | datatable(person_data) 338 | ``` 339 | 340 | ## Merging Data 341 | 342 | - The **dplyr** `join` family of functions will do this (see `help(join)`). The different varieties just determine what to do with rows you *don't* find a match for. `left_join()` keeps non-matching rows from the first dataset but not the second, `right_join()` from the second not the first, `full_join()` from both, `inner_join()` from neither, and `anti_join()` JUST keeps non-matches 343 | 344 | ## Merging Data 345 | 346 | ```{r, echo = TRUE} 347 | person_year_data %>% 348 | left_join(person_data, by = 'Person') 349 | ``` 350 | 351 | ```{r, echo = TRUE} 352 | person_year_data %>% 353 | right_join(person_data, by = 'Person') 354 | ``` 355 | 356 | ## Merging Data 357 | 358 | - Things work great if the list of variables in `by` is the exact observation level in *at least one* of the two data sets 359 | - But if there are multiple observations per combination of `by` variables in *both*, that's a problem! It will create all the potential matches, which may not be what you want: 360 | 361 | ```{r, echo = TRUE} 362 | a <- tibble(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4) 363 | b <- tibble(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right')) 364 | a %>% left_join(b, by = 'Name') 365 | 366 | ``` 367 | 368 | ## Merging Data 369 | 370 | - This is why it's *super important* to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you *think* are your key variables: if we think that `Person` is a key for data set `a`, then `a %>% select(Person) %>% duplicated() %>% max()` will return `TRUE`, showing us we're wrong 371 | - At that point you can figure out how you want to proceed - drop observations so it's the observation level in one? Accept the multi-match? Pick only one of the multi-matches? 372 | 373 | ## Merging Data: Other Packages 374 | 375 | - Or you can use `safe_join()` in the **pmdplyr** package, which will check for you that you're doing the kind of merge you think you're doing. 376 | - **pmdplyr** also contains the `inexact_join()` family of functions which can help join data sets that don't line up exactly, like if you want to match on time, but on the *most recent* match, not an exact match. The **fuzzyjoin** package has similar functions for matching inexactly for text variables 377 | 378 | # From Tidy Data to Your Analysis 379 | 380 | ## From Tidy Data to Your Analysis 381 | 382 | - Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is! 383 | - That doesn't mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis 384 | - We will be doing this mostly with **dplyr** 385 | 386 | ## dplyr 387 | 388 | - **dplyr** uses a *small set of "verbs"* to very flexibly do all kinds of data cleaning and manipulation 389 | - The primary verbs are: `filter(), select()`, `arrange()`, `mutate()`, `group_by()`, and `summarize()`. 390 | - Other important functions in **dplyr**: `pull()` (which we covered), `case_when()` 391 | - See the [dplyr cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) 392 | 393 | ## filter() 394 | 395 | - `filter()` limits the data to the observations that fulfill a certain *logical condition*. It *picks rows*. 396 | - For example, `Income > 100000` is `TRUE` for everyone with income above 100000, and `FALSE` otherwise. `filter(data, Income > 100000)` would return just the rows of `data` that have `Income > 100000` 397 | 398 | ```{r, echo = TRUE} 399 | person_year_data %>% 400 | left_join(person_data, by = 'Person') %>% 401 | filter(Income > 100000) 402 | ``` 403 | 404 | ## Logical Conditions 405 | 406 | - A lot of programming in general is based on writing logical conditions that check whether something is true 407 | - In R, if the condition is true, it returns `TRUE`, which turns into 1 if you do a calculation with it. If false, it returns `FALSE`, which turns into 0. (tip: `ifelse()` is rarely what you want, and `ifelse(condition, TRUE, FALSE)` is redundant) 408 | 409 | ## Logical Conditions Tips 410 | 411 | Handy tools for constructing logical conditions: 412 | 413 | `a > b`, `a >= b`, `a < b`, `a <= b`, `a == b`, or `a != b` to compare two numbers and check if `a` is above, above-or-equal, below, below-or-equal, equal (note `==` to check equality, not `=`), or not equal 414 | 415 | `a %in% c(b, c, d, e, f)` checks whether `a` is any of the values `b, c, d, e,` or `f`. Works for text too! 416 | 417 | ## Logical Conditions Tips 418 | 419 | Whatever your condition is (`condition`), just put a `!` ("not") in front to reverse `TRUE`/`FALSE`. `2 + 2 == 4` is `TRUE`, but `!(2 + 2 == 4)` is `FALSE` 420 | 421 | Chain multiple conditions together! `&` is "and", `|` is "or". Be careful with parentheses if combining them! In `filter` specifically, you can use `,` instead of `&`. 422 | 423 | ## select() 424 | 425 | - `select()` gives you back just a subset of the columns. It *picks columns* 426 | - It can do this by name or by column number 427 | - Use `-` to *not* pick certain columns 428 | 429 | If our data has the columns "Person", "Year", and "Income", then all of these do the same thing: 430 | 431 | ```{r, echo = TRUE} 432 | no_income <- person_year_data %>% select(Person, Year) 433 | no_income <- person_year_data %>% select(1:2) 434 | no_income <- person_year_data %>% select(-Income) 435 | print(no_income) 436 | ``` 437 | 438 | ## arrange() 439 | 440 | - `arrange()` sorts the data. That's it! Give it the column names and it will sort the data by those columns. 441 | - It's often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate 442 | - There are also some data manipulation tricks that rely on the position of the data 443 | 444 | ```{r} 445 | person_year_data %>% 446 | arrange(Person, Year) 447 | ``` 448 | 449 | ## mutate() 450 | 451 | - `mutate()` *assigns columns/variables*, i.e. you can create variables with it (note also its sibling `transmute()` which does the same thing and then drops any variables you don't explicitly specify in the function) 452 | - You can assign multiple variables in the same `mutate()` call, separated by commas (`,`) 453 | 454 | ```{r, echo = TRUE} 455 | person_year_data %>% 456 | mutate(NextYear = Year + 1, 457 | Above100k = Income > 100000) 458 | ``` 459 | 460 | ## case_when() 461 | 462 | - A function that comes in handy a lot when using mutate to *create* a categorical variable is `case_when()`, which is sort of like `ifelse()` except it can cleanly handle way more than one condition 463 | - Provide `case_when()` with a series of `if ~ then` conditions, separated by commas, and it will go through the `if`s one by one for each observation until it finds a fitting one. 464 | - As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn't count any more. Also, you can have the last `if` be `TRUE` to give a value for anyone who hasn't been caught yet 465 | 466 | ## case_when() 467 | 468 | ```{r, echo = TRUE} 469 | person_year_data %>% 470 | mutate(IncomeBracket = case_when( 471 | Income <= 50000 ~ 'Under 50k', 472 | Income > 50000 & Income <= 100000 ~ '50-100k', 473 | Income > 100000 & Income < 120000 ~ '100-120k', 474 | TRUE ~ 'Above 120k' 475 | )) 476 | ``` 477 | 478 | ## case_when() 479 | 480 | - Note that the `then` doesn't have to be a value, it can be a calculation, for example 481 | 482 | ```{r, eval = FALSE, echo = TRUE} 483 | Inflation_Adjusted_Income = case_when(Year == 2014 ~ Income*1.001, Year == 2015 ~ Income) 484 | ``` 485 | 486 | - And you can use `case_when()` to change the values of just *some* of the observations. 487 | 488 | ```{r, eval = FALSE, echo = TRUE} 489 | mutate(Income = case_when(Person == 'David' ~ Income*1.34, TRUE ~ Income)) 490 | ``` 491 | 492 | - Note: if assigning some observations to be `NA`, you must use the type-appropriate `NA`. `NA_character_`, `NA_real_`, etc. 493 | 494 | 495 | ## group_by() 496 | 497 | - `group_by()` turns the dataset into a *grouped* data set, splitting each combination of the grouping variables 498 | - Calculations like `mutate()` or (up next) `summarize()` or (if you want to get fancy) `group_map()` then process the data separately by each group 499 | 500 | ```{r, echo = TRUE} 501 | person_year_data %>% group_by(Person) %>% 502 | mutate(Income_Relative_to_Mean = Income - mean(Income)) 503 | ``` 504 | 505 | ## group_by() 506 | 507 | - It will maintain this grouping structure until you re-`group_by()` it, or `ungroup()` it, or `summarize()` it (which removes one of the grouping variables) 508 | - How is this useful in preparing data? 509 | - Remember, we want to *look at where information is* and *think about how we can get it where we need it to be* 510 | - `group_by()` helps us move information *from one row to another in a key variable* - otherwise a difficult move! 511 | - It can also let us *change the observation level* with `summarize()` 512 | - Tip: `n()` gives the number of rows in the group - handy! and `row_number()` gives the row number within its group of that observation 513 | 514 | ## summarize() 515 | 516 | - `summarize()` *changes the observation level* to a broader level 517 | - It returns only *one row per group* (or one row total if the data is ungrouped) 518 | - So now your keys are whatever you gave to `group_by()` 519 | 520 | ```{r, echo = TRUE} 521 | person_year_data %>% 522 | group_by(Person) %>% 523 | summarize(Mean_Income = mean(Income), 524 | Years_Tracked = n()) 525 | ``` 526 | 527 | # Variable Types 528 | 529 | ## Manipulating Variables 530 | 531 | - Those are the base **dplyr** verbs we need to think about 532 | - They can be combined to do all sorts of things! 533 | - But important in using them is thinking about what kinds of variable manipulations we're doing 534 | - That will feed into our `mutate()`s and our `summarizes()` 535 | - A lot of data cleaning is making an already-tidy variable usable! 536 | 537 | ## Variable Types 538 | 539 | Common variable types: 540 | 541 | - Numeric 542 | - Character/string 543 | - Factor 544 | - Date 545 | 546 | ## Variable Types 547 | 548 | - You can check the types of your variables by printing a `tibble()`, or `is.` and then the type, or doing str(data) 549 | - You can generally convert between types using `as.` and then the type 550 | 551 | ```{r, echo = TRUE} 552 | taxdata %>% 553 | pivot_wider(names_from = 'TaxFormRow', 554 | values_from = 'Value') %>% 555 | mutate(Person = as.factor(Person), 556 | Income = as.numeric(Income), 557 | Deductible = as.numeric(Deductible), 558 | AGI = as.numeric(AGI)) 559 | ``` 560 | 561 | ## Numeric Notes 562 | 563 | - Numeric data actually comes in multiple formats based on the level of acceptable precision: `integer`, `double`, and so on 564 | - Often you won't have to worry about this - R will just make the data whatever numeric type makes sense at the time 565 | - But a common problem is that reading in very big integers (like ID numbers) will sometimes create `double`s that are stored in scientific notation - lumping multiple groups together! Avoid this with options like `col_types` in your data-reading function 566 | 567 | ## Character/string 568 | 569 | - Specified with `''` or `""` 570 | - Use `paste0()` to stick stuff together! `paste0('h','ello', sep = '_')` is ''h_ello'` 571 | - Messy data often defaults to character. For example, a "1,000,000" in your Excel sheet might not be parsed as `1000000` but instead as a literal "1,000,000" with commas 572 | - Lots of details on working with these - back to them in a moment 573 | 574 | ## Factors 575 | 576 | - Factors are for categorical data - you're in one category or another 577 | - The `factor()` function lets you specify these `labels`, and also specify the `levels` they go in - factors can be ordered! 578 | 579 | ```{r, echo = TRUE} 580 | tibble(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+')) %>% 581 | mutate(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+'))) %>% 582 | arrange(Income) 583 | ``` 584 | ## Dates 585 | 586 | - Dates are the scourge of data cleaners everywhere. They're just plain hard to work with! 587 | - There are Date variables, Datetime variables, both of multiple different formats... eugh! 588 | - I won't go into detail here, but I strongly recommend using the **lubridate** package whenever working with dates. See the [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf) 589 | 590 | ## Characters/strings 591 | 592 | - Back to strings! 593 | - Even if your data isn't textual, working with strings is a very common aspect of preparing data for analysis 594 | - Some are straightforward, for example using `mutate()` and `case_when()` to fix typos/misspellings in the data 595 | - But other common tasks in data cleaning include: getting substrings, splitting strings, cleaning strings, and detecting patterns in strings 596 | - For this we will be using the **stringr** package in **tidyverse**, see the [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) 597 | 598 | ## Getting Substrings 599 | 600 | - When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters 601 | - `str_sub(string, start, end)` will do this. `str_sub('hello', 2, 4)` is `'ell'` 602 | - Note negative values read from end of string. `str_sub('hello', -1)` is `'o'` 603 | 604 | ## Getting Substrings 605 | 606 | - For example, geographic Census Block Group indicators are 13 digits, the first two of which are the state FIPS code 607 | 608 | ```{r, echo = TRUE} 609 | tibble(cbg = c(0152371824231, 1031562977281)) %>% 610 | mutate(cbg = as.character(cbg)) %>% # Make it a string to work with 611 | mutate(state_fips = case_when( 612 | nchar(cbg) == 12 ~ str_sub(cbg, 1, 1), # Leading zeroes! 613 | nchar(cbg) == 13 ~ str_sub(cbg, 1, 2) 614 | )) 615 | ``` 616 | 617 | ## Strings 618 | 619 | - **Lots** of data will try to stick multiple pieces of information in a single cell, so you need to split it out! 620 | - Generically, `str_split()` will do this. `str_split('a,b', ',')[[1]]` is `c('a','b')` 621 | - Often in already-tidy data, you want `separate()` from **tidyr**. Make sure you list enough new `into` columns to get everything! 622 | 623 | ```{r, echo = TRUE} 624 | tibble(category = c('Sales,Marketing','H&R,Marketing')) %>% 625 | separate(category, into = c('Category1', 'Category2'), ',') 626 | ``` 627 | 628 | ## Cleaning Strings 629 | 630 | - Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters 631 | - `str_trim()` removes beginning/end whitespace, `str_squish()` removes additional whitespace from the middle too. `str_trim(' hi hello ')` is `'hi hello'`. 632 | - `str_replace_all()` is often handy for eliminating (or fixing) unwanted characters 633 | 634 | ```{r, echo = TRUE} 635 | tibble(number = c('1,000', '2,003,124')) %>% 636 | mutate(number = number %>% str_replace_all(',', '') %>% as.numeric()) 637 | ``` 638 | 639 | ## Detecting Patterns in Strings 640 | 641 | - Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is *regular expressions* 642 | - Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with `str_replace_all(',','')` - `','` is a regular expression saying "look for a comma" 643 | - There are a *lot* of options here. See the [guide](https://stringr.tidyverse.org/articles/regular-expressions.html) 644 | - Common: `[0-9]` to look for a digit, `[a-zA-Z]` for letters, `*` to repeat until you see the next thing... hard to condense here. Read the guide. 645 | 646 | ## Detecting Patterns in Strings 647 | 648 | - For example, some companies are publicly listed and we want to indicate that but not keep the ticker. `separate()` won't do it here, not easily! 649 | - On the next page we'll use the regular expression `'\$[A-Z].*\$'` 650 | - `'\$[A-Z].*\$'` says "look for a (" (note the `\\` to treat the usually-special ( character as an actual character), then "Look for a capital letter `[A-Z]`", then "keep looking for capital letters `.*`", then "look for a )" 651 | 652 | ## Detecting Patterns in Strings 653 | 654 | ```{r, echo = TRUE} 655 | tibble(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) %>% 656 | mutate(publicly_listed = str_detect(name, '\$[A-Z].*\$'), 657 | name = str_replace_all(name, '\$[A-Z].*\$', '')) 658 | ``` 659 | 660 | # Using Data Structure 661 | 662 | ## Using Data Structure 663 | 664 | - One of the core steps of data wrangling we discussed is thinking about how to get information from where it is now to where you want it 665 | - A tough thing about tidy data is that it can be a little tricky to move data *into different rows than it currently is* 666 | - This is often necessary when `summarize()`ing, or when doing things like "calculate growth from an initial value" 667 | - But we can solve this with the use of *arrange()* along with other-row-referencing functions like `first()`, `last()`, and `lag()` 668 | 669 | ## Using Data Structure 670 | 671 | - `first()` and `last()` refer to the first and last row, naturally 672 | 673 | ```{r, echo = TRUE} 674 | stockdata <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'), 675 | date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)), 676 | stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) 677 | stockdata %>% 678 | arrange(ticker, date) %>% 679 | group_by(ticker) %>% 680 | mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1) 681 | ``` 682 | 683 | ## Using Data Structure 684 | 685 | - `lag()` looks to the row a certain number above/below this one, based on the `n` argument 686 | - Careful! Despite the name, `dplyr::lag()` doesn't care about *time* structure, it only cares about *data* structure. If you want daily growth but the row above is last year, too bad! 687 | 688 | ## Using Data Structure 689 | 690 | ```{r, echo = TRUE} 691 | stockdata %>% 692 | arrange(ticker, date) %>% 693 | group_by(ticker) %>% 694 | mutate(daily_price_growth = stock_price/lag(stock_price, 1) - 1) 695 | ``` 696 | 697 | 698 | ## Trickier Stuff 699 | 700 | - Sometimes the kind of data you want to move from one row to another is more complex! 701 | - You can use `first()/last()` to get stuff that might not normally be first or last with things like `arrange(ticker, -(date == as.Date('2020-03-05')))` 702 | - For even more complex stuff, I often find it useful to use `case_when()` to create a new variable that only picks data from the rows you want, then a `group_by()` and `mutate()` to spread the data from those rows across the other rows in the group 703 | 704 | ## Trickier Stuff 705 | 706 | ```{r, echo = TRUE} 707 | tibble(person = c('Adam','James','Diego','Beth','Francis','Qian','Ryan','Selma'), 708 | school_grade = c(6,7,7,8,6,7,8,8), 709 | subject = c('Math','Math','English','Science','English','Science','Math','PE'), 710 | test_score = c(80,84,67,87,55,75,85,70)) %>% 711 | mutate(Math_Scores = case_when(subject == 'Math' ~ test_score, 712 | TRUE ~ NA_real_)) %>% 713 | group_by(school_grade) %>% 714 | mutate(Math_Average_In_This_Grade = mean(Math_Scores, na.rm = TRUE)) %>% 715 | select(-Math_Scores) 716 | 717 | ``` 718 | 719 | # Automation 720 | 721 | ## Automation 722 | 723 | - Data cleaning is often very repetitive 724 | - You shouldn't let it be! 725 | - Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code *once* both reduces errors and means that if you have to change something you only have to change it once 726 | - So let's automate! Three ways we'll do it here: `across()`, writing functions, and **purrr** 727 | 728 | ## across() 729 | 730 | - If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric? 731 | - Old versions of **dplyr** used "scoped" variants like `mutate_at()` or `mutate_if()`. As of **dplyr 1.0.0**, these have been deprecated in favor of `across()` 732 | - `across()` lets you use all the variable-selection tricks available in `select()`, like `starts_with()` or `a:z` or `1:5`, but then lets you apply functions to each of them in `mutate()` or `summarize()` 733 | - similarly `rowwise()` and `c_across()` lets you do stuff like "add up a bunch of columns" 734 | 735 | ## across() 736 | 737 | - `starts_with('price_growth')` is the same here as `4:5` or `c(price_growth_since_march_4, price_growth_daily)` 738 | 739 | ```{r, echo = TRUE} 740 | stockgrowth <- stockdata %>% 741 | arrange(ticker, date) %>% 742 | group_by(ticker) %>% 743 | mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1, 744 | price_growth_daily = stock_price/lag(stock_price, 1) - 1) 745 | stockgrowth %>% 746 | mutate(across(starts_with('price_growth'), function(x) x*10000)) # Convert to basis points 747 | ``` 748 | 749 | ## across() 750 | 751 | - That version replaced the original values, but you can have it create new ones with `.names` 752 | - Also, you can use a `list()` of functions instead of just one to do multiple calculations at the same time 753 | 754 | ## across() 755 | 756 | ```{r, echo = TRUE} 757 | stockgrowth %>% 758 | mutate(across(starts_with('price_growth'), 759 | list(bps = function(x) x*10000, 760 | pct = function(x) x*100), 761 | .names = "{.col}_{.fn}")) %>% 762 | select(ticker, starts_with('price_growth_daily')) %>% datatable() 763 | ``` 764 | 765 | ## across() 766 | 767 | - Another common issue is wanting to apply the same transformation to all variables of the same type 768 | - For example, converting all characters to factors, or converting a bunch of dollar values to pounds 769 | - Use `where(is.type)` for this 770 | 771 | ## across() 772 | 773 | ```{r, echo = TRUE} 774 | stockdata %>% 775 | mutate(across(where(is.numeric), list(stock_price_pounds = function(x) x/1.36))) 776 | ``` 777 | 778 | ## rowwise() and c_across() 779 | 780 | - A lot of business data especially might record values in a bunch of categories, each category in its own column, but not report the total 781 | - This is annoying! Fix with `rowwise()` and `c_across()` 782 | 783 | ## rowwise() and c_across() 784 | 785 | ```{r, echo = TRUE} 786 | tibble(year = c(1994, 1995, 1996), sales = c(104, 106, 109), marketing = c(100, 200, 174), rnd = c(423,123,111)) %>% 787 | rowwise() %>% 788 | mutate(total_spending = sum(c_across(sales:rnd))) %>% 789 | mutate(across(sales:rnd, function(x) x/total_spending, .names = '{.col}_pct')) 790 | ``` 791 | 792 | ## Writing Functions 793 | 794 | - We've already done a bit of function-writing here, in the file read-in and with `across()` 795 | - Generally, **if you're going to do the same thing more than once, you're probably better off writing a function** 796 | - Reduces errors, saves time, makes code reusable later! 797 | 798 | ```{r, echo = TRUE, eval = FALSE} 799 | function_name <- function(argument1 = default1, argument2 = default2, etc.) { 800 | some code 801 | result <- more code 802 | return(result) 803 | # (or just do result by itself - the last object printed will be automatically returned if there's no return()) 804 | } 805 | ``` 806 | 807 | ## Function-writing tips 808 | 809 | - Make sure to think about what kind of values your function accepts and make sure that what it returns is consistent so you know what you're getting 810 | - This is a really deep topic to cover in two slides, and mostly I just want to poke you and encourage you to do it. At least, if you find yourself doing something a bunch of times in a row, just take the code, stick it inside a `function()` wrapper, and instead use a bunch of calls to that function in a row 811 | - More information [here](https://www.r-bloggers.com/2019/07/writing-functions-in-r-example-one/). 812 | 813 | ## Unnamed Functions 814 | 815 | - There are other ways to do functions in R: *unnamed functions* 816 | - Notice how in the `across()` examples I didn't have to do `bps <- function(x) x*10000`, I just did `function(x) x*10000`? That's an "unnamed function" 817 | - If your function is very small like this and you're only going to use it once, it's great for that! 818 | - In R 4.1, you will be able to just do `\(x)` instead of `function(x)` 819 | 820 | ## purrr 821 | 822 | - One good way to apply functions iteratively (yours or not) is with the `map()` functions in **purrr** 823 | - We already did this to read in files, but it applies much more broadly! `map()` usually generates a `list()`, `map_dbl()` a numeric vector, `map_chr()` a character vector, `map_df()` a `tibble()`... 824 | - It iterates through a `list`, `data.frame/tibble` (which are technically `list`s, or `vector`, and then applies a function to each of the elements 825 | 826 | ```{r, echo = TRUE} 827 | person_year_data %>% 828 | map_chr(class) 829 | ``` 830 | ## purrr 831 | 832 | - Obviously handy for processing many files, as in our reading-in-files example 833 | - Or looping more generally for diagnostic or wrangling purposes. Perhaps you have a `summary_profile()` function you've made, and want to check each state's data to see if its data looks right. You could do 834 | 835 | ```{r, echo = TRUE, eval = FALSE} 836 | data %>% pull(state) %>% unique() %>% map(summary_profile) 837 | ``` 838 | 839 | - You can use it generally in place of a `for()` loop 840 | - See the [purrr cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) 841 | 842 | # Finishing Up, and an Example! 843 | 844 | ## Some Final Notes 845 | 846 | - We can't possibly cover everything. So one last note, about saving your data! 847 | - What to do when you're done and want to save your processed data? 848 | - Saving data in R format: `save()` saves many objects, which are all put back in the environment with `load()`. Often preferable is `saveRDS()` which saves a single `data.frame()` in compressed format, loadable with `df <- readRDS()` 849 | - Saving data for sharing: `write_csv()` makes a CSV. Yay! 850 | 851 | ## Some Final Notes 852 | 853 | - Also, please, please, *please* **DOCUMENT YOUR DATA** 854 | - At the very least, keep a spreadsheet/\code{tibble} with a set of descriptions for each of your variables 855 | - Also look into the **sjlabelled** or **haven** packages to add variable labels directly to the data set itself 856 | - Once you have your variables labelled, `vtable()` in **vtable** can generate a documentation file for sharing 857 | 858 | ## A Walkthrough 859 | 860 | - Let's clean some data! -------------------------------------------------------------------------------- /Data_Wrangling_data_table.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Wrangling Faster and Bigger with data.table" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | 18 | ```{r setup, include=FALSE} 19 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 20 | library(tidyverse) 21 | library(data.table) 22 | library(DT) 23 | library(purrr) 24 | library(readxl) 25 | ``` 26 | 27 | ## Data Wrangling 28 | 29 | ```{r, results = 'asis'} 30 | cat(" 31 | ") 37 | ``` 38 | 39 | Welcome to the Data Wrangling Workshop! 40 | 41 | - The goal of data wrangling 42 | - How to think about data wrangling 43 | - Technical tips for data wrangling in R using the **data.table** package 44 | - A walkthrough example 45 | 46 | ## Limitations 47 | 48 | - If you already attended the **tidyverse** version of this workshop, there will be some overlap in content, although of course the technical details will be different. 49 | - We only have so much time! I won't be going into *great* detail on the use of all the technical commands, but by the end of this you will know what's out there and generally how it's used 50 | - *As with any computer skill, a teacher's comparative advantage is in letting you know what's out there. The* **real learning** *comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end* 51 | 52 | ## data.table notes 53 | 54 | - `data.table` is a package for working with data 55 | - It is *extremely fast*. Its functions are much faster than comparable **tidyverse** functions, and for many purposes it is faster than **pandas** in Python as well. Julia outperforms it sometimes, but then you have to learn Julia 56 | - It's also great at handling big data. It's basically your technically-best option for working with data in-memory. Once you work from a database (i.e. SQL) you're in to something different 57 | 58 | ## Tidyverse notes 59 | 60 | - Throughout this talk I'll be using the pipe (`%>%`), which simply means "take whatever's on the left and make it the first argument of the thing on the right" 61 | - Very handy for chaining together operations and making code more readable. 62 | - This is more of a **tidyverse** tool but I like it for **data.table** too. It can be loaded from `library(magrittr)` or loading the **tidyverse** or one of many packages that come with the pipe. 63 | 64 | ## The pipe 65 | 66 | `scales::percent(mean(mtcars$am, na.rm = TRUE),` `accuracy = .1)` can be rewritten 67 | 68 | ```{r, eval = FALSE, echo = TRUE} 69 | mtcars %>% 70 | `[[`('am') %>% 71 | mean(na.rm = TRUE) %>% 72 | scales::percent(accuracy = .1) 73 | ``` 74 | 75 | - Like a conveyer belt! Nice and easy. Note that R 4.1 will switch to the use of `|>` for the pipe which won't require a package load. Also, you can chain `data.table()` operations by just using a bunch of `[]`s (we'll get to it) 76 | - `[[` is the "[[ function" - i.e. this is doing `mtcars[['am']]`, equivalent to `mtcars$am` 77 | 78 | 79 | ## Data Wrangling 80 | 81 | What is data wrangling? 82 | 83 | - You have data 84 | - It's not ready for you to run your model 85 | - You want to get it ready to run your model 86 | - Ta-da! 87 | 88 | ## The Core of Data Wrangling 89 | 90 | - Always **look directly at your data so you know what it looks like** 91 | - Always **think about what you want your data to look like when you're done** 92 | - Think about **how you can take information from where it is and put it where you want it to be** 93 | - After every step, **look directly at your data again to make sure it's doing what you think it's doing** 94 | 95 | I help a lot of people with their problems with data wrangling. Their issues are almost always *not doing one of these four things*, much more so than having trouble coding or anything like that 96 | 97 | ## The Core of Data Wrangling 98 | 99 | - How can you "look at your data"? 100 | - Literally is one way - click on the data set, or do `View()` to look at it 101 | - Summary statistics tables: `sumtable()` or `vtable(lush = TRUE)` in **vtable** for example 102 | - Checking what values it takes: `table()` or `summary()` on individual variables 103 | - Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured 104 | 105 | ## The Stages of Data Wrangling 106 | 107 | - From records to data 108 | - From data to tidy data 109 | - From tidy data to data for your analysis 110 | 111 | # From Records to Data 112 | 113 | ## From Records to Data 114 | 115 | Not something we'll be focusing on today! But any time the data isn't in a workable format, like a spreadsheet or database, someone's got to get it there! 116 | 117 | - "Google Trends has information on the popularity of our marketing terms, go get it!" 118 | - "Here's a 600-page unformatted PDF of our sales records for the past three years. Turn it into a database." 119 | - "Here are scans of the 15,000 handwritten doctor's notes at the hospital over the past year" 120 | - "Here's access to the website. The records are in there somewhere." 121 | - "Go do a survey" 122 | 123 | ## From Records to Data: Tips! 124 | 125 | - Do as little by hand as possible. It's a lot of work and you *will* make mistakes 126 | - *Look at the data* a lot! 127 | - Check for changes in formatting - it's common for things like "this enormous PDF of our tables" or "eight hundred text files with the different responses/orders" to change formatting halfway through 128 | - When working with something like a PDF or a bunch of text files, think "how can I tell a computer to spot where the actual data is?" 129 | - If push comes to shove, or if the data set is small enough, you can do by-hand data entry. Be very careful! 130 | 131 | ## Reading Files 132 | 133 | One common thing you run across is data split into multiple files. How can we read these in and compile them? 134 | 135 | - `list.files()` produces a vector of filenames (tip: `full.names = TRUE` gives full filepaths) 136 | - Use `map()` from **purrr** to iterate over that vector and read in the data. This gives a list of `tibble`s (`data.frame`s) read in 137 | - Create your own function to process each, use `map` with that too (if you want some processing before you combine) 138 | - Turn each to a `data.table` (if it isn't already, say from `fread()` with `map(as.data.table)`) 139 | - Combine the results with `rbindlist()`! 140 | 141 | ## Reading Files 142 | 143 | For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together. 144 | 145 | ```{r, echo = TRUE, eval = FALSE} 146 | # For reading Excel 147 | library(readxl) 148 | # For map 149 | library(purrr) 150 | 151 | # Get the list of 200 reports 152 | filelist <- list.files(path = '../Monthly_reports/', pattern = 'sales', full.names = TRUE) 153 | ``` 154 | 155 | ## Reading Files 156 | 157 | We can simplify by making a little function that processes each of the reports as it's read. Then, use `map()` with `read_excel()` and then our function, then bind it together! 158 | 159 | How do I get `df[1,3]`, etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it! 160 | 161 | ```{r, echo = TRUE, eval = FALSE} 162 | process_file <- function(df) { 163 | sales <- df[1,3] 164 | employee <- df[42,2] 165 | return(data.table(sales = sales, employee = employee)) 166 | } 167 | 168 | compiled_data <- filelist %>% 169 | map(read_excel) %>% 170 | map(process_file) %>% 171 | rbindlist() 172 | ``` 173 | 174 | # From Data to Tidy Data 175 | 176 | ## From Data to Tidy Data 177 | 178 | - **Data** is any time you have your records stored in some structured format 179 | - But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation 180 | - These structures can be great for *looking up values*. That's why they are often used in business or other settings where you say "I wonder what the value of X is for person/day/etc. N" 181 | - They're rarely good for *doing analysis* (calculating statistics, fitting models, making visualizations) 182 | - For that, we will aim to get ourselves *tidy data* (see [this walkthrough](https://tidyr.tidyverse.org/articles/tidy-data.html) ) 183 | 184 | ## Tidy Data 185 | 186 | In tidy data: 187 | 188 | 1. Each variable forms a column 189 | 1. Each observation forms a row 190 | 1. Each type of observational unit forms a table 191 | 192 | ```{r} 193 | df <- data.table(Country = c('Argentina','Belize','China'), TradeImbalance = c(-10, 35.33, 5613.32), PopulationM = c(45.3, .4, 1441.5)) 194 | datatable(df) 195 | ``` 196 | 197 | ## Tidy Data 198 | 199 | The variables in tidy data come in two types: 200 | 201 | 1. *Identifying Variables*/*Keys* are the columns you'd look at to locate a particular observation. 202 | 1. *Measures*/*Values* are the actual data. 203 | 204 | Which are they in this data? 205 | 206 | ```{r} 207 | df <- data.table(Person = c('Chidi','Chidi','Eleanor','Eleanor'), Year = c(2017, 2018, 2017, 2018), Points = c(14321,83325, 6351, 63245), ShrimpConsumption = c(0,13, 238, 172)) 208 | datatable(df) 209 | ``` 210 | ## Tidy Data 211 | 212 | - *Person* and *Year* are our identifying variables. The combination of person and year *uniquely identifies* a row in the data. Our "observation level" is person and year. There's only one row with Person == "Chidi" and Year == 2018 213 | - *Points* and *ShrimpConsumption* are our measures. They are the things we have measured for each of our observations 214 | - Notice how there's one row per observation (combination of Person and Year), and one column per variable 215 | - Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don't change over Year) would go in a different table, although this last one is less important 216 | 217 | ## Tidying Non-Tidy Data 218 | 219 | - So what might data look like when it's *not* like this, and how can we get it this way? 220 | - Here's one common example, a *count table* (not tidy!) where each column is a *value*, not a *variable* 221 | 222 | ```{r} 223 | data("relig_income") 224 | datatable(relig_income) 225 | ``` 226 | 227 | ## Tidying Non-tidy Data 228 | 229 | - Here's another, where the "chart position" variable is split across 52 columns, one for each week 230 | 231 | ```{r} 232 | data("billboard") 233 | datatable(billboard) 234 | billboard <- as.data.table(billboard) 235 | ``` 236 | 237 | 238 | 239 | ## Tidying Non-Tidy Data 240 | 241 | - The first big tool in our tidying toolbox is the *pivot*, which in **data.table** is the function pair `melt()` and `dcast()` 242 | - A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up. 243 | - This can also be referred to as going from "wide" data to "long" data (`melt`) 244 | - Long to wide is also an option (`dcast`) 245 | - In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended. 246 | 247 | ## Tidying Non-Tidy Data 248 | 249 | Check our steps! 250 | 251 | - We looked at the data 252 | - Think about how we want the data to look - one row per (keys) artist, track, and week, and a column for the chart position of that artist/track in that week, and the date entered for that artist/track (value) 253 | - How can we carry information from where it is to where we want it to be? With a pivot! 254 | - And afterwards we'll look at the result (and, likely, go back and fix our pivot code - the person who gets a pivot right the first try is a mysterious genius) 255 | 256 | ## Pivot 257 | 258 | - Here we want wide-to-long so we use `melt()` 259 | - This asks for: 260 | - `data` (the data set you're working with, also the first argument so we can pipe to it) 261 | - `id.vars` (a vector of identifying/key columns, either numbers for position or character for names) 262 | - `measure.vars` (the columns to pivot) - by default everything not in `id.vars` 263 | - `variable.name` (the name of the variable to store which column a given row came from, here "week") 264 | - `value.name` (the variable to store the value in) 265 | - Many other options (see `help(melt)`) 266 | 267 | ## Pivot 268 | 269 | ```{r, echo = TRUE, eval = FALSE} 270 | billboard %>% 271 | melt(measure.vars = patterns('^wk'), # patterns helps us pick columns based on name patterns 272 | variable.name = 'week', 273 | value.name = 'chart_position') 274 | 275 | ``` 276 | 277 | ```{r} 278 | billboard %>% 279 | melt(measure.vars = patterns('^wk'), # patterns helps us pick columns based on name patterns 280 | variable.name = 'week', 281 | value.name = 'chart_position') %>% 282 | datatable() 283 | 284 | ``` 285 | 286 | ## Variables Stored as Rows 287 | 288 | - Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use `pivot_wider()`, and then combine multiple individuals with `bind_rows()` 289 | 290 | ```{r} 291 | taxdata <- data.table(TaxFormRow = c('Person','Income','Deductible','AGI'), Value = c('James Acaster',112341, 24000, 88341)) 292 | taxdata2 <- data.table(TaxFormRow = c('Person','Income','Deductible','AGI'), Value = c('Eddie Izzard',325122, 16000,325122 - 16000)) 293 | datatable(taxdata) 294 | ``` 295 | 296 | ## Variables Stored as Rows 297 | 298 | - `dcast()` needs: 299 | - `data` (first argument, the data we're working with) 300 | - `formula` (this tells us the observation level of the wide data and how it expands to the long data) 301 | - Many others! See `help(dcast)` 302 | - Here, the new observation level doesn't have a key (`.`), but the old one is `TaxFormRow` 303 | 304 | ## Variables Stored as Rows 305 | 306 | ```{r, echo = TRUE} 307 | taxdata %>% 308 | dcast(. ~ TaxFormRow) 309 | ``` 310 | 311 | (note that the variables are all stored as character variables not numbers - that's because the "person" row is a character, which forced the rest to be too. we'll go through how to fix that later) 312 | 313 | ## Variables Stored as Rows 314 | 315 | We can use `rbind()` to stack data sets with the same variables together, handy for compiling data from different sources (`rbindlist()` binds a `list` of `data.table`s) 316 | 317 | ```{r} 318 | taxdata %>% 319 | dcast(. ~ TaxFormRow) %>% 320 | rbind(taxdata2 %>% 321 | dcast(. ~ TaxFormRow)) 322 | ``` 323 | 324 | ## Merging Data 325 | 326 | - Commonly, you will need to link two datasets together based on some shared keys 327 | - For example, if one dataset has the variables "Person", "Year", and "Income" and the other has "Person" and "Birthplace" 328 | 329 | ```{r} 330 | person_year_data <- data.table(Person = c('Ramesh','Ramesh','Whitney', 'Whitney','David','David'), Year = c(2014, 2015, 2014, 2015,2014,2014), Income = c(81314,82155,131292,141262,102452,105133)) 331 | person_data <- data.table(Person = c('Ramesh','Whitney'), Birthplace = c('Crawley','Washington D.C.')) 332 | datatable(person_year_data) 333 | ``` 334 | 335 | ## Merging Data 336 | 337 | That was `person_year_data`. And now for `person_data`: 338 | 339 | ```{r} 340 | datatable(person_data) 341 | ``` 342 | 343 | ## Merging Data 344 | 345 | - The **data.table** `merge` function will do this (see `help(merge)`, making sure you get the **data.table** one instead of the base-R one). 346 | - The `by` option will specify the columns to merge on. The `all.x` and `all.y` options will specify whether to keep rows from the first and second data sets, respectively, that don't find a match 347 | 348 | ## Merging Data 349 | 350 | ```{r, echo = TRUE} 351 | person_year_data %>% 352 | merge(person_data, by = 'Person', all.x = TRUE) 353 | ``` 354 | 355 | ```{r, echo = TRUE} 356 | person_year_data %>% 357 | merge(person_data, by = 'Person', all.y = TRUE) 358 | ``` 359 | 360 | ## Merging Data 361 | 362 | - Things work great if the list of variables in `by` is the exact observation level in *at least one* of the two data sets 363 | - But if there are multiple observations per combination of `by` variables in *both*, that's a problem! It will create all the potential matches, which may not be what you want: 364 | 365 | ```{r, echo = TRUE} 366 | a <- data.table(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4) 367 | b <- data.table(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right')) 368 | a %>% merge(b, by = 'Name') 369 | 370 | ``` 371 | 372 | ## Merging Data 373 | 374 | - This is why it's *super important* to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you *think* are your key variables: if we think that `Person` is a key for data set `a`, then `a[, .(Person)] %>% duplicated() %>% max()` will return `TRUE`, showing us we're wrong 375 | - At that point you can figure out how you want to proceed - drop observations so it's the observation level in one? Accept the multi-match? Pick only one of the multi-matches? 376 | 377 | ## Merging data 378 | 379 | - Another way to merge `data.table`s is to use the special `data.table` syntax `DT1[DT2, on = .(keys)]`. 380 | - This approach is a little harder to work through syntax-wise, but it is (a little) faster 381 | - And lets you do neat tricks by matching in *non-exact* ways 382 | 383 | # From Tidy Data to Your Analysis 384 | 385 | ## From Tidy Data to Your Analysis 386 | 387 | - Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is! 388 | - That doesn't mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis 389 | 390 | ## Working with data.tables 391 | 392 | - `data.table()` syntax is extremely simple 393 | - `DT[filter, variable operations, grouping]` aka `DT[i, j, by]` 394 | - We can use this to do just about anything we like! 395 | - Thankfully, there are also some "wrapper" functions for this that can automate some operations, and other helper functions like `fcase()` 396 | - See [this data.table cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/datatable.pdf) or [this one](https://www.infoworld.com/article/3575086/the-ultimate-r-datatable-cheat-sheet.html) 397 | 398 | ## In-Place Manipulation 399 | 400 | - Normally in R, to change something, you must reassign it, which takes up space in memory `a <- a + 1` 401 | - Many `data.table` operations can be done directly at the existing memory location. **Blazing fast.** 402 | - For variable manipulation, you can do in-place with the walrus operator `:=` 403 | - Other in-place functions begin with `set` like `setnames()`, `setorder()`, `set()`, etc. 404 | - Note that these generally have versions like `setorderv()` that take column-name vectors instead of direct column names 405 | 406 | ## data.table oddities 407 | 408 | - Some things about `data.table`s clash with typical R practice. In-place manipulation is one of them 409 | - These changes "stick" even inside a function, and generally shouldn't be combined with pipes 410 | - And if you copy a `data.table`, say by `dt2 <- dt1`, the changes to `dt2` will also happen to `dt1`, unless you do `dt2 <- copy(dt1)` instead for a "deep copy" 411 | - Also note that putting a `data.table` by itself on a line won't print it - you need to add `[]` after 412 | - By the way feel free to chain multiple `[]`s together for chained operations! 413 | 414 | ## Filtering 415 | 416 | - The first argument limits the data to the observations that fulfill a certain *logical condition*. It *picks rows*. 417 | - For example, `Income > 100000` is `TRUE` for everyone with income above 100000, and `FALSE` otherwise. `data[Income > 100000]` would return just the rows of `data` that have `Income > 100000` 418 | 419 | ```{r, echo = TRUE} 420 | merge(person_year_data, person_data, by = 'Person')[Income > 10000] 421 | ``` 422 | 423 | ## Logical Conditions 424 | 425 | - A lot of programming in general is based on writing logical conditions that check whether something is true 426 | - In R, if the condition is true, it returns `TRUE`, which turns into 1 if you do a calculation with it. If false, it returns `FALSE`, which turns into 0. (tip: `ifelse()` is rarely what you want, and `ifelse(condition, TRUE, FALSE)` is redundant) 427 | - Also, if `ifelse` *is* what you want, it's not what you want! **data.table**'s `fifelse` is the same but faster 428 | 429 | ## Logical Conditions Tips 430 | 431 | Handy tools for constructing logical conditions: 432 | 433 | `a > b`, `a >= b`, `a < b`, `a <= b`, `a == b`, or `a != b` to compare two numbers and check if `a` is above, above-or-equal, below, below-or-equal, equal (note `==` to check equality, not `=`), or not equal 434 | 435 | `a %in% c(b, c, d, e, f)` checks whether `a` is any of the values `b, c, d, e,` or `f`. Works for text too! 436 | 437 | ## Logical Conditions Tips 438 | 439 | Whatever your condition is (`condition`), just put a `!` ("not") in front to reverse `TRUE`/`FALSE`. `2 + 2 == 4` is `TRUE`, but `!(2 + 2 == 4)` is `FALSE` 440 | 441 | Chain multiple conditions together! `&` is "and", `|` is "or". Be careful with parentheses if combining them! 442 | 443 | ## Column operations 444 | 445 | - Column operations in `data.table`s go in the second argument, i.e. the `j` in `DT[i,j,by]` 446 | - There are two main ways to do them! In-place and not-in-place. 447 | 448 | ## Column operations 449 | 450 | - Not-in-place operations are done by providing a list of variables, and, if you want to reassign them, what they will be equal to. All variables not mentioned in the list (or in `by`) are dropped 451 | - The `.()` function in **data.table** is a shortcut for `list()` 452 | - To store it, save over the `data.table` 453 | 454 | ```{r, echo = TRUE, eval = FALSE} 455 | mtcars <- as.data.table(mtcars) 456 | # Select just these columns 457 | mtcars[, .(mpg, hp)] 458 | mtcars[, c(1,4)] 459 | # Select just those columns and also add the ratio variable 460 | mtcars[, .(mpg, hp, ratio = mpg/hp)] 461 | ``` 462 | 463 | ## Column Operations 464 | 465 | - When to use a list (`.()`) and when to use `c()`? 466 | - Generally, use a list if you want to refer to columns by name directly ("unquoted") 467 | - And use `c()` to refer to column numbers, or to pass in names as strings (handy in programming!) 468 | - Sometimes if using a string variable in `j`, you'll also need the option `with = FALSE` 469 | 470 | ```{r, echo = TRUE, eval = FALSE} 471 | varnames <- c('mpg','hp') 472 | mtcars[, varnames, with = FALSE] 473 | mtcars[, c('mpg','hp')] 474 | ``` 475 | 476 | ## Column Operations 477 | 478 | - Note that these sorts of column operations are really about *what to calculate*. It just so happens that often we want to assign that calculation to a column. But sometimes we don't! 479 | - For example, instead of `mean(mtcars$hp)` we can do `mtcars[, mean(hp)]` 480 | - We can also pull a variable out of a data.table entirely with `mtcars[, hp]` (this gives back a numeric vector, not a one-column `data.table`! For a one-column `data.table` we'd do `mtcars[, .(hp)]`) 481 | - Mix n match! Guess what `mtcars[am == 1, mean(hp)]` does... you can start to see the appeal! 482 | 483 | ## In-Place Column Operations 484 | 485 | - Most of the time when it comes to creating variables you'll probably do in-place operations. This will just add/replace columns. Non-mentioned columns stay intact 486 | 487 | ```{r, echo = TRUE, eval = FALSE} 488 | # Create ratio variable 489 | mtcars[, ratio := mpg/hp] 490 | # Create two variables at once 491 | mtcars[, `:=`(ratio = mpg/hp, hp_square = hp^2)] 492 | # Drop a single variable by setting it to NULL 493 | mtcars[, am := NULL] 494 | ``` 495 | 496 | ## fcase() 497 | 498 | - A function that comes in handy a lot when using mutate to *create* a categorical variable is `fcase()`, which is sort of like `ifelse()` except it can cleanly handle way more than one condition 499 | - Provide `fcase()` with a series of `if, then` conditions, separated by commas, and it will go through the `if`s one by one for each observation until it finds a fitting one. 500 | - As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn't count any more. 501 | - Also note the `default` option for what to do with observations that are `FALSE` for everything else 502 | 503 | ## fcase() 504 | 505 | ```{r, echo = TRUE} 506 | person_year_data[, .(Income = Income, 507 | IncomeBracket = fcase( 508 | Income <= 50000, 'Under 50k', 509 | Income > 50000 & Income <= 100000, '50-100k', 510 | Income > 100000 & Income < 120000, '100-120k', 511 | default = 'Above 120k'))] 512 | ``` 513 | 514 | ## fcase() 515 | 516 | - Note that the `then` doesn't have to be a value, it can be a calculation, for example 517 | 518 | ```{r, eval = FALSE, echo = TRUE} 519 | person_year_data[, .(Income = Income, 520 | Year = Year, 521 | Inflation_Adjusted_Income = fcase( 522 | Year == 2014, Income*1.001, 523 | Year == 2015, Income))] 524 | ``` 525 | 526 | ## Changing Some Observations 527 | 528 | - A related problem to `fcase()` is when you want to make an adjustment to just *some* of the observations. Say we realized that David's income was reported in pounds, not dollars, so we need to adjust it. 529 | - For this, just apply both a filter and an in-place update. We could do 530 | 531 | ```{r, eval = FALSE, echo = TRUE} 532 | person_year_data[Person == 'David', Income := Income*1.34] 533 | ``` 534 | 535 | - If you attended the **tidyverse** version of this, you may recall this was a huge pain in the **tidyverse**. Easy here! 536 | 537 | 538 | ## Grouping 539 | 540 | - The third argument of `data.table` (`DT[i,j,by]`) performs the `j` operations by group, splitting each combination of the grouping variables 541 | - Can just give the variable (or condition!) by itself, or list multiples with `by = .(a, b)` or `by = c('a','b')` 542 | 543 | ```{r, echo = TRUE} 544 | person_year_data[, .(Income = Income, 545 | Income_Relative_to_Mean = Income - mean(Income)), 546 | by = Person] 547 | ``` 548 | 549 | ## Keys 550 | 551 | - You can, as mentioned, tell `data.table` to do a calculation by group by giving those groups to `by` 552 | - `data.table`s also have explicit *keys* - when the keys are set, the `data.table` is pre-sorted by those keys 553 | - Any grouping, merging, etc., by those keys becomes *insanely faster* 554 | - So if you're going to use the same grouping multiple times, *set the key!!* 555 | 556 | ```{r, echo = TRUE, eval = FALSE} 557 | setkey(person_year_data, Person) 558 | # Perform a by operation and set the key at the same time 559 | person_year_data[, Income_Relative_to_Mean := Income - mean(Income), keyby = Person] 560 | ``` 561 | 562 | ## Grouping 563 | 564 | - How is grouping useful in preparing data? 565 | - Remember, we want to *look at where information is* and *think about how we can get it where we need it to be* 566 | - Grouping helps us move information *from one row to another in a key variable* - otherwise a difficult move! 567 | - It can also let us *change the observation level* depending on our use of `j` 568 | - Tip: `.N` gives the number of rows in the group - handy! and `seq_len(.N)` gives the row number within its group of that observation 569 | 570 | ## Changing Observation Level 571 | 572 | - Using `:=` with `by` maintains the original observation level. But `.()` with `by` *changes the observation level* to the level implied by the functions you give it! 573 | - So give it a function returning one row per group and you get *one row per group* - your new observation level! 574 | 575 | ```{r, echo = TRUE} 576 | person_year_data[, .(Mean_Income = mean(Income), 577 | Years_Tracked = .N), 578 | by = Person] 579 | ``` 580 | 581 | 582 | ## Sorting data.tables 583 | 584 | - It's often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate 585 | - There are also some data manipulation tricks that rely on the position of the data 586 | - You *could* sort by just passing an `order()` to the `i` argument, but more often you'll just do in-place `setorder()` 587 | 588 | ```{r, echo = TRUE} 589 | setorder(person_year_data, Person, Year) 590 | person_year_data[] 591 | ``` 592 | 593 | 594 | 595 | # Variable Types 596 | 597 | ## Manipulating Variables 598 | 599 | - Those are the base **data.table** actions we need to think about 600 | - They can be combined to do all sorts of things! 601 | - But important in using column operations 602 | - A lot of data cleaning is making an already-tidy variable usable! 603 | 604 | ## Variable Types 605 | 606 | Common variable types: 607 | 608 | - Numeric 609 | - Character/string 610 | - Factor 611 | - Date 612 | 613 | ## Variable Types 614 | 615 | - You can check the types of your variables with `is.` and then the type, or `class()`, or `vtable::vtable(data)` or doing `str(data)` 616 | - You can generally convert between types using `as.` and then the type 617 | 618 | ```{r, echo = TRUE} 619 | widetax <- dcast(taxdata, . ~ TaxFormRow) 620 | widetax[, `:=`(Person = as.factor(Person), 621 | Income = as.numeric(Income), 622 | Deductible = as.numeric(Deductible), 623 | AGI = as.numeric(AGI))] 624 | sapply(widetax, class) 625 | ``` 626 | 627 | ## Numeric Notes 628 | 629 | - Numeric data actually comes in multiple formats based on the level of acceptable precision: `integer`, `double`, and so on 630 | - Often you won't have to worry about this - R will just make the data whatever numeric type makes sense at the time 631 | - But a common problem is that reading in very big integers (like ID numbers) will sometimes create `double`s that are stored in scientific notation - lumping multiple groups together! Avoid this with options like `colClasses` in your data-reading function 632 | 633 | ## Character/string 634 | 635 | - Specified with `''` or `""` 636 | - Use `paste0()` to stick stuff together! `paste0('h','ello', sep = '_')` is "h_ello" 637 | - Messy data often defaults to character. For example, a "1,000,000" in your Excel sheet might not be parsed as `1000000` but instead as a literal "1,000,000" with commas 638 | - Lots of details on working with these - back to them in a moment 639 | 640 | ## Factors 641 | 642 | - Factors are for categorical data - you're in one category or another 643 | - The `factor()` function lets you specify these `labels`, and also specify the `levels` they go in - factors can be ordered! 644 | 645 | ```{r, echo = TRUE} 646 | incdata <- data.table(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+'))[, 647 | .(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+')))] 648 | setorder(incdata, Income) 649 | incdata 650 | ``` 651 | ## Dates 652 | 653 | - Dates are the scourge of data cleaners everywhere. They're just plain hard to work with! 654 | - There are Date variables, Datetime variables, both of multiple different formats... eugh! 655 | - I won't go into detail here, but I strongly recommend using the **lubridate** package whenever working with dates. See the [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf) 656 | 657 | ## Characters/strings 658 | 659 | - Back to strings! 660 | - Even if your data isn't textual, working with strings is a very common aspect of preparing data for analysis 661 | - Some are straightforward, for example using `DT[condition, var := fix]` to fix typos/misspellings in the data 662 | - But other common tasks in data cleaning include: getting substrings, splitting strings, cleaning strings, and detecting patterns in strings 663 | - For this we will be using the **stringr** package, see the [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf) 664 | 665 | ## Getting Substrings 666 | 667 | - When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters 668 | - `str_sub(string, start, end)` will do this. `str_sub('hello', 2, 4)` is `'ell'` 669 | - Note negative values read from end of string. `str_sub('hello', -1)` is `'o'` 670 | 671 | ## Getting Substrings 672 | 673 | - For example, geographic Census Block Group indicators are 13 digits, the first two of which are the state FIPS code 674 | 675 | ```{r, echo = TRUE} 676 | cbgdata <- data.table(cbg = c(0152371824231, 1031562977281)) 677 | cbgdata[, cbg := as.character(cbg)] # Make it a string to work with 678 | cbgdata[, state_fips := fcase( 679 | nchar(cbg) == 12, str_sub(cbg, 1, 1), # Leading zeroes! 680 | nchar(cbg) == 13, str_sub(cbg, 1, 2) 681 | )] 682 | cbgdata[] 683 | ``` 684 | 685 | ## Strings 686 | 687 | - **Lots** of data will try to stick multiple pieces of information in a single cell, so you need to split it out! 688 | - Generically, `str_split()` will do this. `str_split('a,b', ',')[[1]]` is `c('a','b')` 689 | - Conveniently, double-assignment basically works like you want it to in `data.table`s! 690 | 691 | ```{r, echo = TRUE} 692 | deptdata <- data.table(category = c('Sales,Marketing','H&R,Marketing')) 693 | deptdata[, c('Category1', 'Category2') := str_split(category, ',')] 694 | deptdata[] 695 | ``` 696 | 697 | ## Cleaning Strings 698 | 699 | - Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters 700 | - `str_trim()` removes beginning/end whitespace, `str_squish()` removes additional whitespace from the middle too. `str_trim(' hi hello ')` is `'hi hello'`. 701 | - `str_replace_all()` is often handy for eliminating (or fixing) unwanted characters 702 | 703 | ```{r, echo = TRUE} 704 | numdata <- data.table(number = c('1,000', '2,003,124')) 705 | numdata[, number := str_replace_all(number, ',', '') %>% as.numeric()] 706 | numdata[] 707 | ``` 708 | 709 | ## Detecting Patterns in Strings 710 | 711 | - Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is *regular expressions* 712 | - Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with `str_replace_all(',','')` - `','` is a regular expression saying "look for a comma" 713 | - There are a *lot* of options here. See the [guide](https://stringr.tidyverse.org/articles/regular-expressions.html) 714 | - Common: `[0-9]` to look for a digit, `[a-zA-Z]` for letters, `*` to repeat until you see the next thing... hard to condense here. Read the guide. 715 | 716 | ## Detecting Patterns in Strings 717 | 718 | - For example, some companies are publicly listed and we want to indicate that but not kepe the ticeker `separate()` won't do it here, not easily! 719 | - On the next page we'll use the regular expression `'\$[A-Z].*\$'` 720 | - `'\$[A-Z].*\$'` says "look for a (" (note the `\\` to treat the usually-special ( character as an actual character), then "Look for a capital letter `[A-Z]`", then "keep looking for capital letters `.*`", then "look for a )" 721 | 722 | ## Detecting Patterns in Strings 723 | 724 | - For detecting patterns, `str_detect()` from **stringr** is fine, but a touch faster is **data.table**'s `%like%` operator or `like()` function, the latter having options for ignoring case and regular-expression syntax 725 | 726 | ```{r, echo = TRUE} 727 | companydata <- data.table(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) 728 | companydata[, `:=`(publicly_listed = name %like% '\$[A-Z].*\$', 729 | name = str_replace_all(name, '\$[A-Z].*\$', ''))] 730 | companydata[] 731 | ``` 732 | 733 | # Using Data Structure 734 | 735 | ## Using Data Structure 736 | 737 | - One of the core steps of data wrangling we discussed is thinking about how to get information from where it is now to where you want it 738 | - A tough thing about tidy data is that it can be a little tricky to move data *into different rows than it currently is* 739 | - This is often necessary when changing observation level, or when doing things like "calculate growth from an initial value" 740 | - But we can solve this with the use of `setorder()` along with other-row-referencing functions like `first()`, `last()`, and `shift()` 741 | 742 | ## Using Data Structure 743 | 744 | - `first()` and `last()` refer to the first and last row, naturally 745 | 746 | ```{r, echo = TRUE} 747 | stockdata <- data.table(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'), 748 | date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)), 749 | stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) 750 | setorder(stockdata, ticker, date) 751 | stockdata[, price_growth_since_march_4 := stock_price/first(stock_price) - 1, by = ticker] 752 | stockdata[] 753 | ``` 754 | 755 | ## Using Data Structure 756 | 757 | - `shift()` looks to the row a certain number above/below this one, based on the `n` argument 758 | - Careful! `shift()` doesn't care about *time* structure, it only cares about *data* structure. If you want daily growth but the row above is last year, you'll get the wrong result! 759 | 760 | ## Using Data Structure 761 | 762 | ```{r, echo = TRUE} 763 | setorder(stockdata, ticker, date) 764 | stockdata[, price_growth_daily := stock_price/lag(stock_price, 1) - 1, by = ticker] 765 | stockdata 766 | ``` 767 | 768 | 769 | ## Trickier Stuff 770 | 771 | - Sometimes the kind of data you want to move from one row to another is more complex! 772 | - You can use `first()/last()` to get stuff that might not normally be first or last with things like `stockdata[, targetdate := date ==` `as.Date('2020-03-05')]` and `setorder(stockdata, ticker, -(targetdate))` 773 | - For even more complex stuff, I often find it useful to use `DT[condition, operation]` to create a new variable that only picks data from the rows you want, then a grouped in-place column operation to spread the data from those rows across the other rows in the group 774 | 775 | ## Trickier Stuff 776 | 777 | ```{r, echo = TRUE} 778 | testdata <- data.table(person = c('Adam','James','Diego','Beth','Francis','Qian','Ryan','Selma'), 779 | school_grade = c(6,7,7,8,6,7,8,8), 780 | subject = c('Math','Math','English','Science','English','Science','Math','PE'), 781 | test_score = c(80,84,67,87,55,75,85,70)) 782 | testdata[subject == 'Math', Math_Scores := test_score] 783 | testdata[, Math_Average_In_This_Grade := mean(Math_Scores, na.rm = TRUE), by = school_grade] 784 | testdata[, Math_Scores := NULL] 785 | testdata[] 786 | ``` 787 | 788 | # Automation 789 | 790 | ## Automation 791 | 792 | - Data cleaning is often very repetitive 793 | - You shouldn't let it be! 794 | - Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code *once* both reduces errors and means that if you have to change something you only have to change it once 795 | - So let's automate! Three ways we'll do it here: `.SD`, writing functions, and **purrr** 796 | 797 | ## .SD 798 | 799 | - If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric? 800 | - `.SD` refers to the entire set of data being analyzed other than any variables in `by` (i.e. the whole thing, or the current group if grouped) 801 | 802 | ## .SD 803 | 804 | - You can pass this to `lapply()` to apply a function to every variable! (plenty of fancier applications too but we'll stick here for now) 805 | - Or just a subset: you can specify only some columns to be in `.SD` with the `.SDcols` argument (which takes `patterns()`) 806 | - It really does work like the dataset! `.SD[1]` gives the first row of all columns, etc. 807 | 808 | 809 | ## .SD 810 | 811 | - Let's apply the same function to every column. First just to get some summary stats: 812 | 813 | ```{r, echo = FALSE, eval = TRUE} 814 | data(mtcars) 815 | mtcars <- as.data.table(mtcars) 816 | ``` 817 | 818 | ```{r, echo = TRUE} 819 | mtcars[, lapply(.SD, mean)] 820 | ``` 821 | 822 | ## .SD 823 | 824 | - And then to apply a function to each of them, changing the original data: 825 | 826 | ```{r, echo = TRUE, eval = FALSE} 827 | mtcars[, lapply(.SD, function(x) x + 1)] 828 | ``` 829 | 830 | ```{r, echo = FALSE, eval = TRUE} 831 | mtcars[, lapply(.SD, function(x) x + 1)] %>% datatable() 832 | ``` 833 | 834 | ## .SDcols 835 | 836 | - We can be a little choosier by just doing specific columns (despite doing multiple of them) 837 | - `patterns('price_growth')` on the next slide the same here as `4:5` or `c(price_growth_since_march_4, price_growth_daily)` or `price_growth_since_march_4:price_growth_daily` 838 | - The naming column is only to not overwrite old names. Could overwrite with just `4:5` there, or avoid the walrus 839 | 840 | ## .SDcols 841 | 842 | ```{r, echo = TRUE} 843 | stockgrowth <- stockdata[, .(ticker, date, price_growth_since_march_4, price_growth_daily)] 844 | stockgrowth[, c('bps_march4', 'bps_daily') := lapply(.SD, function(x) x*10000), 845 | .SDcols = patterns('price_growth')] 846 | stockgrowth 847 | ``` 848 | 849 | ## .SDcols 850 | 851 | - That version replaced the original values, but you can have it create new ones with `.names` 852 | - Also, you can use a `c()` of `lapply()`s instead of just one to do multiple calculations at the same time 853 | 854 | ## .SDcols 855 | 856 | ```{r, echo = TRUE} 857 | stockgrowth[, c('bps_march4', 'bps_daily', 858 | 'pct_march4', 'pct_daily') := 859 | c(lapply(.SD, function(x) x*10000), 860 | lapply(.SD, function(x) x*100)), .SDcols = 4:5] 861 | stockgrowth[] 862 | ``` 863 | 864 | ## .SDcols 865 | 866 | - Another common issue is wanting to apply the same transformation to all variables of the same type 867 | - For example, converting all characters to factors, or converting a bunch of dollar values to pounds 868 | - Use `sapply(DT,is.type))` for this 869 | 870 | ## .SDcols 871 | 872 | ```{r, echo = TRUE} 873 | justprice <- stockdata[, .(ticker, date, stock_price)] 874 | numeric_col_names <- names(justprice)[sapply(justprice,is.numeric)] 875 | newnames <- paste0(numeric_col_names, '_pounds') 876 | justprice[, (newnames) := lapply(.SD, function(x) x/1.36), .SDcols = numeric_col_names] 877 | justprice[] 878 | ``` 879 | 880 | ## Rowwise Operations 881 | 882 | - A lot of business data especially might record values in a bunch of categories, each category in its own column, but not report the total 883 | - This is annoying! But `.SD` helps here 884 | - Especially if you just want to sum, then it's easy with `rowSums()` 885 | 886 | ## Summing over Columns 887 | 888 | ```{r, echo = TRUE} 889 | deptdata <- data.table(year = c(1994, 1995, 1996), sales = c(104, 106, 109), marketing = c(100, 200, 174), rnd = c(423,123,111)) 890 | deptdata[, total_spending := rowSums(.SD), .SDcols = sales:rnd] 891 | deptdata[] 892 | ``` 893 | 894 | ## Writing Functions 895 | 896 | - We've already done a bit of function-writing here, in the file read-in and with `lapply()` 897 | - Generally, **if you're going to do the same thing more than once, you're probably better off writing a function** 898 | - Reduces errors, saves time, makes code reusable later! 899 | 900 | ```{r, echo = TRUE, eval = FALSE} 901 | function_name <- function(argument1 = default1, argument2 = default2, etc.) { 902 | some code 903 | result <- more code 904 | return(result) 905 | # (or just do result by itself - the last object printed will be automatically returned if there's no return()) 906 | } 907 | ``` 908 | 909 | ## Function-writing tips 910 | 911 | - Make sure to think about what kind of values your function accepts and make sure that what it returns is consistent so you know what you're getting 912 | - This is a really deep topic to cover in two slides, and mostly I just want to poke you and encourage you to do it. At least, if you find yourself doing something a bunch of times in a row, just take the code, stick it inside a `function()` wrapper, and instead use a bunch of calls to that function in a row 913 | - More information [here](https://www.r-bloggers.com/2019/07/writing-functions-in-r-example-one/). 914 | 915 | ## Unnamed Functions 916 | 917 | - There are other ways to do functions in R: *unnamed functions* 918 | - Notice how in the `lapply()` examples I didn't have to do `bps <- function(x) x*10000`, I just did `function(x) x*10000`? That's an "unnamed function" 919 | - If your function is very small like this and you're only going to use it once, it's great for that! 920 | - In R 4.1, you will be able to just do `\(x)` instead of `function(x)` 921 | 922 | ## purrr 923 | 924 | - One good way to apply functions iteratively (yours or not) is with the `map()` functions in **purrr** 925 | - We already did this to read in files, but it applies much more broadly! `map()` usually generates a `list()`, `map_dbl()` a numeric vector, `map_chr()` a character vector, `map_df()` a `tibble()`... 926 | 927 | ## purrr 928 | 929 | - It iterates through a `list`, `data.frame/tibble/data.table` (which are technically `list`s, or `vector`, and then applies a function to each of the elements 930 | 931 | ```{r, echo = TRUE} 932 | person_year_data %>% 933 | map_chr(class) 934 | ``` 935 | ## purrr 936 | 937 | - Obviously handy for processing many files, as in our reading-in-files example 938 | - Or looping more generally for diagnostic or wrangling purposes. Perhaps you have a `summary_profile()` function you've made, and want to check each state's data to see if its data looks right. You could do 939 | 940 | ```{r, echo = TRUE, eval = FALSE} 941 | data[, state] %>% unique() %>% map(summary_profile) 942 | ``` 943 | 944 | - You can use it generally in place of a `for()` loop 945 | - See the [purrr cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) 946 | 947 | # Finishing Up, and an Example! 948 | 949 | ## Some Final Notes 950 | 951 | - We can't possibly cover everything. So one last note, about saving your data! 952 | - What to do when you're done and want to save your processed data? 953 | - Saving data in R format: `save()` saves many objects, which are all put back in the environment with `load()`. Often preferable is `saveRDS()` which saves a single `data.frame()` in compressed format, loadable with `df <- readRDS()` 954 | - Saving data for sharing: `fwrite()` makes a CSV. Yay! 955 | 956 | ## Some Final Notes 957 | 958 | - Also, please, please, *please* **DOCUMENT YOUR DATA** 959 | - At the very least, keep a spreadsheet/\code{data.table} with a set of descriptions for each of your variables 960 | - Also look into the **sjlabelled** or **haven** packages to add variable labels directly to the data set itself 961 | - Once you have your variables labelled, `vtable()` in **vtable** can generate a documentation file for sharing 962 | 963 | ## A Walkthrough 964 | 965 | - Let's clean some data! -------------------------------------------------------------------------------- /Data_Wrangling_pandas.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Wrangling in Pandas" 3 | author: "Nick Huntington-Klein w/Andrew Hornstra" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | 18 | ```{r setup, include=FALSE} 19 | knitr::opts_chunk$set(echo = FALSE) 20 | library(tidyverse) 21 | library(DT) 22 | library(purrr) 23 | library(readxl) 24 | library(reticulate) 25 | ``` 26 | 27 | ## Data Wrangling 28 | 29 | ```{r, results = 'asis'} 30 | cat(" 31 | ") 37 | ``` 38 | 39 | ```{python, echo = FALSE} 40 | import pandas as pd 41 | ``` 42 | 43 | Welcome to the Data Wrangling Workshop! 44 | 45 | - The goal of data wrangling 46 | - How to think about data wrangling 47 | - Technical tips for data wrangling in Python using the **pandas** package 48 | - A walkthrough example 49 | 50 | ## Limitations 51 | 52 | - I will assume you already have some familiarity with Python in general 53 | - We only have so much time! I won't be going into *great* detail on the use of all the technical commands, but by the end of this you will know what's out there and generally how it's used 54 | - Shorthand: `pd` is from `import pandas as pd`, and `df` will be shorthand for our `DataFrame` object 55 | - *As with any computer skill, a teacher's comparative advantage is in letting you know what's out there. The* **real learning** *comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end* 56 | 57 | 58 | ## Data Wrangling 59 | 60 | What is data wrangling? 61 | 62 | - You have data 63 | - It's not ready for you to run your model 64 | - You want to get it ready to run your model 65 | - Ta-da! 66 | 67 | ## The Core of Data Wrangling 68 | 69 | - Always **look directly at your data so you know what it looks like** 70 | - Always **think about what you want your data to look like when you're done** 71 | - Think about **how you can take information from where it is and put it where you want it to be** 72 | - After every step, **look directly at your data again to make sure it's doing what you think it's doing** 73 | 74 | I help a lot of people with their problems with data wrangling. Their issues are almost always *not doing one of these four things*, much more so than having trouble coding or anything like that 75 | 76 | ## The Core of Data Wrangling 77 | 78 | - How can you "look at your data"? 79 | - Literally is one way - `print` the `DataFrame` to have it print out 80 | - Summary statistics tables: `df.describe(include = 'all')` 81 | - Checking what values it takes: `pd.unique()` on individual variables 82 | - Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured 83 | 84 | ## The Stages of Data Wrangling 85 | 86 | - From records to data 87 | - From data to tidy data 88 | - From tidy data to data for your analysis 89 | 90 | # From Records to Data 91 | 92 | ## From Records to Data 93 | 94 | Not something we'll be focusing on today! But any time the data isn't in a workable format, like a spreadsheet or database, someone's got to get it there! 95 | 96 | - "Google Trends has information on the popularity of our marketing terms, go get it!" 97 | - "Here's a 600-page unformatted PDF of our sales records for the past three years. Turn it into a database." 98 | - "Here are scans of the 15,000 handwritten doctor's notes at the hospital over the past year" 99 | - "Here's access to the website. The records are in there somewhere." 100 | - "Go do a survey" 101 | 102 | ## From Records to Data: Tips! 103 | 104 | - Do as little by hand as possible. It's a lot of work and you *will* make mistakes 105 | - *Look at the data* a lot! 106 | - Check for changes in formatting - it's common for things like "this enormous PDF of our tables" or "eight hundred text files with the different responses/orders" to change formatting halfway through 107 | - When working with something like a PDF or a bunch of text files, think "how can I tell a computer to spot where the actual data is?" 108 | - If push comes to shove, or if the data set is small enough, you can do by-hand data entry. Be very careful! 109 | 110 | ## Reading Files 111 | 112 | One common thing you run across is data split into multiple files. How can we read these in and compile them? 113 | 114 | - `grob` from the **grob** pakcage produces a vector of filenames 115 | - Use a `for` loop to iterate over that vector and read in the data, as well as any processing 116 | - Combine the results with `df.append()`! 117 | 118 | ## Reading Files 119 | 120 | For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together. 121 | 122 | ```{python, echo = TRUE, eval = FALSE} 123 | import glob 124 | import os 125 | 126 | # Get relative filepaths 127 | partial_paths = glob.glob("../Monthly_reports/*sales*") 128 | # turn them into absolute filepaths 129 | file_list = [os.path.abspath(file) for file in partial_paths] 130 | ``` 131 | 132 | ## Reading Files 133 | 134 | We can simplify by making a little function that processes each of the reports as it's read. Then, use` with `pd.read_excel()` and then our function, then appendit together! 135 | 136 | How do I get `df[1,3]`, etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it! 137 | 138 | ## Reading Files 139 | 140 | ```{python, echo = TRUE, eval = FALSE} 141 | # Initialize place for data to go 142 | df = pd.DataFrame(columns=["sales", "employee"]) 143 | for file in file_list: 144 | report = pd.read_excel(file) 145 | sales = report.iloc[1, 3] 146 | employee = report.iloc[42, 1] 147 | df = df.append( 148 | pd.DataFrame( 149 | { 150 | "sales": sales, 151 | "employee": employee 152 | }, index=[0] 153 | ) 154 | ) 155 | ``` 156 | 157 | # From Data to Tidy Data 158 | 159 | ## From Data to Tidy Data 160 | 161 | - **Data** is any time you have your records stored in some structured format 162 | - But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation 163 | - These structures can be great for *looking up values*. That's why they are often used in business or other settings where you say "I wonder what the value of X is for person/day/etc. N" 164 | - They're rarely good for *doing analysis* (calculating statistics, fitting models, making visualizations) 165 | - For that, we will aim to get ourselves *tidy data* (see [this walkthrough](https://tidyr.tidyverse.org/articles/tidy-data.html) ) 166 | 167 | ## Tidy Data 168 | 169 | In tidy data: 170 | 171 | 1. Each variable forms a column 172 | 1. Each observation forms a row 173 | 1. Each type of observational unit forms a table 174 | 175 | ```{r} 176 | df <- data.frame(Country = c('Argentina','Belize','China'), TradeImbalance = c(-10, 35.33, 5613.32), PopulationM = c(45.3, .4, 1441.5)) 177 | datatable(df) 178 | ``` 179 | 180 | ## Tidy Data 181 | 182 | The variables in tidy data come in two types: 183 | 184 | 1. *Identifying Variables*/*Keys* are the columns you'd look at to locate a particular observation. 185 | 1. *Measures*/*Values* are the actual data. 186 | 187 | Which are they in this data? 188 | 189 | ```{r} 190 | df <- data.frame(Person = c('Chidi','Chidi','Eleanor','Eleanor'), Year = c(2017, 2018, 2017, 2018), Points = c(14321,83325, 6351, 63245), ShrimpConsumption = c(0,13, 238, 172)) 191 | datatable(df) 192 | ``` 193 | ## Tidy Data 194 | 195 | - *Person* and *Year* are our identifying variables. The combination of person and year *uniquely identifies* a row in the data. Our "observation level" is person and year. There's only one row with Person == "Chidi" and Year == 2018 196 | - *Points* and *ShrimpConsumption* are our measures. They are the things we have measured for each of our observations 197 | - Notice how there's one row per observation (combination of Person and Year), and one column per variable 198 | - Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don't change over Year) would go in a different table, although this last one is less important 199 | 200 | ## Tidying Non-Tidy Data 201 | 202 | - So what might data look like when it's *not* like this, and how can we get it this way? 203 | - Here's one common example, a *count table* (not tidy!) where each column is a *value*, not a *variable* 204 | 205 | ```{r} 206 | data("relig_income") 207 | datatable(relig_income) 208 | ``` 209 | 210 | ## Tidying Non-tidy Data 211 | 212 | - Here's another, where the "chart position" variable is split across 52 columns, one for each week 213 | 214 | ```{r} 215 | data("billboard") 216 | datatable(billboard) 217 | ``` 218 | 219 | 220 | 221 | ## Tidying Non-Tidy Data 222 | 223 | - The first big tool in our tidying toolbox is the *pivot* 224 | - A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up. 225 | - This can also be referred to as going from "wide" data to "long" data 226 | - Long to wide is also an option 227 | - In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended. 228 | 229 | ## Tidying Non-Tidy Data 230 | 231 | Check our steps! 232 | 233 | - We looked at the data 234 | - Think about how we want the data to look - one row per (keys) artist, track, and week, and a column for the chart position of that artist/track in that week, and the date entered for that artist/track (value) 235 | - How can we carry information from where it is to where we want it to be? With a pivot! 236 | - And afterwards we'll look at the result (and, likely, go back and fix our pivot code - the person who gets a pivot right the first try is a mysterious genius) 237 | 238 | ## Pivot 239 | 240 | - In **pandas** we have the functions `pd.wide_to_long()` and `pd.long_to_wide()` (there is also the more-powerful `pd.melt()` and `pd.pivot_table()` but these may be trickier to use). Here we want wide-to-long so we use `pd.wide_to_long()` 241 | - This asks for: 242 | - `df` (the data set you're working with) 243 | - `stubnames` (the columns to pivot) - a string (or vector) with the characters that start the cols to pivot 244 | - `i` (the existing ID variables) 245 | - `j` (the name of the new ID variable) 246 | 247 | ## Pivot 248 | 249 | 250 | ```{python, echo = TRUE, eval = FALSE} 251 | pd.wide_to_long( 252 | billboard, 253 | "wk", 254 | i=["artist", "track", "date.entered"], 255 | j="week" 256 | ).rename( 257 | {"wk": "chart_position"}, 258 | axis=1 259 | ).dropna() 260 | ``` 261 | 262 | ```{r, echo = FALSE} 263 | billboard2 <- billboard %>% 264 | mutate(across(, as.character)) 265 | ``` 266 | 267 | ```{python, echo = FALSE} 268 | billboard = r.billboard2 269 | pd.wide_to_long( 270 | billboard, 271 | "wk", 272 | i=["artist", "track", "date.entered"], 273 | j="week" 274 | ).rename( 275 | {"wk": "chart_position"}, 276 | axis=1 277 | ).dropna() 278 | ``` 279 | 280 | 281 | ## Variables Stored as Rows 282 | 283 | - Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use `pivot_wider()`, and then combine multiple individuals with `bind_rows()` 284 | 285 | ```{python, echo = FALSE} 286 | tax_data = pd.DataFrame( 287 | { 288 | "index": 289 | [ 290 | 0, 0, 0, 0 291 | ], 292 | "Value": 293 | [ 294 | "Person", "Income", "Deductible", "AGI" 295 | ], 296 | "TaxFormRow": 297 | [ 298 | "James Acaster", 112341, 24000, 88341 299 | ], 300 | } 301 | ) 302 | 303 | tax_data2 = pd.DataFrame( 304 | { 305 | "index": 306 | [ 307 | 1, 1, 1, 1 308 | ], 309 | "Value": 310 | [ 311 | "Person", "Income", "Deductible", "AGI" 312 | ], 313 | "TaxFormRow": 314 | [ 315 | 'Eddie Izzard',325122, 16000, 325122 - 16000 316 | ], 317 | } 318 | ) 319 | print(tax_data) 320 | ``` 321 | 322 | ## Variables Stored as Rows 323 | 324 | - `pivot()` is a `DataFrame` method that needs: 325 | - `index` (the columns that give us the key - what should it be here?) 326 | - `columns` (the column containing what will be the new variable names) 327 | - `values` (the column containing the new values) 328 | 329 | ## Variables Stored as Rows 330 | 331 | ```{python, echo = TRUE} 332 | tax_data.pivot( 333 | index="index", 334 | columns="Value", 335 | values="TaxFormRow" 336 | ) 337 | ``` 338 | 339 | 340 | ## Variables Stored as Rows 341 | 342 | We can use `.append()` to stack data sets with the same variables together, handy for compiling data from different sources 343 | 344 | ```{python, echo = TRUE} 345 | tax_data.pivot( 346 | index="index", 347 | columns="Value", 348 | values="TaxFormRow" 349 | ).append(tax_data2.pivot( 350 | index="index", 351 | columns="Value", 352 | values="TaxFormRow" 353 | )) 354 | ``` 355 | 356 | ## Merging Data 357 | 358 | - Commonly, you will need to link two datasets together based on some shared keys 359 | - For example, if one dataset has the variables "Person", "Year", and "Income" and the other has "Person" and "Birthplace" 360 | 361 | ```{python, echo = FALSE} 362 | person_year_data = pd.DataFrame( 363 | { 364 | "Person": 365 | [ 366 | "Ramesh", "Ramesh", "Whitney", "Whitney", "David", "David" 367 | ], 368 | "Year": 369 | [ 370 | 2014, 2015, 2014, 2015, 2014, 2015 371 | ], 372 | "Income": 373 | [ 374 | 81314, 82155, 131292, 141262, 102452, 105133 375 | ] 376 | } 377 | ) 378 | person_data = pd.DataFrame( 379 | { 380 | "Person": 381 | [ 382 | "Ramesh", 383 | "Whitney" 384 | ], 385 | "Birthplace": 386 | [ 387 | "Crawley", 388 | "Washington D.C." 389 | ] 390 | } 391 | ) 392 | print(person_year_data) 393 | ``` 394 | 395 | ## Merging Data 396 | 397 | That was `person_year_data`. And now for `person_data`: 398 | 399 | ```{python} 400 | print(person_data) 401 | ``` 402 | 403 | ## Merging Data 404 | 405 | - The `.merge()` method will do this. The different `how` options varieties just determine what to do with rows you *don't* find a match for. `'left'` keeps non-matching rows from the first dataset but not the second, `'right'` from the second not the first, `'outer'` from both, `'inner'` from neither 406 | - Can deal with mismatched named on either side by using `'left_on'` etc. instead of `'on'` 407 | 408 | ## Merging Data 409 | 410 | ```{python, echo = TRUE} 411 | person_year_data.merge( 412 | person_data, 413 | how='left', 414 | on='Person' 415 | ) 416 | ``` 417 | 418 | ```{python, echo = TRUE} 419 | person_year_data.merge( 420 | person_data, 421 | how='right', 422 | on='Person' 423 | ) 424 | ``` 425 | 426 | ## Merging Data 427 | 428 | - Things work great if the list of variables in `by` is the exact observation level in *at least one* of the two data sets 429 | - But if there are multiple observations per combination of `by` variables in *both*, that's a problem! It will create all the potential matches, which may not be what you want: 430 | 431 | ```{python, echo = TRUE} 432 | a = pd.DataFrame({"name": ["A", "A", "B", "C"], 433 | "Year": [2014, 2015, 2014, 2014], "Value": range(1, 5) }) 434 | b = pd.DataFrame({"name": ["A", "A", "B", "C", "C"], 435 | "Characteristic": ["Up", "Down", "Up", "Left", "Right"]}) 436 | a.merge(b, how='left', on="name") 437 | ``` 438 | 439 | ## Merging Data 440 | 441 | - This is why it's *super important* to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you *think* are your key variables: if we think that `Person` is a key for data set `a`, then `a.duplicated(["Person"]).max()` will return `True`, showing us we're wrong 442 | - At that point you can figure out how you want to proceed - drop observations so it's the observation level in one? Accept the multi-match? Pick only one of the multi-matches? 443 | 444 | # From Tidy Data to Your Analysis 445 | 446 | ## From Tidy Data to Your Analysis 447 | 448 | - Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is! 449 | - That doesn't mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis 450 | 451 | ## Filtering 452 | 453 | - Filtering limits the data to the observations that fulfill a certain *logical condition*. It *picks rows*. 454 | - For example, `Income > 100000` is `True` for everyone with income above 100000, and `False` otherwise. So filtering on `Income > 100000` should give you every row with income above 100000. 455 | - Two main ways in pandas: `.query()` and `.loc[]` 456 | 457 | ```{python, echo = TRUE} 458 | full_person_merge = person_year_data.merge(person_data, how='left', on='Person') 459 | full_person_merge.query("Income > 100000") 460 | full_person_merge.loc[full_person_merge["Income"] > 100000] 461 | ``` 462 | 463 | ## Logical Conditions 464 | 465 | - A lot of programming in general is based on writing logical conditions that check whether something is true 466 | - In Python, if the condition is true, it returns `True`, which turns into 1 if you do a calculation with it. If false, it returns `False`, which turns into 0. 467 | 468 | ## Logical Conditions Tips 469 | 470 | Handy tools for constructing logical conditions: 471 | 472 | `a > b`, `a >= b`, `a < b`, `a <= b`, `a == b`, or `a != b` to compare two numbers and check if `a` is above, above-or-equal, below, below-or-equal, equal (note `==` to check equality, not `=`), or not equal 473 | 474 | `a in c(b, c, d, e, f)` checks whether `a` is any of the values `b, c, d, e,` or `f`. Works for text too! 475 | 476 | ## Logical Conditions Tips 477 | 478 | - Whatever your condition is (`condition`), just put a `not` in front to reverse `True`/`False`. `2 + 2 == 4` is `True`, but `not (2 + 2 == 4)` is `False` 479 | - Chain multiple conditions together! Use `and` and `or`. Be careful with parentheses if combining them! 480 | 481 | ## Selecting columns 482 | 483 | - Indexing and `.drop()` give you back just a subset of the columns. They *pick columns* 484 | - It can do this by name (with a vector of column names) or by column number (with `.iloc[]`) 485 | - Use `.drop()` to *not* pick certain columns 486 | 487 | If our data has the columns "Person", "Year", and "Income", then all of these do the same thing: 488 | 489 | ```{pandas, echo = TRUE} 490 | no_income = person_year_data[["Person", "Year"]] 491 | # a few ways to do this, but this is the most readable 492 | no_income = person_year_data.drop("Income", axis=1) 493 | no_income = person_year_data.iloc[0:1] 494 | print(no_income) 495 | ``` 496 | 497 | 498 | ## .sort_values() 499 | 500 | - `.sort_values()` sorts the data. That's it! Give it the column names and it will sort the data by those columns. 501 | - It's often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate 502 | - There are also some data manipulation tricks that rely on the position of the data 503 | 504 | ```{python, echo = TRUE} 505 | person_year_data.sort_values(["Person","Year"]) 506 | ``` 507 | 508 | ## Assigning Variables 509 | 510 | - We can *assign columns/variables* by declaring their column names 511 | 512 | ```{python, echo = TRUE} 513 | person_year_data["NextYear"] = person_year_data["Year"] + 1 514 | person_year_data["Above100k"] = person_year_data["Income"] > 100000 515 | print(person_year_data) 516 | ``` 517 | 518 | ## Case assignment 519 | 520 | - A common need is in *creating* a categorical variable 521 | - Use `.loc[]` to determine which rows to update, and then assign them 522 | - This is known as a boolean mask 523 | - (here we will also use `between` to help with our `.loc[]`) 524 | 525 | ## Case assignment 526 | 527 | ```{python, echo = TRUE} 528 | person_year_data["IncomeBracket"] = "Under 50k" 529 | person_year_data.loc[person_year_data["Income"].between( 530 | 50001, 100000 531 | ), "IncomeBracket"] = "50-100k" 532 | person_year_data.loc[person_year_data["Income"].between( 533 | 100001, 120000 534 | ), "IncomeBracket"] = "100-120k" 535 | person_year_data.loc[person_year_data["Income"] 536 | > 120000, "IncomeBracket"] = "Above 120k" 537 | ``` 538 | 539 | ## Case assignment 540 | 541 | ```{python, echo = FALSE} 542 | print(person_year_data) 543 | ``` 544 | 545 | ## Case assignment 546 | 547 | - Note that the assignment doesn't have to be a value, it can be a calculation, for example 548 | 549 | ```{python, eval = FALSE, echo = TRUE} 550 | person_year_data["Inflation_Adjusted_Income"] = person_year_data["Income"] 551 | person_year_data.loc[person_year_data["Year"] == 552 | 2014, "Inflation_Adjusted_Income"] *= 1.001 553 | ``` 554 | 555 | - Note in that last step we are using boolean masking to change the value of just *some* of the observations, also handy 556 | 557 | ## .groupby() 558 | 559 | - `.groupby()` turns the dataset into a *grouped* data set, splitting each combination of the grouping variables 560 | - Calculations like `.transform()` then process the data separately by each group 561 | 562 | ```{python, echo = TRUE} 563 | person_year_data["Income_Relative_to_Mean"] = (person_year_data["Income"] 564 | - person_year_data.groupby("Person")["Income"].transform("mean")) 565 | ``` 566 | 567 | ## .groupby() 568 | 569 | - How is this useful in preparing data? 570 | - Remember, we want to *look at where information is* and *think about how we can get it where we need it to be* 571 | - `.groupby()` helps us move information *from one row to another in a key variable* - otherwise a difficult move! 572 | - It can also let us *change the observation level* with `.agg()` 573 | - Tip: `"count"` gives the number of rows in the group - handy! 574 | 575 | ## .agg() 576 | 577 | - `.agg()` *changes the observation level* to a broader level 578 | - It returns only *one row per group* (or one row total if the data is ungrouped) 579 | - So now your keys are whatever you gave to `.groupby()` 580 | 581 | ```{python, echo = TRUE} 582 | person_year_data.groupby( 583 | "Person" 584 | ).agg( 585 | {"Income": "mean", "Person": "count"} 586 | ).rename({"Person": "YearsTracked"}, axis=1) 587 | ``` 588 | 589 | # Variable Types 590 | 591 | ## Manipulating Variables 592 | 593 | - Those are the base data manipulation approaches we need to think about 594 | - They can be combined to do all sorts of things! 595 | - But important in using them is thinking about what kinds of variable manipulations we're doing 596 | - That will feed into our variable assignments and our `.agg()`s 597 | - A lot of data cleaning is making an already-tidy variable usable! 598 | 599 | ## Variable Types 600 | 601 | Common variable types: 602 | 603 | - Numeric (many types!) 604 | - Character/string 605 | - Categorical 606 | - Date 607 | 608 | ## Variable Types 609 | 610 | - You can check the types of your variables with `.dtypes` 611 | - You can generally convert between types using `.astype` 612 | 613 | ```{python, echo = TRUE, eval = FALSE} 614 | tax_data.pivot( 615 | index="index", 616 | columns="Value", 617 | values="TaxFormRow" 618 | ).astype( 619 | { 620 | "AGI": "float64", 621 | "Deductible": "float64", 622 | "Income": "float64", 623 | "Person": "category" 624 | } 625 | ).reset_index(drop=True) 626 | ``` 627 | 628 | ## Numeric Notes 629 | 630 | - Numeric data actually comes in multiple formats based on the level of acceptable precision: `float`, `int`, and so on 631 | - You can generally convert between types with functions like `int()` 632 | - But a common problem is that reading in very big integers (like ID numbers) will sometimes create `double`s that are stored in scientific notation - lumping multiple groups together! Avoid this with options like `col_types` in your data-reading function 633 | 634 | ## Character/string 635 | 636 | - Specified with `""`, and `''` is also OK, especially if you need a `"` in the string 637 | - Use `+` to stick stuff together, or `.join()` to paste together a vector! `"h"+"ello"` is `"hello"`, `"_".join(["h","ello"])` is `"h_ello"` 638 | - Messy data often defaults to character. For example, a "1,000,000" in your Excel sheet might not be parsed as `1000000` but instead as a literal "1,000,000" with commas 639 | - Lots of details on working with these - back to them in a moment 640 | 641 | ## Categorical/factor variables 642 | 643 | - Categorical variables are for when you're in one category or another 644 | - The `Categorical()` function lets you specify these - and they can be ordered! 645 | 646 | ## Categorical/factor variables 647 | 648 | ```{python, echo = TRUE} 649 | unsorted = pd.Categorical( 650 | pd.Series( 651 | [ 652 | "50k-100k", "Less than 50k", "50k-100k", "100k+", "100k+" 653 | ] 654 | ), 655 | categories=[ 656 | "Less than 50k", "50k-100k", "100k+" 657 | ], 658 | ordered=True 659 | ) 660 | unsorted.sort_values() 661 | ``` 662 | 663 | ## Dates 664 | 665 | - Dates are the scourge of data cleaners everywhere. They're just plain hard to work with! 666 | - Thankfully **pandas** does at least consolidate variable types into `datetime` 667 | - I won't go into detail here, but there is a good guide [here](https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587). Even then it's tricky - dates never want to do what you want them to! 668 | 669 | ## Characters/strings 670 | 671 | - Back to strings! 672 | - Even if your data isn't textual, working with strings is a very common aspect of preparing data for analysis 673 | - Some are straightforward, for example using boolean masks to fix typos/misspellings in the data 674 | - But other common tasks in data cleaning include: getting substrings, splitting strings, cleaning strings, and detecting patterns in strings 675 | 676 | ## Getting Substrings 677 | 678 | - When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters 679 | - You can index the characters of the string like it's an array 680 | - `string[start:end]` will do this. `"hello"[1:3]` is `'ell'` 681 | - Note negative values read from end of string. `"hello"[-1]` is `'o'` 682 | 683 | ## Getting Substrings 684 | 685 | - For example, geographic Census Block Group indicators are 13 digits, the first two of which are the state FIPS code 686 | 687 | ```{python, echo = TRUE} 688 | cbg = pd.DataFrame({"cbg":[152371824231, 1031562977281]},dtype=str) 689 | cbg["state_fips"] = cbg["cbg"].apply(lambda x: x[0:2] if len(x) == 13 else x[0:1]) 690 | cbg 691 | ``` 692 | 693 | ## Strings 694 | 695 | - **Lots** of data will try to stick multiple pieces of information in a single cell, so you need to split it out! 696 | - Generically, `str.split()` will do this. `"a,b".split(",")` is `["a","b"]` 697 | - Often in already-tidy data, you want `.str.split()`. Make sure to rename as appropriate! 698 | 699 | ```{python, echo = TRUE} 700 | category = pd.DataFrame({"category": ["Sales,Marketing", "H&R,Marketing"]}) 701 | category["category"].str.split(",", expand=True).rename({0: "Category1",1: "Category2"},axis=1) 702 | ``` 703 | 704 | ## Cleaning Strings 705 | 706 | - Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters 707 | - `.strip()` removes beginning/end whitespace, `" hi hello ".strip()` is `"hi hello"`. See also `.rstrip()` and `lstrip()` for one-sided versions 708 | - `.str.replace()` is often handy for eliminating (or fixing) unwanted characters 709 | 710 | ```{python, echo = TRUE} 711 | number = pd.DataFrame({"number": ["1,000", "2,003,124"]}) 712 | number["number"].str.replace(",", "").astype(int) 713 | ``` 714 | 715 | ## Detecting Patterns in Strings 716 | 717 | - Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is *regular expressions* 718 | - Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with `.str.replace(",","")` - `","` is a regular expression saying "look for a comma" 719 | - There are a *lot* of options here. See the [guide](https://stringr.tidyverse.org/articles/regular-expressions.html) 720 | - Common: `[0-9]` to look for a digit, `[a-zA-Z]` for letters, `*` to repeat until you see the next thing... hard to condense here. Read the guide. 721 | 722 | ## Detecting Patterns in Strings 723 | 724 | - For example, some companies are publicly listed and we want to indicate that but not keep the ticker. `separate()` won't do it here, not easily! 725 | - On the next page we'll use the regular expression `'\$[A-Z].*\$'` 726 | - `'\$[A-Z].*\$'` says "look for a (" (note the `\\` to treat the usually-special ( character as an actual character), then "Look for a capital letter `[A-Z]`", then "keep looking for capital letters `.*`", then "look for a )" 727 | 728 | ## Detecting Patterns in Strings 729 | 730 | ```{python, echo = TRUE} 731 | companies = pd.DataFrame({"name":["Amazon (AMZN) Holdings", "Cargill Corp. (cool place!)"]}) 732 | companies["publicly_listed"] = companies["name"].str.contains("\$[A-Z].*\$") 733 | companies["name"] = companies["name"].str.replace("\$[A-Z].*\$", "") 734 | print(companies) 735 | ``` 736 | 737 | 738 | # Using Data Structure 739 | 740 | ## Using Data Structure 741 | 742 | - One of the core steps of data wrangling we discussed is thinking about how to get information from where it is now to where you want it 743 | - A tough thing about tidy data is that it can be a little tricky to move data *into different rows than it currently is* 744 | - This is often necessary when `.agg()`ing, or when doing things like "calculate growth from an initial value" 745 | - But we can solve this with the use of *sort_values()* along with other-row-referencing functions like indexing, perhaps combined with `.head()` 746 | 747 | ## Using Data Structure 748 | 749 | ```{python, echo = TRUE} 750 | stock_data = pd.DataFrame({"ticker": ["AMZN", "AMZN", "AMZN", "WMT", "WMT", "WMT"], 751 | "date": [ "2020-03-04", "2020-03-05", "2020-03-06", "2020-03-04","2020-03-05", "2020-03-06"], 752 | "stock_price": [103, 103.4, 107, 85.2, 86.3, 85.6]}) 753 | stock_data["date"] = pd.to_datetime(stock_data["date"]) 754 | ``` 755 | 756 | ## Using Data Structure 757 | 758 | - `.head()` and `.tail()` refer to the first and last rows, naturally 759 | 760 | ```{python, echo = TRUE} 761 | stock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby( 762 | "ticker")["stock_price"].apply(lambda x: x/x.head(1).values[0] - 1) 763 | print(stock_data) 764 | ``` 765 | 766 | ## Using Data Structure 767 | 768 | - `shift()` looks to the row a certain number above/below this one, based on the `n` argument 769 | - Careful! `shift()` doesn't care about *time* structure, it only cares about *data* structure. If you want daily growth but the row above is last year, too bad! 770 | 771 | ## Using Data Structure 772 | 773 | ```{python, echo = TRUE} 774 | stock_data["daily_price_growth"] = (stock_data["stock_price"]/stock_data.sort_values(["ticker", "date"]).groupby( 775 | "ticker")["stock_price"].shift(1) - 1) 776 | stock_data 777 | ``` 778 | 779 | 780 | ## Trickier Stuff 781 | 782 | - Sometimes the kind of data you want to move from one row to another is more complex! 783 | - You can get stuff that might not normally be first or last by filtering on the values you want before `.transform()`ing 784 | 785 | ## Trickier Stuff 786 | 787 | ```{python, echo = FALSE} 788 | grades = pd.DataFrame( 789 | { 790 | "person": 791 | [ 792 | "Adam", "James", "Diego", "Beth", "Francis", "Qian", 793 | "Ryan", "Selma" 794 | ], 795 | "school_grade": 796 | [ 797 | 6, 7, 7, 8, 6, 7, 8, 8 798 | ], 799 | "subject": 800 | [ 801 | "Math", "Math", "English", "Science", "English", 802 | "Science", "Math", "PE" 803 | ], 804 | "test_score": 805 | [ 806 | 80, 84, 67, 87, 55, 75, 85, 70 807 | ] 808 | } 809 | ) 810 | print(grades) 811 | ``` 812 | 813 | ## Trickier Stuff 814 | 815 | ```{python, echo = TRUE} 816 | grades["math_scores"] = grades.loc[ 817 | grades["subject"] == "Math" 818 | ].groupby( 819 | ["school_grade"] 820 | )["test_score"].transform("mean") 821 | grades["Math_Average_In_This_Grade"] = grades.groupby( 822 | "school_grade" 823 | )["math_scores"].transform("max") 824 | grades.drop("math_scores", axis=1) 825 | ``` 826 | 827 | ## Trickier Stuff 828 | 829 | ```{python, echo = TRUE} 830 | print(grades) 831 | ``` 832 | 833 | # Automation 834 | 835 | ## Automation 836 | 837 | - Data cleaning is often very repetitive 838 | - You shouldn't let it be! 839 | - Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code *once* both reduces errors and means that if you have to change something you only have to change it once 840 | - So let's automate! Two ways we'll do it here: for loops across columns, for loops more generally, and writing functions 841 | 842 | ## for loops across columns 843 | 844 | - If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric? 845 | - Variable selectors like `.startswith()` helps you apply a given function to a lot of the right columns at once in addition to regular ways like `1:5` 846 | 847 | ## startswith 848 | 849 | ```{python, echo = TRUE} 850 | stock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby("ticker")["stock_price"].apply( 851 | lambda x: x/x.head(1).values[0] - 1) 852 | 853 | stock_data["price_growth_daily"] = (stock_data["stock_price"] / stock_data.sort_values(["ticker", "date"]).groupby( 854 | "ticker")["stock_price"].shift(1) - 1) 855 | ``` 856 | 857 | ## startswith 858 | 859 | - `.startswith("price_growth")` is the same here as `4:5` or `["price_growth_since_march_4", "price_growth_daily"]` 860 | 861 | 862 | ```{python, echo = TRUE} 863 | growth_cols = [col for col in stock_data.columns if col.startswith("price_growth")] 864 | stock_growth = stock_data.copy() 865 | stock_growth[growth_cols] *= 10000 866 | 867 | print(stock_growth) 868 | ``` 869 | 870 | ## Multiple functions at once 871 | 872 | ```{python, echo = TRUE} 873 | # Undo what we just did 874 | stock_growth[growth_cols] /= 10000 875 | for col in growth_cols: 876 | stock_growth[col+"_pct"] = stock_growth[col] * 100 877 | stock_growth[col+"_bps"] = stock_growth[col] * 10000 878 | print(stock_growth) 879 | ``` 880 | 881 | ```{python, echo = FALSE} 882 | stock_data = stock_data[["ticker", "date", "stock_price"]] 883 | stock_data["stock_price_pounds"] = stock_data.loc[:, 884 | (stock_data.dtypes == 885 | "float64") 886 | ]/1.36 887 | ``` 888 | 889 | 890 | ## Writing Functions 891 | 892 | - Generally, **if you're going to do the same thing more than once, you're probably better off writing a function** 893 | - Reduces errors, saves time, makes code reusable later! 894 | 895 | ```{python, echo = TRUE, eval = FALSE} 896 | def function_name(argument1: list = None, 897 | argument2: set = set()) -> set: 898 | """This function has type hints AND a doc string. What 899 | a life of luxury this is.""" 900 | # do some stuff 901 | some_value = 100*argument1 902 | another_value = argument2/set(some_value) 903 | return another_value 904 | # alternatively, without saving another_value 905 | # return argument2/some_value 906 | ``` 907 | 908 | ## Function-writing tips 909 | 910 | - Make sure to think about what kind of values your function accepts and make sure that what it returns is consistent so you know what you're getting 911 | - This is a really deep topic to cover in two slides, and mostly I just want to poke you and encourage you to do it. At least, if you find yourself doing something a bunch of times in a row, just take the code, stick it inside a `def` wrapper, and instead use a bunch of calls to that function in a row 912 | 913 | # Finishing Up, and an Example! 914 | 915 | ## Some Final Notes 916 | 917 | - We can't possibly cover everything. So one last note, about saving your data! 918 | - What to do when you're done and want to save your processed data? 919 | - There are a bunch of formats, one of which is parquet with `to_parquet()` 920 | - Saving data for sharing: `to_csv()` makes a CSV. Yay! 921 | 922 | ## Some Final Notes 923 | 924 | - Also, please, please, *please* **DOCUMENT YOUR DATA** 925 | - At the very least, keep a spreadsheet/\code{DataFrame}\dictionary with a set of descriptions for each of your variables 926 | 927 | ## A Walkthrough 928 | 929 | - Let's clean some data! -------------------------------------------------------------------------------- /EADA/InstLevel.sav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/InstLevel.sav -------------------------------------------------------------------------------- /EADA/InstLevel.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/InstLevel.xlsx -------------------------------------------------------------------------------- /EADA/Schools.sav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/Schools.sav -------------------------------------------------------------------------------- /EADA/Schools.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/Schools.xlsx -------------------------------------------------------------------------------- /EADA/SchoolsDoc2019.doc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/SchoolsDoc2019.doc -------------------------------------------------------------------------------- /EADA/instlevel.sas7bdat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/instlevel.sas7bdat -------------------------------------------------------------------------------- /EADA/schools.sas7bdat: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/EADA/schools.sas7bdat -------------------------------------------------------------------------------- /IPEDS/STATA_RV_942020-614.do: -------------------------------------------------------------------------------- 1 | * Created: 9/4/2020 3:30:48 PM 2 | * Modify the path below to point to your data file. 3 | * The specified subdirectory was not created on 4 | * your computer. You will need to do this. 5 | * 6 | * This read program must be ran against the specified 7 | * data file. This file is specified in the program 8 | * and must be saved separately. 9 | * 10 | * This program does not provide tab or summaries for all 11 | * variables. 12 | * 13 | * There may be missing data for some institutions due 14 | * to the merge used to create this file. 15 | * 16 | * This program does not include reserved values in its 17 | * calculations for missing values. 18 | * 19 | * You may need to adjust your memory settings depending 20 | * upon the number of variables and records. 21 | * 22 | * The save command may need to be modified per user 23 | * requirements. 24 | * 25 | * For long lists of value labels, the titles may be 26 | * shortened per program requirements. 27 | * 28 | label drop _all 29 | insheet using "STATA_RV_942020-614.csv", clear 30 | label data STATA_RV_942020_614 31 | label variable unitid "UNITID" 32 | label variable instnm "Institution Name" 33 | label variable year "Survey year 2018" 34 | label variable enrtot "Total enrollment" 35 | label variable fte "Full-time equivalent fall enrollment" 36 | label variable efug "Undergraduate enrollment" 37 | label variable efgrad "Graduate enrollment" 38 | label variable pctenran "Percent of total enrollment that are American Indian or Alaska Native" 39 | label variable pctenras "Percent of total enrollment that are Asian" 40 | label variable pctenrbk "Percent of total enrollment that are Black or African American" 41 | label variable pctenrhs "Percent of total enrollment that are Hispanic/Latino" 42 | label variable pctenrnh "Percent of total enrollment that are Native Hawaiian or Other Pacific Islander" 43 | label variable pctenrwh "Percent of total enrollment that are White" 44 | label variable pctenr2m "Percent of total enrollment that are two or more races" 45 | label variable pctenrun "Percent of total enrollment that are Race/ethnicity unknown" 46 | label variable pctenrnr "Percent of total enrollment that are Nonresident Alien" 47 | label variable pctenrap "Percent of total enrollment that are Asian/Native Hawaiian/Pacific Islander" 48 | label variable pctenrw "Percent of total enrollment that are women" 49 | label variable dvef13 "Percent of undergraduate enrollment under 18" 50 | label variable dvef14 "Percent of undergraduate enrollment 18-24" 51 | label variable dvef15 "Percent of undergraduate enrollment, 25-64" 52 | label variable dvef16 "Percent of undergraduate enrollment over 65" 53 | label variable pctdeexc "Percent of students enrolled exclusively in distance education courses" 54 | label variable pctdesom "Percent of students enrolled in some but not all distance education courses" 55 | label variable pctdenon "Percent of students not enrolled in any distance education courses" 56 | label variable rminsttp "Percent of first-time undergraduates - in-state" 57 | label variable rmousttp "Percent of first-time undergraduates - out-of-state" 58 | label variable rmfrgncp "Percent of first-time undergraduates - foreign countries" 59 | label variable f1tufeft "Revenues from tuition and fees per FTE (GASB)" 60 | label variable f1stapft "Revenues from state appropriations per FTE (GASB)" 61 | label variable f1lcapft "Revenues from local appropriations per FTE (GASB)" 62 | label variable f1gvgcft "Revenues from government grants and contracts per FTE (GASB)" 63 | label variable f1pggcft "Revenues from private gifts, grants, and contracts per FTE (GASB)" 64 | label variable f1invrft "Revenues from investment return per FTE (GASB)" 65 | label variable f1otrvft "Other core revenues per FTE (GASB)" 66 | label variable f1endmft "Endowment assets (year end) per FTE enrollment (GASB)" 67 | label variable f2endmft "Endowment assets (year end) per FTE enrollment (FASB)" 68 | label variable npist2 "Average net price-students awarded grant or scholarship aid, 2017-18" 69 | label variable npgrn2 "Average net price-students awarded grant or scholarship aid, 2017-18" 70 | 71 | 72 | 73 | summarize enrtot 74 | summarize fte 75 | summarize efug 76 | summarize efgrad 77 | summarize pctenran 78 | summarize pctenras 79 | summarize pctenrbk 80 | summarize pctenrhs 81 | summarize pctenrnh 82 | summarize pctenrwh 83 | summarize pctenr2m 84 | summarize pctenrun 85 | summarize pctenrnr 86 | summarize pctenrap 87 | summarize pctenrw 88 | summarize dvef13 89 | summarize dvef14 90 | summarize dvef15 91 | summarize dvef16 92 | summarize pctdeexc 93 | summarize pctdesom 94 | summarize pctdenon 95 | summarize rminsttp 96 | summarize rmousttp 97 | summarize rmfrgncp 98 | summarize f1tufeft 99 | summarize f1stapft 100 | summarize f1lcapft 101 | summarize f1gvgcft 102 | summarize f1pggcft 103 | summarize f1invrft 104 | summarize f1otrvft 105 | summarize f1endmft 106 | summarize f2endmft 107 | summarize npist2 108 | summarize npgrn2 109 | 110 | 111 | save cdsfile_all_STATA_RV_942020-614.dta -------------------------------------------------------------------------------- /IPEDS/STATA_RV_942020-662.do: -------------------------------------------------------------------------------- 1 | * Created: 9/4/2020 6:16:49 PM 2 | * Modify the path below to point to your data file. 3 | * The specified subdirectory was not created on 4 | * your computer. You will need to do this. 5 | * 6 | * This read program must be ran against the specified 7 | * data file. This file is specified in the program 8 | * and must be saved separately. 9 | * 10 | * This program does not provide tab or summaries for all 11 | * variables. 12 | * 13 | * There may be missing data for some institutions due 14 | * to the merge used to create this file. 15 | * 16 | * This program does not include reserved values in its 17 | * calculations for missing values. 18 | * 19 | * You may need to adjust your memory settings depending 20 | * upon the number of variables and records. 21 | * 22 | * The save command may need to be modified per user 23 | * requirements. 24 | * 25 | * For long lists of value labels, the titles may be 26 | * shortened per program requirements. 27 | * 28 | label drop _all 29 | insheet using "STATA_RV_942020-662.csv", clear 30 | label data STATA_RV_942020_662 31 | label variable unitid "UNITID" 32 | label variable instnm "Institution Name" 33 | label variable year "Survey year 2018" 34 | label variable enrtot "Total enrollment" 35 | label variable fte "Full-time equivalent fall enrollment" 36 | label variable efug "Undergraduate enrollment" 37 | label variable efgrad "Graduate enrollment" 38 | label variable pctenran "Percent of total enrollment that are American Indian or Alaska Native" 39 | label variable pctenras "Percent of total enrollment that are Asian" 40 | label variable pctenrbk "Percent of total enrollment that are Black or African American" 41 | label variable pctenrhs "Percent of total enrollment that are Hispanic/Latino" 42 | label variable pctenrnh "Percent of total enrollment that are Native Hawaiian or Other Pacific Islander" 43 | label variable pctenrwh "Percent of total enrollment that are White" 44 | label variable pctenr2m "Percent of total enrollment that are two or more races" 45 | label variable pctenrun "Percent of total enrollment that are Race/ethnicity unknown" 46 | label variable pctenrnr "Percent of total enrollment that are Nonresident Alien" 47 | label variable pctenrap "Percent of total enrollment that are Asian/Native Hawaiian/Pacific Islander" 48 | label variable pctenrw "Percent of total enrollment that are women" 49 | label variable dvef13 "Percent of undergraduate enrollment under 18" 50 | label variable dvef14 "Percent of undergraduate enrollment 18-24" 51 | label variable dvef15 "Percent of undergraduate enrollment, 25-64" 52 | label variable dvef16 "Percent of undergraduate enrollment over 65" 53 | label variable pctdeexc "Percent of students enrolled exclusively in distance education courses" 54 | label variable pctdesom "Percent of students enrolled in some but not all distance education courses" 55 | label variable pctdenon "Percent of students not enrolled in any distance education courses" 56 | label variable rminsttp "Percent of first-time undergraduates - in-state" 57 | label variable rmousttp "Percent of first-time undergraduates - out-of-state" 58 | label variable rmfrgncp "Percent of first-time undergraduates - foreign countries" 59 | label variable f1tufeft "Revenues from tuition and fees per FTE (GASB)" 60 | label variable f1stapft "Revenues from state appropriations per FTE (GASB)" 61 | label variable f1lcapft "Revenues from local appropriations per FTE (GASB)" 62 | label variable f1gvgcft "Revenues from government grants and contracts per FTE (GASB)" 63 | label variable f1pggcft "Revenues from private gifts, grants, and contracts per FTE (GASB)" 64 | label variable f1invrft "Revenues from investment return per FTE (GASB)" 65 | label variable f1otrvft "Other core revenues per FTE (GASB)" 66 | label variable f1endmft "Endowment assets (year end) per FTE enrollment (GASB)" 67 | label variable f2endmft "Endowment assets (year end) per FTE enrollment (FASB)" 68 | label variable npist2 "Average net price-students awarded grant or scholarship aid, 2017-18" 69 | label variable npgrn2 "Average net price-students awarded grant or scholarship aid, 2017-18" 70 | label variable f2d01 "Tuition and fees - Total" 71 | label variable f2d16 "Total revenues and investment return - Total" 72 | 73 | 74 | 75 | summarize enrtot 76 | summarize fte 77 | summarize efug 78 | summarize efgrad 79 | summarize pctenran 80 | summarize pctenras 81 | summarize pctenrbk 82 | summarize pctenrhs 83 | summarize pctenrnh 84 | summarize pctenrwh 85 | summarize pctenr2m 86 | summarize pctenrun 87 | summarize pctenrnr 88 | summarize pctenrap 89 | summarize pctenrw 90 | summarize dvef13 91 | summarize dvef14 92 | summarize dvef15 93 | summarize dvef16 94 | summarize pctdeexc 95 | summarize pctdesom 96 | summarize pctdenon 97 | summarize rminsttp 98 | summarize rmousttp 99 | summarize rmfrgncp 100 | summarize f1tufeft 101 | summarize f1stapft 102 | summarize f1lcapft 103 | summarize f1gvgcft 104 | summarize f1pggcft 105 | summarize f1invrft 106 | summarize f1otrvft 107 | summarize f1endmft 108 | summarize f2endmft 109 | summarize npist2 110 | summarize npgrn2 111 | summarize f2d01 112 | summarize f2d16 113 | 114 | 115 | save cdsfile_all_STATA_RV_942020-662.dta -------------------------------------------------------------------------------- /IPEDS/cdsfile_all_STATA_RV_942020-310.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/IPEDS/cdsfile_all_STATA_RV_942020-310.dta -------------------------------------------------------------------------------- /IPEDS/cdsfile_all_STATA_RV_942020-417.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/IPEDS/cdsfile_all_STATA_RV_942020-417.dta -------------------------------------------------------------------------------- /IPEDS/cdsfile_all_STATA_RV_942020-614.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/IPEDS/cdsfile_all_STATA_RV_942020-614.dta -------------------------------------------------------------------------------- /IPEDS/cdsfile_all_STATA_RV_942020-662.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/IPEDS/cdsfile_all_STATA_RV_942020-662.dta -------------------------------------------------------------------------------- /NYT/README_masksurvey.txt: -------------------------------------------------------------------------------- 1 | # Mask-Wearing Survey Data 2 | 3 | The New York Times is releasing estimates of [mask usage](https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html) by county in the United States. 4 | 5 | This data comes from a large number of interviews conducted online by the global data and survey firm Dynata at the request of The New York Times. The firm asked a question about mask use to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. (Several states have imposed new mask requirements since the completion of these interviews.) 6 | 7 | Specifically, each participant was asked: _How often do you wear a mask in public when you expect to be within six feet of another person?_ 8 | 9 | This survey was conducted a single time, and at this point we have no plans to update the data or conduct the survey again. 10 | 11 | ## Data 12 | 13 | Data on the estimated prevalence of mask-wearing in counties in the United States can be found in the **[mask-use-by-county.csv](mask-use-by-county.csv)** file. ([Raw CSV](https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv)) 14 | 15 | ``` 16 | COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS 17 | 01001,0.053,0.074,0.134,0.295,0.444 18 | 01003,0.083,0.059,0.098,0.323,0.436 19 | 01005,0.067,0.121,0.12,0.201,0.491 20 | ``` 21 | 22 | The fields have the following definitions: 23 | 24 | **COUNTYFP**: The county FIPS code. 25 | **NEVER**: The estimated share of people in this county who would say **never** in response to the question “How often do you wear a mask in public when you expect to be within six feet of another person?” 26 | **RARELY**: The estimated share of people in this county who would say **rarely** 27 | **SOMETIMES**: The estimated share of people in this county who would say **sometimes** 28 | **FREQUENTLY**: The estimated share of people in this county who would say **frequently** 29 | **ALWAYS**: The estimated share of people in this county who would say **always** 30 | 31 | ## Methodology 32 | 33 | To transform raw survey responses into county-level estimates, the survey data was weighted by age and gender, and survey respondents’ locations were approximated from their ZIP codes. Then estimates of mask-wearing were made for each census tract by taking a weighted average of the 200 nearest responses, with closer responses getting more weight in the average. These tract-level estimates were then rolled up to the county level according to each tract’s total population. 34 | 35 | By rolling the estimates up to counties, it reduces a lot of the random noise that is seen at the tract level. In addition, the shapes in the map are constructed from census tracts that have been merged together — this helps in displaying a detailed map, but is less useful than county-level in analyzing the data. 36 | 37 | ## License and Attribution 38 | 39 | This data is licensed under the same terms as our Coronavirus Data in the United States data. In general, we are making this data publicly available for broad, noncommercial public use including by medical and public health researchers, policymakers, analysts and local news media. 40 | 41 | If you use this data, you must attribute it to “The New York Times and Dynata” in any publication. If you would like a more expanded description of the data, you could say “Estimates from The New York Times, based on roughly 250,000 interviews conducted by Dynata from July 2 to July 14.” 42 | 43 | If you use it in an online presentation, we would appreciate it if you would link to our graphic discussing these results [https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html](https://www.nytimes.com/interactive/2020/07/17/upshot/coronavirus-face-mask-map.html). 44 | 45 | If you use this data, please let us know at covid-data@nytimes.com. 46 | 47 | See our [LICENSE](https://github.com/nytimes/covid-19-data/blob/master/LICENSE) for the full terms of use for this data. 48 | 49 | ## Contact Us 50 | 51 | If you have questions about the data or licensing conditions, please contact us at: 52 | 53 | covid-data@nytimes.com 54 | 55 | ## Contributors 56 | 57 | Josh Katz, Margot Sanger-Katz and Kevin Quealy. 58 | -------------------------------------------------------------------------------- /Pandas Example Walkthrough.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 25, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/html": [ 11 | "

\n", 12 | "\n", 25 | "\n", 26 | " \n", 27 | " \n", 28 | " \n", 29 | " \n", 30 | " \n", 31 | " \n", 32 | " \n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | "

	unitid	pctdeexc	tuition_share
0	100654	2.0	NaN
1	100663	26.0	0.160065
2	100706	7.0	0.286900
3	100724	10.0	0.245470
4	100751	10.0	0.362953
...	...	...	...
3731	494597	NaN	NaN
3732	494603	NaN	NaN
3733	494630	NaN	NaN
3734	494685	NaN	NaN
3735	494737	NaN	NaN

\n", 103 | "

3736 rows × 3 columns

\n", 104 | "

" 105 | ], 106 | "text/plain": [ 107 | " unitid pctdeexc tuition_share\n", 108 | "0 100654 2.0 NaN\n", 109 | "1 100663 26.0 0.160065\n", 110 | "2 100706 7.0 0.286900\n", 111 | "3 100724 10.0 0.245470\n", 112 | "4 100751 10.0 0.362953\n", 113 | "... ... ... ...\n", 114 | "3731 494597 NaN NaN\n", 115 | "3732 494603 NaN NaN\n", 116 | "3733 494630 NaN NaN\n", 117 | "3734 494685 NaN NaN\n", 118 | "3735 494737 NaN NaN\n", 119 | "\n", 120 | "[3736 rows x 3 columns]" 121 | ] 122 | }, 123 | "execution_count": 25, 124 | "metadata": {}, 125 | "output_type": "execute_result" 126 | } 127 | ], 128 | "source": [ 129 | "# GOAL:\n", 130 | "# Do a simplified version of the data-cleaning necessary for my most recent project.\n", 131 | "# Create data with one observation *per college per day*\n", 132 | "# incorporating data from:\n", 133 | "# IPEDS 2018: IPEDS/cdsfile_all_STATA_RV_942020-662.dta -> get percent in distance ed, and tuition share (f1tufeft/(f1stapft+f1lcapft+f1gvgcft+f1pggcft+f1invrft+f1otrvft+f1endmft))\n", 134 | "# IPEDS 2019: IPEDS/cdsfile_all_STATA_RV_942020-417.dta -> get whether the college is private\n", 135 | "# EADA: EADA/InstLevel.xlsx -> get whether hte college is Division I in sports (ClassificationCode is 1 through 3)\n", 136 | "# NY Times: NYT/us-counties_cases.csv -> Get by-county information on COVID cases on July 31\n", 137 | "# Census: Census/co-est2019-annres.csv -> Get by-county population in 2019\n", 138 | "# politicaldata house_116 data -> Get 2018 congressoinal DW-Nominate scores\n", 139 | "# foot_traffic_panel.Rdata -> Pre-prepared college-day file of foot traffic visits to nearby locations\n", 140 | "\n", 141 | "import pandas as pd\n", 142 | "import numpy as np\n", 143 | "\n", 144 | "# IPEDS 2018\n", 145 | "ipeds2018 = pd.read_stata(\"IPEDS/cdsfile_all_STATA_RV_942020-662.dta\")\n", 146 | "ipeds2018['tuition_share'] = (ipeds2018['f1tufeft']/(\n", 147 | " ipeds2018['f1stapft']+ipeds2018['f1lcapft']+ipeds2018['f1gvgcft']+\n", 148 | " ipeds2018['f1pggcft']+ipeds2018['f1invrft']+ipeds2018['f1otrvft']+ipeds2018['f1endmft']))\n", 149 | "\n", 150 | "keep = ['unitid', 'pctdeexc', 'tuition_share']\n", 151 | "\n", 152 | "ipeds2018 = ipeds2018[keep]\n", 153 | "\n", 154 | "ipeds2018\n", 155 | "\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 46, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/html": [ 166 | "

\n", 167 | "\n", 180 | "\n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | "

	unitid	fips
0	100654	1089
1	100663	1073
2	100706	1089
3	100724	1101
4	100751	1125
...	...	...
3731	494597	12057
3732	494603	48439
3733	494630	48029
3734	494685	29183
3735	494737	36047

\n", 246 | "

3736 rows × 2 columns

\n", 247 | "

" 248 | ], 249 | "text/plain": [ 250 | " unitid fips\n", 251 | "0 100654 1089\n", 252 | "1 100663 1073\n", 253 | "2 100706 1089\n", 254 | "3 100724 1101\n", 255 | "4 100751 1125\n", 256 | "... ... ...\n", 257 | "3731 494597 12057\n", 258 | "3732 494603 48439\n", 259 | "3733 494630 48029\n", 260 | "3734 494685 29183\n", 261 | "3735 494737 36047\n", 262 | "\n", 263 | "[3736 rows x 2 columns]" 264 | ] 265 | }, 266 | "execution_count": 46, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "# IPEDS 2019\n", 273 | "ipeds2019 = pd.read_stata(\"IPEDS/cdsfile_all_STATA_RV_942020-417.dta\")\n", 274 | "\n", 275 | "ipeds2019['Private'] = ipeds2019['sector'].apply(lambda x: x in ['Private not-for-profit, 4-year or above',\n", 276 | " 'Private not-for-profit, 2-year',\n", 277 | " 'Private not-for-profit, less-than 2-year'])\n", 278 | "\n", 279 | "ipeds2019 = ipeds2019[['unitid','Private']]\n", 280 | "\n", 281 | "ipeds_linker = pd.read_stata(\"IPEDS/cdsfile_all_STATA_RV_942020-417.dta\", convert_categoricals = False)\n", 282 | "ipeds_linker = ipeds_linker[['unitid', 'countycd']].rename({'countycd':'fips'}, axis = 1)\n", 283 | "\n", 284 | "ipeds_linker" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 20, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "data": { 294 | "text/html": [ 295 | "

\n", 296 | "\n", 309 | "\n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | "

	unitid	DivisionOne
0	100654	True
1	100663	True
2	100706	False
3	100724	True
4	100751	True
...	...	...
2069	489201	False
2070	489937	False
2071	490805	False
2072	492069	False
2073	800001	False

\n", 375 | "

2074 rows × 2 columns

\n", 376 | "

" 377 | ], 378 | "text/plain": [ 379 | " unitid DivisionOne\n", 380 | "0 100654 True\n", 381 | "1 100663 True\n", 382 | "2 100706 False\n", 383 | "3 100724 True\n", 384 | "4 100751 True\n", 385 | "... ... ...\n", 386 | "2069 489201 False\n", 387 | "2070 489937 False\n", 388 | "2071 490805 False\n", 389 | "2072 492069 False\n", 390 | "2073 800001 False\n", 391 | "\n", 392 | "[2074 rows x 2 columns]" 393 | ] 394 | }, 395 | "execution_count": 20, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "eada = pd.read_csv('EADA/InstLevel.csv')\n", 402 | "\n", 403 | "eada['DivisionOne'] = eada['ClassificationCode'] < 4\n", 404 | "\n", 405 | "eada = eada[['unitid','DivisionOne']]\n", 406 | "\n", 407 | "eada" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 79, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "data": { 417 | "text/html": [ 418 | "

\n", 419 | "\n", 432 | "\n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | "

	County	fips	cases
385994	Autauga County, Alabama	1001	1015
385995	Baldwin County, Alabama	1003	3101
385996	Barbour County, Alabama	1005	598
385997	Bibb County, Alabama	1007	363
385998	Blount County, Alabama	1009	767
...	...	...	...
389206	Sweetwater County, Wyoming	56037	240
389207	Teton County, Wyoming	56039	335
389208	Uinta County, Wyoming	56041	254
389209	Washakie County, Wyoming	56043	47
389210	Weston County, Wyoming	56045	5

\n", 510 | "

3188 rows × 3 columns

\n", 511 | "

" 512 | ], 513 | "text/plain": [ 514 | " County fips cases\n", 515 | "385994 Autauga County, Alabama 1001 1015\n", 516 | "385995 Baldwin County, Alabama 1003 3101\n", 517 | "385996 Barbour County, Alabama 1005 598\n", 518 | "385997 Bibb County, Alabama 1007 363\n", 519 | "385998 Blount County, Alabama 1009 767\n", 520 | "... ... ... ...\n", 521 | "389206 Sweetwater County, Wyoming 56037 240\n", 522 | "389207 Teton County, Wyoming 56039 335\n", 523 | "389208 Uinta County, Wyoming 56041 254\n", 524 | "389209 Washakie County, Wyoming 56043 47\n", 525 | "389210 Weston County, Wyoming 56045 5\n", 526 | "\n", 527 | "[3188 rows x 3 columns]" 528 | ] 529 | }, 530 | "execution_count": 79, 531 | "metadata": {}, 532 | "output_type": "execute_result" 533 | } 534 | ], 535 | "source": [ 536 | "nyt = pd.read_csv('NYT/us-counties_cases.csv')\n", 537 | "\n", 538 | "nyt = nyt.loc[(nyt['date'] == '2020-07-31')]\n", 539 | "\n", 540 | "nyt['County'] = nyt['county'] + ' County, ' + nyt['state']\n", 541 | "\n", 542 | "nyt = nyt[['County','fips','cases']]\n", 543 | "nyt = nyt.dropna()\n", 544 | "\n", 545 | "nyt['fips'] = nyt['fips'].astype(int)\n", 546 | "\n", 547 | "nyt" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 75, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/html": [ 558 | "

\n", 559 | "\n", 572 | "\n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | "

	County	Population
0	Autauga County, Alabama	55869
1	Baldwin County, Alabama	223234
2	Barbour County, Alabama	24686
3	Bibb County, Alabama	22394
4	Blount County, Alabama	57826
...	...	...
3137	Sweetwater County, Wyoming	42343
3138	Teton County, Wyoming	23464
3139	Uinta County, Wyoming	20226
3140	Washakie County, Wyoming	7805
3141	Weston County, Wyoming	6927

\n", 638 | "

3142 rows × 2 columns

\n", 639 | "

" 640 | ], 641 | "text/plain": [ 642 | " County Population\n", 643 | "0 Autauga County, Alabama 55869\n", 644 | "1 Baldwin County, Alabama 223234\n", 645 | "2 Barbour County, Alabama 24686\n", 646 | "3 Bibb County, Alabama 22394\n", 647 | "4 Blount County, Alabama 57826\n", 648 | "... ... ...\n", 649 | "3137 Sweetwater County, Wyoming 42343\n", 650 | "3138 Teton County, Wyoming 23464\n", 651 | "3139 Uinta County, Wyoming 20226\n", 652 | "3140 Washakie County, Wyoming 7805\n", 653 | "3141 Weston County, Wyoming 6927\n", 654 | "\n", 655 | "[3142 rows x 2 columns]" 656 | ] 657 | }, 658 | "execution_count": 75, 659 | "metadata": {}, 660 | "output_type": "execute_result" 661 | } 662 | ], 663 | "source": [ 664 | "census = pd.read_csv('Census/county_simple.csv')\n", 665 | "\n", 666 | "census['2019'] = census['2019'].str.replace(',','').astype(int)\n", 667 | "\n", 668 | "census = census[['County','2019']].rename({'2019':'Population'}, axis =1 )\n", 669 | "\n", 670 | "census['County'] = census['County'].apply(lambda x: x[1:])\n", 671 | "\n", 672 | "census" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": 86, 678 | "metadata": {}, 679 | "outputs": [ 680 | { 681 | "data": { 682 | "text/html": [ 683 | "

\n", 684 | "\n", 697 | "\n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | "

	unitid	pctdeexc	tuition_share	Private	DivisionOne	fips	County	cases	Population
3	NaN	NaN	NaN	NaN	NaN	NaN	Barbour County, Alabama	NaN	24686
4	NaN	NaN	NaN	NaN	NaN	NaN	Bibb County, Alabama	NaN	22394
5	NaN	NaN	NaN	NaN	NaN	NaN	Blount County, Alabama	NaN	57826
6	NaN	NaN	NaN	NaN	NaN	NaN	Bullock County, Alabama	NaN	10101
7	NaN	NaN	NaN	NaN	NaN	NaN	Butler County, Alabama	NaN	19448
9	NaN	NaN	NaN	NaN	NaN	NaN	Chambers County, Alabama	NaN	33254
10	NaN	NaN	NaN	NaN	NaN	NaN	Cherokee County, Alabama	NaN	26196
11	NaN	NaN	NaN	NaN	NaN	NaN	Chilton County, Alabama	NaN	44428
12	NaN	NaN	NaN	NaN	NaN	NaN	Choctaw County, Alabama	NaN	12589
13	NaN	NaN	NaN	NaN	NaN	NaN	Clarke County, Alabama	NaN	23622
14	NaN	NaN	NaN	NaN	NaN	NaN	Clay County, Alabama	NaN	13235
15	NaN	NaN	NaN	NaN	NaN	NaN	Cleburne County, Alabama	NaN	14910
19	NaN	NaN	NaN	NaN	NaN	NaN	Coosa County, Alabama	NaN	10663
21	NaN	NaN	NaN	NaN	NaN	NaN	Crenshaw County, Alabama	NaN	13772
26	NaN	NaN	NaN	NaN	NaN	NaN	DeKalb County, Alabama	NaN	71513
28	NaN	NaN	NaN	NaN	NaN	NaN	Escambia County, Alabama	NaN	36633
30	NaN	NaN	NaN	NaN	NaN	NaN	Fayette County, Alabama	NaN	16302
31	NaN	NaN	NaN	NaN	NaN	NaN	Franklin County, Alabama	NaN	31362
32	NaN	NaN	NaN	NaN	NaN	NaN	Geneva County, Alabama	NaN	26271

\n", 943 | "

" 944 | ], 945 | "text/plain": [ 946 | " unitid pctdeexc tuition_share Private DivisionOne fips \\\n", 947 | "3 NaN NaN NaN NaN NaN NaN \n", 948 | "4 NaN NaN NaN NaN NaN NaN \n", 949 | "5 NaN NaN NaN NaN NaN NaN \n", 950 | "6 NaN NaN NaN NaN NaN NaN \n", 951 | "7 NaN NaN NaN NaN NaN NaN \n", 952 | "9 NaN NaN NaN NaN NaN NaN \n", 953 | "10 NaN NaN NaN NaN NaN NaN \n", 954 | "11 NaN NaN NaN NaN NaN NaN \n", 955 | "12 NaN NaN NaN NaN NaN NaN \n", 956 | "13 NaN NaN NaN NaN NaN NaN \n", 957 | "14 NaN NaN NaN NaN NaN NaN \n", 958 | "15 NaN NaN NaN NaN NaN NaN \n", 959 | "19 NaN NaN NaN NaN NaN NaN \n", 960 | "21 NaN NaN NaN NaN NaN NaN \n", 961 | "26 NaN NaN NaN NaN NaN NaN \n", 962 | "28 NaN NaN NaN NaN NaN NaN \n", 963 | "30 NaN NaN NaN NaN NaN NaN \n", 964 | "31 NaN NaN NaN NaN NaN NaN \n", 965 | "32 NaN NaN NaN NaN NaN NaN \n", 966 | "\n", 967 | " County cases Population \n", 968 | "3 Barbour County, Alabama NaN 24686 \n", 969 | "4 Bibb County, Alabama NaN 22394 \n", 970 | "5 Blount County, Alabama NaN 57826 \n", 971 | "6 Bullock County, Alabama NaN 10101 \n", 972 | "7 Butler County, Alabama NaN 19448 \n", 973 | "9 Chambers County, Alabama NaN 33254 \n", 974 | "10 Cherokee County, Alabama NaN 26196 \n", 975 | "11 Chilton County, Alabama NaN 44428 \n", 976 | "12 Choctaw County, Alabama NaN 12589 \n", 977 | "13 Clarke County, Alabama NaN 23622 \n", 978 | "14 Clay County, Alabama NaN 13235 \n", 979 | "15 Cleburne County, Alabama NaN 14910 \n", 980 | "19 Coosa County, Alabama NaN 10663 \n", 981 | "21 Crenshaw County, Alabama NaN 13772 \n", 982 | "26 DeKalb County, Alabama NaN 71513 \n", 983 | "28 Escambia County, Alabama NaN 36633 \n", 984 | "30 Fayette County, Alabama NaN 16302 \n", 985 | "31 Franklin County, Alabama NaN 31362 \n", 986 | "32 Geneva County, Alabama NaN 26271 " 987 | ] 988 | }, 989 | "execution_count": 86, 990 | "metadata": {}, 991 | "output_type": "execute_result" 992 | } 993 | ], 994 | "source": [ 995 | "fulldata = ipeds2018.merge(ipeds2019, on = 'unitid', how = 'outer')\n", 996 | "\n", 997 | "fulldata = fulldata.merge(eada, on = 'unitid', how = 'left')\n", 998 | "fulldata.loc[fulldata['DivisionOne'].apply(lambda x: np.isnan(x)), 'DivisionOne'] = False\n", 999 | "\n", 1000 | "fulldata = fulldata.merge(ipeds_linker, on = 'unitid', how = 'outer')\n", 1001 | "\n", 1002 | "fulldata = fulldata.merge(nyt, on = 'fips', how = 'left')\n", 1003 | "\n", 1004 | "fulldata = fulldata.merge(census, on = 'County', how = 'right')\n", 1005 | "fulldata = fulldata.loc[fulldata['unitid'].apply(lambda x: np.isnan(x))]\n", 1006 | "\n", 1007 | "fulldata[1:20]" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "code", 1012 | "execution_count": 27, 1013 | "metadata": {}, 1014 | "outputs": [ 1015 | { 1016 | "data": { 1017 | "text/plain": [ 1018 | "True" 1019 | ] 1020 | }, 1021 | "execution_count": 27, 1022 | "metadata": {}, 1023 | "output_type": "execute_result" 1024 | } 1025 | ], 1026 | "source": [ 1027 | "np.isnan(np.nan)" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": 76, 1033 | "metadata": {}, 1034 | "outputs": [ 1035 | { 1036 | "data": { 1037 | "text/html": [ 1038 | "

\n", 1039 | "\n", 1052 | "\n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | "

	date	county	state	fips	cases	deaths
0	2020-01-21	Snohomish	Washington	53061.0	1	0
1	2020-01-22	Snohomish	Washington	53061.0	1	0
2	2020-01-23	Snohomish	Washington	53061.0	1	0
3	2020-01-24	Cook	Illinois	17031.0	1	0
4	2020-01-24	Snohomish	Washington	53061.0	1	0
...	...	...	...	...	...	...
498891	2020-09-03	Sweetwater	Wyoming	56037.0	304	2
498892	2020-09-03	Teton	Wyoming	56039.0	435	1
498893	2020-09-03	Uinta	Wyoming	56041.0	305	2
498894	2020-09-03	Washakie	Wyoming	56043.0	108	6
498895	2020-09-03	Weston	Wyoming	56045.0	20	0

\n", 1166 | "

498896 rows × 6 columns

\n", 1167 | "

" 1168 | ], 1169 | "text/plain": [ 1170 | " date county state fips cases deaths\n", 1171 | "0 2020-01-21 Snohomish Washington 53061.0 1 0\n", 1172 | "1 2020-01-22 Snohomish Washington 53061.0 1 0\n", 1173 | "2 2020-01-23 Snohomish Washington 53061.0 1 0\n", 1174 | "3 2020-01-24 Cook Illinois 17031.0 1 0\n", 1175 | "4 2020-01-24 Snohomish Washington 53061.0 1 0\n", 1176 | "... ... ... ... ... ... ...\n", 1177 | "498891 2020-09-03 Sweetwater Wyoming 56037.0 304 2\n", 1178 | "498892 2020-09-03 Teton Wyoming 56039.0 435 1\n", 1179 | "498893 2020-09-03 Uinta Wyoming 56041.0 305 2\n", 1180 | "498894 2020-09-03 Washakie Wyoming 56043.0 108 6\n", 1181 | "498895 2020-09-03 Weston Wyoming 56045.0 20 0\n", 1182 | "\n", 1183 | "[498896 rows x 6 columns]" 1184 | ] 1185 | }, 1186 | "execution_count": 76, 1187 | "metadata": {}, 1188 | "output_type": "execute_result" 1189 | } 1190 | ], 1191 | "source": [ 1192 | "nyt = pd.read_csv('NYT/us-counties_cases.csv')\n", 1193 | "\n", 1194 | "nyt" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "metadata": {}, 1201 | "outputs": [], 1202 | "source": [] 1203 | } 1204 | ], 1205 | "metadata": { 1206 | "kernelspec": { 1207 | "display_name": "Python 3", 1208 | "language": "python", 1209 | "name": "python3" 1210 | }, 1211 | "language_info": { 1212 | "codemirror_mode": { 1213 | "name": "ipython", 1214 | "version": 3 1215 | }, 1216 | "file_extension": ".py", 1217 | "mimetype": "text/x-python", 1218 | "name": "python", 1219 | "nbconvert_exporter": "python", 1220 | "pygments_lexer": "ipython3", 1221 | "version": "3.8.5" 1222 | } 1223 | }, 1224 | "nbformat": 4, 1225 | "nbformat_minor": 4 1226 | } 1227 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DataWranglingWorkshopFiles 2 | These are the files for the walk-through example at the end of the data wrangling workshop(s) 3 | 4 | You can see the slides and videos for these workshops at the links below. Also, the raw RMD files to make the slides are also in this repo. 5 | 6 | ## Tidyverse version: 7 | 8 | [Slides](https://rpubs.com/NickCHK/data_wrangling_tidyverse) 9 | 10 | [Video](https://www.youtube.com/watch?v=CnY5Y5ANnjE) 11 | 12 | ## data.table version: 13 | 14 | [Slides](https://rpubs.com/NickCHK/data_wrangling_data_table) 15 | 16 | [Video](https://www.youtube.com/watch?v=EdPKcy1WKD0) 17 | 18 | ## pandas version: 19 | 20 | [Slides](https://rpubs.com/NickCHK/data_wrangling_pandas) 21 | 22 | [Video](https://www.youtube.com/watch?v=6rDqwji7eMc) 23 | -------------------------------------------------------------------------------- /example_walkthrough.R: -------------------------------------------------------------------------------- 1 | # GOAL: 2 | # Do a simplified version of the data-cleaning necessary for my most recent project. 3 | # Create data with one observation *per college per day* 4 | # incorporating data from: 5 | # IPEDS 2018: IPEDS/cdsfile_all_STATA_RV_942020-662.dta -> get percent in distance ed, and tuition share (f1tufeft/(f1stapft+f1lcapft+f1gvgcft+f1pggcft+f1invrft+f1otrvft+f1endmft)) 6 | # IPEDS 2019: IPEDS/cdsfile_all_STATA_RV_942020-417.dta -> get whether the college is private 7 | # EADA: EADA/InstLevel.xlsx -> get whether hte college is Division I in sports (ClassificationCode is 1 through 3) 8 | # NY Times: NYT/us-counties_cases.csv -> Get by-county information on COVID cases on July 31 9 | # Census: Census/co-est2019-annres.csv -> Get by-county population in 2019 10 | # politicaldata house_116 data -> Get 2018 congressoinal DW-Nominate scores 11 | # foot_traffic_panel.Rdata -> Pre-prepared college-day file of foot traffic visits to nearby locations -------------------------------------------------------------------------------- /example_walkthrough_data_table.R: -------------------------------------------------------------------------------- 1 | # GOAL: 2 | # Do a simplified version of the data-cleaning necessary for my most recent project. 3 | # Create data with one observation *per college per day* 4 | # incorporating data from: 5 | # IPEDS 2018: IPEDS/cdsfile_all_STATA_RV_942020-662.dta -> get percent in distance ed, and tuition share (f1tufeft/(f1stapft+f1lcapft+f1gvgcft+f1pggcft+f1invrft+f1otrvft+f1endmft)) 6 | # IPEDS 2019: IPEDS/cdsfile_all_STATA_RV_942020-417.dta -> get whether the college is private 7 | # EADA: EADA/InstLevel.xlsx -> get whether hte college is Division I in sports (ClassificationCode is 1 through 3) 8 | # NY Times: NYT/us-counties_cases.csv -> Get by-county information on COVID cases on July 31 9 | # Census: Census/co-est2019-annres.csv -> Get by-county population in 2019 10 | # politicaldata house_116 data -> Get 2018 congressoinal DW-Nominate scores 11 | # foot_traffic_panel.Rdata -> Pre-prepared college-day file of foot traffic visits to nearby locations 12 | 13 | library(haven) 14 | library(data.table) 15 | library(readxl) 16 | library(lubridate) 17 | 18 | ipeds2018 <- read_stata('IPEDS/cdsfile_all_STATA_RV_942020-662.dta') %>% 19 | as.data.table() 20 | 21 | ipeds2018 <- ipeds2018[, .(unitid, instnm, pctdesom)] 22 | 23 | ipeds2019 <- read_stata('IPEDS/cdsfile_all_STATA_RV_942020-417.dta') %>% 24 | as.data.table() 25 | ipeds2019 <- ipeds2019[, .(unitid, instnm, countycd, private = sector %in% c(2,5,8))] 26 | setnames(ipeds2019, 'countycd', 'fips') 27 | ipeds2019[, fips := as.integer(fips)] 28 | 29 | eada <- read_excel('EADA/Schools.xlsx') %>% 30 | as.data.table() 31 | 32 | table(eada$ClassificationCode, eada$classification_name) 33 | classnames <- eada[, classification_name] %>% 34 | unique() 35 | classnames <- classnames[!str_detect(classnames, 'Division II')] 36 | classnames <- classnames[str_detect(classnames, 'Division I')] 37 | 38 | eada[, div_1 := classification_name %in% classnames] 39 | setnames(eada, 'institution_name', 'instnm') 40 | eada <- eada[, .(unitid, instnm, div_1)] 41 | eada[, unitid := as.numeric(unitid)] 42 | 43 | nyt <- fread('NYT/us-counties_cases.csv')[date == ymd('2020-07-31')] 44 | nyt <- nyt[!is.na(fips)] 45 | 46 | census <- fread('Census/co-est2019-annres.csv', skip = 1) 47 | census <- census[, .(V1, V13)] 48 | setnames(census, c('V1','V13'), c('countyname', 'pop2019')) 49 | census <- census[2:nrow(census)] 50 | census[, pop2019 := str_replace_all(pop2019, ',', '')] 51 | census[, pop2019 := as.numeric(pop2019)] 52 | census[str_sub(countyname, 1, 1) == '.', countyname := str_sub(countyname, 2)] 53 | 54 | census[, county := str_sub(countyname, 1, str_locate(countyname, ',')[,1]-1)] 55 | census[, county := str_replace_all(county, 'County','')] 56 | census[, county := str_trim(county)] 57 | 58 | census[, state := str_sub(countyname, str_locate(countyname, ',')[,1]+1)] 59 | census[, state := str_trim(state)] 60 | census[, countyname := NULL] 61 | 62 | fulldata <- merge(ipeds2018, ipeds2019, by = c('unitid', 'instnm'), all = TRUE) %>% 63 | merge(eada, by = c('unitid', 'instnm'), all = TRUE) %>% 64 | merge(nyt, by = 'fips', all = TRUE) %>% 65 | merge(census, by = c('county', 'state'), all = TRUE) 66 | fulldata[is.na(private)] %>% nrow() 67 | fulldata[is.na(pctdesom)] %>% nrow() 68 | -------------------------------------------------------------------------------- /example_walkthrough_tidyverse.R: -------------------------------------------------------------------------------- 1 | # GOAL: 2 | # Do a simplified version of the data-cleaning necessary for my most recent project. 3 | # Create data with one observation *per college per day* 4 | # incorporating data from: 5 | # IPEDS 2018: IPEDS/cdsfile_all_STATA_RV_942020-662.dta -> get percent in distance ed, and tuition share (f1tufeft/(f1stapft+f1lcapft+f1gvgcft+f1pggcft+f1invrft+f1otrvft+f1endmft)) 6 | # IPEDS 2019: IPEDS/cdsfile_all_STATA_RV_942020-417.dta -> get whether the college is private 7 | # EADA: EADA/InstLevel.xlsx -> get whether hte college is Division I in sports (ClassificationCode is 1 through 3) 8 | # NY Times: NYT/us-counties_cases.csv -> Get by-county information on COVID cases on July 31 9 | # Census: Census/co-est2019-annres.csv -> Get by-county population in 2019 10 | # politicaldata house_116 data -> Get 2018 congressoinal DW-Nominate scores 11 | # foot_traffic_panel.Rdata -> Pre-prepared college-day file of foot traffic visits to nearby locations 12 | 13 | 14 | # IPEDS -> college-level data (unitid) 15 | # EADA -> college-level data (unitid?) 16 | # NY Times -> county / day data 17 | # Census -> county (textual county-level data, and skip first row) 18 | # politicaldata -> congressional district (link to ipeds with congressional district) 19 | 20 | # One observation per college 21 | 22 | # NEED: college -> county link to link up IPEDS and EADA with NY Times nad Census 23 | # NEED: colllege -> congressional district to linkup IPEDS/EADA with congressional distict 24 | 25 | library(tidyverse) 26 | library(haven) 27 | library(politicaldata) 28 | library(readxl) 29 | library(lubridate) 30 | library(tidylog) 31 | 32 | ipeds2018 <- read_dta('IPEDS/cdsfile_all_STATA_RV_942020-662.dta') %>% 33 | mutate(tuition_share = f1tufeft/(f1stapft+f1lcapft+f1gvgcft+f1pggcft+f1invrft+f1otrvft+f1endmft), 34 | distance_share = pctdeexc + pctdesom) %>% 35 | select(unitid, tuition_share, distance_share) 36 | 37 | ipeds2019 <- read_dta('IPEDS/cdsfile_all_STATA_RV_942020-417.dta') %>% 38 | mutate(private = sector == 2) %>% 39 | select(unitid, cngdstcd, fips, countycd) 40 | 41 | eada <- read_excel('EADA/InstLevel.xlsx') %>% 42 | mutate(DivisionI = ClassificationCode %in% 1:3) %>% 43 | select(unitid, DivisionI) 44 | 45 | NYT <- read_csv('NYT/us-counties_cases.csv') %>% 46 | filter(date == ymd('2020-07-31')) %>% 47 | mutate(CountyandState = paste0(county, ' ', 48 | case_when( 49 | state == 'Alaska' ~ ', ', 50 | state == 'Louisiana' ~ 'Parish, ', 51 | TRUE ~ 'County, ' 52 | ), 53 | state)) 54 | 55 | Census <- read_csv('Census/co-est2019-annres.csv', skip = 1) %>% 56 | mutate(County = str_sub(County, 2)) %>% 57 | rename(CountyandState = County) %>% 58 | select(CountyandState, `2019`) %>% 59 | rename(Population = `2019`) 60 | 61 | # Check if unitid is key 62 | check_dupes <- function(data, vars) { 63 | data %>% 64 | select(vars) %>% 65 | duplicated() %>% 66 | max() 67 | } 68 | check_dupes(ipeds2018, 'unitid') 69 | check_dupes(ipeds2019, 'unitid') 70 | check_dupes(eada, 'unitid') 71 | check_dupes(NYT, 'CountyandState') 72 | check_dupes(Census, 'CountyandState') 73 | 74 | # Check the countyandstate values 75 | NYT %>% filter(!(CountyandState %in% Census$CountyandState)) %>% pull(CountyandState) %>% table() 76 | 77 | # FIX THE COUNTYANDSTATE THING 78 | 79 | # Merge! 80 | 81 | our_data <- ipeds2018 %>% 82 | inner_join(ipeds2019) %>% 83 | left_join(eada) %>% 84 | left_join(NYT %>% rename(countycd = fips) %>% mutate(countycd = as.numeric(countycd))) %>% 85 | left_join(Census) 86 | 87 | 88 | data("house_116") 89 | 90 | 91 | # str_pad example 92 | 93 | str_pad() 94 | 95 | '2020/01/01/sales.csv' 96 | 97 | yr <- 2020 98 | mon <- 1 99 | da <- 1 100 | 101 | str_pad(mon, 2, 'left', '0') 102 | 103 | paste0(yr, '/', 104 | str_pad(mon, 2, 'left', '0'), '/', 105 | str_pad(da, 2, 'left', '0'), '/sales.csv') 106 | -------------------------------------------------------------------------------- /foot_traffic_panel.Rdata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataWranglingWorkshopFiles/de6b25f34ea721a742f2e00df0e60ada0ce72f2a/foot_traffic_panel.Rdata --------------------------------------------------------------------------------