├── 00-Introduction.Rmd ├── 01-Visualize-Data.Rmd ├── 02-Transform-Data.Rmd ├── 03-Tidy-Data.Rmd ├── 04-Import-Data.Rmd ├── 05-Data-Types.Rmd ├── 06-Iteration.Rmd ├── 07-Models.Rmd ├── 08-List-Columns.Rmd ├── LICENSE.md ├── README.md ├── nimbus.csv └── pdfs ├── 00-Introduction.pdf ├── 00-Reintroduction.pdf ├── 000-A-Preclass-loop.pdf ├── 01-Visualize-Data.pdf ├── 02-Transform-Data.pdf ├── 03-Tidy-Data.pdf ├── 04-Import-Data.pdf ├── 05-Data-Types.pdf ├── 06-Iteration.pdf ├── 07-Models.pdf └── 08-List-Columns.pdf /00-Introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Notebook" 3 | output: html_notebook 4 | --- 5 | 6 | This is an R Notebook. You can use it to take notes and run code. For example, you can write your name on the line below. Try it: 7 | 8 | 9 | 10 | ```{r} 11 | # You can write code in chunks that look like this. 12 | # This chunk uses some code from base R to plot a data set. 13 | # To run the code click the Green play button to the right. Try it! 14 | plot(cars) 15 | ``` 16 | 17 | Good job! The results of a code chunk will appear beneath the chunk. You can click the x above the results to make them go away, but let's not do that. 18 | 19 | You can open a new R Notebook by going to **File > New File > R Notebook**. 20 | 21 | # Adding chunks 22 | 23 | To add a new chunk, press *Cmd+Option+I* (*Ctrl+Alt+I* on Windows), or click the *Insert* button at the top of this document, then select *R*. 24 | Try making a code chunk below: 25 | 26 | 27 | 28 | Good job! For now, you should place all of your R code inside of code chunks. 29 | 30 | ```{r} 31 | # You can click the downward facing arrow to the left of the play button to run 32 | # every chunk above the current code chunk. This is useful if the code in your 33 | # chunk depends on the code in previous chunks. For example, if you use an 34 | # object or data set made in a previous chunk. 35 | ``` 36 | 37 | # HTML version 38 | 39 | When you save the notebook, an HTML file containing the code and output will be saved alongside it. This makes a nice, polished report of your work to share. 40 | 41 | Click the *Preview* button at the top of this document or press *Cmd+Shift+K* (*Ctrl+Shift+K* on Windows) to preview the HTML file. Try clicking *Preview* now. 42 | 43 | # Packages 44 | 45 | You can immediately run any function from base R within a notebook, But if you'd like to run a function that comes in an R package, you will need to first load the package in the notebook. 46 | 47 | Do you remember how to run the core tidyverse packages? Load them in the chunk below: 48 | 49 | ```{r} 50 | # I've already installed the tidyverse package(s) on this server 51 | # So all you need to do is load it 52 | 53 | ``` 54 | 55 | Good job! You'll need to reload your packages every time you begin a new notebook. 56 | 57 | -------------------------------------------------------------------------------- /01-Visualize-Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Visualize Data" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | ``` 9 | 10 | ```{r} 11 | mpg 12 | ``` 13 | 14 | 15 | ## Your Turn 1 16 | 17 | Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses! 18 | 19 | ```{r} 20 | 21 | ``` 22 | 23 | ## Your Turn 2 24 | 25 | Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment. 26 | 27 | ```{r} 28 | ggplot(data = mpg) + 29 | geom_point(mapping = aes(x = displ, y = hwy)) 30 | ``` 31 | 32 | ## Your Turn 3 33 | 34 | Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess. 35 | 36 | ```{r} 37 | ggplot(mpg) + geom_point(aes(class, hwy)) 38 | ``` 39 | 40 | ## Your Turn 4 41 | 42 | Make a histogram of the `hwy` variable from `mpg`. 43 | 44 | ```{r} 45 | 46 | ``` 47 | 48 | ## Your Turn 5 49 | 50 | Make a density plot of `hwy` colored by `class`. 51 | 52 | ```{r} 53 | 54 | ``` 55 | 56 | ## Your Turn 6 57 | 58 | Make a bar chart `hwy` colored by `class`. 59 | 60 | ```{r} 61 | 62 | ``` 63 | 64 | ## Your Turn 7 65 | 66 | Predict what this code will do. Then run it. 67 | 68 | ```{r} 69 | ggplot(mpg) + 70 | geom_point(aes(displ, hwy)) + 71 | geom_smooth(aes(displ, hwy)) 72 | ``` 73 | 74 | ## Your Turn 8 75 | 76 | What does `getwd()` return? 77 | 78 | ```{r} 79 | 80 | ``` 81 | 82 | ## Your Turn 9 83 | 84 | Save the last plot and then locate it in the files pane. 85 | 86 | ```{r} 87 | 88 | ``` 89 | 90 | *** 91 | 92 | # Take aways 93 | 94 | You can use this code template to make thousands of graphs with **ggplot2**. 95 | 96 | ```{r eval = FALSE} 97 | ggplot(data = ) + 98 | (mapping = aes()) 99 | ``` -------------------------------------------------------------------------------- /02-Transform-Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Transform Data" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | library(babynames) 9 | library(nycflights13) 10 | 11 | # Toy datasets to use 12 | 13 | pollution <- tribble( 14 | ~city, ~size, ~amount, 15 | "New York", "large", 23, 16 | "New York", "small", 14, 17 | "London", "large", 22, 18 | "London", "small", 16, 19 | "Beijing", "large", 121, 20 | "Beijing", "small", 56 21 | ) 22 | 23 | band <- tribble( 24 | ~name, ~band, 25 | "Mick", "Stones", 26 | "John", "Beatles", 27 | "Paul", "Beatles" 28 | ) 29 | 30 | instrument <- tribble( 31 | ~name, ~plays, 32 | "John", "guitar", 33 | "Paul", "bass", 34 | "Keith", "guitar" 35 | ) 36 | 37 | instrument2 <- tribble( 38 | ~artist, ~plays, 39 | "John", "guitar", 40 | "Paul", "bass", 41 | "Keith", "guitar" 42 | ) 43 | ``` 44 | 45 | ## babynames 46 | 47 | ```{r} 48 | babynames 49 | ``` 50 | 51 | 52 | ## Your Turn 1 53 | 54 | Alter the code to select just the `n` column: 55 | 56 | ```{r} 57 | select(babynames, name, prop) 58 | ``` 59 | 60 | ## Quiz 61 | 62 | Which of these is NOT a way to select the `name` and `n` columns together? 63 | 64 | ```{r} 65 | select(babynames, -c(year, sex, prop)) 66 | select(babynames, name:n) 67 | select(babynames, starts_with("n")) 68 | select(babynames, ends_with("n")) 69 | ``` 70 | 71 | ## Your Turn 2 72 | 73 | Show: 74 | 75 | * All of the names where prop is greater than or equal to 0.08 76 | * All of the children named "Sea" 77 | * All of the names that have a missing value for `n` 78 | 79 | ```{r} 80 | filter(babynames, is.na(n)) 81 | 82 | ``` 83 | 84 | ## Your Turn 3 85 | 86 | Use Boolean operators to alter the code below to return only the rows that contain: 87 | 88 | * Girls named Sea 89 | * Names that were used by exactly 5 or 6 children in 1880 90 | * Names that are one of Acura, Lexus, or Yugo 91 | 92 | ```{r} 93 | filter(babynames, name == "Sea" | name == "Anemone") 94 | ``` 95 | 96 | ## Your Turn 4 97 | 98 | Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is? 99 | 100 | ```{r} 101 | 102 | ``` 103 | 104 | ## Your Turn 5 105 | 106 | Use `desc()` to find the names with the highest prop. 107 | Then, use `desc()` to find the names with the highest n. 108 | 109 | ```{r} 110 | 111 | ``` 112 | 113 | ## Your Turn 6 114 | 115 | Use `%>%` to write a sequence of functions that: 116 | 117 | 1. Filter babynames to just the girls that were born in 2015 118 | 2. Select the `name` and `n` columns 119 | 3. Arrange the results so that the most popular names are near the top. 120 | 121 | ```{r} 122 | 123 | ``` 124 | 125 | ## Exam 126 | 127 | 1. Trim `babynames` to just the rows that contain your `name` and your `sex` 128 | 2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice) 129 | 3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis 130 | 131 | ```{r} 132 | 133 | ``` 134 | 135 | ## Your Turn 7 136 | 137 | Use summarise() to compute three statistics about the data: 138 | 139 | 1. The first (minimum) year in the dataset 140 | 2. The last (maximum) year in the dataset 141 | 3. The total number of children represented in the data 142 | 143 | ```{r} 144 | 145 | ``` 146 | 147 | ## Your Turn 8 148 | 149 | Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find: 150 | 151 | 1. The total number of children named Khaleesi 152 | 2. The first year Khaleesi appeared in the data 153 | 154 | ```{r} 155 | 156 | ``` 157 | 158 | ## Your Turn 9 159 | 160 | Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. 161 | 162 | ```{r} 163 | 164 | ``` 165 | 166 | ## Your Turn 10 167 | 168 | Use grouping to calculate and then plot the number of children born each year over time. 169 | 170 | ```{r} 171 | 172 | ``` 173 | 174 | ## Your Turn 11 175 | 176 | Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`. 177 | 178 | ```{r} 179 | 180 | ``` 181 | 182 | ## Your Turn 12 183 | 184 | Compute each name's rank _within its year and sex_. 185 | Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest. 186 | 187 | ```{r} 188 | 189 | ``` 190 | 191 | ## Your Turn 13 192 | 193 | Which airlines had the largest arrival delays? Complete the code below. 194 | 195 | 1. Join `airlines` to `flights` 196 | 2. Compute and order the average arrival delays by airline. Display full names, no codes. 197 | 198 | ```{r} 199 | flights %>% 200 | drop_na(arr_delay) %>% 201 | %>% 202 | group_by( ) %>% 203 | %>% 204 | arrange( ) 205 | ``` 206 | 207 | ## Your Turn 14 208 | 209 | How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set. 210 | 211 | 212 | ```{r} 213 | __________ %>% 214 | _________(_________, by = ___________) %>% 215 | distinct(faa) 216 | ``` 217 | 218 | 219 | 220 | *** 221 | 222 | # Take aways 223 | 224 | * Extract variables with `select()` 225 | * Extract cases with `filter()` 226 | * Arrange cases, with `arrange()` 227 | 228 | * Make tables of summaries with `summarise()` 229 | * Make new variables, with `mutate()` 230 | * Do groupwise operations with `group_by()` 231 | 232 | * Connect operations with `%>%` 233 | 234 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets 235 | * Use `semi_join()` or `anti_join()` to filter datasets against each other 236 | 237 | 238 | -------------------------------------------------------------------------------- /03-Tidy-Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tidy Data" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | library(babynames) 9 | 10 | # Toy data 11 | cases <- tribble( 12 | ~Country, ~"2011", ~"2012", ~"2013", 13 | "FR", 7000, 6900, 7000, 14 | "DE", 5800, 6000, 6200, 15 | "US", 15000, 14000, 13000 16 | ) 17 | 18 | pollution <- tribble( 19 | ~city, ~size, ~amount, 20 | "New York", "large", 23, 21 | "New York", "small", 14, 22 | "London", "large", 22, 23 | "London", "small", 16, 24 | "Beijing", "large", 121, 25 | "Beijing", "small", 121 26 | ) 27 | 28 | x <- tribble( 29 | ~x1, ~x2, 30 | "A", 1, 31 | "B", NA, 32 | "C", NA, 33 | "D", 3, 34 | "E", NA 35 | ) 36 | 37 | # To avoid a distracting detail during class 38 | names(who) <- stringr::str_replace(names(who), "newrel", "new_rel") 39 | ``` 40 | 41 | ## Your Turn 1 42 | 43 | On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n** 44 | 45 | ## Your Turn 2 46 | 47 | Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**. 48 | 49 | ```{r} 50 | 51 | ``` 52 | 53 | ## Your Turn 3 54 | 55 | On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: **city**, **large**, **small** 56 | 57 | ## Your Turn 4 58 | 59 | Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**. 60 | 61 | ```{r} 62 | 63 | ``` 64 | 65 | ## Your Turn 5 66 | 67 | Gather the 5th through 60th columns of `who` into a key column: value column pair named **codes** and **n**. Then select just the `county`, `year`, `codes` and `n` variables. 68 | 69 | ```{r} 70 | 71 | ``` 72 | 73 | ## Your Turn 6 74 | 75 | Separate the `sexage` column into **sex** and **age** columns. 76 | 77 | ```{r} 78 | 79 | ``` 80 | 81 | ## Your Turn 7 82 | 83 | Reshape the layout of this data. Calculate the percent of male (or female) children by year. Then plot the percent over time. 84 | 85 | ```{r} 86 | babynames %>% 87 | group_by(year, sex) %>% 88 | summarise(n = sum(n)) 89 | ``` 90 | 91 | *** 92 | 93 | # Take Aways 94 | 95 | Data comes in many formats but R prefers just one: _tidy data_. 96 | 97 | A data set is tidy if and only if: 98 | 99 | 1. Every variable is in its own column 100 | 2. Every observation is in its own row 101 | 3. Every value is in its own cell (which follows from the above) 102 | 103 | What is a variable and an observation may depend on your immediate goal. 104 | 105 | -------------------------------------------------------------------------------- /04-Import-Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Import Data (Bonus)" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | ``` 9 | 10 | This module will teach you an efficient way to import data stored in a flat file format. Flat file formats are one of the most common formats for saving data because flat files can be read by a large variety of data related software. Flat file formats include Comma Separated Values (csv's), tab delimited files, fixed width files and more. 11 | 12 | ## readr 13 | 14 | The **readr** R package contains simple, consistent functions for importing data saved as flat file documents. readr functions offer an alternative to base R functions that read in flat files. Compared to base R's `read.table()` and its derivatives, readr functions have several advantages. readr functions: 15 | 16 | * are ~10 times faster 17 | * return user-friendly tibbles 18 | * have more intuitive defaults. No rownames, no `stringsAsFactors = TRUE`. 19 | 20 | readr supplies several related functions, each designed to read in a specific flat file format. 21 | 22 | Function | Reads 23 | -------------- | -------------------------- 24 | `read_csv()` | Comma separated values 25 | `read_csv2()` | Semi-colon separate values 26 | `read_delim()` | General delimited files 27 | `read_fwf()` | Fixed width files 28 | `read_log()` | Apache log files 29 | `read_table()` | Space separated files 30 | `read_tsv()` | Tab delimited values 31 | 32 | Here, we will focus on the `read_csv()` function, but the other functions work in a similar way. In most cases, you can use the syntax and arguments of `read_csv()` when using the other functions listed above. 33 | 34 | readr is a core member of the tidyverse. It is loaded everytime you call `library(tidyverse)`. 35 | 36 | ## Sample data 37 | 38 | A sample data set to import is saved alongside this notebook. The data set, saved as `nimbus.csv`, contains atmospheric ozone measurements of the southern hemisphere collected by NASA's NIMBUS-7 satellite in October 1985. The data set is of historical interest because it displays evidence of the hole in the ozone layer collected shortly after the hole was first reported. 39 | 40 | ## Importing Data 41 | 42 | To import `nimbus.csv`, use the readr functions that reads `.csv` files, `read_csv()`. Set the first argument of `read_csv()` to a characterstring: the filepath from your _working directory_ to the `nimbus.csv` file. 43 | 44 | So, if `nimbus.csv` is saved in your working directory, you can read it in with the command. 45 | 46 | ```{r eval = FALSE} 47 | nimbus <- read_csv("nimbus.csv") 48 | ``` 49 | 50 | Note: you can determine the location of your working directory by running `getwd()`. You can change the location of your working directory by going to **Session > Set Working Directory** in the RStudio IDE menu bar. 51 | 52 | Notice that the code above saves the output to an object named nimbus. You must save the output of `read_csv()` to an object if you wish to use it later. If you do not save the output, `read_csv()` will merely print the contents of the data set at the command line. 53 | 54 | ## Your Turn 1 55 | 56 | Find nimbus.csv on your server or computer. Then read it into an object. Then view the results. 57 | 58 | ```{r} 59 | 60 | ``` 61 | 62 | ## Tibbles 63 | 64 | `read_csv()` reads the data into a **tibble**, which is a special class of data frame. Since a tibble is a sub-class of data frame, R will in most cases treat tibbles in exactly the same way that R treats data frames. 65 | 66 | There is however, one notable exception. When R prints a data frame in the console window, R attempts to display the entire contents of the data frame. This has a negative effect: unless the data frame is very small you are left viewing the end of the data frame (or if the data frame is very large, the middle of the data frame since R stops displaying new rows after a maximum number is reached). 67 | 68 | R will display tibbles in a much more sensible way. If the **tibble** package is loaded, R will display the first ten rows of the tibble and as many columns as will fit in your console window. This display ensures that the name of each column is visible in the display. 69 | 70 | (Note that R Notebooks automatically display both tibbles and data frames as interactive tables, which hides this difference). 71 | 72 | tibble is a core member of the tidyverse. It is loaded everytime you call `library(tidyverse)`. The tibble package includes the functions `tibble()` and `tribble()` for makign tibbles from scratch, as well as `as_tibble()` and `as.data.frame()` for converting back and forth between tibbles and data frames. 73 | 74 | In almost every case, you can ignore whether or not you are working with tibbles or data frames. 75 | 76 | ## Parsing NA's 77 | 78 | If you examine `nimbus` closely, you will notice that the initial values in the `ozone` column are `.`. Can you guess what `.` stands for? The compilers of the nimbus data set used `.` to denote a missing value. In other words, they used `.` in the same way that R uses the `NA` value. 79 | 80 | If you'd like R to treat these `.` values as missing values (and you should) you will need to convert them to `NA`s. One way to do this is to ask `read_csv()` to parse `.` values as `NA` values when it reads in the data. To do this add the argument `na = "."` to `read_csv()`: 81 | 82 | ```{r eval = FALSE} 83 | nimbus <- read_csv("nimbus.csv", na = ".") 84 | ``` 85 | 86 | You can set `na` to a single character string or a vector of character strings. `read_csv()` will transform every value listed in the `na` argument to an `NA` when it reads in the data. 87 | 88 | ## Parsing data types 89 | 90 | If you run the code above and examine the results, you may now notice a new concern about the `ozone` column. The column has been parsed as character strings instead of numbers. 91 | 92 | When you use `read_csv()`, `read_csv()` tries to match each column of input to one of the basic data types in R. `read_csv(0` generally does a good job, but here the initial presence of the character strings `.` caused `read_csv()` to misidentify the contents of the `ozone` column. You can now correct this with R's `as.numeric()` function, or you can read the data in again, this time instructing `read_csv()` to parse the column as numbers. 93 | 94 | To do this, add the argument `col_types` to `read.csv()` and set it equal to a list. Add a named element to the list for each column you would like to manually parse. The name of the element should match the name of the column you wish to parse. So for example, if we wish to parse the `ozone` column into a specifc data type, we would begin by inserting the argument: 95 | 96 | ```{r eval = FALSE} 97 | nimbus <- read_csv("nimbus.csv", na = ".", 98 | col_types = list(ozone = #something) 99 | ``` 100 | 101 | To complete the code, set `ozone` equal to one of the functions below, each function instructs `read_csv()` to parse `ozone` as a specific type of data. 102 | 103 | Type function | Data Type 104 | ----------------- | ----------------------------------------- 105 | `col_character()` | character 106 | `col_date()` | Date 107 | `col_datetime()` | POSIXct (date-time) 108 | `col_double()` | double (numeric) 109 | `col_factor()` | factor 110 | `col_guess()` | let readr geuss (default) 111 | `col_integer()` | integer 112 | `col_logical()` | logical 113 | `col_number()` | numbers mixed with non-number characters 114 | `col_numeric()` | double or integer 115 | `col_skip()` | do not read this column 116 | `col_time()` | time 117 | 118 | In our case, we would use the `col_double()` function to ensure that `ozone` is read a sa double (that is numeric) column. 119 | 120 | ## The hole in the ozone layer 121 | 122 | Now that we have our data, we can use it to plot a picture of the hole in the ozone layer. Note that the "hole" in the ozone layer is the dark regions around the south pole. The actual "jhole" in the data is a smaller area centered on the south pole, where the satellite did not take measurements. 123 | 124 | ```{r} 125 | nimbus <- read_csv("nimbus.csv", na = ".", 126 | col_types = list(ozone = col_double())) 127 | 128 | library(viridis) 129 | world <- map_data(map = "world") 130 | nimbus %>% 131 | ggplot() + 132 | geom_point(aes(longitude, latitude, color = ozone)) + 133 | geom_path(aes(long, lat, group = group), data = world) + 134 | coord_map("ortho", orientation=c(-90, 0, 0)) + 135 | scale_color_viridis(option = "A") 136 | ``` 137 | 138 | ## Writing data 139 | 140 | readr also contains functiomns for saving data. These functions parallel the read functions and each save a data frame or tibble in a specific file format. 141 | 142 | Function | Writes 143 | ------------------- | ---------------------------------------- 144 | `write_csv()` | Comma separated values 145 | `write_excel_csv()` | CSV that you plan to open in Excel 146 | `write_delim()` | General delimited files 147 | `write_file()` | A single string, written as is 148 | `write_lines()` | A vector of strings, one string per line 149 | `write_tsv()` | Tab delimited values 150 | 151 | To use a write function, first give it the name of the data frame to save, then give it a filepath from your wqorking directory to the location where you would like to save the file. This filepath should end in the name of the new file. So we can save the clean `nimbus` data set as a csv in our working directory with 152 | 153 | ```{r eval = FALSE} 154 | write_csv(nimbus, path = "nimbus-clean.csv") 155 | ``` 156 | 157 | *** 158 | 159 | # Take Aways 160 | 161 | The readr package provides efficient functions for reading and saving common flat file data formats. 162 | 163 | Consider these packages for other types of data: 164 | 165 | Package | Reads 166 | -------- | ----- 167 | haven | SPSS, Stata, and SAS files 168 | readxl | excel files (.xls, .xlsx) 169 | jsonlite | json 170 | xml2 | xml 171 | httr | web API's 172 | rvest | web pages (web scraping) 173 | DBI | databases 174 | sparklyr | data loaded into spark 175 | -------------------------------------------------------------------------------- /05-Data-Types.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Types" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | library(babynames) 9 | library(nycflights13) 10 | library(stringr) 11 | library(forcats) 12 | library(lubridate) 13 | library(hms) 14 | ``` 15 | 16 | ## Your Turn 1 17 | 18 | Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`). 19 | 20 | Then, remove all rows that contain an NA in `delayed`. 21 | 22 | Finally, create a summary table that shows: 23 | 24 | 1. How many flights were delayed 25 | 2. What proportion of flights were delayed 26 | 27 | ```{r} 28 | 29 | ``` 30 | 31 | 32 | ## Your Turn 2 33 | 34 | In your group, fill in the blanks to: 35 | 36 | 1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y". 37 | 2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`) 38 | 3. and then display the results as a line plot. 39 | 40 | ```{r} 41 | babynames %>% 42 | _______(last = _________, 43 | vowel = __________) %>% 44 | group_by(__________) %>% 45 | _________(p_vowel = weighted.mean(vowel, n)) %>% 46 | _________ + 47 | __________ 48 | ``` 49 | 50 | ## Your Turn 3 51 | 52 | Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status. 53 | 54 | ```{r} 55 | gss_cat %>% 56 | drop_na(________) %>% 57 | group_by(________) %>% 58 | summarise(_________________) %>% 59 | ggplot() + 60 | geom_point(mapping = aes(x = _______, y = _________________________)) 61 | ``` 62 | 63 | ## Your Turn 4 64 | 65 | Do you think liberals or conservatives watch more TV? 66 | Compute average tv hours by party ID an then plot the results. 67 | 68 | ```{r} 69 | 70 | ``` 71 | 72 | ## Your Turn 5 73 | 74 | What is the best time of day to fly? 75 | 76 | Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`. 77 | 78 | ```{r} 79 | 80 | ``` 81 | 82 | ## Your Turn 6 83 | 84 | Fill in the blanks to: 85 | 86 | Extract the day of the week of each flight (as a full name) from `time_hour`. 87 | 88 | Calculate the average `arr_delay` by day of the week. 89 | 90 | Plot the results as a column chart (bar chart) with `geom_col()`. 91 | 92 | ```{r} 93 | flights %>% 94 | mutate(weekday = _______________________________) %>% 95 | __________________ %>% 96 | drop_na(arr_delay) %>% 97 | summarise(avg_delay = _______________) %>% 98 | ggplot() + 99 | ___________(mapping = aes(x = weekday, y = avg_delay)) 100 | ``` 101 | 102 | *** 103 | 104 | # Take Aways 105 | 106 | Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data. 107 | 108 | Package | Data Type 109 | --------- | -------- 110 | stringr | strings 111 | forcats | factors 112 | hms | times 113 | lubridate | dates and times 114 | 115 | -------------------------------------------------------------------------------- /06-Iteration.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Iteration" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | 9 | # Toy data 10 | set.seed(1000) 11 | exams <- list( 12 | student1 = round(runif(10, 50, 100)), 13 | student2 = round(runif(10, 50, 100)), 14 | student3 = round(runif(10, 50, 100)), 15 | student4 = round(runif(10, 50, 100)), 16 | student5 = round(runif(10, 50, 100)) 17 | ) 18 | 19 | extra_credit <- list(0, 0, 10, 10, 15) 20 | ``` 21 | 22 | ## Your Turn 1 23 | 24 | Here is a list: 25 | 26 | ```{r} 27 | a_list <- list(num = c(8, 9), 28 | log = TRUE, 29 | cha = c("a", "b", "c")) 30 | ``` 31 | 32 | Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm 33 | 34 | ```{r} 35 | a_list["num"] 36 | ``` 37 | 38 | ```{r} 39 | a_list$num 40 | ``` 41 | 42 | ## Your Turn 2 43 | 44 | What will each of these return? Run the code chunks to confirm. 45 | 46 | ```{r} 47 | vec <- c(-2, -1, 0, 1, 2) 48 | abs(vec) 49 | ``` 50 | 51 | ```{r} 52 | lst <- list(-2, -1, 0, 1, 2) 53 | abs(lst) 54 | ``` 55 | 56 | ## Your Turn 3 57 | 58 | Run the code in the chunks. What does it return? 59 | 60 | ```{r} 61 | list(student1 = mean(exams$student1), 62 | student2 = mean(exams$student2), 63 | student3 = mean(exams$student3), 64 | student4 = mean(exams$student4), 65 | student5 = mean(exams$student5)) 66 | ``` 67 | 68 | ```{r} 69 | map(exams, mean) 70 | ``` 71 | 72 | ## Your Turn 4 73 | 74 | Calculate the variance (`var()`) of each student’s exam grades. 75 | 76 | ```{r} 77 | 78 | ``` 79 | 80 | ## Your Turn 5 81 | 82 | Calculate the max grade (max())for each student. Return the result as a vector. 83 | 84 | ```{r} 85 | 86 | ``` 87 | 88 | ## Your Turn 6 89 | 90 | Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students. 91 | 92 | 1. Write code that solves the problem for a real object 93 | 2. Wrap the code in `function(){}` to save it 94 | 3. Add the name of the real object as the function argument 95 | 96 | ```{r} 97 | 98 | ``` 99 | 100 | ### Your Turn 7 101 | 102 | Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector. 103 | 104 | ```{r} 105 | 106 | ``` 107 | 108 | 109 | *** 110 | 111 | # Take Aways 112 | 113 | Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list. 114 | 115 | You can do this with the `map()` family of functions in the purrr package. 116 | 117 | To write a function, 118 | 119 | 1. Write code that solves the problem for a real object 120 | 2. Wrap the code in `function(){}` to save it 121 | 3. Add the name of the real object as the function argument 122 | 123 | This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). -------------------------------------------------------------------------------- /07-Models.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Models" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup, message=FALSE} 7 | library(tidyverse) 8 | library(modelr) 9 | library(broom) 10 | 11 | wages <- heights %>% filter(income > 0) 12 | ``` 13 | 14 | ## Your Turn 1 15 | 16 | Fit the model on the slide and then examine the output. What does it look like? 17 | 18 | ```{r} 19 | 20 | ``` 21 | 22 | ## Your Turn 2 23 | 24 | Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract: 25 | 26 | 1. The **coefficient estimates** and their related statistics 27 | 2. The **adj.r.squared** and **p.value** for the overall model 28 | 29 | ```{r} 30 | 31 | ``` 32 | 33 | ## Your Turn 3 34 | 35 | Model `log(income)` against `education` _and_ `height`. Do the coefficients change? 36 | 37 | ```{r} 38 | 39 | ``` 40 | 41 | ## Your Turn 4 42 | 43 | Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients? 44 | 45 | ```{r} 46 | 47 | ``` 48 | 49 | ## Your Turn 5 50 | 51 | Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`. 52 | 53 | _Bonus: Overlay the plot on the original data points._ 54 | 55 | ```{r} 56 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 57 | 58 | mod_h %>% 59 | ________(data = wages) %>% 60 | ________ + 61 | ________ 62 | ``` 63 | 64 | ## Your Turn 6 65 | 66 | Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model mod_ehs. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens? 67 | 68 | ```{r} 69 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 70 | 71 | ______ %>% 72 | augment(data = wages) %>% 73 | ggplot(mapping = aes(x = height, y = .fitted, __________)) + 74 | geom_line() 75 | ``` 76 | 77 | ## Your Turn 7 78 | 79 | Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? 80 | 81 | Add `+ facet_grid(sex ~ education)` to the end of your code. What happens? 82 | 83 | ```{r warning = FALSE, message = FALSE} 84 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 85 | mod_eh <- wages %>% lm(log(income) ~ education + height, data = .) 86 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 87 | 88 | _____ %>% 89 | _____________________ %>% 90 | filter(education >= 12) %>% 91 | ggplot(mapping = aes(x = height, y = _____, color = _____)) + 92 | geom_line() 93 | ``` 94 | 95 | *** 96 | 97 | # Take Aways 98 | 99 | * Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame. 100 | 101 | * Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions. 102 | 103 | * Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals. 104 | 105 | 106 | -------------------------------------------------------------------------------- /08-List-Columns.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "List-Columns" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup, message=FALSE} 7 | library(tidyverse) 8 | library(broom) 9 | library(gapminder) 10 | library(stringr) 11 | 12 | knitr::opts_chunk$set(warning = FALSE, message = FALSE) 13 | ``` 14 | 15 | ## Your Turn 1 16 | 17 | How has life expectancy changed since 1952? 18 | 19 | Using `gapminder`, make a line plot of **lifeExp** vs. **year** grouped by **country**. Set alpha to 0.2, to see the results better. 20 | 21 | ```{r} 22 | 23 | ``` 24 | 25 | ## Your Turn 2 26 | 27 | 1. Group `gapminder` by `country` and `continent` then fit a model and collect the residuals for *each country.* 28 | 29 | 2. Plot the residuals vs `year` as a line graph, grouped by `country`, with `alpha = 0.2`. 30 | 31 | 3. Add the following to your plot to facet by `continent`: 32 | 33 | + facet_wrap(~continent) 34 | 35 | ```{r warning = FALSE} 36 | 37 | ``` 38 | 39 | ## Your Turn 3 40 | 41 | Complete the code to filter, `residuals`, which is the dataset that you made in Your Turn 2, against `bad_fits`, which is the data set that contains just the countries that have an r-squared < 0.25. 42 | 43 | Use the result to plot a line graph of `year` vs. `.resid` colored by `country` for each country that had an r-squared < 0.25. 44 | 45 | ```{r warning = FALSE} 46 | bad_fits <- gapminder %>% 47 | group_by(_______) %>% 48 | do(lm(lifeExp ~ year, data = ____) %>% glance()) %>% 49 | filter(___________) 50 | 51 | residuals <- gapminder %>% 52 | group_by(_______) %>% 53 | do(________________________________________) 54 | 55 | residuals %>% 56 | ___________(bad_fits) %>% 57 | ggplot() + 58 | _________________________________________ 59 | ``` 60 | 61 | ## Your Turn 4 62 | 63 | Create your own copy of master and then add one more list column: 64 | 65 | * **output** which contains the output of `augment()` for each model 66 | 67 | ```{r warning = FALSE} 68 | fit_model <- function(df) lm(lifeExp ~ year, data = df) 69 | get_rsq <- function(mod) glance(mod)$r.squared 70 | _______________________________________________ # write a function that applies augment ot a model 71 | 72 | master <- gapminder %>% 73 | ___________ %>% # group gapminder 74 | ___________ %>% # nest gapminder 75 | mutate(model = ________________________, # add a model column 76 | r.squared = _____________________, # add an r.squared column 77 | output = ________________________) # add an output column 78 | master 79 | ``` 80 | 81 | ## Your Turn 5 82 | 83 | Use master to recreate our plot of the residuals vs `year` for the six countries with an r squared less than 0.25. 84 | 85 | ```{r} 86 | 87 | ``` 88 | 89 | *** 90 | 91 | # Take Aways 92 | 93 | * A two way table is an organizational device that you can manipulate. 94 | * The table structure will maintain correspondence between the table contents during your manipulations. 95 | * Data frames can store more than values. They can store list columns, which can contain _any_ type of R object. 96 | * You can manipulate list columns with `mutate()` and the `map()` family functions. 97 | * Create list columns with `nest()` 98 | * Unnest list columns with `unnest()` to create plots 99 | * Dplyr's `do()` function can be a useful tool when combined with broom functions and `group_by()` -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The materials in this repository are licensed under Creative Commons Attribution Share Alike 4.0 (CC BY SA) 2 | 3 | https://creativecommons.org/licenses/by-sa/4.0/legalcode 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Thank you for enrolling in Master the Tidyverse. 2 | 3 | Please bring a laptop to class that has the following installed: 4 | 5 | 1. A recent version of R (~3.4.1), which is available for free at 6 | 2. A recent version of RStudio IDE (~1.0.153), available for free at 7 | 3. The R packages we will use, which you can install by connecting to the internet, opening R, and running: 8 | 9 | install.packages(c("babynames", "formatR", "gapminder", "hexbin", "mgcv", "maps", "mapproj","nycflights13", "tidyverse", "viridis")) 10 | 11 | 4. The class materials, which can be downloaded at 12 | 13 | 14 | And don't forget your power cord! -------------------------------------------------------------------------------- /pdfs/00-Introduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/00-Introduction.pdf -------------------------------------------------------------------------------- /pdfs/00-Reintroduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/00-Reintroduction.pdf -------------------------------------------------------------------------------- /pdfs/000-A-Preclass-loop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/000-A-Preclass-loop.pdf -------------------------------------------------------------------------------- /pdfs/01-Visualize-Data.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/01-Visualize-Data.pdf -------------------------------------------------------------------------------- /pdfs/02-Transform-Data.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/02-Transform-Data.pdf -------------------------------------------------------------------------------- /pdfs/03-Tidy-Data.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/03-Tidy-Data.pdf -------------------------------------------------------------------------------- /pdfs/04-Import-Data.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/04-Import-Data.pdf -------------------------------------------------------------------------------- /pdfs/05-Data-Types.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/05-Data-Types.pdf -------------------------------------------------------------------------------- /pdfs/06-Iteration.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/06-Iteration.pdf -------------------------------------------------------------------------------- /pdfs/07-Models.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/07-Models.pdf -------------------------------------------------------------------------------- /pdfs/08-List-Columns.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rstudio/master-the-tidyverse/2a7faa8add2a9fb335db47965d18888d2d9e718f/pdfs/08-List-Columns.pdf --------------------------------------------------------------------------------