├── .gitignore ├── README.md ├── .DS_Store ├── ch14.Rmd ├── Ch6.Rmd ├── R4DS-Solutions.Rproj ├── Ch4.Rmd ├── Ch2.Rmd ├── Ch7.Rmd ├── ch21.Rmd ├── Ch12.Rmd ├── Ch8.Rmd ├── ch16.Rmd ├── ch20.Rmd ├── ch13.Rmd ├── ch18.Rmd ├── ch19.Rmd ├── ch15.Rmd ├── ch22.Rmd ├── Ch10.Rmd ├── ch17.Rmd ├── Ch9.Rmd ├── ch11.Rmd ├── Ch5.Rmd ├── Ch1.Rmd └── Ch3.Rmd /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R4DS-Solutions 2 | Solutions and notes on R 4 Data Science 3 | -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cimentadaj/R4DS-Solutions/HEAD/.DS_Store -------------------------------------------------------------------------------- /ch14.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch14" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | No exercises -------------------------------------------------------------------------------- /Ch6.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch6" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | No exercises here -------------------------------------------------------------------------------- /R4DS-Solutions.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: XeLaTeX 14 | -------------------------------------------------------------------------------- /Ch4.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch4" 3 | output: html_notebook 4 | --- 5 | 6 | # 6.3 Practice 7 | 8 | Go to the RStudio Tips twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it! 9 | 10 | What other common mistakes will RStudio diagnostics report? Read https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics to find out. -------------------------------------------------------------------------------- /Ch2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch2" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(tidyverse) 9 | ``` 10 | 11 | ## 4.4 Practice 12 | 13 | Why does this code not work? 14 | ```{r} 15 | 16 | my_variable <- 10 17 | my_varıable 18 | #> Error in eval(expr, envir, enclos): object 'my_varıable' not found 19 | ``` 20 | Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.) 21 | 22 | the i in the second my_variable is not an i. 23 | 24 | Tweak each of the following R commands so that they run correctly: 25 | 26 | ```{r} 27 | library(tidyverse) 28 | 29 | # ggplot(dota = mpg) + 30 | # geom_point(mapping = aes(x = displ, y = hwy)) 31 | # 32 | # fliter(mpg, cyl = 8) 33 | # filter(diamond, carat > 3) 34 | 35 | ###### 36 | 37 | ggplot(data = mpg) + 38 | geom_point(mapping = aes(x = displ, y = hwy)) 39 | 40 | filter(mpg, cyl == 8) 41 | filter(diamonds, carat > 3) 42 | ``` 43 | 44 | Press Alt + Shift + K. What happens? How can you get to the same place using the menus? 45 | 46 | - Go to tools -> Keyboard shortcuts help -------------------------------------------------------------------------------- /Ch7.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch7" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ## 10.5 Exercises 9 | 10 | How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame) 11 | 12 | Visually you can tell by the printing method and how it stores the class of each variable underneath each variable. Or just get the `class()` of object and see whether it has `tbl_df` and `tbl`. 13 | 14 | Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration? 15 | 16 | ```{r} 17 | df <- data.frame(abc = 1, xyz = "a") 18 | df$x # on a tibble this will throw a warning. Partial matching is not allowed 19 | df[, "xyz"] # on a tibble this will be a data frame still 20 | df[, c("abc", "xyz")] # This will be the same result in a tibble 21 | ``` 22 | 23 | The frustration is because in some situations a data frame will returns a different thing like in the last two previous lines of code. Tibble will return the same thing, providing consistency. 24 | 25 | If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble? 26 | 27 | ```{r} 28 | var <- "mpg" 29 | as_tibble(mtcars)[[var]] 30 | 31 | # or 32 | 33 | as_tibble(mtcars)[var] 34 | ``` 35 | 36 | 37 | Practice referring to non-syntactic names in the following data frame by: 38 | 39 | ```{r} 40 | annoying <- tibble( 41 | `1` = 1:10, 42 | `2` = `1` * 2 + rnorm(length(`1`)) 43 | ) 44 | ``` 45 | 46 | Extracting the variable called 1. 47 | 48 | ```{r} 49 | annoying$`1` 50 | ``` 51 | 52 | 53 | Plotting a scatterplot of 1 vs 2. 54 | 55 | ```{r} 56 | ggplot(annoying, aes(`1`, `2`)) + 57 | geom_point() 58 | ``` 59 | 60 | Creating a new column called 3 which is 2 divided by 1. 61 | 62 | ```{r} 63 | annoying <- 64 | annoying %>% 65 | mutate(`3` = `2` / `1`) 66 | ``` 67 | 68 | 69 | Renaming the columns to one, two and three. 70 | 71 | ```{r} 72 | annoying %>% 73 | rename(one = `1`, 74 | two = `2`, 75 | three = `3`) 76 | ``` 77 | 78 | What does `tibble::enframe()` do? When might you use it? 79 | 80 | It turns named vectors or list to two-column data frames. 81 | 82 | It's different from `as_tibble()` for lists because it creates a stacked data frame rather than a widy one. It all dependes on your data. 83 | ```{r} 84 | lst <- list(female = 1, male = 2) 85 | 86 | as_tibble(lst) 87 | ``` 88 | rather than 89 | 90 | ```{r} 91 | enframe(lst) %>% unnest() 92 | ``` 93 | 94 | 95 | What option controls how many additional column names are printed at the footer of a tibble? 96 | 97 | `options(tibble.width = Inf)` for all columns to be printed. -------------------------------------------------------------------------------- /ch21.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch21" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ## Exercises 27.2.1 11 | Create a new notebook using File > New File > R Notebook. Read the instructions. Practice running the chunks. Verify that you can modify the code, re-run it, and see modified output. 12 | 13 | -- 14 | 15 | Create a new R Markdown document with File > New File > R Markdown… Knit it by clicking the appropriate button. Knit it by using the appropriate keyboard short cut. Verify that you can modify the input and see the output update. 16 | 17 | -- 18 | 19 | Compare and contrast the R notebook and R markdown files you created above. How are the outputs similar? How are they different? How are the inputs similar? How are they different? What happens if you copy the YAML header from one to the other? 20 | 21 | The `YAML` is the same but the `output` is set to to `html_notebook`. To switch between the two docs you only have to change the format in `output`. As for the document, `notebooks` offer to show the code as drop down chunks, a feature which Rmarkdowns don't have. 22 | 23 | Create one new R Markdown document for each of the three built-in formats: HTML, PDF and Word. Knit each of the three documents. How does the output differ? How does the input differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.) 24 | 25 | -- 26 | 27 | ## Exercises 27.3.1 28 | 29 | Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold. 30 | 31 | ``` 32 | --- 33 | title: "My CV" 34 | output: html_document 35 | --- 36 | 37 | ## Jorge Cimentada 38 | ## cimentadaj@gmail.com 39 | 40 | # Education 41 | - PhD in **Quantitative Sociology** 42 | - Master's in Demography and Sociology 43 | - B.A in Business 44 | 45 | 46 | # Employment 47 | - Data Scient at RECSM 48 | - Data Analyst at UPF 49 | - Junior Analyst at Newlink 50 | 51 | Skills: R, Python, Git, SQL, LaTeX and Stats 52 | 53 | ``` 54 | 55 | 56 | Using the R Markdown quick reference, figure out how to: 57 | 58 | Add a footnote. 59 | 60 | `Here's a footnote [^1]` 61 | 62 | And add footnote below: 63 | 64 | `[^1]: Here's a footnote` 65 | 66 | Add a horizontal rule. 67 | 68 | Add `--------` 69 | 70 | Add a block quote. 71 | 72 | Something like `> This is a quote` 73 | 74 | Copy and paste the contents of diamond-sizes.Rmd from https://github.com/hadley/r4ds/tree/master/rmarkdown in to a local R markdown document. Check that you can run it, then add text after the frequency polygon that describes its most striking features. 75 | 76 | -- 77 | 78 | ## Exercises 27.4.7 79 | 80 | Add a section that explores how diamond sizes vary by cut, colour, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo = FALSE on each chunk, set a global option. 81 | 82 | Download diamond-sizes.Rmd from https://github.com/hadley/r4ds/tree/master/rmarkdown. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes. 83 | 84 | Modify diamonds-sizes.Rmd to use comma() to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats. 85 | 86 | For the answers of the three previous exercises, see the answer `26.3` [here](https://jrnold.github.io/r4ds-exercise-solutions/r-markdown.html) 87 | 88 | Set up a network of chunks where d depends on c and b, and both b and c depend on a. Have each chunk print lubridate::now(), set cache = TRUE, then verify your understanding of caching. -------------------------------------------------------------------------------- /Ch12.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch12" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r} 9 | library(tidyverse) 10 | ``` 11 | 12 | ## Exercise 15.3.1 13 | Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot? 14 | 15 | Well, we should recode the levels so that all non-income categories are at the end and the plot is set to `coord_flip` so that the labels can be read. 16 | 17 | ```{r} 18 | gss_cat %>% 19 | ggplot(aes(rincome)) + 20 | geom_bar() 21 | 22 | gss_cat %>% 23 | mutate(rincome = 24 | fct_relevel(rincome, 25 | c("No answer", "Don't know", "Refused", "Not applicable"))) %>% 26 | ggplot(aes(rincome)) + 27 | geom_bar() + 28 | coord_flip() 29 | ``` 30 | 31 | 32 | What is the most common relig in this survey? What’s the most common partyid? 33 | 34 | ```{r} 35 | gss_cat %>% 36 | count(relig) %>% 37 | arrange(-n) 38 | ``` 39 | ```{r} 40 | gss_cat %>% 41 | count(partyid) %>% 42 | arrange(-n) 43 | ``` 44 | 45 | 46 | Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation? 47 | 48 | ```{r} 49 | gss_cat %>% 50 | count(relig, denom) %>% 51 | filter(denom == "No denomination") 52 | ``` 53 | 54 | ## Exercises 15.4.1 55 | There are some suspiciously high numbers in tvhours. Is the mean a good summary? 56 | 57 | ```{r} 58 | gss_cat %>% 59 | ggplot(aes(tvhours)) + 60 | geom_histogram() + 61 | geom_vline(xintercept = mean(gss_cat$tvhours, na.rm = TRUE), colour = "red") + 62 | geom_vline(xintercept = median(gss_cat$tvhours, na.rm = TRUE), colour = "blue") 63 | ``` 64 | 65 | Nope, there's a reasonable differnece between the mean and the median. 66 | 67 | For each factor in gss_cat identify whether the order of the levels is arbitrary or principled. 68 | 69 | ```{r} 70 | fct_gss <- gss_cat[sapply(gss_cat, is.factor)] 71 | 72 | lapply(fct_gss, levels) 73 | ``` 74 | 75 | For all variables except rincome the levels are arbitrary. `rincome` is the only one which has a principled order. 76 | 77 | Why did moving 'Not applicable' to the front of the levels move it to the bottom of the plot? 78 | 79 | ```{r} 80 | gss_cat %>% 81 | mutate(rincome = rincome %>% fct_relevel("Not applicable")) %>% 82 | ggplot(aes(rincome)) + 83 | geom_bar() 84 | ``` 85 | 86 | The previous plot moves it the beginning by if the flip the coordinates, it looks like it's in the end. 87 | 88 | ```{r} 89 | gss_cat %>% 90 | mutate(rincome = rincome %>% fct_relevel("Not applicable")) %>% 91 | ggplot(aes(rincome)) + 92 | geom_bar() + 93 | coord_flip() 94 | ``` 95 | 96 | But that's an illusion because of changing the coordinates. The plot is in the same order. 97 | 98 | ## Exercises 15.5.1 99 | How have the proportions of people identifying as Democrat, Republican, and Independent changed over time? 100 | 101 | ```{r} 102 | 103 | all_levels <- levels(gss_cat$partyid) 104 | 105 | gss_cat %>% 106 | mutate(partyid = fct_collapse(partyid, 107 | Democract = c('Not str democrat', 'Strong democrat'), 108 | Republican = c('Strong republican', 'Not str republican'), 109 | Independent = c("Ind,near rep", "Independent", "Ind,near dem"), 110 | Others = c("No answer", "Don't know", "Other party") 111 | )) %>% 112 | count(year, partyid) %>% 113 | group_by(year) %>% 114 | mutate(perc = n / sum(n)) %>% 115 | ggplot(aes(year, perc, group = partyid, colour = partyid)) + 116 | geom_line() + 117 | theme_bw() 118 | ``` 119 | 120 | It looks like Independents are growing whereas both other groups are shrinking, with Republicans shrinking much faster. 121 | 122 | How could you collapse rincome into a small set of categories? 123 | 124 | A very quick but perhaps not so advisable way (because you might lose important information and substantive meaning) is to use `fct_lump`. 125 | 126 | ```{r} 127 | gss_cat %>% 128 | mutate(rincome = fct_lump(rincome, n = 6)) %>% 129 | count(rincome) 130 | ``` 131 | or 132 | 133 | ```{r} 134 | gss_cat %>% 135 | mutate(rincome = 136 | fct_collapse( 137 | rincome, 138 | `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"), 139 | `Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"), 140 | " to ", c("2999", "3999", "4999"))), 141 | `$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"), 142 | " to ", c("5999", "6999", "7999", "9999")) 143 | )) %>% 144 | ggplot(aes(x = rincome)) + 145 | geom_bar() + 146 | coord_flip() 147 | ``` 148 | 149 | -------------------------------------------------------------------------------- /Ch8.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch7" 3 | output: 4 | pdf_document: default 5 | html_document: 6 | df_print: paged 7 | --- 8 | 9 | ```{r} 10 | library(tidyverse) 11 | ``` 12 | 13 | 14 | ## Exercises 11.3.5 15 | 16 | What function would you use to read a file where fiels are separated with "|"? 17 | 18 | You use `read_delim` and specify "|" in the `delim` argument. 19 | 20 | Apart from `file`, `skip` and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common? 21 | 22 | All arguments! But that's logical because they both use `read_delim` as the function doing the work. Both functions just call `read_delim` with a set of predefine options for the `csv` and `tsv` formats using the `tokenize_*` functions. The `tokenize_*` functions simply return a list with the charachteristics of each format. 23 | 24 | What are the most important arguments to `read_fwf()`? 25 | 26 | The most important argument is `col_positions` because that's how determine the width at which each column is separated. You can determine the width with the `fwf_*` helper functions. 27 | 28 | What arguments do you need to specify to read the following text into a data frame? 29 | 30 | ```{r} 31 | read_csv("x,y\n1,'a,b'", quote = "'") 32 | read_delim("x,y\n1,'a,b'", delim = ",", quote = "'") 33 | ``` 34 | 35 | Identify what is wrong with each of the following inline CSV files. What happens when you run the code? 36 | 37 | ```{r} 38 | read_csv("a,b\n1,2,3\n4,5,6") # more rows then column names 39 | read_csv("a,b\n1,2,3\n4,5,6", skip = 1, col_names = letters[1:3]) # fixed 40 | 41 | read_csv("a,b,c\n1,2\n1,2,3,4") # second row has only two values but the remaining lines have 3 42 | 43 | read_csv("a,b\n\"1") # the second row is actually: ", 1. but it uses \" so it's a literal " and a comma is missing 44 | read_csv('a,b\n\",1', quote = "'") # it should be something like this I think 45 | 46 | read_csv("a,b\n1,2\na,b") # nothing wrong with this one. Maybe the column classes because a and b are column names errors 47 | read_csv("a,b\n1,2\na,b", n_max = 1) # this is the correct format. 48 | 49 | read_csv("a;b\n1;3") # this is ; deliminted 50 | read_csv2("a;b\n1;3") 51 | ``` 52 | 53 | 54 | # Exercise 11.3.5 55 | 56 | What are the most important arguments to `locale()`? 57 | 58 | This is a bitr tricky because they're all imporant. `encoding` is usually one that brings about a lot of problems if you're working with international data. However, all others are also important except for `asciify` which is hardly used. 59 | 60 | What happens if you try and set `decimal_mark` and `grouping_mark` to the same character? What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","? And viceversa? 61 | 62 | If you set both to the same character it throws an error. Why? It makes sense. How can you distinguish cents from thousands here? `123,456,789`? Is this 123 million or 123 thousand? 63 | 64 | ```{r} 65 | locale(decimal_mark = ".") 66 | locale(decimal_mark = ",") 67 | ``` 68 | 69 | `.` and `,` are the defaults. If you change one the other gets assigned. 70 | 71 | I didn't discuss the `date_format` and `time_format` options to `locale()`. What do they do? Construct an example that shows when they might be useful? 72 | 73 | They set the default date/time formats, which are based on the `ISO8601` format of yyyy-mm-dd hh-mm-ss. You can override that default by specifying the `locale` argument with new defaults. 74 | 75 | For date: 76 | ```{r} 77 | read_csv("a\n05-02-00") 78 | read_csv("a\n05-02-00", locale = locale(date_format = "%d-%m-%y")) 79 | ``` 80 | 81 | For time: 82 | ```{r} 83 | read_csv("a\n02-00-08 am") 84 | read_csv("a\n02-00-08 am", locale = locale(time_format = "%M-%S-%I %p")) 85 | ``` 86 | 87 | 88 | If you live outside the U.S, create a new `locale` object that encapsulates the type of files you read more often. 89 | 90 | ```{r} 91 | locale(date_names = "es", 92 | date_format = "%Y/%m/%d", 93 | time_format = "%H/%M/%S", 94 | grouping_mark = ".") 95 | ``` 96 | 97 | What's the difference between `read_csv()` and `read_csv2()`? They performe exactly the same operation but the first reads comma-delimited files and the second one read semi-colon delimited files. 98 | 99 | What are the most common encodings in Europe? and in Asia? 100 | 101 | No internet right now! 102 | 103 | Generate the correct format string to parse each of the following dates and times: 104 | 105 | ```{r} 106 | d1 <- "January 1, 2010" 107 | parse_date(d1, format = "%B %d, %Y") 108 | 109 | d2 <- "2015-Mar-07" 110 | parse_date(d2, "%Y-%b-%d") 111 | 112 | d3 <- "06-Jun-2017" 113 | parse_date(d3, "%d-%b-%Y") 114 | 115 | d4 <- c("August 19 (2015)", "July 1 (2015)") 116 | parse_date(d4, "%B %d (%Y)") 117 | 118 | d5 <- "12/30/14" 119 | parse_date(d5, "%m/%d/%y") 120 | 121 | t1 <- "1705" 122 | parse_time(t1, "%H%M") 123 | 124 | t2 <- "11:15:10.12 PM" 125 | parse_time(t2, "%I:%M:%OS %p") 126 | ``` 127 | 128 | -------------------------------------------------------------------------------- /ch16.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch15" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | library(tidyverse) 10 | library(purrr) 11 | ``` 12 | 13 | ## Exercises 20.3.5 14 | Describe the difference between is.finite(x) and !is.infinite(x). 15 | 16 | There is no difference! `is.finite()` tests whether a value is **not** `-Inf` or `Inf` and `is.infinite` tests whether it is `-Inf` or `Inf`. Logically, `!` negates the previous. 17 | 18 | However... `is.infinite` `NA` and `NaN` to be non infinite! See the answer to ex 19.3.1 [here](https://jrnold.github.io/r4ds-exercise-solutions/vectors.html). 19 | 20 | Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work? 21 | 22 | It calculates absolute difference up to a certain tolerance level. If the difference is 0.2 it calculates the difference and compares it to the tolerance level. 23 | 24 | A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use google to do some research. 25 | 26 | Taken from: https://jrnold.github.io/r4ds-exercise-solutions/vectors.html 27 | 28 | The help for .Machine describes some of this: 29 | 30 | As all current implementations of R use 32-bit integers and uses IEC 60559 floating-point (double precision) arithmetic, 31 | 32 | Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise. 33 | 34 | ```{r} 35 | x <- seq(-10, 10, by = 0.5) 36 | as.integer(x) # turns to a literal integer 37 | floor(x) # rounds down 38 | `%/%`(x, 1) # it extracts the integer part of x 39 | 40 | ``` 41 | I can't think of a 4th 42 | 43 | What functions from the readr package allow you to turn a string into logical, integer, and double vector? 44 | 45 | ```{r} 46 | x <- c("TRUE", "FALSE") 47 | parse_logical(x) 48 | x <- c("1", "2") 49 | parse_integer(x) 50 | x <- c("1.1", "1.2") 51 | parse_double(x) 52 | ``` 53 | 54 | ## Exercises 20.4.6 55 | What does mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))? 56 | 57 | `mean(is.na(x))` gives the proportion or percentage of missing values in that vector 58 | `sum(!is.finite(x))` gives the total number of infinite objects in `x` 59 | 60 | Carefully read the documentation of is.vector(). What does it actually test for? Why does is.atomic() not agree with the definition of atomic vectors above? 61 | 62 | `is.atomic()` tests for objects that area "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw". `is.vector()` on the other hand tests for any of the `atomic` modes but must have no attributes other than names. 63 | 64 | A factor, for example, shows how they contradict. 65 | 66 | ```{r} 67 | x <- factor() 68 | is.vector(x) 69 | is.atomic(x) 70 | ``` 71 | 72 | 73 | Compare and contrast setNames() with purrr::set_names(). 74 | 75 | The do the same thing but `purrr::set_names` but with stricter argument checking and cool argument to rename variable based on a function. 76 | 77 | Create functions that take a vector as input and returns: 78 | 79 | The last value. Should you use [ or [[? 80 | 81 | `[[` because it wants the specific value. `x` might have other attributes such as names and it is asking only for one value. 82 | ```{r} 83 | returner <- function(x) { 84 | x[[length(x)]] 85 | } 86 | 87 | returner(1:10) 88 | ``` 89 | 90 | 91 | The elements at even numbered positions. 92 | 93 | ```{r} 94 | even_num <- function(x) { 95 | x[seq(2, length(x), by = 2)] 96 | } 97 | even_num(1:10) 98 | ``` 99 | 100 | 101 | Every element except the last value. 102 | 103 | ```{r} 104 | except_last <- function(x) { 105 | x[-length(x)] 106 | } 107 | 108 | except_last(1:10) 109 | ``` 110 | 111 | 112 | Only even numbers (and no missing values). 113 | 114 | ```{r} 115 | even_numbers <- function(x) { 116 | x[!is.na(x) & x %% 2 == 0] 117 | } 118 | 119 | even_numbers(sample(c(NA, 1:100), 99)) 120 | ``` 121 | 122 | 123 | Why is x[-which(x > 0)] not the same as x[x <= 0]? 124 | 125 | `x[-which(x > 0)]` subset the values of `x` which ar at `0` or below 126 | `x[x <= 0]` subset the values of x which ar at `0` or below. 127 | 128 | They are the same. 129 | 130 | However, `which` will ignore `NA`'s and leave them as is and `<=` will turn any value that cannot be comparable to `NA` like `NaN`! 131 | 132 | ```{r} 133 | x <- c(-5:5, Inf, -Inf, NaN, NA) 134 | 135 | x[-which(x > 0)] 136 | x[x <= 0] 137 | ``` 138 | Notice the difference? 139 | 140 | 141 | What happens when you subset with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist? 142 | 143 | ```{r} 144 | x <- 1:10 145 | x[11] 146 | ``` 147 | Returns an `NA` 148 | 149 | ```{r} 150 | x <- c("a" = 1) 151 | x["b"] 152 | ``` 153 | Both the name and element is `NA`. 154 | 155 | ## Exercises 20.5.4 156 | Draw the following lists as nested sets: 157 | 158 | `list(a, b, list(c, d), list(e, f))` 159 | `list(list(list(list(list(list(a))))))` 160 | 161 | C'mon!! No paper and pen. 162 | 163 | What happens if you subset a tibble as if you’re subsetting a list? What are the key differences between a list and a tibble? 164 | 165 | ```{r} 166 | x <- as_tibble(mtcars) 167 | x[1] # gives back a tibble (same as lists) 168 | x[[1]] # gives back a vector (same as lists) 169 | # In fact, data frames and tibbles are lists!! 170 | ``` 171 | 172 | ## Exercises 20.7.4 173 | 174 | What does hms::hms(3600) return? How does it print? What primitive type is the augmented vector built on top of? What attributes does it use? 175 | 176 | ```{r} 177 | x <- hms::hms(3600) 178 | ``` 179 | Because sixty seconds multiplied by sixty minutes is an hour (3600 seconds) it returns `1` hour. 180 | 181 | ```{r} 182 | class(x) # augmented classes 183 | typeof(x)# a double, in fact, it is 3600 seconds if we use as.numeric 184 | ``` 185 | 186 | It uses two classes and units attribute. It prints probable as `cat` but it is in fact represented as seconds behind the scenes. The printing in hours is just a nice format to see. 187 | 188 | Try and make a tibble that has columns with different lengths. What happens? 189 | 190 | The length of all columns must be the same or 1 (scalar), otherwise throws an error. 191 | ```{r} 192 | tibble(a = 1:2, b = 1:3) 193 | ``` 194 | 195 | Based on the definition above, is it ok to have a list as a column of a tibble? 196 | 197 | Yes, as long as it has the same length as other columns. But beware, because to 'open' that column all slots need to have the same class. -------------------------------------------------------------------------------- /ch20.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch18" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ```{r} 11 | library(tidyverse) 12 | library(modelr) 13 | library(gapminder) 14 | ``` 15 | 16 | 17 | ## Exercises 25.2.5 18 | A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform year so that it has mean zero.) 19 | 20 | ```{r} 21 | library(splines) 22 | 23 | by_country<- 24 | gapminder %>% 25 | group_by(country, continent) %>% 26 | nest() 27 | 28 | country_model <- function(df) { 29 | lm(lifeExp ~ ns(year, 2), data = df) 30 | } 31 | 32 | by_country <- 33 | by_country %>% 34 | mutate(model = map(data, country_model)) 35 | 36 | by_country <- 37 | by_country %>% 38 | mutate(resids = map2(data, model, add_residuals)) 39 | 40 | resids <- unnest(by_country, resids) 41 | 42 | resids %>% 43 | ggplot(aes(year, resid)) + 44 | geom_line(aes(group = country), alpha = 1/3) + 45 | geom_smooth(se = FALSE) 46 | ``` 47 | Even though there are still many spikes, the trend line lines up with zero much more than before! There was an improvement. 48 | 49 | ```{r} 50 | glance <- 51 | by_country %>% 52 | mutate(glance = map(model, broom::glance)) %>% 53 | unnest(glance, .drop = TRUE) 54 | 55 | glance %>% 56 | ggplot(aes(continent, r.squared)) + 57 | geom_jitter() 58 | ``` 59 | And the R2 seemed to increase substantially in Africa. Quite an improvement from just adding a squared term! 60 | 61 | Explore other methods for visualising the distribution of R2 per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods. 62 | 63 | An interesting approach is not to look at how many R squares but the distribution between continents. We could do that with `freq_poly`. 64 | 65 | ```{r} 66 | ggplot(glance, aes(r.squared, colour = continent)) + 67 | geom_freqpoly(binwidth = 0.1) + 68 | theme_bw() 69 | ``` 70 | Well, not only did AFrica improved (only 1 country remaining close to to an R2 of 0) but the overall distribution is in the same line as Europe and Americas. Because the number of countries within each distribution is very different, another approach is to look at the distribution using `geom_density` to harmonize the count scale. 71 | 72 | Using `ggbeeswarm`, let's try what the book suggests. 73 | 74 | ```{r} 75 | ggplot(glance, aes(continent, r.squared)) + 76 | ggbeeswarm::geom_beeswarm() 77 | ``` 78 | Oh, that's neat. So we shouldn't make any patterns of this new arrangement (remember that these are numbers inside one categor!) but it's much cleaner to look at. In the previous `geom_point` graph is not that easier to see the countries in Africa between `0.60` and `0.90`. So they're not at all that close to the `1` as we though in the `freq_poly` graph. 79 | 80 | To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It’s possible avoid this join if we use unnest() instead of unnest(.drop = TRUE). How? 81 | 82 | ```{r} 83 | by_country %>% 84 | mutate(glance = map(model, broom::glance)) %>% 85 | unnest(glance) %>% 86 | filter(r.squared < 0.25) %>% 87 | unnest(data, .drop = TRUE) %>% 88 | ggplot(aes(year, lifeExp, colour = country)) + 89 | geom_line() 90 | ``` 91 | Yes, this is the worst country we saw from Africa! That the R2 is very close to zero. 92 | 93 | ## Exercises 25.4.5 94 | 95 | List all the functions that you can think of that take a atomic vector and return a list. 96 | 97 | - `strsplit` 98 | - `stringr::` usually return a list! 99 | 100 | Brainstorm useful summary functions that, like quantile(), return multiple values. 101 | 102 | - `IQR` 103 | - `quantile` 104 | - `confint` 105 | - `range` 106 | - `fivenum` (didn't know about this one!) 107 | 108 | What’s missing in the following data frame? How does quantile() return that missing piece? Why isn’t that helpful here? 109 | 110 | ```{r} 111 | mtcars %>% 112 | group_by(cyl) %>% 113 | summarise(q = list(quantile(mpg))) %>% 114 | unnest() 115 | #> # A tibble: 15 × 2 116 | #> cyl q 117 | #> 118 | #> 1 4 21.4 119 | #> 2 4 22.8 120 | #> 3 4 26.0 121 | #> 4 4 30.4 122 | #> 5 4 33.9 123 | #> 6 6 17.8 124 | #> # ... with 9 more rows 125 | ``` 126 | `quantile` returns a vector of length `n` containing the percentile at which to cut off the distribution. The name of that percentile is set as the name of the number. For example.. 127 | 128 | ```{r} 129 | x <- 1:10 130 | quantile(x) 131 | ``` 132 | 133 | But in terms of the list, this isn't helpful! A solution is to turn that into a data frame with `tibble::enframe()`. 134 | 135 | ```{r} 136 | mtcars %>% 137 | group_by(cyl) %>% 138 | summarise(q = list(enframe(quantile(mpg)))) %>% 139 | unnest() 140 | ``` 141 | 142 | There it is. 143 | 144 | What does this code do? Why might might it be useful? 145 | 146 | ```{r} 147 | mtcars %>% 148 | group_by(cyl) %>% 149 | summarise_each(funs(list)) 150 | ``` 151 | If you want to iterate over several lists at the same time, for example. 152 | 153 | ## Exercises 25.5.3 154 | 155 | Why might the lengths() function be useful for creating atomic vector columns from list-columns? 156 | 157 | It's usefull to check that both list columns have the same number of rows and can be `unnest`ed. For example. 158 | 159 | ```{r} 160 | new_df <- 161 | by_country %>% 162 | mutate(glance = map(model, broom::glance)) %>% 163 | select(country, data, resids) %>% 164 | add_row(country = "random", 165 | data = list(tibble(a = "hey")), 166 | resids = list(tibble(b = c("hey", "ho")))) 167 | 168 | new_df 169 | ``` 170 | 171 | I've added a wrong row that should crash the unnesting. 172 | 173 | ```{r} 174 | new_df %>% 175 | unnest() 176 | ``` 177 | 178 | We can fix it by only subsetting the rows which have the same length between both lists. 179 | 180 | ```{r} 181 | new_df %>% 182 | mutate(first_len = map_dbl(data, nrow), 183 | second_len = map_dbl(resids, nrow)) %>% 184 | filter(first_len == second_len) %>% 185 | unnest() 186 | ``` 187 | 188 | List the most common types of vector found in a data frame. What makes lists different? 189 | 190 | All atomic vector. Lists can hold anything inside! They con host data frames, S3 objects, whatever you're interested in. However, the data frame won't be able to handle it as a traditional column. For that you need to subset whatever you're interested from the object with `map` and the resul needs to be one of the atomic vectors. 191 | 192 | -------------------------------------------------------------------------------- /ch13.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch13" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r} 9 | library(tidyverse) 10 | library(lubridate) 11 | ``` 12 | 13 | 14 | ## Exercises 16.2.4 15 | What happens if you parse a string that contains invalid dates? 16 | 17 | ```{r} 18 | ymd(c("2010-10-10", "bananas")) 19 | ``` 20 | It retuns the parsed object as a Date object (so it parses it) but returns the failed string as NA. 21 | 22 | What does the tzone argument to today() do? Why is it important? 23 | 24 | ```{r} 25 | 26 | # It gives the date of the specific time zone. If you're in China, the date might be different than from the U.S at a given moment. 27 | 28 | # Check OlsonNames() 29 | # for al TZ names 30 | tz <- str_subset(OlsonNames(), "Madrid") 31 | today(tzone = tz) 32 | ``` 33 | 34 | 35 | Use the appropriate lubridate function to parse each of the following dates: 36 | 37 | ```{r} 38 | 39 | d1 <- "January 1, 2010" 40 | 41 | mdy(d1) 42 | 43 | d2 <- "2015-Mar-07" 44 | 45 | ymd(d2) 46 | 47 | d3 <- "06-Jun-2017" 48 | 49 | dmy(d3) 50 | 51 | d4 <- c("August 19 (2015)", "July 1 (2015)") 52 | 53 | mdy(d4) 54 | 55 | d5 <- "12/30/14" # Dec 30, 2014 56 | 57 | mdy(d5) 58 | 59 | ``` 60 | 61 | ## Exercises 16.3.4 62 | How does the distribution of flight times within a day change over the course of the year? 63 | 64 | ```{r} 65 | flights_dt %>% 66 | mutate(time = hour(dep_time) * 100 + minute(dep_time), 67 | mon = as.factor(month 68 | (dep_time))) %>% 69 | ggplot(aes(x = time, y = ..density.., group = mon, color = mon)) + 70 | geom_freqpoly(binwidth = 100) 71 | ``` 72 | 73 | Taken from: https://jrnold.github.io/r4ds-exercise-solutions/dates-and-times.html#time-spans 74 | 75 | Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings. 76 | 77 | ```{r} 78 | flights_dt %>% 79 | select(dep_time, sched_dep_time, dep_delay) %>% 80 | mutate(other_dep_time = (dep_time - sched_dep_time) / 60) %>% 81 | filter(dep_delay != other_dep_time) %>% 82 | View() 83 | ``` 84 | There seems to be a discrepancy for flights that flew from one day to another. 85 | 86 | Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.) 87 | 88 | ```{r} 89 | flights_dt %>% 90 | mutate(air_time2 = as.numeric(arr_time - dep_time), 91 | diff = air_time2 - air_time) %>% 92 | select(dep_time, arr_time, air_time, air_time2, diff) 93 | ``` 94 | 95 | 96 | How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why? 97 | 98 | ```{r} 99 | flights_dt %>% 100 | group_by(minute = minute(dep_time)) %>% 101 | summarize(avg_delay = mean(dep_delay)) %>% 102 | ggplot(aes(minute, avg_delay)) + 103 | geom_line() 104 | 105 | flights_dt %>% 106 | group_by(minute = minute(sched_dep_time)) %>% 107 | summarize(avg_delay = mean(dep_delay)) %>% 108 | ggplot(aes(minute, avg_delay)) + 109 | geom_line() 110 | ``` 111 | 112 | `dep_time` because it shows the real pattern of delays. The scheduled departure time shows no pattern over the day. 113 | 114 | On what day of the week should you leave if you want to minimise the chance of a delay? 115 | 116 | ```{r} 117 | flights_dt %>% 118 | group_by(day = wday(dep_time, label = TRUE, week_start = 1)) %>% 119 | summarize(avg_delay = mean(dep_delay)) %>% 120 | ggplot(aes(day, avg_delay)) + 121 | geom_line(aes(group = 1)) + 122 | geom_point() 123 | ``` 124 | 125 | Saturday! 126 | 127 | What makes the distribution of diamonds$carat and flights$sched_dep_time similar? 128 | 129 | ```{r} 130 | ggplot(diamonds, aes(carat %% 1 * 100)) + 131 | geom_histogram(bins = 100) 132 | ``` 133 | 134 | ```{r} 135 | ggplot(flights_dt, aes(minute(sched_dep_time))) + 136 | geom_histogram(bins = 100) 137 | ``` 138 | 139 | Taken from: https://jrnold.github.io/r4ds-exercise-solutions/dates-and-times.html#time-spans 140 | Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed. 141 | 142 | 143 | Here we calculate the percentage of flights that left early every minute. 144 | 145 | ```{r} 146 | flights_dt %>% 147 | transmute(early = dep_delay < 0, 148 | minute = minute(sched_dep_time)) %>% 149 | group_by(minute) %>% 150 | summarise(early = mean(early)) %>% 151 | ggplot(aes(x = minute, y = early)) + 152 | geom_point() 153 | ``` 154 | 155 | We can check it out by groups of minutes 156 | 157 | ```{r} 158 | flights_dt %>% 159 | transmute(early = dep_delay < 0, 160 | minute = minute(sched_dep_time), 161 | group_minute = cut(minute, 6)) %>% 162 | group_by(group_minute) %>% 163 | summarise(early = mean(early)) %>% 164 | ggplot(aes(x = group_minute, y = early)) + 165 | geom_point() 166 | ``` 167 | Flights that were suppose to leave in the first ten minutes had greater chances of leaving early than flights which were supposed to leave in the last 20 minutes of the hour. 168 | 169 | 170 | There's a problem with flights that flew over night, let's fix it. 171 | ```{r} 172 | flights_dt <- 173 | flights_dt %>% 174 | mutate( 175 | overnight = arr_time < dep_time, 176 | arr_time = arr_time + days(overnight * 1), # I don't know what * 1 is for 177 | sched_arr_time = sched_arr_time + days(overnight * 1) 178 | ) 179 | ``` 180 | 181 | 182 | ## Exercises 16.4.5 183 | Why is there months() but no dmonths()? 184 | 185 | Because durations are expressed in seconds and different months have different seconds. For example, February has less days depending on the year. It would be different if you specify the year and month. 186 | 187 | Explain days(overnight * 1) to someone who has just started learning R. How does it work? 188 | 189 | ```{r} 190 | overnight <- c(TRUE, FALSE, FALSE, TRUE, NA) 191 | 192 | days(overnight * 1) 193 | ``` 194 | 195 | Imagine you have the date 1st of March 2012. You'd like to know the date 100 days in the future. The first date is `dmy(1032012)`. `days(100)` creates a date object of `100` days that you can add to the first date, like `dmy(1032012) + days(100)`. 196 | 197 | `overnight` is a vector or logicals, so it will create either 1 day objects or 0 days objects (nothing). I don't know what the 1 is for. 198 | 199 | Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year. 200 | 201 | ```{r} 202 | ymd(20150101) + months(1:12) 203 | ``` 204 | for the second 205 | 206 | ```{r} 207 | floor_date(today(), unit = "year") + months(1:12) 208 | ``` 209 | 210 | Write a function that given your birthday (as a date), returns how old you are in years. 211 | 212 | ```{r} 213 | my_bd <- function(birthday) { 214 | as.numeric(as.duration(today() - birthday), "years") %/% 1 215 | } 216 | 217 | birthday <- ymd(19910301) 218 | my_bd(birthday) 219 | ``` 220 | 221 | 222 | Why can’t (today() %--% (today() + years(1)) / months(1) work? 223 | 224 | ```{r} 225 | today() %--% (today() + years(1)) / months(1) 226 | ``` 227 | It works :/ 228 | 229 | -------------------------------------------------------------------------------- /ch18.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch18" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ```{r} 11 | library(tidyverse) 12 | library(modelr) 13 | 14 | options(na.action = na.warn) 15 | ``` 16 | ## Exercises 23.2.1 17 | 18 | One downside of the linear model is that it is sensitive to unusual values because the distance incorporates a squared term. Fit a linear model to the simulated data below, and visualise the results. Rerun a few times to generate different simulated datasets. What do you notice about the model? 19 | 20 | ```{r} 21 | 22 | set.seed(2131) 23 | create_random <- function() { 24 | tibble( 25 | x = rep(1:10, each = 3), 26 | y = x * 1.5 + 6 + rt(length(x), df = 2) 27 | ) 28 | } 29 | 30 | setNames(1:6, paste0("dt", 1:4)) %>% 31 | map(~ create_random()) %>% 32 | enframe() %>% 33 | unnest() %>% 34 | ggplot(aes(x, y)) + 35 | geom_point() + 36 | geom_smooth(method = "lm", colour = "black", se = FALSE) + 37 | facet_wrap(~ name, scales = "free_y") 38 | ``` 39 | 40 | The slope changes quite a bit depending on the outliers. A better approach could be to used something like a root median squared error. 41 | 42 | One way to make linear models more robust is to use a different distance measure. For example, instead of root-mean-squared distance, you could use mean-absolute distance: 43 | 44 | ```{r} 45 | model1 <- function(a, data) { 46 | a[1] + data$x * a[2] 47 | } 48 | 49 | measure_distance <- function(mod, data) { 50 | diff <- data$y - model1(mod, data) 51 | mean(abs(diff)) 52 | } 53 | ``` 54 | 55 | Use optim() to fit this model to the simulated data above and compare it to the linear model. 56 | 57 | ```{r} 58 | create_random <- function() { 59 | tibble( 60 | x = rep(1:10, each = 3), 61 | y = x * 1.5 + 6 + rt(length(x), df = 2) 62 | ) 63 | } 64 | 65 | model_data <- 66 | setNames(1:4, paste0("dt", 1:4)) %>% 67 | map(~ create_random()) 68 | 69 | final_data <- 70 | model_data %>% 71 | map(~ optim(c(0, 0), measure_distance, data = .x)$par) %>% 72 | enframe() %>% 73 | unnest() %>% 74 | mutate(type = rep(c("int", "slope"), nrow(.) / 2)) %>% 75 | spread(type, value) %>% 76 | right_join(model_data %>% enframe() %>% unnest()) 77 | 78 | final_data %>% 79 | ggplot(aes(x, y)) + 80 | geom_point() + 81 | geom_abline(aes(intercept = int, slope = slope)) + 82 | facet_wrap(~ name, scales = "free_y") 83 | ``` 84 | The slope is much less affected by the outliers as it's straight in most cases. 85 | 86 | One challenge with performing numerical optimisation is that it’s only guaranteed to find one local optima. What’s the problem with optimising a three parameter model like this? 87 | 88 | ```{r} 89 | model1 <- function(a, data) { 90 | a[1] + data$x * a[2] + a[3] 91 | } 92 | 93 | measure_distance <- function(mod, data) { 94 | diff <- data$y - model1(mod, data) 95 | mean(abs(diff)) 96 | } 97 | 98 | optim(c(0, 0), measure_distance, data = model_data$dt1)$par 99 | ``` 100 | That you get one (joint) slope, when ideally we'd want two slopes for each term. 101 | 102 | 103 | ## Exercises 23.3.3 104 | 105 | Instead of using lm() to fit a straight line, you can use loess() to fit a smooth curve. Repeat the process of model fitting, grid generation, predictions, and visualisation on sim1 using loess() instead of lm(). How does the result compare to geom_smooth()? 106 | 107 | ```{r} 108 | mod1 <- loess(y ~ x, data = sim1) 109 | 110 | sim1 %>% 111 | add_predictions(mod1) %>% 112 | ggplot(aes(x, y)) + 113 | geom_point() + 114 | geom_line(aes(y = pred), colour = "red") 115 | 116 | sim1 %>% 117 | ggplot(aes(x, y)) + 118 | geom_point() + 119 | geom_smooth() 120 | 121 | ``` 122 | 123 | 124 | Same thing! Cool.. didn't know that. 125 | 126 | add_predictions() is paired with gather_predictions() and spread_predictions(). How do these three functions differ? 127 | 128 | `gather_predictions` works for `gather`ing several models into a tidy data. 129 | 130 | ```{r} 131 | new_data <- tibble( 132 | y = rnorm(100), 133 | x = y + rnorm(100, mean = 5), 134 | z = y * runif(100, max = 100) 135 | ) 136 | 137 | mod1 <- lm(y ~ x, data = new_data) 138 | mod2 <- lm(y ~ z, data = new_data) 139 | 140 | final_data <- 141 | new_data %>% 142 | gather_predictions(mod1, mod2) 143 | 144 | 145 | final_data %>% 146 | ggplot(aes(pred, colour = model)) + 147 | geom_freqpoly() 148 | ``` 149 | 150 | `spread_predictions` does the same but adds the predictions as columns rather than as tidy dataset. 151 | 152 | ```{r} 153 | new_data %>% 154 | spread_predictions(mod1, mod2) 155 | ``` 156 | 157 | 158 | 159 | What does geom_ref_line() do? What package does it come from? Why is displaying a reference line in plots showing residuals useful and important? 160 | 161 | `geom_ref_line` is a nice addition to `ggplot2` although it comes from `modelr`. It's purpose is just adding a reference line in a plot. It's very practical for analyzing residuals because that way you can figure out if many points are above/below a certain point, and whether the models is worse/better at being overly pessimistic or overly positive. 162 | 163 | Why might you want to look at a frequency polygon of absolute residuals? What are the pros and cons compared to looking at the raw residuals? 164 | 165 | Looking at absolute residuals would work really well to distinguish the magnitude of bad or good predictions. Moreover, it server well to identiy strong outliers. On the other hand, the con side is that you don't know whether that strong prediction is either positive or negative. That's why it's better to look at raw residuals for that different question. 166 | 167 | ## Exercises 23.4.5 168 | 169 | What happens if you repeat the analysis of sim2 using a model without an intercept. What 170 | happens to the model equation? What happens to the predictions? 171 | 172 | ```{r} 173 | mod1 <- lm(y ~ x, data = sim2) 174 | mod2 <- lm(y ~ x + -1, data = sim2) 175 | 176 | model_matrix(sim2, y ~ x) 177 | model_matrix(sim2, y ~ x + -1) 178 | ``` 179 | 180 | Nothing happens because the reference group was previously the intercept but it is now an extra category. Consequently, predictions predictions and residuals will be the same. 181 | 182 | ```{r} 183 | grid <- 184 | sim2 %>% 185 | data_grid(x) %>% 186 | add_predictions(mod2) 187 | 188 | ggplot(sim2, aes(x)) + 189 | geom_point(aes(y = y)) + 190 | geom_point(data = grid, aes(y = pred), colour = "red", size = 4) 191 | 192 | ``` 193 | 194 | Use model_matrix() to explore the equations generated for the models I fit to sim3 and sim4. Why is * a good shorthand for interaction? 195 | 196 | ```{r} 197 | mod1 <- lm(y ~ x1 + x2, data = sim3) 198 | mod2 <- lm(y ~ x1 * x2, data = sim3) 199 | 200 | mod1 <- lm(y ~ x1 + x2, data = sim4) 201 | mod2 <- lm(y ~ x1 * x2, data = sim4) 202 | 203 | model_matrix(sim3, y ~ x1 + x2) 204 | model_matrix(sim3, y ~ x1 * x2) 205 | ``` 206 | 207 | 208 | ```{r} 209 | model_matrix(sim4, y ~ x1 + x2) 210 | model_matrix(sim4, y ~ x1 * x2) 211 | ``` 212 | 213 | The `*` operator in a formula expands it to include both the main effect and the interaction between the two variables. 214 | 215 | Using the basic principles, convert the formulas in the following two models into functions. (Hint: start by converting the categorical variable into 0-1 variables.) 216 | 217 | ```{r} 218 | mod1 <- lm(y ~ x1 + x2, data = sim3) 219 | mod2 <- lm(y ~ x1 * x2, data = sim3) 220 | ``` 221 | 222 | Taken from [here](https://jrnold.github.io/r4ds-exercise-solutions/model-basics.html#formulas-and-model-families) as I didn't understand the question. 223 | 224 | ```{r} 225 | model_matrix_mod1 <- function(.data) { 226 | mutate(.data, 227 | `x2b` = as.numeric(x2 == "b"), 228 | `x2c` = as.numeric(x2 == "c"), 229 | `x2d` = as.numeric(x2 == "d"), 230 | `x1:x2b` = x1 * x2b, 231 | `x1:x2c` = x1 * x2c, 232 | `x1:x2d` = x1 * x2d) %>% 233 | select(x1, x2b, x2c, x2d, `x1:x2b`, `x1:x2c`, `x1:x2d`) 234 | } 235 | 236 | model_matrix_mod2 <- function(.data) { 237 | mutate(.data, `x1:x2` = x1 * x2) %>% 238 | select(x1, x2, `x1:x2`) 239 | } 240 | 241 | model_matrix_mod1(sim3) 242 | ``` 243 | ```{r} 244 | model_matrix_mod2(sim4) 245 | ``` 246 | 247 | 248 | For sim4, which of mod1 and mod2 is better? I think mod2 does a slightly better job at removing patterns, but it’s pretty subtle. Can you come up with a plot to support my claim? 249 | 250 | ```{r} 251 | mod1 <- lm(y ~ x1 + x2, data = sim4) 252 | mod2 <- lm(y ~ x1 * x2, data = sim4) 253 | 254 | resid_res <- 255 | sim4 %>% 256 | gather_residuals(mod1, mod2) 257 | 258 | resid_res %>% 259 | ggplot(aes(x = resid, colour = model)) + 260 | geom_freqpoly(binwidth = 0.5) 261 | ``` 262 | 263 | The distribution shows that the right-most side of the residuals are much smoother for the second model and closer to the central distribution. 264 | -------------------------------------------------------------------------------- /ch19.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch18" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ```{r} 11 | library(tidyverse) 12 | library(modelr) 13 | 14 | options(na.action = na.warn) 15 | 16 | library(lubridate) 17 | library(nycflights13) 18 | ``` 19 | 20 | 21 | ## Exercises 24.2.3 22 | 23 | In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent? 24 | 25 | Those represents the categories of `carat` which is in fact an integer variable. But because we logged the initial variables we also get results different from integers. 26 | 27 | If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat? 28 | 29 | That the price of a diamond is completely dependent on the carat size but only when the relationship is in a multiplicative or linear fashion. A 1% increase in carat is associated with a 1% increase in price. 30 | 31 | Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are the particularly bad or good, or do you think these are pricing errors? 32 | 33 | ```{r} 34 | diamonds2 <- 35 | diamonds %>% 36 | mutate(lprice = log2(price), 37 | lcarat = log2(carat)) 38 | 39 | mod1 <- lm(lprice ~ lcarat + color + clarity + cut, data = diamonds2) 40 | 41 | bottom <- 42 | diamonds2 %>% 43 | add_residuals(mod1) %>% 44 | arrange(resid) %>% 45 | slice(1:10) 46 | 47 | top <- 48 | diamonds2 %>% 49 | add_residuals(mod1) %>% 50 | arrange(-resid) %>% 51 | slice(1:10) 52 | 53 | bind_rows(bottom, top) %>% 54 | select(price, carat, resid) 55 | ``` 56 | 57 | Nothing seems off. 58 | 59 | Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond? 60 | 61 | ```{r} 62 | diamonds2 %>% 63 | add_predictions(mod1) %>% 64 | mutate(pred = 2 ^ pred) %>% 65 | select(price, pred) %>% 66 | mutate(se = predict(mod1, se.fit = TRUE)$se.fit, 67 | low_ci = pred - se * 2, 68 | upper_ci = pred + se * 2, 69 | correct = if_else(price >= low_ci & price <= upper_ci, TRUE, FALSE)) %>% 70 | summarize(prop_correct = mean(correct)) 71 | ``` 72 | It doesn't look like **very** good model at predicting because 0% of the predictions were close to the actual price. This is based on the 95% interval. 73 | 74 | We could do it separately and check the magnitude of the residuals. 75 | 76 | ```{r} 77 | diamonds2 %>% 78 | add_residuals(mod1) %>% 79 | mutate(resid = 2 ^ abs(resid)) %>% 80 | ggplot(aes(resid)) + 81 | geom_histogram() 82 | ``` 83 | 84 | Yet despite it doesn't do a good job at making accurate predictions, the model is not terribly bad as most predictions are close to the actual values. 85 | 86 | ## Exercises 24.3.5 87 | 88 | Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalise to another year? 89 | 90 | * Jan 21 is Martin Luther King Jr. Day 91 | * May 26 is Trinity Sunday 92 | * Sep 2 is labot day 93 | 94 | All of these dates are holidays in the US, or the day preceding. 95 | 96 | How would they generalize to another year? Well, the holidays are there but they might end up in another day of the week. Let's check it out. 97 | 98 | ```{r} 99 | holiday <- c("0121", "0526", "0902") 100 | years <- 2013:2015 101 | map(years, ~ wday(ymd(paste0(.x, holiday, sep = "")), label = TRUE)) 102 | ``` 103 | It looks like they will be different for every year, suggesting that there's fluctuations in the number of flights per day. 104 | 105 | What do the three days with high positive residuals represent? How would these days generalise to another year? 106 | 107 | ```{r} 108 | daily %>% 109 | top_n(3, resid) 110 | #> # A tibble: 3 × 5 111 | #> date n wday resid term 112 | #> 113 | #> 1 2013-11-30 857 Sat 112.4 fall 114 | #> 2 2013-12-01 987 Sun 95.5 fall 115 | #> 3 2013-12-28 814 Sat 69.4 fall 116 | ``` 117 | It means that for Saturdays and Sundays the model underpredicts the number of flights (assuming these residuals are not absolute figures). However, these specific week days for another year might be different, so it's better we make sure that this high imprecision is a weekend effect or a date effect. Once we know, we adjust our model for that typo of seasonal effects. 118 | 119 | Create a new variable that splits the wday variable into terms, but only for Saturdays, i.e. it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall. How does this model compare with the model with every combination of wday and term? 120 | 121 | ```{r} 122 | ## All previous code from the book 123 | daily <- 124 | flights %>% 125 | mutate(date = make_date(year, month, day)) %>% 126 | group_by(date) %>% 127 | summarize(n = n()) %>% 128 | mutate(wday = wday(date, label = TRUE)) 129 | 130 | mod <- lm(n ~ wday, data = daily) 131 | 132 | daily <- add_residuals(daily, mod) 133 | 134 | term <- function(date) { 135 | cut(date, 136 | breaks = ymd(20130101, 20130605, 20130825, 20140101), 137 | labels = c("spring", "summer", "fall") 138 | ) 139 | } 140 | 141 | daily <- 142 | daily %>% 143 | mutate(term = term(date)) 144 | 145 | ### 146 | 147 | new_daily <- 148 | daily %>% 149 | mutate(wday = as.character(wday), 150 | term_sat = ifelse(wday == "Sat", paste0(wday, "-", term), wday)) 151 | 152 | mod1 <- MASS::rlm(n ~ term_sat, data = new_daily) 153 | 154 | new_daily %>% 155 | add_residuals(mod1) %>% 156 | ggplot(aes(date, resid)) + 157 | geom_line() 158 | ``` 159 | 160 | IT's pretty much the same. Both the Jan-March under prediction and the outliers from summer and winter are present. See [here](https://jrnold.github.io/r4ds-exercise-solutions/model-building.html) for a more detailed explanation. 161 | 162 | 163 | Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like? 164 | 165 | ```{r} 166 | daily_holidays <- 167 | new_daily %>% 168 | mutate(holidays = case_when(date %in% ymd(c(20130101, # new years 169 | 20130121, # mlk 170 | 20130218, # presidents 171 | 20130527, # memorial 172 | 20130704, # independence 173 | 20130902, # labor 174 | 20131028, # columbus 175 | 20131111, # veterans 176 | 20131128, # thanksgiving 177 | 20131225)) ~ "holiday", 178 | TRUE ~ "None")) %>% 179 | unite(new_term, term_sat, holidays) 180 | 181 | mod2 <- lm(n ~ new_term, data = daily_holidays) 182 | 183 | daily_holidays %>% 184 | add_residuals(mod2) %>% 185 | ggplot(aes(date, resid)) + 186 | geom_line() 187 | ``` 188 | No luck! holidays and days of the week don't seem to change much of the unexplained variation. 189 | 190 | What happens if you fit a day of week effect that varies by month (i.e. n ~ wday * month)? Why is this not very helpful? 191 | 192 | ```{r} 193 | mod2 <- lm(n ~ wday * month(date), data = daily_holidays) 194 | 195 | daily_holidays %>% 196 | add_residuals(mod2) %>% 197 | ggplot(aes(date, resid)) + 198 | geom_line() 199 | ``` 200 | 201 | The outliers become much more extreme! This is the case becaue the interaction term leaves less observations in each cell, making the predictions more uncertain. 202 | 203 | 204 | What would you expect the model n ~ wday + ns(date, 5) to look like? Knowing what you know about the data, why would you expect it to be not particularly effective? 205 | 206 | Well, it could model the overall trend in this year but it would not be particularly effective in generalizing to other years if the effect we're missing is not year-specific but something else like seasonal effects that change over the years. Moreover, it would not capture in detail the strong outliers that are very particular for specific days/weeks. 207 | 208 | We hypothesised that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday. Explore that hypothesis by seeing how it breaks down based on distance and time: if it’s true, you’d expect to see more Sunday evening flights to places that are far away. 209 | 210 | It’s a little frustrating that Sunday and Saturday are on separate ends of the plot. Write a small function to set the levels of the factor so that the week starts on Monday. 211 | 212 | ```{r} 213 | week_relevel <- function(x) { 214 | fct_relevel(x, "Sun", after = 7) 215 | } 216 | 217 | 218 | daily %>% 219 | mutate(wday = week_relevel(wday)) %>% 220 | ggplot(aes(wday, n)) + 221 | geom_boxplot() 222 | ``` 223 | 224 | -------------------------------------------------------------------------------- /ch15.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch15" 3 | output: 4 | pdf_document: default 5 | html_document: 6 | df_print: paged 7 | --- 8 | 9 | ```{r} 10 | library(lubridate) 11 | library(tidyverse) 12 | ``` 13 | 14 | 15 | 19.2.1 Practice 16 | Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE? 17 | 18 | ```{r} 19 | rescale01 <- function(x) { 20 | rng <- range(x, na.rm = FALSE) 21 | (x - rng[1]) / (rng[2] - rng[1]) 22 | } 23 | 24 | rescale01(c(-1, 0, 5, 20, NA)) 25 | ``` 26 | Everything is NA! We should set an argument to control the `TRUE` or `FALSE` of `range` 27 | 28 | In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1. 29 | 30 | ```{r} 31 | rescale01 <- function(x) { 32 | rng <- range(x, na.rm = TRUE, finite = TRUE) 33 | (x - rng[1]) / (rng[2] - rng[1]) 34 | x[x == Inf] <- 1 35 | x[x == -Inf] <- 0 36 | x 37 | } 38 | 39 | rescale01(c(Inf, -Inf, 0:5, NA)) 40 | ``` 41 | 42 | 43 | Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative? 44 | 45 | ```{r} 46 | x <- 1:10 47 | mean(is.na(x)) 48 | 49 | x / sum(x, na.rm = TRUE) 50 | 51 | sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) 52 | ``` 53 | 54 | Implementation 55 | ```{r} 56 | prop_miss <- function(x) { 57 | mean(is.na(x)) 58 | } 59 | 60 | my_mean <- function(x) { 61 | x / sum(x, na.rm = TRUE) 62 | } 63 | 64 | my_var <- function(x) { 65 | sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) 66 | } 67 | 68 | # Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector. 69 | 70 | 71 | ``` 72 | 73 | Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors. 74 | 75 | ```{r} 76 | both_na <- function(x, y) { 77 | stopifnot(length(x) == length(y)) 78 | which(is.na(x) & is.na(y)) 79 | } 80 | 81 | both_na(c(1, 2, NA, 2, NA), c(1, 2, 3, 4, NA)) 82 | ``` 83 | 84 | 85 | What do the following functions do? Why are they useful even though they are so short? 86 | 87 | ```{r} 88 | # Checks whether the path is a directory. 89 | is_directory <- function(x) file.info(x)$isdir 90 | 91 | # Checks whether a file is readable. 92 | is_readable <- function(x) file.access(x, 4) == 0 93 | ``` 94 | 95 | Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication. 96 | 97 | ## Exercises 19.3.1 98 | Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names. 99 | 100 | ```{r} 101 | f1 <- function(string, prefix) { 102 | substr(string, 1, nchar(prefix)) == prefix 103 | } 104 | f2 <- function(x) { 105 | if (length(x) <= 1) return(NULL) 106 | x[-length(x)] 107 | } 108 | f3 <- function(x, y) { 109 | rep(y, length.out = length(x)) 110 | } 111 | ``` 112 | 113 | ```{r} 114 | 115 | # It tests whether the prefix argument is the same prefix 116 | # as in the string 117 | has_prefix <- function(string, prefix) { 118 | substr(string, 1, nchar(prefix)) == prefix 119 | } 120 | 121 | has_prefix(c("hey_ho", "another_pre"), "hey") 122 | 123 | # It removes the last element of x, only of the length greater than 1 124 | # otherwise returns NULL 125 | remove_last <- function(x) { 126 | if (length(x) <= 1) return(NULL) 127 | x[-length(x)] 128 | } 129 | 130 | remove_last(c(1, 2, 3)) 131 | 132 | convert_length <- function(x, y) { 133 | rep(y, length.out = length(x)) 134 | } 135 | 136 | convert_length(1:10, 1:3) 137 | ``` 138 | 139 | Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments. 140 | 141 | - 142 | 143 | Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent? 144 | 145 | You could create one function that accepts the same first three arguments and then a logical stating whether you want it to be uni or multivariate. 146 | 147 | Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite. 148 | 149 | Taken from: https://jrnold.github.io/r4ds-exercise-solutions/functions.html#when-should-you-write-a-function 150 | 151 | If named norm_r and norm_d, it groups the family of functions related to the normal distribution. If named rnorm, and dnorm, functions related to are grouped into families by the action they perform. r* functions always sample from distributions: rnorm, rbinom, runif, rexp. d* functions calculate the probability density or mass of a distribution: dnorm, dbinom, dunif, dexp. 152 | 153 | ## Exercises 19.4.4 154 | 155 | What’s the difference between if and ifelse()? Carefully read the help and construct three examples that illustrate the key differences. 156 | 157 | `ifelse` tests the conditions in a vectorizes way, meaning it returns a vector that can be of length 1 or > 1. `if` tests only one logical statement for an object of length 1. 158 | 159 | Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.) 160 | 161 | ```{r} 162 | 163 | greeter <- function(now = now()) { 164 | if (between(hour(now), 8, 13)) { 165 | print("Good morning") 166 | } else if (between(hour(now), 13, 18)) { 167 | print("Good afternoon") 168 | } else { 169 | print("Good evening") 170 | } 171 | } 172 | 173 | greeter(now()) 174 | ``` 175 | 176 | 177 | Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function. 178 | 179 | ```{r} 180 | x <- 9 181 | 182 | fizzbuzz <- function(x) { 183 | by_three <- x %% 3 == 0 184 | by_five <- x %% 5 == 0 185 | 186 | if (by_three && by_five) { 187 | return("fizzbuzz") 188 | } else if (by_three) { 189 | return("fizz") 190 | } else if (by_five) { 191 | return("buzz") 192 | } else { 193 | return(x) 194 | } 195 | } 196 | 197 | sapply(1:20, fizzbuzz) 198 | ``` 199 | 200 | 201 | How could you use cut() to simplify this set of nested if-else statements? 202 | 203 | ```{r} 204 | temp <- c(27, 30) 205 | 206 | if (temp <= 0) { 207 | "freezing" 208 | } else if (temp <= 10) { 209 | "cold" 210 | } else if (temp <= 20) { 211 | "cool" 212 | } else if (temp <= 30) { 213 | "warm" 214 | } else { 215 | "hot" 216 | } 217 | 218 | cut(temp, breaks = seq(-10, 40, 10), 219 | labels = c("freezing", "cold", "cool", "warm", "hot")) 220 | ``` 221 | 222 | How would you change the call to cut() if I’d used < instead of <=? What is the other chief advantage of cut() for this problem? (Hint: what happens if you have many values in temp?) 223 | 224 | 225 | First, that it's vectorized, so I can recode a vector of values with the single function and second that it controls the closing points of what to recode such as: 226 | ```{r} 227 | cut(temp, breaks = seq(-10, 40, 10), 228 | right = FALSE, 229 | labels = c("freezing", "cold", "cool", "warm", "hot")) 230 | ``` 231 | 232 | 233 | What happens if you use switch() with numeric values? 234 | 235 | ```{r, eval = FALSE} 236 | x = 2 237 | switch(x, 1 = "No", 2 = "Yes") 238 | switch(x, `1` = "No", `2` = "Yes") 239 | ``` 240 | 241 | 242 | What does this switch() call do? What happens if x is “e”? 243 | 244 | ```{r} 245 | x <- "d" 246 | switch(x, 247 | a = , 248 | b = "ab", 249 | c = , 250 | d = "cd" 251 | ) 252 | ``` 253 | When a slot is empty it finds the next slot that has values. 254 | 255 | ```{r} 256 | x <- "a" 257 | switch(x, 258 | a = , 259 | z = , 260 | b = "ab", 261 | c = , 262 | d = "cd" 263 | ) 264 | ``` 265 | 266 | ## Exercises 19.5.5 267 | What does commas(letters, collapse = "-") do? Why? 268 | 269 | ```{r} 270 | commas <- function(..., collapse = ",") { 271 | str_c(..., collapse = collapse) 272 | } 273 | 274 | commas(letters, collapse = "-") 275 | ``` 276 | You get them `-` separated! 277 | 278 | It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it? 279 | 280 | ```{r} 281 | rule <- function(..., pad = "-") { 282 | title <- paste0(...) 283 | width <- getOption("width") - nchar(title) - 5 284 | pad_char <- nchar(pad) 285 | cat(title, " ", stringr::str_dup(pad, width / pad_char), "\n", sep = "") 286 | } 287 | 288 | rule("my title", pad = "-+") 289 | ``` 290 | 291 | 292 | What does the trim argument to mean() do? When might you use it? 293 | 294 | ```{r} 295 | mean(c(99, 1:10)) 296 | mean(c(99, 1:10), trim = 0.1) 297 | ``` 298 | In this case, it trimes `10%` of the the vector on both sides. 299 | 300 | The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default? 301 | 302 | The first value is used by default and you have the option of specifying any of the three values. -------------------------------------------------------------------------------- /ch22.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch22" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ```{r} 11 | library(tidyverse) 12 | library(modelr) 13 | ``` 14 | 15 | ## Exercises 28.2.1 16 | Create one plot on the fuel economy data with customised title, subtitle, caption, x, y, and colour labels. 17 | 18 | ```{r} 19 | mpg %>% 20 | group_by(year, class) %>% 21 | summarize(max_hwy = max(hwy)) %>% 22 | ungroup() %>% 23 | mutate(year = as.character(year)) %>% 24 | ggplot(aes(year, max_hwy, color = class, group = class)) + 25 | geom_point() + 26 | geom_line() + 27 | labs( 28 | x = "Year of manufacture", 29 | y = "Maximum highway mileage", 30 | color = "Class of car", 31 | title = paste("High way mileage rougly increases for", 32 | "every class of car"), 33 | subtitle = paste("Subcompact and compact seem to have", 34 | "decreased") 35 | ) + 36 | theme_bw() 37 | 38 | 39 | ggplot(mpg, aes(year, displ, group = manufacturer, color = manufacturer)) + 40 | geom_line() 41 | ``` 42 | 43 | 44 | The geom_smooth() is somewhat misleading because the hwy for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. Use your modelling tools to fit and display a better model. 45 | 46 | An easy way out is to control the `span` to a more linear trend. 47 | 48 | ```{r} 49 | ggplot(mpg, aes(displ, hwy)) + 50 | geom_point(aes(color = class)) + 51 | geom_smooth(se = FALSE, span = 1.8) + 52 | labs( 53 | x = "Engine displacement (L)", 54 | y = "Highway fuel economy (mpg)", 55 | color = "Car type", 56 | title = paste( 57 | "Fuel efficiency generally decreases with", 58 | "engine size" 59 | ), 60 | subtitle = paste( 61 | "Two seaters (sports cars) are an exception", 62 | "because of their light weight" 63 | ), 64 | caption = "Data from fueleconomy.gov" 65 | ) 66 | 67 | ``` 68 | Or just set the trend to be linear. 69 | 70 | ```{r} 71 | ggplot(mpg, aes(displ, hwy)) + 72 | geom_point(aes(color = class)) + 73 | geom_smooth(se = FALSE, method = "lm") + 74 | labs( 75 | x = "Engine displacement (L)", 76 | y = "Highway fuel economy (mpg)", 77 | color = "Car type", 78 | title = paste( 79 | "Fuel efficiency generally decreases with", 80 | "engine size" 81 | ), 82 | subtitle = paste( 83 | "Two seaters (sports cars) are an exception", 84 | "because of their light weight" 85 | ), 86 | caption = "Data from fueleconomy.gov" 87 | ) 88 | ``` 89 | or just a better model, literally. 90 | 91 | ```{r} 92 | mod1 <- lm(hwy ~ class, data = mpg) 93 | 94 | mpg %>% 95 | add_residuals(mod1) %>% 96 | ggplot(aes(displ, resid)) + 97 | geom_point() + 98 | geom_smooth(se = FALSE, method = "lm") + 99 | labs( 100 | x = "Engine displacement (L)", 101 | y = "Highway fuel economy (mpg) adjusted \n for the class of the car", 102 | color = "Car type", 103 | title = paste( 104 | "Fuel efficiency generally decreases with", 105 | "engine size" 106 | ), 107 | subtitle = paste( 108 | "Two seaters (sports cars) are an exception", 109 | "because of their light weight" 110 | ), 111 | caption = "Data from fueleconomy.gov" 112 | ) + 113 | theme_bw() 114 | 115 | ``` 116 | 117 | 118 | Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand. 119 | 120 | -- 121 | 122 | ## Exercises 28.3.1 123 | Use geom_text() with infinite positions to place text at the four corners of the plot. 124 | 125 | ```{r} 126 | label_text <- tibble( 127 | displ = c(-Inf, -Inf, Inf, Inf), 128 | hwy = c(Inf, -Inf, Inf, -Inf), 129 | hjust = c("left", "left", "right", "right"), 130 | vjust = c("top", "bottom", "top", "bottom"), 131 | text = c("This is top left", 132 | "This is bottom left", 133 | "This is top right", 134 | "This is bottom right") 135 | ) 136 | 137 | ggplot(mpg, aes(displ, hwy)) + 138 | geom_point() + 139 | geom_text(aes(hjust = hjust, vjust = vjust, label = text), data = label_text) 140 | ``` 141 | 142 | 143 | Read the documentation for annotate(). How can you use it to add a text label to a plot without having to create a tibble? 144 | 145 | ```{r} 146 | ggplot(mpg, aes(displ, hwy)) + 147 | geom_point() + 148 | annotate(geom = "text", label = "annotation", x = 6, y = 30) 149 | ``` 150 | 151 | 152 | How do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: think about the underlying data.) 153 | 154 | ```{r} 155 | best_in_class <- 156 | mpg %>% 157 | filter(manufacturer %in% c("audi", "toyota")) %>% 158 | group_by(manufacturer) %>% 159 | filter(row_number(class) == 1) 160 | 161 | mpg %>% 162 | filter(manufacturer %in% c("audi", "toyota")) %>% 163 | ggplot(aes(displ, hwy)) + 164 | geom_point() + 165 | geom_text(aes(label = model), data = best_in_class) + 166 | facet_wrap(~ manufacturer) 167 | ``` 168 | What if you wanted it to appear in one facet and not the other? Think about the data. Let's say we wanted to remove `a4` then we would have to either remove that label or remove the row altogether! For example 169 | 170 | ```{r} 171 | best_in_class <- 172 | mpg %>% 173 | filter(manufacturer %in% c("audi", "toyota")) %>% 174 | group_by(manufacturer) %>% 175 | filter(row_number(class) == 1) %>% 176 | .[2, ] 177 | 178 | mpg %>% 179 | filter(manufacturer %in% c("audi", "toyota")) %>% 180 | ggplot(aes(displ, hwy)) + 181 | geom_point() + 182 | geom_text(aes(label = model), data = best_in_class) + 183 | facet_wrap(~ manufacturer) 184 | 185 | 186 | ``` 187 | 188 | See [here](https://jrnold.github.io/r4ds-exercise-solutions/graphics-for-communication.html) for some cooler examples. 189 | 190 | What arguments to geom_label() control the appearance of the background box? 191 | 192 | `label.padding`, `label.r` and `label.size`. 193 | 194 | What are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options. 195 | 196 | 197 | ```{r} 198 | mpg %>% 199 | ggplot(aes(displ, hwy)) + 200 | geom_point() + 201 | geom_segment(x = 3, xend = 4, y = 30, yend = 40, 202 | arrow = arrow()) 203 | ``` 204 | 205 | you can specify different angles and options. For example 206 | 207 | ```{r} 208 | mpg %>% 209 | ggplot(aes(displ, hwy)) + 210 | geom_point() + 211 | geom_segment(x = 3, xend = 4, y = 30, yend = 40, 212 | arrow = arrow(type = "closed")) 213 | ``` 214 | 215 | ```{r} 216 | mpg %>% 217 | ggplot(aes(displ, hwy)) + 218 | geom_point() + 219 | geom_segment(x = 3, xend = 4, y = 30, yend = 40, 220 | arrow = arrow(angle = 160)) 221 | ``` 222 | 223 | ```{r} 224 | mpg %>% 225 | ggplot(aes(displ, hwy)) + 226 | geom_point() + 227 | geom_segment(x = 3, xend = 4, y = 30, yend = 40, 228 | arrow = arrow(ends = "first")) 229 | ``` 230 | 231 | ## Exercises 28.4.4 232 | 233 | Why doesn’t the following code override the default scale? 234 | 235 | ```{r} 236 | df <- tibble( 237 | x = rnorm(10000), 238 | y = rnorm(10000) 239 | ) 240 | 241 | ggplot(df, aes(x, y)) + 242 | geom_hex() + 243 | scale_colour_gradient2(low = "white", high = "red") + 244 | coord_fixed() 245 | ``` 246 | 247 | I think it's because there's not `color` aesthetic. Instead, `geom_hex` uses a `fill` aesthetic. 248 | 249 | ```{r} 250 | df %>% 251 | ggplot(aes(x, y)) + 252 | geom_hex() + 253 | scale_fill_gradient(low = "white", high = "red") + 254 | coord_fixed() 255 | ``` 256 | 257 | 258 | What is the first argument to every scale? How does it compare to labs()? 259 | 260 | All scales control very similar behavior. For example, the name of the aesthetic, the name of th legend, the labels, breaks, minor breaks of the aesthetic and so on. They are all common to all scales! `labs` has similar arguments such as naming the titles of aesthetics such as `color`, `fill` or any oher aesthetic specified. But it also controls high-level properties of the plot such as `title` and `substitle`. 261 | 262 | 263 | Change the display of the presidential terms by: 264 | 265 | ```{r} 266 | presidential <- 267 | presidential %>% 268 | mutate(id = row_number()) 269 | 270 | p <- 271 | ggplot(presidential, aes(start, id, color = party)) + 272 | geom_point() + 273 | geom_segment(aes(xend = end, yend = id)) 274 | p 275 | ``` 276 | 277 | 278 | Combining the two variants shown above. 279 | 280 | ```{r} 281 | p2 <- 282 | p + 283 | labs(title = "Presidential terms", 284 | x = "Years", 285 | y = "Presidents", 286 | color = "Political party") + 287 | scale_color_manual(labels = c("Dem", "Rep"), 288 | values = c(Democratic = "blue", 289 | Republican = "red")) 290 | p2 291 | ``` 292 | 293 | 294 | Improving the display of the y axis. 295 | Labelling each term with the name of the president. 296 | 297 | ```{r} 298 | p3 <- 299 | p2 + 300 | scale_y_continuous(breaks = presidential$id, 301 | labels = presidential$name) 302 | 303 | p3 304 | ``` 305 | 306 | Adding informative plot labels. 307 | 308 | Done above. 309 | 310 | Placing breaks every 4 years (this is trickier than it seems!). 311 | 312 | ```{r} 313 | p4 <- 314 | p3 + 315 | scale_x_date(date_breaks = "4 years", 316 | date_labels = "%y'") 317 | 318 | p4 319 | ``` 320 | 321 | 322 | Use override.aes to make the legend on the following plot easier to see. 323 | 324 | ```{r} 325 | ggplot(diamonds, aes(carat, price)) + 326 | geom_point(aes(colour = cut), alpha = 1/20) + 327 | guides( 328 | override.aes = list(alpha = 1) 329 | ) 330 | ``` 331 | -------------------------------------------------------------------------------- /Ch10.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch10" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r} 9 | library(tidyverse) 10 | library(nycflights13) 11 | ``` 12 | 13 | Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine? 14 | 15 | You would need to combine `airports` with `flights` because the `airports` dataset the the coordinates of the airport. You could match them by the `faa` variable in `airports` and the `origin` and `dest` from `flights` 16 | 17 | ```{r} 18 | flights %>% 19 | left_join(airports, by = c("origin" = "faa", "dest" = "faa")) 20 | ``` 21 | 22 | I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram? 23 | 24 | Similarly as the above exercise, you can match them with `faa` in `airports` and `origin` in `weather`. 25 | 26 | weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights? 27 | 28 | You could also connect it with flights through the `dest` and have the weather of every single airport in the U.S that are present in the `flights` dataset. 29 | 30 | We know that some days of the year are "special"", and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables? 31 | 32 | You could have a separate dataset with the festivities in the U.S by day and month. With this information you can match it with each flight in the `flights` data set and subsequently with the `weather` dataset 33 | 34 | ## Exercises 13.3.1 35 | 36 | Add a surrogate key to flights. 37 | 38 | ```{r} 39 | flights %>% 40 | mutate(id = row_number(year)) %>% 41 | select(id, everything()) 42 | ``` 43 | 44 | Identify the keys in the following datasets 45 | 46 | Lahman::Batting, 47 | 48 | ```{r} 49 | as_tibble(Lahman::Batting) 50 | ``` 51 | 52 | It's `playerID`. 53 | 54 | babynames::babynames 55 | 56 | ```{r} 57 | as_tibble(babynames::babynames) 58 | ``` 59 | 60 | I think it might be the combination of `year` and `name`. That identifies each name-year pair for matching with other tables. 61 | 62 | nasaweather::atmos 63 | 64 | ```{r} 65 | as_tibble(nasaweather::atmos) 66 | ``` 67 | 68 | Here it's most likely `lat`, `long`, `year` and `month`, which locate a specific place in a month/year. 69 | 70 | fueleconomy::vehicles 71 | 72 | ```{r} 73 | as_tibble(fueleconomy::vehicles) 74 | ``` 75 | `id` is the simple key. 76 | 77 | ggplot2::diamonds 78 | 79 | ```{r} 80 | as_tibble(ggplot2::diamonds) 81 | ``` 82 | 83 | There is not key because there are not other datasets! The concept of key only makes sense when there are other relational datasets. 84 | 85 | Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers. 86 | 87 | How would you characterise the relationship between the Batting, Pitching, and Fielding tables? 88 | 89 | It's actualy *very* straight forward: all three tables have the same `playerID` and `yearID` and each table has the information that the other doesn't have, so they complement each other. I think it is one-to-one relationships but that needs to have inspected further. 90 | 91 | ## Exercises 13.4.6 92 | 93 | Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States: 94 | 95 | 96 | ```{r} 97 | flights %>% 98 | mutate(tot_delay = arr_delay + dep_delay) %>% 99 | group_by(dest) %>% 100 | summarize(avg_delay = mean(tot_delay, na.rm = TRUE)) %>% 101 | left_join(select(airports, faa, lon, lat), c("dest" = "faa")) %>% 102 | ggplot(aes(lon, lat, colour = avg_delay)) + 103 | borders("state") + 104 | geom_point(size = 2, alpha = 0.8) + 105 | xlim(c(-130, -65)) + 106 | ylim(c(20, 50)) + 107 | coord_quickmap() + 108 | viridis::scale_color_viridis() 109 | ``` 110 | 111 | 112 | Add the location of the origin and destination (i.e. the lat and lon) to flights. 113 | 114 | ```{r} 115 | flights %>% 116 | left_join(select(airports, faa, lat, lon), by = c("origin" = "faa")) %>% 117 | rename(lat_origin = lat, 118 | lon_origin = lon) %>% 119 | left_join(select(airports, faa, lat, lon), by = c("dest" = "faa")) %>% 120 | rename(lat_dest = lat, 121 | lon_dest = lon) %>% 122 | select(origin, dest, matches("lat|lon")) 123 | ``` 124 | 125 | (However, this could've been done with `gather` and avoid the double `left_join`) 126 | 127 | Is there a relationship between the age of a plane and its delays? 128 | 129 | ```{r} 130 | flights %>% 131 | mutate(tot_delay = arr_delay + dep_delay) %>% 132 | group_by(tailnum) %>% 133 | summarize(avg_delay = mean(tot_delay, na.rm = TRUE)) %>% 134 | left_join(select(planes, tailnum, year), by = "tailnum") %>% 135 | mutate(year = 2013 - year) %>% 136 | ggplot(aes(avg_delay, year)) + 137 | geom_point() + 138 | geom_smooth() 139 | ``` 140 | 141 | From a very preliminary view, there doesn't seem to be, although the some more older planes have very short delays and some younger planes have very high delays. This pattern however could be due to other things such as the origin/destionation. 142 | 143 | What weather conditions make it more likely to see a delay? 144 | 145 | ```{r} 146 | avg_del <- 147 | flights %>% 148 | mutate(tot_delay = arr_delay + dep_delay) %>% 149 | group_by(month, day) %>% 150 | summarize(avg_delay = mean(tot_delay, na.rm = TRUE)) 151 | 152 | avg_weather <- 153 | weather %>% 154 | group_by(month, day) %>% 155 | select(-hour) %>% 156 | summarize_at(vars(temp, humid, wind_speed, precip), mean, na.rm = TRUE) 157 | 158 | avg_del %>% 159 | left_join(avg_weather) %>% 160 | ungroup() %>% 161 | mutate(avg_delay = cut_width(avg_delay, 35)) %>% 162 | gather(weather, metrics, -(month:avg_delay)) %>% 163 | ggplot(aes(avg_delay, metrics)) + 164 | geom_boxplot() + 165 | facet_wrap(~ weather, scales = "free_y") 166 | ``` 167 | 168 | Humidity seems to be the one more related to delays, although this is in isolation. A more thorough approach would be to create a grid of all possible combinations of weather conditions and match them up with the delay and then compare the combination of weather conditions with the highest delays. 169 | 170 | What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather. 171 | 172 | Here is for Junte 13 2013: 173 | ```{r} 174 | flights %>% 175 | mutate(tot_delay = arr_delay + dep_delay) %>% 176 | group_by(month, day, dest) %>% 177 | summarize(avg_delay = mean(tot_delay, na.rm = TRUE)) %>% 178 | filter(month == 6, day == 13) %>% 179 | left_join(select(airports, faa, lat, lon), by = c("dest" = "faa")) %>% 180 | ggplot(aes(lon, lat, colour = avg_delay)) + 181 | borders("state") + 182 | geom_point(size = 2, alpha = 0.8) + 183 | xlim(c(-130, -65)) + 184 | ylim(c(20, 50)) + 185 | coord_quickmap() + 186 | viridis::scale_color_viridis() 187 | ``` 188 | 189 | I didn't know if that was a big increase in delays so I cross-checked randomly with the next month: 190 | 191 | ```{r} 192 | flights %>% 193 | mutate(tot_delay = arr_delay + dep_delay) %>% 194 | group_by(month, day, dest) %>% 195 | summarize(avg_delay = mean(tot_delay, na.rm = TRUE)) %>% 196 | filter(month == 7, day == 13) %>% 197 | left_join(select(airports, faa, lat, lon), by = c("dest" = "faa")) %>% 198 | ggplot(aes(lon, lat, colour = avg_delay)) + 199 | borders("state") + 200 | geom_point(size = 2, alpha = 0.8) + 201 | xlim(c(-130, -65)) + 202 | ylim(c(20, 50)) + 203 | coord_quickmap() + 204 | viridis::scale_color_viridis() 205 | ``` 206 | 207 | You can see big differences in the east coast where in June the average delay was much higher. Apparently there was a sever weather storm in the east coast. See [here](https://www.washingtonpost.com/news/capital-weather-gang/wp/2013/06/14/june-13-2013-severe-weather-hypestorm-or-the-real-deal/?utm_term=.ec80520fb97c) 208 | 209 | ## Exercises 13.5.1 210 | 211 | What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.) 212 | 213 | ```{r} 214 | flights %>% 215 | anti_join(planes, by = "tailnum") %>% 216 | count(carrier, sort = TRUE) 217 | ``` 218 | 219 | AA and MQ don't seem to report tail numbers. 220 | 221 | Filter flights to only show flights with planes that have flown at least 100 flights. 222 | 223 | ```{r} 224 | flights %>% 225 | semi_join(count(flights, tailnum) %>% filter(n >= 100)) 226 | ``` 227 | 228 | 229 | Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models. 230 | 231 | ```{r} 232 | ten_common <- 233 | fueleconomy::common %>% 234 | ungroup() %>% 235 | arrange(-n) %>% 236 | top_n(10, n) 237 | 238 | fueleconomy::vehicles %>% 239 | semi_join(ten_common) 240 | ``` 241 | 242 | 243 | Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns? 244 | 245 | ```{r} 246 | fn <- 247 | flights %>% 248 | group_by(month, day) %>% 249 | summarize(avg_delay = sum(arr_delay + dep_delay, na.rm = TRUE)) %>% 250 | mutate(twoday_delay = avg_delay + lag(avg_delay)) %>% 251 | arrange(-twoday_delay) 252 | 253 | wea <- 254 | weather %>% 255 | group_by(month, day) %>% 256 | summarize_at(vars(humid, precip, temp), mean, na.rm = TRUE) 257 | 258 | fn %>% 259 | left_join(wea) %>% 260 | arrange(twoday_delay) 261 | ``` 262 | 263 | Very superficially, it seems as though precipitation was higher in delays flights, and temprature was slightly higher. Although the pattern is just by looking at top/bottom 10. Graphical inspection should yield more interesting patterns. 264 | 265 | 266 | What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you? 267 | 268 | ```{r} 269 | anti_join(flights, airports, by = c("dest" = "faa")) 270 | # Give me the flights from the destionations that that are not present in the `airports` dataset. 271 | 272 | 273 | anti_join(airports, flights, by = c("faa" = "dest")) 274 | # Give me the airports that are not present as destinations in the `flights` dataset. 275 | ``` 276 | 277 | 278 | You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above. 279 | 280 | ```{r} 281 | flights %>% 282 | group_by(tailnum, carrier) %>% 283 | count() %>% 284 | filter(n() > 1) %>% 285 | select(tailnum) %>% 286 | distinct(tailnum) 287 | 288 | 289 | ``` 290 | 291 | -------------------------------------------------------------------------------- /ch17.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch17" 3 | output: 4 | pdf_document: default 5 | word_document: default 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | ```{r} 11 | library(tidyverse) 12 | library(nycflights13) 13 | ``` 14 | 15 | ## Exercises 21.2.1 16 | 17 | Write for loops to: 18 | 19 | ```{r} 20 | looper <- function(the_type, object, fun) { 21 | empty_vec = vector(the_type, length(object)) 22 | 23 | for (every_column in seq_along(object)) { 24 | empty_vec[every_column] <- fun(object[[every_column]]) 25 | } 26 | 27 | empty_vec 28 | } 29 | ``` 30 | 31 | 32 | Compute the mean of every column in mtcars. 33 | 34 | ```{r} 35 | looper("numeric", mtcars, mean) 36 | ``` 37 | 38 | 39 | Determine the type of each column in nycflights13::flights. 40 | 41 | ```{r} 42 | looper("character", flights, class) 43 | ``` 44 | 45 | 46 | Compute the number of unique values in each column of iris. 47 | 48 | ```{r} 49 | looper("numeric", iris, function(x) sum(table(unique(x)))) 50 | ``` 51 | 52 | Generate 10 random normals for each of u = -10, 0, 10, and 100. 53 | 54 | ```{r} 55 | ten_draws <- function(x) rnorm(10, mean = x) 56 | 57 | map(c(-10, 0, 10, 100), ten_draws) 58 | ``` 59 | 60 | 61 | Think about the output, sequence, and body before you start writing the loop. 62 | 63 | Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors: 64 | 65 | ```{r} 66 | out <- "" 67 | for (x in letters) { 68 | out <- stringr::str_c(out, x) 69 | } 70 | 71 | # Can be replaced with 72 | 73 | str_c(out, collapse = "") 74 | 75 | # --- 76 | 77 | x <- sample(100) 78 | sd <- 0 79 | for (i in seq_along(x)) { 80 | sd <- sd + (x[i] - mean(x)) ^ 2 81 | } 82 | sd <- sqrt(sd / (length(x) - 1)) 83 | 84 | # can be replaced with 85 | 86 | sd(x) 87 | 88 | # --- 89 | 90 | x <- runif(100) 91 | out <- vector("numeric", length(x)) 92 | out[1] <- x[1] 93 | 94 | for (i in 2:length(x)) { 95 | out[i] <- out[i - 1] + x[i] 96 | } 97 | 98 | # can be replaced with 99 | 100 | cumsum(x) 101 | ``` 102 | 103 | Combine your function writing and for loop skills: 104 | 105 | Write a for loop that prints() the lyrics to the children’s song "Alice the camel". 106 | 107 | ```{r} 108 | song <- 109 | "Alice the camel has five humps. 110 | Alice the camel has five humps. 111 | Alice the camel has five humps. 112 | So go, Alice, go. 113 | " 114 | 115 | split_song <- str_split(song, "\\.")[[1]] 116 | 117 | for (i in split_song) { 118 | cat(i) 119 | } 120 | ``` 121 | 122 | 123 | Convert the nursery rhyme "ten in the bed" to a function. Generalise it to any number of people in any sleeping structure. 124 | 125 | ```{r} 126 | 127 | nurs <- function(x) { 128 | 129 | empty_list <- vector("list", x) 130 | 131 | empty_list[[1]] <- '\n There was 1 in the bed \n And the little one said, "Alone at last!"' 132 | 133 | 134 | for (i in seq_len(x)[-1]) { 135 | the_str <- 136 | paste0('\n There were ', i,' in the bed \n And the little one said, \n "Roll over! Roll over!" \n So they all rolled over and one fell out \n') 137 | 138 | empty_list[[i]] <- the_str 139 | } 140 | 141 | for (i in rev(empty_list)) cat(i, sep = "\n") 142 | } 143 | 144 | nurs(10) 145 | ``` 146 | 147 | 148 | Convert the song "99 bottles of beer on the wall" to a function. Generalise to any number of any vessel containing any liquid on any surface. 149 | 150 | 151 | ```{r} 152 | beers <- function(x, drink, where) { 153 | 154 | for (i in c(rev(seq_len(x)), 0)) { 155 | if (i == 0) { 156 | cat(glue::glue("No more bottles of {drink} on the {where}, no more bottles of {drink}.\n Go to the store and buy some more, {x} bottles of {drink} on the {where}.")) 157 | } else { 158 | cat(glue::glue("{i} {if (i == 1) 'bottle' else 'bottles'} of {drink} on the {where}, {i} {if (i == 1) 'bottle' else 'bottles'} of {drink}.\n Take one down and pass it around, {if (i == 1) 'no more' else i - 1} bottles of {drink} on the {where}. \n \n \n")) 159 | } 160 | } 161 | } 162 | 163 | beers(10, 'beer', 'wall') 164 | beers(10, 'wine', 'floor') 165 | ``` 166 | 167 | 168 | It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step: 169 | 170 | ```{r, results = 'asis'} 171 | x <- rerun(10000, sample(sample(10, 1))) 172 | 173 | microbenchmark::microbenchmark( 174 | non_allocate = { 175 | output <- vector("integer", 0) 176 | for (i in seq_along(x)) { 177 | output <- c(output, length(x[[i]])) 178 | } 179 | } 180 | , 181 | allocate = { 182 | output <- vector("integer", 0) 183 | for (i in seq_along(x)) { 184 | output[i] <- length(x[[i]]) 185 | } 186 | }) 187 | ``` 188 | 189 | How does this affect performance? Design and execute an experiment. 190 | 191 | The first (growing the vector) is nearly 15 times slower than allocating! 192 | 193 | ## Exercises 21.3.5 194 | 195 | Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame. 196 | 197 | ```{r, eval = FALSE} 198 | all_csv <- c("one.csv", "two.csv") 199 | 200 | all_dfs <- vector("list", length(all_csv)) 201 | 202 | for (i in all_csv) { 203 | all_dfs[[i]] <- read_csv(all_csv[[i]]) 204 | } 205 | 206 | bind_rows(all_dfs) 207 | ``` 208 | 209 | 210 | What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique? 211 | 212 | ```{r} 213 | 214 | no_names <- 1:5 215 | some_names <- c("one" = 1, 2, "three" = 3) 216 | repeated_names <- c("one" = 1, "one" = 2, "three" = 3) 217 | 218 | for (nm in names(no_names)) print(identity(nm)) # nothing happens! 219 | for (nm in names(some_names)) print(identity(nm)) # the empty name get's filled with a "" 220 | for (nm in names(repeated_names)) print(identity(nm)) # everything get's printed out 221 | ``` 222 | 223 | 224 | Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print: 225 | 226 | ```{r, eval = FALSE} 227 | show_mean(iris) 228 | #> Sepal.Length: 5.84 229 | #> Sepal.Width: 3.06 230 | #> Petal.Length: 3.76 231 | #> Petal.Width: 1.20 232 | 233 | show_means <- function(x) { 234 | 235 | the_class <- vector("logical", length(x)) 236 | 237 | for (i in seq_along(x)) the_class[[i]] <- is.numeric(x[[i]]) 238 | 239 | x <- x[the_class] 240 | 241 | for (i in seq_along(x)) { 242 | cat(paste0(names(x)[i], ": ", round(mean(x[[i]]), 2)), fill = TRUE) 243 | } 244 | } 245 | 246 | show_means(iris) 247 | show_means(mtcars) 248 | ``` 249 | 250 | (Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?) 251 | 252 | `cat` with `fill` set to `TRUE` 253 | 254 | What does this code do? How does it work? 255 | 256 | ```{r, eval = FALSE} 257 | 258 | trans <- list( 259 | disp = function(x) x * 0.0163871, 260 | am = function(x) { 261 | factor(x, labels = c("auto", "manual")) 262 | } 263 | ) 264 | for (var in names(trans)) { 265 | mtcars[[var]] <- trans[[var]](mtcars[[var]]) 266 | } 267 | ``` 268 | 269 | It converts `disp` and `am` by multiplying and then into a factor respectively. This is simply iterating over a list with functions, and applying in that same order to both variables. 270 | 271 | ## Exercises 21.4.1 272 | 273 | Read the documentation for apply(). In the 2d case, what two for loops does it generalise? 274 | 275 | It allows to loop over rows or columns and apply a function to that specific vector. 276 | 277 | Adapt col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column. 278 | 279 | ```{r} 280 | col_summary <- function(x) { 281 | 282 | the_numeric <- vector("logical", length(x)) 283 | for (i in seq_along(x)) the_numeric[[i]] <- is.numeric(x[[i]]) 284 | 285 | x <- x[the_numeric] 286 | 287 | the_mean <- vector("numeric", length(x)) 288 | for (i in seq_along(x)) the_mean[[i]] <- round(mean(x[[i]]), 2) 289 | 290 | the_mean 291 | } 292 | 293 | col_summary(iris) 294 | ``` 295 | 296 | ## Exercises 21.5.3 297 | Write code that uses one of the map functions to: 298 | 299 | Compute the mean of every column in mtcars. 300 | 301 | ```{r} 302 | map_dbl(mtcars, mean) 303 | ``` 304 | 305 | 306 | Determine the type of each column in nycflights13::flights. 307 | 308 | ```{r} 309 | map(flights, class) 310 | ``` 311 | 312 | 313 | Compute the number of unique values in each column of iris. 314 | 315 | ```{r} 316 | map(iris, ~ length(unique(.))) 317 | ``` 318 | 319 | Generate 10 random normals for each of u = -10, 0, 10, and 100. 320 | 321 | ```{r} 322 | map(c(-10, 0, 10, 100), rnorm, n = 10) 323 | ``` 324 | 325 | 326 | How can you create a single vector that for each column in a data frame indicates whether or not it’s a factor? 327 | 328 | ```{r} 329 | map_lgl(iris, is.factor) 330 | ``` 331 | 332 | 333 | What happens when you use the map functions on vectors that aren’t lists? What does map(1:5, runif) do? Why? 334 | 335 | ```{r} 336 | map(1:5, runif) 337 | ``` 338 | It iterates the same way. `map` functions take either 2d or 1d objects and iterate over (columns) or the 1d object. 339 | 340 | What does map(-2:2, rnorm, n = 5) do? Why? What does map_dbl(-2:2, rnorm, n = 5) do? Why? 341 | 342 | `map(-2:2, rnorm, n = 5)` runs `rnorm` with mean `-2`, then `-1`, ..., `2` producing a vector of 5 numbers for each call. This returns a list because there's no prefix. Conversely, `map_dbl(-2:2, rnorm, n = 5)` repeates the same but checks that the result is double `_dbl` and of length one. 343 | 344 | Rewrite map(x, function(df) lm(mpg ~ wt, data = df)) to eliminate the anonymous function. 345 | 346 | ```{r, eval = F} 347 | map(x, function(df) lm(mpg ~ wt, data = df)) 348 | 349 | map(x, ~ lm(mpg ~ wt, data = .)) 350 | ``` 351 | 352 | ## Exercises 21.9.3 353 | 354 | Implement your own version of every() using a for loop. Compare it with purrr::every(). What does purrr’s version do that your version doesn’t? 355 | 356 | ```{r} 357 | my_every <- function(.x, .f, ...) { 358 | the_test <- vector("logical", length(.x)) 359 | for (i in seq_along(.x)) the_test[[i]] <- .f(.x[[i]], ...) 360 | all(the_test, na.rm = TRUE) 361 | } 362 | 363 | my_every(mtcars, is.numeric) 364 | every(mtcars, is.numeric) 365 | ``` 366 | `every` is probably much much faster and allows to supply one-sided formulas as functions. I don't understand clearly what `.p` is about in the docs and I can't get it work properly. 367 | 368 | ```{r} 369 | every(1:10, ~ rep(TRUE, 9)) # should throw an error? 370 | ``` 371 | 372 | Create an enhanced col_sum() that applies a summary function to every numeric column in a data frame. 373 | 374 | ```{r} 375 | col_sum <- function(x, f, ...) { 376 | x <- keep(x, is.numeric) 377 | map_dbl(x, f, ...) 378 | } 379 | 380 | col_sum(mtcars, mean, trim = 0.8) 381 | col_sum(iris, median) 382 | ``` 383 | 384 | 385 | 386 | A possible base R equivalent of col_sum() is: 387 | 388 | ```{r} 389 | col_sum3 <- function(df, f) { 390 | is_num <- sapply(df, is.numeric) 391 | df_num <- df[, is_num] 392 | 393 | sapply(df_num, f) 394 | } 395 | ``` 396 | 397 | But it has a number of bugs as illustrated with the following inputs: 398 | 399 | ```{r, eval = FALSE} 400 | df <- tibble( 401 | x = 1:3, 402 | y = 3:1, 403 | z = c("a", "b", "c") 404 | ) 405 | # OK 406 | col_sum3(df, mean) 407 | # Has problems: don't always return numeric vector 408 | col_sum3(df[1:2], mean) 409 | col_sum3(df[1], mean) 410 | col_sum3(df[0], mean) 411 | ``` 412 | 413 | What causes the bugs? 414 | 415 | `sapply` returns different things based on input. `sapply(df[0], is.numeric)` returns a named list rather than a `tibble`, whereas `sapply(df[1], is.numeric)` returns the correct output. -------------------------------------------------------------------------------- /Ch9.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch9" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r} 9 | library(tidyverse) 10 | ``` 11 | 12 | 13 | # Exercise 12.2.1 14 | 15 | Using prose describe how each of variables and observations are organized in the the tables. 16 | 17 | Table 1: 18 | ```{r} 19 | table1 20 | ``` 21 | 22 | All columns are it's own variables because for example `year` contains only years and `country` contains only countries. For example, if we got a data set where we have `country` and then two columns for `males` and `females` that would be untidy. At least in principle. Because both columns should be a column called `gender`. This is a bit tricky because nothing says that you can't use the gender columns that way. But for easy use in R it's usually better to work with `tidy` data. But who says that `cases` and `populations` are not the same thing? We could argue that they should be in the same column as it is now. 23 | 24 | ```{r} 25 | table2 26 | ``` 27 | 28 | In principle, this is also not a `tidy` dataset, although that's debatable. If we had something like different sicknesses in the type column then this would be tidy. But `type` in this case should be different columns because they measure different things. This dataset is organized in a way that years are nested within countries and then each type is instead within years. 29 | 30 | ```{r} 31 | table3 32 | ``` 33 | This is clearly non-tidy because we can't work with values such as the `rate` column. If this column would be the result of the operation, then this would be a tidy dataset. 34 | 35 | ```{r} 36 | table4a 37 | ``` 38 | This is clearly not a tidy dataset because years, which are the same thing, are in different columns. This is the same case as the gender example I outlined above. These two columns should be *one* variable and the values should be a separate column. This dataset is only for the type `cases` 39 | 40 | ```{r} 41 | table4b 42 | ``` 43 | 44 | Same as above, but this is for `population`. 45 | 46 | Compute the `rate` for `table2`, and `table4a` and `table4b`. 47 | 48 | For table 2 49 | ```{r} 50 | fyear_cases <- 51 | table2 %>% 52 | filter(year == 1999, 53 | type == "cases") 54 | 55 | fyear_pop <- 56 | table2 %>% 57 | filter(year == 1999, 58 | type == "population") 59 | 60 | rate_99 <- 61 | bind_cols(fyear_cases, fyear_pop) %>% 62 | mutate(rate = count/count1) %>% 63 | select(-ends_with("1"), -type) 64 | 65 | 66 | fyear_cases <- 67 | table2 %>% 68 | filter(year == 2000, 69 | type == "cases") 70 | 71 | fyear_pop <- 72 | table2 %>% 73 | filter(year == 2000, 74 | type == "population") 75 | 76 | rate_00 <- 77 | bind_cols(fyear_cases, fyear_pop) %>% 78 | mutate(rate = count/count1) %>% 79 | select(-ends_with("1"), -type) 80 | 81 | bind_rows(rate_99, rate_00) 82 | ``` 83 | 84 | For table 4a and 4b 85 | 86 | ```{r} 87 | table4a %>% 88 | mutate(rate_99 = `1999` / table4b$`1999`, 89 | rate_00 = `2000` / table4b$`2000`) 90 | ``` 91 | 92 | In a way, it's more intuitive to work with `table2` because we use the filtering techniques to understand the operations. But with `table4a` is more succient. However, if we wanted to turn the `table4a/4b` result to a tidier version, it would be more tedious (withut `gather` and such functions.) 93 | 94 | Recreate the plot showing change in cases over time using `table2` instead of `table1`. What do you need to do first? 95 | 96 | ```{r} 97 | table2 %>% 98 | filter(type == "cases") %>% 99 | ggplot(aes(year, count, group = country, colour = country)) + 100 | geom_point() + 101 | geom_line() 102 | ``` 103 | 104 | ## 12.3.3 Exercises 105 | 106 | Why are gather() and spread() not perfectly symmetrical? 107 | 108 | ```{r} 109 | stocks <- tibble( 110 | year = c(2015, 2015, 2016, 2016), 111 | half = c( 1, 2, 1, 2), 112 | return = c(1.88, 0.59, 0.92, 0.17) 113 | ) 114 | stocks %>% 115 | spread(year, return) %>% 116 | gather("year", "return", `2015`:`2016`) 117 | ``` 118 | Both spread() and gather() have a convert argument. What does it do? 119 | 120 | Because the `key` variable is actually the column names, and is thus moved as character column. It would be unwise for gather to treat column names as numerics, logicals, or something else. However, you can find a workaround by specifying `convert = TRUE` which will try to convert the `key` columns to it's correct class. 121 | 122 | Why does this code fail? 123 | 124 | ```{r} 125 | table4a %>% 126 | gather(1999, 2000, key = "year", value = "cases") 127 | ``` 128 | 129 | Because `gather` can't find the columns names. You can't name columns w/ numbers in R without quoting them with tick marks. 130 | 131 | ```{r} 132 | table4a %>% 133 | gather(`1999`, `2000`, key = "year", value = "cases") 134 | ``` 135 | 136 | 137 | Why does spreading this tibble fail? How could you add a new column to fix the problem? 138 | 139 | ```{r} 140 | people <- tribble( 141 | ~name, ~key, ~value, 142 | #-----------------|--------|------ 143 | "Phillip Woods", "age", 45, 144 | "Phillip Woods", "height", 186, 145 | "Phillip Woods", "age", 50, 146 | "Jessica Cordero", "age", 37, 147 | "Jessica Cordero", "height", 156 148 | ) 149 | ``` 150 | 151 | Because Phillip Woods has two values of age. Think about it.. Phillip woods then would have TWO columns of age. That doesn't make sense! We need to add a unique column id specifying the third or first age as a unique person. 152 | 153 | ```{r} 154 | people %>% 155 | mutate(unique_id = c(1, 2, 2, 3, 3)) %>% 156 | select(unique_id, everything()) %>% 157 | spread(key, value) 158 | ``` 159 | 160 | Tidy the simple tibble below. Do you need to spread or gather it? What are the variables? 161 | 162 | ```{r} 163 | preg <- tribble( 164 | ~pregnant, ~male, ~female, 165 | "yes", NA, 10, 166 | "no", 20, 12 167 | ) 168 | ``` 169 | 170 | The main objective of analysis here is whether pregnant or not (bc males can not be pregnant), so I would go for `gather`ing the gender column rather than spreading the pregnant column. 171 | 172 | ```{r} 173 | preg %>% 174 | gather(gender, values, -pregnant) 175 | 176 | # the other way around: 177 | preg %>% 178 | gather(gender, values, -pregnant) %>% 179 | spread(pregnant, values) 180 | ``` 181 | 182 | ## 12.4.3 Exercises 183 | 184 | What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets. 185 | 186 | ```{r} 187 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 188 | separate(x, c("one", "two", "three"), fill = "warn") 189 | 190 | tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 191 | separate(x, c("one", "two", "three")) 192 | ``` 193 | 194 | It's simple. x has vectors with 3 and 4 characters but we specify 3 columns. `fill` has three values: 195 | `warn`, `right` and `left`. Here I specify a fourth column to place the extra letter. The first fills the missing values with the extra character using the right most match. `left` does the same thing but without a warning. and left places the extra character empty in the first column 196 | 197 | ```{r} 198 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 199 | separate(x, c("one", "two", "three", "four")) 200 | 201 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 202 | separate(x, c("one", "two", "three", "four"), fill = "right") 203 | 204 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 205 | separate(x, c("one", "two", "three", "four"), fill = "left") 206 | ``` 207 | I've deleted the fourth column to see how this works. `extra` on the other hand, deals with either droping or merging the extra characters. `warn` drops the extra character and emits a warning messge. 208 | `drop` does the same thing but without a warning and `merge` merges the extra character to it's closest end. No aparent option to `merge` with the first column rather than the last. 209 | 210 | ```{r} 211 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 212 | separate(x, c("one", "two", "three"), extra = "warn") 213 | 214 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 215 | separate(x, c("one", "two", "three"), extra = "drop") 216 | 217 | tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 218 | separate(x, c("one", "two", "three"), extra = "merge") 219 | ``` 220 | 221 | 222 | Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE? 223 | 224 | Because `unite` and `separate` receive columns and create new ones, `remove` allows you to remove the original columns that you unite/separate on. You might want to leave them as they are if you're checking whether the transformation was done correctly. 225 | 226 | Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite? 227 | 228 | Because you can separate differently. Examples below: 229 | 230 | ```{r} 231 | df_sep <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e")) 232 | df_extract <- data.frame(x = c(NA, "ap.b", "aa/d", "b.c", "d-ee")) 233 | 234 | # This is easy with separate 235 | df_sep %>% separate(x, c("new", "old"), sep = 1) 236 | df_sep %>% separate(x, c("new", "old"), sep = "-") 237 | 238 | # Here we can define 2 or more groups to separate the more complex string 239 | df_extract %>% extract(x, c("new", "old"), regex = "(.*)[:punct:](.*)") 240 | ``` 241 | However, I don't understand it completely because I think I could do the same as above with `separate` by just providing a regular expression. 242 | 243 | 244 | ## Exercise 12.5.1 245 | Compare and contrast the fill arguments to spread() and complete(). 246 | 247 | The `fill` argument in `spread()` will replace ALL missing values regardless of columns with the same value. The `fill` argument of `complete()` accepts a list where each slot is the missing value for each column. So missing values per column are customizable to any chosen missing. 248 | 249 | What does the direction argument to fill() do? 250 | 251 | If we have this dataset 252 | 253 | ```{r} 254 | treatment <- tribble( 255 | ~ person, ~ treatment, ~response, 256 | "Derrick Whitmore", 1, 7, 257 | NA, 2, 10, 258 | NA, 3, 9, 259 | "Katherine Burke", 1, 4 260 | ) 261 | ``` 262 | 263 | We have two missing values in column `person`. We can carry over the value `Katherine` to replace the missing values or we could take `Derrick` to replace the missing values. `.direction` does exactly that by specifying either `down` or `up`. 264 | 265 | Ex 1. 266 | ```{r} 267 | fill(treatment, person, .direction = "up") 268 | ``` 269 | 270 | Ex 2. 271 | ```{r} 272 | fill(treatment, person, .direction = "down") 273 | ``` 274 | 275 | ## Exercises 12.6.1 276 | 277 | In this case study I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero? 278 | 279 | A proper analysis would not exclude the missing values because that's information! It is the presence of an absence. So for our purposes it is reasonable, but for appropriate descriptive statistics it is important to report the number of missing values. 280 | 281 | How many implicit missing values are there? That's easy! We use `complete` with the `gather`ed dataset. 282 | 283 | ```{r} 284 | first <- 285 | who %>% 286 | gather( 287 | new_sp_m014:newrel_f65, 288 | key = "key", 289 | value = "cases" 290 | ) 291 | 292 | second <- 293 | first %>% complete(country, year, key) 294 | 295 | # We merge both dataset where there are no matching values (so left over rows) 296 | first %>% 297 | anti_join(second, by = c("country", "year", "key")) 298 | # Nothing! 299 | 300 | ``` 301 | 302 | So no implicit missing values. And the difference between an `NA` and a `0` is that 0 means there's 0 cases in that cell but `NA` could mean that there's `20` cases but weren't reported. 303 | 304 | What happens if you neglect the mutate() step? (`mutate(key = stringr::str_replace(key, "newrel", "new_rel"))`) 305 | 306 | Well, if we have `new_sp` and `newrel` and we separate on `_` we would get a column where there's `new` and `newrel` together and in the other column there would only be `sp`. If we replace `newrel` with `new_rel` then the same pattern is constant in the same column. 307 | 308 | I claimed that iso2 and iso3 were redundant with country. Confirm this claim. 309 | 310 | ```{r} 311 | who %>% 312 | count(country, iso2, iso3) %>% 313 | count(country) %>% 314 | filter(nn > 1) 315 | ``` 316 | If there would be repetitions of country, then this would equal more than 1 317 | 318 | For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data. 319 | 320 | ```{r} 321 | who1 <- 322 | who %>% 323 | gather( 324 | new_sp_m014:newrel_f65, 325 | key = "key", 326 | value = "cases", 327 | na.rm = TRUE 328 | ) %>% 329 | mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>% 330 | separate(key, c("new", "type", "sexage"), sep = "_") %>% 331 | select(-new, -iso2, -iso3) %>% 332 | separate(sexage, c("sex", "age"), sep = 1) 333 | ``` 334 | 335 | ```{r} 336 | who1 %>% 337 | group_by(country, year, sex) %>% 338 | summarize(n = sum(cases)) %>% 339 | ggplot(aes(year, n, group = country)) + 340 | geom_line(alpha = 2/4) + 341 | facet_wrap(~ sex) 342 | ``` 343 | 344 | 345 | -------------------------------------------------------------------------------- /ch11.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch11" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r} 9 | library(tidyverse) 10 | # stringr now belongs to the tidyverse core 11 | ``` 12 | 13 | 14 | ## Exercises 14.2.5 15 | 16 | In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA? 17 | 18 | `paste` and `paste0` are the same but `paste0` has `sep = ""` by default and `paste` has `sep = " "` by default. 19 | 20 | `str_c` is the equivalent `stringr` function. 21 | 22 | ```{r} 23 | str_c(c("a", "b"), collapse = ", ") 24 | ``` 25 | 26 | ```{r} 27 | str_c(c("a", "b"), NA) 28 | # In `str_c` everything that is pasted with an NA is an NA 29 | 30 | paste0(c("a", "b"), NA) 31 | # But in paste0 NA gets converted to a character string a pasted together. To mimic the same behaviour, replace the NA to a string with: 32 | str_c(c("a", "b"), str_replace_na(NA)) 33 | ``` 34 | 35 | In your own words, describe the difference between the sep and collapse arguments to str_c(). 36 | 37 | `sep` is what divides what you paste together within a vector of strings. `collapse` is the divider of a single pasted vector of strings. 38 | 39 | Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters? 40 | 41 | ```{r} 42 | uneven <- "one" 43 | even <- "thre" 44 | 45 | str_sub(even, str_length(even) / 2, str_length(even) / 2) 46 | 47 | # Automatically rounds up the lower digit 48 | str_sub(uneven, str_length(uneven) / 2, str_length(uneven) / 2) 49 | ``` 50 | One solution would be to round the the highest digit with `ceiling`. 51 | 52 | What does str_wrap() do? When might you want to use it? 53 | 54 | ```{r} 55 | str_wrap( 56 | "Hey, so this is one paragraph 57 | I'm interested in writing but I 58 | think it might be too long. I just 59 | want to make sure this is in the right format", 60 | width = 60, indent = 2, exdent = 1 61 | ) %>% cat() 62 | ``` 63 | 64 | This might be interesting to output messages while running scripts or in packages. 65 | What does str_trim() do? What’s the opposite of str_trim()? 66 | 67 | Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2. 68 | 69 | ```{r} 70 | 71 | str_paster <- function(x, collapse = ", ") { 72 | str_c(x, collapse = collapse) 73 | } 74 | 75 | tr <- letters[1:3] 76 | str_paster(tr) 77 | 78 | tr <- letters[1:2] 79 | str_paster(tr) 80 | 81 | tr <- letters[1] 82 | str_paster(tr) 83 | 84 | tr <- letters[0] 85 | str_paster(tr) 86 | ``` 87 | It always returns a character, even if the vector is empty. 88 | 89 | ## 14.3.1.1 Exercises 90 | 91 | Explain why each of these strings don’t match a \: "\", "\\", "\\\". 92 | 93 | "\" won't match anything because "\" needs to be accompanied by two "\\" to escape "\" 94 | "\" won't match "\\" because because "\" is actualy "\\" and needs double escaping so "\\\\" will match it. 95 | 96 | Same for "\\\". 97 | 98 | 99 | How would you match the sequence "'\? 100 | 101 | str_view("\"'\\", "\"'\\\\") 102 | 103 | What patterns will the regular expression \..\..\.. match? How would you represent it as a string? 104 | 105 | It matches a string similar to .a.b.c So every '\.' matches a literal dot and . matches any character except a new line. 106 | 107 | ```{r} 108 | str_view(".a.b.c", "\\..\\..\\..") 109 | ``` 110 | 111 | ## Exercises 14.3.2.1 112 | 113 | How would you match the literal string "$^$"? 114 | 115 | Given the corpus of common words in stringr::words, create regular expressions that find all words that: 116 | 117 | Start with "y". 118 | ```{r} 119 | str_bring <- function(string, pattern) { 120 | string[str_detect(string, pattern)] 121 | } 122 | 123 | str_bring(words, "^y") 124 | ``` 125 | 126 | End with "x" 127 | 128 | ```{r} 129 | str_bring(words, "x$") 130 | ``` 131 | 132 | Are exactly three letters long. (Don’t cheat by using str_length()!) 133 | 134 | ```{r} 135 | str_bring(words, "^.{3}$") 136 | ``` 137 | 138 | Have seven letters or more. 139 | 140 | ```{r} 141 | str_bring(words, "^.{7,}$") 142 | ``` 143 | 144 | Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words. 145 | 146 | ## Exercises 14.3.3.1 147 | 148 | Create regular expressions to find all words that: 149 | 150 | Start with a vowel. 151 | 152 | ```{r} 153 | str_bring(words, "^[aeiou]") 154 | ``` 155 | 156 | That only contain consonants. (Hint: thinking about matching “not”-vowels.) 157 | 158 | ```{r} 159 | str_bring(words, "^[^aeiou]") 160 | ``` 161 | 162 | End with ed, but not with eed. 163 | 164 | ```{r} 165 | str_bring(words, "[^e]ed$") 166 | ``` 167 | 168 | End with ing or ise. 169 | 170 | ```{r} 171 | str_bring(words, "i(ng|se)$") 172 | ``` 173 | 174 | Empirically verify the rule "i before e except after c". 175 | 176 | ```{r} 177 | str_bring(words, "ie|[^c]ie") 178 | ``` 179 | 180 | 181 | Is "q"" always followed by a "u"? 182 | 183 | ```{r} 184 | str_bring(words, "q[^u]") 185 | ``` 186 | 187 | Yes! 188 | 189 | Write a regular expression that matches a word if it’s probably written in British English, not American English. 190 | 191 | A bit hard. The closest is: "ou|ise^|ae|oe|yse^" 192 | 193 | ```{r} 194 | str_bring(words, "ou|ise^|ae|oe|yse^") 195 | ``` 196 | 197 | But see: https://jrnold.github.io/r4ds-exercise-solutions/strings.html 198 | 199 | Create a regular expression that will match telephone numbers as commonly written in your country. 200 | 201 | ```{r} 202 | x <- c("34697382009", "18093438932", "18098462020") 203 | str_bring(x, "^34.{9}$") 204 | ``` 205 | or 206 | 207 | ```{r} 208 | x <- c("123-456-7890", "1235-2351") 209 | str_bring(x, "\\d{3}-\\d{3}-\\d{4}") 210 | ``` 211 | 212 | ## Exercises 14.3.4.1 213 | 214 | Describe the equivalents of ?, +, * in {m,n} form. 215 | 216 | ? is {,1} 217 | + is {1,} 218 | * has no equivalent 219 | 220 | Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.) 221 | 222 | ^.*$ 223 | 224 | Matches any string 225 | 226 | "\\{.+\\}" 227 | 228 | Matches any string with curly braces. 229 | 230 | \d{4}-\d{2}-\d{2} 231 | 232 | Matches a set of numbers in this format dddd-dd-dd 233 | 234 | "\\\\{4}" 235 | 236 | It matches four back slashes. 237 | 238 | ```{r} 239 | str_bring("\\\\\\\\", "\\\\{4}") 240 | ``` 241 | 242 | Create regular expressions to find all words that: 243 | 244 | Start with three consonants. 245 | 246 | ```{r} 247 | str_bring(words, "^[^aeiou]{3}") 248 | ``` 249 | 250 | Have three or more vowels in a row. 251 | 252 | ```{r} 253 | str_bring(words, "[aeiou]{3,}") 254 | ``` 255 | 256 | 257 | Have two or more vowel-consonant pairs in a row. 258 | 259 | ```{r} 260 | str_bring(words, "[^aeiou][aeiou]{2,}") 261 | ``` 262 | 263 | Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner. 264 | 265 | 266 | ## Exercises 14.3.5.1 267 | Describe, in words, what these expressions will match: 268 | 269 | (.)\1\1 270 | 271 | Any character repeated three times in a row. 272 | 273 | ```{r} 274 | str_bring(c("aaa", "aaba"), "(.)\\1\\1") 275 | ``` 276 | 277 | "(.)(.)\\2\\1" 278 | 279 | Two characters followed by the same two characters in reverse order 280 | ```{r} 281 | str_bring(c("aabb"), "(.)(.)\\2\\1") 282 | ``` 283 | 284 | 285 | (..)\1 286 | 287 | Two charachters repeated twice 288 | 289 | ```{r} 290 | str_bring(c("abab", "abba"), "(..)\\1") 291 | ``` 292 | 293 | 294 | "(.).\\1.\\1" 295 | 296 | A character repeated three times with characters in between each repitition, e.g. abaca 297 | 298 | ```{r} 299 | str_bring(c("abaca", "aabb"), "(.).\\1.\\1") 300 | ``` 301 | 302 | 303 | "(.)(.)(.).*\\3\\2\\1" 304 | 305 | The characters followed by any character repeate 0 or more times and then then the same three characters in reverse order. 306 | 307 | ```{r} 308 | str_bring(c("abc312131cba", "aaabbbccc"), "(.)(.)(.).*\\3\\2\\1") 309 | ``` 310 | 311 | Construct regular expressions to match words that: 312 | 313 | Start and end with the same character. 314 | 315 | ```{r} 316 | str_bring(words, "^(.).*\\1$") 317 | ``` 318 | 319 | Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) 320 | 321 | ```{r} 322 | str_bring(words, "(..).*\\1") 323 | ``` 324 | 325 | 326 | Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) 327 | 328 | 329 | ```{r} 330 | str_bring(words, "([a-z]).*\\1.*\\1") 331 | ``` 332 | 333 | 334 | ## Exercises 14.4.2 335 | 336 | For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls. 337 | 338 | Find all words that start or end with x. 339 | 340 | ```{r} 341 | str_bring(words, "^x|x$") 342 | ``` 343 | 344 | or 345 | 346 | ```{r} 347 | start_r <- str_detect(words, "^x") 348 | end_r <- str_detect(words, "x$") 349 | 350 | words[start_r | end_r] 351 | ``` 352 | 353 | 354 | Find all words that start with a vowel and end with a consonant. 355 | 356 | ```{r} 357 | str_bring(words, "^[aeiou].*[^aeiou]$") 358 | ``` 359 | 360 | or 361 | 362 | ```{r} 363 | start_r <- str_detect(words, "^[aeiou]") 364 | end_r <- str_detect(words, "[^aeiou]$") 365 | 366 | words[start_r & end_r] 367 | ``` 368 | 369 | 370 | Are there any words that contain at least one of each different vowel? 371 | 372 | ```{r} 373 | 374 | vowels <- 375 | str_detect(words, "a") & str_detect(words, "e") & str_detect(words, "i") & 376 | str_detect(words, "o") & str_detect(words, "u") 377 | 378 | words[vowels] 379 | ``` 380 | 381 | 382 | What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?) 383 | 384 | ```{r} 385 | vowels <- str_count(words, "[aeiou]") 386 | 387 | words[which.max(vowels)] 388 | ``` 389 | 390 | ```{r} 391 | words[which.max(vowels / str_length(words))] 392 | ``` 393 | 394 | ## Exercises 14.4.3.1 395 | 396 | In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour. Modify the regex to fix the problem. 397 | 398 | ```{r} 399 | colors <- c( 400 | "red", "orange", "yellow", "green", "blue", "purple" 401 | ) 402 | 403 | color_match <- str_c(str_c("\\b", colors, "\\b"), collapse = "|") 404 | 405 | sentences[str_count(sentences, color_match) > 1] 406 | ``` 407 | 408 | 409 | From the Harvard sentences data, extract: 410 | 411 | The first word from each sentence. 412 | 413 | ```{r} 414 | str_extract(sentences, "^[a-zA-Z]+") 415 | ``` 416 | 417 | All words ending in ing. 418 | 419 | ```{r} 420 | 421 | end_ing <- str_extract(sentences, "\\b[a-zA-Z]+ing\\b") 422 | end_ing[!is.na(end_ing)] 423 | ``` 424 | 425 | 426 | All plurals. 427 | 428 | ```{r} 429 | unique(unlist(str_extract_all(sentences, "\\b[a-zA-Z]{3,}s\\b"))) %>% 430 | head() 431 | ``` 432 | 433 | ## Exercises 14.4.4.1 434 | 435 | Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word. 436 | 437 | ```{r} 438 | numbers <- c( 439 | "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten" 440 | ) 441 | 442 | number_regexp <- str_c("(", str_c(numbers, collapse = "|"), ")") 443 | 444 | regexp <- str_c(number_regexp, " ([^ ]+)") 445 | all_match <- str_match(sentences, regexp) 446 | 447 | all_match[complete.cases(all_match), ] %>% head() 448 | ``` 449 | 450 | Find all contractions. Separate out the pieces before and after the apostrophe. 451 | 452 | ```{r} 453 | contract_re <- "([a-zA-Z]+)'([a-zA-Z]+)" 454 | contract <- sentences[str_detect(sentences, contract_re)] 455 | 456 | str_match(contract, contract_re) 457 | ``` 458 | 459 | ## Exercises 14.4.5.1 460 | Replace all forward slashes in a string with backslashes. 461 | 462 | ```{r} 463 | str_replace_all(c("hey this is a /", "and another / in the pic"), 464 | "/", "\\\\") 465 | ``` 466 | 467 | 468 | Implement a simple version of str_to_lower() using replace_all(). 469 | 470 | ```{r} 471 | my_str_lower <- function(x) { 472 | lower_let <- letters 473 | names(lower_let) <- LETTERS 474 | 475 | str_replace_all(x, lower_let) 476 | } 477 | 478 | identical(my_str_lower(sentences), str_to_lower(sentences)) 479 | ``` 480 | 481 | 482 | Switch the first and last letters in words. Which of those strings are still words? 483 | 484 | ```{r} 485 | str_replace_all(words, "^([a-z])(.*)([a-z])$", c("\\3\\2\\1")) 486 | ``` 487 | 488 | ## Exercises 14.4.6.1 489 | Split up a string like "apples, pears, and bananas" into individual components. 490 | 491 | ```{r} 492 | str_split("apples, pears, and bananas", boundary("word"))[[1]] 493 | ``` 494 | 495 | 496 | Why is it better to split up by boundary("word") than " "? 497 | 498 | Becaise ot tales care of commas and dots. 499 | 500 | What does splitting with an empty string ("") do? Experiment, and then read the documentation. 501 | 502 | ```{r} 503 | str_split("apples, pears, and bananas", "")[[1]] 504 | ``` 505 | 506 | 507 | ## Exercises 14.5.1 508 | How would you find all strings containing \ with regex() vs. with fixed()? 509 | 510 | ```{r} 511 | str_ing <- c("contains \\", "and \\ another", "ad") 512 | 513 | str_subset(str_ing, regex("\\\\")) 514 | str_subset(str_ing, fixed("\\")) # ignores regular expressions and matches 515 | # on byte by byte. 516 | ``` 517 | 518 | 519 | What are the five most common words in sentences? 520 | ```{r} 521 | 522 | unlist(str_split(sentences, boundary("word"))) %>% 523 | str_to_lower() %>% 524 | tibble() %>% 525 | set_names("words") %>% 526 | count(words) %>% 527 | arrange(desc(n)) %>% 528 | head() 529 | 530 | ``` 531 | 532 | ## Exercises 14.7.1 533 | Find the stringi functions that: 534 | 535 | Count the number of words. 536 | 537 | stri_count 538 | 539 | Find duplicated strings. 540 | 541 | stri_duplicated 542 | 543 | Generate random text. 544 | 545 | stri_rand_* functions 546 | 547 | How do you control the language that stri_sort() uses for sorting? 548 | 549 | With the `locale` argument which is specified through `...` -------------------------------------------------------------------------------- /Ch5.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch5" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r, include = F} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | library(GGally) 11 | library(nycflights13) 12 | library(arm) 13 | library(tidyverse) 14 | ``` 15 | 16 | # 7.3.4 Exercises 17 | 18 | Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth. 19 | 20 | ```{r} 21 | diamonds %>% 22 | gather(key = dist, vals, x, y, z) %>% 23 | ggplot(aes(vals, colour = dist)) + 24 | geom_freqpoly(bins = 100) 25 | ``` 26 | 27 | One thing that is pretty obvious but perhaps hard to grasp at first is that the distribution of X and Y are pretty much the same. In fact, the same graph from above with `bins = 30` won't show you the X distribution because it overlaps perfectly. The correlation between the two is `cor(diamonds$x, diamonds$y)`. 28 | 29 | If we rounded each mm to the closest number, value-pairing x and y yields `mean(with(diamonds, round(x, 0) == round(y, 0)))` of the values with the same number. So far, the length is directly proportional to the y value. 30 | 31 | ```{r} 32 | diamonds %>% 33 | filter(y < 30) %>% 34 | select(x, y, z) %>% 35 | ggpairs() 36 | ``` 37 | 38 | Yet the relationship between x and y with z is almost flat, as expected. That is, after excluding 2 diamonds which had unreasonable values. 39 | 40 | Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.) 41 | 42 | ```{r} 43 | ## TODO: Fix the Y and X axis to be able to specify the cutting point in the distribution. 44 | source("http://peterhaschke.com/Code/multiplot.R") 45 | 46 | graph <- map(seq(50, 1000, 100), 47 | ~ ggplot(diamonds, aes(x = price)) + 48 | geom_histogram(bins = .x) + 49 | labs(x = NULL, y = NULL) + 50 | scale_x_continuous(labels = NULL) + 51 | scale_y_continuous(labels = NULL)) 52 | 53 | multiplot(plotlist = graph) 54 | ``` 55 | The distribution seems to decrease, as expected, but there's a cut in the distribution showing that most prices are above or below a certain threshold. 56 | 57 | How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference? 58 | 59 | ```{r} 60 | diamonds %>% 61 | filter(carat %in% c(0.99, 1)) %>% 62 | count(carat) 63 | ``` 64 | 65 | I have no idea. It could be that 0.99 is just a typo repeated 23 times. 66 | 67 | Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows? 68 | 69 | ```{r} 70 | diamonds %>% 71 | ggplot(aes(y)) + 72 | geom_histogram() + 73 | coord_cartesian(ylim = c(0, 50)) 74 | # Note how xlim deleted the observations at 0. 75 | 76 | diamonds %>% 77 | ggplot(aes(y)) + 78 | geom_histogram() + 79 | xlim(c(0, 60)) + 80 | coord_cartesian(y = c(0, 50)) 81 | 82 | # Also note how xlim and ylim inside coord_cartesian don't exclude the data 83 | diamonds %>% 84 | ggplot(aes(y)) + 85 | geom_histogram(bins = 30) + 86 | coord_cartesian(xlim = c(2, 60), ylim = c(0, 50)) 87 | 88 | ``` 89 | 90 | # 7.4.1 Exercises 91 | 92 | What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference? 93 | 94 | ```{r} 95 | diamonds %>% 96 | ggplot(aes(price)) + 97 | geom_histogram(bins = 1000) 98 | ``` 99 | In a histogram, they simply leave a gap in the distribution, as in the gap in the above histogram of price. 100 | 101 | For the barplot, the function removes the `NA` value. 102 | 103 | ```{r} 104 | mtcars[1, 2] <- NA 105 | 106 | mtcars %>% 107 | ggplot(aes(cyl)) + 108 | geom_bar() 109 | ``` 110 | 111 | What does na.rm = TRUE do in mean() and sum()? 112 | 113 | It removes the `NA` from the calculations. 114 | 115 | 7.5.1.1 Exercises 116 | 117 | Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights. 118 | 119 | ```{r} 120 | 121 | fl <- 122 | flights %>% 123 | mutate( 124 | cancelled = is.na(dep_time), 125 | sched_hour = sched_dep_time %/% 100, 126 | sched_min = sched_dep_time %% 100, 127 | sched_dep_time = sched_hour + sched_min / 60 128 | ) 129 | 130 | fl %>% 131 | ggplot(aes(sched_dep_time, ..density.., colour = cancelled)) + 132 | geom_freqpoly(binwidth = 1/2) 133 | 134 | fl %>% 135 | ggplot(aes(sched_dep_time, colour = cancelled)) + 136 | geom_density() 137 | 138 | fl %>% 139 | ggplot(aes(cancelled, sched_dep_time)) + 140 | geom_boxplot() 141 | ``` 142 | 143 | What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive? 144 | 145 | ```{r} 146 | 147 | display(lm(price ~ ., diamonds), detail = T) 148 | # In a dirty way, carat 149 | 150 | # Let's confirm the variation in carat for cut. 151 | 152 | diamonds %>% 153 | ggplot(aes(cut, carat)) + 154 | geom_boxplot() 155 | 156 | # It looks like it's weakly negatively correlated, so the fair diamonds having the greater carat. 157 | 158 | diamonds %>% 159 | ggplot(aes(carat, colour = cut)) + 160 | geom_density(position = "dodge") 161 | 162 | # It does like the Fair diamonds have the highest average carat but only by a little. 163 | 164 | diamonds %>% 165 | group_by(cut) %>% 166 | summarise(cor(carat, price)) 167 | 168 | ``` 169 | 170 | It does look like the carat and price are highly correlated between, as well as within, the quality of the diamond. 171 | 172 | Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()? 173 | 174 | ```{r} 175 | library(ggstance) 176 | 177 | diamonds %>% 178 | ggplot(aes(cut, carat)) + 179 | geom_boxplot() + 180 | coord_flip() 181 | 182 | diamonds %>% 183 | ggplot(aes(carat, cut)) + 184 | geom_boxploth() 185 | 186 | ``` 187 | 188 | It's exactly the same plot but less verbose with the `geom_boxploth()`. Note that because the `geom_boxploth()` is already flipped, the variable order changes as well. The continuous variable goes in the x axis and the categorical in the y axis. 189 | 190 | One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots? 191 | 192 | ```{r} 193 | library(lvplot) 194 | 195 | p <- ggplot(diamonds, aes(cut, price, colour = ..LV..)) 196 | p + geom_lv() 197 | 198 | p <- ggplot(diamonds, aes(cut, carat, fill = ..LV..)) 199 | p + geom_lv() 200 | ``` 201 | 202 | This plot ise usefull for having a more detailed description of the tails in a distribution. This works because each particular `lv` plot has both height and width. So for example, we can see that the upper tail for `Fair` has more values that the upper tail for `Ideal`. In a similar line, the distribution of `Ideal` is decreasing both in the number of carats as well as in the number of outliers as it increases towards the upper tail. That information is very difficult to get visually with a boxplot. 203 | 204 | Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method? 205 | 206 | ```{r} 207 | diamonds %>% 208 | ggplot(aes(cut, price)) + 209 | geom_violin() 210 | 211 | diamonds %>% 212 | ggplot(aes(price)) + 213 | geom_histogram() + 214 | facet_wrap(~ cut, scale = "free_y", nrow = 1) 215 | 216 | diamonds %>% 217 | ggplot(aes(price)) + 218 | geom_freqpoly(aes(colour = cut)) 219 | ``` 220 | 221 | The violin plot is extremely useful, at least to me, to compare the distributions. Histograms are trickier to compare, although they might be a bit useful when allowing the y axis to vary across plots. Freqpoly are both misleading because the frequency of each category influences greatly the visual display. In both plots we'd have to adjust for it by freeing th y axis (histogram) and plotting `..density..` in the y axis (freqpoly). 222 | 223 | If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does. 224 | 225 | # 7.5.2.1 Exercises 226 | 227 | How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut? 228 | 229 | By calculating percentages and also showing the n. 230 | 231 | ```{r} 232 | diamonds %>% 233 | count(color, cut) %>% 234 | group_by(color) %>% 235 | mutate(perc = n / sum(n)) %>% 236 | ggplot(aes(color, cut, fill = perc)) + 237 | geom_tile() 238 | ``` 239 | 240 | Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it? 241 | 242 | One thing that makes it extremely difficult to read is that it is difficult to see differences in dep_delay because the higher values are driving the whole color palette upwards. Also, many dest have missing values on some months. Two solutions could be done: exclude dest with missing vallues for now and summarise, standardize or rescale the dep_delay so that we con spot differences. 243 | 244 | ```{r} 245 | library(viridis) 246 | library(forcats) 247 | flights %>% 248 | ggplot(aes(x = month, y = dest, fill = dep_delay)) + 249 | geom_tile() 250 | 251 | flights %>% 252 | mutate(tot_delay = dep_delay + arr_delay) %>% 253 | filter(tot_delay > 0) %>% 254 | group_by(dest, month) %>% 255 | summarize(dep_del_dev = mean(tot_delay, na.rm = T)) %>% 256 | filter(n() == 12) %>% 257 | ungroup() %>% 258 | ggplot(aes(x = factor(month), y = fct_reorder(dest, dep_del_dev), fill = dep_del_dev)) + 259 | geom_tile() + 260 | scale_fill_viridis() 261 | ``` 262 | 263 | Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above? 264 | 265 | ```{r} 266 | diamonds %>% 267 | count(color, cut) %>% 268 | ggplot(aes(x = color, y = cut)) + 269 | geom_tile(aes(fill = n)) 270 | ``` 271 | 272 | Because the cut is ordered giving the impression of a scatterplot-type of intuition. Also, it's better to have names that we to interpret constantly (and are a bit lengthy) on the y axis. 273 | 274 | # 5.3.3 Two continuous variables 275 | 276 | Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price? 277 | 278 | ```{r} 279 | ggplot(data = diamonds, 280 | mapping = aes(x = price, 281 | colour = cut_width(carat, 0.3))) + 282 | geom_freqpoly() 283 | ``` 284 | 285 | So cut_number makes N groups having the same number of observation in each group. 286 | Here there's no contextual or substantive reason for different coding. 287 | 288 | cut_width is more substancial. You make groups based on the metric of the variables. 289 | 290 | In fact, there's a tension between the two. The idea is that you reach an equilibrium 291 | between the two; categoris which have substantial meaning but also have reasonable sample sizes. 292 | 293 | Here we get 10 groups with the same sample size. 294 | ```{r} 295 | ggplot(data = diamonds, aes(x=cut_number(price, 10), y=carat)) + 296 | geom_boxplot() 297 | ``` 298 | 299 | The relationship seems a bit exponential with little differences between the smaller groups. 300 | 301 | But with 10 groups based on substantive metrics: 302 | ```{r} 303 | ggplot(data = diamonds, aes(x=cut_width(price, 2000), y=carat)) + 304 | geom_boxplot() 305 | ``` 306 | 307 | There are bigger differences between the bottom group. But that's because the bottom groups now 308 | contain many more observations which captures the variability between groups (so within each of 309 | these groups the variance is small, that's why in the previous plot we see very little difference 310 | in the small groups). 311 | 312 | Visualize the distribution of cara, partitioned by price. 313 | 314 | ```{r} 315 | ggplot(diamonds, aes(carat, y = ..density.., colour = cut_width(price, 2000))) + 316 | geom_freqpoly() 317 | 318 | ``` 319 | 320 | 321 | How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you? 322 | 323 | ```{r} 324 | diamonds %>% 325 | filter(between(carat, 0, 2.5)) %>% 326 | mutate(carat = cut_width(carat, 1)) %>% 327 | ggplot(aes(price)) + 328 | geom_histogram() + 329 | facet_wrap(~ carat) 330 | ``` 331 | 332 | I'm a bit surprised with the distribution of price for bigger diamonds. I was expecting very little variance! It seems as though big diamonds can cost anything between 5000 and 18000. Whereas small ones have very little variance. 333 | 334 | Combine two of the techniques you've learned to visualize the combined distribution of cut, carat and price. 335 | 336 | ```{r} 337 | diamonds %>% 338 | filter(between(carat, 0, 2.5)) %>% 339 | mutate(carat = cut_width(carat, 1)) %>% 340 | ggplot(aes(cut, price)) + 341 | geom_boxplot() + 342 | scale_y_log10() + 343 | facet_wrap(~ carat) 344 | ``` 345 | 346 | We get the differences in price we saw before for carat groups but we also see that within each carat groups the differences between the cut in terms of prices is very little. It seems the be that the driving force of price is more carat rather than the cut. 347 | 348 | Why is a scatterplot a better display than a binned plot for this case? 349 | 350 | ```{r} 351 | ggplot(diamonds, aes(x, y)) + 352 | geom_point() + 353 | coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) 354 | ``` 355 | 356 | Because binned plots tend to categorize continuous measures and might distort some relationships, like a linear one here. 357 | 358 | ```{r} 359 | ggplot(diamonds, aes(x, y)) + 360 | geom_hex() 361 | ``` 362 | 363 | Another possibility would be to bin both continuous variables, as stated before, but that would categorize outliers into groups, eliminating any trace of anomalous or extreme values. -------------------------------------------------------------------------------- /Ch1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R-4DS" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(tidyverse) 9 | ``` 10 | 11 | ## 3.2.4 Exercises 12 | 13 | Run ggplot(data = mpg) what do you see? 14 | ```{r} 15 | ggplot(data = mpg) 16 | ``` 17 | 18 | How many rows are in mtcars? How many columns? 19 | ```{r} 20 | dim(mpg) 21 | ``` 22 | What does the drv variable describe? Read the help for ?mpg to find out. 23 | 24 | Whether the car is front wheel drive or not. 25 | f = front-wheel drive, r = rear wheel drive, 4 = 4wd 26 | 27 | Make a scatterplot of hwy vs cyl. 28 | ```{r} 29 | ggplot(mpg) + geom_point(aes(hwy, cyl)) 30 | ``` 31 | What happens if you make a scatterplot of class vs drv. Why is the plot not useful? 32 | ```{r} 33 | ggplot(mpg) + geom_point(aes(class, drv)) 34 | ``` 35 | 36 | Because both variables are categorical. 37 | 38 | ## 3.3.1 Exercises 39 | 40 | What’s gone wrong with this code? Why are the points not blue? 41 | ```{r} 42 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue")) 43 | ``` 44 | 45 | ```{r} 46 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") 47 | ``` 48 | 49 | Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg? 50 | 51 | ### Categorical 52 | - Model 53 | - cyl 54 | - Manufacturer 55 | - trans 56 | - drv 57 | - fl 58 | - class 59 | 60 | ### Continuous 61 | - displ 62 | - year 63 | - cty 64 | - hwy 65 | 66 | Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables? 67 | 68 | ```{r} 69 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty)) 70 | # ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = cty)) This creates an error 71 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = cty)) 72 | ``` 73 | 74 | What happens if you map the same variable to multiple aesthetics? 75 | ```{r} 76 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = cty, size = cty)) 77 | ``` 78 | 79 | What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point) 80 | 81 | Stroke controls the width of the border of certain shapes. Those shapes which have borders are the only ones that stroke can alter. 82 | 83 | What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? 84 | ```{r} 85 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5)) 86 | ``` 87 | 88 | ggplot turns displ < 5 into a boolean (or dummy) variable on the fly and maps that T or F to the colour argument. 89 | 90 | ## 3.5.1 Exercises 91 | 92 | What happens if you facet on a continuous variable? 93 | 94 | ```{r} 95 | ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 96 | facet_wrap(~ cty) 97 | ``` 98 | 99 | It plots it anyway 100 | 101 | What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot? 102 | 103 | ```{r} 104 | ggplot(data = mpg) + 105 | geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl) 106 | ``` 107 | 108 | It means that there are combinations where there are no data points. 109 | 110 | What plots does the following code make? What does . do? 111 | 112 | ```{r} 113 | ggplot(data = mpg) + 114 | geom_point(mapping = aes(x = displ, y = hwy)) + 115 | facet_grid(drv ~ .) 116 | 117 | ggplot(data = mpg) + 118 | geom_point(mapping = aes(x = displ, y = hwy)) + 119 | facet_grid(. ~ cyl) 120 | ``` 121 | 122 | The dot controls whether the facetting will be done row or column wise. For example `facet_grid(drv ~ .)` will use drv as rows while `facet_grid(. ~ drv)` will use it as columns. `facet_grid(~ drv)` will do the same as the column wise facetting but `facet_grid(drv ~)` won't because a formula object needs to have something after the `~`. 123 | 124 | Take the first faceted plot in this section: 125 | 126 | ```{r} 127 | ggplot(data = mpg) + 128 | geom_point(mapping = aes(x = displ, y = hwy)) + 129 | facet_wrap(~ class, nrow = 2) 130 | ``` 131 | 132 | What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? 133 | 134 | I think facetting is better when you want to pay particular attention to particular facets alone (naturally) while using the color aesthetic is better to discriminate which points are located where. Colour is better to get a global overview of the relationship while facetting is better for paying attention to within group patterns. For example, fitting many trendlines for different groups is better done with faceting rather than all together. 135 | 136 | Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables? 137 | 138 | `nrow` controls the number of rows for the total number of facets whereas `ncol` controls the number of columns. Other options can control interesting parameters. For example, scales can control whether each plot has its own y axis with `scales = "free"`, as in allow the axes to be free. The function also has the labeller option to change the names of each facet and other options like `strip.position` for the position of the facets labels. Read `?facet_wrap` for more options. 139 | 140 | **BONUS** 141 | 142 | How do you change the names of the facets? Very easily 143 | 144 | ```{r} 145 | # the `0` and `1` are the old names 146 | new_names <- as_labeller(c(`4` = "name0", `5` = "other_name", `6` = "name1", `8` = "name2")) 147 | 148 | ggplot(mpg, aes(displ, cty)) + facet_wrap(~ cyl, labeller = new_names) 149 | ``` 150 | 151 | Yay! 152 | **BONUS** 153 | 154 | `facet_grid` doesn't have the option to specify rows or columns because it calculate automatically the grid. So the multiplication of the number of distinct values in the variables in the formula. 155 | 156 | 157 | When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? 158 | 159 | Because otherwise the graph is going to be too long and you won't understand anything. This graph is a good example: 160 | ```{r} 161 | ggplot(data = mpg) + 162 | geom_point(mapping = aes(x = displ, y = hwy)) + 163 | facet_grid(~ model) 164 | ``` 165 | 166 | ## 3.6.1 Exercises 167 | 168 | What geom would you use to draw a line chart? A boxplot? A histogram? An area chart? 169 | ```{r} 170 | # Line chart 171 | mpg %>% 172 | group_by(year) %>% 173 | summarise(m = mean(cty)) %>% 174 | ggplot(aes(year, m)) + 175 | geom_line() 176 | 177 | # Boxplot 178 | ggplot(mpg, aes(class, hwy)) + 179 | geom_boxplot() 180 | 181 | # Histogram 182 | ggplot(mpg, aes(displ)) + 183 | geom_histogram(bins = 60) 184 | 185 | # Area chart 186 | huron <- data.frame(year = 1875:1972, level = as.vector(LakeHuron)) 187 | ggplot(huron, aes(year, level)) + 188 | geom_area() 189 | 190 | ``` 191 | 192 | Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions. 193 | 194 | ```{r} 195 | ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 196 | geom_point() + 197 | geom_smooth(se = FALSE) 198 | ``` 199 | 200 | What does show.legend = FALSE do? What happens if you remove it? 201 | Why do you think I used it earlier in the chapter? 202 | 203 | It removes the legend. It gives a cleaner plot when its clear that the grouping is done on a specific variable. 204 | 205 | What does the se argument to geom_smooth() do? 206 | 207 | It removes the confidence intervals from the smoothed lines 208 | 209 | Will these two graphs look different? Why/why not? 210 | ```{r, echo = F} 211 | ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 212 | geom_point() + 213 | geom_smooth() 214 | 215 | ggplot() + 216 | geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 217 | geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy)) 218 | ``` 219 | 220 | They'll be exactly the same. 221 | 222 | Recreate the R code necessary to generate the following graphs. 223 | 224 | ```{r} 225 | # 1st. 226 | ggplot(mpg, aes(displ, hwy)) + 227 | geom_point() + 228 | geom_smooth(se = F) 229 | 230 | ggplot(mpg, aes(displ, hwy)) + 231 | geom_point() + 232 | geom_smooth(aes(group = drv), se = F) 233 | 234 | # 2nd. 235 | ggplot(mpg, aes(displ, hwy, colour = drv)) + 236 | geom_smooth(se = F) + 237 | geom_point() 238 | 239 | ggplot(mpg, aes(displ, hwy)) + 240 | geom_point(aes(colour = drv)) + 241 | geom_smooth(se = F) 242 | 243 | # 3rd. 244 | ggplot(mpg, aes(displ, hwy)) + 245 | geom_point(aes(colour = drv)) + 246 | geom_smooth(aes(linetype = drv), se = F) 247 | 248 | # You can do this one by choosing a shape which has a border and simply colour 249 | # the border with `colour` and the insides with `fill` (which is matched to drv). 250 | # Then make the whole point a bit bigger with size 251 | ggplot(mpg, aes(displ, hwy)) + 252 | geom_point(aes(fill = drv), shape = 21, stroke = 2, colour = "white", size = 3) 253 | ``` 254 | 255 | ## 3.7.1 Exercises 256 | 257 | What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function? 258 | 259 | ```{r, echo = F} 260 | # Previous plot 261 | ggplot(data = diamonds) + 262 | stat_summary( 263 | mapping = aes(x = cut, y = depth), 264 | fun.ymin = min, 265 | fun.ymax = max, 266 | fun.y = median 267 | ) 268 | ``` 269 | 270 | `stat_summary` is associated with `geom_pointrange`. 271 | 272 | ```{r} 273 | ggplot(diamonds) + 274 | geom_pointrange(aes(cut, depth, ymin = depth, ymax = depth)) 275 | ``` 276 | 277 | What does geom_col() do? How is it different to geom_bar()? 278 | 279 | `geom_col` leaves the data as it is. `geom_bar()` creates two variables (count and prop) and then graphs the count data on the y axis. With `geom_col` you can plot the values of any x variable against any y variable. 280 | 281 | ```{r} 282 | # For example, plotting exactly x to y values. 283 | aggregate.data.frame(diamonds$price, list(diamonds$cut), mean, na.rm = T) %>% 284 | print(.) %>% 285 | ggplot(aes(Group.1, x)) + 286 | geom_col() 287 | ``` 288 | 289 | Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common? 290 | 291 | What variables does stat_smooth() compute? What parameters control its behaviour? 292 | 293 | `stat_smooth()` computes the y, the predicted value of y for each x value. Also, it computes 294 | the se of that value predicted, together with the upper and lower bound of that point prediction. 295 | It can compute different methods such as `lm`, `glm`, `lowess` among others. See method in `?stat_smooth`. The statistic can be controlled with the method argument. 296 | 297 | You can see the values by wrapping any plot that has geom_smooth() with ggplot_build(). 298 | 299 | In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs? 300 | 301 | Not sure about this one. 302 | 303 | ```{r} 304 | # Each cut is treated as a searapte group that sums to 1. 305 | ggplot(data = diamonds) + 306 | geom_bar(mapping = aes(x = cut, y = ..prop..)) 307 | 308 | # If you calculate it manually, it doesn't matter 309 | m <- ggplot(data = diamonds) 310 | m + geom_bar(aes(cut, ..count../sum(..count..))) 311 | 312 | diamonds %>% 313 | count(cut) %>% 314 | mutate(prop = n/sum(n)) %>% 315 | ggplot(aes(cut, prop)) + geom_bar(stat = "identity") # or geom_col() 316 | 317 | ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..count../sum(..count..))) 318 | 319 | # By specifying group = 1, you treat all cut groups as 1 group. 320 | ggplot(diamonds, aes(cut)) + geom_bar(aes(y = ..prop.., group = 1)) 321 | # and thus all the proportions are done calculate as a single group 322 | ``` 323 | 324 | ## 3.8.1 Exercises 325 | 326 | What is the problem with this plot? How could you improve it? 327 | 328 | ```{r} 329 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 330 | geom_point() 331 | ``` 332 | Althought the two variables are continuous, the chance of being in a single point is very discrete and a lot of points overlap. We could fix it by adding jitter. 333 | 334 | ```{r} 335 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 336 | geom_point() + 337 | geom_jitter() 338 | ``` 339 | 340 | What parameters to geom_jitter() control the amount of jittering? 341 | `width` and `height` 342 | 343 | ```{r} 344 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 345 | geom_point() + 346 | geom_jitter(width = 5, height = 10) 347 | ``` 348 | 349 | Compare and contrast geom_jitter() with geom_count(). 350 | ```{r} 351 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 352 | geom_point() + 353 | geom_jitter() 354 | 355 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 356 | geom_point() + 357 | geom_count() 358 | ``` 359 | `geom_count()` is another variant of `geom_point()` and controls the size of each dot based on the frequency of observations in a specifiy coordinate. It can help to contrast with `geom_jitter()` in understanding the data. 360 | 361 | What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it. 362 | 363 | ```{r} 364 | ggplot(data = mpg, mapping = aes(x = class, y = displ)) + 365 | geom_boxplot(aes(colour = drv)) 366 | ``` 367 | 368 | 369 | ```{r setup, include=FALSE} 370 | knitr::opts_chunk$set(echo = TRUE) 371 | library(tidyverse) 372 | ``` 373 | 374 | ## 3.9.1 Exercises 375 | 376 | Turn a stacked bar chart into a pie chart using coord_polar(). 377 | 378 | ```{r} 379 | ggplot(mpg, aes(factor(1), fill = factor(cyl))) + 380 | geom_bar(width = 1) + 381 | coord_polar(theta = 'y') 382 | ``` 383 | 384 | What does labs() do? Read the documentation. 385 | `labs()` allows you to control all the labels in the plot. For example: 386 | ```{r} 387 | ggplot(mpg, aes(cyl, fill = as.factor(cyl))) + 388 | geom_bar() + 389 | labs(title = "Hey, this is a title", 390 | subtitle = "This are the subs", 391 | x = "This is the X axis", 392 | y = "This is the Y axis", 393 | fill = "This is the fill", 394 | caption = "This is a caption") 395 | ``` 396 | 397 | What’s the difference between coord_quickmap() and coord_map()? 398 | ```{r} 399 | nz <- map_data("nz") 400 | 401 | nzmap <- ggplot(nz, aes(x = long, y = lat, group = group)) + 402 | geom_polygon(fill = "white", colour = "black") 403 | 404 | nzmap + coord_map() 405 | nzmap + coord_quickmap() 406 | ``` 407 | 408 | `coord_quickmap()` is very similar to `coord_map()` but `coord_quickmap()` preserves straight lines in what should be a spherical plane. So, basically, the earth is shperical and `coord_map()` preserves that without plotting any straight lines. `coord_quickmap()` adds those lines adjusting to the spherical surface. 409 | 410 | What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do? 411 | 412 | ```{r} 413 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 414 | geom_point() + 415 | geom_abline() + 416 | coord_fixed() 417 | ``` 418 | 419 | There is a positive correlation between the two. `coord_fixed()` makes sure there is no visual discrepancies and 420 | > ensures that the ranges of axes are equal to the specified ratio by adjusting the plot aspect ratio - Documentation of `coord_fixed()`. 421 | 422 | Finally, `geom_abline()` plots the estimated slope between the two variables. -------------------------------------------------------------------------------- /Ch3.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ch3" 3 | output: 4 | html_document: 5 | df_print: paged 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | library(nycflights13) 11 | library(tidyverse) 12 | library(car) 13 | ``` 14 | 15 | 16 | ## 5.2.4 Exercises 17 | 18 | Find all flights that: 19 | 20 | Had an arrival delay of two or more hours 21 | 22 | ```{r} 23 | filter(flights, arr_delay >= 120) 24 | ``` 25 | 26 | Flew to Houston (IAH or HOU) 27 | 28 | ```{r} 29 | filter(flights, dest %in% c("IAH", "HOU")) 30 | ``` 31 | 32 | Were operated by United, American, or Delta 33 | 34 | ```{r} 35 | filter(flights, carrier %in% c("AA", "DL", "UA")) 36 | ``` 37 | 38 | Departed in summer (July, August, and September) 39 | ```{r} 40 | filter(flights, month %in% 7:9) 41 | ``` 42 | 43 | Arrived more than two hours late, but didn’t leave late 44 | 45 | ```{r} 46 | filter(flights, arr_delay > 120, dep_delay <= 0) 47 | ``` 48 | 49 | Were delayed by at least an hour, but made up over 30 minutes in flight 50 | 51 | ```{r} 52 | # For example, if dep_delay is 10 minutes late then arr_delay should be 53 | # 10 mins lates. 10 - 10 = 0, so air time was on time. 54 | 55 | # If dep_delay is 10 minutes late but arr_delay is -20 minutes earlier, then 56 | # arr_delay SHOULD'VE been 10 but instead is -20 (because of 30 catch up), so 57 | # 10 - (-20) = 30. 58 | 59 | filter(flights, dep_delay >= 60, (dep_delay - arr_delay > 30)) 60 | ``` 61 | 62 | Departed between midnight and 6am (inclusive) 63 | ```{r} 64 | filter(flights, dep_time >= 2400 | dep_time <= 600) 65 | ``` 66 | 67 | Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges? 68 | 69 | ```{r} 70 | filter(flights, between(dep_time, 601, 2359)) 71 | ``` 72 | 73 | How many flights have a missing dep_time? What other variables are missing? What might these rows represent? 74 | ```{r} 75 | sum(is.na(flights$dep_time)) 76 | 77 | map_dbl(flights, ~ sum(is.na(.x))) 78 | ``` 79 | 80 | Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) 81 | 82 | Because anything that is `^ 0` equals `1`. 83 | Because NA | TRUE is saying whether one of the two is `TRUE` and the second one is. 84 | Because at least one of the two expressions can be tested: FALSE & NA. In NA & NA neither can be tested and the results is `NA & NA`. 85 | 86 | The general rule is that whenever there is a logical expressions, if one can be tested, then the result shouldn't be `NA`. And any operation that the results is determined, regardless of the number, the inputting `NA` does not affect the result. 87 | 88 | ## 5.3.1 Exercises 89 | 90 | How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). 91 | ```{r} 92 | df <- tibble(x = c(5, 2, NA), 93 | y = c(2, NA, 2)) 94 | 95 | rowSums(df) 96 | arrange(df, desc(is.na(x))) 97 | arrange(df, -(is.na(x))) 98 | 99 | ``` 100 | We're basically saying, those which are `TRUE` to being `NA`, sort them in descending order. 101 | 102 | Sort flights to find the most delayed flights. Find the flights that left earliest. 103 | ```{r} 104 | arrange(flights, dep_delay) 105 | arrange(flights, desc(dep_delay)) 106 | ``` 107 | 108 | Sort flights to find the fastest flights. 109 | ```{r} 110 | arrange(flights, air_time) 111 | ``` 112 | 113 | Which flights travelled the longest? Which travelled the shortest? 114 | ```{r} 115 | # Shortest 116 | flights %>% 117 | arrange(air_time) %>% 118 | select(carrier, flight, air_time) 119 | 120 | # Fastest 121 | flights %>% 122 | arrange(-air_time) %>% 123 | select(carrier, flight, air_time) 124 | ``` 125 | 126 | ## 5.4.1 Exercises 127 | 128 | Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. 129 | 130 | ```{r} 131 | vars <- c("dep_time", "dep_delay", "arr_time", "arr_delay") 132 | select(flights, dep_time, dep_delay, arr_time, arr_delay) 133 | select(flights, starts_with("dep"), starts_with("arr")) 134 | select(flights, one_of(vars)) 135 | select_(flights, .dots = vars) 136 | select_(flights, "dep_time", "dep_delay", "arr_time", "arr_delay") 137 | select(flights, matches("dep"), matches("arr"), -matches("sched"), -carrier) 138 | select(flights, contains("dep"), contains("arr"), -contains("sched"), -carrier) 139 | select(flights, matches("^dep|^arr")) 140 | select(flights, matches("time$|delay$"), -contains("sched"), -contains("air")) 141 | select(flights, matches("^dep|arr_delay|time$")) 142 | 143 | ``` 144 | 145 | What happens if you include the name of a variable multiple times in a select() call? 146 | 147 | ```{r} 148 | select(flights, dep_time, dep_time) 149 | ``` 150 | 151 | Nothing, it just returns it once. 152 | 153 | What does the one_of() function do? Why might it be helpful in conjunction with this vector? 154 | ```{r} 155 | vars <- c("year", "month", "day", "dep_delay", "arr_delay") 156 | select(flights, one_of(vars)) 157 | ``` 158 | 159 | It works because select only accepts variable names without `" "` quotes. By including inside `one_of()` one can use character names. 160 | 161 | Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? 162 | 163 | By default, each select_helper function ignore the cases of the variables. 164 | 165 | ```{r} 166 | select(flights, contains("TIME")) 167 | ``` 168 | 169 | With this command you can treat each name as literal: 170 | ```{r} 171 | select(flights, contains("TIME", ignore.case = F)) 172 | ``` 173 | 174 | ## 5.5.2 Exercises 175 | 176 | Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. 177 | 178 | ```{r} 179 | hours2mins <- function(x) { 180 | x %/% 100 * 60 + x %% 100 181 | } 182 | 183 | # with integer division 184 | mutate(flights, 185 | dep_time = hours2mins(dep_time), 186 | sched_dep_time = hours2mins(sched_dep_time)) 187 | 188 | # with rounding operations 189 | mutate(flights, 190 | dep_time = 60 * floor(dep_time/100) + (dep_time - floor(dep_time/100) * 100), 191 | sched_dep_time = 60 * floor(sched_dep_time/100) + (sched_dep_time - floor(sched_dep_time/100) * 100)) 192 | ``` 193 | 194 | Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it? 195 | 196 | ```{r} 197 | flights %>% 198 | mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100), 199 | sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100), 200 | arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100), 201 | sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>% 202 | transmute((arr_time - dep_time) %% (60*24) - air_time) 203 | ``` 204 | 205 | 206 | Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related? 207 | 208 | ```{r} 209 | 210 | hours2mins <- function(x) { 211 | x %/% 100 * 60 + x %% 100 212 | } 213 | 214 | 215 | select(flights, contains("dep")) %>% 216 | mutate(dep_time_two = hours2mins(dep_time) - hours2mins(sched_dep_time)) 217 | 218 | 219 | # these two numbers don’t match because we aren’t accounting for flights 220 | # where the departure time is the next day from the scheduled departure time. 221 | 222 | select(flights, contains("dep")) %>% 223 | mutate(dep_time_two = hours2mins(dep_time) - hours2mins(sched_dep_time)) %>% 224 | filter(dep_delay != dep_time_two) %>% 225 | mutate(dep_time_two = hours2mins(dep_time) - hours2mins(sched_dep_time - 2400)) 226 | 227 | # There it is fixed! 228 | ``` 229 | 230 | Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank(). 231 | 232 | ```{r} 233 | flights %>% 234 | filter(min_rank(-(dep_delay)) %in% 1:10) 235 | 236 | flights %>% 237 | top_n(10, dep_delay) 238 | 239 | ``` 240 | 241 | What does 1:3 + 1:10 return? Why? 242 | It wil return: 243 | ```{r} 244 | x <- c(2, 4, 6, 5, 7, 9, 8, 10, 12, 11) 245 | 246 | p <- 1:3 + 1:10 247 | 248 | p == x 249 | ``` 250 | 251 | Because `1:3` is reciclyed. 252 | 253 | What trigonometric functions does R provide? 254 | 255 | `?Trig` 256 | 257 | ## 5.6.7 Exercises 258 | 259 | Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios: 260 | 261 | ```{r} 262 | delay_char <- 263 | flights %>% 264 | group_by(flight) %>% 265 | summarise(n = n(), 266 | fifteen_early = mean(arr_delay == -15, na.rm = T), 267 | fifteen_late = mean(arr_delay == 15, na.rm = T), 268 | ten_always = mean(arr_delay == 10, na.rm = T), 269 | thirty_early = mean(arr_delay == -30, na.rm = T), 270 | thirty_late = mean(arr_delay == 30, na.rm = T), 271 | percentage_on_time = mean(arr_delay == 0, na.rm = T), 272 | twohours = mean(arr_delay > 120, na.rm = T)) %>% 273 | map_if(is_double, round, 2) %>% 274 | as_tibble() 275 | 276 | 277 | ``` 278 | 279 | A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. 280 | 281 | ```{r} 282 | delay_char %>% 283 | filter(fifteen_early == 0.5, fifteen_late == 0.5) 284 | ``` 285 | 286 | A flight is always 10 minutes late. 287 | 288 | ```{r} 289 | 290 | delay_char %>% 291 | filter(ten_always == 1) 292 | 293 | ``` 294 | 295 | A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. 296 | 297 | ```{r} 298 | 299 | delay_char %>% 300 | filter(thirty_early == 0.5 & thirty_late == 0.5) 301 | 302 | ``` 303 | 304 | 99% of the time a flight is on time. 1% of the time it’s 2 hours late. 305 | 306 | ```{r} 307 | 308 | delay_char %>% 309 | filter(percentage_on_time == 0.99 & twohours == 0.01) 310 | 311 | ``` 312 | 313 | Which is more important: arrival delay or departure delay? 314 | 315 | It depends 316 | 317 | Come up with another approach that will give you the same output as: 318 | ```{r} 319 | not_cancelled <- 320 | flights %>% 321 | filter(!is.na(dep_delay), !is.na(arr_delay)) 322 | 323 | not_cancelled %>% 324 | count(dest) 325 | 326 | # and 327 | 328 | not_cancelled %>% 329 | count(tailnum, wt = distance) 330 | # (without using count()). 331 | 332 | ####################### 333 | 334 | not_cancelled %>% 335 | group_by(dest) %>% 336 | summarise(n = n()) 337 | 338 | # and 339 | 340 | not_cancelled %>% 341 | group_by(tailnum) %>% 342 | tally(wt = distance) 343 | # or 344 | not_cancelled %>% 345 | group_by(tailnum) %>% 346 | summarize(n = sum(distance)) 347 | 348 | ``` 349 | 350 | Our definition of cancelled flights `(is.na(dep_delay) | is.na(arr_delay) )` is slightly suboptimal. Why? Which is the most important column? 351 | 352 | Because if a flight didn't leave then it was cancelled. If the condition `is.na(dep_delay)` is met, then the flight was cancelled. 353 | 354 | Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? 355 | 356 | ```{r} 357 | flights %>% 358 | group_by(day) %>% 359 | summarise(cancelled = mean(is.na(dep_delay)), 360 | mean_dep = mean(dep_delay, na.rm = T), 361 | mean_arr = mean(arr_delay, na.rm = T)) %>% 362 | ggplot(aes(y = cancelled)) + 363 | geom_point(aes(x = mean_dep), colour = "red") + 364 | geom_point(aes(x = mean_arr), colour = "blue") + 365 | labs(x = "Avg delay per day", y = "Cancelled flights p day") 366 | ``` 367 | 368 | It looks like there is a positive relationship. The higher the average delay of the day, the higher the proportion of cancelled flights per day. 369 | 370 | Which carrier has the worst delays? 371 | 372 | ```{r} 373 | flights %>% 374 | group_by(carrier) %>% 375 | summarise(dep_max = max(dep_delay, na.rm = T), 376 | arr_max = max(arr_delay, na.rm = T)) %>% 377 | arrange(desc(dep_max, arr_max)) %>% 378 | filter(1:n() == 1) 379 | ``` 380 | 381 | Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n())) 382 | 383 | It might be possible. For example, if we took the average departure delay for each carrier and then computed the deviations over the overall carrier mean from each airport mean, perhaps we can find something out. If the overall mean is, let's say, 25 mins, and then each deviation is ± 1 or 2, then it could be that the airline is bad or that every single bad. I know, it might be more likely that the airline is bad, but we can't be 100% sure. On the other hand, if the overall carrier mean is high and the deviations are all lower except for 1 or 2 airports then the effect is probably the airport effect. 384 | 385 | ```{r} 386 | 387 | flights %>% 388 | summarise(n_car = n_distinct(carrier), 389 | n_air = n_distinct(dest), 390 | n_or = n_distinct(origin)) 391 | 392 | flights %>% 393 | group_by(carrier) %>% 394 | mutate(avg_carrier = mean(dep_delay, na.rm = T)) %>% 395 | group_by(carrier, origin) %>% 396 | mutate(origin_mean = mean(dep_delay, na.rm = T), 397 | deviations = origin_mean - avg_carrier) %>% 398 | summarise(deviations = mean(deviations), mean = mean(avg_carrier)) %>% 399 | ggplot(aes(origin, deviations)) + geom_col() + facet_wrap(~ carrier) 400 | 401 | ``` 402 | 403 | Tearing out the effect is not straight forward but we can make some informed guesses. For example, whenever there are substantial deviations, they seem to be higher in EWR airport rather than in other airports. On the other hand, there are some airlines that look particular bad like 9E and MQ. And the same pattern is not found on the vast majority of other airlines, which would suggest it's an airport issues rather than an airline issue. 404 | 405 | ```{r} 406 | flights %>% 407 | group_by(carrier, dest) %>% 408 | summarise(mean_departure = mean(dep_delay, na.rm = T), 409 | mean_arrival = mean(arr_delay, na.rm = T)) 410 | ``` 411 | 412 | For each plane, count the number of flights before the first delay of greater than 1 hour. 413 | 414 | ```{r} 415 | flights %>% 416 | mutate(dep_date = time_hour) %>% 417 | group_by(tailnum) %>% 418 | arrange(dep_date) %>% 419 | mutate(cumulative = !cumany(arr_delay > 60)) %>% 420 | filter(cumulative == T) %>% 421 | tally(sort = TRUE) 422 | ``` 423 | 424 | or 425 | 426 | ```{r} 427 | flights %>% 428 | group_by(tailnum) %>% 429 | arrange(time_hour) %>% 430 | mutate(cum = arr_delay > 60, 431 | cum_any = cumsum(cum)) %>% 432 | filter(cum_any < 1) %>% 433 | tally(sort = TRUE) 434 | ``` 435 | 436 | What does the sort argument to count() do. When might you use it? 437 | 438 | When you want to sort the cases based on the count. 439 | ```{r} 440 | flights %>% 441 | count(flight, sort = T) 442 | ``` 443 | 444 | ## 5.7.1 Exercises 445 | 446 | Refer back to the table of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping. 447 | 448 | Which one? 449 | 450 | Which plane (tailnum) has the worst on-time record? 451 | 452 | ```{r} 453 | 454 | flights %>% 455 | filter(!is.na(arr_delay)) %>% 456 | group_by(tailnum) %>% 457 | summarise(prop_time = sum(arr_delay <= 30)/n(), 458 | mean_arr = mean(arr_delay, na.rm = T), 459 | fl = n()) %>% 460 | arrange(desc(prop_time)) 461 | ``` 462 | 463 | All these flights are always late. 464 | 465 | What time of day should you fly if you want to avoid delays as much as possible? 466 | 467 | ```{r} 468 | flights %>% 469 | group_by(hour) %>% 470 | filter(!is.na(dep_delay)) %>% 471 | summarise( delay = mean( dep_delay > 0 , na.rm = T)) %>% 472 | ggplot(aes(hour, delay, fill = delay)) + geom_col() 473 | 474 | # or 475 | 476 | flights %>% 477 | group_by(hour) %>% 478 | summarize(m = mean(dep_delay, na.rm = TRUE), 479 | sd = sd(dep_delay, na.rm = TRUE), 480 | low_ci = m - 2*sd, 481 | high_ci = m + 2*sd, 482 | n = n()) %>% 483 | ggplot(aes(hour, m, ymin = low_ci, ymax = high_ci)) + 484 | geom_pointrange() 485 | ``` 486 | 487 | Worst time to flight is in the early evening. Although that happens because more flights go out on that specific time also. 488 | 489 | For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination. 490 | 491 | ```{r} 492 | flights %>% 493 | group_by(dest) %>% 494 | filter(!is.na(dep_delay)) %>% 495 | summarise(tot_mins = sum(dep_delay[dep_delay > 0])) 496 | 497 | flights %>% 498 | filter(!is.na(dep_delay)) %>% 499 | group_by(tailnum, dest) %>% 500 | summarise(m = mean(dep_delay > 0), n = n()) %>% 501 | arrange(desc(m)) 502 | 503 | ``` 504 | 505 | Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight. 506 | 507 | ```{r} 508 | 509 | flights %>% 510 | mutate(new_sched_dep_time = lubridate::make_datetime(year, month, day, hour, minute)) %>% 511 | arrange(new_sched_dep_time) %>% 512 | mutate(prev_time = lag(dep_delay)) %>% 513 | # filter(between(dep_delay, 0, 300), between(prev_time, 0, 300)) %>% # play with this one 514 | select(origin, new_sched_dep_time, dep_delay, prev_time) %>% 515 | ggplot(aes(dep_delay, prev_time)) + geom_point(alpha = 1/10) + 516 | geom_smooth() 517 | 518 | # or 519 | 520 | flights %>% 521 | select(year, month, day, hour, dest, dep_delay) %>% 522 | group_by(dest) %>% 523 | mutate(lag_delay = lag(dep_delay)) %>% 524 | arrange(dest) %>% 525 | filter(!is.na(lag_delay)) %>% 526 | summarize(cor = cor(dep_delay, lag_delay, use = "complete.obs"), 527 | n = n()) %>% 528 | arrange(desc(cor)) %>% 529 | filter(row_number(desc(cor)) %in% 1:10) 530 | ``` 531 | 532 | Although there is a lot of noise, you can see a sort of straight line going on there. There is also a correlation between the lagged values in many of the destionatinons. So correlation between flights is mostly in specific airports. 533 | 534 | Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air? 535 | 536 | ```{r} 537 | # (1) 538 | flights %>% 539 | group_by(dest) %>% 540 | arrange(air_time) %>% 541 | slice(1:5) %>% 542 | select(tailnum, sched_dep_time, sched_arr_time, air_time) %>% 543 | arrange(air_time) 544 | 545 | # (2) 546 | 547 | flights %>% 548 | group_by(dest) %>% 549 | mutate(shortest = air_time - min(air_time, na.rm = T)) %>% 550 | top_n(1, air_time) %>% 551 | arrange(-air_time) %>% 552 | select(tailnum, sched_dep_time, sched_arr_time, shortest) 553 | 554 | ``` 555 | 556 | Find all destinations that are flown by at least two carriers. Use that information to rank the carriers. 557 | 558 | ```{r} 559 | flights %>% 560 | group_by(dest) %>% 561 | filter(n_distinct(carrier) > 2) %>% 562 | group_by(carrier) %>% 563 | summarise(n = n_distinct(dest)) %>% 564 | arrange(-n) 565 | ``` 566 | --------------------------------------------------------------------------------