├── .gitignore ├── comments.md ├── tibbles_solutions.rmd ├── visualize_soutions.Rmd └── transform_solutions.Rmd /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /comments.md: -------------------------------------------------------------------------------- 1 | # Comments 2 | - The pie chart question seems like a bad idea. We could not find a clear, elegant answer, and it ends exposing the workings of geom_bar in a way that might be unwise. 3 | 4 | - cumulative logical operators probably should not be covered so early. They are a bit difficult to explain, relative to how often they are used. In addition, it seems unwise to introduce row order depended operators before arrange. 5 | 6 | - Modular arithmetic should be introduced later than it is. It is an important topic in general programming, but it is not a common function in data science. `ceiling`, `floor`, `trunc`, `round` and `signif` can achieve most or all of the same effects, and can be applied to floating point numbers 7 | 8 | - Converting timestamps to times is probably a bit involved for this point in the book, unless you want to give a feel for messy data 9 | 10 | - Not clear what delay means in chapter 4 11 | - modular arithmetic stuff can probably go 12 | -------------------------------------------------------------------------------- /tibbles_solutions.rmd: -------------------------------------------------------------------------------- 1 | # `Tibble` Exercises 2 | 3 | ```{r} 4 | library(tibble) 5 | ``` 6 | 7 | ## Exercise 1 8 | 9 | ```{r} 10 | # load the data 11 | data(mtcars) 12 | 13 | is(mtcars) # the type is data.frame 14 | 15 | mtcars %>% as_tibble() 16 | ``` 17 | 18 | ## Exercise 2 19 | 20 | ## Exercise 2.1 21 | 22 | ```{r} 23 | library(ggplot2) 24 | 25 | annoying <- tibble( 26 | `1` = 1:10, 27 | `2` = `1` * 2 + rnorm(length(`1`)) 28 | ) 29 | 30 | ggplot(data = annoying) + 31 | geom_point(aes(x = `1`, y = `2`)) 32 | ``` 33 | 34 | ## Exercise 2.2 35 | 36 | ```{r} 37 | annoying$`3` <- annoying$`2` / annoying$`1` 38 | annoying 39 | ``` 40 | 41 | ## Exercise 2.4 42 | 43 | Has to be done before the renaming. 44 | 45 | ```{r} 46 | annoying$`1` 47 | annoying 48 | ``` 49 | 50 | ## Exercise 2.3 51 | 52 | ```{r} 53 | colnames(annoying) <- c("one", "two", "three") 54 | annoying 55 | ``` 56 | 57 | ## Exercise 3 58 | 59 | It is like a named vector that is converted into a data_frame/tibble. We might use it before we use `mutate`. 60 | 61 | ```{r} 62 | v <- 1:10 63 | names(v) <- letters[v] 64 | 65 | d <- enframe(v) 66 | d 67 | ``` 68 | 69 | ## Exercise 4 70 | 71 | In the example below `tibble.max_extra_cols` is set to `100` and defines how many additional column names are printed 72 | 73 | ```{r} 74 | options(tibble.max_extra_cols = 100) 75 | ``` 76 | -------------------------------------------------------------------------------- /visualize_soutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "visualize_soutions" 3 | output: html_document 4 | --- 5 | ```{r setup, include=FALSE} 6 | knitr::opts_chunk$set(echo = TRUE) 7 | library(ggplot2) 8 | ``` 9 | 10 | 11 | # 3.2.1 12 | ### Run `ggplot(data = mpg)` what do you see? 13 | ```{r} 14 | ggplot(data = mpg) 15 | ``` 16 | 17 | A ggplot with no aesthetics just shows a grey square, since it produces a background with no graph on it. 18 | 19 | ### What does the `drv` variable describe? Read the help for `?mpg` to find out. 20 | ```{r} 21 | ?mpg 22 | ``` 23 | The variable `drv` says which wheels [drive](https://en.wikipedia.org/wiki/Drive_wheel) the vehicle. 24 | Typing `?` into the command prompt gets you help on functions and other objects. 25 | 26 | ### Make a scatterplot of `hwy` vs `cyl`. 27 | ```{r} 28 | ggplot(mpg, aes(x=cyl, y=hwy)) + geom_point() 29 | ``` 30 | 31 | `ggplot(mpg, aes(x=cyl, y=hwy))` sets up the plot: the data that it is based on, and what the axes represent. `geom_point` says that we want to plot a scatter plot on these axes. 32 | 33 | ### What happens if you make a scatterplot of `class` vs `drv`. Why is the plot not useful? 34 | ```{r} 35 | ggplot(mpg, aes(x=class, y=drv)) + geom_point() 36 | ``` 37 | 38 | Because both variables are categorical, the points overlap and so we only see if there were any variables with that combination of classes. `geom_scatter` or using the mapping `alpha = 0.01` are possible ways to remedy this. 39 | 40 | # 3.3.1 41 | ### What's gone wrong with this code? Why are the points not blue? 42 | The points are not blue because the "blue" is being interpreted as a vector (`c("blue")`) to map to an aesthetic, just like hwy or displ. To manually override a colour, the mapping could be placed outside the `aes`. 43 | ```{r} 44 | ggplot(data = mpg) + 45 | geom_point(mapping = aes(x = displ, y = hwy), color = "blue") 46 | ``` 47 | 48 | ### Which variables in `mpg` are categorical? Which variables are continuous? 49 | `mpg` is a tibble, so if you just run `mpg`, you can see the types of the variables as column headers. (As long as you have `dplyr` loaded) 50 | geom 51 | ### Map a continuous variable to `color`, `size`, and `shape`. How do these aesthetics behave differently for categorical vs. continuous variables? 52 | - `shape` can't take a continuous variable, because shapes aren't ordered. 53 | - `size` maps the variable to the area of the mark `scale_radius` can be used to map to the radius. 54 | - `colour` maps the variable to the saturation of the colour of a blue mark. Other mappings can be achieved with `scale_color_continuous` 55 | 56 | ### What happens if you map the same variable across multiple aesthetics? What happens if you map different variables across multiple aesthetics? 57 | - You can represent a variable with multiple aesthetics with no trouble. For instance, using both `shape` and `colour` for one discrete variable means that your plot will still be readable in black and white. 58 | - If you try to use the same aesthetic multiple times, ggplot will take your first answer, with a warning. 59 | 60 | ### What does the `stroke` aesthetic do? What shapes does it work with? 61 | Stroke controls the width of the border, for shapes that have one. 62 | ```{r} 63 | ggplot(data = mpg) + geom_point(aes(x = cty, y = hwy, stroke = displ), shape = 21) 64 | ``` 65 | 66 | ### What happens if you set an aesthetic to something other than a variable name, like `displ < 5`? 67 | Aesthetic mappings are treated as expressions to be evaluated in the context of the `data` argument, so this will evaluate the expression, and plot the result. 68 | ```{r} 69 | ggplot(data = mpg) + geom_point(aes(x = cty, y = hwy, colour = displ < 5)) 70 | ``` 71 | 72 | # 3.5.1 73 | ### What happens if you facet on a continuous variable? 74 | You'll get one row or column for each unique value of the variable. This can very very slow for variables that take a lot of values. 75 | 76 | ### What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? 77 | Empty cells in `facet_grid` imply that there were no rows with that combination of values in the original dataset. This is just like in a discrete vs discrete scatter plot, where empty rows or columns imply that that combination of values didn't occur in the original data set. 78 | 79 | ### What plots does the following code make? What does `.` do? 80 | `.` is just a placeholder so that we can have a facet in only one dimension. 81 | This is necessary because sometimes one sided formulae can cause problems. 82 | 83 | ### What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? 84 | It is difficult to resolve more than a dozen or so discrete colours, but we can have a larger number of facets than that. On the other hand, facets can be harder to read at a glance, or if the cells being compared aren't lined up in the required dimension. So in a situation like this, colours are probably better, but if we had more classes, or wanted to use colour for a different variable, facets would come into their own. 85 | 86 | ### Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn't `facet_grid()` have `nrow` and `ncol` variables? 87 | In `facet_wrap`, `nrow` and `ncol` control the numbers of rows and columns, but in `facet_grid` these are implied by the faceting variables. `dir` also controls the placement of the individual panels, and so isn't an argument of `facet_grid`. 88 | 89 | ### When using `facet_grid()` you should usually put the variable with more unique levels in the columns. Why? 90 | Most screens are wider than they are tall. 91 | 92 | # 3.6.1 93 | ### What geom would you use to draw a line chart? A boxplot? A histogram? An area chart? 94 | Assuming that this means a line chart like that produced by `geom_line` (as opposed to vertical lines line a zero-width bar chart), then the closest analogy is something like this: 95 | ```{r} 96 | ggplot(mtcars, aes(x=qsec, y=mpg)) + 97 | geom_area(position = 'identity', alpha=0, colour='black') + coord_cartesian(xlim=c(min(mtcars$qsec)*1.1,max(mtcars$qsec)*0.9), ylim=c(2, max(mtcars$mpg))) 98 | ``` 99 | 100 | ### Run this code in your head and predict what the output will look like. 101 | This plot shows that four wheel drive cars generally have somewhat worse highway fuel consumption than two wheel drives, and that higher engine displacement is generally associated with worse fuel consumption. However, there are several possible confounding variables: both four wheel drive and large displacement are generally associated with large mass and body size, and four wheel drives often have more frontal area than very similar two wheel drives. This means that this plot doesn't tell us the causal effect of driven wheels and displacement on fuel consumption. 102 | 103 | ### What does the `se` argument to `geom_smooth()` do? 104 | `se` specifies whether to add a translucent background showing the confidence interval. 105 | 106 | ### Will these two graphs look different? Why/why not? 107 | They look the same. It doesn't matter whether `data` and `mapping` are specified in the inital `ggplot()` or in the `geom`. 108 | 109 | ### Recreate the R code necessary to generate the following graphs. 110 | ```{r} 111 | base1 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group=drv)) 112 | base1 + geom_point() + geom_smooth(se = FALSE) 113 | base1 + geom_smooth(se = FALSE) + geom_point() 114 | base2 <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) 115 | base2 + geom_point(aes(colour = drv)) + geom_smooth(aes(colour = drv), se=FALSE) 116 | base2 + geom_point(aes(colour = drv)) + geom_smooth(se=FALSE) 117 | base2 + geom_point(aes(colour = drv)) + geom_smooth(aes(linetype=drv), se=FALSE) 118 | base2 + geom_point(size = 4, colour = "white") + geom_point(aes(colour = drv)) 119 | ``` 120 | 121 | # 3.7.1 122 | ### In our proportion barchart, we need to set group = 1. Why? In other words, why is this graph not useful? 123 | `..prop..` finds proportions of the groups in the data. If we don't specify that we want all the data to be regarded as one group, then `geom_barchart` we end up with each cut as a separate group, and if we find the proprtion of "Premium" diamonds that are "Premium", the answer is obviously 1. 124 | 125 | ### How do you find out the default stat associated with a geom? 126 | We can see the default state for `geom_point` by calling `?geom_point` to see the help page, or just `geom_point` to see the function definition. 127 | 128 | # 3.8.1 129 | ### What is the problem with this plot? How could you improve it? 130 | Because mpg figures are rounded, both cty and hwy are relatively small integers. 131 | This means that some of the points overlap and hide each other. 132 | Options for dealing with this include using `position_jitter`, making the points transparent, and adding a line. 133 | ```{r} 134 | ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point(position = 'jitter', alpha=0.5) + geom_smooth(method='lm') 135 | ``` 136 | 137 | ### Compare and contrast `geom_jitter()` with `geom_count()`. 138 | `geom_jitter` randomly moves the points to stop them overlapping. `geom_count` deterministically counts the points at a given point and maps them to the size of a single point. The determinism of `geom_count` makes it useful in discrete situations, but it does not work when the points are overlapping but not in exactly the same place. 139 | 140 | ### What's the default position adjustment for `geom_boxplot()`? Create a visualisation of the `mpg` dataset that demonstrates it. 141 | The default adjustment is `position_dodge`. This means that the points are moved to the side by a discrete amount 142 | ```{r} 143 | base <- ggplot(data = mpg, mapping = aes(x=drv, y=cty, fill=as.factor(cyl))) 144 | base + geom_boxplot() # looks right 145 | base + geom_boxplot(position='dodge') # the same 146 | base + geom_boxplot(position='identity') # unreadable 147 | base + geom_boxplot(position='jitter') # looks terrible, and unreadable 148 | ``` 149 | 150 | # 3.9.1 151 | ### Turn a stacked bar chart into a pie chart using `coord_polar()`. 152 | Can't work this one out 153 | ```{r} 154 | # produces a stacked bar chart 155 | ggplot(mpg, aes(x = 1, fill=factor(drv))) + 156 | geom_bar(width=1, stat='count') 157 | 158 | # produces the equivalent pie chart 159 | ggplot(mpg, aes(x = 1, fill=factor(drv))) + 160 | geom_bar(width=1, stat='count') + 161 | coord_polar(theta='y') 162 | ``` 163 | This is not a good idea to use: 164 | - Forcing a bar chart with one column is untidy. 165 | - Using `theta='y'` to force the polar plot to use the implicit `y` variable created by `geom_bar`'s `stat_count` is confusing and hacky. 166 | 167 | ### What does `labs()` do? Read the documentation. 168 | Changes labels on legends and axes. 169 | 170 | ### What's the difference between `coord_quickmap()` and `coord_map()`? 171 | `coord_quickmap` uses an approximation to the mercator projection, whereas `coord_map` can use a variety of projections from the `mapproj` package. 172 | This means that `coord_quickmap` runs faster and doesn't require additional packages, but isn't as accurate, and won't work right far from the equator. 173 | 174 | ### What does the plot below tell you about the relationship between city and highway mpg? Why is `coord_fixed()` important? What does `geom_abline()` do? 175 | `geom_abline` is used to plot lines defined by slope (a) and intercept (b) parameters. Used with no arguments, like here, it will plot a line with slope 1 and intercept 0, so passing through the origin at 45 degrees. `coord_fixed` is important because x and y have the same units, so we want to maintain the slope of the line, and see that city mileage is worse than highway, but more importantly that this is better explained by a constant offset than a multiplicative factor. -------------------------------------------------------------------------------- /transform_solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "transform_solutions" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(dplyr) 9 | library(nycflights13) 10 | library(ggplot2) 11 | ``` 12 | # look at flights data 13 | ```{r} 14 | flights 15 | ``` 16 | 17 | # 4.2.4 Exercises 18 | ### 1. Find all flights that 19 | #### 1.1. Had an arrival delay of two or more hours. 20 | ```{r} 21 | filter(flights, arr_delay>=120) 22 | ``` 23 | 24 | #### 1.2. Flew to Houston (`IAH` or `HOU`) 25 | ```{r} 26 | filter(flights, dest == 'IAH' | dest == 'HOU') 27 | filter(flights, dest %in% c('IAH', 'HOU')) 28 | ``` 29 | 30 | #### 1.3. Were operated by United, American, or Delta 31 | ```{r} 32 | filter(flights, carrier == 'UA' | carrier == 'AA' | carrier == 'DL') 33 | filter(flights, carrier %in% c('UA', 'AA', 'DL')) 34 | ``` 35 | 36 | #### 1.4. Departed in summer (July, August, and September) 37 | ```{r} 38 | filter(flights, month >= 7 & month <= 9) 39 | filter(flights, month %in% c(7, 8, 9)) 40 | ``` 41 | 42 | #### 1.5. Arrived more than two hours late, but didn't leave late 43 | ```{r} 44 | filter(flights, arr_delay > 120, dep_delay <= 0) 45 | ``` 46 | 47 | #### 1.6. Were delayed by at least an hour, but made up over 30 minutes in flight 48 | ```{r} 49 | filter(flights, dep_delay >= 60, dep_delay-arr_delay > 30) 50 | ``` 51 | 52 | #### 1.7. Departed between midnight and 6am (inclusive) 53 | ```{r} 54 | filter(flights, dep_time <=600 | dep_time == 2400) 55 | ``` 56 | 57 | ### 2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges? 58 | Between is a shorter, faster way of testing two inequalities at once: it tests if its first argument is greater than or equal to its second, and less than or equal to its third. 59 | ```{r} 60 | filter(flights, between(month, 7, 9)) 61 | filter(flights, !between(dep_time, 601, 2359)) 62 | ``` 63 | 64 | ### 3. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent? 65 | ```{r} 66 | summary(flights) 67 | ``` 68 | 69 | 8255 flights have a missing `dep_time`, 8255 have a missing `dep_delay`, 8713 have a missing `arr_time`, 9430 have a missing `arr_delay`, and 9430 have a missing `air_time`. We can speculate that these are flights that failed to depart or arrive, since a flight that departs normally but is then rerouted will probably have a normally recorded departure but no similar record for it's arrival. However, these could also just be lost data about perfectly normal flights. 70 | 71 | ### 4. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!) 72 | `NA ^ 0` evaluates to 1 because anything to the power of 0 is 1, so although we didn't know the original value, we know it's being taken to the zeroth power. 73 | 74 | With `NA | TRUE`, since the `|` operator returns `TRUE` if either of the terms are true, the whole expression returns true because the right half returns true. This is easier to see in an expression like `NA | 5<10` (since 5 is indeed less than 10). 75 | 76 | For the next example, we know that `&` returns TRUE when both terms are true. So, for example, `TRUE & TRUE` evaluates to `TRUE`. In `FALSE & NA`, one of the terms is false, so the expression evaluates to `FALSE`. As does something like `FALSE & TRUE`. 77 | 78 | `NA * 0` could be argued to be because the NA could represent `Inf`, and `Inf * 0` is `NaN` (Not a Number), rather than `NA`. However, I suspect that these results are dictated as much by what answer is natural, quick and sensible in C as by mathematical edge cases. 79 | 80 | 81 | # 4.3.1 Exercises 82 | ### 1. How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`). 83 | ```{r} 84 | df <- tibble(x = c(5, 2, NA)) 85 | arrange(df, desc(is.na(x))) 86 | arrange(df, -(is.na(x))) 87 | 88 | ``` 89 | 90 | ### 2. Sort flights to find the most delayed flights. Find the flights that left earliest. 91 | ```{r} 92 | arrange(flights, desc(dep_delay)) 93 | arrange(flights, dep_delay) 94 | 95 | ``` 96 | 97 | ### 3. Sort flights to find the fastest flights. 98 | ```{r} 99 | # Note - this is a bit tricky since the time stamps are just encoded as integers 100 | # so if a flight left at midnight (i.e. dep_time=2400) and arrived at 00:54 (arr_time=54), 101 | # it's hard to just do arr_time - dep_time to get the travel time (you get back -2346, which doesn't make sense). 102 | # Taking absolute values doesn't help either. 103 | # A workaround solution is just to add 2400 if the travel time is ever negative. 104 | # A better solution is to properly encode the times as timestamps 105 | # note: we use the `mutate` function and the pipe character `%>%`, which haven't been introduced yet 106 | 107 | flights %>% mutate(travel_time = ifelse((arr_time - dep_time < 0), 108 | 2400+(arr_time - dep_time), 109 | arr_time - dep_time)) %>% 110 | arrange(travel_time) %>% select(arr_time, dep_time, travel_time) 111 | 112 | # for demonstration purposes, the naive solution is 113 | arrange(flights, (arr_time - dep_time)) 114 | 115 | ``` 116 | 117 | ### 4. Which flights travelled the longest? Which travelled the shortest? 118 | ```{r} 119 | # note: the `%>% select(1:5, distance)` is just so we can see the distance column, 120 | # which otherwise gets pushed off the console screen 121 | arrange(flights, desc(distance)) %>% select(1:5, distance) 122 | arrange(flights, distance) %>% select(1:5, distance) 123 | 124 | ``` 125 | 126 | # 4.4.1 Exercises 127 | ### 1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. 128 | ```{r} 129 | # standard ways 130 | select(flights, dep_time, dep_delay, arr_time, arr_delay) 131 | select(flights, c(dep_time, dep_delay, arr_time, arr_delay)) 132 | flights %>% select(dep_time, dep_delay, arr_time, arr_delay) 133 | flights %>% select_("dep_time", "dep_delay", "arr_time", "arr_delay") 134 | flights %>% select_(.dots=c("dep_time", "dep_delay", "arr_time", "arr_delay")) 135 | 136 | # fancier ways 137 | flights %>% select(dep_time:arr_delay, -c(contains("sched"))) 138 | flights %>% select(ends_with("time"), ends_with("delay")) %>% select(-c(starts_with("sched"), starts_with("air"))) 139 | flights %>% select(contains("dep"), contains("arr"), -contains("sched"), -carrier) 140 | flights %>% select(matches("^dep|arr_delay|time$")) 141 | flights %>% select(matches("^dep|^arr")) 142 | flights %>% select(matches("^dep|^arr.*time$|delay$")) 143 | flights %>% select(matches("^dep|^arr_time$|delay$")) 144 | 145 | head(flights) 146 | ``` 147 | 148 | ### 2. What happens if you include the name of a variable multiple times in a `select()` call? 149 | ```{r} 150 | flights %>% select(dep_delay, dep_delay, dep_delay) 151 | ``` 152 | Nothing happens, you just get the variable once. 153 | 154 | ### 3. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? 155 | It returns all the variables you ask for, for example ones stored in a vector. 156 | ```{r} 157 | vars <- c("year", "month", "day", "dep_delay", "arr_delay") 158 | flights %>% select(one_of(vars)) 159 | ``` 160 | 161 | ### 4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? 162 | ```{r} 163 | select(flights, contains("TIME")) 164 | select(flights, contains("TIME", ignore.case = FALSE)) 165 | ``` 166 | The default helper functions are insensitive to case. This can be changes by setting `ignore.case=FALSE`. 167 | 168 | head(flights) 169 | # 4.5.2 Exercises 170 | ### 1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. 171 | ```{r} 172 | # with integer division 173 | mutate(flights, 174 | dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100), 175 | sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100)) 176 | 177 | # with rounding operations 178 | mutate(flights, 179 | dep_time = 60 * floor(dep_time/100) + (dep_time - floor(dep_time/100) * 100), 180 | sched_dep_time = 60 * floor(sched_dep_time/100) + (sched_dep_time - floor(sched_dep_time/100) * 100)) 181 | 182 | ``` 183 | 184 | ### 2. Compare `air_time` with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it? 185 | - Firstly, we notice that if `arr_time` is in clock format, but `dep_time` is in minutes-after-midnight format, as per the previous question, we get the wrong answer. Obviously converting `arr_time` to minutes-after-midnight solves this problem. 186 | - Second, we find that some of the results of `arr_time - dep_time` are large negative numbers. This occurs when a flight sets off before midnight but arrives after it. We can deal with this by using modular arithmetic again (and assuming that no flights take off before midnight and land after midnight the day after.) 187 | - Finally, we find that `arr_time - dep_time` can vary significantly from `air_time`. 188 | ```{r} 189 | flights %>% 190 | mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100), 191 | sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100), 192 | arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100), 193 | sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>% 194 | transmute((arr_time - dep_time) %% (60*24) - air_time) 195 | 196 | ``` 197 | 198 | ### 3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related? 199 | We would expect to find that `sched_dep_time + dep_delay == dep_time`. We find that in the vast majority of cases (99.99%), this is true. 200 | ```{r} 201 | flights %>% 202 | mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100), 203 | sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100), 204 | arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100), 205 | sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>% 206 | transmute(near((sched_dep_time + dep_delay) %% (60*24), dep_time, tol=1)) 207 | 208 | ``` 209 | 210 | ### 4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank(). 211 | There aren't actually any ties in the top 10 most delayed flights for departure and arrival, but if there had been a tie for 10th place, then min_rank could have produced more than 10 results. It is still the most honest method here, though, since it is better to produce a result that highlights a corner case like a tie than a result that hides it. 212 | ```{r} 213 | filter(flights, min_rank(desc(dep_delay))<=10) 214 | flights %>% top_n(n = 10, wt = dep_delay) 215 | ``` 216 | 217 | ### 5. What does `1:3 + 1:10` return? Why? 218 | `1:3 + 1:10` produces a length 10 vector and a warning message. This is because the shorter vector is repeated out to the length of the longer one. Because 10 doesn't divide exactly by 3, the vectors do not line up properly and we get an error. This automatic vector extension is most commonly useful when one of the vectors is of length 1. 219 | 220 | ### 6. What trigonometric functions does R provide? 221 | ```{r} 222 | ?Trig 223 | ``` 224 | Using `?Trig`, we can find a list of trigonometric functions provided by base R. Examples include `cos(x)`, `acos(x)`, `cospi(x)`. 225 | 226 | 227 | # 4.6.7 Exercises 228 | ### 1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios: 229 | # A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. 230 | # A flight is always 10 minutes late. 231 | # A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. 232 | # 99% of the time a flight is on time. 1% of the time it’s 2 hours late. 233 | # Which is more important: arrival delay or departure delay? 234 | 235 | We're not clear what this question means. Assuming we are interested in *arrival* delays, we can calculate the following summary variables for all flights: 236 | 237 | ```{r} 238 | str(flights) 239 | head(flights) 240 | flight_delay_summary <- group_by(flights, flight) %>% summarise(num_flights = n(), 241 | percentage_on_time = sum(arr_time == sched_arr_time)/num_flights, 242 | percentage_early = sum(arr_time < sched_arr_time)/num_flights, 243 | percentage_15_mins_early = sum(sched_arr_time - arr_time == 15)/num_flights, 244 | percentage_late = sum(arr_time > sched_arr_time)/num_flights, 245 | percentage_15_mins_late = sum(arr_time - sched_arr_time == 15)/num_flights, 246 | percentage_2_hours_late = sum(arr_time - sched_arr_time == 120)/num_flights) 247 | flight_delay_summary 248 | ``` 249 | 250 | Using this, we can then answer the preceding questions, e.g. a flight that is 15 minutes early 50% of the time, and 15 minutes late 50% of the time can be found using: 251 | 252 | ```{r} 253 | flight_delay_summary %>% filter(percentage_15_mins_early == 0.5 & percentage_15_mins_late == 0.5) 254 | 255 | ``` 256 | 257 | As for whether arrival delay or departure delay is more important - from the individual perspective this may be a matter of personal taste, and from the business perspective we would need data on associated costs of both types of delay (monetary, customer satisfaction hits, etc.) to reason about relative importance. 258 | 259 | ### 2. Come up with another appraoch that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`). 260 | ```{r} 261 | not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 262 | 263 | not_cancelled %>% 264 | group_by(dest) %>% 265 | tally() 266 | 267 | not_cancelled %>% 268 | group_by(tailnum) %>% 269 | summarise(n = sum(distance)) 270 | ``` 271 | 272 | Using `group_by` and `summarise` instead of `count` is more verbose, but it can be clearer, especially in more complex situations. 273 | 274 | ### 3. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`) is slightly suboptimal. Why? Which is the most important column? 275 | There are no flights which arrived but did not depart, so we can just use `!is.na(dep_delay)`. 276 | ```{r} 277 | flights %>% 278 | group_by(departed = !is.na(dep_delay), arrived = !is.na(arr_delay)) %>% 279 | summarise(n=n()) 280 | 281 | ``` 282 | 283 | ### 4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? 284 | ```{r} 285 | flights %>% 286 | mutate(dep_date = lubridate::make_datetime(year, month, day)) %>% 287 | group_by(dep_date) %>% 288 | summarise(cancelled = sum(is.na(dep_delay)), 289 | n = n(), 290 | mean_dep_delay = mean(dep_delay,na.rm=TRUE), 291 | mean_arr_delay = mean(arr_delay,na.rm=TRUE)) %>% 292 | ggplot(aes(x= cancelled/n)) + 293 | geom_point(aes(y=mean_dep_delay), colour='blue', alpha=0.5) + 294 | geom_point(aes(y=mean_arr_delay), colour='red', alpha=0.5) + 295 | ylab('mean delay (minutes)') 296 | ``` 297 | 298 | We can see that on most days, there is not a strong relationship between cancellations and delay, but if one is unusually high, then the other probably is, too. 299 | 300 | ### 5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`) 301 | There are 16 carriers, 3 origin airports, and 105 destination airports in this dataset. For many destination airports, there are only one or two carriers that fly there, so it is difficult to tell how much of the delay is due to the carrier, and how much is due to the airport (busy destination airports can force planes to loiter longer before there is a free landing slot). We also can't necessarily tell how much of the delay is due to the route, versus the airport itself. This makes attributing the cause of in flight delays difficult. 302 | ```{r} 303 | flights %>% 304 | filter(arr_delay > 0) %>% 305 | group_by(carrier) %>% 306 | summarise(average_arr_delay = mean(arr_delay, na.rm=TRUE)) %>% 307 | arrange(desc(average_arr_delay)) 308 | 309 | flights %>% 310 | summarise(n_distinct(carrier), 311 | n_distinct(origin), 312 | n_distinct(dest)) 313 | 314 | ``` 315 | 316 | ### 6. For each plane, count the number of flights before the first delay of greater than 1 hour. 317 | ```{r} 318 | flights %>% 319 | mutate(dep_date = lubridate::make_datetime(year, month, day)) %>% 320 | group_by(tailnum) %>% 321 | arrange(dep_date) %>% 322 | filter(!cumany(arr_delay>60)) %>% 323 | tally(sort = TRUE) 324 | ``` 325 | 326 | ### 7. What does the sort argument to count() do. When might you use it? 327 | The `sort` argument to `count()` sorts by descending order of `n`. This is useful because often the most common group is the most important. 328 | 329 | 330 | # 4.7.1 Exercises 331 | ### 1. Refer back to the table of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping. 332 | Which one? 333 | 334 | # 2. Which plane (tailnum) has the worst on-time record? 335 | ```{r} 336 | flights %>% 337 | group_by(tailnum) %>% 338 | summarise(prop_on_time = sum(arr_delay <= 30 & !is.na(arr_delay))/n(), 339 | mean_arr_delay = mean(arr_delay, na.rm=TRUE), 340 | flights = n()) %>% 341 | arrange(prop_on_time, desc(mean_arr_delay)) 342 | 343 | flights %>% 344 | group_by(tailnum) %>% 345 | filter(all(is.na(arr_delay))) %>% 346 | tally(sort=TRUE) 347 | ``` 348 | 349 | Many of the planes have never arrived on time, and 7 have never arrived at all. These are planes for which we do not have much data, so there's no clear answer to the worst plane unless we limit ourselves to some arbitrary threshold of number of recorded flights. 350 | 351 | # 3. What time of day should you fly if you want to avoid delays as much as possible? 352 | ```{r} 353 | flights %>% 354 | ggplot(aes(x=factor(hour), fill=arr_delay>5 | is.na(arr_delay))) + geom_bar() 355 | ``` 356 | 357 | We can see that the highest probability of delay as a proportion of total flights is in the late evening. We could hypothesize that this is due to accumulated knockon delays during the day, the difficulties of flying at night, or these flights being typically longer distance. 358 | 359 | # 4. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using `lag()` explore how the delay of a flight is related to the delay of the immediately preceding flight. 360 | ```{r} 361 | flights %>% 362 | mutate(new_sched_dep_time = lubridate::make_datetime(year, month, day, hour, minute)) %>% 363 | group_by(origin) %>% 364 | arrange(new_sched_dep_time) %>% 365 | mutate(prev_flight_dep_delay = lag(dep_delay)) %>% 366 | ggplot(aes(x=prev_flight_dep_delay, y= dep_delay)) + geom_point() 367 | 368 | 369 | ``` 370 | 371 | # 5. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air? 372 | ```{r} 373 | flights %>% 374 | mutate(new_sched_dep_time = lubridate::make_datetime(year, month, day, hour, minute)) %>% 375 | group_by(origin) %>% 376 | arrange(new_sched_dep_time) %>% 377 | mutate(prev_flight_dep_delay = lag(dep_delay)) %>% 378 | lm(dep_delay ~ prev_flight_dep_delay,.) %>% summary() 379 | ``` 380 | 381 | We find that there is a weak correlation between the delays, but that due to the number of rows, we can be highly confident of a predictive relationship. 382 | 383 | # 6. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers. 384 | ```{r} 385 | flights %>% 386 | group_by(dest) %>% 387 | filter(n_distinct(carrier)>=2) %>% 388 | group_by(carrier) %>% 389 | summarise(possible_transfers = n_distinct(dest)) %>% 390 | arrange(desc(possible_transfers)) 391 | ``` 392 | 393 | --------------------------------------------------------------------------------