├── .gitignore
├── ggplot2.Rproj
├── ggplot2_solutions_chapter10.Rmd
├── ggplot2_solutions_chapter11.Rmd
├── ggplot2_solutions_chapter12.Rmd
├── ggplot2_solutions_chapter2.Rmd
├── ggplot2_solutions_chapter3.Rmd
├── ggplot2_solutions_chapter4.Rmd
├── ggplot2_solutions_chapter5.Rmd
├── ggplot2_solutions_chapter6.Rmd
├── ggplot2_solutions_chapter7.Rmd
├── ggplot2_solutions_chapter8.Rmd
├── ggplot2_solutions_chapter9.Rmd
└── random practice notes.R


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | 


--------------------------------------------------------------------------------
/ggplot2.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter10.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_solutions_chapter10"
  3 | author: "Nade Kang"
  4 | date: "July 13, 2018"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  9 | # ggplot2 Chpater 10 Data Transformation
 10 | 
 11 | ## *Exercise 10.2.3*
 12 | ### Load Packages
 13 | ```{r setup, results='hide'}
 14 | library(tidyverse)
 15 | library(ggthemes)
 16 | ```
 17 | 
 18 | **_Question 1_** Practice your filtering skills by:
 19 | - Finding all the diamonds with equal x and y dimensions.
 20 | ```{r e.10.2.3.1_1}
 21 | diamonds_xey <- filter(diamonds, x == y)
 22 | head(diamonds_xey)
 23 | ```
 24 | 
 25 | - A depth between 55 and 70.
 26 | Here, depth refers to the value z.
 27 | Assuming the between means inclusive:
 28 | ```{r e.10.2.3.1_2}
 29 | diamonds_z5570 <- diamonds %>% filter(z >= 55 & z <= 70)
 30 | head(diamonds_z5570)
 31 | ```
 32 | 
 33 | So there is no such diamond that has a depth between 55 and 70.
 34 | 
 35 | - A carat smaller than the median carat.
 36 | ```{r e.10.2.3.1_3}
 37 | dia_cara_lessmedian <- filter(diamonds, carat < median(diamonds$carat))
 38 | head(dia_cara_lessmedian)
 39 | ```
 40 | 
 41 | - Cost more than $10,000 per carat
 42 | ```{r e.10.2.3.1_4}
 43 | more10000pc <- filter(diamonds, price/carat > 10000)
 44 | head(more10000pc)
 45 | ```
 46 | 
 47 | 
 48 | - Are of good or better quality
 49 | Are of good or better means excluding fiar:
 50 | ```{r e.10.2.3.1_5}
 51 | besseralsgut <- filter(diamonds, cut != "Fair")
 52 | head(besseralsgut)
 53 | ```
 54 | 
 55 | 
 56 | **_Question 2_** Fill in the question marks in this table:
 57 | 
 58 | Expression   TRUE   FALSE   NA
 59 | x            x
 60 | !x                  x
 61 | is.na(x)                   x
 62 | !is.na(x)    x        
 63 | is.na(x)|x   x
 64 | is.na(x)|!x  x
 65 | 
 66 | Didn't understand this table, waiting for help from Stack Overflow.
 67 | 
 68 | **_Question 3_** Repeat the analysis of outlying values with z. Compared to x and y, how
 69 | would you characterise the relationship of x and z, or y and z?
 70 | 
 71 | We make a first plot to take a look at the data:
 72 | ```{r}
 73 | ggplot(diamonds, aes(x, z)) +
 74 |   geom_jitter()
 75 | ```
 76 | 
 77 | Then we know to extract x > 0 and z > 0, with z < 10:
 78 | ```{r e.10.2.3.3_plot1}
 79 | diamonds_zok <- filter(diamonds, x > 0, z > 0, z < 10)
 80 | ggplot(diamonds_zok, aes(x, z)) +
 81 |   geom_bin2d() +
 82 |   geom_abline(slope = 0.6, color = "white", size = 1, alpha = 0.5)
 83 | ```
 84 | 
 85 | Despite the outliers, there seems to be a positive replationship between x and z.\
 86 | 
 87 | Now let's take a look at y and z:
 88 | ```{r}
 89 | ggplot(diamonds, aes(y, z)) +
 90 |   geom_jitter()
 91 | ```
 92 | 
 93 | We need to extract 0 < z < 10 and 0 < y < 15:
 94 | ```{r e.10.2.3.3_plot2}
 95 | diamonds_yz <- filter(diamonds, z > 0 & z < 10, y > 0 & y < 15)
 96 | ggplot(diamonds_yz, aes(y, z)) +
 97 |   geom_bin2d() +
 98 |   geom_abline(slope = 0.65, color = "white", size = 1, alpha = 0.5)
 99 | ```
100 | 
101 | 
102 | **_Question 4_** Install the ggplot2movies package and look at the movies that have a
103 | missing budget. How are they different from the movies with a budget?
104 | (Hint: try a frequency polygon plus colour = is.na(budget).)
105 | 
106 | ```{r}
107 | library(ggplot2movies)
108 | movies <- ggplot2movies::movies
109 | movies
110 | movies2 <- movies %>% 
111 |   mutate(hasbudget = ifelse(budget == 0, "Missing Budget", "Has Budget"))
112 | movies2
113 | ggplot(movies2, aes(budget)) +
114 |   geom_freqpoly(aes(color = hasbudget))
115 | ```
116 | 
117 | Not quite sure what conclusion can be drawn from here. 
118 | 
119 | **_Question 5_** What is NA & FALSE and NA | TRUE? Why? Why doesn’t NA * 0 equal zero?
120 | What number times zero does not equal 0? What do you expect NA ˆ 0 to
121 | equal? Why?
122 | ```{r}
123 | NA & FALSE
124 | # FALSE
125 | NA | TRUE
126 | # TRUE
127 | ```
128 | This is because the logical expressions follow the Kleene and Priest logics, view it here:
129 | https://en.wikipedia.org/wiki/Three-valued_logic
130 | 
131 | ```{r}
132 | NA * 0
133 | # NA
134 | NA ^ 0
135 | # 1
136 | ```
137 | 
138 | The reason that NA * 0 is not equal to 0 is that x * 0 = NaN is undefined when x = Inf or x = -Inf.
139 | But anything with power to 0 is 1.
140 | 
141 | ## *Exercises 10.3.2*
142 | **_Question 1_** Practice your variable creation skills by creating the following new variables:
143 | - The approximate volume of the diamond (using x, y, and z).
144 | 
145 | Assuming the volume of a diamond is approximately 2/3 of a cubic with approximately the same parameters:
146 | ```{r e.10.3.2.1_1}
147 | diamonds_q1 <- diamonds %>% 
148 |   mutate(vol = x * y * z *(2/3))
149 | head(select(diamonds_q1, carat, cut, x:z, vol))
150 | ```
151 | 
152 | 
153 | - The approximate density of the diamond.
154 | One carat is about 0.2 gram, and to calculate density, density is gram/cm^3:
155 | ```{r e.10.3.2.1_2}
156 | diamonds_q1 <- diamonds_q1 %>% 
157 |   mutate(density = carat * 0.2 / vol)
158 | head(select(diamonds_q1, carat, cut, vol, density))
159 | ```
160 | 
161 | 
162 | - The price per carat.
163 | Price per carat is simply price/carat
164 | ```{r e.10.3.2.1_3}
165 | diamonds_q1 <- diamonds_q1 %>% 
166 |   mutate(pricepc = price/carat)
167 | head(select(diamonds_q1, carat, price, vol, density, pricepc))
168 | ```
169 | 
170 | 
171 | - Log transformation of carat and price.
172 | ```{r e.10.3.2.1_4}
173 | diamonds_q1 <- diamonds_q1 %>% 
174 |   mutate(caratl = log2(carat),
175 |          pricel = log2(price))
176 | head(select(diamonds_q1, carat, price, caratl, pricel))
177 | ```
178 | 
179 | 
180 | **_Question 2_** How can you improve the data density of ggplot(diamonds, aes(x, z)) +
181 | stat bin2d(). What transformation makes it easier to extract outliers?
182 | 
183 | ```{r}
184 | ggplot(diamonds, aes(x, z)) +
185 |   stat_bin2d()
186 | ```
187 | 
188 | I am not quite sure about what the question is asking for.
189 | 
190 | **_Question 3_** The depth variable is just the width of the diamond (average of x and y)
191 | divided by its height (z) multiplied by 100 and round to the nearest integer.
192 | Compute the depth yourself and compare it to the existing depth variable.
193 | Summarise your findings with a plot.
194 | 
195 | Depth is calculated as height(z) divided by width (avg of x and y) times 100. There is expression
196 | mistake in the question.
197 | ```{r e.10.3.2.3}
198 | diamonds_q1 <- diamonds_q1 %>% 
199 |   mutate(depth_cal = round(z / ((x + y)/2) * 100))
200 | head(select(diamonds_q1, depth_cal, depth))
201 | ```
202 | 
203 | Make a plot:
204 | We calculate the difference between the given depth and our calculated depth_cal, then we make 
205 | a plot, if the plots shows more positive values, then it means our calculation is below the original,
206 | otherwise, above.
207 | ```{r e.10.3.2.3_plot}
208 | diamonds_q1 <- diamonds_q1 %>% 
209 |   mutate(diff = depth - depth_cal,
210 |          seq = seq_along(carat))
211 | diamonds_q1
212 | ggplot(diamonds_q1) +
213 |   geom_area(aes(x = seq, y = diff)) +
214 |   xlim(800, 1500) + ylim(-2, 2)
215 | ```
216 | 
217 | You can adjust the xlim and ylim to zoom in different areas of the plot to take a look. Here only
218 | offers one set of values for an example.
219 | 
220 | **_Question 4_** Compare the distribution of symmetry for diamonds with x > y vs. x < y.
221 | ```{r e.10.3.2.4_plot1}
222 | diamonds_q1 %>%
223 |   filter(x > y) %>%
224 |   ggplot(aes(x)) +
225 |   geom_bar(binwidth = 0.2)
226 | ```
227 | 
228 | ```{r e.10.3.2.4_plot2}
229 | diamonds_q1 %>%
230 |   filter(y > x) %>%
231 |   ggplot(aes(x)) +
232 |   geom_bar(binwidth = 0.2)
233 | ```
234 | 
235 | 
236 | ## *Exercises 10.4.3*
237 | **_Question 1_** For each year in the ggplot2movies::movies data determine the percent of
238 | movies with missing budgets. Visualise the result.
239 | ```{r view, eval=FALSE}
240 | movies <- ggplot2movies::movies
241 | View(movies)
242 | ```
243 | 
244 | We need to create a boolean column called with_budget, so if the movie has a budget, we place TRUE,
245 | else we place FALSE
246 | 
247 | ```{r e.10.4.3.1}
248 | movies1 <- movies %>% 
249 |   mutate(with_budget = ifelse(is.na(budget), FALSE, TRUE)) %>%
250 |   group_by(year) %>%
251 |   summarise(
252 |     hasBudget = sum(with_budget),
253 |     total = n(),
254 |     propNoBudget = 1 - hasBudget/total
255 |   )
256 | ggplot(movies1, aes(year, propNoBudget)) +
257 |   geom_area(aes(fill = "linen")) +
258 |   labs(fill = expression(atop("No Budget"))) +
259 |   theme(legend.title = element_text(vjust = 0))
260 |   
261 | movies1
262 | ```
263 | 
264 | As you can see from the plot, the percentage of movies with missing budget decreased with
265 | the progression of time.
266 | 
267 | **_Question 2_** How does the average length of a movie change over time? Display your
268 | answer with a plot, including some display of uncertainty.
269 | 
270 | We only need to group the data by year, and then calculate the mean of length, then use a 
271 | line plot (since we have time and a continuous variable).
272 | 
273 | ```{r e.10.4.3.2}
274 | q2 <- movies %>% 
275 |   group_by(year) %>% 
276 |   summarise(mean = mean(length))
277 | q2
278 | ggplot(q2, aes(year, mean)) +
279 |   geom_line()
280 | ```
281 | 
282 | As we can see, the average length of movies increased from 1890s to now.
283 | 
284 | **_Question 3_** For each combination of diamond quality (e.g. cut, colour and clarity),
285 | count the number of diamonds, the average price and the average size.
286 | Visualise the results.
287 | 
288 | Assume size means depth
289 | ```{r e.10.4.3.3_cut}
290 | diamonds_q3 <- diamonds %>% 
291 |   group_by(cut) %>%
292 |   summarise(
293 |     number = n(),
294 |     avgprice = mean(price),
295 |     avgsize = mean(depth)
296 |   )
297 | diamonds_q3
298 | 
299 | ggplot(diamonds, aes(cut)) +
300 |   geom_bar() +
301 |   ggtitle("Number of diamonds in each cut")
302 | 
303 | ggplot(diamonds_q3, aes(cut, avgprice)) +
304 |   geom_bar(stat = "identity") +
305 |   ggtitle("Average Price of Each Cut")
306 | 
307 | ggplot(diamonds_q3, aes(cut, avgsize)) +
308 |   geom_bar(stat = "identity") +
309 |   ggtitle("Average Size of Each Cut")
310 | 
311 | ```
312 | 
313 | ```{r e.10.4.3.3_color}
314 | diamonds_q3 <- diamonds %>% 
315 |   group_by(color) %>%
316 |   summarise(
317 |     number = n(),
318 |     avgprice = mean(price),
319 |     avgsize = mean(depth)
320 |   )
321 | 
322 | ggplot(diamonds, aes(color)) +
323 |   geom_bar() +
324 |   ggtitle("Number of diamonds in Each Color")
325 | 
326 | ggplot(diamonds_q3, aes(color, avgprice)) +
327 |   geom_bar(stat = "identity") +
328 |   ggtitle("Average Price of Each Color")
329 | 
330 | ggplot(diamonds_q3, aes(color, avgsize)) +
331 |   geom_bar(stat = "identity") +
332 |   ggtitle("Average Size of Each Color")
333 | ```
334 | 
335 | ```{r e.10.4.3.3_clarity}
336 | diamonds_q3 <- diamonds %>% 
337 |   group_by(clarity) %>%
338 |   summarise(
339 |     number = n(),
340 |     avgprice = mean(price),
341 |     avgsize = mean(depth)
342 |   )
343 | 
344 | ggplot(diamonds, aes(clarity)) +
345 |   geom_bar() +
346 |   ggtitle("Number of diamonds in Each Clarity")
347 | 
348 | ggplot(diamonds_q3, aes(clarity, avgprice)) +
349 |   geom_bar(stat = "identity") +
350 |   ggtitle("Average Price of Each Clarity")
351 | 
352 | ggplot(diamonds_q3, aes(clarity, avgsize)) +
353 |   geom_bar(stat = "identity") +
354 |   ggtitle("Average Size of Each Clarity")
355 | ```
356 | 
357 | 
358 | **_Question 4_** Compute a histogram of carat by “hand” using a binwidth of 0.1. Display
359 | the results with geom bar(stat = "identity"). (Hint: you might need to
360 | create a new variable first.)
361 | 
362 | Not sure about the best way to compute by hand.
363 | 
364 | ```{r e.10.4.3.4}
365 | ggplot(diamonds, aes(carat)) +
366 |   geom_histogram(binwidth = 0.1)
367 | ```
368 | 
369 | 
370 | **_Question 5_** In the baseball example, the batting average seems to increase as the
371 | number of at bats increases. Why?
372 | 
373 | Not familiar with baseball, but practice makes perfect?
374 | 
375 | ## *Exercises 10.5.1*
376 | **_Question 1_** Translate each of the examples in this chapter to use the pipe.
377 | 
378 | Most of my codes above used pipes, but each example is too much.
379 | 
380 | **_Question 2_** What does the following pipe do?
381 | ```{r, eval=FALSE}
382 | library(magrittr)
383 | x <- runif(100)
384 | x %>%
385 | subtract(mean(.)) %>%
386 | raise_to_power(2) %>%
387 | mean() %>%
388 | sqrt()
389 | ```
390 | 
391 | It loads the magittr package first, then assign 100 uniformly distributed random variable between
392 | 0 and 1 to x. Using this x, subtract each value by the mean of x. Then raise the value to the power
393 | of 2, then calculate the mean of the result, then square the result. As you can see, this is calculating
394 | the VARIANCE.
395 | 
396 | **_Question 3_** Which player in the Batting dataset has had the most consistently good
397 | performance over the course of their career?
398 | 
399 | Not familiar with baseball, and choose the criteria, group the data by player, arrange criteria in decending order.


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter11.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_solutions_chapter11"
  3 | author: "Nade Kang"
  4 | date: "July 14, 2018"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  9 | # ggplot2 Chpater 11 Modeling for Visualization
 10 | 
 11 | ## *Exercise 11.2.1*
 12 | ### Load Packages
 13 | ```{r setup, results='hide'}
 14 | library(tidyverse)
 15 | library(ggthemes)
 16 | ```
 17 | 
 18 | **_Question 1_** What happens if you repeat the above analysis with all diamonds? (Not
 19 | just all diamonds with two or fewer carats). What does the strange geometry
 20 | of log(carat) vs relative price represent? What does the diagonal line
 21 | without any points represent?
 22 | 
 23 | Applying the analysis to all diamonds, without filtering diamonds with <= 2 carat.
 24 | ```{r e.11.2.1.1}
 25 | diamonds_all <- diamonds %>% 
 26 |   mutate(lprice = log2(price),
 27 |          lcarat = log2(carat))
 28 | head(select(diamonds_all, carat, price, lcarat, lprice))
 29 | 
 30 | ggplot(diamonds_all, aes(lcarat, lprice)) +
 31 |   geom_bin2d() +
 32 |   geom_smooth(method = "lm", color = "linen", size = 2, se = FALSE) +
 33 |   ggtitle("Log Price Plotted Against Log Carat\n for All Diamonds") +
 34 |   theme(plot.title = element_text(hjust = 0.5))
 35 | 
 36 | mod_all <- lm(lprice ~ lcarat, data = diamonds_all)
 37 | coef(summary(mod_all))
 38 | 
 39 | diamonds_all <- diamonds_all %>% 
 40 |   mutate(rel_price = resid(mod_all))
 41 | 
 42 | ggplot(diamonds_all, aes(lcarat, rel_price)) +
 43 |   geom_bin2d() +
 44 |   ggtitle("Relative Price Plotted Against Log Carat\n for All Diamonds") +
 45 |   theme(plot.title = element_text(hjust = 0.5))
 46 | 
 47 | ```
 48 | 
 49 | As the log carat increases, the relative price drops gradually below zero. This means that 
 50 | diamonds with large carat are cheapter than expected.
 51 | 
 52 | **_Question 2_** made an unsupported assertion that lower-quality diamonds tend to be
 53 | larger. Support my claim with a plot.
 54 | 
 55 | ```{r e.11.2.1.2}
 56 | ggplot(diamonds_all, aes(color, carat)) +
 57 |   geom_boxplot(aes(color = cut))
 58 | ```
 59 | 
 60 | Indeed, from this boxplot of carat against color with cut, we can clearly see that,
 61 | judging from color, D(the best) has smaller carat, while H, I and J have a larger carat.
 62 | In addition, Fair cut is generally larger than other cuts in terms of carat.
 63 | 
 64 | **_Question 3_** Can you create a plot that simultaneously shows the effect of colour, cut,
 65 | and clarity on relative price? If there’s too much information to show on
 66 | one plot, think about how you might create a sequence of plots to convey
 67 | the same message.
 68 | 
 69 | We can use facet_wrap() to help with the plot:
 70 | 
 71 | ```{r e.11.2.1.3}
 72 | ggplot(diamonds_all, aes(cut, rel_price)) +
 73 |   geom_boxplot(aes(color = color)) +
 74 |   facet_wrap(~clarity)
 75 | ```
 76 | 
 77 | 
 78 | **_Question 4_** How do depth and table relate to the relative price?
 79 | 
 80 | ```{r e.11.2.1.4_plot1}
 81 | ggplot(diamonds_all, aes(depth, rel_price)) +
 82 |   geom_point()
 83 | ```
 84 | 
 85 | It seems that the depth doesn't show a clear pattern wIth relative_price, although, given
 86 | a certain depth of diamond, the relative price could be high or low. This because the price
 87 | of a diamonds usually depends on color, cut, and clarity rather than on depth.
 88 | 
 89 | ```{r e.11.2.1.4_plot2}
 90 | ggplot(diamonds_all, aes(table, rel_price)) +
 91 |   geom_point()
 92 | ```
 93 | 
 94 | Interesting pattern for table and relative price. As we can see from the plot, with the table
 95 | increasing fom 50 to about 59, the relative price variated greater, and then smaller as table
 96 | continue to increase.
 97 | 
 98 | ## *Exercises 11.3.1*
 99 | 
100 | **_Question 1_** The final plot shows a lot of short-term noise in the overall trend. How
101 | could you smooth this further to focus on long-term changes?
102 | ```{r}
103 | deseas <- function(x, month) {
104 | resid(lm(x ~ factor(month), na.action = na.exclude))
105 | }
106 | 
107 | txhousing <- txhousing %>% 
108 |   group_by(city) %>% 
109 |   mutate(rel_sales = deseas(log(sales), month))
110 | 
111 | ggplot(txhousing, aes(date, rel_sales)) + 
112 |   geom_line(aes(group = city), alpha = 1/5) + 
113 |   geom_line(stat = "summary", fun.y = "mean", colour = "red")
114 | 
115 | ggplot(txhousing, aes(date, rel_sales)) + 
116 |   geom_line(aes(group = city), alpha = 1/5) + 
117 |   geom_line(stat = "summary", fun.y = "mean", colour = "red") +
118 |   geom_smooth(method = "loess", se = FALSE)
119 | 
120 | ```
121 | 
122 | 
123 | **_Question 2_** If you look closely (e.g. + xlim(2008, 2012)) at the long-term trend you’ll
124 | notice a weird pattern in 2009–2011. It looks like there was a big dip in
125 | 2010. Is this dip “real”? (i.e. can you spot it in the original data)
126 | 
127 | **_Question 3_** What other variables in the TX housing data show strong seasonal effects?
128 | Does this technique help to remove them?
129 | 
130 | **_Question 4_** Not all the cities in this data set have complete time series. Use your dplyr
131 | skills to figure out how much data each city is missing. Display the results
132 | with a visualisation.
133 | 
134 | **_Question 5_** Replicate the computation that stat summary() did with dplyr so you can
135 | plot the data “by hand”.


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter12.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "ggplot2_solutions_chapter12"
 3 | author: "Nade Kang"
 4 | date: "July 14, 2018"
 5 | output: html_document
 6 | ---
 7 | 
 8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
 9 | # ggplot2 Chpater 12 Programming with ggplot2
10 | 
11 | ## *Exercise 12.2.1*
12 | ### Load Packages
13 | ```{r setup, results='hide'}
14 | library(tidyverse)
15 | library(ggthemes)
16 | ```
17 | 
18 | **_Question 1_** Create an object that represents a pink histogram with 100 bins.
19 | 
20 | We name this object as hist100:
21 | ```{r e.12.2.1.1}
22 | hist100 <- geom_histogram(bins  = 100, color = "pink", fill = "pink")
23 | 
24 | # Test this object with diamonds:
25 | ggplot(diamonds, aes(price)) +
26 |   hist100
27 | ```
28 | 
29 | 
30 | **_Question 2_** Create an object that represents a fill scale with the Blues ColorBrewer
31 | palette.
32 | 
33 | P138 in textbook, we just use the default scale_fill_brewer() so it is blue.
34 | 
35 | ```{r e.12.2.1.2}
36 | bluePalette <- scale_fill_brewer()
37 | 
38 | # Test this object with diamonds dataset:
39 | ggplot(diamonds, aes(cut, price, fill = cut)) +
40 |   geom_bar(stat = "summary", fun.y = "mean") +
41 |   scale_fill_brewer()
42 | ```
43 | 
44 | 
45 | **_Question 3_** Read the source code for theme grey(). What are its arguments? How does
46 | it work?
47 | 
48 | ```{r}
49 | theme_grey
50 | ```
51 | 
52 | 
53 | **_Question 4_** Create scale colour wesanderson(). It should have a parameter to pick the
54 | palette from the wesanderson package, and create either a continuous or
55 | discrete scale.
56 | 
57 | Package not available for R3.4.4
58 | 
59 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter2.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_chapter2"
  3 | author: "Nade Kang"
  4 | date: "July 6, 2018"
  5 | output: html_document
  6 | ---
  7 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  8 | # ggplot2 Chpater 2 Getting Started with ggplot2
  9 | 
 10 | ## *Exercise 2.2.1*
 11 | ### Load Packages
 12 | ```{r setup}
 13 | library(tidyverse)
 14 | ```
 15 | **_Question 1_** List five functions that you could use to get more information about the mpg dataset.
 16 | ```{r e.2.2.1.1, eval=FALSE}
 17 | View(mpg)
 18 | head(mpg)
 19 | tail(mpg)
 20 | dim(mpg)
 21 | names(mpg)
 22 | ```
 23 | 
 24 | **_Question 2_** How can you find out what other datasets are included with ggplot2?
 25 | ```{r e.2.2.1.2}
 26 | data(package = "ggplot2")
 27 | ```
 28 | 
 29 | **_Question 3_** Apart from the US, most countries use fuel consumption (fuel consumed
 30 | over fixed distance) rather than fuel economy (distance travelled with fixed
 31 | amount of fuel). How could you convert cty and hwy into the European
 32 | standard of l/100km?
 33 | ```{r e.2.2.1.3}
 34 | head(mpg)
 35 | mpg_km <- mpg %>%
 36 |   mutate(cty_l100km = cty * 235.215,
 37 |          hwy_l100lm = hwy * 235.215)
 38 | mpg_km %>% select(cty:hwy_l100lm)
 39 | ```
 40 | **_Question 4_** Which manufacturer has the most the models in this dataset?Which model
 41 | has the most variations? Does your answer change if you remove the redundant
 42 | specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”)
 43 | from the model name?
 44 | 
 45 | Manufacturer with the most models
 46 | ```{r e.2.2.1.4_mostmodels}
 47 | mpg %>%
 48 |   group_by(manufacturer) %>%
 49 |   summarise(n = n()) %>%
 50 |   arrange(desc(n))
 51 | ```
 52 | As you can see, Dodge has the most models, a number of 37.
 53 | 
 54 | The model that has the most variations :
 55 | ```{r e.2.2.1.4_mostvariations}
 56 | unique(mpg$model)
 57 | ```
 58 | 
 59 | It seems that camry has the most model variations.
 60 | To remove the redundant specification:
 61 | ```{r e.2.2.1.4_mostvarrm}
 62 | library(stringr)
 63 | str_trim(str_replace_all(unique(mpg$model), c("quattro" = "", "4wd" = "",
 64 |                                               "2wd" = "", "awd" = "")))
 65 | ```
 66 | Still, it seems camry has the most model variation after removing redundancy.
 67 | 
 68 | ## *Exercise 2.3.1*
 69 | **_Question 1_** How would you describe the relationship between cty and hwy? Do you
 70 | have any concerns about drawing conclusions from that plot?
 71 | To understand the relationship, we need to make a plot:
 72 | ```{r e.2.3.1.1_cty_hwy_plot}
 73 | ggplot(mpg, aes(x = cty, y = hwy)) +
 74 |   geom_point()
 75 | ```
 76 | 
 77 | It appears that there is a positive linear relationship between cty and hwy. 
 78 | 
 79 | **_Question 2_** What does ggplot(mpg, aes(model, manufacturer)) + geom point() show? Is it useful? How could 
 80 | you modify the data to make it more informative?
 81 | ```{r e.2.3.1.2_manu_model_plot}
 82 | ggplot(mpg, aes(model, manufacturer)) +
 83 |   geom_point()
 84 | ```
 85 | This plot has problems. First, the x-axis names in model are too long, that the plot doesn't
 86 | show all the full names. This makes it impossible for people to understand. Second, this plot
 87 | doesn't really help people understand the relationship between model and manufacturer for the
 88 | manufacturer may have several models such as e.g. audi and camry.
 89 | 
 90 | A better approach is to check a manufacturer+model combination count
 91 | ```{r e.2.3.1.2_manu_model}
 92 | df <- mpg %>%
 93 |   mutate(manuModel = paste(manufacturer, model, sep = " "))
 94 | 
 95 | df  %>%
 96 |   select(manufacturer, model, manuModel)
 97 | 
 98 | ggplot(df, aes(x = manuModel)) +
 99 |   geom_bar() +
100 |   coord_flip()
101 | ```
102 | 
103 | **_Question 3_** Describe the data, aesthetic mappings and layers used for each of the
104 | following plots. You’ll need to guess a little because you haven’t seen all
105 | the datasets and functions yet, but use your common sense! See if you can
106 | predict what the plot will look like before running the code.
107 | 
108 | 1. ggplot(mpg, aes(cty, hwy)) + geom point()
109 | 2. ggplot(diamonds, aes(carat, price)) + geom point()
110 | 3. ggplot(economics, aes(date, unemploy)) + geom line()
111 | 4. ggplot(mpg, aes(cty)) + geom histogram()
112 | 
113 | ```{r e.2.3.1.3_summry_plot}
114 | summary(ggplot(mpg, aes(cty, hwy)) + geom_point())
115 | ```
116 | As you can see, we can use summary() function to get full details about a chunk of
117 | plot codes. But in general, the codes above has one dataset, mapping to two variables
118 | in that data set, and has one layer of plots.
119 | 
120 | ## *Exercises 2.4.1*
121 | **_Question 1_** Experiment with the colour, shape and size aesthetics. What happens when you map 
122 | them to continuous values? What about categorical values? What happens when you use 
123 | more than one aesthetic in a plot?
124 | Using mpg dataset as an example, first I map color, shape, and size to continuous 
125 | variables:
126 | ```{r e.2.4.1.1_plot1}
127 | ggplot(mpg, aes(cty, hwy, color = +displ)) +
128 |   geom_jitter()
129 | ```
130 | What you get is a color scale, which you can use +/- sign to change the direction of
131 | color scale.
132 | 
133 | But the problem is that, color and size might work with continuous variables, but shape
134 | doesn't. Because the various numbers could deplete the current availble shapes that 
135 | represent them.
136 | ```{r e.2.4.1.1_plot2, eval=FALSE}
137 | ggplot(mpg, aes(cty, hwy, shape = displ)) +
138 |   geom_point()
139 | 
140 | # You get:
141 | # Error: A continuous variable can not be mapped to shape
142 | ```
143 | 
144 | You can use more than one aesthetic in a plot, such as:
145 | ```{r e.2.4.1.1_plot3}
146 | ggplot(mpg, aes(cty, hwy, size = displ, color = displ)) +
147 |   geom_point()
148 | ```
149 | 
150 | **_Question 2_** What happens if you map a continuous variable to shape? Why? What
151 | happens if you map trans to shape? Why?
152 | The first part has been answered in the previous question.
153 | The second part to map trans to shape:
154 | ```{r e.2.4.1.2}
155 | ggplot(mpg, aes(cty, hwy, shape = trans)) +
156 |   geom_point()
157 | ```
158 | 
159 | The plot generates a warning that shape for more than 6 discrete values becomes hard to discriminate.
160 | 
161 | **_Question 3_** How is drive train related to fuel economy? How is drive train related to
162 | engine size and class?
163 | 
164 | ```{r e.2.4.1.3_plot1}
165 | ggplot(mpg, aes(drv, cty)) + 
166 |   geom_boxplot() +
167 |   scale_x_discrete(labels = c("Front wheel", "Rear wheel", "Four wheel"),
168 |                    limits = c("f", "r", "4"))
169 | ```
170 | 
171 | Four wheel appears to be most efficient for city miles per gallon.
172 | 
173 | For drive train, engine size, and class, we need to reorder the class based on engine size first
174 | with median, and then plot class on x-axis and engine size on y-axis, with drive train as color.
175 | ```{r e.2.4.1.3_plot2}
176 | ggplot(mpg, aes(reorder(class, displ, FUN = median), displ, color = drv)) +
177 |   geom_jitter(width = 0.5)
178 | ```
179 | 
180 | 
181 | ## Exercises 2.5.1
182 | **_Question 1_** What happens if you try to facet by a continuous variable like hwy? What
183 | about cyl? What’s the key difference?
184 | 
185 | ```{r e.2.5.1.1_plot1}
186 | ggplot(mpg, aes(x = cty, y = displ)) +
187 |   geom_point() +
188 |   facet_wrap(~ hwy)
189 | ```
190 | 
191 | When you run facet_wrap(~continuous) with continuous variable, the whole plot becomes hard to
192 | grasp because there are too many graphs.
193 | 
194 | Then we try to run the same thing with cyl:
195 | 
196 | ```{r e.2.5.1.1_plot2}
197 | ggplot(mpg, aes(displ, cty)) +
198 |   geom_point() +
199 |   facet_wrap(~ cyl)
200 | ```
201 | 
202 | With cyl, which only has four different values, this picture is much easier to read. The key
203 | difference is that hwy has way too many variation in values than cyl does.
204 | 
205 | **_Question 2_** Use facetting to explore the three-way relationship between fuel economy,
206 | engine size, and number of cylinders. How does facetting by number of
207 | cylinders change your assessment of the relationship between engine size
208 | and fuel economy?
209 | 
210 | The pattern can be seen in the last plot.
211 | 
212 | **_Question 3_** Read the documentation for facet wrap(). What arguments can you use to
213 | control how many rows and columns appear in the output?
214 | 
215 | ```{r e.2.5.1.3}
216 | ?facet_wrap
217 | ```
218 | 
219 | We can use nrow and ncol to control the rows and columns that appear in the output.
220 | 
221 | **_Question 4_** What does the scales argument to facet wrap() do? When might you
222 | use it?
223 | 
224 | should Scales be fixed ("fixed", the default), free ("free"), or free in one dimension ("free_x", "free_y").
225 | 
226 | ```{r, results="hide"}
227 | library(mgcv)
228 | library(MASS)
229 | ```
230 | 
231 | ## Exercises 2.6.6
232 | **_Question 1_** What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy))
233 | + geom point()? Which of the geoms described above is most effective at
234 | remedying the problem?
235 | 
236 | ```{r e.2.6.6.1_plot1}
237 | ggplot(mpg, aes(cty, hwy)) +
238 |   geom_point()
239 | ```
240 | 
241 | The problem with this plot is that, there is overplotting, so that this graph doesn't show all the
242 | availble data points in the dataset.
243 | 
244 | The solution to this problem is to use geom_jitter().
245 | ```{r e.2.6.6.1_plot2}
246 | ggplot(mpg, aes(cty, hwy)) +
247 |   geom_jitter()
248 | ```
249 | 
250 | **_Question 2_** One challenge with ggplot(mpg, aes(class, hwy)) + geom boxplot() is that
251 | the ordering of class is alphabetical, which is not terribly useful. How
252 | could you change the factor levels to be more informative?
253 | Rather than reordering the factor by hand, you can do it automatically
254 | based on the data: ggplot(mpg, aes(reorder(class, hwy), hwy)) +
255 | geom boxplot(). What does reorder() do? Read the documentation.
256 | 
257 | You can check reorder() using:
258 | ```{r e.2.6.6.2, eval=FALSE}
259 | ?reorder
260 | ```
261 | 
262 | reorder is a generic function. The "default" method treats its first argument as a categorical variable, 
263 | and reorders its levels based on the values of a second variable, usually numeric.
264 | 
265 | **_Question 3_** Explore the distribution of the carat variable in the diamonds dataset. What
266 | binwidth reveals the most interesting patterns?
267 | 
268 | We can generate several plots with different binwidth:
269 | ```{r e.2.6.6.3_plot1}
270 | ggplot(diamonds, aes(carat)) +
271 |   geom_bar(binwidth = 1) +
272 |   ggtitle(expression(atop("Carat Barplot with", "binwidth = 1"))) +
273 |   xlab("Carat") +
274 |   ylab("Count/Number") +
275 |   theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
276 |         text = element_text(size = 9))
277 | ```
278 | 
279 | As we change the binwidth:
280 | ```{r e.2.6.6.3_plot2}
281 | ggplot(diamonds, aes(carat)) +
282 |   geom_histogram(binwidth = 0.5) +
283 |   ggtitle(expression(atop("Carat Barplot with", "binwidth = 0.5"))) +
284 |   xlab("Carat") +
285 |   ylab("Count/Number") +
286 |   theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
287 |         text = element_text(size = 9))
288 | ```
289 | 
290 | ```{r r e.2.6.6.3_plot3}
291 | ggplot(diamonds, aes(x = carat)) +
292 |   geom_histogram(binwidth = 0.01) +
293 |   ggtitle(expression(atop("Carat Barplot with", "binwidth = 0.01"))) +
294 |   xlim(0.3, 3) +
295 |   xlab("Carat") +
296 |   ylab("Count/Number") +
297 |   theme(plot.title = element_text(hjust = 0.5), title = element_text(size = 14),
298 |         text = element_text(size = 9))
299 | ```
300 | 
301 | I am not familiar with diamonds industry, but the pattern looks interesting and there must be
302 | a reason for this pattern.
303 | 
304 | **_Question 4_** Explore the distribution of the price variable in the diamonds data. How
305 | does the distribution vary by cut?
306 | 
307 | To check this::
308 | ```{r e.2.6.6.4_plot1}
309 | ggplot(diamonds, aes(x = cut, y = price, color = cut)) +
310 |   geom_boxplot()
311 | ```
312 | 
313 | ```{r e.2.6.6.4_plot2}
314 | ggplot(diamonds, aes(x = price, y =..density.., color = cut)) +
315 |   geom_freqpoly(binwidth = 200)
316 | ```
317 | 
318 | 
319 | Fair cut diamonds have higher price than very good cut. One of the reasons could be
320 | these fair diamonds are big in terms of their sizes, so people are likely to spend
321 | money for the size than for the cut, because not everyone is an expert in diamonds.
322 | 
323 | **_Question 5_** You now know (at least) three ways to compare the distributions of
324 | subgroups: geom violin(), geom freqpoly() and the colour aesthetic, or
325 | geom histogram() and facetting. What are the strengths and weaknesses
326 | of each approach? What other approaches could you try?
327 | 
328 | The strengths and weaknesses are discussed in the chapter already.
329 | 
330 | **_Question 6_** Read the documentation for geom bar().What does the weight aesthetic do?
331 | There are two types of bar charts: geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aethetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead. geom_bar uses stat_count by default: it counts the number of cases at each x position. geom_col uses stat_identity: it leaves the data as is.
332 | 
333 | **_END OF CHAPTER 2 EXERCISES_**


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter3.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_chapter3"
  3 | author: "Nade Kang"
  4 | date: "July 8, 2018"
  5 | output: html_document
  6 | ---
  7 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  8 | # ggplot2 Chpater 3 Toolbox
  9 | 
 10 | ## *Exercise 3.2.1*
 11 | ### Load Packages
 12 | ```{r setup, results='hide'}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | library(tidyverse)
 15 | ```
 16 | 
 17 | **_Question 1_** What geoms would you use to draw each of the following named plots?
 18 | 1. Scatterplot: geom_point()
 19 | 2. Line chart: geom_line()
 20 | 3. Histogram: geom_histogram()
 21 | 4. Bar chart: geom_bar()
 22 | 5. Pie chart: geom_bar() + coord_polar("y")
 23 | 
 24 | **_Question 2_** What’s the difference between geom path() and geom polygon()? What’s the
 25 | difference between geom path() and geom line()?
 26 | 
 27 | geom_path() connects the observations in the order in which they appear in the data. 
 28 | geom_line() connects them in order of the variable on the x axis. 
 29 | Polygons are very similar to paths (as drawn by geom_path) except that the start and 
 30 | end points are connected and the inside is coloured by fill.
 31 | geom_path() connects the observations in the order in which they appear in the data. 
 32 | geom_line() connects them in order of the variable on the x axis. 
 33 | 
 34 | **_Question 3_** What low-level geoms are used to draw geom smooth()? What about
 35 | geom boxplot() and geom violin()?
 36 | 
 37 | geom_point(), geom_path(), and geom_area() are used to draw with geom_smooth().
 38 | geom_rect(), geom_line(), geom_point() are used for geom_boxplot().
 39 | geom_area() and geom_path() are used for geom_violin()
 40 | 
 41 | ## *Exercises 3.5.5*
 42 | 
 43 | **_Question 1_** Draw a boxplot of hwy for each value of cyl, without turning cyl into a
 44 | factor. What extra aesthetic do you need to set?
 45 | 
 46 | ```{r e.3.5.5.1}
 47 | ggplot(mpg, aes(cyl, hwy, group = cyl)) +
 48 |   geom_boxplot()
 49 | ```
 50 | 
 51 | You simply add a "group = cyl" argument within the overall aes in ggplot()
 52 | 
 53 | **_Question 2_** Modify the following plot so that you get one boxplot per integer value of
 54 | displ.
 55 | 
 56 | ```{r q2, eval=FALSE}
 57 | ggplot(mpg, aes(displ, cty)) +
 58 | geom_boxplot()
 59 | ```
 60 | 
 61 | ```{r e.3.5.5.2}
 62 | ggplot(mpg, aes(displ, cty)) +
 63 |   geom_boxplot(aes(group = displ))
 64 | ```
 65 | 
 66 | **_Question 3_** When illustrating the difference between mapping continuous and discrete
 67 | colours to a line, the discrete example needed aes(group = 1). Why?What
 68 | happens if that is omitted? What’s the difference between aes(group = 1)
 69 | and aes(group = 2)? Why?
 70 | 
 71 | The important thing (for a line graph with a factor on the horizontal axis) is to manually specify the grouping. By default ggplot2 uses the combination of all categorical variables in the plot to group geoms - that doesn't work for this plot because you get an individual line for each point. Manually specify group = 1 indicates you want a single line connecting all the points.
 72 | 
 73 | **_Question 4_** How many bars are in each of the following plots?
 74 | ```{r}
 75 | ggplot(mpg, aes(drv)) + geom_bar()
 76 | ggplot(mpg, aes(drv, fill = hwy, group = hwy)) + geom_bar()
 77 | library(dplyr)
 78 | mpg2 <- mpg %>% arrange(hwy) %>% mutate(id = seq_along(hwy))
 79 | ggplot(mpg2, aes(drv, fill = hwy, group = id)) + geom_bar()
 80 | ```
 81 | 
 82 | All have 3 bars.
 83 | 
 84 | **_Question 5_** Install the babynames package. It contains data about the popularity of
 85 | babynames in the US. Run the following code and fix the resulting graph.
 86 | Why does this graph make me unhappy?
 87 | ```{r, eval=FALSE}
 88 | install.packages("babynames")
 89 | ```
 90 | ```{r e.3.5.5.5}
 91 | library(babynames)
 92 | hadley <- dplyr::filter(babynames, name == "Hadley")
 93 | ggplot(hadley, aes(year, n)) +
 94 | geom_line()
 95 | ```
 96 | We want to see the growth of the new born babies with name called "Hadley". This graph doesn't
 97 | show the full picture, and the shape of the line is not a good representation of the number of names
 98 | we want to see. In addition, if you check the data, the count is separated by sex(gender). As a
 99 | result, this line plot cannot show us the full details of the name Hadley's number's variation
100 | across the years. Starting in 1960s, there are female babies named as "Hadley".
101 | ```{r e.3.5.5.5_plot}
102 | hadley
103 | male <- hadley %>%
104 |   filter(sex == "M")
105 | male
106 | 
107 | female <- hadley %>%
108 |   filter(sex == "F")
109 | female
110 | 
111 | ggplot(hadley) +
112 |   geom_line(aes(year, n, color = sex))
113 | ```
114 | 
115 | ## *Exercises 3.11.1*
116 | **_Question 1_** What binwidth tells you the most interesting story about the distribution
117 | of carat?
118 | ```{r e.3.11.1_plot1}
119 | ggplot(diamonds, aes(carat)) +
120 |   geom_histogram(binwidth = 0.01) +
121 |   xlim(0.4, 2)
122 | ```
123 | 
124 | Zooming in using xlim, we can see that a binwidth = 0.01 reveals some interesting pattern
125 | in terms of carat and number of diamonds.
126 | 
127 | **_Question 2_** Draw a histogram of price. What interesting patterns do you see?
128 | 
129 | ```{r e.3.11.1.2}
130 | ggplot(diamonds, aes(price)) +
131 |   geom_histogram(binwidth = 30) +
132 |   xlim(0, 10000)
133 | ```
134 | 
135 | Not quite familiar with diamonds, so it's hard for me to say what kind of pattern there is.
136 | But you could try to adjust the binwidth and xlim() to see different patterns.
137 | 
138 | **_Question 3_** How does the distribution of price vary with clarity?
139 | 
140 | ```{r e.3.11.1.3_1}
141 | ggplot(diamonds, aes(clarity, price)) +
142 |   geom_boxplot()
143 | ```
144 | 
145 | ```{r e.3.11.3_2}
146 | ggplot(diamonds, aes(clarity, price)) +
147 |   geom_violin()
148 | ```
149 | 
150 | Again, I am not an expert in terms of the different levels of clarity. Therefore, I am not a 
151 | good judge in terms of the meaning of the plots.
152 | 
153 | **_Question 4_** Overlay a frequency polygon and density plot of depth. What computed
154 | variable do you need to map to y to make the two plots comparable? (You
155 | can either modify geom freqpoly() or geom density().)
156 | 
157 | ```{r e.3.11.4}
158 | ggplot(diamonds, aes(depth)) +
159 |   geom_density(alpha = 0.1) +
160 |   geom_freqpoly()
161 | ```
162 | 
163 | Unsure about the answer.


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter4.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "ggplot2_solutions_chapter4"
 3 | author: "Nade Kang"
 4 | date: "July 8, 2018"
 5 | output: html_document
 6 | ---
 7 | 
 8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
 9 | # ggplot2 Chpater 4 Mastering the Grammar
10 | 
11 | ## *Exercise 4.5.1*
12 | ### Load Packages
13 | ```{r setup, results='hide'}
14 | knitr::opts_chunk$set(echo = TRUE)
15 | library(tidyverse)
16 | ```
17 | 
18 | **_Question 1_** One of the best ways to get a handle on how the grammar works is to
19 | apply it to the analysis of existing graphics. For each of the graphics listed
20 | below, write down the components of the graphic. Don’t worry if you don’t
21 | know what the corresponding functions in ggplot2 are called (or if they
22 | even exist!), instead focussing on recording the key elements of a plot so
23 | you could communicate it to someone else.
24 | 
25 | 1. “Napoleon’s march” by Charles John Minard: http://www.datavis.ca/
26 | gallery/re-minard.php
27 | 
28 | This is based on my basic understanding, it may not be correct:
29 | ggplot(data, aes(x = year, y # of graphics)) +
30 |  geom_point() +
31 |  geom_hline(yintercept = some mean or median value) +
32 |  geom_smooth(method = "loess", se = FALSE) +
33 |  geom_text(some text annotation)
34 |  
35 | 2. “Where the Heat and the Thunder Hit Their Shots”, by Jeremy White,
36 | JoeWard, and Matthew Ericson at The New York Times. http://nyti.
37 | ms/1duzTvY
38 | 
39 | ggplot(data, aes(x = coordX, y = coordY)) +
40 |  geom_Hexagonal(data = attempt_points, shape = number of attempts, color = points per region) +
41 |  ggtitle() +
42 |  geom_text(annotation) +
43 |  coord_quick_basketball_court()
44 | 
45 | 3. “London Cycle Hire Journeys”, by James Cheshire. http://bit.ly/
46 | 1S2cyRy
47 | 
48 | ggplot(data) +
49 |  geom_map()
50 | 
51 | 4. The Pew Research Center’s favorite data visualizations of 2014: http://
52 | pewrsr.ch/1KZSSN6
53 | 
54 | First graph used area plot, vertical line, title, annotation and faceting
55 | Second used animation.
56 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter5.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_solutions_chapter5"
  3 | author: "Nade Kang"
  4 | date: "July 9, 2018"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  9 | # ggplot2 Chpater 5 Build a Plot Layer by Layer
 10 | 
 11 | ## *Exercise 5.3.1*
 12 | ### Load Packages
 13 | ```{r setup, results='hide'}
 14 | knitr::opts_chunk$set(echo = TRUE)
 15 | library(tidyverse)
 16 | ```
 17 | 
 18 | **_Question 1_** The first two arguments to ggplot are data and mapping. The first two
 19 | arguments to all layer functions are mapping and data. Why does the order
 20 | of the arguments differ? (Hint: think about what you set most commonly.)
 21 | 
 22 | ggplot() sets the data first, while geom layers set their aesthetics first.
 23 | 
 24 | **_Question 2_** The following code uses dplyr to generate some summary statistics about
 25 | each class of car (you’ll learn how it works in Chap. 10).
 26 | 
 27 | ```{r}
 28 | library(dplyr)
 29 | class <- mpg %>% 
 30 |   group_by(class) %>% 
 31 |   summarise(n = n(), hwy = mean(hwy))
 32 | class
 33 | ```
 34 | 
 35 | ```{r e.5.3.1.2}
 36 | ggplot(mpg, aes(class, hwy)) +
 37 |   geom_jitter(width = 0.05, size = 2) +
 38 |   geom_point(aes(y = hwy), data = class, size = 4, color = "red") +
 39 |   geom_text(aes(y = 10, label = paste0("n = ", n)), data = class)
 40 | ```
 41 | 
 42 | The process behind the scene:
 43 | First, you use the ggplot() function to define the main dataset and aesthetics you want to use. Here, we want to plot the hwy(y-axis) as points against class(x-axis). This is the first line of code.
 44 | 
 45 | Second, remember, ggplot2 works by plotting a graph layer by layer. The second line of code is the first layer that we are going to add. Instead of a scatterplot using geom_point(), we want to use jitter plot to avoid overplotting in geom_point(). Here, we are using the same dataset - mpg, and using the same aesthetics (x = class, y = hwy) as we indicated in ggplot(), so we do not change anything. But, to make it look similar to the original plot in the textbook, we want to squeeze the width = 0.05 (or you can adjust this number to 0.1, 0.2 to take a look), and we can add a size = 2 or 1 or 3, it doesn't matter.
 46 | 
 47 | Third, we want to create the same red dots in the original graph. This red dot, is the third layer. Here I used geom_point() because these red dots are basically points. However, red dots means the mean value of the y values on corresponding x values. In the original dataset mpg, we do not have this variable. So, here in the second layer of red dots, we need to reset some of the aesthetics and data. We want aesthetics to have y=hwy, this is the same as the ggplot(), but we change the dataset so data = class. Now, the layer knows that each class matches 1 y-axis value which is the hwy as mean value. We have the red dots there, but we also want to shape the size so it looks big, and color to red.
 48 | 
 49 | Finally, to add the layer of labels. Any annotation, we use geom_text() layer to do it. We set the aes(y = 10), because this sets the height of the label near position of y = 10. Then we want to set the label value. But here, we only know the n, which is the count. But in the original graph, it is n = integer. So we need to use the paste0() function to concatenate string and numbers. write, label = paste0("n = ", n) to have an effect of n = integer, then, we set the data = class, because we are not using the original mpg.
 50 | 
 51 | ## *Exercise 5.4.3*
 52 | **_Question 1_** Simplify the following plot specifications:
 53 | There is a type in the textbook, the variable is displ, not disp.
 54 | ```{r, eval=FALSE}
 55 | ggplot(mpg) + geom_point(aes(mpg$displ, mpg$hwy)) 
 56 | 
 57 | ggplot() + geom_point(mapping = aes(y = hwy, x = cty), data = mpg) + geom_smooth(data = mpg, mapping = aes(cty, hwy))
 58 | 
 59 | ggplot(diamonds, aes(carat, price)) + geom_point(aes(log(brainwt), log(bodywt)), data = msleep)
 60 | ```
 61 | 
 62 | ```{r e.5.4.3.1_1}
 63 | ggplot(mpg, aes(displ, hwy)) + geom_point()
 64 | ```
 65 | 
 66 | ```{r e.5.4.3.1_2}
 67 | ggplot(mpg) + geom_point(aes(cty, hwy)) + geom_smooth(aes(cty, hwy))
 68 | ```
 69 | 
 70 | ```{r e.5.4.3.1_3}
 71 | ggplot(aes(log(brainwt), log(bodywt)), data = msleep) + geom_point()
 72 | ```
 73 | 
 74 | 
 75 | **_Question 2_** What does the following code do? Does it work? Does it make sense?
 76 | Why/why not?
 77 | 
 78 | ```{r}
 79 | ggplot(mpg) + geom_point(aes(class, cty)) + geom_boxplot(aes(trans, hwy)) + coord_flip()
 80 | ```
 81 | 
 82 | This plot is using mpg dataset as a blank background without setting aesthetics. Then
 83 | it adds the first layer using scatterplot with class as x-axis and cty on y axis.
 84 | After that, it adds another layer using boxplot with trans on x-axis and hwy on y-axis.
 85 | 
 86 | It doesn't work because the x-axis: 1. too crowded, need to add coord_flip() to read all
 87 | the class and trans labels. 2. Trans and class are mixed together, it doesn't make sense.
 88 | Also, boxplot is mixed with scatterplot. Y-axis has two different value groups, the hwy 
 89 | and the cty. These two contain both continuous values but are different in nature because one
 90 | is city miles per gallon, the other is highway miles per gallon.
 91 | In the end, the x-axis and y-axis labels only have class and cty, it doesn't label trans
 92 | and hwy.
 93 | 
 94 | **_Question 3_** What happens if you try to use a continuous variable on the x axis in one
 95 | layer, and a categorical variable in another layer? What happens if you do
 96 | it in the opposite order?
 97 | 
 98 | ```{r e.5.4.3_1, eval=FALSE}
 99 | ggplot(mpg) +
100 |   geom_point(aes(hwy, cyl)) +
101 |   geom_point(aes(drv, cty))
102 | ```
103 | 
104 | ```{r e.5.4.3_2}
105 | ggplot(mpg) +
106 |   geom_point(aes(drv, cty)) +
107 |   geom_point(aes(hwy, cyl))
108 | ```
109 | 
110 | If you set first x-axis with categorical values, and then with continuous values, the plot
111 | will run but the result doesn't make sense at all. However, if you do it the opposite way,
112 | R will report errors.
113 | 
114 | ## *Exercises 5.5.1*
115 | **_Question 1_** Download and print out the ggplot2 cheatsheet from http://www.rstudio.
116 | com/resources/cheatsheets/ so you have a handy visual reference for all
117 | the geoms.
118 | 
119 | Easy to do, just download.
120 | 
121 | **_Question 2_** Look at the documentation for the graphical primitive geoms. Which aesthetics
122 | do they use? How can you summarise them in a compact form?
123 | 
124 | geom_primitives such as geom_point(), geom_path(), geom_rect(), geom_polyon(), geom_text() and 
125 | so on all have common aesthetics such as: x, y, color, alpha, fill, group, size.
126 | Other features such as geom_point()'s stroke, or geom_path() and geom_rect()'s linetype, belong 
127 | to these funtions themselves. But the common ones are available in each function.
128 | 
129 | **_Question 3_** What’s the best way to master an unfamiliar geom? List three resources
130 | to help you get started.
131 | 
132 | You can use Winston Chang's R Graphics Cookbook, which shows you the recipe for different types of plots.
133 | Or you could use Hadley Wickham's gplot2 Elegant Graphics for Data Analysis.
134 | Or you can just use ?function to read the R documentation with examples to learn.
135 | 
136 | **_Question 4_** For each of the plots below, identify the geom used to draw it.
137 | The first one on the upper left is violin
138 | ```{r e.5.5.1.4_plot1}
139 | ggplot(mpg, aes(drv, displ)) +
140 |   geom_violin(size = 1, color = "black")
141 | ```
142 | 
143 | The one on upper right is a jitter plot with size and alpha features. It is not geom_point()
144 | because this results in overplotting.
145 | ```{r e.5.5.1.4_plot2}
146 | ggplot(mpg) +
147 |   geom_jitter(aes(x = hwy, y = cty, alpha = 0.1, size = cyl)) +
148 |   theme(legend.position = "none")
149 | ```
150 | 
151 | The middle left plot is a hexagonal plot
152 | ```{r e.5.5.1.4_plot3}
153 | ggplot(mpg) +
154 |   geom_hex(aes(x = hwy, y = cty))
155 | ```
156 | 
157 | Middle right is a jitter plot, thanks to stack overflow user PoGibas for the help. PoGibas mentioned:
158 | "By default jitter in geom_jitter is too large and we need to specify our own height and width of jitter by using position_jitter function.""
159 | ```{r e.5.5.1.4_plot4}
160 | ggplot(mpg, aes(x = cyl, y = drv)) +
161 |   geom_jitter(width = 0.05, height = 0.05)
162 | # or geom_jitter(position = position_jitter(0.05, 0.05))
163 | ```
164 | 
165 | Next, we have to use the economics package in ggplot2.
166 | ```{r e.5.5.1.4_plot5}
167 | ggplot(economics, aes(date, psavert)) +
168 |   geom_area()
169 | ```
170 | 
171 | Finally, the last plot.
172 | ```{r e.5.5.1.4_plot6}
173 | ggplot(economics, aes(uempmed, psavert)) +
174 |   geom_path(size = 1)
175 | ```
176 | 
177 | **_Question 5_** For each of the following problems, suggest a useful geom:
178 | -Display how a variable has changed over time.
179 | geom_line()
180 | -Show the detailed distribution of a single variable.
181 | geom_histogram()
182 | -Focus attention on the overall trend in a large dataset.
183 | geom_line(), geom_area()
184 | -Draw a map.
185 | geom_sf(), geom_polygon(), coord_quickmap()    
186 | -Label outlying points.
187 | geom_point(), geom_text()
188 | 
189 | ## *Exercises 5.6.2*
190 | **_Question 1_** The code below creates a similar dataset to stat smooth(). Use the appropriate
191 | geoms to mimic the default geom smooth() display.
192 | 
193 | ```{r}
194 | mod <- loess(hwy ~ displ, data = mpg)
195 | smoothed <- data.frame(displ = seq(1.6, 7, length = 50))
196 | pred <- predict(mod, newdata = smoothed, se = TRUE)
197 | smoothed$hwy <- pred$fit
198 | smoothed$hwy_lwr <- pred$fit - 1.96 * pred$se.fit
199 | smoothed$hwy_upr <- pred$fit + 1.96 * pred$se.fit
200 | smoothed
201 | ```
202 | 
203 | We need to use geom_ribbon() instead of geom_area(), though area sounds more like coloring the region between
204 | the two standard deviation values. geom_ribbon() and geom_area() both display a y interval for given x values.
205 | However, geom_are() is a special case of geom_ribbon() for geom_area()'s ymin is fixed at 0.
206 | 
207 | The sequence of the layers, without knowing other functions to deal with it, is important. If we plot the geom_ribbon()
208 | in the end filled with color, it will shadow the smooth line and data points.
209 | ```{r e.5.6.2.1}
210 | ggplot(mpg, aes(displ, hwy)) +
211 |   geom_ribbon(data = smoothed, aes(ymin = hwy_lwr, ymax = hwy_upr, color = "red"), fill = "grey") +
212 |   geom_point(position = "jitter") +
213 |   geom_line(data = smoothed, aes(y = hwy), size = 1, color = "red")
214 | ```
215 | 
216 | 
217 | **_Question 2_** What stats were used to create the following plots?
218 | Not quite sure about this one.
219 | 
220 | **_Question 3_** Read the help for stat sum() then use geom count() to create a plot that
221 | shows the proportion of cars that have each combination of drv and trans.
222 | 
223 | ```{r}
224 | mpg %>% 
225 |   group_by(drv, trans) %>% 
226 |   summarise(n = n()) %>% 
227 |   mutate(drvtrans = paste0(drv, "-", trans)) %>% 
228 |   ggplot() +
229 |   geom_bar(aes(drvtrans, n), stat = "identity") +
230 |   coord_flip()
231 | ```
232 | 
233 | ## *Exercises 5.7.1*
234 | **_Question 1_** When might you use position nudge()? Read the documentation.
235 | 
236 | position_nudge is generally useful for adjusting the position of items on discrete scales by a small amount. Nudging is built in to geom_text because it's so useful for moving labels a small distance from what they're labelling.
237 | 
238 | e.g.:
239 | ```{r, eval=FALSE, results='hide'}
240 | ggplot(df, aes(x, y)) +
241 |   geom_point() +
242 |   geom_text(aes(label = y), position = position_nudge(y = -0.1))
243 | ```
244 | 
245 | **_Question 2_** Many position adjustments can only be used with a few geoms. For example,
246 | you can’t stack boxplots or errors bars. Why not? What properties
247 | must a geom possess in order to be stackable? What properties must it
248 | possess to be dodgeable?
249 | 
250 | Not sure.
251 | 
252 | **_Question 3_** Why might you use geom jitter() instead of geom count()? What are the
253 | advantages and disadvantages of each technique?
254 | 
255 | Count() counts the number of obervations at each given x values, and then map the point.
256 | The more observation one given x value has, the larger the point.
257 | 
258 | Jitter() adds random noise and shows how the observations are distributed in the plot.
259 | 
260 | **_Question 4_** When might you use a stacked area plot? What are the advantages and
261 | disadvantages compared to a line plot?
262 | 
263 | When you want to separate the x values based on a given category, you use stacked area plot.
264 | Advantage is it allows you to show the difference based on the criteria, but sometimes it is
265 | hard to figure out the real magnitude.


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter6.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_solutions_chapter6"
  3 | author: "Nade Kang"
  4 | date: "July 10, 2018"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  9 | # ggplot2 Chpater 6 Scales Axes and Legends
 10 | 
 11 | ## *Exercise 6.2.1*
 12 | ### Load Packages
 13 | ```{r setup, results='hide'}
 14 | knitr::opts_chunk$set(echo = TRUE)
 15 | library(tidyverse)
 16 | ```
 17 | 
 18 | **_Question 1_** What happens if you pair a discrete variable to a continuous scale? What
 19 | happens if you pair a continuous variable to a discrete scale?
 20 | 
 21 | Pair a discrete variable to continuous scale:
 22 | ```{r e.6.2.1.1_plot1}
 23 | ggplot(mpg, aes(displ, hwy)) +
 24 | geom_point() +
 25 | scale_x_continuous() +
 26 | scale_y_continuous()
 27 | ```
 28 | ```{r e.6.2.1.1_plot2}
 29 | ggplot(mpg, aes(displ, hwy)) +
 30 | geom_point() +
 31 | scale_x_discrete() +
 32 | scale_y_discrete()
 33 | ```
 34 | 
 35 | Let's compare the two graphs. The first plot used both scale_x_continuous() and scale_y_continuous().
 36 | The second plot used scale_x_discrete() and scale_y_discrete for both axes. The difference between
 37 | the two graphs doesn't lie within the positions of the points, but the background and the units on
 38 | the two axes. If you use discrete for continuous, you won't see the hwy units on the y-axis, nor will
 39 | you see the units for displ on x-axis.
 40 | 
 41 | On the other hand, we can try continuous scale on discrete variables:
 42 | ```{r e.6.2.1.1_plot3}
 43 | ggplot(mpg, aes(class, hwy)) +
 44 |   geom_jitter(width = 0.05, height = 0.05)
 45 | ```
 46 | 
 47 | Now we change the scale on the previous plot:
 48 | ```{r e.6.2.1.1_plot4, eval=FALSE}
 49 | ggplot(mpg, aes(class, hwy)) +
 50 |   geom_jitter(width = 0.05, height = 0.05) +
 51 |   scale_x_continuous()
 52 | # Error: Discrete value supplied to continuous scale
 53 | ```
 54 | 
 55 | We were not allowed to do so because of the error message: Discrete value
 56 | supplied to continuous scale.
 57 | 
 58 | So in conclusion, we could supply discrete scale to continuous variables, but not
 59 | vice versa.
 60 | 
 61 | **_Question 2_** Simplify the following plot specifications to make them easier to understand.
 62 | ```{r, eval=FALSE}
 63 | ggplot(mpg, aes(displ)) +
 64 | scale_y_continuous("Highway mpg") +
 65 | scale_x_continuous() +
 66 | geom_point(aes(y = hwy))
 67 | ```
 68 | 
 69 | The codes can be simplified as below:
 70 | 
 71 | ```{r e.6.2.1.2_plot1}
 72 | ggplot(mpg, aes(displ, hwy)) +
 73 |   geom_point() +
 74 |   ylab("Highway mpg")
 75 | ```
 76 | 
 77 | ```{r, eval=FALSE}
 78 | ggplot(mpg, aes(y = displ, x = class)) +
 79 | scale_y_continuous("Displacement (l)") +
 80 | scale_x_discrete("Car type") +
 81 | scale_x_discrete("Type of car") +
 82 | scale_colour_discrete() +
 83 | geom_point(aes(colour = drv)) +
 84 | scale_colour_discrete("Drive\ntrain")
 85 | ```
 86 | 
 87 | This can be simplified down to:
 88 | ```{r e.6.2.1.2_plot2}
 89 | ggplot(mpg, aes(class, displ)) +
 90 |   geom_point(aes(color = drv)) +
 91 |   labs(x = "Type of car", y = "Displacement (l)", colour = "Drive\ntrain")
 92 | ```
 93 | 
 94 | ## *Exercises 6.3.3
 95 | **_Question 1_** Recreate the following graphic:
 96 | Adjust the y axis label so that the parentheses are the right size.
 97 | 
 98 | ```{r e.6.3.3.1}
 99 | ggplot(mpg, aes(displ, hwy)) +
100 |   geom_point() +
101 |   scale_x_continuous("Displacement",
102 |                      breaks = c(2,3,4,5,6,7),
103 |                      labels = c("2k", "3k", "4k", "5k", "6k", "7k")) +
104 |   scale_y_continuous(quote(Highway (Miles/Gallon)))
105 | ```
106 | 
107 | **_Question 2_** List the three different types of object you can supply to the breaks argument.
108 | How do breaks and labels differ?
109 | 
110 | Breaks takes one of the followings as input: 
111 | -NULL for no breaks
112 | -waiver() for the default breaks computed by the transformation object
113 | -A numeric vector of positions
114 | -A function that takes the limits as input and returns breaks as output
115 | 
116 | Labels takes one of the followings as input:
117 | -NULL for no labels
118 | -waiver() for the default computed by the transformation object
119 | -A character vector giving labels (must be same length as breaks)
120 | -A function that takes the breaks as input and returns labels as output
121 | 
122 | **_Question 3_** Recreate the following plot:
123 | 
124 | ```{r e.6.3.3.3}
125 | ggplot(mpg, aes(displ, hwy, color = drv)) +
126 |   geom_point() +
127 |   scale_color_discrete(labels = c("4wd", "fwd", "rwd"))
128 | ```
129 | 
130 | 
131 | **_Question 4_** What label function allows you to create mathematical expressions?What
132 | label function converts 1 to 1st, 2 to 2nd, and so on?
133 | 
134 | quote() to create mathematical expressions.
135 | Type: scales::format and up and down arrows to check:
136 | labels = scales::ordinal_format() to convert to 1st, 2nd, 3rd and so on.
137 | 
138 | 
139 | **_Question 5_** What are the three most important arguments that apply to both axes
140 | and legends? What do they do? Compare and contrast their operation for
141 | axes vs. legends.
142 | Maybe labs()? Not sure what three most important means.
143 | 
144 | 
145 | ## *Exercises 6.4.4*
146 | **_Question 1_** How do you make legends appear to the left of the plot?
147 | theme(legend.position = "left")
148 | 
149 | **_Question 2_** What’s gone wrong with this plot? How could you fix it?
150 | 
151 | ```{r}
152 | ggplot(mpg, aes(displ, hwy)) +
153 | geom_point(aes(colour = drv, shape = drv)) +
154 | scale_colour_discrete("Drive train")
155 | ```
156 | 
157 | The plot created two legends on the right-hand side, where you could just use one.
158 | 
159 | Here, quoting from the book:"In order for legends to be merged, they must have the same name. So if you
160 | change the name of one of the scales, you’ll need to change it for all of them."
161 | 
162 | As a result, here, the original plot sets only colour with new name, while shape doesn't have
163 | the same new name, the result is that these two legends cannot merge. There are two ways to fix this
164 | plot:
165 | 
166 | ```{r e.6.4.4.2_plot1}
167 | ggplot(mpg, aes(displ, hwy)) +
168 | geom_point(aes(color = drv, shape = drv)) +
169 |   labs(color = "Drive train", shape = "Drive train")
170 | ```
171 | 
172 | ```{r e.6.4.4.2_plot2}
173 | ggplot(mpg, aes(displ, hwy)) + 
174 |   geom_point(aes(colour = drv, shape = drv)) + 
175 |   scale_colour_discrete("Drive train") +
176 |   scale_shape_discrete("Drive train")
177 | ```
178 | 
179 | **_Question 3_** Can you recreate the code for this plot?
180 | 
181 | Answers are above in Question 2 answer.
182 | 
183 | ## *Exercises 6.5.1*
184 | **_Question 1_** The following code creates two plots of the mpg dataset. Modify the code
185 | so that the legend and axes match, without using facetting!
186 | ```{r, eval=FALSE}
187 | fwd <- subset(mpg, drv == "f")
188 | rwd <- subset(mpg, drv == "r")
189 | ggplot(fwd, aes(displ, hwy, colour = class)) + geom_point()
190 | ggplot(rwd, aes(displ, hwy, colour = class)) + geom_point()
191 | ```
192 | 
193 | Use exactly expand_limits() function to set both plots' legend title to have all types
194 | of drv with the same set of colors.
195 | 
196 | We can also set the xlim and ylim so that both plots have the same axes scales. But this
197 | is optional.
198 | 
199 | ```{r, e.6.5.1.1_plot1}
200 | fwd <- subset(mpg, drv == "f")
201 | rwd <- subset(mpg, drv == "r")
202 | ggplot(fwd, aes(displ, hwy, colour = class)) + 
203 |   geom_point() +
204 |   scale_color_discrete("Drive train") +
205 |   xlim(0, 10) +
206 |   ylim(0, 45) +
207 |   expand_limits(color = c("2seater", "compact", "midsize", "minivan",
208 |                                    "pickup", "subcompact", "suv"))
209 | ```
210 | 
211 | ```{r, e.6.5.1.1_plot2}
212 | ggplot(rwd, aes(displ, hwy, colour = class)) + 
213 |   geom_point() +
214 |   scale_color_discrete("Drive train")+
215 |   xlim(0, 10) +
216 |   ylim(0, 45) +
217 |   expand_limits(color = c("2seater", "compact", "midsize", "minivan",
218 |                                    "pickup", "subcompact", "suv"))
219 | ```
220 | 
221 | 
222 | **_Question 2_** What does expand limits() do and how does it work? Read the source
223 | code.
224 | 
225 | Sometimes you may want to ensure limits include a single value, for all panels or all plots. This function is a thin wrapper around geom_blank that makes it easy to add such values.
226 | 
227 | expand_limits() solves the first question.
228 | 
229 | **_Question 3_** What happens if you add two xlim() calls to the same plot? Why?
230 | 
231 | Use question 1's plot as an example, let's test what happens when we add two xlim() calls.
232 | ```{r e.6.5.1.3}
233 | ggplot(rwd, aes(displ, hwy, colour = class)) + 
234 |   geom_point() +
235 |   scale_color_discrete("Drive train")+
236 |   xlim(0, 10) +
237 |   xlim(0, 20) +
238 |   ylim(0, 45) +
239 |   expand_limits(color = c("2seater", "compact", "midsize", "minivan",
240 |                                    "pickup", "subcompact", "suv"))
241 | ```
242 | 
243 | The result is that, the second xlim(0, 20) call overrides the first xlim(0, 10) call. As
244 | a result, the finished plot has an x-axis ranging from 0 to 20 instead of from 0 to 10.
245 | 
246 | **_Question 4_** What does scale x continuous(limits = c(NA, NA)) do?
247 | Let's use the same plot:
248 | ```{r e.6.5.1.4_plot1}
249 | ggplot(fwd, aes(displ, hwy, colour = class)) + 
250 |   geom_point() +
251 |   xlim(0, 20)
252 | ```
253 | 
254 | ```{r e.6.5.1.4_plot2}
255 | ggplot(fwd, aes(displ, hwy, colour = class)) + 
256 |   geom_point() +
257 |   xlim(0, 20) +
258 |   scale_x_continuous(limits = c(NA, NA)) 
259 | ```
260 | 
261 | This overrides the existing xlim() on the plot. So there is no more limits from xlim().
262 | 
263 | ## *Exercises 6.6.5*
264 | **_Question 1_** Compare and contrast the four continuous colour scales with the four
265 | discrete scales.
266 | 
267 | Please refer to the textbook.
268 | 
269 | **_Question 2_** Explore the distribution of the built-in colors() using the luv colours
270 | dataset.
271 | 
272 | ```{r e.6.6.5.2}
273 | color_df <- colors()
274 | 
275 | ggplot(luv_colours, aes(u, v)) +
276 |   geom_point(aes(color = color_df), size = 2) +
277 |   scale_color_identity() +
278 |   coord_equal()
279 | 
280 | ```
281 | 
282 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter7.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "ggplot2_solutions_chapter7"
 3 | author: "Nade Kang"
 4 | date: "July 11, 2018"
 5 | output: html_document
 6 | ---
 7 | 
 8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
 9 | # ggplot2 Chpater 7 Positioning
10 | 
11 | ## *Exercise 7.2.7*
12 | ### Load Packages
13 | ```{r setup, results='hide'}
14 | knitr::opts_chunk$set(echo = TRUE)
15 | library(tidyverse)
16 | ```
17 | 
18 | **_Question 1_** Diamonds: display the distribution of price conditional on cut and carat.
19 | Try facetting by cut and grouping by carat. Try facetting by carat and
20 | grouping by cut. Which do you prefer?
21 | 
22 | First we create two bases:
23 | ```{r e.7.2.7.1_base}
24 | base_cut <- ggplot(diamonds)
25 | base_cut
26 | base_cut + geom_histogram(aes(x = price, fill = cut), position = "dodge")
27 | 
28 | base_carat <- ggplot(diamonds, aes(carat, price))
29 | base_carat +geom_point()
30 | ```
31 | 
32 | Facetting cut and group by carat
33 | ```{r e.7.2.7.1_plot1}
34 | base_carat + geom_point(aes(color = carat)) +
35 |   facet_wrap(~ cut, scales = "free")
36 | ```
37 | 
38 | Try facetting by carat and grouping by cut
39 | ```{r e.7.2.7.1_plot2}
40 | diamonds2 <- diamonds
41 | diamonds2$carat_i <- cut_interval(diamonds$carat, 10)
42 |   
43 | 
44 | ggplot(diamonds, aes(cut, price)) +
45 |   geom_boxplot(aes(color = cut)) +
46 |   coord_flip() +
47 |   facet_wrap(~ diamonds2$carat_i)
48 | ```
49 | Facetting cut and group by carat is better from my point of view.
50 | 
51 | **_Question 2_** Diamonds: compare the relationship between price and carat for each
52 | colour. What makes it hard to compare the groups? Is grouping better
53 | or facetting? If you use facetting, what annotation might you add to make
54 | it easier to see the differences between panels?
55 | 
56 | With facetting, each group of cut is separated in its own plot, therefore, leaving no
57 | chance for us to see the differences when every cut is on the same plot. Depends on your
58 | personal taste, you can choose either faceting or grouping.
59 | 
60 | **_Question 3_** Why is facet wrap() generally more useful than facet grid()?
61 | 
62 | Reading from the documentation:
63 | facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.
64 | 
65 | **_Question 4_** Recreate the following plot. It facets mpg2 by class, overlaying a smooth
66 | curve fit to the full dataset.
67 | 
68 | ```{r mpg2}
69 | mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f") & class != "2seater")
70 | head(mpg2)
71 | ```
72 | ```{r}
73 | ggplot(mpg2, aes(displ, hwy)) + 
74 |   geom_smooth(data = select(mpg2, -class), se = FALSE) + 
75 |   geom_point() + 
76 |   facet_wrap(~class, nrow = 2)
77 | ```
78 | 
79 | In documentation:
80 | The data to be displayed in this layer. There are three options:
81 | If NULL, the default, the data is inherited from the plot data as specified in the call to ggplot.
82 | A data.frame, or other object, will override the plot data. All objects will be fortified to produce a data frame. See fortify for which variables will be created.
83 | A function will be called with a single argument, the plot data. The return value must be a data.frame., and will be used as the layer data.
84 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter8.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "ggplot2_solutions_chapter8"
  3 | author: "Nade Kang"
  4 | date: "July 11, 2018"
  5 | output: html_document
  6 | ---
  7 | 
  8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
  9 | # ggplot2 Chpater 8 Themes
 10 | 
 11 | ## *Exercise 8.2.1*
 12 | ### Load Packages
 13 | ```{r setup, results='hide'}
 14 | library(tidyverse)
 15 | library(ggthemes)
 16 | ```
 17 | 
 18 | **_Question 1_** Try out all the themes in ggthemes. Which do you like the best?
 19 | ```{r e.8.2.1.1}
 20 | df <- data.frame(x = c(1:4), y = c(5,8))
 21 | df
 22 | ggplot(df, aes(x, y)) +
 23 |   geom_point() +
 24 |   theme_calc() +ggtitle("theme_calc()")
 25 | 
 26 | ggplot(df, aes(x, y)) +
 27 |   geom_point() +
 28 |   theme_economist() +ggtitle("theme_economist()") +
 29 |   ylim(0, 10)
 30 | 
 31 | ggplot(df, aes(x, y)) +
 32 |   geom_point() +
 33 |   theme_excel() +ggtitle("theme_excel()") +
 34 |   ylim(0, 10)
 35 | 
 36 | ggplot(df, aes(x, y)) +
 37 |   geom_point() +
 38 |   theme_fivethirtyeight() +ggtitle("theme_fivethirtyeight()") +
 39 |   ylim(0, 10)
 40 | 
 41 | ggplot(df, aes(x, y)) +
 42 |   geom_point() +
 43 |   theme_pander() +ggtitle("theme_pander()") +
 44 |   ylim(0, 10)
 45 | 
 46 | ggplot(df, aes(x, y)) +
 47 |   geom_point() +
 48 |   theme_solarized() +ggtitle("theme_solarized()") +
 49 |   ylim(0, 10)
 50 | 
 51 | ggplot(df, aes(x, y)) +
 52 |   geom_point() +
 53 |   theme_solarized_2() +ggtitle("theme_solarized_2()") +
 54 |   ylim(0, 10)
 55 | 
 56 | ggplot(df, aes(x, y)) +
 57 |   geom_point() +
 58 |   theme_tufte() +ggtitle("theme_tufte()") +
 59 |   ylim(0, 10)
 60 | 
 61 | ggplot(df, aes(x, y)) +
 62 |   geom_point() +
 63 |   theme_stata() +ggtitle("theme_stata()") +
 64 |   ylim(0, 10)
 65 | 
 66 | ggplot(df, aes(x, y)) +
 67 |   geom_point() +
 68 |   theme_wsj() +ggtitle("theme_wsj()") +
 69 |   ylim(0, 10)
 70 | ```
 71 | 
 72 | I like theme_solarized(), solarized_2, wsj, stata, and tufte.
 73 | 
 74 | **_Question 2_** What aspects of the default theme do you like? What don’t you like?
 75 | What would you change?
 76 | ```{r}
 77 | ggplot(df, aes(x,y)) +
 78 |   geom_point() +
 79 |   theme_bw()
 80 | ```
 81 | 
 82 | It's quite a personal preference related question. I'll try change in when plotting data.
 83 | 
 84 | **_Question 3_** Look at the plots in your favourite scientific journal. What theme do they
 85 | most resemble? What are the main differences?
 86 | 
 87 | Not quite sure, probably some themes in stata could be more common? Some journals in finance
 88 | uses theme_bw() quite often.
 89 | 
 90 | ## *Exercises 8.4.6*
 91 | **_Question 1_** Create the ugliest plot possible! (Contributed by Andrew D. Steen, University
 92 | of Tennessee - Knoxville)
 93 | 
 94 | ```{r e.8.4.6.1}
 95 | ggplot(mpg, aes(displ, hwy)) +
 96 |   geom_point(position = "jitter", aes(color = class), size = 20) +
 97 |   ggtitle("Hope this is ugly enough!!!!! Looks like a prison") +
 98 |   labs(x = "Displacement eGINE", y = "highWAY miles Per miles", color = "type of CAR") +
 99 |   theme(legend.background = element_rect(fill = "black"),
100 |         legend.key = element_rect("lightblue"),
101 |         legend.text = element_text(color = "white", face = "bold", size = 8),
102 |         legend.title = element_text(color = "white", face = "bold", size = 4),
103 |         plot.title = element_text(hjust = 1, size = 18, face = "bold", color = "red"),
104 |         plot.background = element_rect(fill = "black", color = "grey50"),
105 |         panel.background = element_rect(fill = "black", color = "lightblue"),
106 |         panel.grid.major = element_line(colour = "white", size = 3, linetype = "dotted"),
107 |         panel.grid.minor = element_line(color = "red", size = 4),
108 |         axis.text.x = element_text(angle = 90, vjust = 1, size = 10, color = "red"),
109 |         axis.text.y = element_text(angle = 180, hjust = 2, size = 20, color = "white")
110 |         )
111 |   
112 | ```
113 | 
114 | **_Question 2_** theme dark() makes the inside of the plot dark, but not the outside. Change
115 | the plot background to black, and then update the text settings so you
116 | can still read the labels.
117 | 
118 | The best way to do so is to make the axis texts and axis titles color white.
119 | 
120 | ```{r e.8.4.6.2}
121 | ggplot(mpg, aes(displ, hwy)) +
122 |   geom_point(position = "jitter", aes(color = class), size = 2) +
123 |   theme_dark() +
124 |   ggtitle("HWY against DISPL") +
125 |   theme(plot.background = element_rect(fill = "black")) +
126 |   labs(x = "Displacement", y = "Highway Miles Per Hour", color = "Type of Car") +
127 |   theme(axis.text.x = element_text(size = 10, color = "white"),
128 |         axis.text.y = element_text(size = 10, color = "white"),
129 |         axis.title = element_text(color = "white"),
130 |         plot.title = element_text(color = "white", hjust = 0.5))
131 | ```
132 | 
133 | **_Question 3_** Make an elegant theme that uses “linen” as the background colour and a
134 | serif font for the text.
135 | 
136 | ```{r e.8.4.6.3}
137 | ggplot(mpg, aes(displ, hwy)) +
138 |   geom_point(aes(color = class)) +
139 |   ggtitle("HWY AGAINST DISPL") +
140 |   labs(x = "Displacement", y = "Highway Miles Per Hour", color = "Type of Car") +
141 |   theme(plot.background = element_rect(fill = "linen"),
142 |         panel.background = element_rect(fill = "linen"),
143 |         plot.title = element_text(hjust = 0.5, size = 15, family = "serif"),
144 |         text = element_text(family = "serif"),
145 |         legend.background = element_rect(fill = "linen"),
146 |         legend.text = element_text(family = "serif"),
147 |         panel.grid.major = element_line(linetype = "dotted", color = "black", size = 0.4),
148 |         panel.grid.minor = element_line(linetype = "dotted", color = "grey50", size = 0.4),
149 |         axis.line = element_line(colour = "grey50"))
150 | ```
151 | 
152 | **_Question 4_** Systematically explore the effects of hjust when you have a multiline title.
153 | Why doesn’t vjust do anything?
154 | 
155 | When you have a long title, it will show up as the following: the title is not completely shown in the output plot.
156 | ```{r e.8.4.6.4_plot1}
157 | ggplot(mpg, aes(displ, hwy)) +
158 |   geom_point(aes(color = class)) +
159 |   ggtitle("This is a very long title to explore the effects of hjust and vjust so let's see what will happen to the plot and answer the question")
160 | ```
161 | 
162 | Let's try use different levels of vjust:
163 | ```{r e.8.4.6.4_plot2}
164 | ggplot(mpg, aes(displ, hwy)) +
165 |   geom_point(aes(color = class)) +
166 |   ggtitle("This is a very long title to explore the effects of hjust and vjust so let's see what will happen to the plot and answer the question") +
167 |   theme(plot.title = element_text(vjust = 0.5))
168 | ```
169 | 
170 | It seems that no matter vjust is 0, 0.5 or even you put 1, it doesn't help with the issue.
171 | ```{r e.8.4.6.4_plot3}
172 | ggplot(mpg, aes(displ, hwy)) +
173 |   geom_point(aes(color = class)) +
174 |   ggtitle("This is a very long title to explore the effects of hjust and vjust so let's see what will happen to the plot and answer the question") +
175 |   theme(plot.title = element_text(vjust = 0))
176 | ```
177 | 
178 | The reason is that, vjust means vertical adjustment. When the full title is not shown, adjusting
179 | the vertical position doesn't help with the problem. Instead, we can use hjust combined with expression(atop())
180 | nested functions.
181 | 
182 | The expression(atop()) functions allow you to separate the title into different pieces of strings, separated by comma.
183 | Thus adding hjust = 0.5 will put the title in to multiple lines and centered in the middle.
184 | ```{r e.8.4.6.4_plot4}
185 | ggplot(mpg, aes(displ, hwy)) +
186 |   geom_point(aes(color = class)) +
187 |   ggtitle(expression(atop("This is a very long title to explore the effects of hjust", "and vjust so let's see what will happen to the plot and answer the question"))) +
188 |   theme(plot.title = element_text(hjust = 0.5))
189 | ```
190 | 
191 | 


--------------------------------------------------------------------------------
/ggplot2_solutions_chapter9.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "ggplot2_solutions_chapter9"
 3 | author: "Nade Kang"
 4 | date: "July 12, 2018"
 5 | output: html_document
 6 | ---
 7 | 
 8 | # Solution Manual for ggplot2 Elegant Graphics for Data Analysis by Hadley Wickham
 9 | # ggplot2 Chpater 9 Data Analysis
10 | 
11 | ## *Exercise 9.3.3*
12 | ### Load Packages
13 | ```{r setup, results='hide'}
14 | library(tidyverse)
15 | library(ggthemes)
16 | ```
17 | 
18 | **_Question 1_** How can you translate each of the initial example datasets into the other
19 | form?
20 | 
21 | I do not have the ec2 dataset. Instead, I'll use the data in tidyverse:
22 | ```{r e.9.3.3.1}
23 | table4b
24 | table4b_trans <- table4b %>%
25 |   gather(key = "year", value = "population", `1999`, `2000`)
26 | table4b_trans
27 | 
28 | table4b_trans %>% 
29 |   spread(key = "year", value = "population")
30 | ```
31 | 
32 | **_Question 2_** How can you convert back and forth between the economics and
33 | economics long datasets built into ggplot2?
34 | 
35 | Take a look at the economics and economics_long dataset, it is easy to discover that
36 | the latter one is simply the gathered economics dataset.
37 | However, I do not know how the economic_long dataset created value01 column.
38 | 
39 | ```{r e.9.3.3.2}
40 | economics
41 | economics_long
42 | 
43 | economics_longe <- economics %>% 
44 |   gather(pce, pop, psavert, uempmed, unemploy, key = "variable", value = "value")
45 | economics_longe
46 | ```
47 | 
48 | **_Question 3_** Install the EDAWR package from https://github.com/rstudio/EDAWR.
49 | Tidy the storms, population and tb datasets.
50 | 
51 | I installed the package using a R script. You have to do it your own to install it by
52 | typing install.packages("EDAWR")
53 | 
54 | I got a warning:
55 | Warning in install.packages :
56 |   package ‘EDAWR’ is not available (for R version 3.4.4)
57 | So cannot complete this question here.
58 | 
59 | ## *Exercise 9.4.1*
60 | **_Question 1_** Install the EDAWR package from https://github.com/rstudio/EDAWR.
61 | Tidy the who dataset.
62 | 
63 | Cannot complete this one.
64 | 
65 | **_Question 2_** Work through the demos included in the tidyr package (demo(package =
66 | "tidyr"))
67 | ```{r, eval=FALSE}
68 | (demo(package = "tidyr"))
69 | ```
70 | 
71 | ```{r e.9.4.1.2_1}
72 | smiths <- tidyr::smiths
73 | smiths
74 | 
75 | smiths_gather <- smiths %>% 
76 |   gather(age, weight, height, key = "characters", value = "value")
77 | smiths_gather
78 | ```
79 | 
80 | We can spread the population dataset:
81 | 
82 | ```{r e.9.4.1.2_2}
83 | population <- tidyr::population
84 | head(population)
85 | 
86 | pop_spread <- population %>% 
87 |   spread(key = "year", value = "population")
88 | pop_spread
89 | ```
90 | 
91 | 


--------------------------------------------------------------------------------
/random practice notes.R:
--------------------------------------------------------------------------------
  1 | ?diamonds
  2 | head(diamonds)
  3 | ?facet_wrap
  4 | ?geom_smooth
  5 | rect <- data.frame(x = 50, y = 50)
  6 | line <- data.frame(x = c(1, 200), y = c(100, 1))
  7 | 
  8 | base <- ggplot(mapping = aes(x, y)) +
  9 |   geom_tile(data = rect, aes(width = 50, height = 50)) +
 10 |   geom_line(data = line) +
 11 |   xlab(NULL) + ylab(NULL)
 12 | base
 13 | base + coord_polar("x")
 14 | base + coord_polar("y")
 15 | base + coord_flip()
 16 | base + coord_trans(y = "log10")
 17 | base + coord_fixed()
 18 | 
 19 | 
 20 | cnmap <-  ggplot(map_data("world.cities"), aes(long, lat, group = group)) +
 21 |   geom_polygon(fill = "white", colour = "black") +
 22 |   xlab(NULL) + ylab(NULL)
 23 | cnmap
 24 | 
 25 | nzmap <- ggplot(map_data("nz"), aes(long, lat, group = group)) +
 26 |   geom_polygon(fill = "white", colour = "black") +
 27 |   xlab(NULL) + ylab(NULL)
 28 | nzmap
 29 | ?map_data
 30 | ?purrr
 31 | ?map
 32 | help(package='maps')
 33 | 
 34 | #install.packages("ggthemes")
 35 | library(ggthemes)
 36 | df <- data.frame(x = 1:3, y = 1:3)
 37 | base <- ggplot(df, aes(x, y)) + geom_point()
 38 | base
 39 | base + theme(plot.background = element_rect(fill = "grey88", color = NA))
 40 | base + theme(plot.background = element_rect(color = "red", size =2))
 41 | base + theme(plot.background = element_rect(fill = "linen"))
 42 | base +theme(plot.background = element_rect(fill = "linen"), panel.background = element_rect(fill = "linen"),
 43 |             panel.grid.major = element_line(color = "black", size = 0.5, linetype = "dotted"),
 44 |             panel.grid.minor = element_line(color = "red", size = 0.1, linetype = "dotted"))
 45 | base + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
 46 |              panel.background = element_blank())
 47 | base + theme(axis.title.x = element_blank(), axis.title.y = element_blank(), panel.background = element_blank(),
 48 |              axis.line = element_line(color = "grey50"))
 49 | 
 50 | oldtheme <- theme_update(
 51 |   plot.background = element_rect(fill = "lightblue3", color = NA),
 52 |   panel.background = element_rect(fill = "linen", color = NA),
 53 |   axis.text = element_text(color = "linen"),
 54 |   axis.title = element_text(color = "linen")
 55 | )
 56 | base
 57 | theme_set(oldtheme)
 58 | base
 59 | ?theme_update
 60 | 
 61 | p <- ggplot(mtcars, aes(mpg, wt)) +
 62 |   geom_point()
 63 | p
 64 | old <- theme_set(theme_bw())
 65 | p
 66 | theme_set(old)
 67 | p
 68 | 
 69 | 
 70 | df <- data.frame(x = 1:3, y = 1:3)
 71 | base <- ggplot(df, aes(x, y)) + geom_point()
 72 | 
 73 | base + theme(axis.line = element_line(color = "grey50", size = 1))
 74 | base + theme(axis.text = element_text(color = "blue", size= 10))
 75 | base + theme(axis.text.x = element_text(angle = -90, vjust = 0.5))
 76 | 
 77 | 
 78 | df <- data.frame(
 79 |   x = c("label", "a long label", "an even longer label"),
 80 |   y = 1:3
 81 | )
 82 | base <- ggplot(df, aes(x, y)) + geom_point()
 83 | base
 84 | base + theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0)) +
 85 |   xlab(NULL) + ylab(NULL)
 86 | 
 87 | 
 88 | df <- data.frame(x = 1:4, y = 1:4, z = rep(c("a", "b"), each = 2))
 89 | base <- ggplot(df, aes(x, y, colour = z)) + geom_point()
 90 | base
 91 | base + theme(legend.background = element_rect(fill = "lemonchiffon",
 92 |                                                color = "grey50",
 93 |                                                size = 1))
 94 | base + theme(legend.key = element_rect(color = "grey50"),
 95 |              legend.key.width = unit(0.9, "cm"),
 96 |              legend.key.height = unit(0.75, "cm"))
 97 | base + theme(legend.title = element_text(size = 15, face = "bold"),
 98 |              legend.text = element_text(size = 13))
 99 | base <- ggplot(df, aes(x, y)) + geom_point()
100 | base + theme(panel.background = element_rect(fill = "lightblue"))
101 | base + theme(panel.grid.major = element_line(color = "grey50", size = 0.8, linetype = "dotted"))
102 | base + theme(panel.grid.major.x = element_line(color = "grey50", size = 0.8, linetype = "dotted"))
103 | base2 <- base + theme(plot.background = element_rect(color = "grey50"))
104 | base2 + theme(aspect.ratio = 2/1)
105 | 
106 | diamonds_ok <- diamonds %>% filter(x > 0, y > 0, y < 20)
107 | ggplot(diamonds_ok, aes(x,y)) +
108 |   geom_bin2d() +
109 |   geom_abline(slope = 1, color = "white", size = 1, alpha = 0.5)
110 | ?diamonds
111 | dia <- filter(diamonds, z > 50)
112 | dia
113 | is.na(1 + 2i)
114 | ?is.na
115 | is.na(c(NA, NA, 2, 4))
116 | (xx <- c(0:4))
117 | xx
118 | is.na(xx) <- c(2, 4)
119 | xx
120 | !is.na(xx)
121 | 
122 | ?!is.na
123 | c <- c()
124 | is.na(NULL)
125 | ?is.na
126 | is.na(2*3)
127 | 
128 | ?xor
129 | x <- TRUE
130 | !x
131 | xor(x, !x)
132 | 
133 | # install.packages("ggplot2movies")
134 | movies
135 | View(movies)
136 | x <- 5.0e+05
137 | as.numeric(x)
138 | 
139 | options(scipen = 999)
140 | typeof(movies$budget)
141 | movies3 <- movies %>% 
142 |   mutate(budget = as.integer(budget))
143 | View(movies3)
144 | 
145 | 
146 | diamonds_ok2 <- mutate(diamonds_ok,
147 |                        sam = x - y,
148 |                        size = sqrt(x ^ 2 + y ^ 2))
149 | diamonds_ok2
150 | ggplot(diamonds_ok2, aes(size, sam)) +
151 |   stat_bin2d()
152 | 
153 | ggplot(diamonds_ok2, aes(abs(sam))) +
154 |   geom_histogram(binwidth = 0.10)
155 | diamonds_ok3 <- filter(diamonds_ok2, abs(sam) < 0.20)
156 | ggplot(diamonds_ok3, aes(abs(sam))) +
157 |   geom_histogram(binwidth = 0.01)
158 | 
159 | width <- diamonds$x * diamonds$y /2
160 | width / diamonds$z * 100
161 | 
162 | diamonds$depth / 100 / diamonds$z * 2
163 | ?diamonds
164 | 
165 | 
166 | by_clarity <- diamonds %>% 
167 |   group_by(clarity) %>% 
168 |   summarise(
169 |     n = n(),
170 |     mean = mean(price),
171 |     lq = quantile(price, 0,25),
172 |     uq = quantile(price, 0.75)
173 |   )
174 | by_clarity
175 | ggplot(by_clarity, aes(clarity, mean)) +
176 |   geom_linerange(aes(ymin = lq, ymax = uq)) +
177 |   geom_line(aes(group = 1), color = "grey50") +
178 |   geom_point(aes(size = n))
179 | typeof(movies$length)
180 | mean(movies$length)
181 | ?diamonds
182 | 
183 | runif(100)
184 | ?runif
185 | ?qqplot
186 | ?id.method
187 | ?gvlma
188 | 
189 | library(car)
190 | states <- as.data.frame(state.x77[, c("Murder", "Population", "Illiteracy",
191 |                                       "Income", "Frost")])
192 | states
193 | fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
194 | summary(fit)
195 | qqPlot(fit, labels = row.names(states), id.method = "identify", simulate = TRUE, main = "Q-Q Plot")
196 | 
197 | x_values <- c(Murder, Population, Illiteracy, Income, Frost)
198 | 
199 | plot(Murder ~ Population + Illiteracy + Population + Income + Frost, data = states)
200 | 
201 | library(car)
202 | states <- as.data.frame(state.x77[,c("Murder", "Population",
203 |                                      "Illiteracy", "Income", "Frost")])
204 | fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
205 | qqPlot(fit, labels=row.names(states), id.method="identify",
206 |        simulate=TRUE, main="Q-Q Plot")
207 | 
208 | 
209 | diamonds2 <- diamonds %>% 
210 |   filter(carat <= 2) %>% 
211 |   mutate(lcarat = log2(carat),
212 |          lprice = log2(price))
213 | diamonds2
214 | 
215 | ggplot(diamonds2, aes(lcarat, lprice)) +
216 |   geom_bin2d() +
217 |   geom_smooth(method = "lm", color = "white", size = 2, se = FALSE)
218 | mod <- lm(lprice ~ lcarat, data = diamonds2)
219 | summary(mod)
220 | 
221 | diamonds2 <- diamonds2 %>% 
222 |   mutate(rel_price = resid(mod))
223 | ggplot(diamonds2, aes(carat, rel_price)) +
224 |   geom_bin2d()
225 | xgrid <- seq(-2, 1, by = 1/3)
226 | xgrid
227 | data.frame(logx = xgrid, x = round(2 ^ xgrid, 2))
228 | color_cut <- diamonds2 %>%
229 |   group_by(color, cut) %>%
230 |   summarise(
231 |     price = mean(price),
232 |     rel_price = mean(rel_price)
233 |   )
234 | color_cut
235 | ggplot(color_cut, aes(color, price))  +
236 |   geom_line(aes(group = cut), colour = "grey80") +
237 |   geom_point(aes(colour = cut))
238 | ggplot(color_cut, aes(color, rel_price)) +
239 |   geom_line(aes(group = cut), colour = "grey80") +
240 |   geom_point(aes(colour = cut))
241 | ?diamonds
242 | 
243 | mtcars
244 | new <- mtcars %>%
245 |   gather(-mpg, -hp, -cyl, key = "var", value = "value")
246 | View(mtcars)
247 | View(new)
248 | geom_histogram
249 | install.packages("weanderson")
250 | library(weanderson)
251 | 


--------------------------------------------------------------------------------