├── Easy_Mistakes_to_Avoid.Rmd
├── Easy_Mistakes_to_Avoid.html
├── Lecture_01_Data_Communication.Rmd
├── Lecture_01_Data_Communication.html
├── Lecture_01_Faucet.jpg
├── Lecture_02_Bible_Cross_Ref.gif
├── Lecture_02_Bible_Cross_Ref.png
├── Lecture_02_Clutter_and_Focus.Rmd
├── Lecture_02_Clutter_and_Focus.html
├── Lecture_02_Coffee.jpg
├── Lecture_02_DuBois.PNG
├── Lecture_02_Healthcare.PNG
├── Lecture_02_Minard.gif
├── Lecture_02_NAEP.csv
├── Lecture_02_NAEP.xlsx
├── Lecture_02_Network_Graph.jpg
├── Lecture_02_Unemployment.PNG
├── Lecture_03_Audience.Rmd
├── Lecture_03_Audience.html
├── Lecture_03_Cardash.jpg
├── Lecture_03_Preattentive.PNG
├── Lecture_04_Common_R_Problems.Rmd
├── Lecture_04_Common_R_Problems.html
├── Lecture_04_R_Walkthrough.R
├── Lecture_04_R_and_RMarkdown.Rmd
├── Lecture_04_R_and_RMarkdown.html
├── Lecture_04_example_rmarkdown.Rmd
├── Lecture_04_second_example.Rmd
├── Lecture_05_Example_Sprail.Rmd
├── Lecture_05_Example_Sprail.html
├── Lecture_05_and_06_Data_Wrangling.Rmd
├── Lecture_05_and_06_Data_Wrangling.html
├── Lecture_05_and_06_example_code.R
├── Lecture_06_GG_Components.png
├── Lecture_07_EDA.Rmd
├── Lecture_07_EDA.html
├── Lecture_08_Basics_of_ggplot2.Rmd
├── Lecture_08_Basics_of_ggplot2.html
├── Lecture_09_Structuring_ggplot.Rmd
├── Lecture_09_Structuring_ggplot.html
├── Lecture_10_Styling_ggplot.Rmd
├── Lecture_10_Styling_ggplot.html
├── Lecture_11_Bob_Ross.Rmd
├── Lecture_11_Bob_Ross.html
├── Lecture_12_Hexmap.png
├── Lecture_12_State_Heatmap.png
├── Lecture_12_Story_and_Geometry.Rmd
├── Lecture_12_Story_and_Geometry.html
├── Lecture_13_Statistics.Rmd
├── Lecture_13_Statistics.html
├── Lecture_13_climate.png
├── Lecture_14_Clutter.png
├── Lecture_14_Color.png
├── Lecture_14_Dashboards_in_R.Rmd
├── Lecture_14_Dashboards_in_R.html
├── Lecture_14_Example_Dashboard.Rmd
├── Lecture_14_Flexdashboard.png
├── Lecture_14_Flexdashboard_row.png
├── Lecture_14_Grids.png
├── Lecture_14_KPI.png
├── Lecture_15_Example_Shiny.Rmd
├── Lecture_15_Interactive_Visuals_in_R.Rmd
├── Lecture_15_Interactive_Visuals_in_R.html
├── Lecture_15b_Bob_Ross_My_ATUS_Dashboard_Final.Rmd
├── Lecture_16_Data_Import.png
├── Lecture_16_Intro_to_Tableau.Rmd
├── Lecture_16_Intro_to_Tableau.html
├── Lecture_16_Scorecard.twb
├── Lecture_16_Scorecard.twbx
├── Lecture_16_Scorecard_YouDoIt.twb
├── Lecture_16_Tableau_Boxplot.png
├── Lecture_16_Tableau_Employment.png
├── Lecture_16_Tableau_Main.png
├── Lecture_16_Tableau_Map.png
├── Lecture_16_Tableau_Repayment.png
├── Lecture_17_Axis.png
├── Lecture_17_CancelledFlights.png
├── Lecture_17_CumulativeCancellations.png
├── Lecture_17_DelaybyDay.png
├── Lecture_17_Delayed_Flights.twb
├── Lecture_17_Delayed_Flights_Answers.twb
├── Lecture_17_MiddayFlights.png
├── Lecture_17_MiddayFlightsAxes.png
├── Lecture_17_More_Tableau.Rmd
├── Lecture_17_More_Tableau.html
├── Lecture_17_Percent.png
├── Lecture_17_Table_Calc.png
├── Lecture_17_Tickmarks.png
├── Lecture_18_A_Mess.png
├── Lecture_18_AddWorksheets.png
├── Lecture_18_Dashboard_Format.png
├── Lecture_18_Dashboards_in_Tableau.Rmd
├── Lecture_18_Dashboards_in_Tableau.html
├── Lecture_18_Oster_Data.twb
├── Lecture_18_Oster_Data_packaged.twbx
├── README.md
├── SPrail.zip
├── SPrail_april.csv
├── SPrail_july.csv
├── SPrail_june.csv
├── SPrail_may.csv
├── Scorecard_for_Tableau.csv
├── gganimate_guide.Rmd
├── gganimate_guide.html
├── king_dailyvisits.Rdata
├── no_border.css
├── oster_vitamine_data.Rdata
├── oster_vitamine_data.hyper
├── robocalls.csv
└── robocalls_excel.xlsx
/Lecture_01_Data_Communication.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 1: Data Communication"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 |
18 | ```{r setup, include=FALSE}
19 | knitr::opts_chunk$set(echo = FALSE)
20 | library(tidyverse)
21 | library(gganimate)
22 | ```
23 |
24 | ## Data Communication
25 |
26 | ```{r, results = 'asis'}
27 | cat("
28 | ")
34 | ```
35 |
36 | Welcome to Data Communication!
37 |
38 | ## Data Communication
39 |
40 | This class is about
41 |
42 | - How to turn data into a message
43 | - How to cleanly communicate that message
44 | - Technically, how to create that communication
45 |
46 | ## Media
47 |
48 | This will largely focus on *data visualization* but will also cover:
49 |
50 | - Exploratory data analysis
51 | - Tables
52 | - Notebooks
53 | - Workflow
54 | - Data cleaning and manipulation
55 |
56 | ## What We're Doing
57 |
58 | - This is sort of a mutt of a class, covering coding, communications concepts, data management and cleaning, and even some analytical stuff
59 | - That's because this is *really* a course about preparing you to go out into the real world and work with data
60 | - The thing I really want you to come away with after this class is this:
61 | - When you are working with data, **think carefully about what you're doing, and make sure you're doing it right.**
62 | - It really is just **thinking carefully** and **attending to detail**, not technical skill. This is more important than coding. We'll have to code, but code can only be as good as what you're trying to make it do
63 |
64 | ## What We're Doing
65 |
66 | - You're going to leave the world of a classroom where the project is set up so your prof knows there *is* a right answer, what that right answer is, and they can check your work
67 | - In the real world, *you* will be expected to produce good work on your own, which means *you'll* be checking your work, and there won't be a "wrong answer" buzzer that sounds if you make a mistake
68 | - I want you to build enough confidence in working with data that you can look at something the computer has spit out and think "the computer only does what it's told. I know what this *should* be. This is not yet right."
69 |
70 | ## What We're Doing
71 |
72 | So when you do **anything** with data - load a file, create a variable, make a graph, do an analysis, ask:
73 |
74 | 1. What am I **trying to do**?
75 | 1. Did it **work properly**? How can I **CHECK** if it worked properly?
76 | 1. If it's a form of communication, have you made it **as clear as possible**? Try to find places where people might get confused.
77 |
78 | Never think "I pushed the button/wrote the line that made the thing go. I'm done." It's on you to fulfill the requirement in a way that's **good**. Take pride in your work, or it will be bad.
79 |
80 | ## Noticing
81 |
82 | More than anything else, this class is about *noticing things*. I'm serious.
83 |
84 | - Data analysis is full of opportunities to mess up. Mistakes are unavoidable, and results that make no sense just happen
85 | - Being a good analyst means noticing mistakes, and noticing what your output looks like, and taking responsibility for fixing it
86 | - Be careful in your work. Take time. Review.
87 | - That's what I'm hoping to teach, and something I *will* be assessing you on
88 |
89 | ## What We're Doing
90 |
91 | It takes no technical/coding skill, or extensive explanation from me, to see why this graph is bad. If you produce this graph, you should take it upon yourself to fix it. Ask: how can I improve this?
92 |
93 | ```{r, echo = FALSE, eval = TRUE, fig.height = 3, fig.width = 6}
94 | data(gapminder, package = 'gapminder')
95 | countries <- gapminder %>%
96 | pull(country) %>%
97 | unique() %>%
98 | sort()
99 | gapminder %>%
100 | filter(year == 2007) %>%
101 | mutate(lexp = lifeExp/max(lifeExp) + .1) %>%
102 | mutate(countryy_2 = as.numeric(factor(country))) %>%
103 | ggplot(aes(x = countryy_2, y = lexp)) + geom_line() +
104 | scale_x_continuous(breaks = 1:length(countries), labels = \(x) countries[x]) +
105 | labs(title = 'Life expectancy compared to most')
106 |
107 | ```
108 |
109 | ## Admin
110 |
111 | Let's do some housekeeping:
112 |
113 | - Course website
114 | - Syllabus
115 | - Expectations and assignments
116 |
117 | ## Resources
118 |
119 | In addition to the course website and Healy's [Data Visualization](https://kieranhealy.org/publications/dataviz/):
120 |
121 | - The [R Graphics Cookbook](https://r-graphics.org/)
122 | - [Data Visualization blogs](https://www.tableau.com/learn/articles/best-data-visualization-blogs)
123 | - [LOST-STATS](https://lost-stats.github.io)
124 | - The internet writ large
125 | - Code for all these slides (RMarkdown and ggplot2 examples) are on the course GitHub page
126 |
127 | ## What is Data Communication?
128 |
129 | - There is a lot of information in the world
130 | - And a lot of information at your fingertips
131 | - *Too much*
132 | - And so we simplify to tell the *story* underlying the data
133 |
134 | ## What is Data Communicatoin?
135 |
136 | - I have a **result** from my data
137 | - I want you to **understand and believe** my result
138 | - How can I demonstrate this result to you so you'll understand it?
139 | - How can I present the data so that you understand where the result came from and why they should agree that the result is accurate?
140 |
141 | ## The Map and the Territory
142 |
143 | - Someone asks you for directions to Dick's on Broadway
144 | - Do you hand them your 3.2GB perfectly detailed shapefile of Capitol Hill?
145 | - The answer is in there, and much more precisely than you could possibly tell them
146 | - But it doesn't really answer their question, right?
147 |
148 | ## The Map and the Territory
149 |
150 | - The goal of a data *analyst* is to take that shapefile and figure out how to get to Dick's
151 | - The goal of a data communicator is to take what the data analyst figured out and figure out *what part of the map to show you to help you understand how to get to Dick's*
152 | - A good data communicator will make understanding the directions easy and obvious
153 |
154 | ## Storytelling With Data
155 |
156 | 1. Understand the context
157 | 1. Figure out *the story* (what you want the reader to understand, and why)
158 | 1. Choose an appropriate visual display
159 | 1. Eliminate clutter
160 | 1. Focus attention where you want it
161 | 1. Think like a designer
162 | 1. Tell the story
163 |
164 | ## Examples Outside of Data
165 |
166 | - Let's consider some examples of effective communication of information outside the narrow range of "data communication"
167 |
168 |
169 | ## How Does This Faucet Work?
170 |
171 | ```{r, out.width="650px"}
172 | knitr::include_graphics("Lecture_1_Faucet.jpg")
173 | ```
174 |
175 | ## Understand the context
176 |
177 | The person who designed this faucet understands, hopefully, how water pipes work and how opening a valve can allow water to flow
178 |
179 |
180 | ## Figure out the story
181 |
182 | *What is important for the audience to know*?
183 |
184 | I don't care if the user understands how pipes work, or their history of water usage.
185 |
186 | I need the user to know how to properly turn the water on
187 |
188 |
189 | ## Choose an appropriate visual display
190 |
191 | We have a handle close to the source of the water
192 |
193 | It implies the ways the handle can be turned - towards us or away, left and right
194 |
195 |
The display doesn't allow us any information about how those directions relate to pressure or temperature
(what version of a faucet might?)
196 |
197 | ## Eliminate clutter
198 |
199 | Nothin' but handle
200 |
201 | We could have other stuff here - sink stopper, an LCD with the weather report, but do we need it?
202 |
203 | ## Focus attention where you want it
204 |
205 | There's nowhere to look but the handle (other than the spigot, not pictured)
206 |
207 | The shape and design pushes you towards it - it calls for a hand!
208 |
209 | ## Think like a designer
210 |
211 | We want the user to understand that they can pull or rotate the handle to affect the water flow
212 |
213 | This design *affords* both of those uses
214 |
215 | And nothing else
216 |
217 | There aren't a lot of ways to use this wrong, other than messing up pressure vs. temperature
218 |
219 | ## Tell the story
220 |
221 | Water flow can be controlled by twisting this handle
222 |
223 | If you were a monkey who had never experienced plumbing, it would only take you about ten seconds to follow the design to that handle, pull it, and learn about the connection between handles and water flow
224 |
225 | ## Gapminder
226 |
227 | - Let's move into some data
228 | - Gapminder (from the Gapminder institute) is a data set that, among other things, shows how differences between countries change over time
229 | - One thing it is commonly used to show is that economic development aids health development
230 | - GDP per capita $\rightarrow$ life expectancy
231 | - Also, generally, both of those things have improved over time
232 |
233 | ## Gapminder
234 |
235 | ```{r}
236 | data(gapminder, package = 'gapminder')
237 | gapminder
238 | ```
239 |
240 | ## Understand the Context
241 |
242 | How do things work here?
243 |
244 | - We know that life expectancy and GDP per capita go together closely
245 |
246 | ## Figure out the story
247 |
248 | What do we want people to learn?
249 |
250 | - GDP per capita and life expectancy go together strongly
251 | - Both GDP per capita and life expectancy have increased a lot over time (i.e. things on this front are getting better!)
252 |
253 | This is useful information and I can see why someone would want to understand this for its own sake, to understand the world better (and perhaps have some actionable takeaway as a result!)
254 |
255 | ## Choose an appropriate visual display
256 |
257 | - We want something that will show a relationship between two variables with many observations
258 | - NOW WE HAVE SOME OUTPUT. Put on those "how can we make it better?" goggles: the first output we get is NOT DONE
259 |
260 | ```{r, fig.width=6, fig.height=4}
261 | ggplot(gapminder,
262 | aes(x = gdpPercap, y = lifeExp)) +
263 | geom_point() +
264 | labs(x = "GDP per Capita", y = "Life Expectancy")
265 | ```
266 |
267 | ## Eliminate clutter
268 |
269 | - That's a lot of dots! Can we tell the same story by focusing on just a few countries?
270 | - Also, that's a lot of background ink...
271 |
272 | ```{r, fig.width=6, fig.height=4}
273 | ggplot(gapminder %>%
274 | filter(country %in% c('Brazil','India','Spain')),
275 | aes(x = gdpPercap, y = lifeExp, color = country)) +
276 | geom_line() +
277 | theme_minimal() +
278 | labs(x = "GDP per Capita", y = "Life Expectancy",
279 | color = 'Country') +
280 | theme(legend.position = c(.8,.2),
281 | legend.background = element_rect())
282 | ```
283 |
284 | ## Focus attention where you want it
285 |
286 | - Those few high-GDP observations are drawing a LOT of space, as opposed to that left blob. Let's put the x-axis on a log scale
287 |
288 | ```{r, fig.width=6, fig.height=4}
289 | ggplot(gapminder %>%
290 | filter(country %in% c('Brazil','India','Spain')),
291 | aes(x = gdpPercap, y = lifeExp, color = country)) +
292 | geom_line() +
293 | theme_minimal() +
294 | scale_x_log10() +
295 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy",
296 | color = 'Country') +
297 | theme(legend.position = c(.8,.2),
298 | legend.background = element_rect())
299 | ```
300 |
301 | ## Think like a designer
302 |
303 | - Why make the reader work?
304 | - Also, realize this graph sort of feels like it's moving forward in time. Uh-oh...
305 |
306 | ```{r, fig.width=6, fig.height=4}
307 | ggplot(gapminder %>%
308 | filter(country %in% c('Brazil','India','Spain')),
309 | aes(x = gdpPercap, y = lifeExp, color = country)) +
310 | geom_line() +
311 | theme_minimal() +
312 | scale_x_log10() +
313 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy",
314 | color = 'Country') +
315 | annotate('text',x = 2300, y = 66, label = 'India', color = 'green') +
316 | annotate('text',x = 10000, y = 68, label = 'Brazil', color = 'red') +
317 | annotate('text',x = 25000, y = 75, label = 'Spain', color = 'blue') +
318 | guides(color = 'none')
319 | ```
320 |
321 | ## Tell the story
322 |
323 | - Realize that we've lost the "things get better over time" angle
324 | - And also lost the part where we want to talk about the whole world!
325 | - Use what we've done so far to think about how we can show the dual GDP-and-life-expectancy improvements *over time* for everyone
326 |
327 | ## What could still be improved?
328 |
329 | ```{r}
330 | options(gganimate.dev_args = list(width = 650, height = 400))
331 | (ggplot(gapminder,
332 | aes(x = gdpPercap, y = lifeExp, color = continent)) +
333 | geom_point() +
334 | theme_minimal() +
335 | scale_x_log10() +
336 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy",
337 | title = "GDP and Life Expectancy by Country, 1952-2007",
338 | color = 'Continent') +
339 | transition_time(year) +
340 | theme(axis.title.x = element_text(size=15),
341 | axis.title.y = element_text(size=15),
342 | title = element_text(size=15))) %>%
343 | animate(nframes = 200,end_pause = 30)
344 | ```
345 |
346 | ## Let's See What We Can Get
347 |
348 | - Find a "good" chart from [informationisbeautiful.net](https://informationisbeautiful.net)
349 | - And a "bad" chart from [junkcharts.typepad.com/](https://junkcharts.typepad.com/)
350 | - And go through our steps. What are they trying to say? How do they do it? What mistakes do they make? What is unclear?
351 | - We will discuss
352 |
353 |
354 |
--------------------------------------------------------------------------------
/Lecture_01_Faucet.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_01_Faucet.jpg
--------------------------------------------------------------------------------
/Lecture_02_Bible_Cross_Ref.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Bible_Cross_Ref.gif
--------------------------------------------------------------------------------
/Lecture_02_Bible_Cross_Ref.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Bible_Cross_Ref.png
--------------------------------------------------------------------------------
/Lecture_02_Clutter_and_Focus.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 2: Clutter and Focus"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(lubridate)
20 | library(gghighlight)
21 | library(ggthemes)
22 | ```
23 |
24 |
25 |
26 | ## Clutter and Focus
27 |
28 | ```{r, results = 'asis'}
29 | cat("
30 | ")
36 | ```
37 |
38 | - We want to tell a story with our data
39 | - That is, we want to *demonstrate some interesting finding* from our data
40 | - Some stories are clear
41 | - Others are complex
42 | - They all need to be clear
43 |
44 | ## Clutter and Focus
45 |
46 | - A common mistake is to *show you the data* rather than *show you the story*
47 | - The story often exists within part of the data
48 | - We want to understand some sort of *continuity* or some sort of *distinction*
49 | - Everything that takes away from that makes the story more opaque
50 |
51 | ## Examples
52 |
53 | - Let's look at some data visualizations that are cluttered
54 | - Some are still *beautiful* and lovely and works of art
55 | - Some are impressive with the amount of information they're able to get on one graph
56 | - But they can obscure the point they're trying to make
57 | - With each, let's think about why
58 |
59 | ## Bible Cross-References
60 |
61 | - The following graph shows how different chapters of the bible refer to each other
62 |
63 | ---
64 |
65 | ```{r}
66 | knitr::include_graphics('Lecture_02_Bible_Cross_Ref.gif')
67 | ```
68 |
69 | ## Coffee
70 |
71 | - How much coffee do South Koreans drink?
72 | - Simple question, right?
73 |
74 | ---
75 |
76 | ```{r}
77 | knitr::include_graphics('Lecture_02_Coffee.jpg')
78 | ```
79 |
80 |
81 | ## Health care
82 |
83 | - This graph just wants to get across three statistics about health care
84 | - And how those statistics vary by state
85 |
86 | ---
87 |
88 | ```{r}
89 | knitr::include_graphics('Lecture_02_Healthcare.png')
90 | ```
91 |
92 | ## Network Graphs
93 |
94 | - You see a lot of *network graphs* floating around these days
95 | - They look super cool
96 | - But they can be hard to actually learn anything from
97 | - Unless the story is just "things sure are connected, aren't they?" which isn't all that interesting
98 |
99 | ---
100 |
101 | ```{r}
102 | knitr::include_graphics('Lecture_02_Network_Graph.jpg')
103 | ```
104 |
105 | ## Napoleon's March
106 |
107 | - This is considered one of the most beautiful and impressive pieces of data viz ever, about Napoleon's army during its march and retreat from Russia
108 | - Distance from Paris, size of army, elevation, places where the army met itself...
109 | - It absolutely tells an intricate and detailed story - but what can you take away from it if you don't already know what it's trying to say?
110 | - What parts of the story does it guide you towards focusing on? Which get lost?
111 |
112 | ---
113 |
114 | ```{r}
115 | knitr::include_graphics('Lecture_02_Minard.gif')
116 | ```
117 |
118 | ## Clutter and Focus
119 |
120 | - These graphs vary in quality but some common issues:
121 | - They add things to the graph that do not need to be there or are difficult to understand
122 | - They do not focus your attention on a single takeaway
123 |
124 | ## Clutter and Focus
125 |
126 | - Clutter adds cognitive load - it makes your viz more difficult to understand
127 | - There are many very low-level visual shortcuts in the human brain we can take advantage of to avoid this
128 | - And to make sure that we can anticipate how the reader will interpret the visualization
129 |
130 | ## Gestalt Principles of Visual Perception
131 |
132 | - Proximity
133 | - Similarity
134 | - Enclosure
135 | - Closure
136 | - Continuity
137 | - Connection
138 |
139 | ## Proximity
140 |
141 | - If you put things close to each other, they tend to get thought of as being associated
142 | - And visually, they are at least visually comparable
143 | - Putting two similar things close together says "these things are very similar!"
144 | - Putting two distinct things close together says "compare these please!"
145 |
146 | ## Proximity
147 |
148 | | | Budget |
149 | |-----------|---------|
150 | | Sales | $10000 |
151 | | Marketing | $15000 |
152 | | | |
153 | | R&D | $5000 |
154 | | HR | $6000 |
155 |
156 | ## Similarity
157 |
158 | - Simply put, things that are made similar in some way (shape, color, shade, etc.) are part of the same whole
159 |
160 | Note on using color for this: Remember colorblindness!
161 |
162 | - Most common is red/green/(orange), blue/yellow also relatively common
163 | - When picking colors, avoid having your contrasts be red/green or blue/yellow. Red/blue is a popular choice here.
164 | - For complete accessibility, focus on highly contrasting *shades*
165 | - Colorblind-friendly palettes are available!
166 |
167 | ## Similarity
168 |
169 | ```{r}
170 | data(mtcars)
171 | mtcars <- mtcars %>%
172 | mutate(Transmission = case_when(
173 | am == 0 ~ 'Automatic',
174 | am == 1 ~ 'Manual'
175 | ))
176 | ggplot(mtcars, aes(x = wt, y = mpg, color = Transmission))+
177 | geom_point() +
178 | theme_minimal() +
179 | labs(x = 'Car Weight',
180 | y = 'Mileage (MPG)',
181 | color = 'Transmission Type')
182 | ```
183 |
184 | ## Enclosure
185 |
186 | - If you put a physical enclosure around some things, they will be perceived as being part of a group
187 | - This can be especially handy if you're already using some other forms of similarity, or need to indicate areas that WOULD be part of that group if you had data there.
188 | - Enclosures say "these things are in a group!" - perhaps you want to compare that group to other things, or focus attention *within* the group?
189 |
190 | ## Enclosure
191 |
192 | ```{r}
193 | knitr::include_graphics('Lecture_02_Unemployment.png')
194 | ```
195 |
196 | ## Continuity
197 |
198 | - When there are gaps, we tend to "fill in" in the most intuitive way
199 |
200 | ## Continuity
201 |
202 | - This year-on-year change graph has a gap at Feb. 29. But what does our brain do?
203 |
204 | ```{r, fig.height = 4}
205 | set.seed(1000)
206 | # Seven-day moving average
207 | ma <- function(x,n=7){as.numeric(stats::filter(as.ts(x),rep(1/n,n), sides=1))}
208 | tb1 <- tibble(dayofyr = 0:90,Day = ymd('2020-01-01'),Sales2020 = ma(runif(91,min=.5,max = .6))) %>%
209 | mutate(Day = Day + days(dayofyr)) %>%
210 | mutate(Mon = month(Day),
211 | DoM = day(Day))
212 | tb2 <- tibble(dayofyr = 0:89,Day = ymd('2019-01-01'),Sales2019 = ma(runif(90,min=.47,max=.57))) %>%
213 | mutate(Day = Day + days(dayofyr)) %>%
214 | mutate(Mon = month(Day),
215 | DoM = day(Day)) %>%
216 | select(-dayofyr, -Day)
217 | graphdat <- tb1 %>%
218 | left_join(tb2,by=c('Mon','DoM')) %>%
219 | mutate(YOY = (Sales2020/Sales2019) - 1)
220 |
221 | ggplot(graphdat, aes(x = Day, y = YOY)) +
222 | geom_line() +
223 | labs(y = "Year-on-Year Change",
224 | title = "Changes in Sales Year-to-Year") +
225 | theme_minimal() +
226 | scale_y_continuous(labels = scales::percent)
227 | ```
228 |
229 | ## Connection
230 |
231 | - If you put a literal connection between two points, people will interpret them as being connected!
232 | - Connecting two things says "one of these things is adjacent or linked to to the other"
233 | - No big surprise
234 | - See any line graph for this.
235 | - This is also a good reason *not* to use a line graph when your x-axis doesn't have an order to it!
236 |
237 | ## How?
238 |
239 | - With these concepts in mind, how can we use them to emphasize information?
240 | - And improve clarity
241 | - Let's begin with a bad first-pass graph and improve it as we can
242 | - Let's tell a story about how Washington's 4th graders fare in math vs. other coastal states
243 |
244 | ## State NAEP Test Scores
245 |
246 | ```{r}
247 | naep <- read_csv('Lecture_02_NAEP.csv')
248 | ggplot(naep,
249 | aes(x = StateCode, y = Score)) +
250 | geom_col() +
251 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) +
252 | labs(title = "Fourth Grade Math NAEP Score by State",
253 | x = "State",
254 | y = "4th Grade Math NAEP")
255 | ```
256 |
257 | ## Ordering Information
258 |
259 | - Information should be presented in an order that highlights the information of interest
260 | - Comparisons should be easy to make
261 | - Ask: what should be comparable?
262 | - We should be able to tell "more" or "less" test score
263 | - We use *proximity* to make more-similar scores go together
264 | - **Think**: what bar ordering makes the takeaway as easy to see as possible?
265 |
266 | ## Proximity: Example
267 |
268 | ```{r}
269 | ggplot(naep %>%
270 | arrange(Score) %>%
271 | mutate(StateCode = factor(StateCode, levels = .$StateCode)),
272 | aes(x = StateCode, y = Score)) +
273 | geom_col() +
274 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) +
275 | labs(title = "Fourth Grade Math NAEP Score by State",
276 | x = "State",
277 | y = "4th Grade Math NAEP")
278 | ```
279 |
280 | ## Using similarity
281 |
282 | - We may be interested in comparing Washington vs. other coastal states and saying something about the difference.
283 | - We can do this easily by giving all the coastal states a similar *something*
284 | - With bar graphs, an obvious pick is color
285 |
286 | ## Similarity: Example
287 |
288 | ```{r}
289 | ggplot(naep %>%
290 | arrange(Score) %>%
291 | mutate(StateCode = factor(StateCode, levels = .$StateCode),
292 | Coastal = factor(Coastal, labels = c('Not Coastal','Coastal'))),
293 | aes(x = StateCode, y = Score, fill = Coastal)) +
294 | geom_col() +
295 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) +
296 | labs(title = "Fourth Grade Math NAEP Score by State",
297 | x = "State",
298 | y = "4th Grade Math NAEP")
299 | ```
300 |
301 | ## Clarity
302 |
303 | - We have our comparison clarified
304 | - Let's see what else we can do with this graph
305 |
306 | ## Slices of Data
307 |
308 | - What data is important to our story, and what data is not?
309 | - We are interested in comparing Washington to other coastal states, why do we need all these other states
310 | - Having stuff on the graph says "this is worth your attention somehow." If it's not important to the story, it will confuse or distract from the story
311 | - Chuck 'em!
312 | - This will also give us room to rotate those axis labels
313 |
314 |
315 | ## Clarity
316 |
317 | ```{r}
318 | ggplot(naep %>%
319 | arrange(Score) %>%
320 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>%
321 | filter(Coastal == 1),
322 | aes(x = StateCode, y = Score)) +
323 | geom_col() +
324 | labs(title = "Fourth Grade Math NAEP Score by State",
325 | x = "State",
326 | y = "4th Grade Math NAEP")
327 | ```
328 |
329 | ## Clarity
330 |
331 | - And dare we?
332 | - It's a debate as to whether it ever makes sense for your $y$-axis not to start at zero. But here it's really making things hard to see. Clarity could be improved by starting the axis at, say, 100!
333 | - We will leave this as-is, but it might imply something to think about for improvement in the future
334 |
335 |
336 | ## Contrast
337 |
338 | - We can use contrast to make Washington stand out again
339 | - Use a light color for others so they fade more into the back
340 |
341 |
342 | ## Contrast
343 |
344 | ```{r}
345 | ggplot(naep %>%
346 | arrange(Score) %>%
347 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>%
348 | filter(Coastal == 1),
349 | aes(x = StateCode, y = Score)) +
350 | geom_col() +
351 | labs(title = "Fourth Grade Math NAEP Score by State",
352 | x = "State",
353 | y = "4th Grade Math NAEP") +
354 | gghighlight(StateCode == 'WA')
355 | ```
356 |
357 | ## Simplify
358 |
359 | - Enclosure allows us to get rid of a lot of the borders
360 | - And we an remove the backing ink too
361 | - **Don't underestimate the benefits of removing background ink.** It can really be distracting! Cleaner graphs look cleaner and are often nicer.
362 |
363 | ## Simplify
364 |
365 | ```{r}
366 | ggplot(naep %>%
367 | arrange(Score) %>%
368 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>%
369 | filter(Coastal == 1),
370 | aes(x = StateCode, y = Score)) +
371 | geom_col() +
372 | labs(title = "Fourth Grade Math NAEP Score by State",
373 | x = "State",
374 | y = "4th Grade Math NAEP") +
375 | gghighlight(StateCode == 'WA') +
376 | theme_tufte()
377 | ```
378 |
379 | ## Easily Following Information
380 |
381 | - **Think**: have we made the presentation as easy to see as possible? How could we make it more clear?
382 | - Especially with long text labels, horizontal bar charts are much easier to read and compare
383 |
384 | ## Easily Following Information
385 |
386 | ```{r}
387 | ggplot(naep %>%
388 | arrange(Score) %>%
389 | mutate(StateCode = factor(StateCode, levels = .$StateCode),
390 | State = factor(State, levels = .$State)) %>%
391 | filter(Coastal == 1),
392 | aes(x = StateCode, y = Score)) +
393 | geom_col() +
394 | labs(title = "Fourth Grade Math NAEP Score by State",
395 | x = "State",
396 | y = "4th Grade Math NAEP") +
397 | gghighlight(StateCode == 'WA') +
398 | theme_tufte() +
399 | coord_flip()
400 | ```
401 |
402 | ## Label Data Directly
403 |
404 | - **Think**: have we provided *affordances*? Have we put the right answer in the place where you'd think to look for it?
405 | - Why make the reader work? Put the label right where it's needed
406 | - Also, remove the cognitive steps of translating the markers
407 |
408 | ## Label Data Directly
409 |
410 | ```{r}
411 | ggplot(naep %>%
412 | arrange(Score) %>%
413 | mutate(StateCode = factor(StateCode, levels = .$StateCode),
414 | State = factor(State, levels = .$State)) %>%
415 | filter(Coastal == 1),
416 | aes(x = StateCode, y = Score)) +
417 | geom_col(fill = 'blue') +
418 | geom_text(aes(label = State, x = StateCode, y = Score + 3), hjust = 0) +
419 | scale_y_continuous(limits = c(0,300)) +
420 | labs(title = "Fourth Grade Math NAEP Score by State",
421 | x = "State",
422 | y = "4th Grade Math NAEP") +
423 | gghighlight(StateCode == 'WA') +
424 | theme_tufte() +
425 | coord_flip() +
426 | theme(axis.text.y = element_blank(),
427 | axis.ticks.y = element_blank())
428 | ```
429 |
430 | ## Now - Your Turn!
431 |
432 | - You'll be creating a graph by hand
433 | - Think carefully about how to make data comparable
434 | - And how to contrast as needed
435 | - And how to tell the story
436 |
437 | ## Your Turn
438 |
439 | - Story: Sales and Marketing may be more expensive, but they haven't grown as much as R&D and HR.
440 |
441 |
442 | ```{r}
443 | df <- read.csv(text="Dept: Sales Marketing R&D HR
444 | 2018 10000 15000 5000 6000
445 | 2019 10500 15000 7200 6500
446 | 2020 10600 16000 9300 8000",sep='\t')
447 |
448 | df %>%
449 | knitr::kable(caption = 'Costs in thousands of dollars by department by year',
450 | format = 'html') %>%
451 | kableExtra::kable_styling(bootstrap_options = 'striped')
452 | ```
453 |
454 |
--------------------------------------------------------------------------------
/Lecture_02_Coffee.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Coffee.jpg
--------------------------------------------------------------------------------
/Lecture_02_DuBois.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_DuBois.PNG
--------------------------------------------------------------------------------
/Lecture_02_Healthcare.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Healthcare.PNG
--------------------------------------------------------------------------------
/Lecture_02_Minard.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Minard.gif
--------------------------------------------------------------------------------
/Lecture_02_NAEP.csv:
--------------------------------------------------------------------------------
1 | State,StateCode,Score,Coastal
2 | Minnesota ,MN,248,0
3 | Massachusetts ,MA,247,1
4 | Virginia ,VA,247,1
5 | Florida ,FL,246,1
6 | New Jersey ,NJ,246,1
7 | Wyoming ,WY,246,0
8 | Indiana ,IN,245,0
9 | New Hampshire ,NH,245,0
10 | Pennsylvania ,PA,244,0
11 | Utah ,UT,244,0
12 | Nebraska ,NE,244,0
13 | Texas ,TX,244,0
14 | Connecticut ,CT,243,1
15 | North Dakota ,ND,243,0
16 | Idaho ,ID,242,0
17 | Colorado ,CO,242,0
18 | Wisconsin ,WI,242,0
19 | North Carolina ,NC,241,1
20 | South Dakota ,SD,241,0
21 | Ohio ,OH,241,0
22 | Montana ,MT,241,0
23 | Maine ,ME,241,1
24 | Mississippi ,MS,241,0
25 | Iowa ,IA,241,0
26 | Tennessee ,TN,240,0
27 | Washington ,WA,240,1
28 | Kansas ,KS,239,0
29 | Rhode Island ,RI,239,1
30 | Delaware ,DE,239,1
31 | Kentucky ,KY,239,0
32 | Vermont ,VT,239,0
33 | Hawaii ,HI,239,1
34 | Maryland ,MD,239,1
35 | Missouri ,MO,238,0
36 | Georgia ,GA,238,1
37 | Arizona ,AZ,238,0
38 | Illinois ,IL,237,0
39 | Oklahoma ,OK,237,0
40 | South Carolina ,SC,237,1
41 | New York ,NY,237,1
42 | Oregon ,OR,236,1
43 | Michigan ,MI,236,0
44 | Nevada ,NV,236,0
45 | California ,CA,235,1
46 | District of Columbia ,DC,235,1
47 | Arkansas ,AR,233,0
48 | Alaska ,AK,232,1
49 | West Virginia ,WV,231,0
50 | Louisiana ,LA,231,0
51 | New Mexico ,NM,231,0
52 | Alabama ,AL,230,0
53 | Puerto Rico ,PR,185,1
54 |
--------------------------------------------------------------------------------
/Lecture_02_NAEP.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_NAEP.xlsx
--------------------------------------------------------------------------------
/Lecture_02_Network_Graph.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Network_Graph.jpg
--------------------------------------------------------------------------------
/Lecture_02_Unemployment.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Unemployment.PNG
--------------------------------------------------------------------------------
/Lecture_03_Audience.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 3: Focus and Audience"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | {
19 | library(tidyverse)
20 | library(lubridate)
21 | library(gghighlight)
22 | library(ggthemes)
23 | library(directlabels)
24 | library(RColorBrewer)
25 | }
26 | ```
27 |
28 | ## Who is it For?
29 |
30 | ```{r, results = 'asis'}
31 | cat("
32 | ")
38 | ```
39 |
40 | - It's impossible to do good communication without thinking about who you're communicating *to*
41 | - Something you make for a highly technical audience should be different from something you make for a non-technical boss, or an outsider, or the general population
42 | - And, crucially, all of it should be different from what you'd make *for yourself*
43 | - **Communication is not about what you say, it's about what people hear**. Different people hear differently.
44 |
45 | ## Who is it For?
46 |
47 | Always ask:
48 |
49 | - Who is going to see this?
50 | - *What do they know?*
51 | - What don't they know?
52 | - What do they *need to know?*
53 |
54 | ## What do they Know?
55 |
56 | - Keep in mind the technical skill of your audience - have they seen this graph type before? Do they know what a median is? Can my boss read a graph (yes really)?
57 | - Keep in mind the *contextual* knowledge of your audience - are you using lingo they don't know? Abbreviations?
58 | - Keep in mind *where their head is at* - you've been staring at the data for hours so you know "AR" means "annual revenue." Is that what they'll think if they see "AR"? You've done a million projects before so you know that $1 million is way above-average for March. Will they recognize that?
59 |
60 | ## Presentation
61 |
62 | - Your audience does not know what you know
63 | - In many cases they may not care as much as you care
64 | - And even if they know and care, they may not understand the way you think about it
65 | - Don't be afraid to hold their hand
66 | - Even if they *are* very familiar with what you're doing, they'll appreciate clarity. It will let them know that *you* know what you're doing
67 | - **Don't assume people will read your work super super closely. Make your work such that the right conclusion to draw is also the most obvious one**
68 |
69 | ## Tips
70 |
71 | - Have someone else look at your work and tell you what they think it means (maybe trade with them in this class?)
72 | - After finishing something, give yourself a little break from it and come back. Ask: "if I had no idea about what this was, would I figure it out? What conclusion would I draw? Would it be the right one?"
73 | - And "What could I change to make the proper conclusion even more obvious?" How can you **focus** attention?
74 | - It can rarely be too obvious! Rarely are there points for subtlety here.
75 | - As with *anything* (not just thinking about audience) always **look at your work and ask if it looks right**
76 |
77 |
78 | ## Example
79 |
80 | - What are some communication errors here?
81 |
82 | ```{r, echo = FALSE, eval = TRUE}
83 | ggplot(readRDS('king_dailyvisits.Rdata'), aes(x = reorder(naics_title, visits_by_day), y = visits_by_day)) +
84 | geom_col(color = 'lightblue') +
85 | coord_flip() +
86 | ggpubr::theme_pubr() +
87 | labs(y = 'Visits',
88 | x = NULL,
89 | title = 'Top Five Industries')
90 | ```
91 |
92 | ## Mistakes
93 |
94 | - What are "visits"? (Logged mobile-phone visits to locations in these industries, but how would you know?)
95 | - Top five industries... at what? Where? (Seattle in June 2020, but how would you know?)
96 | - Scientific notation (Not everyone can read this!)
97 | - What are those industries? Nature parks we can guess, maybe the restaurant types. But what are Lessors of Nonresidential Buildings (except Miniwarehouses) and Snack and Nonalcoholic Beverage Bars? (Malls and Cafes)
98 | - Why are restaurants two industries?
99 |
100 |
101 | ## The Three Rules of Presentation
102 |
103 | A good way to focus attention on the story you want them to understand is to prepare them to receive information
104 |
105 | - Tell them what you're going to tell them
106 | - Tell them
107 | - Tell them what you told them
108 |
109 |
110 | ## Presentation
111 |
112 | - When transmitting information, always let the reader know where to go. **Give them a bucket to put the information in before you give them the information**.
113 | - Really well-done research or argumentative writing makes the reader think of the next point you're going to make before you even make it
114 | - Similarly, well done data communication doesn't assume knowledge of the reader and is willing to direct their attention.
115 |
116 | ## A Demonstration
117 |
118 | - Watch for how [this neuroscientist](https://www.youtube.com/watch?v=opqIa5Jiwuw) explains something to different audiences
119 | - How does he adjust both content and delivery style to who he's talking to?
120 | - How does he give them a bucket for the information he's about to provide?
121 | - How does he get them engaged with what he's talking about?
122 |
123 | ## Guiding Our Audience
124 |
125 | Beyond considering our audience's capabilities, we also want to think about how to guide their attention
126 |
127 | - How can we nudge them as to where to look, what to compare, what continuities to see?
128 | - How can we prepare them to receive the story we want them to understand?
129 |
130 | ## Focusing Attention
131 |
132 | - We want to tell a story with our data
133 | - Some stories are simple
134 | - Others are complex
135 | - They all need to be clear
136 | - And we may need to walk the reader along
137 |
138 | ## Foreground and Background
139 |
140 | - We are already trying to focus our reader by making the viz in the first place
141 | - After all, we could avoid focusing them entirely by just showing the full data set, but we don't do that
142 | - Can we focus them further to make the story even clearer?
143 |
144 | ## Foreground and Background
145 |
146 | - Think about what you want to highlight - these are the parts that will draw the eye and get the most attention
147 | - You don't have to limit yourself to *only* including the information that drives the story home (although do get rid of as much clutter as possible)
148 | - But the important parts, the ones that produce the right conclusion, should be *as visible as possible*
149 | - In writing, you may have learned about inverted-pyramid structure
150 | - In visualization there are other tricks we can use
151 |
152 | ## My Mom's Car
153 |
154 | - What does the driver want to know? And what is highlighted?
155 |
156 | 
157 |
158 | ## My Mom's Car
159 |
160 | - The driver probably wants to know how fast they are going
161 | - And how full their tank is
162 | - Instead the focus is heavily on the RPM of the engine, and the other things are tucked back, faded out by glare, and not where you immediately look
163 | - (more confusing is that the RPM is a dial that looks like a typical speedometer, but isn't)
164 |
165 |
166 | ## Preattentive Attributes
167 |
168 |
169 | - How can we direct the eye where we need it to go?
170 | - Thankfully, (most of) you have been looking with your eyes your whole life!
171 | - So this should be pretty intuitive
172 |
173 | ---
174 |
175 | ```{r}
176 | knitr::include_graphics('Lecture_03_Preattentive.png')
177 | ```
178 |
179 | ## Preattentive Attributes in Images
180 |
181 | - Our brain knows what to look out for
182 | - This should be pretty intuitive
183 | - The purpose here is to emphasize one important thing!
184 | - Let's go through some commonly usable ones:
185 |
186 | ## Color
187 |
188 | ```{r}
189 | data(gapminder, package='gapminder')
190 |
191 | ggplot(gapminder %>%
192 | filter(year == 2007),
193 | aes(x = gdpPercap, y = lifeExp, color = continent == 'Asia')) +
194 | geom_point() +
195 | scale_x_log10() +
196 | labs(x = 'GDP Per Capita (log scale)',
197 | y = 'Life Expectancy',
198 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') +
199 | theme_minimal() +
200 | guides(color = FALSE) +
201 | scale_color_manual(values = c('#33ccFF','red')) +
202 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8)
203 | ```
204 |
205 | ## Intensity
206 |
207 | ```{r}
208 | ggplot(gapminder %>%
209 | filter(year == 2007),
210 | aes(x = gdpPercap, y = lifeExp, color = continent == 'Asia', alpha = continent == 'Asia')) +
211 | geom_point() +
212 | scale_x_log10() +
213 | labs(x = 'GDP Per Capita (log scale)',
214 | y = 'Life Expectancy',
215 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') +
216 | theme_minimal() +
217 | guides(color = FALSE, alpha = FALSE) +
218 | scale_color_manual(values = c('#33ccFF','red')) +
219 | scale_alpha_manual(values = c(.4,1)) +
220 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8)
221 | ```
222 |
223 | ## Size
224 |
225 |
226 | ```{r}
227 | ggplot(gapminder %>%
228 | filter(year == 2007),
229 | aes(x = gdpPercap, y = lifeExp,
230 | color = continent == 'Asia',
231 | alpha = continent == 'Asia',
232 | size = continent == 'Asia')) +
233 | geom_point() +
234 | scale_x_log10() +
235 | labs(x = 'GDP Per Capita (log scale)',
236 | y = 'Life Expectancy',
237 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') +
238 | theme_minimal() +
239 | guides(color = FALSE, alpha = FALSE, size = FALSE) +
240 | scale_color_manual(values = c('#33ccFF','red')) +
241 | scale_alpha_manual(values = c(.4,1)) +
242 | scale_size_manual(values = c(1,2)) +
243 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8)
244 | ```
245 |
246 | ## Width
247 |
248 | ```{r}
249 | gapyrs<- gapminder %>%
250 | group_by(continent, year) %>%
251 | summarize(lifeExp = mean(lifeExp))
252 |
253 | ggplot(gapyrs, aes(x = year, y = lifeExp, group = continent, size = continent == 'Asia')) +
254 | geom_line() +
255 | geom_dl(aes(x = year + .5,label = continent), method = 'last.bumpup') +
256 | scale_x_continuous(limits = c(1950,2015)) +
257 | labs(x = 'Year',
258 | y = 'Average Life Expectancy',
259 | title = 'Overall Life Expectancy in Asia is Low') +
260 | scale_size_manual(values = c(.5, 1.5)) +
261 | guides(size = FALSE) +
262 | theme_tufte()
263 | ```
264 |
265 | ## Enclosure
266 |
267 | ```{r}
268 | ggplot(gapminder %>%
269 | filter(year == 2007),
270 | aes(x = gdpPercap, y = lifeExp)) +
271 | geom_point() +
272 | scale_x_log10() +
273 | labs(x = 'GDP Per Capita (log scale)',
274 | y = 'Life Expectancy',
275 | title = 'There is a Small Cluster of Rich and Long-Lived Countries') +
276 | theme_minimal() +
277 | annotate('rect', xmin = 16000, xmax = 50000, ymin = 68, ymax= 85, alpha = .2)+
278 | annotate('text', x = 16000, y = 62, hjust = 0,
279 | label = 'Even here,\nthere is a\npositive\nrelationship.')
280 | ```
281 |
282 | ## Keep in Mind
283 |
284 | - With all of these attributes, some are more important than others!
285 | - And more intensity generally means more attention
286 | - Rather than providing a direct order, remember that it's subjective - try things out, see what is intuitive
287 |
288 | ## An Example
289 |
290 | ```{r}
291 | data(SPrail, package='pmdplyr')
292 |
293 | SPrail_dow <- SPrail %>%
294 | filter(!is.na(fare)) %>%
295 | mutate(fare = factor(fare)) %>%
296 | mutate(dow = factor(weekdays(start_date), levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
297 | group_by(fare, dow) %>%
298 | summarize(price = mean(price, na.rm = TRUE))
299 |
300 | ggplot(SPrail_dow, aes(x = fare, y = price, fill = dow)) +
301 | geom_bar(position = 'dodge', stat = 'identity') +
302 | labs(fill = 'Day of Week',
303 | x = 'Fare Type',
304 | y = 'Average Price',
305 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
306 | theme_minimal()
307 |
308 | ```
309 |
310 | ## Use Color Sparingly
311 |
312 | ```{r}
313 | SPrail_dow <- SPrail %>%
314 | filter(!is.na(fare)) %>%
315 | mutate(fare = factor(fare)) %>%
316 | mutate(dow = factor(weekdays(start_date), levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
317 | mutate(weekend = dow %in% c('Saturday','Sunday')) %>%
318 | group_by(fare, dow, weekend) %>%
319 | summarize(price = mean(price, na.rm = TRUE)) %>%
320 | ungroup()
321 |
322 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) +
323 | geom_bar(position = 'dodge', stat = 'identity') +
324 | labs(fill = 'Weekend',
325 | x = 'Fare Type',
326 | y = 'Average Price',
327 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
328 | theme_minimal()
329 |
330 | ```
331 |
332 | ## Choose Colors for Focus
333 |
334 | ```{r}
335 |
336 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) +
337 | geom_bar(position = 'dodge', stat = 'identity') +
338 | labs(fill = 'Weekend',
339 | x = 'Fare Type',
340 | y = 'Average Price',
341 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
342 | theme_minimal() +
343 | scale_fill_manual(values = c('#33ccFF','red'))
344 |
345 | ```
346 |
347 |
348 | ## Also Make Them Look Good
349 |
350 | ```{r}
351 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) +
352 | geom_bar(position = 'dodge', stat = 'identity') +
353 | labs(fill = 'Weekend',
354 | x = 'Fare Type',
355 | y = 'Average Price',
356 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
357 | theme_minimal() +
358 | scale_fill_manual(values = c('gray','#33ccFF'))
359 | ```
360 |
361 | ## Choose Intensity for Focus
362 |
363 | ```{r}
364 | ggplot(SPrail_dow, aes(x = fare,
365 | y = price,
366 | group = dow,
367 | fill = weekend,
368 | alpha = fare == 'Flexible')) +
369 | geom_bar(position = 'dodge', stat = 'identity') +
370 | labs(fill = 'Weekend',
371 | x = 'Fare Type',
372 | y = 'Average Price',
373 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
374 | theme_tufte() +
375 | scale_fill_manual(values = c('gray','#33ccFF')) +
376 | scale_alpha_manual(values = c(.4,1)) +
377 | guides(alpha = FALSE)
378 | ```
379 |
380 | ## Proximity for Focus and Comparison
381 |
382 | ```{r}
383 | SPrail_dow <- SPrail %>%
384 | filter(!is.na(fare)) %>%
385 | mutate(fare = factor(fare, levels = c('Adulto ida','Promo','Promo +','Flexible'))) %>%
386 | mutate(dow = factor(weekdays(start_date), levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))) %>%
387 | mutate(weekend = dow %in% c('Saturday','Sunday')) %>%
388 | group_by(fare, dow, weekend) %>%
389 | summarize(price = mean(price, na.rm = TRUE)) %>%
390 | ungroup()
391 |
392 | ggplot(SPrail_dow, aes(x = fare,
393 | y = price,
394 | group = dow,
395 | fill = weekend,
396 | alpha = fare == 'Flexible')) +
397 | geom_bar(position = 'dodge', stat = 'identity') +
398 | labs(fill = 'Weekend',
399 | x = 'Fare Type',
400 | y = 'Average Price',
401 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
402 | theme_tufte() +
403 | scale_fill_manual(values = c('gray','#33ccFF')) +
404 | scale_alpha_manual(values = c(.4,1)) +
405 | guides(alpha = FALSE)
406 | ```
407 |
408 | ## Declutter
409 |
410 | - Labels might not even be necessary!
411 |
412 | ```{r}
413 | ggplot(SPrail_dow, aes(x = fare,
414 | y = price,
415 | group = dow,
416 | fill = weekend,
417 | alpha = fare == 'Flexible')) +
418 | geom_bar(position = 'dodge', stat = 'identity') +
419 | labs(fill = 'Weekend',
420 | x = 'Fare Type',
421 | y = 'Average Price',
422 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') +
423 | theme_tufte() +
424 | scale_fill_manual(values = c('gray','#33ccFF')) +
425 | scale_alpha_manual(values = c(.4,1)) +
426 | guides(alpha = FALSE,fill = FALSE) +
427 | annotate('text',x = .87, y = 36, label = 'Weekday',color = 'darkgray', alpha = 1) +
428 | annotate('text',x = 1.34, y = 36, label = 'Weekend',color = '#33ccFF', alpha = 1)
429 | ```
430 |
431 |
432 | ## Could we Go Further?
433 |
434 | - Declutter with days-of-week $\rightarrow$ just one point for weekday, one for weekend
435 | - Do we need the other ticket types?
436 | - Slope graph?
437 | - Move the title in closer?
438 |
439 | ## Designing for Focus
440 |
441 | - Highlight what's important
442 | - Remove distractions
443 | - Unimportant stuff CAN be backgrounded
444 |
445 | ## Annotation
446 |
447 | - Annotation can be a great way of pointing out features of the data
448 | - If you can tell the reader what you mean rather than making them infer it, that's probably good
449 | - (although if they'd always infer it without you having to tell it, that's even better)
450 | - Remember - *walk your audience through the data*, and don't be afraid to interpret for them!
451 |
452 | ## Annotation
453 |
454 | - Tell them what you're going to tell them: Title the graph with the story or insight
455 | - Tell them: Have a graph that demonstrates that story or insight
456 | - Tell them what you told them: put a label or arrow or other attention-drawing element that makes clear how the graph supports that title
457 |
458 | ## Explanation with Audience and Focus
459 |
460 | - Pick something that you know a lot about, and pick something interesting about that thing.
461 | - Prepare a ~1 minute explanation of that interesting thing
462 | - Take turns giving that explanation to your partner
463 | - Then, after both explanations have been given, have them explain back to you the thing you explained to them. How accurate is it?
--------------------------------------------------------------------------------
/Lecture_03_Cardash.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_03_Cardash.jpg
--------------------------------------------------------------------------------
/Lecture_03_Preattentive.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_03_Preattentive.PNG
--------------------------------------------------------------------------------
/Lecture_04_Common_R_Problems.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Common R Mistakes in Data Viz"
3 | author: "Nick HK"
4 | date: "`r Sys.Date()`"
5 | output:
6 | rmdformats::readthedown:
7 | toc: 2
8 | ---
9 |
10 | ```{r setup, include=FALSE}
11 | knitr::opts_chunk$set(echo = TRUE)
12 | ```
13 |
14 | # Common R Problems
15 |
16 | This document catalogues some of the requests for help I get, and mistakes I see in looking at student work, particularly in my Data Communications class. I also, of course, cover how to fix these things.
17 |
18 | I've divided this into two sections: (1) things breaking, and (2) mistakes in coding.
19 |
20 | # Things Breaking
21 |
22 | These are common errors you might get, and common things that might cause errors.
23 |
24 | ## General troubleshooting tips
25 |
26 | 1. Run your code one line at a time. Between each line, check the output and make sure it looks like you expect it to look. Often, errors occur because an earlier line of code didn't do what we intended it to do. If you don't spot any problems there, keep going until you hit the error message.
27 | 1. Read the error message and see if you can make sense of what it says
28 | 1. Often, even if you don't understand the message, it will give you a hint as to what is causing the problem, and so you'll know where to look to double-check your code
29 | 1. Google
30 |
31 | If you still need help, I am your professor and I am here to help you! Here's some tips for asking me for coding help:
32 |
33 | 1. Let me know what the problem is ("it's not working" isn't too helpful)
34 | 1. What is the error message or warning, if there is one, and what lines of code are generating it?
35 | 1. What are you trying to accomplish with the code? What *should* it do?
36 | 1. Send along your code as well. The original file is a good idea, although then I won't be able to help on my phone and you'll have to wait until I get to a computer. Copy/pasting code and error message into an email works well. A screenshot (Google "how to take a screenshot" and either Mac or Windows, as appropriate) also works well. Using your phone to take a picture of your computer screen is the least-good way to share your problem.
37 |
38 | ## There is no package called
39 |
40 | If you get an error like `there is no package called 'packagename'`, that means that you've tried to load or use a package - perhaps using `library()` or something like that - but you don't have that package installed.
41 |
42 | The Fix:
43 |
44 | 1. Did you spell the package name correctly? Maybe you do have the package, but you made a typo
45 | 1. If you spelled it correctly, you may not have installed it. Install it (usually) with `install.packages('packagename')` where `'packagename'` is replaced with the name of the package you actually want.
46 |
47 | ## Could not find function
48 |
49 | If you get an error like `could not find function "functionname"`, that means it's trying to look for that function, but cannot find it.
50 |
51 | The Fix:
52 |
53 | 1. Did you spell the function name correctly? Maybe you made a typo.
54 | 1. If the function is in a package, did you load the package? Remember, the package must be re-loaded every time you restart R. If you're working in Quarto, you must load the package inside of your Quarto script, since Quarto starts from a blank slate.
55 |
56 | ## Object not found
57 |
58 | If you get an error like `object 'objectname' not found`, that means that you've made a reference to an object called `objectname`, but there is no such object to refer to.
59 |
60 | The Fix:
61 |
62 | 1. Did you spell the object's name correctly? Maybe you made a typo.
63 | 1. Did you remember to create the object before referring to it? For example, the code `mean(x); x <- 1:10` would not work, because you try to take the mean of `x` before it's been created. If you're working in Quarto, remember that it will start from a blank slate, so you must create all objects inside of your Quarto script.
64 |
65 | ## Column doesn't exist
66 |
67 | If you get an error like `column 'columnname' doesn't exist`, that means it's trying to refer to a column name that is not actually in your data.
68 |
69 | The Fix:
70 |
71 | 1. Did you spell the column's name correctly? Maybe you made a typo.
72 | 1. Are you referring to the proper dataset? Maybe you created the variable in `my_data2` but you're trying to access it in `my_data`.
73 | 1. Does your column name have spaces or other strange characters in it like dashes? If so, in many applications (anywhere you're not treating the column name as a string) you have to surround the column name in backticks (`r "\u0060"`) so it knows where the column name starts and ends.
74 | 1. Did you perhaps drop the column before trying to refer to it? For example, in the below code, the `group_by() %>% summarize()` will only keep columns named in the `group_by()` or in the `summarize()`. So even though the `mpg` column was there at the start, it no longer is by the time we refer to it.
75 |
76 | ```{r, eval = FALSE}
77 | mtcars %>%
78 | group_by(am) %>%
79 | summarize(mean_hp = mean(hp)) %>%
80 | mutate(hp_by_mpg = mean_hp/mpg)
81 | ```
82 |
83 | ## Warning: Package was compiled with R version
84 |
85 | If you load a package and get a warning that looks like `Warning: packagename was compiled with R version...`, that just means that the package was compiled using a slightly newer version of R than yours. It's generally not a problem as long as you have a somewhat-recent version of R.
86 |
87 | The Fix:
88 |
89 | 1. Ignore it
90 | 1. If you want it to go away, update your R installation.
91 |
92 | ## Rtools is required to build packages from source
93 |
94 | If you are installing packages, you may be asked whether you'd like to build packages from source. If you say "Yes", but you do not have Rtools installed, you will get an error, and the package won't install properly.
95 |
96 | The Fix:
97 |
98 | 1. When asked whether you want to install packages from source, say "No."
99 | 1. Install Rtools before installing packages. The easiest way to do this is to install the **installr** package, and then run `installr::install.rtools()`.
100 |
101 | ## File does not exist, or No such file or directory
102 |
103 | If you get an error like `'filename.csv' does not exist`, that means that you're trying to load a file, but the file is not in the location you're telling R to look.
104 |
105 | The Fix:
106 |
107 | 1. Did you spell the filename correctly? Maybe you made a typo. Or, perhaps, you're trying to open an Excel fie (`.xlsx`) but telling R to look for a CSV (`.csv`). Or you wrote "file.csv" but the file is "file (1).csv".
108 | 1. Is the file perhaps open in Excel? Sometimes, Microsoft Office will hold files hostage and not let them be opened in other programs while they're open in Office. Close the Excel window that has the file open.
109 | 1. **Most often**, this is a working directory issue. If you tell R to look for `'filename.csv'`, it will look for that file in the *working directory*. Or, if you tell it to look for `'data/filename.csv'`, it will look for a folder called "data" in the working directory, and for filename.csv inside of that. In recent versions of RStudio, the current working directory is listed right on top of the Console. Did you check whether the file is actually in that folder? If you're running Quarto, it will set the working directory to the folder where the .qmd file is located. Did you put the file in the right spot relative to where the qmd is stored? See my [video on filepaths](https://www.youtube.com/watch?v=NG7Y0kkGR8g) for more help.
110 |
111 |
112 | ## My code works fine in RStudio, but my Quarto won't render
113 |
114 | If your code runs fine when you run it by hand, but your Quarto throws errors, these are the most common issues:
115 |
116 | 1. You may have loaded a library while working on your code, but not loaded that same library in your Quarto. Include all necessary `library()` functions inside your Quarto document.
117 | 1. You may have created an object while working on your code, but not created that same object in your Quarto. Make sure all objects are created within the Quarto itself.
118 | 1. To troubleshoot both (1) and (2) above, restart your R session (Session -> Restart R) and clear your environment (hit the broom icon in the Environment tab), and THEN run your code line by line to see where the error is.
119 | 1. If you're loading in a file, you may have a different working directory than the Quarto is using. Remember, Quarto sets the working directory to the folder where the file is saved. See the previous "File does not exist" issue above.
120 | 1. Quarto hates when you try to install packages inside the Quarto (and it's a bad idea anyway). Make sure there are no `install.packages()` or `remotes::install_github()` lines in your Quarto doc.
121 | 1. Did you make sure your code is inside of a chunk? Code outside of a chunk won't run.
122 |
123 | If none of those fix it, then look in the "Render" tab to see what exactly the error message is. Keep in mind that, in Quarto, the line number associated with the error message is the line number for the **start of the chunk where the error is**, not the **actual line where the error occurs.**
124 |
125 | ## Object of type 'closure' is not subsettable
126 |
127 | This is the most aggravating and least-informative R error. "Subsettable" is a hint though - you're trying to take a subset of something (such as picking an element from a vector, or picking a column from a data set), but the thing you're doing it to is a "closure" that you can't do that to.
128 |
129 | Most commonly, this is a case of accidentally giving the name of a function where you mean to give the name of a vector or data frame. For example:
130 |
131 | ```{r, eval = FALSE}
132 | # My data set is about how mean you are, so I'll call it mean_data
133 | mean_data <- tibble(person = c('Me','You'), meanness = c(1000,999))
134 | # Now let's look at that meanness variable, but oops, I'm referring to the function mean() instead of my data mean_data
135 | mean$meanness
136 | # Error in mean$meanness : object of type 'closure' is not subsettable
137 | ```
138 |
139 | The Fix:
140 |
141 | 1. See where it says the error is "in" - that's the place where you've referred to the wrong thing. fix it!
142 |
143 | ## You probably made a typo or didn't balance your parentheses
144 |
145 | Just in general, a very large percentage of errors students come to me with occur simply because they made a typo, or didn't balance their parentheses (pairing each ( with a )).
146 |
147 | The Fix:
148 |
149 | 1. If you get an error message of any kind, read the relevant part of the code carefully to make sure you didn't make a typo
150 | 1. Avoid coding where you have lots and lots of nested parentheses (this is something that **tidyverse** piping we learn is intended to avoid). Then, for whatever parentheses you do have, click on each one in RStudio - it will highlight its "partner" or show where there is a missing partner.
151 |
152 |
153 | # Mistakes in Coding
154 |
155 | These are common mistakes I see students making, which may not always lead to error, but sometimes will. And these mistakes will often lead to incorrect results even if it runs okay.
156 |
157 | ## Not checking
158 |
159 | This is not specific to R, or really to coding. But I am begging you to do this:
160 |
161 | After every line of code, check whether the code did what you expected it to do.
162 |
163 | Ask yourself: what was that line of code trying to do? What should the data look like after that line runs?
164 |
165 | Look at the data after you run the code. Does it look as expected?
166 |
167 | If you created a variable you'd expect to take certain values, perhaps use a function like `summary()` or `unique()` to see if it takes those values.
168 |
169 | If you tried to collapse the data so there should be one row per, say, country, look at your data (click it in the Environment pane) to see whether that happened.
170 |
171 | I have been using R for years (and coding generally for decades). I can look at your code and, often, know immediately whether it's going to do something unexpected. You (probably) don't have that luxury. But you can replicate 90% of it by just *checking your work*. I cannot emphasize enough how much of a good idea this actually is and I'm not just saying this. **I** do this *despite those decades of experience*. Or perhaps *because* of it.
172 |
173 | ## Using numbers as strings
174 |
175 | When you put something in quotes, like '1', it will treat it as a *string variable* - a set of characters to be read as text.
176 |
177 | If you want a variable to be treated as a number, **do not put it in quotes.** This way lies madness.
178 |
179 | R is pretty loose, and will allow you to do things like ask whether `'2' > 1`, and will answer yes. This *feels* like it's okay to put the 2 in quotes. But it's not. What R is doing is saying "oof... I can't compare a number to a letter, but you're asking me to. I'll just convert this number to a string as well, so I'm really doing `'2' > '1'`, and since '2' is alphabetically after '1', that's true, which is the desired result. Great! Except it also means that `'02' > 1` is false, since '02' is alphabetically before '1'. Same with `'120'>'20'` being false. Oops.
180 |
181 | The Fix:
182 |
183 | 1. If it's a number, and you want it treated like a numeric value, don't put quotes around it.
184 |
185 | ## Using strings for variable names
186 |
187 | If your main coding experience has been with Python so far, you're probably used to always having to put variable names in quotes. If I want to get the variable `mpg` from the Pandas DataFrame `mtcars`, I do `mtcars['mpg']` or `mtcars.loc[:, 'mpg']`.
188 |
189 | R is different. R makes a lot of use of what's called "non-standard evaluation" which basically means you can often write variable names like regular code instead of putting it in strings. So we *could* do `mtcars[['mpg']]`, but usually we'd do something like `mtcars$mpg` or `mtcars %>% pull(mpg)`.
190 |
191 | Confusingly, sometimes R *does* require quotes (`mtcars[['mpg']]` works but `mtcars[[mpg]]` doesn't), and sometimes it will accept things either way (`mtcars %>% select(mpg)` and `mtcars %>% select('mpg')` both work).
192 |
193 | But often, especially when working with the **tidyverse** as we do, using strings for variable names can get you in trouble. For example, you'll be scratching your head for a while to figure out why `ggplot(data, aes(x = 'x', y = 'y')) + geom_point()` doesn't work, before realizing that **ggplot2** *doesn't* accept variable names as strings, and it's literally trying to graph the letter "x" against the letter "y". Oops.
194 |
195 | The Fix:
196 |
197 | 1. It's generally a good idea to default to not using quotes around variable names.
198 |
199 | ## Letting NAs propogate
200 |
201 | In this class we do a whole lot of summary statistics! And you may find yourself with a lot of `NA` (missing) values in the results. Why is that?
202 |
203 | R's standard summarizing functions, like `mean()` or `sum()`, will return `NA` if *any* of the values you're giving it are `NA`. `mean(c(1,2,3,4,5, NA))` returns `NA`, not `3`.
204 |
205 | So, if you're doing something like a `group_by() %>% summarize()` with a function like `mean()` in it, then if there's a missing value *anywhere in your data*, you'll get back an `NA`.
206 |
207 | The Fix:
208 |
209 | 1. Whenever you use a function like `mean()`, there's usually a `na.rm = TRUE` option you can add to drop any `NA` values before computing the function. This is usually what you want.
210 | 1. Note the `na.rm = TRUE` option goes inside the `mean()` or whatever function, not the `mutate()` or `summarize()
211 |
212 | ## Indiscriminate na.omit()
213 |
214 | One way to deal with missing values is to purge them from your data entirely. I don't teach this in class, but I find a lot of students will look online and find the `na.omit()` function for doing this.
215 |
216 | However, be aware that `na.omit()` removes rows for which there's a missing value in *any* column. If you're analyzing three columns of data, but your data set has 200 columns, then you lose a row if that missing value is *anywhere*. That's probably not what you want.
217 |
218 | The Fix:
219 |
220 | 1. Limit your data only to the columns you're going to use (perhaps with `select()`) before running `na.omit()`
221 | 1. Instead use the superior `drop_na()` in **dplyr/tidyverse**, which also lets you specify the set of variables to look for NAs in. `data %>% drop_na(a, b)` will only drop rows if they have missing values in the `a` or `b` columns, not any others.
222 |
223 | ## Mixing up <-, =, and ==
224 |
225 | `<-` is for assigning objects. You can say `a <- 1`, and that will create an object `a` with the value `1` inside.
226 |
227 | `=` *can also* be for assigning objects, just like `<-`, but is also for assigning arguments and options in functions. In the function `mean()`, which takes the argument `x`, I can say `mean(x = 1:10)` but not `mean(x <- 1:10)`.
228 |
229 | `==` is for *checking* whether two things are equal. `a == 1` will check whether the object `a` is equal to `1`, and will return `TRUE` if it is, or `FALSE` otherwise.
230 |
231 | Commonly, this will cause problems in cases like this:
232 |
233 | ```{r, eval = FALSE}
234 | data %>%
235 | filter(state = "WA")
236 | ```
237 |
238 | which we might expect to filter our data to rows that are in the state of Washington, but instead will try to assign a new object `state` to be equal to `"WA"`.
239 |
240 | ## summarize() without summaries
241 |
242 | `summarize()` is a **dplyr/tidyverse** function that summarizes the data down to have one row, or one row per group if you've previously specified a `group_by()`.
243 |
244 | However, for it to do this, you need to tell it *how* to summarize that data. So, each argument needs to be a function that returns *one value*. Or else it won't work.
245 |
246 | Most commonly, students will just list a bunch of variables in their `summarize()` and expect it to take the mean of each of them. But it doesn't do that, and will instead return a non-summarized data set.
247 |
248 | ```{r, eval = FALSE}
249 | # wrong:
250 | mtcars %>%
251 | group_by(am) %>%
252 | summarize(mpg, hp)
253 |
254 | # right:
255 | mtcars %>%
256 | group_by(am) %>%
257 | summarize(mpg = mean(mpg), hp = mean(hp))
258 |
259 | # or if you want to be fancy
260 | mtcars %>%
261 | group_by(am) %>%
262 | summarize(across(c(mpg, hp), mean))
263 | ```
264 |
265 | ## Working way too hard
266 |
267 | Data communications is a class in which we need to learn some coding to do what we need to do, but it's not a coding class. If there's a difficult task, I've usually gone over a simple solution somewhere in class.
268 |
269 | If you're trawling arcane StackExchange answers and pasting together a bunch of code you don't understand, or if you're spending tedious hours writing out a thousand lines of code, there's a good chance that you're doing something the hard way, when I've provided an easy way in class.
270 |
271 | Some tips for avoiding this:
272 |
273 | 1. If you're writing tiny variations on the same code a zillion times, try doing it in a loop instead
274 | 1. If you're having difficulty working with a date variable, there's probably a function that does things for you in the **lubridate** package
275 | 1. If you're having difficulty working with a string variable, there's probably a function that does things for you in the **stringr** package
276 | 1. If you're spending hours and hours on what feels like a problem that should have a simple answer, it probably does. Email me.
--------------------------------------------------------------------------------
/Lecture_04_R_Walkthrough.R:
--------------------------------------------------------------------------------
1 | # Packages
2 | install.packages(c('tidyverse','vtable'))
3 | # One at a time!
4 | library(tidyverse)
5 |
6 | # Getting data
7 | # From packages
8 | data(storms, package = 'dplyr')
9 | # From files
10 | # d <- read_csv('exampledata.csv')
11 |
12 | # R objects
13 | a <- 1
14 | b <- 'hello'
15 |
16 | # Vectors
17 | a <- 11:20
18 | a[1]
19 | b <- c(1, 1, 4, 4, 2, 3, 6)
20 |
21 | # Logicals
22 | b == 1
23 | sum(b == 1)
24 | mean(b == 1)
25 | (b == 1) | (b == 4)
26 | (b >= 1) & (b < 4)
27 |
28 | # Variable types
29 | as.character(a)
30 | factor(b)
31 | b
32 |
33 | # Data frames / tibbles
34 | storms
35 | summary(storms)
36 | str(storms)
37 | names(storms)
38 | vtable::vt(storms)
39 | library(vtable)
40 | sumtable(storms)
41 | storms$category
42 | table(storms$category)
43 | storms[['category']]
44 | table(storms[['category']])
45 |
46 |
47 | # Functions
48 | mean(b)
49 | table(b)
50 | mean(b, na.rm = TRUE)
51 | sumtable(storms, vars = names(storms)[5:8])
52 |
53 | # Autocomplete
54 | # data(
55 | # mea
56 |
57 | # Help
58 | help(mean)
59 | ??mean
60 | # help pane
61 |
62 | # Rstudio notes
63 | # Working directory
64 | # Environment pane and broom
65 | # Viewer and file pane, and export
66 |
67 | # Swirl
68 | # install.packages('swirl')
69 | swirl()
70 | # Note the base package 4 on EDA! also 1 on basic programming
71 | # See https://github.com/swirldev/swirl_courses for more
72 |
73 | # Troubleshooting
74 | mean(c('a','b'))
75 | sumtable(mean)
76 | feols(Y~X)
77 |
--------------------------------------------------------------------------------
/Lecture_04_R_and_RMarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 4: R and Quarto"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | {
19 | library(tidyverse)
20 | library(lubridate)
21 | library(gghighlight)
22 | library(ggthemes)
23 | library(directlabels)
24 | library(RColorBrewer)
25 | }
26 | rinline <- function(code) {
27 | sprintf('``` `r %s` ```', code)
28 | }
29 | ```
30 |
31 | ## Tech!
32 |
33 | ```{r, results = 'asis'}
34 | cat("
35 | ")
41 | ```
42 |
43 | - As we go through the course, we're going to need to kep data visualization and communication principles close at hand
44 | - But it's also unavoidable that actually using them will require some technical skill on our part
45 | - In this class we'll be using R
46 | - Specifically, the **tidyverse** set of packages, and Quarto
47 | - I don't expect you to become R experts. You're not learning R so much as learning **dplyr** and **ggplot2** and some extra details
48 |
49 | ## R
50 |
51 | - Why R instead of Python?
52 | - It's good to know more than one language (nothing ever lasts!)
53 | - **ggplot2** is a top-tier visualization package and makes it much easier to implement our principles
54 | - Data-cleaning is a bit more intuitive than in **pandas** (and faster too if you're using **data.table** but we aren't)
55 | - Python > R in designing software for production, or in some machine learning apps. But we aren't doing those!
56 |
57 | ## R - The Basics
58 |
59 | - Let's switch over to an RStudio walkthrough. We'll cover:
60 | - Getting set up
61 | - The RStudio window
62 | - Using basic R elements
63 | - I'll also show you how to set up the **swirl** package so you can do some walkthroughs on your own
64 |
65 | ## Loading in data
66 |
67 | - For the most part the **tidyverse** package will cover all our needs
68 | - For loading in (and saving) data, I strongly suggest the use of the **rio** package
69 | - This package is new, so you won't find it on a lot of Google searches for "R load CSV file"
70 | - But it's super easy. Doesn't even matter what file format it is. `import('filename.csv')` or `.Rdata` or `.xlsx` etc. all work immediately. `import_list()` can import all the sheets of an Excel workbook, or a bunch of files, all at once
71 |
72 | ## My Promise
73 |
74 | - There are many ways to do anything in R
75 | - But if I've shown it in class, almost certainly the way I'm showing it is the easiest, especially in the context of this class, as I've chosen stuff that makes sense for us specifically
76 | - Don't make yourself work too hard! If you're trying to figure something out, ask "did we cover this in class?" If so, go back to this material - it will probably be less work than whatever Google solution you find
77 | - Like, seriously. The number of times students spend hours and hours trying to stitch together a solution from dozens of StackExchange answers, when it's something I've given a one-line solution to in class, is *a lot*
78 |
79 | ## Quarto
80 |
81 | - Quarto is a text and code processing system
82 | - Text and code in -> Formatted output out (HTML, Word, PDF, slides, books, etc.)
83 | - These slides are made in Quarto. I write books and papers in it too.
84 |
85 | ## Markdown
86 |
87 | - Markdown is a "markup language" like HTML or LaTeX. All your formatting goes in as raw text, and then you run it through an interpreter to get output
88 | - Markdown is *super super simple*
89 | - Let's walk through the [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
90 | - Note HTML code also works!
91 |
92 | ## Markdown
93 |
94 | - Why Markdown instead of, like, Word? or JuPyteR?
95 | - Can be automated. Includes code and output directly in the document, including for in-line numbers
96 | - Much easier to share via something like, e.g., GitHub
97 | - Output is in standard formats like Word or HTML so they don't need any sort of setup to see it
98 |
99 | ## The YAML
100 |
101 | - The YAML is the set of text at the top of a Quarto that tells it how to interpret all that text. Here's an example YAML:
102 |
103 | ```{r, eval = FALSE, echo = TRUE}
104 | ---
105 | title: "Untitled"
106 | format:
107 | html:
108 | embed-resources: true
109 | toc: true
110 | editor: visual
111 | ---
112 | ```
113 |
114 |
115 | ## Code Chunks
116 |
117 | - Include code in-line (fantastic for memos) with `r rinline('mean(1:3)')` which makes `r mean(1:3)`
118 | - Start a new code chunk with a forward slash, /. Let's take a look at an example Quarto doc
119 | - *Code goes inside the chunks, writing goes outside*
120 | - Note: want a data frame to nicely format as a table? use `knitr::kable()`.
121 |
122 | ## Code Chunks
123 |
124 | - Common options to set in each code chunk, or altogether in the YAML:
125 | - `echo`: Show code or not?
126 | - `eval`: Evaluate the code inside or not?
127 | - `warning` and `message`: Show warnings and messages from the code?
128 | - `include`: Include *any output at all* or not?
129 |
130 | ## Common Mistakes
131 |
132 | - Let's take a look at the [Common Errors and Mistakes Cheatsheet](https://nickch-k.github.io/DataCommSlides/Lecture_04_Common_R_Problems.html)
133 | - Common Quarto mistakes: forgetting the be sure that *all the code you need to run from fresh are in the Quarto*, including `install.packages()` (don't do it!), working directory issues, using the "Import Data" wizard
134 |
135 | ## Let's do it
136 |
137 | - Make a new Quarto HTML document
138 | - YAML: Turn the toc on, title it and add your name
139 | - Make one big section and title it (#) and two small (##)
140 | - In the setup chunk, load the **tidyverse** and load `data(storms)`
141 | - Write some text in the first small one, including a link to something
142 | - In the second, add a code chunk that does a `summary()` of `storms`
143 | - Then, some more text, with an in-line text to get the mean of `storms$wind`.
144 | - Bonus: load the **scales** package and use `number()` to format the in-line text.
--------------------------------------------------------------------------------
/Lecture_04_example_rmarkdown.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "My Rmarkdown Doc"
3 | author: "Nick HK"
4 | date: "`r Sys.Date()`"
5 | output: html_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = FALSE)
10 |
11 | library(tidyverse)
12 | library(vtable)
13 | my_mean_data <- 100:200
14 | ```
15 |
16 | ## R Markdown
17 |
18 | Write what I like: `hi this is code`, *this is italics* [Google](https://google.com) `r 2+2`
19 |
20 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see .
21 |
22 | When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
23 |
24 | ```{r, echo = TRUE}
25 | knitr::kable(cars) %>%
26 | kableExtra::kable_styling('striped')
27 | ```
28 |
29 | ```{r cars, eval = FALSE}
30 | summary(cars)
31 | ```
32 |
33 | ## Including Plots
34 |
35 | You can also embed plots, for example:
36 |
37 | ```{r pressure, echo=FALSE}
38 | plot(pressure)
39 | ```
40 |
41 | Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
42 |
--------------------------------------------------------------------------------
/Lecture_04_second_example.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "My Storms Stuff"
3 | author: "Nick HK"
4 | date: "`r Sys.Date()`"
5 | output: html_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = TRUE)
10 | # install.packages('tidyverse')
11 | library(tidyverse)
12 | library(scales)
13 |
14 | data(storms)
15 | ```
16 |
17 | # My Storms Analysis
18 |
19 | ## Storms 1
20 |
21 | Storms are bad [look at the weather](https://weather.com).
22 |
23 | ## Storms 2
24 |
25 | ```{r}
26 | summary(storms)
27 | ```
28 | THe mean wind speed is `r number(mean(storms$wind))`
29 |
--------------------------------------------------------------------------------
/Lecture_05_Example_Sprail.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture_05_Example_SPrail"
3 | author: "Nick Huntington-Klein"
4 | date: "7/8/2021"
5 | output:
6 | html_document:
7 | toc: TRUE
8 | toc_float: TRUE
9 | ---
10 |
11 | ```{r setup, include=FALSE}
12 | knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
13 |
14 | {
15 | library(tidyverse)
16 | library(purrr)
17 | library(kableExtra)
18 | library(lubridate)
19 | }
20 |
21 | SPrail <- list.files(pattern = 'SPrail_') %>%
22 | map(read_csv) %>%
23 | bind_rows()
24 |
25 | ```
26 |
27 | ## Is the data tidy?
28 |
29 |
30 | ```{r}
31 | head(SPrail)
32 | ```
33 |
34 | Looks like one row per ticket (albeit without any sort of identifier) and everything is in its own column. Seems tidy to me! Although an ID would be nice. We could make one with `mutate(id = 1:n())`
35 |
36 | ## Data Processing
37 |
38 | ```{r}
39 | SPrail <- SPrail %>%
40 | filter(fare == 'Promo') %>%
41 | mutate(month = month(start_date))
42 |
43 | months <- tibble(month = 1:12, monthname = month.name)
44 |
45 | SPrail <- SPrail %>%
46 | left_join(months)
47 | ```
48 |
49 | ## Now to Make a Table!
50 |
51 | ```{r}
52 | meanprice <- SPrail %>%
53 | group_by(destination, monthname) %>%
54 | summarize(meanprice = mean(price, na.rm = TRUE), N = n()) %>%
55 | ungroup() %>%
56 | arrange(-meanprice) %>%
57 | slice(1:5)
58 |
59 | meanprice %>%
60 | mutate(meanprice = scales::dollar(meanprice)) %>%
61 | knitr::kable() %>%
62 | kable_styling(bootstrap_options = 'striped')
63 | ```
64 |
65 | The most expensive tickets are for `r meanprice %>% slice(1) %>% pull(destination)` in `r meanprice %>% slice(1) %>% pull(monthname)`, at a price of `r meanprice %>% slice(1) %>% pull(meanprice) %>% scales::dollar()`. On average the five most expensive routes sold `r meanprice %>% pull(N) %>% mean()` tickets apiece.
66 |
--------------------------------------------------------------------------------
/Lecture_05_and_06_example_code.R:
--------------------------------------------------------------------------------
1 | library(tidyverse)
2 | data1 <- tibble(Month = 1:4, MonthName = c('Jan','Feb','Mar','Apr'),
3 | Sales = c(6,8,1,2), RD = c(4,1,2,4))
4 | data2 <- tibble(Department = 'Sales', Director = 'Lavy')
5 |
6 | joined_data <- data1 %>%
7 | pivot_longer(cols = c('Sales','RD'),
8 | names_to = 'Department',
9 | values_to = 'Budget') %>%
10 | full_join(data2, by = 'Department')
11 |
12 |
13 | joined_data %>%
14 | filter(!(MonthName %in% c('Jan','Feb')))
15 | joined_data %>%
16 | filter((MonthName %in% c('Jan','Feb')) & Department == 'Sales')
17 | joined_data %>%
18 | filter((MonthName %in% c('Jan','Feb')) | Department == 'Sales')
19 |
20 | data("construction")
21 | construction %>%
22 | select(Year, Month, West) %>%
23 | filter(Month %in% c('July','August','September')) %>%
24 | arrange(-West) %>%
25 | slice(1:2)
26 |
27 | joined_data
28 | joined_data %>%
29 | group_by(MonthName) %>%
30 | mutate(MonthAverage = sum(Budget)/n())
31 |
32 | joined_data %>%
33 | group_by(MonthName) %>%
34 | summarize(MonthAverage = mean(Budget))
35 |
36 | joined_data %>%
37 | group_by(MonthName) %>%
38 | summarize(proportion_above_6 = mean(Budget > 5))
39 |
40 |
41 | data(construction)
42 | new_cons <- construction %>%
43 | mutate(avg = (Northeast + Midwest + South + West)/4) %>%
44 | rename(`Price Average` = avg) %>%
45 | mutate(firsthalf = case_when(
46 | row_number() <= 6 ~ 'First half',
47 | TRUE ~ 'Second half'
48 | )) %>%
49 | group_by(firsthalf) %>%
50 | summarize(`Price Average` = mean(`Price Average`))
51 | new_cons %>%
52 | pull(firsthalf) %>%
53 | is.character()
54 | str(new_cons)
55 | class(construction$Year)
56 | sapply(construction, class)
57 |
58 | joined_data %>%
59 | mutate(Department = factor(Department)) %>%
60 | group_by(Department) %>%
61 | summarize(Budget =mean(Budget)) %>%
62 | mutate(Department = reorder(Department, -Budget)) %>%
63 | arrange(Department)
64 | summarized$Department
65 |
66 |
67 | library(lubridate)
68 | ymd('2020-01-01')
69 | ym('202001')
70 | my('012020')
71 | mdy('01-01-2020')
72 | construction %>%
73 | mutate(year_as_date = ym(paste(Year, '01'))) %>%
74 | pull(year_as_date)
75 |
76 | tibble(year = 2015:2018, month = c(1,6,2,4), day = c(5,6,1,2)) %>%
77 | mutate(date = dmy(paste(day, month, year))) %>%
78 | mutate(quarter = quarter(date),
79 | year_new = year(date)) %>%
80 | mutate(month_label = month.name[month])
81 | ymd('2020-01-01') + years(1)
82 | ymd('2020-01-01') + months(3)
83 | ymd('2020-03-31') - months(1)
84 |
85 |
86 | tibble(year = c(2015,2015,2016,2016), month = c(1,6,2,4), day = c(5,6,1,2)) %>%
87 | mutate(date = dmy(paste(day, month, year))) %>%
88 | mutate(quarter = quarter(date),
89 | year_new = year(date)) %>%
90 | mutate(month_label = month.name[month]) %>%
91 | mutate(year_date = floor_date(date, 'month'))
92 |
93 | c('my string', 'my second string')
94 | paste('a','b','c','d', sep = ' and ')
95 | paste0('a','b','c',' d')
96 |
97 | construction
98 | construction %>%
99 | mutate(Month = case_when(
100 | Month == 'January' ~ 'Janubary',
101 | TRUE ~ Month
102 | ))
103 | cat('Look at the \n result in my graph')
104 | str_sub('hello', 2, 4)
105 |
106 | construction %>%
107 | mutate(Month = str_replace(Month, 'Ju', 'Jub'))
108 |
109 | fips_to_names %>%
110 | mutate(Just_name = word(countyname, 1))
111 | fips_to_names %>%
112 | mutate(Just_name = str_replace(countyname, 'County|Parish', ''))
113 |
114 |
115 | data(gss_cat)
116 | changed_gss_cat <- gss_cat %>%
117 | mutate(birthyear = ym(paste0(year, '-06')) - years(age)) %>%
118 | mutate(birthyear = floor_date(birthyear, 'year')) %>%
119 | mutate(race = factor(race, levels = c('Black','White','Other'))) %>%
120 | mutate(income_num = word(rincome,1)) %>%
121 | mutate(income_num = str_replace(income_num, '\\$','')) %>%
122 | mutate(income_num = as.numeric(income_num))
123 |
124 | table(gss_cat$rincome)
125 |
126 | stockgrowth <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
127 | date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
128 | stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) %>%
129 | arrange(ticker, date) %>% group_by(ticker) %>%
130 | mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1,
131 | price_growth_daily = stock_price/lag(stock_price, 1) - 1)
132 | stockgrowth
133 |
134 | do_bps <- function(p) {
135 | bps <- p*10000
136 | return(bps)
137 | }
138 | do_bps(.00006)
139 | library(purrr)
140 | c(.00006, .05, 1) %>% map_dbl(do_bps)
141 |
142 | list.files(pattern = 'trends_') %>%
143 | map_df(read_csv)
144 |
145 |
146 | my_function <- function(i) {
147 | print(i+1)
148 | }
149 | # Print the numbers 1:10
150 | 1:10 %>% map(my_function)
151 |
152 | xplusa <- function(x, a) {
153 | return(x+a)
154 | }
155 |
156 | my_plus_a_p <- function(a) {
157 | data(mtcars)
158 | mtcars %>%
159 | mutate(across(ends_with('p'), xplusa, a = a)) %>%
160 | return()
161 | }
162 |
163 | 1:5 %>% map_df(my_plus_a_p) %>%
164 | write_csv('my_filename.csv')
165 |
166 | save(my_data, construction, file = 'my_data_and_construction.Rdata')
167 | load('my_data_and_construction.Rdata')
168 | saveRDS(my_data, 'mydata.Rdata')
169 | my_data <- readRDS('mydata.Rdata')
170 | write_csv(my_data, 'mydata.csv')
--------------------------------------------------------------------------------
/Lecture_06_GG_Components.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_06_GG_Components.png
--------------------------------------------------------------------------------
/Lecture_07_EDA.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 7: Exploratory Data Analysis"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | {
19 | library(tidyverse)
20 | library(lubridate)
21 | library(gghighlight)
22 | library(ggthemes)
23 | library(directlabels)
24 | library(RColorBrewer)
25 | }
26 | rinline <- function(code) {
27 | sprintf('``` `r %s` ```', code)
28 | }
29 | ```
30 |
31 | ## The Purpose of Analysis
32 |
33 | ```{r, results = 'asis'}
34 | cat("
35 | ")
41 | ```
42 |
43 | - What are we even *doing* with data?
44 | - We want to see what sorts of stuff is in the data, so that we know what things we *should* be reporting on
45 | - This requires us to *explore* the data
46 | - Even if we come in with a strong idea of what we want to find, learning about what the data looks like is a good idea
47 |
48 | ## The Purpose of Analysis
49 |
50 | We want to:
51 |
52 | - Understand our data (read the docs, too!!!)
53 | - Detect mistakes
54 | - Get a sense of what our variables look like
55 | - Figure out what some of the relationships are
56 |
57 | ## The Purpose of Analysis
58 |
59 | Always be on the lookout (in EDA and in your code!)
60 |
61 | - Things that are different that shouldn't be different ("why does this one person have a height 8x that of anyone else?")
62 | - Things that are the same that shouldn't be the same ("Why is the mean income for Americans and Canadians exactly preciely the same to the 8th decimal place?")
63 | - Relationships that are very surprising ("Why is age negatively correlated with height?")
64 |
65 | ## Exploring Data
66 |
67 | - Univariate non-graphical
68 | - Univariate graphical
69 | - Multivariate non-graphical
70 | - Multivariate graphical
71 |
72 | ## Univariate Exploratory Analysis
73 |
74 | Looking for:
75 |
76 | - What kind of values we have
77 | - Are there massive outliers? Are there values that look *incorrect*?
78 | - What is the distribution? Is there skew? If it's a factor, does one category dominate?
79 | - You'll want to do this with nearly every variable you have
80 | - Getting all the little presentation details is less important, but the graph should still be readable (labels etc.) On that note, EDA isn't always *concise* - some tables/graphs will slip off the edge of these slides.
81 |
82 | ## Univariate Exploratory Analysis
83 |
84 | - Think about *what features of the variable it makes sense to explore*
85 | - Central tendencies (mean, median, etc.?)
86 | - Tails? Skew? Percentiles?
87 | - If it's a factor, is showing *all* the categories informative? Do you need to collapse first? Or summarize?
88 |
89 | ## Univariate Exploratory Analysis Tools
90 |
91 | - `summary()` shows lots of good info about a variable's distribution (also works to summarize other objects)
92 |
93 | ```{r, echo = TRUE}
94 | summary(iris$Sepal.Length)
95 | summary(iris$Species)
96 | ```
97 |
98 |
99 | ## Univariate Exploratory Analysis Tools
100 |
101 | - `str()` shows variable classes (important!) and some values
102 |
103 | ```{r, echo = TRUE}
104 | str(iris)
105 | ```
106 |
107 | ## Univariate Exploratory Analysis Tools
108 |
109 | - `vtable()` in **vtable** is a more readable and flexible version of that. `lush = TRUE` for more info
110 |
111 | ```{r, echo = TRUE}
112 | library(vtable)
113 | vtable(iris, lush = TRUE)
114 | ```
115 |
116 | ## Univariate Exploratory Analysis Tools
117 |
118 | - `sumtable()` in **vtable** is a full table of summary statistics
119 |
120 | ```{r, echo = TRUE}
121 | sumtable(iris)
122 | ```
123 |
124 |
125 | ## Univariate Exploratory Analysis Tools
126 |
127 | - Moving on to graphical tools! In **ggplot**, `geom_density()` and `geom_histogram()` can both show full distributions (when might this *not* be useful?)
128 |
129 | ```{r, echo = TRUE, fig.height = 3}
130 | ggplot(iris, aes(x = Sepal.Length)) + geom_density()
131 | ```
132 |
133 | ## Univariate Exploratory Analysis Tools
134 |
135 | - Boxplots aren't for general-audience consumption but they show some easy summary statistics
136 |
137 | ```{r, echo = TRUE, fig.height = 3}
138 | ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot() + coord_flip()
139 | ```
140 |
141 | ## Univariate Exploratory Analysis
142 |
143 | - Load the `gov_transfers` data from the **causaldata** package
144 | - Read the docs! `help(gov_transfers)`
145 | - Explore the variables that are in there on a univariate basis
146 |
147 | ## Multivariate Exploratory Analysis
148 |
149 | - We want to know *how variables are related* in multivariate analysis
150 | - Again, *fit the analysis to the context*. Don't treat a continuous variable as a factor, and if you treat a discrete variable as continuous know that a scatterplot won't work! etc.
151 | - Don't just fire and forget. Think about whether the output is actually informative.
152 | - more than 2 variables are possible too! Just add more groups/aesthetics.
153 |
154 | ## Multivariate Exploratory Analysis Tools
155 |
156 | - For continuous vs. continuous, nothing wrong with starting with a linear correlation!
157 | - Bivariate OLS is just a rescaled correlation
158 |
159 | ```{r, echo = TRUE}
160 | cor(iris$Sepal.Length, iris$Sepal.Width)
161 | lm(Sepal.Length ~ Sepal.Width, data = iris)
162 | ```
163 |
164 |
165 | ## Multivariate Exploratory Analysis Tools
166 |
167 | - If one variable is discrete or a factor, don't underestimate plain ol' `group_by() %>% summarize()`
168 | - In fact, many headaches of trying to get R to do something for you can be avoided by just doing `group_by() %>% summarize()` yourself
169 | - Don't forget `na.rm = TRUE` as appropriate
170 |
171 | ```{r, echo = TRUE}
172 | iris %>%
173 | group_by(Species) %>%
174 | summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE),
175 | sd.SL = sd(Sepal.Length, na.rm = TRUE))
176 | ```
177 |
178 | ## Multivariate Exploratory Analysis Tools
179 |
180 | - `sumtable` has a `group` option for the same purpose (note `summ` can customize the summary functions)
181 |
182 | ```{r, echo = TRUE}
183 | iris %>% sumtable(group = 'Species')
184 | ```
185 |
186 | ## Multivariate Exploratory Analysis Tools
187 |
188 | - `tabyl` in **janitor** is great for discrete vs. discrete; use `adorn` functions to get percents instead of counts. Watch the denominator! It affects interpretation a LOT
189 |
190 | ```{r, echo = TRUE}
191 | library(janitor)
192 | mtcars %>% tabyl(am, vs) %>%
193 | adorn_percentages(denominator = 'row') %>%
194 | adorn_pct_formatting() %>% adorn_title()
195 | ```
196 |
197 | ## Multivariate Exploratory Analysis
198 |
199 | - Moving on to graphs!
200 | - If one variable is a factor/discrete you can do pretty much any kind of graph you like, but grouped
201 |
202 | ```{r, echo = TRUE, fig.height = 3}
203 | mtcars %>% group_by(vs, am) %>%
204 | summarize(mean_mpg = mean(mpg)) %>%
205 | ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge')
206 | ```
207 |
208 | ## Multivariate Exploratory Analysis
209 |
210 | - When it comes to densities, typical geoms look cluttered. I recommend `geom_density_ridges` in **ggridges**, or `geom_violin` (or use facets)
211 |
212 | ```{r, echo = TRUE, fig.height = 3}
213 | library(ggridges); library(patchwork)
214 | p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges()
215 | p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()
216 | p1 + p2
217 | ```
218 |
219 | ## Multivariate Exploratory Analysis
220 |
221 | - For two continuous variables, things can be tricky, but a scatterplot is a good place to start
222 |
223 | ```{r, echo = TRUE, fig.height = 4}
224 | ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
225 | ```
226 |
227 | ## Automation
228 |
229 | - As a starting place, `ggpairs` in **GGally** will automatically pick univariate and multivariate comparisons to start
230 |
231 | ```{r, echo = TRUE, eval = FALSE}
232 | GGally::ggpairs(iris)
233 | ```
234 |
235 | ## Automation
236 |
237 | ```{r}
238 | GGally::ggpairs(iris)
239 | ```
240 |
241 | ## Multivariate Exploratory Analysis
242 |
243 | - Let's do some of this in our `gov_transfers` data!
244 |
245 | ## Going into Detail
246 |
247 | - This has all been running with little input from us at this point
248 | - Yes we have to figure out which kind makes sense but we're just exploring
249 | - Going for a bit more detail requires us to have a *question* to answer, and then we can look into that
250 | - This targets us towards the variables/relationships to study, and *how* to explore them further
251 | - Perhaps, for example, checking trivariate relationships, one example being "*in this subgroup* how are X and Y related?"
252 |
253 | ## Careful!
254 |
255 | Be aware of what EDA does not do:
256 |
257 | - Checking *every* relationship will tend to uncover non-real relationships by random chance
258 | - Consider doing EDA on a random subset data for this reason, so you can check if something interesting actually is there in the other/full data and don't trick yourself
259 | - Hardly ever establishes a *causal relationship* - avoid causal language unless you can really back it up!
260 |
261 | ## Let's Do More (if there's time)
262 |
263 | - [https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Clothing.csv](https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Clothing.csv)
264 | - Data description: [https://rdrr.io/rforge/Stat2Data/man/Clothing.html](https://rdrr.io/rforge/Stat2Data/man/Clothing.html)
265 | - Explore! What do we find?
--------------------------------------------------------------------------------
/Lecture_08_Basics_of_ggplot2.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 8: Basics of ggplot2"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | ```
20 |
21 |
22 |
23 | ## What is ggplot2?
24 |
25 | ```{r, results = 'asis'}
26 | cat("
27 | ")
33 | ```
34 |
35 | - **ggplot2** is a package in R, `library(ggplot2)` or `library(tidyverse)`
36 | - It is based on the *grammar of graphics*
37 | - Lots of major publications use it (538, The Economist, BBC, ...)
38 | - Zillions of extensions for doing anything you like
39 | - (also runs in Python using **plotnine**, and the newest version of **seaborn** is based on teh Grammar of Graphics, so everything you learn here will also teach you plotting in Python)
40 |
41 | ## Resources
42 |
43 | - [An overview of the Grammar of Graphics concept](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149)
44 | - [ggplot2 cheat sheet from RStudio](https://github.com/rstudio/cheatsheets/blob/master/data-visualization.pdf)
45 | - [ggplot2 book by Hadley Wickham](https://ggplot2-book.org/)
46 | - [LOST-Stats Figures](https://lost-stats.github.io/Presentation/Figures/Figures.html)
47 | - [R Graphics Cookbook](https://r-graphics.org/)
48 | - The **esquisse** or **swirlr** packages for learning **ggplot2**
49 |
50 | ## Why use ggplot?
51 |
52 | - Huge amounts of power and flexibility
53 | - Manipulating data easy in R
54 | - Easy to edit
55 | - Handles multidimensional data effortlessly
56 | - Once you know how to generate graphs in a flexible coding system, **it will be easier to do what you want in code than in a more canned and restricted piece of software**
57 |
58 | ## Tidy Tuesday
59 |
60 | - Every week, the R for Data Science community posts a data set you can use
61 | - An assignment for this class will be to participate in (some) Tidy Tuesday weeks
62 | - See the repository [here](https://github.com/rfordatascience/tidytuesday)
63 | - Now on to **ggplot2**!
64 |
65 | ## The Grammar of Graphics
66 |
67 | - There's a method here!
68 | - We want to describe graphics in the way we use language
69 | - Think about syntax and nouns and verbs
70 | - We can build a plot from the ground up, element by element
71 | - This also makes it easy to make changes or switch things out
72 |
73 | ## Components of the Grammar of Graphics
74 |
75 | ```{r}
76 | # https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149
77 | knitr::include_graphics('Lecture_06_GG_Components.png')
78 | ```
79 |
80 | ## Components of the Grammar of Graphics
81 |
82 | - Data: Identify the dimensions you want to visualize. Prepare data so that it contains the points you want to visualize, or calculate from
83 | - Aesthetics: Axes are based on the data dimensions. Also includes other "axes" like size, shape, color, etc.
84 | - Scale: Do we need to scale the potential values, use a specific scale to represent multiple values or a range?
85 |
86 | ## Components of the Grammar of Graphics
87 |
88 | - Geometric objects (`geom_`s): How should the data be drawn on the graph? Lines? Points? Bars?
89 | - Statistics: Do we want to show calculations based on data rather than data itself?
90 | - Facets: Do we want to spread one of the "axes" across multiple graphs?
91 | - Coordinate system: Should it be cartesian or polar (or something else)? Flipped? Log scale?
92 |
93 | ## Setting up a ggplot Graph
94 |
95 | We can focus, at least to start, on:
96 |
97 | - Data
98 | - Aesthetic
99 | - Geometry
100 |
101 | ## Data
102 |
103 | - Data is just a regular data set (`data.frame` or `tibble`) as you'd normally work with it in R.
104 | - **ggplot2** is set up to work directly with data frames. References to variables will look in the dataset you give it
105 |
106 | ## Aesthetic
107 |
108 | - The *aesthetic* determines the "axes" of our graph
109 | - Literally the `x=` and `y=` axes, but also other axes on which data is differentiated, like `color=`, `size=`, `linetype=`, `fill=`, etc.
110 | - `aes(x=mpg,y=hp,color=transmission)` puts the `mpg` variable on the x-axis, `hp` on the y-axis, and colors things differently by `transmission`
111 | - `group=` to separate groups for things like labeling without making them visibly different
112 | - Different aesthetic values for different types of graphs/geometries
113 |
114 | ## For Example
115 |
116 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
117 | library(tidyverse)
118 | mtcars <- mtcars %>%
119 | mutate(Transmission = factor(am, labels = c('Automatic','Manual')),
120 | CarName = row.names(mtcars))
121 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) +
122 | geom_point()
123 | ```
124 |
125 | ## Common Aesthetics
126 |
127 | See `help()` for a given geometry to see what it accepts. But common are:
128 |
129 | - `x`, `y`: position on the x and y axis
130 | - `color`, `fill`: color is for coloring things generally. For shapes, `color` is the outline and `fill` is the inside
131 | - `label`: text labels
132 | - `size`: guess what this does
133 | - `alpha`: Transparency
134 | - `linetype`, `linewidth`: for lines or outlines, the type/thickness of line (solid, dashed...)
135 | - `xmin`/`xmax`/`xend` (and same for `y`): For line segments or shaded ranges, where do they start/end?
136 |
137 |
138 | ## Geometry
139 |
140 | - The *geometry* is what actually gets drawn. In the last slide, `geom_point()` drew points.
141 | - There are many different geometries. Let's look at the [ggplot2 cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization.pdf)
142 | - Common: `geom_point`, `geom_line`, `geom_col`, `geom_text`
143 |
144 | ## Geometry: geom_line
145 |
146 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
147 | data(economics)
148 | ggplot(economics, aes(x = date, y = uempmed)) +
149 | geom_line()
150 | ```
151 |
152 |
153 |
154 | ## Geometries: geom_text
155 |
156 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
157 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, label = CarName)) +
158 | geom_text()
159 | ```
160 |
161 | ## Explicit bar graphs like Excel: geom_col
162 |
163 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
164 | data <- tibble(category = c('Apple','Banana','Carrot'),
165 | quality = c(6,4,3))
166 | ggplot(data, aes(x = category,y=quality)) + geom_col()
167 | ```
168 |
169 | ## Or explicit grouped charts: geom_col
170 |
171 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
172 | data <- tibble(category = c('Apple','Banana','Carrot','Apple','Banana','Carrot'),
173 | person = c('Me','Me','Me','You','You','You'),
174 | quality = c(6,4,3,1,6,3))
175 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge')
176 | ```
177 |
178 | ## Hm?
179 |
180 | What was that `position = 'dodge'`?
181 |
182 | - Used when you have stacked elements with the same `x` aesthetic that you want to be side-by-side
183 | - With `geom_col()` or `geom_bar()` you can set `position = 'dodge'` to make the different-fill bars go side by side
184 | - Can apply `position = position_dodge(.9)` with other geometries to make them line up. For example...
185 |
186 | ## Dodging
187 |
188 | ```{r}
189 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') +
190 | geom_text(aes(label = quality), position = position_dodge(.9), vjust = -.4)
191 | ```
192 |
193 |
194 | ## Calculations
195 |
196 | - Some geometries do calculations/summaries of your data before graphing
197 | - Common: `geom_density`/`geom_histogram`/`geom_dotplot`, `geom_bar`, `geom_smooth`
198 |
199 | ## Demonstrating Distributions
200 |
201 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
202 | ggplot(mtcars, aes(x = mpg)) + geom_histogram()
203 | ```
204 |
205 | ## Demonstrating Distributions
206 |
207 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
208 | ggplot(mtcars, aes(x = mpg)) + geom_dotplot()
209 | ```
210 |
211 | ## Demonstrating Distributions
212 |
213 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
214 | ggplot(mtcars, aes(x = mpg)) + geom_density()
215 | ```
216 |
217 | ## Demonstrating Distributions
218 |
219 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
220 | ggplot(mtcars, aes(y = mpg, x = factor(am))) + geom_violin()
221 | ```
222 |
223 |
224 |
225 | ## Discrete Distributions: geom_bar
226 |
227 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
228 | ggplot(mtcars, aes(x = Transmission)) + geom_bar()
229 | ```
230 |
231 | ## Best-fit lines with geom_smooth
232 |
233 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
234 | ggplot(mtcars, aes(x = mpg, y = hp)) +
235 | geom_point() + geom_smooth()
236 | ```
237 |
238 | ## Trendlines with geom_smooth
239 |
240 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
241 | # Or make it straight, and separate by Transmission, and no bands
242 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) +
243 | geom_point() + geom_smooth(method = 'lm', se = FALSE)
244 | ```
245 |
246 | ## Imitation and Flattery
247 |
248 | - Uses `data(iris)`
249 |
250 | ```{r, fig.height = 4, fig.width=6}
251 | ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
252 | geom_point()
253 | ```
254 |
255 | ## Imitation and Flattery
256 |
257 | - Uses `data(economics_long)`
258 | - `economics_long <- economics_long %>%` `filter(variable %in% c('psavert','uempmed'))` keeps just the two variables we want
259 |
260 | ```{r, fig.height = 4, fig.width=6}
261 | data(economics_long)
262 | economics_long <- economics_long %>% filter(variable %in% c('psavert','uempmed'))
263 | ggplot(economics_long, aes(x = date, y = value, linetype = variable)) +
264 | geom_line()
265 | ```
266 |
267 | ## Imitation and Flattery
268 |
269 | - Uses `data(economics)`
270 |
271 | ```{r, fig.height = 4, fig.width=6}
272 | data(economics)
273 | ggplot(economics, aes(x = uempmed)) + geom_density()
274 | ```
275 |
276 | ## Moving Forward
277 |
278 | - Everything we've done up to now isn't all *that* different from any sort of graphing software
279 | - But what really makes **ggplot2** shine is its consistency and internal logic
280 | - You'll have full access to the *scales* of the graph, as well as to the components that make it up
281 | - And manipulating those things is fairly intuitive to do rather than head-banging work
282 | - That's what we'll work on next time!
--------------------------------------------------------------------------------
/Lecture_09_Structuring_ggplot.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 9: Structuring ggplot2"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(ggrepel)
20 | library(scales)
21 | ```
22 |
23 |
24 |
25 | ## Structuring ggplot
26 |
27 | ```{r, results = 'asis'}
28 | cat("
29 | ")
35 | ```
36 |
37 | - Last time we went through the basics of the grammar of graphics
38 | - And how to put together a ggplot graph:
39 | - Data, aesthetics, geometry
40 |
41 | ## Today!
42 |
43 | - We'll talk about different ways of changing the *structure* of your ggplot
44 | - Scales and coordinates, labels and legends, new geometries
45 | - There are infinite possibilities here, we can't cover everything
46 | - The goal is to reduce unknown unknowns so you know where to look/Google
47 | - Also, just start typing in RStudio and see what autocompletes!
48 |
49 | ## The Key Pieces
50 |
51 | A common structure for a `ggplot()` command with a fair amount of customization might be:
52 |
53 | ```{r, echo = TRUE, eval = FALSE}
54 | ggplot(data, aes(x = xvar, y = yvar, color = colorvar)) +
55 | geom_mygeometry(otheraesthetic = value) +
56 | scale_aesthetic_type(labels = something) +
57 | labs(x = 'My X Var', y = 'My Y Var', title = 'My Title') +
58 | guides(color = 'none')
59 | ```
60 |
61 | ## Scales
62 |
63 | - Every axis (aesthetic element) has a *scale*
64 | - This determines how *a value in the data* (1, 2, 3) turns into *a value on the graph* (100 pixels to the right, 200 pixels, 300 pixels; or "red", "blue", "green")
65 | - These scales are fully manipulable and label-able!
66 | - `scale_axisname_type`
67 |
68 | ## Discrete and Continuous Scales
69 |
70 | - For example:
71 | - `scale_x_continuous` for a continuous x-axis
72 | - `scale_x_discrete` for a discrete one
73 | - Or similarly `scale_y_continuous`, `scale_color_continuous`, `scale_color_gradient`, `scale_fill_discrete`, and so on and so on
74 |
75 | ## Discrete and Continuous Scales
76 |
77 | - Use to:
78 | - Set colors, set limits (want the axis to extend beyond its values? `limits`)
79 | - Name the scale
80 | - Label values (for dicrete values, or perhaps as percents?)
81 |
82 | ## Setting Values Manually
83 |
84 | - `scale_x_manual` for discrete plots will let you set things by hand
85 | - For example if `x` is different companies and you want to pick each company's brand color for its bar
86 |
87 | ## Examples
88 |
89 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
90 | data <- tibble(category = c('Apple','Banana','Carrot','Apple','Banana','Carrot'),
91 | person = c('Me','Me','Me','You','You','You'),
92 | quality = c(.06,.04,.03,.01,.06,.03))
93 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge')
94 | ```
95 |
96 | ## Setting Scales
97 |
98 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
99 | library(scales)
100 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') +
101 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) +
102 | scale_x_discrete(position = 'top') +
103 | scale_fill_manual(values = c('Apple'='red','Banana'='yellow','Carrot'='orange'))
104 | ```
105 |
106 | ## Setting Scales
107 |
108 | - `limits` in the `scale` function can be used to specify the vector of values to be included, or as a range for continuous variables
109 | - `labels` is also handy. `labels = c('Red Apple',` `'Yellow Banana','Orange Carrot')` would relabel the legend (or axis labels)
110 | - In a continuous function, say for dollars, `scale_x_continuous(labels = scales::label_dollar())` would put it in dollar terms. More on this in a moment
111 | - Or dates: `scale_x_date(labels = function(x)` `paste(month.abb[month(x)],year(x)))`
112 |
113 | ## Setting Color or Fill Scales
114 |
115 | - Selecting a color scale is important - we've discussed discrete and continuous palettes
116 | - There are a bunch of `scale_color_`/`scale_fill_` functions that solely exist to help with this!
117 | - (not to mention entire packages like **paletteer**)
118 |
119 | ## Setting Color or Fill Scales
120 |
121 | Especially useful are:
122 |
123 | - `scale_color_gradient()` for gradient scales (or `_gradient2()` for diverging scales with a "middle" in them), `scale_color_viridis()` also has some great gradient scales (either discrete or continuous!)
124 | - `scale_color_brewer()`/`scale_fill_brewer()` functions for discrete values, or `_distiller()` for continuous values, or `_fermenter()` for binned
125 |
126 | ## Setting Color or Fill Scales
127 |
128 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
129 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') +
130 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) +
131 | scale_x_discrete(position = 'top') +
132 | scale_fill_brewer(palette = 'Dark2')
133 | ```
134 |
135 |
136 | ## Setting Color or Fill Scales
137 |
138 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
139 | ggplot(data, aes(x = person, y = quality, group = category, fill = quality)) + geom_col(position = 'dodge') +
140 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) +
141 | scale_x_discrete(position = 'top') +
142 | scale_fill_viridis_c()
143 | ```
144 |
145 | ## Setting Color or Fill Scales
146 |
147 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
148 | ggplot(data, aes(x = person, y = quality, group = category, fill = quality)) + geom_col(position = 'dodge') +
149 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) +
150 | scale_x_discrete(position = 'top') +
151 | scale_fill_gradient2(midpoint = .03)
152 | ```
153 |
154 | ## Transformations
155 |
156 | - We can transform values as we plot them
157 | - Many `scale_something_continuous` entries have a `trans` option, set to `date`, `log`, `probability`, `reciprocal`, `sqrt`, `reverse`, etc. etc. to perform that transformation before plotting
158 | - You can also do things like `scale_x_log10` or `scale_y_reverse` for some common transforms
159 | - `scale_something_binned()` takes a continuous value and puts it in bins, handy sometimes for simplification
160 |
161 | ## Transformations
162 |
163 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
164 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) +
165 | geom_point()
166 | ```
167 |
168 | ## Transformations
169 |
170 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
171 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) +
172 | geom_point() +
173 | scale_x_log10() +
174 | scale_y_continuous(trans='reverse') +
175 | scale_color_binned()
176 | ```
177 |
178 | ## Log Scales
179 |
180 | When to use log scales?
181 |
182 | - When data is highly skewed, so a few huge observations are drawing all focus
183 | - When the relationship is multiplicative so you want to show, say, percentage growth
184 |
185 | ## Labeling Scales for Continuous Variables
186 |
187 | Scale functions have a `labels=` option.
188 |
189 | - USE IT!!
190 | - This is the quickest and easiest way to make your graph look a zillion times better and more professional
191 | - We already covered this for categorical variables. `labels = c('A','B')` gives the categories the names A and B, in that order (tip: check the order first)
192 | - For continuous variables, the **scales** package helps us immensely here
193 |
194 | ## scales
195 |
196 | Two main types of functions in **scales**:
197 |
198 | - Transformation functions like `dollar()`: `dollar(10)` creates $10 (NOTE: handy sometimes in RMarkdown text! Also note this creates text, not numbers, so don't use them in `aes()` unless you want the variable to be a string)
199 | - Labeling functions like `label_dollar()` designed to slot directly into the `labels=` argument. `scale_y_continuous(labels = label_dollar())` turns all your y-axis labels into the dollar equivalent
200 |
201 |
202 | ## scales
203 |
204 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
205 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') +
206 | scale_y_continuous(labels = label_percent(), limits = c(0,.1))
207 | ```
208 |
209 | ## Scales
210 |
211 | The `label_` functions have lots of options! You can set the `accuracy` (precision), decide how to break up big numbers (`big.mark`) or scale things down to, say, thousands! (`scale=1/1000, suffix = 'k'`)
212 |
213 | ```{r, echo = TRUE, fig.height = 2, fig.width = 6}
214 | data(gapminder, package = 'gapminder')
215 | ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
216 | scale_x_log10(labels = label_dollar(accuracy = 1, scale = 1/1000, suffix = 'k'))
217 | ```
218 |
219 | ## Documentation
220 |
221 | - There are a zillion options
222 | - Be sure to use `help(whatever)` before using something
223 | - Not sure what the function is? Check the **ggplot2** documentation page
224 | - Or just start typing `scales_` or whatever and see what pops up in autocomplete in RStudio
225 |
226 | ## Labels with labs()
227 |
228 | - Please label your axes and graphs!!
229 | - Label legends too by naming their axes
230 | - Captions and subcaptions available too
231 | - Annotation and labeling data points we'll do next time
232 |
233 | ## All Kinds of Labels
234 |
235 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
236 | data(mtcars)
237 | mtcars <- mtcars %>% mutate(CarName = row.names(mtcars))
238 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() +
239 | scale_x_log10() + scale_y_reverse() + scale_color_binned() +
240 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight',
241 | title = 'Title', subtitle = 'Subtitle', caption = 'Caption')
242 | ```
243 |
244 | ## Moving the Legend
245 |
246 | - Working with **ggplot2** legends is one of the trickier things
247 | - we'll do a lot more with aesthetic details next time but for now let's just move it and outline it.
248 |
249 | ## Moving the Legend
250 |
251 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
252 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() +
253 | scale_x_log10() + scale_y_reverse() + scale_color_binned() +
254 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight',
255 | title = 'Title', subtitle = 'Subtitle', caption = 'Caption') +
256 | theme(legend.position = c(.9,.3), legend.box.background = element_rect(color='black'))
257 | ```
258 |
259 | ## Or Removing It
260 |
261 |
262 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
263 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() +
264 | scale_x_log10() + scale_y_reverse() + scale_color_binned() +
265 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') +
266 | guides(color = 'none')
267 | ```
268 |
269 | ## Overlaid Graphs
270 |
271 | - In addition to adding chart elements, we can add whole new geometries over the top
272 | - Some stack naturally (adding a `geom_smooth` best fit line over top, for example)
273 | - For others we may have to change the data or aesthetic as we go
274 |
275 | ## Overlaid Graphs
276 |
277 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
278 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() +
279 | scale_x_log10() + scale_y_reverse() + scale_color_binned() +
280 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') +
281 | geom_smooth(method='lm', se = FALSE)
282 | ```
283 |
284 | ## Overlaid Graphs
285 |
286 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
287 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() +
288 | scale_x_log10() + scale_y_reverse() + scale_color_binned() +
289 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') +
290 | geom_text_repel(data = mtcars %>% slice(1:5),aes(label = CarName),hjust=-1)
291 | ```
292 |
293 | ## Facets
294 |
295 | - Separate graphs by group!
296 |
297 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
298 | ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() +
299 | facet_wrap('cyl') +
300 | labs(x = 'Miles per Gallon', y = 'Horsepower', title = 'Horsepower vs. MPG by Cylinders')
301 | ```
302 |
303 |
304 | ## ggforce's facet_zoom
305 |
306 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
307 | library(ggforce)
308 | ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
309 | geom_point() +
310 | facet_zoom(x = Species == 'versicolor')
311 | ```
312 |
313 | ## Multiple-Graph Notes
314 |
315 | - For overlaying the *same type* of graph, say two line graphs for two variables, don't `+` different geometries, you're almost always better off doing a `pivot_longer()` first and just using `aes()`.
316 | - For sticking together multiple completely disparate graphs, don't use facets, instead load **patchwork** and combine them as desired - we'll discuss this later
317 |
318 | ## Imitation and Flattery
319 |
320 | - Pick one of these graphs or from the previous lecture
321 | - Go into the geometry's help file and recreate an example, or the graph in these slides
322 | - Mess with the scales and labels! Try to get it to work!
323 |
--------------------------------------------------------------------------------
/Lecture_10_Styling_ggplot.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 10: Styling ggplot2"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(ggrepel)
20 | library(ggpubr)
21 | ```
22 |
23 |
24 |
25 | ## Styling ggplot
26 |
27 | ```{r, results = 'asis'}
28 | cat("
29 | ")
35 | ```
36 |
37 | - We should have a decent idea at this point of how to put a **ggplot2** graph together
38 | - But how do we make it look good?
39 |
40 | ## Today!
41 |
42 | - We'll talk about how to set aesthetic attributes along every axis
43 | - As well as overall styling and theming
44 | - Highlighting, putting graphs together, making them interactive, annotating...
45 | - **Think about this stuff**. Your graphs will look unprofessional if you don't, and I'll start grading down all-default styling or missed opportunities for improvement.
46 |
47 | ## Aesthetic Characteristics
48 |
49 | - As you'd expect, we'll be setting the values of certain aesthetic properties
50 | - Which ones apply depend on the geometry. See [this page](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html). Some common ones:
51 | - color, linetype, fill, size, shape (and soon: linewidth)
52 | - Text aesthetics: family (font), fontface, hjust, vjust (see **extrafont** package to use all your system fonts)
53 |
54 | ## Decorating Axes vs. Decorating Geometries
55 |
56 | - When a characteristic is set *inside* of an `aes()`, it becomes an axis and must be mapped to a variable name
57 | - *Outside* of an `aes()` (and in the geometry), it's a setting applied to the entire geometry
58 |
59 | ## Aesthetic Characteristics
60 |
61 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
62 | # We've seen this before
63 | mtcars <- mtcars %>%
64 | mutate(Transmission = factor(am, labels = c('Automatic','Manual')))
65 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) +
66 | geom_point()
67 | ```
68 |
69 | ## Aesthetic Characteristics
70 |
71 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
72 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission,
73 | size = wt, shape = Transmission)) +
74 | geom_point()
75 | ```
76 |
77 | ## Aesthetic Characteristics
78 |
79 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
80 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission,
81 | size = wt)) +
82 | geom_point(shape = 10)
83 | ```
84 |
85 | ## Aesthetic Characteristics
86 |
87 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
88 | ggplot(mtcars, aes(x = Transmission, fill = factor(cyl))) +
89 | geom_bar(position = 'dodge', linetype = 'dashed', color = 'black')
90 | ```
91 |
92 | ## Coordinates
93 |
94 | - Everything *behind* the geometries can be manipulated as well
95 | - For example, we can change the coordinate system (usually to `coord_flip` it for bar graphs)
96 |
97 | ## Coordinates
98 |
99 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
100 | library(ggalt)
101 | mtcars2 <- mtcars %>% group_by(Transmission) %>% summarize(count = n())
102 | ggplot(mtcars2, aes(x = Transmission,y=count)) + geom_lollipop()
103 | ```
104 |
105 | ## Coordinates
106 |
107 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
108 | # geom_lollipop in particular has a 'horizontal' option anyway
109 | # so this is just for demonstration
110 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
111 | geom_lollipop(color = 'red', size = 2) + coord_flip()
112 | ```
113 |
114 | ## Axes
115 |
116 | - We can control axis text with `axis.text` and the ticks we mark on the `scale`
117 | - `element_text` provides text objects
118 |
119 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
120 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
121 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
122 | labs(x = '', y = '') +
123 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) +
124 | theme(axis.text.x = element_text(size = 12,family='serif'),
125 | axis.text.y = element_text(size = 16,family='serif', angle = 10))
126 | ```
127 |
128 | ## Axes
129 |
130 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
131 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
132 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
133 | labs(x = '', y = '') +
134 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) +
135 | theme(axis.text.x = element_text(size = 12,family='serif'),
136 | axis.text.y = element_text(size = 16,family='serif', angle = 10))
137 | ```
138 |
139 |
140 | ## Axes
141 |
142 | - Order for discrete axes can be changed with `limits`
143 |
144 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
145 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
146 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
147 | scale_x_discrete(limits = c("Manual", "Automatic"))
148 | ```
149 |
150 | ## Theming
151 |
152 | - Pretty much all elements of presentation that aren't in a geometry can be controlled with `theme()`
153 | - The options are endless! See `help(theme)`
154 | - We set different aspects of the theme using `element_` functions like `element_text()`, `element_line()`, `element_rect()` which take aesthetic settings like `size`, `color`, etc.
155 |
156 | ## Axes
157 |
158 | - Change axis and tick lines with `axis.line` and `element_line`
159 | - Most everything can be changed generally (`axis.line` or even `line`) or specifically (`axis.ticks.x`)
160 |
161 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
162 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
163 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
164 | labs(x = '', y = '') +
165 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) +
166 | theme(axis.line = element_line(color = 'red'),
167 | axis.ticks.x = element_line(size = 5))
168 | ```
169 |
170 | ## Axes
171 |
172 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
173 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
174 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
175 | labs(x = '', y = '') +
176 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) +
177 | theme(axis.line = element_line(color = 'red'),
178 | axis.ticks.x = element_line(size = 5))
179 | ```
180 |
181 | ## Background
182 |
183 | - Work on the `panel` to change what goes behind that geometry! `element_rect` might come up!
184 | - Pretty much ANY element can be eliminated with `element_blank()`
185 | - This gonna be ugly
186 |
187 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
188 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
189 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
190 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'),
191 | panel.grid.major.x = element_line(color = 'green'),
192 | panel.grid.major.y = element_blank())
193 | ```
194 |
195 | ## Background
196 |
197 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
198 | ggplot(mtcars2, aes(x = Transmission,y=count)) +
199 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
200 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'),
201 | panel.grid.major.x = element_line(color = 'green'),
202 | panel.grid.major.y = element_blank())
203 | ```
204 |
205 | ## Legends
206 |
207 | - We showed a bit last time how legends can be changed.
208 | - Note in `element_text` (and elsewhere, like annotation or `geom_text`), `hjust` and `vjust` align text horizontally and vertically. Set 0/1 for L/R or Top/Bottom
209 | - `legend.position` is in portion of the frame. .8 = 80% to the right.
210 |
211 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
212 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
213 | theme(legend.text = element_text(hjust = .5, family = 'serif'),
214 | legend.title = element_text(hjust = 1, family = 'serif', face = 'bold'),
215 | legend.background = element_rect(color = 'black', fill = 'white'),
216 | legend.position = c(.8,.7))
217 | ```
218 |
219 | ## Legends
220 |
221 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
222 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
223 | theme(legend.text = element_text(hjust = .5, family = 'serif'),
224 | legend.title = element_text(hjust = 1, family = 'serif', face = 'bold'),
225 | legend.background = element_rect(color = 'black', fill = 'white'),
226 | legend.position = c(.8,.7))
227 | ```
228 |
229 | ## Themes
230 |
231 | - The theme is immensely flexible. We didn't even cover everything, like the plot title/caption
232 | - You can save your `theme()` settings as an object to use repeatedly as a house style
233 | - Or use a prepackaged one! You can even tack on more `theme()` customization afterwards.
234 | - Many `theme_` settings you can tack on. I only use a few...
235 | - (many many others... want to look like FiveThirtyEight or The Economist? Check **ggthemes**)
236 |
237 | ## Prepackaged Themes: Default
238 |
239 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6}
240 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point()
241 | ```
242 |
243 | ## Prepackaged Themes: Classic
244 |
245 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
246 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
247 | theme_classic()
248 | ```
249 |
250 | ## Prepackaged Themes: void (for things like flowcharts)
251 |
252 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
253 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
254 | theme_void()
255 | ```
256 |
257 | ## Prepackaged Themes: tufte (in ggthemes)
258 |
259 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
260 | library(ggthemes)
261 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
262 | theme_tufte()
263 | ```
264 |
265 | ## Prepackaged Themes: theme_pubr (in ggpubr), my go-to
266 |
267 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
268 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
269 | theme_pubr()
270 | ```
271 |
272 | ## General Theming Notes
273 |
274 | - How can we think about theming generally?
275 | - We want, as always, to drive focus towards our story
276 | - This means not making the theme distracting it, but using it to drive attention to what we want
277 | - Need focus on the x-axis labels? Make em bold!
278 | - Often we don't need distracting background color (as there is in the default theme). Usually get rid of it!
279 |
280 | ## Other Styling
281 |
282 | - Let's move on to some other ways we can build focus and clarity
283 | - Highlight and annotations!
284 |
285 | ## Highlighting
286 |
287 | - Highlight certain values and gray others with the **gghighlight** package
288 |
289 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
290 | library(gghighlight); data(gapminder, package = 'gapminder')
291 | ggplot(gapminder, aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) +
292 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") +
293 | scale_x_continuous(limits=c(1950,2015),
294 | breaks = c(1950,1970,1990,2010))+
295 | gghighlight(country %in% c('United States','Canada','Mexico'),
296 | unhighlighted_params = aes(size=.1),
297 | label_params=list(direction='y',nudge_x=10)) +
298 | theme_minimal(base_family='serif')
299 | ```
300 |
301 | ## Highlighting
302 |
303 |
304 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
305 | library(gghighlight); data(gapminder, package = 'gapminder')
306 | ggplot(gapminder, aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) +
307 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") +
308 | scale_x_continuous(limits=c(1950,2015),
309 | breaks = c(1950,1970,1990,2010))+
310 | gghighlight(country %in% c('United States','Canada','Mexico'),
311 | unhighlighted_params = aes(size=.1),
312 | label_params=list(direction='y',nudge_x=10)) +
313 | theme_minimal(base_family='serif')
314 | ```
315 |
316 | ## Highlighting
317 |
318 | - Or do it manually with `scale_X_manual()`
319 |
320 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
321 | gapminder %>% mutate(color_name = ifelse(country %in% c('United States','Canada','Mexico'), as.character(country), 'Other')) %>%
322 | ggplot(aes(x = year, y = lifeExp, group = country, color=color_name, size = color_name, alpha = color_name)) + geom_line() +
323 | geom_text(aes(label = ifelse(year == 2007 & color_name != 'Other', color_name,'')),
324 | hjust = 0, size = 13/.pt) +
325 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") +
326 | scale_x_continuous(limits=c(1950,2025),
327 | breaks = c(1950,1970,1990,2010))+
328 | scale_color_manual(values = c('red','forestgreen','gray','blue')) +
329 | scale_size_manual(values = c(1.5,1.5,.1,1.5)) +
330 | scale_alpha_manual(values = c(1,1,.2,1)) +
331 | theme_minimal(base_family='serif') +
332 | guides(color = 'none', size = 'none', alpha = 'none')
333 | ```
334 |
335 | ## Highlighting
336 |
337 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6}
338 | gapminder %>% mutate(color_name = ifelse(country %in% c('United States','Canada','Mexico'), as.character(country), 'Other')) %>%
339 | ggplot(aes(x = year, y = lifeExp, group = country, color=color_name, size = color_name, alpha = color_name)) + geom_line() +
340 | geom_text(aes(label = ifelse(year == 2007 & color_name != 'Other', color_name,'')),
341 | hjust = 0, size = 13/.pt) +
342 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") +
343 | scale_x_continuous(limits=c(1950,2025),
344 | breaks = c(1950,1970,1990,2010))+
345 | scale_color_manual(values = c('red','forestgreen','gray','blue')) +
346 | scale_size_manual(values = c(1.5,1.5,.1,1.5)) +
347 | scale_alpha_manual(values = c(1,1,.2,1)) +
348 | theme_minimal(base_family='serif') +
349 | guides(color = 'none', size = 'none', alpha = 'none')
350 | ```
351 |
352 | ## Annotation
353 |
354 | - There are also many options for *annotating* our data
355 | - For direct data point annotation, especially of line graphs, you can do this with some data manipulation and `geom_text` (see previous slide)
356 | - For putting a block of text I recommend **ggannotate**
357 | - I also recommend looking into [**ggtext**](https://wilkelab.org/ggtext/) which allows rich-text formatting of all text on the graph, but we won't be covering it here.
358 |
359 | ## Direct Line Labels
360 |
361 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6}
362 | p <- gapminder %>%
363 | group_by(country) %>%
364 | mutate(country_label = ifelse(year == max(year), as.character(country), NA_character_)) %>%
365 | filter(country %in% c('United States','Canada','Mexico')) %>%
366 | ggplot(aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) +
367 | scale_x_continuous(limits = c(1950,2025))+ theme_classic()+guides(color='none')+
368 | geom_text(aes(x = year + 2,label = country_label), hjust = 0)
369 | p
370 | ```
371 |
372 | ## Overall Annotation
373 |
374 | - You can use the regular **ggplot2** `annotate()` function, but this takes some work
375 | - **ggannotate** will let you work hands-on with it, and it gives you code at the end
376 | - Also works for adding lines and arrows
377 | - The `geom_mark` functions from **ggforce** are good for shape highlights ([here](https://ggforce.data-imaginist.com/reference/geom_mark_circle.html))
378 | - Excel-esque but the end result is more rigorous
379 | - Install with `remotes::install_github` `("mattcowgill/ggannotate")`, not `install.packages()`
380 |
381 | ## ggannotate
382 |
383 | - Let's see what happens when we run this...
384 |
385 |
386 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6}
387 | # Remember we created p a few slides ago
388 | ggannotate::ggannotate(p)
389 | ```
390 |
391 | ## ggannotate
392 |
393 | - We get code chunks to add to our graph to get our annotations
394 |
395 | ```{r, echo=FALSE, eval =TRUE, fig.height = 5, fig.width = 7}
396 | p + geom_text(data = data.frame(x = 1994.88673320614, y = 60.8970008395899, label = "Mexico catches up considerably\nthrough the latter half of the 20th century."),
397 | mapping = aes(x = x, y = y, label = label),
398 | inherit.aes = FALSE) +
399 | geom_curve(data = data.frame(x = 1994.11374242598, y = 63.8009475579236, xend = 1991.27944289874, yend = 69.80243744248),
400 | mapping = aes(x = x, y = y, xend = xend, yend = yend),
401 | arrow = arrow(30L, unit(0.1, "inches"),
402 | "last", "closed"),
403 | inherit.aes = FALSE)
404 | ```
405 |
406 | ## Export
407 |
408 | - Alright, we've crafted our perfect graph
409 | - What do we do now?
410 | - `ggsave()` will save our result to file (or we can Export Image in the Plots frame of RStudio)
411 | - But we have some more options still!
412 |
413 | ## Sticking Graphs Together
414 |
415 | - If you have multiple plots that go together, consider using **patchwork** to put them all into one
416 | - With the library loaded, you can just add plots together and they get stitched!
417 | - Let's stitch together some of the plots from this presentation
418 | - `+` to stick together, `|` to put "beside", `-` to move to next column, `/` for the next row
419 |
420 |
421 | ## Patchwork
422 |
423 | ```{r, echo = FALSE, eval = TRUE}
424 | p1 <- ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() +
425 | theme_tufte()
426 | p2 <- ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission,
427 | size = wt, shape = Transmission)) +
428 | geom_point()
429 | p3 <- ggplot(mtcars2, aes(x = Transmission,y=count)) +
430 | geom_lollipop(color = 'red', size = 2) + coord_flip() +
431 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'),
432 | panel.grid.major.x = element_line(color = 'green'),
433 | panel.grid.major.y = element_blank())
434 | p4 <- ggplot(mtcars, aes(x = Transmission, fill = factor(cyl))) +
435 | geom_bar(position = 'dodge', linetype = 'dashed', color = 'black')
436 | ```
437 |
438 |
439 | ```{r, echo= TRUE, eval = TRUE, fig.height = 5, fig.width = 12}
440 | library(patchwork)
441 | # Named and saved these earlier
442 | (p1 | p2 - p3)/p4
443 | ```
444 |
445 |
446 | ## Practice
447 |
448 | - We may not have time today to practice much of this
449 | - But I want you to take a basic graph, a real basic one, like this:
450 | - `ggplot(mtcars, aes(x = mpg, y = hp)) +` `geom_point()`
451 | - And just see how you can make it look. Mess around with it
452 | - The only goal is to try stuff out.
--------------------------------------------------------------------------------
/Lecture_11_Bob_Ross.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Painting With the Prof"
3 | author: "Nick Huntington-Klein"
4 | date: "`r Sys.Date()`"
5 | output: html_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = FALSE)
10 | ```
11 |
12 | Our setup code:
13 |
14 | ```{r, echo = TRUE}
15 | # install.packages('ggpubr','ggforce','ggalt')
16 |
17 | library(tidyverse)
18 | library(ggpubr)
19 | library(ggforce)
20 | library(ggalt)
21 |
22 | google <- read_csv('https://github.com/NickCH-K/causalbook/raw/main/EventStudies/google_stock_data.csv') %>%
23 | pivot_longer(cols = c('Google_Return','SP500_Return'))
24 | spend <- read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_spending.csv')
25 | #https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_spending.html
26 | ```
27 |
28 | # Google's Alphabet Announcement
29 |
30 | On August 10, 2015, Google announced that they were reorganizing the company to fit underneath Alphabet, a new umbrella company. The GOOG stock would now be stock in Alphabet. Did this announcement affect the Google stock price?
31 |
32 | ```{r, echo = FALSE}
33 | gdata <- google %>%
34 | mutate(name = case_when(
35 | name == 'Google_Return' ~ 'Google Return',
36 | name == "SP500_Return" ~ "S&P 500 Return"
37 | ))
38 |
39 | ggplot(gdata, aes(x = Date, y = value, linetype = name, color = name)) +
40 | geom_line(size = 1) +
41 | geom_vline(aes(xintercept = as.Date('2015-08-10')), linetype = 'dashed') +
42 | annotate(geom = 'label',x = as.Date('2015-08-10'), y = .1,
43 | label = 'Alphabet\nAnnouncement',
44 | family = 'serif', size = 13/.pt,
45 | hjust = .5)+
46 | scale_y_continuous(labels = scales::percent) +
47 | scale_color_manual(values = c('red','gray')) +
48 | theme_pubr() +
49 | theme(text = element_text(family = 'serif', size = 13),
50 | legend.position = c(.2, .9)) +
51 | labs(y = 'Daily Return', title = 'Google Stock Return Around Alphabet Announcement',
52 | color = NULL) +
53 | guides(linetype = 'none') +
54 | facet_zoom(xlim = c(as.Date('2015-08-05'),as.Date('2015-08-15')), zoom.size = 1)
55 |
56 | ```
57 |
58 |
59 | # Life by Education
60 |
61 | ```{r, echo = TRUE}
62 | spend <- read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_spending.csv')
63 | #https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_spending.html
64 |
65 | spend <- spend %>%
66 | dplyr::filter(!is.na(degree)) %>%
67 | transmute(BA = degree >= 3,
68 | income_above_median = rincom16 >= median(rincom16, na.rm = TRUE),
69 | conservative = polviews >= 5,
70 | news_reader = news %in% 1:2,
71 | full_time = wrkstat == 1) %>%
72 | group_by(BA) %>%
73 | summarize(across(.fns = function(x) mean(x, na.rm = TRUE)))
74 |
75 | vtable::vtable(spend, labels = c('Has Bachelor or Grad Degree',
76 | 'Income above median',
77 | '5-7 on political views scale',
78 | 'Reads news at least a few times a week or daily',
79 | 'Works full-time'))
80 | ```
81 |
82 | ```{r, echo = FALSE}
83 | knitr::kable(spend) %>% kableExtra::kable_styling(bootstrap_options = 'striped')
84 | ```
85 |
86 | ```{r, echo = TRUE}
87 | spend <- spend %>%
88 | pivot_longer(cols = c('income_above_median','conservative','news_reader','full_time'))
89 | spend <- spend %>%
90 | pivot_wider(id_cols = name, names_from = BA) %>%
91 | rename(ba_value = `TRUE`, no_ba_value = `FALSE`)
92 | ```
93 |
94 | ```{r}
95 | ggplot(spend, aes(x = no_ba_value, xend = ba_value, y = name)) +
96 | geom_dumbbell(size_x = 3, size_xend = 3, colour = 'black', colour_x = 'red', colour_xend = 'blue') +
97 | labs(x = 'Proportion', y = NULL,
98 | title = 'A Different Financial World for BA-Holders') +
99 | annotate(geom = 'text', x = spend$no_ba_value[3], y = 'news_reader', label = 'No BA', vjust = -1) +
100 | annotate(geom = 'text', x = spend$ba_value[3], y = 'news_reader', label = 'BA', vjust = -1) +
101 | theme_minimal() +
102 | theme(panel.grid.major.y = element_blank(),
103 | panel.grid.minor.y = element_blank(),
104 | text = element_text(size = 15)) +
105 | scale_x_continuous(labels = scales::percent, limits = c(0,1)) +
106 | scale_y_discrete(labels = c( 'Conservative', 'Works Full-Time','Income above Median','Frequent News Reader'))
107 | ```
--------------------------------------------------------------------------------
/Lecture_12_Hexmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_12_Hexmap.png
--------------------------------------------------------------------------------
/Lecture_12_State_Heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_12_State_Heatmap.png
--------------------------------------------------------------------------------
/Lecture_13_climate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_13_climate.png
--------------------------------------------------------------------------------
/Lecture_14_Clutter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Clutter.png
--------------------------------------------------------------------------------
/Lecture_14_Color.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Color.png
--------------------------------------------------------------------------------
/Lecture_14_Dashboards_in_R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 14: Dashboards in R"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(paletteer)
20 | options("kableExtra.html.bsTable" = T)
21 | rinlinevarname <- function(code){
22 | html <- '``` `CODE` ```
'
23 | sub("CODE", code, html)
24 | }
25 | ```
26 |
27 |
28 | ## Dashboards
29 |
30 | ```{r, results = 'asis'}
31 | cat("
32 | ")
38 | ```
39 |
40 | - Everything we've done so far has focused on *single images*
41 | - But there's only so much you can fit on one image! (although it's still a lot if done right)
42 | - The hot new thing is *dashboards*
43 | - Dashboards are *many* visualizations and tables on a single page to let you really explore the data
44 | - Often, interactive - hover-over for more information, or use menus to change which of the underlying data you're looking at
45 |
46 | ## Tips for Dashboards
47 |
48 | - (Inspired by [this article](https://www.tableau.com/about/blog/2017/10/7-tips-and-tricks-dashboard-experts-76821).)
49 | - Avoid clutter both in presentation *and* interactivity
50 | - Use a grid
51 | - Use consistent and clear colors and fonts
52 | - No scrolling! Dashboards aren't articles (those are neat too though)
53 | - Draw focus within visualizations as always (use differences *between* visualizations as well)
54 | - Also draw focus *between* visualizations with location and size
55 |
56 | ## Avoid Clutter
57 |
58 | - We know this one!
59 | - In the context of dashboards, it also means not presenting *all* the results you can
60 | - Think about what's redundant, what's unnecessary, what's distracting, and what's likely to be misinterpreted
61 | - Get rid of 'em!
62 | - Also avoid big walls of color
63 |
64 | ## Avoid Clutter
65 |
66 | ```{r}
67 | knitr::include_graphics('Lecture_14_Clutter.png')
68 | ```
69 |
70 |
71 | ## Fonts and Colors
72 |
73 | - Big and bold for up-top conclusions
74 | - Smaller, generally serif fonts for other graphs
75 | - Avoid using LOTS of fonts
76 | - You can help the reader see the structure by having one font for BIG, another for medium, another for small
77 |
78 | ## Fonts and Colors
79 |
80 | - When it comes to color, one or two is often enough!
81 | - Instead of lots of color designating groups, standard here is to break groups into single-group graphs (facets)
82 |
83 |
84 | ```{r}
85 | knitr::include_graphics('Lecture_14_Color.png')
86 | ```
87 |
88 | ## Focus
89 |
90 | - As always, figure out the story and focus on that
91 | - Buzzword here is KPI: "Key point of interest". Make it big, make it stand out!
92 | - Notice how readable ALL of these have been at tiny sizes!
93 |
94 | ```{r}
95 | knitr::include_graphics('Lecture_14_KPI.png')
96 | ```
97 |
98 |
99 | ## Examples
100 |
101 | - Let's explore each of these examples and see what works and what might not
102 | - [Year-to-Date Income Statement](https://public.tableau.com/views/IncomeStatement_4/IncomeStatement?:embed=y&:display_count=yes&publish=yes&:showVizHome=no)
103 | - [NBA Scoring Dashboard](https://rpubs.com/jcheng/nba1)
104 | - [Economic Indicators for the USA](https://research.stlouisfed.org/dashboard/1151)
105 |
106 | ## Dashboards in R
107 |
108 | - R dashboard provide you access to a huge range of free premade graphical tools through htmlwidgets
109 | - See the [htmlwidgets gallery](https://www.htmlwidgets.org/showcase_leaflet.html)
110 | - Good for everything, but the maps do really stand out
111 | - As you learn these tools you can include them not just in dashboards but in RMarkdown docs, too
112 | - And also, it's not just htmlwidgets, but also Shiny!
113 |
114 | ## Shiny
115 |
116 | - Shiny is more internal to R than htmlwidgets; it's an R-specific interactive site builder
117 | - You can add user controls to allow them to change parameters around and see the results
118 | - Great for dashboards, but even beyond that, learning Shiny for the purpose of dashboards will also give you the tools to create data-based *apps* that you can upload on their own
119 | - Super cool stuff you can do if you branch out with this!
120 |
121 | ## On to dashboards
122 |
123 | - There are two approaches to doing dashboards in R. One is Shiny-focused, and the other is htmlwidgets focused (but lets you include Shiny stuff)
124 | - We'll be going with the latter, as it's easier, but the Shiny one is good too.
125 | - Our approach will be to use **flexdashboard**
126 | - Install **flexdashboard** with `install.packages('flexdashboard')`
127 | - Then you can open a new Flex Dashboard by going File $\rightarrow$ New File $\rightarrow$ R Markdown $\rightarrow$ From Template $\rightarrow$ Flex Dashboard.
128 | - Plenty of info on [https://rmarkdown.rstudio.com/flexdashboard/](https://rmarkdown.rstudio.com/flexdashboard/)
129 |
130 | ## What Do We See?
131 |
132 | ```{r}
133 | knitr::include_graphics('Lecture_14_Flexdashboard.png')
134 | ```
135 |
136 | ## What Do We See?
137 |
138 | - Up top is the YAML - we know this from working with RMarkdown docs. Basic info about the doc itself
139 | - Then we have our *layout structure* - columns of stuff, with multiple graphs per columns
140 | - We can if we prefer lay things out row-wise (see next slide), and "storyboard" where it goes from page to page like a flipbook
141 | - Then within each column we have our graphs, labeled with three hashes
142 |
143 | ## Row orientation
144 |
145 | ```{r}
146 | knitr::include_graphics('Lecture_14_Flexdashboard_row.png')
147 | ```
148 |
149 | ## Content
150 |
151 | - We can of course fill those charts with regular-ol **ggplot2** output
152 | - (or **gganimate**, or whatever else produces an image file)
153 | - Or Shiny stuff, which we'll get to next time
154 | - But also there are a bunch of packages designed to output **htmlwidgets** that we can include! Many of them look real nice
155 | - Let's peruse: [The HTMLWidgets R Gallery](http://gallery.htmlwidgets.org/)
156 | - We'll go more into detail on these next time
157 |
158 |
159 | ## Material
160 |
161 | - When it comes to any complex programmatic system like this, you always want to look at the docs at least a little
162 | - Let's just spend some time walking through [https://rmarkdown.rstudio.com/flexdashboard/](https://rmarkdown.rstudio.com/flexdashboard/) if only to see what's possible and what we need to know
163 | - Then come back here
164 |
165 | ## Practice
166 |
167 | - Open a Flex Dashboard template
168 | - Load any data set (in the Dashboard code)
169 | - Make a Flex Dashboard and set its `theme`
170 | - Include three graphs: one regular **ggplot2**, that same graph again but with a different theme, and one graph from [http://gallery.htmlwidgets.org/](http://gallery.htmlwidgets.org/) - an easy pick is [dygraphs](https://rstudio.github.io/dygraphs/) line graphs (although you do have to format your data as `xts`... try the **tbl2xts** package and its `tbl_xts()` function)
171 | - If you can make it look nice that's great! But focus more on getting it to work.
--------------------------------------------------------------------------------
/Lecture_14_Example_Dashboard.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Some Categories"
3 | output:
4 | flexdashboard::flex_dashboard:
5 | orientation: columns
6 | vertical_layout: fill
7 | ---
8 |
9 | ```{r setup, include=FALSE}
10 | library(flexdashboard)
11 | library(collapsibleTree)
12 | library(gapminder)
13 | library(tidyverse)
14 | library(plotly)
15 | library(ggthemes)
16 | data(gapminder)
17 | gapminder <- gapminder %>%
18 | filter(year == max(year)) %>%
19 | arrange(-pop) %>%
20 | slice(1:50) %>%
21 | mutate(GDP = gdpPercap*pop)
22 | ```
23 |
24 | Column {data-width=450}
25 | -----------------------------------------------------------------------
26 |
27 | ### GDP vs. Population
28 |
29 | ```{r}
30 | p <- ggplot(gapminder,
31 | aes(x = pop, y = GDP, color = continent)) +
32 | geom_point() +
33 | scale_x_log10(labels = function(x) scales::number(x, scale = 1/1000000)) +
34 | scale_y_log10(labels = function(x) scales::dollar(x, scale = 1/1000000000, accuracy = 1)) +
35 | theme_tufte() +
36 | guides(color = FALSE) +
37 | labs(x= "Population (Millions)", y = "GDP (Billions)") +
38 | annotate(geom = 'label', x = 300000000, y = 500000000000, label = 'Unsurprisingly, bigger\ncountries have bigger GDPs', hjust= 0, vjust = 0)
39 | p
40 | ```
41 |
42 | ### Chart B
43 |
44 | ```{r}
45 | ggplotly(p)
46 | ```
47 |
48 | Column {data-width=550}
49 | -----------------------------------------------------------------------
50 |
51 | ### Percentage of GDP and Continent GDP
52 |
53 | ```{r}
54 | gapminder %>%
55 | collapsibleTreeSummary(
56 | hierarchy = c("continent","country"),
57 | root = "50 Biggest Countries",
58 | percentOfParent = TRUE,
59 | width = 800,
60 | attribute = "GDP",
61 | zoomable = FALSE
62 | )
63 | ```
64 |
65 |
--------------------------------------------------------------------------------
/Lecture_14_Flexdashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Flexdashboard.png
--------------------------------------------------------------------------------
/Lecture_14_Flexdashboard_row.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Flexdashboard_row.png
--------------------------------------------------------------------------------
/Lecture_14_Grids.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Grids.png
--------------------------------------------------------------------------------
/Lecture_14_KPI.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_KPI.png
--------------------------------------------------------------------------------
/Lecture_15_Example_Shiny.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "mtcars Dashboard with Shiny"
3 | output:
4 | flexdashboard::flex_dashboard:
5 | orientation: columns
6 | vertical_layout: fill
7 | runtime: shiny
8 | ---
9 |
10 | ```{r global, include=FALSE}
11 | library(flexdashboard)
12 | library(tidyverse)
13 | data(mtcars)
14 | mtcars <- mtcars %>%
15 | mutate(Transmission = factor(am, labels = c('Automatic','Manual')))
16 | ```
17 |
18 | Column {.sidebar}
19 | -----------------------------------------------------------------------
20 |
21 | The mean of a variable by whether the car is an automatic or a manual.
22 |
23 | ```{r}
24 | selectInput("varname", label = "Variable to Analyze:",
25 | choices = names(mtcars), selected = 'mpg')
26 |
27 | textInput('col1', label = 'Color for Automatics', value = 'red')
28 |
29 | textInput('col2', label = 'Color for Manuals', value = 'black')
30 | ```
31 |
32 | Column
33 | -----------------------------------------------------------------------
34 |
35 | ### Mean by Transmission Type
36 |
37 | ```{r}
38 | renderPlot({
39 | ggplot(mtcars, aes_string(x = 'Transmission', y = input$varname, fill = 'Transmission')) +
40 | geom_bar(stat = 'summary', fun = 'mean') +
41 | labs(y = input$varname) +
42 | theme_minimal() +
43 | scale_fill_manual(values = c(ifelse(input$col1 %in% colors(), input$col1, 'black'),
44 | ifelse(input$col2 %in% colors(), input$col2, 'black'))) +
45 | guides(fill = FALSE)
46 | })
47 |
48 |
49 | ```
50 |
51 |
52 |
--------------------------------------------------------------------------------
/Lecture_15_Interactive_Visuals_in_R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 15: Interactive Visuals in R"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(paletteer)
20 | library(DT)
21 | library(dygraphs)
22 | library(leaflet)
23 | library(vtable)
24 | library(ggiraph)
25 | options("kableExtra.html.bsTable" = T)
26 | rinlinevarname <- function(code){
27 | html <- '``` `CODE` ```
'
28 | sub("CODE", code, html)
29 | }
30 | ```
31 |
32 |
33 | ## Dashboards
34 |
35 | ```{r, results = 'asis'}
36 | cat("
37 | ")
43 | ```
44 |
45 | - Y'know, back in *my* day we had static images and gosh darnit we appreciated what we had!!!
46 | - These days, it's sort of expected that a lot of visuals are interactive, especially dashboards
47 | - Thankfully there are some straightforward ways to make interactive visuals in R
48 | - Note that these will all be reliant on HTML. You won't be able to output these to Word or PDF etc.
49 |
50 | ## Dashboards
51 |
52 | Three main approaches we'll take today:
53 |
54 | - **htmlwidgets**, focusing on **leaflet** (maps), **DT** (tables), and **dygraphs** (line graphs and time series)
55 | - **ggiraph** for interactivity via tooltip closely tied to **ggplot2**
56 | - **shiny** controls in our **flexdashboard** dashboards
57 |
58 | But of course there is a universe of other stuff
59 |
60 | ## htmlwidgets
61 |
62 | - **htmlwidgets** is a set of R front-ends for a bunch of different JavaScript-based visualization packages
63 | - We can stroll through the gallery [here](https://www.htmlwidgets.org/showcase_leaflet.html)
64 | - There are many, but I'll focus on just a few to recommend:
65 | - **leaflet** for maps, **DT** for tables, and **dygraphs** for line graphs and time series
66 |
67 | ## DT
68 |
69 | - Let's do an easy one first. **DT** makes tables interactive. No futzin'
70 |
71 | ```{r, echo = TRUE}
72 | iris %>% vtable(out = 'return', lush = TRUE) %>% datatable()
73 | ```
74 |
75 | ## dygraphs
76 |
77 | - For plotting time series data as a line graph! Requires time series data (`xts`) input
78 |
79 | ```{r, echo = TRUE}
80 | ggplot2::economics %>% tbl2xts::tbl_xts(cols_to_xts = 'uempmed') %>%
81 | dygraph(ylab = 'Unemployment Rate and the 08-09 recession', height = 300) %>% dyShading(from = '2008-07-01', to = '2009-07-01')
82 | ```
83 |
84 | ## leaflet
85 |
86 | - Leaflet is for making maps. It's quite easy (and while not **ggplot2**-compatible, functions in a similar way to **ggplot2** in a lot of ways), especially since it lets you skip the hardest part of graphing maps: messing with *shapefiles*
87 | - Shapefiles are basically the data of the map itself, and you have to find it, download it, and then deal with the eight zillion different file formats and converting between them
88 | - It's definitely possible, and not even all that bad once you've done a few (and there are plenty of resources out there for working with, say, **ggmap** or `geom_polygon`) but also you can just skip that with **leaflet**
89 | - (also, just a heads up, but the *Excel* map-making tool is also pretty neat and doesn't require shapefiles)
90 |
91 | ## leaflet
92 |
93 | - The example uses location markers, but area/polygons work too (or a dataset of entries so you don't have to put them in one at a time)
94 |
95 |
96 | ```{r, echo = TRUE, eval = FALSE}
97 | leaflet() %>% addTiles() %>%
98 | addCircleMarkers(lng = -122.3172284009, lat = 47.6105934921, popup = 'Seattle University', color = '#AA0000') %>%
99 | addCircleMarkers(lng = -122.303200, lat = 47.655548, popup = 'University of Washington', color = '#363C74')
100 | ```
101 |
102 | ## leaflet
103 |
104 | ```{r, echo = FALSE, eval = TRUE}
105 | leaflet() %>% addTiles() %>%
106 | addCircleMarkers(lng = -122.3172284009, lat = 47.6105934921, popup = 'Seattle University', color = '#AA0000') %>%
107 | addCircleMarkers(lng = -122.303200, lat = 47.655548, popup = 'University of Washington', color = '#363C74')
108 | ```
109 |
110 |
111 | ## plotly
112 |
113 | - Plotly is a big confusing graphing system of its very own that you have to learn from the ground up
114 | - Or you can just use the `ggplotly()` function in **plotly** to immediately turn your **ggplot** interactive
115 | - It's not super *flexible* but it is *fast*
116 | - You lose control over some stuff - tooltips, legends
117 |
118 | ## ggiraph
119 |
120 | - Shouldn't we just be able to make an interactive **ggplot2** graph? Yes!
121 | - **ggiraph** slots super easily in to standard **ggplot2** usage, and adds interactivity
122 | - It's not fully interactive but it adds highlighting and tooltips, which often is all you need
123 | - Just replace the geometry with an `_interactive` version and add a `tooltip` aesthetic. That's it!
124 |
125 | ## ggiraph
126 |
127 | ```{r, echo = TRUE, eval = FALSE}
128 | p <- ggplot2::economics %>% mutate(tooltip = paste0(date, '\n', scales::percent(uempmed/100, accuracy = .1))) %>%
129 | ggplot(aes(x = date, y = uempmed)) + geom_line() + labs(x = 'Date', y = 'Unemployment Rate') +
130 | scale_y_continuous(labels = function(x) scales::percent(x/100, accuracy = 1)) +
131 | geom_point_interactive(aes(tooltip = tooltip), size = .5) +
132 | theme_classic() + theme(text = element_text(family = 'serif', size = 13))
133 | ggiraph(ggobj = p)
134 | ```
135 |
136 | ## ggiraph
137 |
138 | ```{r, echo = FALSE, eval = TRUE}
139 | p <- ggplot2::economics %>% mutate(tooltip = paste0(date, '\n', scales::percent(uempmed/100, accuracy = .1))) %>%
140 | ggplot(aes(x = date, y = uempmed)) + geom_line() + labs(x = 'Date', y = 'Unemployment Rate') +
141 | scale_y_continuous(labels = function(x) scales::percent(x/100, accuracy = 1)) +
142 | geom_point_interactive(aes(tooltip = tooltip), size = .5) +
143 | theme_classic() + theme(text = element_text(family = 'serif', size = 13))
144 | ggiraph(ggobj = p)
145 | ```
146 |
147 | ## Making a tooltip
148 |
149 | - It's generally a good idea with **ggiraph** to make the tooltip by hand.
150 | - You can do this by `mutate`ing a new variable where you `paste0()` all the information you want together, using `\n` to break lines
151 |
152 | ```{r, echo = TRUE, eval = FALSE}
153 | p <- iris %>% group_by(Species) %>% summarize(`Sepal Length` = mean(Sepal.Length)) %>%
154 | mutate(tooltip = paste0('Species: ', stringr::str_to_title(Species), '\nAverage Sepal Length: ', `Sepal Length`)) %>%
155 | ggplot(aes(x = Species, y = `Sepal Length`, tooltip = tooltip)) +
156 | geom_col_interactive(fill = 'firebrick') +
157 | theme_minimal()
158 | ggiraph(ggobj = p)
159 | ```
160 |
161 | ## ggiraph downsides
162 |
163 | - Line graphs are a bit tricky, but you can also easily overlay a `geom_point_interactive` on top of a `geom_line`
164 | - While it works great out of the box in a memo, getting the sizing right inside of a **flexdashboard** isn't automatic, and in particular it doesn't like fitting in very *wide* dashboard spaces
165 | - You can adjust the width with either `width` or `width_svg` and `height_svg` in the `ggiraph()` function, but it's not perfect
166 | - Or just put it in a squarish space
167 |
168 | ## Shiny
169 |
170 | - **shiny** is a way of making RMarkdown completely interactive. You can basically build web apps with it in a very straightforward way
171 | - **shiny** output can even be stored for free (with limited traffic access) on [shinyapps.io](shinyapps.io)
172 | - While there's a dedicated **shiny** environment for programming with complete flexibility, we'll focus on the kind that's easy to slot into a **flexdashboard** by adding `runtime: shiny` at the top
173 |
174 | ## Shiny
175 |
176 | - **shiny** gives us the ability to put interactive controls on our dashboard (selectors, etc.)
177 | - And then in `render` or `reactive()` functions, we can make our analysis update based on the controls!
178 | - So for one example, maybe there's a dropdown menu to pick a brand: Pepsi or Coke, and once you select it, the graphs update to only show Pepsi or Coke
179 |
180 | ## Shiny and flexdashboard
181 |
182 | - Let's walk through [https://rmarkdown.rstudio.com/flexdashboard/shiny.html](https://rmarkdown.rstudio.com/flexdashboard/shiny.html)
183 | - Pay close attention to the source code for the example
184 | - How do the input functions come together, and how are the choices referred to in the visualization?
185 | - (Even deeper documentation at [https://mastering-shiny.org](https://mastering-shiny.org) or the [Shiny Dev Center](https://shiny.rstudio.com/))
186 |
187 | ## Inputs
188 |
189 | - What kinds of inputs can we have?
190 | - Dropdown menu (`selectInput`), a slider (`sliderInput`), radio buttons (`radioButtons`), text (`textInput`), numbers (`numericInput`), a checkbox (`checkboxInput`), dates or date ranges (`dateInput` and `dateRangeInput`) and file upload (`fileInput`)
191 | - For each, there are options for layout and the available choices.
192 | - Let's look at one in more detail, `selectInput`
193 |
194 | ## selectInput
195 |
196 | - This could be well used to, for example, choose which of many ways to look at the data, or which variable to graph, or which subset of the data to look at
197 | - The syntax is `selectInput(inputID, label, choices,` `selected = NULL, multiple = FALSE,` `selectize = TRUE, width = NULL,` `size = NULL)`
198 | - (the = parts are just the defaults)
199 |
200 | ## selectInput
201 |
202 | - The `inputID` is the slot in `input$` where the result will be stored. so with `inputID = 'subset'`, you can later use `input$subset` to know what was selected. That's not what the user sees for a title though, they see `label`
203 | - `choices` are the options, with default `selected`, in a standard vector format. So maybe to choose whether to graph independent or chain restaurants, `choices = c('Choose Restaurant Type' = '',` `'Independent','Chain')`
204 | - `multiple` determines whether multiple options can be selected
205 | - Then the minute details
206 |
207 |
208 | ## Output
209 |
210 | - You can include output to be updated as the options change with `renderPlot()` (for plots), `renderPrint()` (for any object being printed / shown on its own), `renderTable()` for tables of data, and `renderText()` for actual text output.
211 | - Inside of these functions we just take a regular plot / object / data table / text, refer to elements from the sidebar with `input$inputID`, wrap that in `{}`, and wrap THAT in the appropriate `render` function
212 |
213 | ## Output
214 |
215 | (Using fake data) in the global chunk: `library(tidyverse)` and `data(RestaurantData)`. In the sidebar column, an R chunk with:
216 |
217 | ```{r, echo = TRUE, eval = FALSE}
218 | selectInput('restaurantType', label = 'Type of Restaurant', choices = choices = c('Choose Restaurant Type' = '','Independent','Chain'))
219 | ```
220 |
221 | ## Output
222 |
223 | And then in the next column,
224 |
225 | ```{r, echo = TRUE, eval = FALSE}
226 | renderTable({
227 | RestaurantType %>% filter(type == input$restaurantType)
228 | })
229 | ```
230 |
231 | This will show a table of all the data, letting you pick whether to show independent or chain restaurants.
232 |
233 | ## Example
234 |
235 | - Let's see the example dashboard I have [here](https://nickch-k.shinyapps.io/Lecture_18_Example_Shiny/) and also the example code.
236 | - I have used the `mtcars` data we've used many times before
237 | - I've included a `geom_bar(stat='summary',fun = 'mean')` graph summarizing the mean of a variable by the type of transmission
238 | - But I let you pick the variable! (with `selectInput`) (Hint: use `aes_string` to input a string variable to **ggplot2**, this means everything must be a string in it)
239 | - And also the color scheme (with `textInput`). Note if you put in something that's not a color (which I check with `%in% colors()`) it will correct to black.
240 |
241 | ## The global chunk
242 |
243 | - Code in a **shiny** dash really comes in three parts: the `global` chunk, and then (as we've covered) the sidebar and the main part
244 | - Everything in `global` will only be run *once*, very handy and speedy for when you change a control!
245 | - Put as much as possible - *everything* that won't need to be updated when a control changes - in the global chunk
246 |
247 | ## Dynamic Documents
248 |
249 | - We're not going to do it in this class, but a nice heads up is that you can use shiny and flexdashboard to create *automatic documents*
250 | - The `params` argument in the YAML can take in inputs
251 | - And you can set whatever settings you want on your dashboard and hit a button to make it create a customized report in RMarkdown using those settings!
252 | - Huge time-saver, either for documents to be created automatically or for creating a frontend for people at your company to easily create customized reports
253 | - Alternate extra credit opportunity for this class: Make an automatic report generator
254 |
255 | ## Practice
256 |
257 | - Open a new **flexdashboard**
258 | - Set `runtime: shiny`
259 | - Rename that setup chunk `global` and load up `data(storms)`.
260 | - Add a `selectInput()` to select a `status`
261 | - Make some graph that only uses the data from that `status`
262 | - Then make it a **ggiraph**!
263 |
--------------------------------------------------------------------------------
/Lecture_15b_Bob_Ross_My_ATUS_Dashboard_Final.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "ATUS Activity Dashboard "
3 | output:
4 | flexdashboard::flex_dashboard:
5 | orientation: columns
6 | vertical_layout: fill
7 | theme:
8 | version: 4
9 | runtime: shiny
10 | ---
11 |
12 | ```{r global, include=FALSE}
13 | {
14 | library(flexdashboard)
15 | library(atus)
16 | library(tidyverse)
17 | library(scales)
18 | library(plotly)
19 | library(shiny)
20 | library(ggiraph)
21 | library(dygraphs)
22 | }
23 |
24 | data(atusact)
25 |
26 | # Check that everything is 5 or 6 digits long
27 | # atusact %>% pull(tiercode) %>% nchar() %>% table()
28 |
29 | # Check that we can use the first four digits of tucaseid to find hte year
30 | # data(atusresp)
31 | # atusresp <- atusresp %>%
32 | # mutate(year_from_id = str_sub(tucaseid, 1, 4))
33 | # table(atusresp$tuyear, atusresp$year_from_id)
34 |
35 |
36 | # Get RID of last two digits
37 | atusact <- atusact %>%
38 | ungroup() %>%
39 | mutate(tiercode = floor(tiercode/100)) %>% # Get four-digit codes
40 | mutate(code_2digit = floor(tiercode/100)) %>% # Get two-digit codes
41 | mutate(tiercode = factor(tiercode),
42 | code_2digit = factor(code_2digit)) %>%
43 | mutate(Year = as.numeric(str_sub(tucaseid, 1, 4))) # Get the year
44 |
45 | # Alternate approach using string manipulation
46 | # atusact %>%
47 | # mutate(six_digit_code = str_pad(tiercode, 6, 'left', '0')) %>%
48 | # mutate(tiercode_2digit = str_sub(six_digit_code, 1, 2)) %>%
49 | # mutate(tiercode_4digit = str_sub(six_digit_code, 1, 4))
50 |
51 |
52 | # Get share of each two-digit code that is hte four-digit code (as a function of dur)
53 | # Overall
54 | overall_share <- atusact %>%
55 | group_by(tiercode, code_2digit) %>%
56 | summarize(total = sum(dur)) %>%
57 | group_by(code_2digit) %>%
58 | mutate(share = total/sum(total))
59 |
60 | # Within year
61 | overall_share_by_year <- atusact %>%
62 | group_by(tiercode, code_2digit, Year) %>%
63 | summarize(total = sum(dur)) %>%
64 | group_by(code_2digit, Year) %>%
65 | mutate(share = total/sum(total))
66 |
67 |
68 | # Get total time spent on each two-digit code each year
69 | total_time_per_year <- atusact %>%
70 | group_by(code_2digit, Year) %>%
71 | summarize(`Total Minutes (Millions)` = sum(dur)/1000000) %>%
72 | ungroup() %>%
73 | mutate(Date = lubridate::ym(paste0(Year, '-01'))) %>%
74 | select(-Year)
75 | ```
76 |
77 | Column {.sidebar}
78 | -----------------------------------------------------------------------
79 |
80 | ```{r}
81 | selectInput('twodigit', 'Select a Category of Activities:',
82 | sort(unique(overall_share$code_2digit)))
83 | ```
84 |
85 | Column {data-width=650}
86 | -----------------------------------------------------------------------
87 |
88 | ### Share of Each Activity Over Time
89 |
90 | ```{r}
91 | renderPlotly({
92 | p1 <- overall_share_by_year %>%
93 | filter(code_2digit == input$twodigit) %>%
94 | mutate(Share = round(share*100, 1)) %>%
95 | ggplot(aes(x = Year, y = Share, fill = tiercode)) +
96 | geom_area() +
97 | scale_y_continuous(labels = label_percent(scale = 1)) +
98 | theme_classic() +
99 | labs(fill = 'Activity')
100 | ggplotly(p1)
101 | })
102 | ```
103 |
104 | ### Total Amount of Activity Over Time
105 |
106 | ```{r}
107 | renderDygraph({
108 | total_time_per_year %>%
109 | filter(code_2digit == input$twodigit) %>%
110 | tbl2xts::tbl_xts() %>%
111 | dygraph() %>%
112 | dyRangeSelector()
113 | })
114 | ```
115 |
116 | Column {data-width=350}
117 | -----------------------------------------------------------------------
118 |
119 | ### Share of Each Activity
120 |
121 | ```{r}
122 | renderggiraph({
123 | p3 <- overall_share %>%
124 | filter(code_2digit == input$twodigit) %>%
125 | mutate(tooltip = paste0('Total Minutes (Mil.): ',
126 | number(total, scale = 1/1000000, accuracy = .1),
127 | '\nPercentage: ',
128 | percent(share, accuracy = .1))) %>%
129 | ggplot(aes(x = reorder(tiercode, share), y = share, tooltip = tooltip)) +
130 | geom_col_interactive(fill = 'lightblue', color = 'black') +
131 | scale_y_continuous(labels = label_percent()) +
132 | labs(x = 'Activity', y = 'Share') +
133 | theme_classic() +
134 | coord_flip()
135 | ggiraph(ggobj = p3)
136 | })
137 | ```
138 |
139 |
--------------------------------------------------------------------------------
/Lecture_16_Data_Import.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Data_Import.png
--------------------------------------------------------------------------------
/Lecture_16_Intro_to_Tableau.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 16: Intro to Tableau"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(knitr)
20 | library(paletteer)
21 | library(ggrepel)
22 | library(directlabels)
23 | library(gghighlight)
24 | library(lubridate)
25 |
26 | rinlinevarname <- function(code){
27 | html <- '``` `CODE` ```
'
28 | sub("CODE", code, html)
29 | }
30 | ```
31 |
32 |
33 | ## Graphing Software
34 |
35 | ```{r, results = 'asis'}
36 | cat("
37 | ")
43 | ```
44 |
45 | - A lot of this class has been spent going over **ggplot2** in R
46 | - Today we'll be going over Tableau
47 | - (not to mention all we leave out - PowerBI, **matplotlib**/**Seaborn** in Python, every stats package other than R, automated stuff like Azure, etc. etc.)
48 | - Why introduce these different tools, and what do they do?
49 |
50 | ## Graphing Software
51 |
52 | - It's most important that we understand the underlying concepts of data communication
53 | - But also we have to know how to implement those concepts
54 | - And since so much of this is subjective, we have to practice, practice, practice. Working with the software is a requirement!
55 | - Plus, stuff like learning the grammar of graphics, baked into **ggplot2**, can help strengthen those concepts
56 |
57 | ## Graphing Software
58 |
59 | - R: High-power but with a learning curve. **Great for any size data sets**, **great for summarizing**, **endless customization**. **Great for contrasting groups**
60 | - **If the data isn't already in graphable format, R itself is crucial for cleaning/formatting it**
61 | - Supports **dashboards**, **notebooks**, **interactivity**, and **animation**
62 | - Until you're used to it, **a lot of work to get going though!**
63 | - Because there is **direct access to the underlying pieces of the graph**, there is no better tool for making **one beautiful image**
64 | - Requires additional work for working with enormous databases
65 |
66 | ## Graphing Software
67 |
68 | - Tableau: High-power, with less learning curve, but more constrained. **Great for any size data sets**, **great for summarizing**, **great for contrating groups**
69 | - Great at making a **host of graphs at once to explore a data set thoroughly**
70 | - It can **smartly suggest for you what kind of graph to use**
71 | - Built-in easy to work with tools for **enormous databases** and **dashboards**
72 | - Everything is **interactive by default**
73 | - **Easier than R** and also less work but **deep customization a bit more difficult**
74 |
75 | ## The Canned-Coded Scale
76 |
77 | - It is extremely common in software to have to choose between a deeper, more customizable option (such as coding something in **ggplot2**) or a more canned, one-click-does-it option (like Tableau or Excel)
78 | - Until you're very familiar with the coding option, the canned option will be faster and easier...
79 | - *IF* the thing you want to do is something the canned-software authors anticipated and decided to make easy
80 | - If it's not, it will usually be *much* harder than just coding the thing up
81 |
82 | ## The Canned-Coded Scale
83 |
84 | - Tableau is a very cool tool and I encourage you to use it (as are other more canned systems like PowerBI, Azure ,etc.)
85 | - But don't be surprised when you start thinking "hmm but what if my visualization was different in THIS nonstandard way" and start Googling, only to find it's a fourteen-step process that doesn't actually work
86 | - Using Tableau for the simple stuff, and immediately shifting to code for anything that's not stock-standard, is a decent strategy!
87 | - That said, let's explore Tableau
88 |
89 | ## Tableau
90 |
91 | - As opposed to (the way we've used) R, Tableau uses *database* logic - it sends the calculation **to the data** rather than bringing the data to the calculation
92 | - So we'll have "connections" to data rather than bringing data in; also, if data updates, so will Tableau
93 | - And the output we have will include underlying data, but only our calculations of the data.
94 | - Let's walk through some usage of Tableau
95 |
96 |
97 | ## Importing Data
98 |
99 | - A connection can be "Live" (database-like) or an "Extract" (bring it in!)
100 | - Specify how variables should be read by clicking on them
101 | - Tableau will guess as much as possible
102 | - Unlike R we do filtering, selection, and recoding at THIS step
103 | - Strongly recommended to do all data-cleaning in R BEFORE importing to Tableau - Tableau *can* clean data but it's a real drag
104 |
105 | ## Importing Data
106 |
107 | ```{r}
108 | knitr::include_graphics('Lecture_16_Data_Import.png')
109 | ```
110 |
111 | ## Importing Data
112 |
113 | - NAs can make it think it's a string, let's tell it earnings is a number
114 | - Let's group `pred_degree_awarded_ipeds`: 1 = less-than-two-year college, 2 = two-year, 3 = four-year+
115 | - Could also do a filter here, or only select certain variables
116 | - Or create variables calculated from others - let's make an employment rate
117 |
118 | ## The Tableau Screen
119 |
120 | - Dimensions (categorical) and measures (continuous numeric) on the left. Tableau really cares about discrete vs. continuous.
121 | - Columns and Rows areas to drag to
122 | - (single image export: Worksheet $\rightarrow$ Copy image)
123 | - "Show me" default graphs on the right. Decent for initial exploration but generally not what you want to work with
124 |
125 | ---
126 |
127 | ```{r}
128 | knitr::include_graphics('Lecture_16_Tableau_Main.png')
129 | ```
130 |
131 |
132 | ## Working with Variables
133 |
134 | - To start making a graph, think about what you want on the x-axis (column) or y-axis (row)
135 | - How it treats a variable can then be manipulated: discrete vs. continuous and measure vs. dimension. Measures get aggregated/summarized in some specified way, dimensions don't!
136 | - Plenty of aggregation options for continuous variables
137 |
138 | ## Working with Variables
139 |
140 | - Also click the variables to do things like sort the order or turn them "dual axis" (two variables on the same axis)
141 | - In general, like Excel, click what you want to edit, like axis titles
142 | - Note with a graph we can find the data that goes into it
143 | - Some variables like "Measure Names" can become their own "shelf" like column/row for more complex diagrams
144 |
145 | ## Common Problems in Tableau
146 |
147 | - If something's not working, check whether your variable has the correct continuous/discrete measure/dimension settings
148 | - Note that sometimes "grouped" variables don't work the same as regular categorical variables - make life easier by just making them strings grouped as you like in R and import them
149 | - Did you try Googling to see how you can do (nonstandard thing X)?
150 |
151 | ## Basic Example
152 |
153 | - Let's remake it!
154 |
155 | ```{r}
156 | knitr::include_graphics('Lecture_16_Tableau_Employment.png')
157 | ```
158 |
159 | ## Basic Example
160 |
161 | - This one is trickier!
162 |
163 | ```{r}
164 | knitr::include_graphics('Lecture_16_Tableau_Repayment.png')
165 | ```
166 |
167 | ## Basic Example
168 |
169 | - Let's do this one together: what should be in rows and columns here?
170 |
171 | ```{r}
172 | knitr::include_graphics('Lecture_16_Tableau_Map.png')
173 | ```
174 |
175 | ## Another Example
176 |
177 | - We can also set labels by clicking-and-labeling
178 | - Let's replicate the graph on the next page
179 |
180 | ---
181 |
182 | ```{r}
183 | knitr::include_graphics('Lecture_16_Tableau_Boxplot.png')
184 | ```
185 |
186 | ## And Explore
187 |
188 | - Click around and make your own graph
189 | - See what kinds of customizations you can do
190 |
--------------------------------------------------------------------------------
/Lecture_16_Scorecard.twbx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Scorecard.twbx
--------------------------------------------------------------------------------
/Lecture_16_Tableau_Boxplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Boxplot.png
--------------------------------------------------------------------------------
/Lecture_16_Tableau_Employment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Employment.png
--------------------------------------------------------------------------------
/Lecture_16_Tableau_Main.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Main.png
--------------------------------------------------------------------------------
/Lecture_16_Tableau_Map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Map.png
--------------------------------------------------------------------------------
/Lecture_16_Tableau_Repayment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Repayment.png
--------------------------------------------------------------------------------
/Lecture_17_Axis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Axis.png
--------------------------------------------------------------------------------
/Lecture_17_CancelledFlights.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_CancelledFlights.png
--------------------------------------------------------------------------------
/Lecture_17_CumulativeCancellations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_CumulativeCancellations.png
--------------------------------------------------------------------------------
/Lecture_17_DelaybyDay.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_DelaybyDay.png
--------------------------------------------------------------------------------
/Lecture_17_MiddayFlights.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_MiddayFlights.png
--------------------------------------------------------------------------------
/Lecture_17_MiddayFlightsAxes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_MiddayFlightsAxes.png
--------------------------------------------------------------------------------
/Lecture_17_More_Tableau.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 17: More Tableau"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | library(knitr)
20 | library(paletteer)
21 | library(ggrepel)
22 | library(directlabels)
23 | library(gghighlight)
24 | library(lubridate)
25 |
26 | rinlinevarname <- function(code){
27 | html <- '``` `CODE` ```
'
28 | sub("CODE", code, html)
29 | }
30 | ```
31 |
32 |
33 | ## Tableau
34 |
35 | ```{r, results = 'asis'}
36 | cat("
37 | ")
43 | ```
44 |
45 | - Last time we got some basic graphs going and got familiar with the software
46 | - This time we'll go a little deeper!
47 | - First, working our way around the decoration and customization options
48 | - Then, making some more complex graphs with Table Calculations and Shelves
49 |
50 | ## Decoration
51 |
52 | - We can adjust the presentation of most elements of the graph by double clicking on them, Excel style
53 | - For example, axes
54 | - From here we can adjust tick marks, scale, consistency of scale across multi-part graphs, and numerical formatting, as well as things like "include 0 on the axis or no?"
55 |
56 | ---
57 |
58 | ```{r}
59 | knitr::include_graphics('Lecture_17_Axis.png')
60 | ```
61 |
62 | ## Formatting Numbers
63 |
64 | - We can also see on the left when we click on an axis the "Scale" segment
65 | - We can set things like percentages being .1 or 10%, or the units of the variable (like $)
66 | - (If you're having trouble, RIGHT-click the axis and hit Format)
67 |
68 | ## Decoration
69 |
70 | - Let's take the example worksheet and walk through adjusting the axis
71 | - Look at Sheet 1
72 | - Make the y-axis tick marks percentages with one decimal place
73 | - And make the x-axis only have ticks on days 0, 10, 20, 30.
74 | - Also change axis titles to "Day of Month" and "Percent Cancelled
75 |
76 | ## Decoration
77 |
78 | ```{r}
79 | knitr::include_graphics('Lecture_17_Tickmarks.png')
80 | ```
81 |
82 | ## Decoration
83 |
84 | ```{r}
85 | knitr::include_graphics('Lecture_17_Percent.png')
86 | ```
87 |
88 |
89 | ## The Result
90 |
91 | ```{r}
92 | knitr::include_graphics('Lecture_17_CancelledFlights.png')
93 | ```
94 |
95 | ## Colors
96 |
97 | - We can make a variable be distinguished by color by dragging it to the Color part of the Marks pane
98 | - (This will add color to the *variable*, not its position, so drag it from the left variables pane, not away from its column/row if it's there!)
99 | - Then click Color to manually adjust the colors used or pick a palette, or pick opacity
100 |
101 | ## Other Axes
102 |
103 | - Just like the axes in our **ggplot2** `aes()`thetic, we can similarly drag variables to be used on the Size, Label, etc. axes
104 | - Having variables *just* on these axes and not Column or Row will create different kinds of output - Labels-only will give a table, for example
105 | - Also, not really an axis, but by clicking data points we can zoom in on them, exclude them, or exclude others
106 | - The Marks card is the home of these, and we can edit things directly there
107 |
108 | ## Decoration
109 |
110 | - On Sheet 2
111 | - Let's color our data by Cancelled
112 | - And make the size differ by Distance
113 | - And exclude flights before 3AM
114 | - Make an annotation of an area (What story does this tell?)
115 | - And title it "Departure Times and Delays"
116 |
117 | ## Decoration
118 |
119 | ```{r}
120 | knitr::include_graphics('Lecture_17_MiddayFlightsAxes.png')
121 | ```
122 |
123 | ---
124 |
125 | ```{r}
126 | knitr::include_graphics('Lecture_17_MiddayFlights.png')
127 | ```
128 |
129 |
130 | ## Table Calculations
131 |
132 | - By clicking the down-arrow on a numeric variable on an axis, we can change it from being presented raw to being presented relative to other stuff
133 | - We already covered how we can make it a calculation/summary: mean, median, etc.
134 | - Table Calculations extend the analysis to make the presentation *relative to other calculations*
135 | - i.e. changing a line graph from the values to growth-from-previous-period, or moving average
136 | - Or making a bar graph percent-of-total, or rank
137 |
138 |
139 | ## Table Calculations
140 |
141 | - Let's change the bar graph from count to percent of total
142 | - Now you change the line graph from percent to change-since-previous
143 | - (Also, while we're at it, add data point labels to the line graph)
144 | - (And right-click an empty space to add an annotation)
145 |
146 | ## Other Down-Arrow Adjustments
147 |
148 | - We can also use this down-arrow to sort
149 | - For example, sorting a bar by its height, or the value of another variable
150 | - You can do this using another variable directly, or a table calculation (like a COUNT) of a variable
151 |
152 | ## Getting More Complex
153 |
154 | - If you want to do more detailed adjustments to variables, things like complex created variables, you can do so with Create Calculated Field
155 | - But this does require that you learn the Tableau coding system. You can Google for how to do stuff, but it's often easier just to do it in R and export to Tableau (same as with joins/etc.)
156 | - One exception is if you just want a basic condition checked
157 | - `IF [X] == Value THEN "A" ELSE "B" END`
158 | - Can also use `ELSEIF` for more-flexible grouped variables than with Groups
159 |
160 | ## Animation
161 |
162 | - You can do basic animations by adding a variable to "Pages"; this will make a slideshow going through each value
163 | - This is only one kind of animated graph but it is very easy to make! Let's try one.
164 |
165 | ## Sharing Work
166 |
167 | - Tableau workbooks can be saved and shared
168 | - If you want them to be interactive for someone without Tableau, have them use [Tableau Reader](https://www.tableau.com/products/reader)
169 | - Or just a static image with Worksheet $\rightarrow$ Export
170 | - Or make a dashboard (we'll get to that next time)!
171 |
172 |
173 | ## Sharing Work
174 |
175 | - If you're sharing with someone who has Tableau, and your data isn't enormous, use a **packaged workbook**.
176 | - This will bundle the data set in with your work so they can actually open it, making a `.twbx` file
177 | - If you don't do this and just send a `.twb` then they won't be able to see your work at all!!
178 | - If you turn in your Tidy Tuesday homework as a `.twb` I won't be able to read it and you'll have to resubmit.
179 |
180 | ## Practice
181 |
182 | - Let's walk through creating the graph on the following slide...
183 |
184 | ---
185 |
186 | ```{r}
187 | knitr::include_graphics('Lecture_17_CumulativeCancellations.png')
188 | ```
189 |
190 | ## Practice
191 |
192 | - Now you recreate this graph
193 |
194 | ```{r}
195 | knitr::include_graphics('Lecture_17_DelaybyDay.png')
196 | ```
197 |
198 |
199 | ## Practice
200 |
201 | - Make your own graph with this data
--------------------------------------------------------------------------------
/Lecture_17_Percent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Percent.png
--------------------------------------------------------------------------------
/Lecture_17_Table_Calc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Table_Calc.png
--------------------------------------------------------------------------------
/Lecture_17_Tickmarks.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Tickmarks.png
--------------------------------------------------------------------------------
/Lecture_18_A_Mess.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_A_Mess.png
--------------------------------------------------------------------------------
/Lecture_18_AddWorksheets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_AddWorksheets.png
--------------------------------------------------------------------------------
/Lecture_18_Dashboard_Format.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_Dashboard_Format.png
--------------------------------------------------------------------------------
/Lecture_18_Dashboards_in_Tableau.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 18: Dashboards in Tableau"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: simple
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
18 | library(tidyverse)
19 | options("kableExtra.html.bsTable" = T)
20 | rinlinevarname <- function(code){
21 | html <- '``` `CODE` ```
'
22 | sub("CODE", code, html)
23 | }
24 | ```
25 |
26 |
27 | ## Dashboards in Tableau
28 |
29 | ```{r, results = 'asis'}
30 | cat("
31 | ")
37 | ```
38 |
39 | - Tableau is designed for dashboards, and they make it easy
40 | - Most of what we'll cover today you could probably figure out yourself by clicking "New Dashboard" at the bottom and clicking around
41 | - So we'll keep this brief and then get to trying stuff out
42 |
43 | ## Creating a Dashboard
44 |
45 | - Create your visualizations and tables
46 | - Then down at the bottom, next to the Sheets, there's a + icon for "New Dashboard"
47 | - Click it!
48 |
49 | ## Adding Worksheets to the Dashboard
50 |
51 | - From here you'll see:
52 | - The Layout option (in a sec)
53 | - Your worksheets - drag them in!
54 | - Options at the bottom to add navigation, images, text, web pages
55 |
56 | ## Adding Worksheets to the Dashboard
57 |
58 | ```{r}
59 | knitr::include_graphics('Lecture_18_AddWorksheets.png')
60 | ```
61 |
62 | ## Layout
63 |
64 | - Tableau Dashboards can be laid out for desktop, mobile, or tablet
65 | - Check your Size options, and try Device Preview with a few things on
66 | - Also check the Layout tab for other options
67 | - Keep the audience in mind, and whether some visualizations might not translate to tiny sizes!
68 |
69 | ## Layout
70 |
71 | - You can "float" elements of the dashboard but this isn't generally a good idea
72 | - What *is* a good idea is moving things like legends (if you have them) close to where the actual viz that uses them is
73 | - You don't want the legend for Viz B sitting next to Viz A - confusing!
74 |
75 | ## Formatting
76 |
77 | - Format $\rightarrow$ Dashboard brings up the formatting panel for things like overall coloring, fonts, etc.
78 | - You can also format the individual worksheets included by clicking on them as normal
79 | - Any re-decoration you want to do at this point is on the table!
80 |
81 | ## Formatting
82 |
83 | ```{r}
84 | knitr::include_graphics('Lecture_18_Dashboard_Format.png')
85 | ```
86 |
87 | ## Filters
88 |
89 | - You can add interactive filters to look at different parts of the data, and apply those filters broadly
90 | - Select a viz that uses the variable you want to filter on and hit the "use as filter" button (looks like a funnel) and then the down arrow and Filter to select the variable to filter on
91 | - A filter will appear. On the down arrow for THAT, you can change the kind of filter you get (dropdown, etc.) and also which worksheets the filter applies to - can update the whole dashboard at once with one filter if you like!
92 |
93 | ## Considerations for Dashboards
94 |
95 | If you're planning to end up with a dashboard...
96 |
97 | - Hold off on annotations until the end; what looks good on a single image might get cramped on multiples
98 | - Consider how you might reduce the number of legends required to understand things. Lots of legends can get confusing
99 | - Try to make dashboard and worksheet formatting consistent
100 | - Use a grid layout! Don't get wild
101 | - Consider including the "Export to PDF" option
102 |
103 | ---
104 |
105 | ```{r}
106 | knitr::include_graphics('Lecture_18_A_Mess.png')
107 | ```
108 |
109 | ## Sharing Dashboards
110 |
111 | - Save as an image (no interactivity) with Dashboard $\rightarrow$ Export Image
112 | - Send the workbook which can be opened with Tableau or (free) Tableau Reader
113 | - Tableau Server and Tableau Online store the workbook online - automatically updated as necessary. Users can also subscribe for update notifications. Requires paid license
114 |
115 | ## Let's Do This
116 |
117 | - See the provided workbook file Lecture_18_Oster_Data
118 | - This contains data from NHANES, the National Health and Nutrition Examination Survey
119 |
120 | ## Let's Do This
121 |
122 | - Specifically, an extract from Oster (2020), who was trying to tell the following story:
123 | - From 1999-2004, Vitamin E was a recommended supplement
124 | - Before 1999 and after 2004, it wasn't, and was actually recommended against
125 | - From 1999-2004, people who pay the most attention to their health will be more likely to follow the recommendation, i.e. take Vitamin E
126 | - So, during that time, the relationship between taking Vitamin E and health outcomes like mortality should be stronger
127 |
128 | ## Let's Do This
129 |
130 | - Try to tell this story in a dashboard!
131 | - Careful: how can you show that a *relationship* gets stronger or weaker over time? This is trivariate!
--------------------------------------------------------------------------------
/Lecture_18_Oster_Data_packaged.twbx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_Oster_Data_packaged.twbx
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DataCommSlides
2 | Slides for Seattle University Course on Data Communications
3 |
4 | The recommended book for this course is Kieran Healy's [Data Visualization](http://socviz.co/)
5 |
6 | NOTE: Some Tableau data files are not included in the repo since they are too large.
7 |
8 | Also, check out the accompanying [video series](https://nickchk.com/videos.html#dcomm) for the class. As of now, these videos have not been updated for the newest version of the class, and instead follow along with the old version.
9 |
10 | ## Principles of Data Communication
11 |
12 | [Lecture 1: Data Commmunication](https://nickch-k.github.io/DataCommSlides/Lecture_01_Data_Communication.html#/)
13 |
14 | [Lecture 2: Clutter and Focus](https://nickch-k.github.io/DataCommSlides/Lecture_02_Clutter_and_Focus.html#/)
15 |
16 | [Lecture 3: Audience](https://nickch-k.github.io/DataCommSlides/Lecture_03_Audience.html#/)
17 |
18 | ## R, Workflow, and Data Wrangling
19 |
20 | [Lecture 4: R and RMarkdown](https://nickch-k.github.io/DataCommSlides/Lecture_04_R_and_RMarkdown.html#/)
21 |
22 | [Common R Problems](https://nickch-k.github.io/DataCommSlides/Lecture_04_Common_R_Problems.html#/)
23 |
24 | [Lecture 5 and Lecture 6: Data Wrangling](https://nickch-k.github.io/DataCommSlides/Lecture_05_and_06_Data_Wrangling.html#/)
25 |
26 | ## Exploratory Data Analysis
27 |
28 | [Lecture 7: Exploratory Data Analysis](https://nickch-k.github.io/DataCommSlides/Lecture_07_EDA.html#/)
29 |
30 |
31 | ## Data Visualization in R/ggplot2
32 |
33 | [Lecture 8: Basics of ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_08_Basics_of_ggplot2.html#/)
34 |
35 | [Lecture 9: Structuring ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_09_Structuring_ggplot.html#/)
36 |
37 | [Lecture 10: Styling ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_10_Styling_ggplot.html#/)
38 |
39 | [Lecture 11: Bob Ross Day](https://nickch-k.github.io/DataCommSlides/Lecture_11_Bob_Ross.html#/) (this one may not translate to a non-classroom scenario; the idea here is that we create these graphs together one step at a time, building them up from scratch and showing the process)
40 |
41 |
42 | ## Fine-Tuning our Communications
43 |
44 | [Lecture 12: Story and Geometry](https://nickch-k.github.io/DataCommSlides/Lecture_12_Story_and_Geometry.html#/)
45 |
46 | [Lecture 13: Communicating Statistics and Uncertainty](https://nickch-k.github.io/DataCommSlides/Lecture_13_Statistics.html#/)
47 |
48 | ## Dashboards and Interactivity
49 |
50 | [Lecture 14: Dashboards in R](https://nickch-k.github.io/DataCommSlides/Lecture_14_Dashboards_in_R.html#/)
51 |
52 | [Lecture 15: Interactivity](https://nickch-k.github.io/DataCommSlides/Lecture_15_Interactive_Visuals_in_R.html#/)
53 |
54 | [Example Shiny Dashboard](https://nickch-k.shinyapps.io/Lecture_18_Example_Shiny/)
55 |
56 | ## Tableau
57 |
58 | [Lecture 16: Introduction to Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_16_Intro_to_Tableau.html#/)
59 |
60 | [Lecture 17: More Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_17_More_Tableau.html#/)
61 |
62 | [Lecture 18: Dashboards in Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_18_Dashboards_in_Tableau.html#/)
63 |
64 |
65 | ## Additional Materials
66 |
67 | [Easy Mistakes to Avoid, and Things You Must Not Do](https://nickch-k.github.io/DataCommSlides/Easy_Mistakes_to_Avoid.html)
--------------------------------------------------------------------------------
/SPrail.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/SPrail.zip
--------------------------------------------------------------------------------
/SPrail_july.csv:
--------------------------------------------------------------------------------
1 | insert_date,origin,destination,start_date,end_date,train_type,price,train_class,fare
2 | 2019-05-08T09:04:07Z,MADRID,BARCELONA,2019-07-05T16:30:00Z,2019-07-05T19:20:00Z,AVE,102.15,Turista Plus,Promo
3 | 2019-05-09T21:01:16Z,MADRID,SEVILLA,2019-07-06T12:00:00Z,2019-07-06T14:32:00Z,AVE,53.4,Turista,Promo
4 | 2019-05-09T09:43:19Z,MADRID,BARCELONA,2019-07-01T15:30:00Z,2019-07-01T18:40:00Z,AVE,69.8,Turista Plus,Promo
5 | 2019-05-09T21:14:08Z,MADRID,BARCELONA,2019-07-05T20:30:00Z,2019-07-05T23:40:00Z,AVE,58.15,Turista,Promo
6 | 2019-05-08T19:11:36Z,MADRID,BARCELONA,2019-07-02T18:30:00Z,2019-07-02T21:20:00Z,AVE,85.1,Turista,Promo
7 | 2019-05-07T07:47:46Z,BARCELONA,MADRID,2019-07-03T19:00:00Z,2019-07-03T22:00:00Z,AVE,85.1,Turista,Promo
8 | 2019-05-09T13:50:14Z,BARCELONA,MADRID,2019-07-04T11:00:00Z,2019-07-04T13:45:00Z,AVE,85.1,Turista,Promo
9 | 2019-05-07T03:42:13Z,MADRID,BARCELONA,2019-07-01T17:00:00Z,2019-07-01T19:30:00Z,AVE,88.95,Turista,Promo
10 |
--------------------------------------------------------------------------------
/gganimate_guide.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Animations in ggplot - gganimate"
3 | author: "Gareth Green"
4 | output: slidy_presentation
5 | ---
6 |
7 | ```{r echo = FALSE}
8 | # Course: 5210 Communicating Data
9 | # Purpose: Demonstrate gganimate
10 | # Date: August 5, 2019
11 | # Author: Gareth Green
12 |
13 | ```
14 |
15 | ```{r echo = FALSE}
16 | # Clear environment of variables and functions
17 | rm(list = ls(all = TRUE))
18 |
19 | # Clear environmet of packages
20 | if(is.null(sessionInfo()$otherPkgs) == FALSE)lapply(paste("package:", names(sessionInfo()$otherPkgs), sep=""), detach, character.only = TRUE, unload = TRUE)
21 |
22 | ```
23 |
24 | ```{r echo = FALSE, message = FALSE, warning = FALSE}
25 | # Load libraries
26 | library(tidyverse)
27 | library(gganimate)
28 |
29 | ```
30 |
31 | ```{r echo = FALSE}
32 | # Load data
33 | qp1 <- read.csv("qp1_data.csv")
34 |
35 | ```
36 |
37 |
38 | Animated visuals
39 | =======================================
40 |
41 |
42 |
43 | + How many of you have seen this famous [gapminder video of Hans Rosling](https://www.bing.com/videos/search?q=hans+rosling&view=detail&mid=4CC1172C798C8C3BD2A94CC1172C798C8C3BD2A9&FORM=VIRE)
44 |
45 | + Animated visuals can help tell a story
46 |
47 | - especially in a presenation
48 | - you can tell the story while the visual progresses
49 |
50 | + Animations are especially good for time series
51 |
52 | - can work well for other forms of motion as well
53 | - for example, how median price changes with number of bedrooms
54 | - key is that members of a group differ over the varialbe that you are animating over
55 |
56 | + Only use if the animation ads substance
57 |
58 | - do not just use animations to show sizzle
59 |
60 | + You can save these in your TA and call them later
61 |
62 | - `anim_save()` is much like `ggsave()`
63 |
64 |
65 |
66 |
67 | Animated visual basics
68 | =======================================
69 |
70 |
71 |
72 | + I want to show you a similar animation to the one Hans used
73 |
74 | + We will be using gganimate
75 |
76 | - I will show several examples of how to use gganminate
77 | - there are several differnt functions and they continue to develop new ones
78 |
79 | + Here we use the function `transition_time()`
80 |
81 | - this has the visual transition along a variable that is not in the plot
82 | - see that `year` is not in `ggplot()` but it is in `transition_time(year)`
83 |
84 | ```{r ann1, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
85 | # Load package with gapminder data
86 | library(gapminder)
87 |
88 | # Build a faceted animation graph
89 | ggplot(data = gapminder,
90 | mapping = aes(x = gdpPercap/1000, y = lifeExp,
91 | size = pop, color = country)) +
92 | geom_point(alpha = 0.7, show.legend = FALSE) +
93 | scale_colour_manual(values = country_colors) +
94 | scale_size(range = c(2, 12)) +
95 | scale_x_log10() +
96 | facet_wrap(~continent) +
97 | theme_bw() +
98 |
99 | # Here comes the gganimate specific bits
100 | labs(title = "Real GDP and life expectancy grow together: Year {frame_time}",
101 | x = "GDP per capita ($000)", y = "Life expectancy") +
102 | transition_time(year) +
103 | ease_aes("linear")
104 |
105 | ```
106 |
107 |
108 |
109 |
110 | Time-series animation visuals
111 | =======================================
112 |
113 |
114 |
115 | + Here is a typical time series visual
116 |
117 | - here I use `transition_reveal()` by the x-variable in `ggplot()`
118 | - also set `frame = yr_built` for the visual to transition along
119 |
120 | + I use `annimate()` to set the speed of the reveal
121 |
122 | ```{r ann2, message = FALSE, warning = FALSE}
123 | # Graph median price by year built
124 | p <- qp1 %>%
125 | select(yr_built, waterfront, price) %>%
126 | group_by(yr_built, waterfront) %>%
127 | summarize(med_price = median(price/1000)) %>%
128 | ggplot(aes(x = yr_built, y = med_price, frame = yr_built)) +
129 | geom_line(aes(color = as.factor(waterfront))) +
130 | scale_color_manual(name = "Waterfront", labels = c("No", "Yes"), values = c("navy", "blue"),
131 | guide = guide_legend(reverse=TRUE)) +
132 | theme_classic() +
133 | theme(axis.title.x = element_blank(),
134 | legend.position="bottom") +
135 | labs(title = "Waterfront median price is much more variable than non-waterfront",
136 | x = "", y = "Median price ($000)") +
137 | transition_reveal(yr_built)
138 |
139 | # Run as a GIF annimation and control speed
140 | animate(p, nframes = 300)
141 |
142 | ```
143 |
144 |
145 |
146 | Animating distinct stages of a visual
147 | =======================================
148 |
149 |
150 |
151 | + Can also animate visuals that have discrete states
152 |
153 | - good for bar graphs or similar
154 |
155 | + the variable in `transition_state()` does not have to be called in `ggplot()`
156 |
157 | - i.e. could have done by `view`
158 |
159 | ```{r ann3, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
160 | # Animate a bar graph
161 | qp1 %>%
162 | select(price, bedrooms, waterfront) %>%
163 | group_by(bedrooms, waterfront) %>%
164 | filter(bedrooms < 7 & bedrooms > 0) %>%
165 | summarise(med_price = median(price/1000)) %>%
166 | ggplot(aes(x = bedrooms, y = med_price, fill = as.factor(waterfront))) +
167 | geom_bar(stat = "identity", position = "dodge") +
168 | theme_classic() +
169 | scale_fill_manual(name = "Waterfront", labels = c("No", "Yes"), values = c("lightblue", "blue"),
170 | guide = guide_legend(reverse=TRUE)) +
171 | labs(title = "Median price grows more per bedroom for waterfront homes",
172 | x = "Number of bedrooms", y = "Median price ($000)") +
173 | transition_states(bedrooms, wrap = FALSE) +
174 | shadow_mark()
175 |
176 | ```
177 |
178 |
179 |
180 |
181 |
--------------------------------------------------------------------------------
/king_dailyvisits.Rdata:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/king_dailyvisits.Rdata
--------------------------------------------------------------------------------
/no_border.css:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/oster_vitamine_data.Rdata:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/oster_vitamine_data.Rdata
--------------------------------------------------------------------------------
/oster_vitamine_data.hyper:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/oster_vitamine_data.hyper
--------------------------------------------------------------------------------
/robocalls_excel.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/robocalls_excel.xlsx
--------------------------------------------------------------------------------