├── Easy_Mistakes_to_Avoid.Rmd ├── Easy_Mistakes_to_Avoid.html ├── Lecture_01_Data_Communication.Rmd ├── Lecture_01_Data_Communication.html ├── Lecture_01_Faucet.jpg ├── Lecture_02_Bible_Cross_Ref.gif ├── Lecture_02_Bible_Cross_Ref.png ├── Lecture_02_Clutter_and_Focus.Rmd ├── Lecture_02_Clutter_and_Focus.html ├── Lecture_02_Coffee.jpg ├── Lecture_02_DuBois.PNG ├── Lecture_02_Healthcare.PNG ├── Lecture_02_Minard.gif ├── Lecture_02_NAEP.csv ├── Lecture_02_NAEP.xlsx ├── Lecture_02_Network_Graph.jpg ├── Lecture_02_Unemployment.PNG ├── Lecture_03_Audience.Rmd ├── Lecture_03_Audience.html ├── Lecture_03_Cardash.jpg ├── Lecture_03_Preattentive.PNG ├── Lecture_04_Common_R_Problems.Rmd ├── Lecture_04_Common_R_Problems.html ├── Lecture_04_R_Walkthrough.R ├── Lecture_04_R_and_RMarkdown.Rmd ├── Lecture_04_R_and_RMarkdown.html ├── Lecture_04_example_rmarkdown.Rmd ├── Lecture_04_second_example.Rmd ├── Lecture_05_Example_Sprail.Rmd ├── Lecture_05_Example_Sprail.html ├── Lecture_05_and_06_Data_Wrangling.Rmd ├── Lecture_05_and_06_Data_Wrangling.html ├── Lecture_05_and_06_example_code.R ├── Lecture_06_GG_Components.png ├── Lecture_07_EDA.Rmd ├── Lecture_07_EDA.html ├── Lecture_08_Basics_of_ggplot2.Rmd ├── Lecture_08_Basics_of_ggplot2.html ├── Lecture_09_Structuring_ggplot.Rmd ├── Lecture_09_Structuring_ggplot.html ├── Lecture_10_Styling_ggplot.Rmd ├── Lecture_10_Styling_ggplot.html ├── Lecture_11_Bob_Ross.Rmd ├── Lecture_11_Bob_Ross.html ├── Lecture_12_Hexmap.png ├── Lecture_12_State_Heatmap.png ├── Lecture_12_Story_and_Geometry.Rmd ├── Lecture_12_Story_and_Geometry.html ├── Lecture_13_Statistics.Rmd ├── Lecture_13_Statistics.html ├── Lecture_13_climate.png ├── Lecture_14_Clutter.png ├── Lecture_14_Color.png ├── Lecture_14_Dashboards_in_R.Rmd ├── Lecture_14_Dashboards_in_R.html ├── Lecture_14_Example_Dashboard.Rmd ├── Lecture_14_Flexdashboard.png ├── Lecture_14_Flexdashboard_row.png ├── Lecture_14_Grids.png ├── Lecture_14_KPI.png ├── Lecture_15_Example_Shiny.Rmd ├── Lecture_15_Interactive_Visuals_in_R.Rmd ├── Lecture_15_Interactive_Visuals_in_R.html ├── Lecture_15b_Bob_Ross_My_ATUS_Dashboard_Final.Rmd ├── Lecture_16_Data_Import.png ├── Lecture_16_Intro_to_Tableau.Rmd ├── Lecture_16_Intro_to_Tableau.html ├── Lecture_16_Scorecard.twb ├── Lecture_16_Scorecard.twbx ├── Lecture_16_Scorecard_YouDoIt.twb ├── Lecture_16_Tableau_Boxplot.png ├── Lecture_16_Tableau_Employment.png ├── Lecture_16_Tableau_Main.png ├── Lecture_16_Tableau_Map.png ├── Lecture_16_Tableau_Repayment.png ├── Lecture_17_Axis.png ├── Lecture_17_CancelledFlights.png ├── Lecture_17_CumulativeCancellations.png ├── Lecture_17_DelaybyDay.png ├── Lecture_17_Delayed_Flights.twb ├── Lecture_17_Delayed_Flights_Answers.twb ├── Lecture_17_MiddayFlights.png ├── Lecture_17_MiddayFlightsAxes.png ├── Lecture_17_More_Tableau.Rmd ├── Lecture_17_More_Tableau.html ├── Lecture_17_Percent.png ├── Lecture_17_Table_Calc.png ├── Lecture_17_Tickmarks.png ├── Lecture_18_A_Mess.png ├── Lecture_18_AddWorksheets.png ├── Lecture_18_Dashboard_Format.png ├── Lecture_18_Dashboards_in_Tableau.Rmd ├── Lecture_18_Dashboards_in_Tableau.html ├── Lecture_18_Oster_Data.twb ├── Lecture_18_Oster_Data_packaged.twbx ├── README.md ├── SPrail.zip ├── SPrail_april.csv ├── SPrail_july.csv ├── SPrail_june.csv ├── SPrail_may.csv ├── Scorecard_for_Tableau.csv ├── gganimate_guide.Rmd ├── gganimate_guide.html ├── king_dailyvisits.Rdata ├── no_border.css ├── oster_vitamine_data.Rdata ├── oster_vitamine_data.hyper ├── robocalls.csv └── robocalls_excel.xlsx /Lecture_01_Data_Communication.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 1: Data Communication" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | 18 | ```{r setup, include=FALSE} 19 | knitr::opts_chunk$set(echo = FALSE) 20 | library(tidyverse) 21 | library(gganimate) 22 | ``` 23 | 24 | ## Data Communication 25 | 26 | ```{r, results = 'asis'} 27 | cat(" 28 | ") 34 | ``` 35 | 36 | Welcome to Data Communication! 37 | 38 | ## Data Communication 39 | 40 | This class is about 41 | 42 | - How to turn data into a message 43 | - How to cleanly communicate that message 44 | - Technically, how to create that communication 45 | 46 | ## Media 47 | 48 | This will largely focus on *data visualization* but will also cover: 49 | 50 | - Exploratory data analysis 51 | - Tables 52 | - Notebooks 53 | - Workflow 54 | - Data cleaning and manipulation 55 | 56 | ## What We're Doing 57 | 58 | - This is sort of a mutt of a class, covering coding, communications concepts, data management and cleaning, and even some analytical stuff 59 | - That's because this is *really* a course about preparing you to go out into the real world and work with data 60 | - The thing I really want you to come away with after this class is this: 61 | - When you are working with data, **think carefully about what you're doing, and make sure you're doing it right.** 62 | - It really is just **thinking carefully** and **attending to detail**, not technical skill. This is more important than coding. We'll have to code, but code can only be as good as what you're trying to make it do 63 | 64 | ## What We're Doing 65 | 66 | - You're going to leave the world of a classroom where the project is set up so your prof knows there *is* a right answer, what that right answer is, and they can check your work 67 | - In the real world, *you* will be expected to produce good work on your own, which means *you'll* be checking your work, and there won't be a "wrong answer" buzzer that sounds if you make a mistake 68 | - I want you to build enough confidence in working with data that you can look at something the computer has spit out and think "the computer only does what it's told. I know what this *should* be. This is not yet right." 69 | 70 | ## What We're Doing 71 | 72 | So when you do **anything** with data - load a file, create a variable, make a graph, do an analysis, ask: 73 | 74 | 1. What am I **trying to do**? 75 | 1. Did it **work properly**? How can I **CHECK** if it worked properly? 76 | 1. If it's a form of communication, have you made it **as clear as possible**? Try to find places where people might get confused. 77 | 78 | Never think "I pushed the button/wrote the line that made the thing go. I'm done." It's on you to fulfill the requirement in a way that's **good**. Take pride in your work, or it will be bad. 79 | 80 | ## Noticing 81 | 82 | More than anything else, this class is about *noticing things*. I'm serious. 83 | 84 | - Data analysis is full of opportunities to mess up. Mistakes are unavoidable, and results that make no sense just happen 85 | - Being a good analyst means noticing mistakes, and noticing what your output looks like, and taking responsibility for fixing it 86 | - Be careful in your work. Take time. Review. 87 | - That's what I'm hoping to teach, and something I *will* be assessing you on 88 | 89 | ## What We're Doing 90 | 91 | It takes no technical/coding skill, or extensive explanation from me, to see why this graph is bad. If you produce this graph, you should take it upon yourself to fix it. Ask: how can I improve this? 92 | 93 | ```{r, echo = FALSE, eval = TRUE, fig.height = 3, fig.width = 6} 94 | data(gapminder, package = 'gapminder') 95 | countries <- gapminder %>% 96 | pull(country) %>% 97 | unique() %>% 98 | sort() 99 | gapminder %>% 100 | filter(year == 2007) %>% 101 | mutate(lexp = lifeExp/max(lifeExp) + .1) %>% 102 | mutate(countryy_2 = as.numeric(factor(country))) %>% 103 | ggplot(aes(x = countryy_2, y = lexp)) + geom_line() + 104 | scale_x_continuous(breaks = 1:length(countries), labels = \(x) countries[x]) + 105 | labs(title = 'Life expectancy compared to most') 106 | 107 | ``` 108 | 109 | ## Admin 110 | 111 | Let's do some housekeeping: 112 | 113 | - Course website 114 | - Syllabus 115 | - Expectations and assignments 116 | 117 | ## Resources 118 | 119 | In addition to the course website and Healy's [Data Visualization](https://kieranhealy.org/publications/dataviz/): 120 | 121 | - The [R Graphics Cookbook](https://r-graphics.org/) 122 | - [Data Visualization blogs](https://www.tableau.com/learn/articles/best-data-visualization-blogs) 123 | - [LOST-STATS](https://lost-stats.github.io) 124 | - The internet writ large 125 | - Code for all these slides (RMarkdown and ggplot2 examples) are on the course GitHub page 126 | 127 | ## What is Data Communication? 128 | 129 | - There is a lot of information in the world 130 | - And a lot of information at your fingertips 131 | - *Too much* 132 | - And so we simplify to tell the *story* underlying the data 133 | 134 | ## What is Data Communicatoin? 135 | 136 | - I have a **result** from my data 137 | - I want you to **understand and believe** my result 138 | - How can I demonstrate this result to you so you'll understand it? 139 | - How can I present the data so that you understand where the result came from and why they should agree that the result is accurate? 140 | 141 | ## The Map and the Territory 142 | 143 | - Someone asks you for directions to Dick's on Broadway 144 | - Do you hand them your 3.2GB perfectly detailed shapefile of Capitol Hill? 145 | - The answer is in there, and much more precisely than you could possibly tell them 146 | - But it doesn't really answer their question, right? 147 | 148 | ## The Map and the Territory 149 | 150 | - The goal of a data *analyst* is to take that shapefile and figure out how to get to Dick's 151 | - The goal of a data communicator is to take what the data analyst figured out and figure out *what part of the map to show you to help you understand how to get to Dick's* 152 | - A good data communicator will make understanding the directions easy and obvious 153 | 154 | ## Storytelling With Data 155 | 156 | 1. Understand the context 157 | 1. Figure out *the story* (what you want the reader to understand, and why) 158 | 1. Choose an appropriate visual display 159 | 1. Eliminate clutter 160 | 1. Focus attention where you want it 161 | 1. Think like a designer 162 | 1. Tell the story 163 | 164 | ## Examples Outside of Data 165 | 166 | - Let's consider some examples of effective communication of information outside the narrow range of "data communication" 167 | 168 | 169 | ## How Does This Faucet Work? 170 | 171 | ```{r, out.width="650px"} 172 | knitr::include_graphics("Lecture_1_Faucet.jpg") 173 | ``` 174 | 175 | ## Understand the context 176 | 177 | The person who designed this faucet understands, hopefully, how water pipes work and how opening a valve can allow water to flow 178 | 179 | 180 | ## Figure out the story 181 | 182 | *What is important for the audience to know*? 183 | 184 | I don't care if the user understands how pipes work, or their history of water usage. 185 | 186 | I need the user to know how to properly turn the water on 187 | 188 | 189 | ## Choose an appropriate visual display 190 | 191 | We have a handle close to the source of the water 192 | 193 | It implies the ways the handle can be turned - towards us or away, left and right 194 | 195 |
The display doesn't allow us any information about how those directions relate to pressure or temperature
(what version of a faucet might?) 196 | 197 | ## Eliminate clutter 198 | 199 | Nothin' but handle 200 | 201 | We could have other stuff here - sink stopper, an LCD with the weather report, but do we need it? 202 | 203 | ## Focus attention where you want it 204 | 205 | There's nowhere to look but the handle (other than the spigot, not pictured) 206 | 207 | The shape and design pushes you towards it - it calls for a hand! 208 | 209 | ## Think like a designer 210 | 211 | We want the user to understand that they can pull or rotate the handle to affect the water flow 212 | 213 | This design *affords* both of those uses 214 | 215 | And nothing else 216 | 217 | There aren't a lot of ways to use this wrong, other than messing up pressure vs. temperature 218 | 219 | ## Tell the story 220 | 221 | Water flow can be controlled by twisting this handle 222 | 223 | If you were a monkey who had never experienced plumbing, it would only take you about ten seconds to follow the design to that handle, pull it, and learn about the connection between handles and water flow 224 | 225 | ## Gapminder 226 | 227 | - Let's move into some data 228 | - Gapminder (from the Gapminder institute) is a data set that, among other things, shows how differences between countries change over time 229 | - One thing it is commonly used to show is that economic development aids health development 230 | - GDP per capita $\rightarrow$ life expectancy 231 | - Also, generally, both of those things have improved over time 232 | 233 | ## Gapminder 234 | 235 | ```{r} 236 | data(gapminder, package = 'gapminder') 237 | gapminder 238 | ``` 239 | 240 | ## Understand the Context 241 | 242 | How do things work here? 243 | 244 | - We know that life expectancy and GDP per capita go together closely 245 | 246 | ## Figure out the story 247 | 248 | What do we want people to learn? 249 | 250 | - GDP per capita and life expectancy go together strongly 251 | - Both GDP per capita and life expectancy have increased a lot over time (i.e. things on this front are getting better!) 252 | 253 | This is useful information and I can see why someone would want to understand this for its own sake, to understand the world better (and perhaps have some actionable takeaway as a result!) 254 | 255 | ## Choose an appropriate visual display 256 | 257 | - We want something that will show a relationship between two variables with many observations 258 | - NOW WE HAVE SOME OUTPUT. Put on those "how can we make it better?" goggles: the first output we get is NOT DONE 259 | 260 | ```{r, fig.width=6, fig.height=4} 261 | ggplot(gapminder, 262 | aes(x = gdpPercap, y = lifeExp)) + 263 | geom_point() + 264 | labs(x = "GDP per Capita", y = "Life Expectancy") 265 | ``` 266 | 267 | ## Eliminate clutter 268 | 269 | - That's a lot of dots! Can we tell the same story by focusing on just a few countries? 270 | - Also, that's a lot of background ink... 271 | 272 | ```{r, fig.width=6, fig.height=4} 273 | ggplot(gapminder %>% 274 | filter(country %in% c('Brazil','India','Spain')), 275 | aes(x = gdpPercap, y = lifeExp, color = country)) + 276 | geom_line() + 277 | theme_minimal() + 278 | labs(x = "GDP per Capita", y = "Life Expectancy", 279 | color = 'Country') + 280 | theme(legend.position = c(.8,.2), 281 | legend.background = element_rect()) 282 | ``` 283 | 284 | ## Focus attention where you want it 285 | 286 | - Those few high-GDP observations are drawing a LOT of space, as opposed to that left blob. Let's put the x-axis on a log scale 287 | 288 | ```{r, fig.width=6, fig.height=4} 289 | ggplot(gapminder %>% 290 | filter(country %in% c('Brazil','India','Spain')), 291 | aes(x = gdpPercap, y = lifeExp, color = country)) + 292 | geom_line() + 293 | theme_minimal() + 294 | scale_x_log10() + 295 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy", 296 | color = 'Country') + 297 | theme(legend.position = c(.8,.2), 298 | legend.background = element_rect()) 299 | ``` 300 | 301 | ## Think like a designer 302 | 303 | - Why make the reader work? 304 | - Also, realize this graph sort of feels like it's moving forward in time. Uh-oh... 305 | 306 | ```{r, fig.width=6, fig.height=4} 307 | ggplot(gapminder %>% 308 | filter(country %in% c('Brazil','India','Spain')), 309 | aes(x = gdpPercap, y = lifeExp, color = country)) + 310 | geom_line() + 311 | theme_minimal() + 312 | scale_x_log10() + 313 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy", 314 | color = 'Country') + 315 | annotate('text',x = 2300, y = 66, label = 'India', color = 'green') + 316 | annotate('text',x = 10000, y = 68, label = 'Brazil', color = 'red') + 317 | annotate('text',x = 25000, y = 75, label = 'Spain', color = 'blue') + 318 | guides(color = 'none') 319 | ``` 320 | 321 | ## Tell the story 322 | 323 | - Realize that we've lost the "things get better over time" angle 324 | - And also lost the part where we want to talk about the whole world! 325 | - Use what we've done so far to think about how we can show the dual GDP-and-life-expectancy improvements *over time* for everyone 326 | 327 | ## What could still be improved? 328 | 329 | ```{r} 330 | options(gganimate.dev_args = list(width = 650, height = 400)) 331 | (ggplot(gapminder, 332 | aes(x = gdpPercap, y = lifeExp, color = continent)) + 333 | geom_point() + 334 | theme_minimal() + 335 | scale_x_log10() + 336 | labs(x = "GDP per Capita (log scale)", y = "Life Expectancy", 337 | title = "GDP and Life Expectancy by Country, 1952-2007", 338 | color = 'Continent') + 339 | transition_time(year) + 340 | theme(axis.title.x = element_text(size=15), 341 | axis.title.y = element_text(size=15), 342 | title = element_text(size=15))) %>% 343 | animate(nframes = 200,end_pause = 30) 344 | ``` 345 | 346 | ## Let's See What We Can Get 347 | 348 | - Find a "good" chart from [informationisbeautiful.net](https://informationisbeautiful.net) 349 | - And a "bad" chart from [junkcharts.typepad.com/](https://junkcharts.typepad.com/) 350 | - And go through our steps. What are they trying to say? How do they do it? What mistakes do they make? What is unclear? 351 | - We will discuss 352 | 353 | 354 | -------------------------------------------------------------------------------- /Lecture_01_Faucet.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_01_Faucet.jpg -------------------------------------------------------------------------------- /Lecture_02_Bible_Cross_Ref.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Bible_Cross_Ref.gif -------------------------------------------------------------------------------- /Lecture_02_Bible_Cross_Ref.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Bible_Cross_Ref.png -------------------------------------------------------------------------------- /Lecture_02_Clutter_and_Focus.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 2: Clutter and Focus" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(lubridate) 20 | library(gghighlight) 21 | library(ggthemes) 22 | ``` 23 | 24 | 25 | 26 | ## Clutter and Focus 27 | 28 | ```{r, results = 'asis'} 29 | cat(" 30 | ") 36 | ``` 37 | 38 | - We want to tell a story with our data 39 | - That is, we want to *demonstrate some interesting finding* from our data 40 | - Some stories are clear 41 | - Others are complex 42 | - They all need to be clear 43 | 44 | ## Clutter and Focus 45 | 46 | - A common mistake is to *show you the data* rather than *show you the story* 47 | - The story often exists within part of the data 48 | - We want to understand some sort of *continuity* or some sort of *distinction* 49 | - Everything that takes away from that makes the story more opaque 50 | 51 | ## Examples 52 | 53 | - Let's look at some data visualizations that are cluttered 54 | - Some are still *beautiful* and lovely and works of art 55 | - Some are impressive with the amount of information they're able to get on one graph 56 | - But they can obscure the point they're trying to make 57 | - With each, let's think about why 58 | 59 | ## Bible Cross-References 60 | 61 | - The following graph shows how different chapters of the bible refer to each other 62 | 63 | --- 64 | 65 | ```{r} 66 | knitr::include_graphics('Lecture_02_Bible_Cross_Ref.gif') 67 | ``` 68 | 69 | ## Coffee 70 | 71 | - How much coffee do South Koreans drink? 72 | - Simple question, right? 73 | 74 | --- 75 | 76 | ```{r} 77 | knitr::include_graphics('Lecture_02_Coffee.jpg') 78 | ``` 79 | 80 | 81 | ## Health care 82 | 83 | - This graph just wants to get across three statistics about health care 84 | - And how those statistics vary by state 85 | 86 | --- 87 | 88 | ```{r} 89 | knitr::include_graphics('Lecture_02_Healthcare.png') 90 | ``` 91 | 92 | ## Network Graphs 93 | 94 | - You see a lot of *network graphs* floating around these days 95 | - They look super cool 96 | - But they can be hard to actually learn anything from 97 | - Unless the story is just "things sure are connected, aren't they?" which isn't all that interesting 98 | 99 | --- 100 | 101 | ```{r} 102 | knitr::include_graphics('Lecture_02_Network_Graph.jpg') 103 | ``` 104 | 105 | ## Napoleon's March 106 | 107 | - This is considered one of the most beautiful and impressive pieces of data viz ever, about Napoleon's army during its march and retreat from Russia 108 | - Distance from Paris, size of army, elevation, places where the army met itself... 109 | - It absolutely tells an intricate and detailed story - but what can you take away from it if you don't already know what it's trying to say? 110 | - What parts of the story does it guide you towards focusing on? Which get lost? 111 | 112 | --- 113 | 114 | ```{r} 115 | knitr::include_graphics('Lecture_02_Minard.gif') 116 | ``` 117 | 118 | ## Clutter and Focus 119 | 120 | - These graphs vary in quality but some common issues: 121 | - They add things to the graph that do not need to be there or are difficult to understand 122 | - They do not focus your attention on a single takeaway 123 | 124 | ## Clutter and Focus 125 | 126 | - Clutter adds cognitive load - it makes your viz more difficult to understand 127 | - There are many very low-level visual shortcuts in the human brain we can take advantage of to avoid this 128 | - And to make sure that we can anticipate how the reader will interpret the visualization 129 | 130 | ## Gestalt Principles of Visual Perception 131 | 132 | - Proximity 133 | - Similarity 134 | - Enclosure 135 | - Closure 136 | - Continuity 137 | - Connection 138 | 139 | ## Proximity 140 | 141 | - If you put things close to each other, they tend to get thought of as being associated 142 | - And visually, they are at least visually comparable 143 | - Putting two similar things close together says "these things are very similar!" 144 | - Putting two distinct things close together says "compare these please!" 145 | 146 | ## Proximity 147 | 148 | | | Budget | 149 | |-----------|---------| 150 | | Sales | $10000 | 151 | | Marketing | $15000 | 152 | | | | 153 | | R&D | $5000 | 154 | | HR | $6000 | 155 | 156 | ## Similarity 157 | 158 | - Simply put, things that are made similar in some way (shape, color, shade, etc.) are part of the same whole 159 | 160 | Note on using color for this: Remember colorblindness! 161 | 162 | - Most common is red/green/(orange), blue/yellow also relatively common 163 | - When picking colors, avoid having your contrasts be red/green or blue/yellow. Red/blue is a popular choice here. 164 | - For complete accessibility, focus on highly contrasting *shades* 165 | - Colorblind-friendly palettes are available! 166 | 167 | ## Similarity 168 | 169 | ```{r} 170 | data(mtcars) 171 | mtcars <- mtcars %>% 172 | mutate(Transmission = case_when( 173 | am == 0 ~ 'Automatic', 174 | am == 1 ~ 'Manual' 175 | )) 176 | ggplot(mtcars, aes(x = wt, y = mpg, color = Transmission))+ 177 | geom_point() + 178 | theme_minimal() + 179 | labs(x = 'Car Weight', 180 | y = 'Mileage (MPG)', 181 | color = 'Transmission Type') 182 | ``` 183 | 184 | ## Enclosure 185 | 186 | - If you put a physical enclosure around some things, they will be perceived as being part of a group 187 | - This can be especially handy if you're already using some other forms of similarity, or need to indicate areas that WOULD be part of that group if you had data there. 188 | - Enclosures say "these things are in a group!" - perhaps you want to compare that group to other things, or focus attention *within* the group? 189 | 190 | ## Enclosure 191 | 192 | ```{r} 193 | knitr::include_graphics('Lecture_02_Unemployment.png') 194 | ``` 195 | 196 | ## Continuity 197 | 198 | - When there are gaps, we tend to "fill in" in the most intuitive way 199 | 200 | ## Continuity 201 | 202 | - This year-on-year change graph has a gap at Feb. 29. But what does our brain do? 203 | 204 | ```{r, fig.height = 4} 205 | set.seed(1000) 206 | # Seven-day moving average 207 | ma <- function(x,n=7){as.numeric(stats::filter(as.ts(x),rep(1/n,n), sides=1))} 208 | tb1 <- tibble(dayofyr = 0:90,Day = ymd('2020-01-01'),Sales2020 = ma(runif(91,min=.5,max = .6))) %>% 209 | mutate(Day = Day + days(dayofyr)) %>% 210 | mutate(Mon = month(Day), 211 | DoM = day(Day)) 212 | tb2 <- tibble(dayofyr = 0:89,Day = ymd('2019-01-01'),Sales2019 = ma(runif(90,min=.47,max=.57))) %>% 213 | mutate(Day = Day + days(dayofyr)) %>% 214 | mutate(Mon = month(Day), 215 | DoM = day(Day)) %>% 216 | select(-dayofyr, -Day) 217 | graphdat <- tb1 %>% 218 | left_join(tb2,by=c('Mon','DoM')) %>% 219 | mutate(YOY = (Sales2020/Sales2019) - 1) 220 | 221 | ggplot(graphdat, aes(x = Day, y = YOY)) + 222 | geom_line() + 223 | labs(y = "Year-on-Year Change", 224 | title = "Changes in Sales Year-to-Year") + 225 | theme_minimal() + 226 | scale_y_continuous(labels = scales::percent) 227 | ``` 228 | 229 | ## Connection 230 | 231 | - If you put a literal connection between two points, people will interpret them as being connected! 232 | - Connecting two things says "one of these things is adjacent or linked to to the other" 233 | - No big surprise 234 | - See any line graph for this. 235 | - This is also a good reason *not* to use a line graph when your x-axis doesn't have an order to it! 236 | 237 | ## How? 238 | 239 | - With these concepts in mind, how can we use them to emphasize information? 240 | - And improve clarity 241 | - Let's begin with a bad first-pass graph and improve it as we can 242 | - Let's tell a story about how Washington's 4th graders fare in math vs. other coastal states 243 | 244 | ## State NAEP Test Scores 245 | 246 | ```{r} 247 | naep <- read_csv('Lecture_02_NAEP.csv') 248 | ggplot(naep, 249 | aes(x = StateCode, y = Score)) + 250 | geom_col() + 251 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) + 252 | labs(title = "Fourth Grade Math NAEP Score by State", 253 | x = "State", 254 | y = "4th Grade Math NAEP") 255 | ``` 256 | 257 | ## Ordering Information 258 | 259 | - Information should be presented in an order that highlights the information of interest 260 | - Comparisons should be easy to make 261 | - Ask: what should be comparable? 262 | - We should be able to tell "more" or "less" test score 263 | - We use *proximity* to make more-similar scores go together 264 | - **Think**: what bar ordering makes the takeaway as easy to see as possible? 265 | 266 | ## Proximity: Example 267 | 268 | ```{r} 269 | ggplot(naep %>% 270 | arrange(Score) %>% 271 | mutate(StateCode = factor(StateCode, levels = .$StateCode)), 272 | aes(x = StateCode, y = Score)) + 273 | geom_col() + 274 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) + 275 | labs(title = "Fourth Grade Math NAEP Score by State", 276 | x = "State", 277 | y = "4th Grade Math NAEP") 278 | ``` 279 | 280 | ## Using similarity 281 | 282 | - We may be interested in comparing Washington vs. other coastal states and saying something about the difference. 283 | - We can do this easily by giving all the coastal states a similar *something* 284 | - With bar graphs, an obvious pick is color 285 | 286 | ## Similarity: Example 287 | 288 | ```{r} 289 | ggplot(naep %>% 290 | arrange(Score) %>% 291 | mutate(StateCode = factor(StateCode, levels = .$StateCode), 292 | Coastal = factor(Coastal, labels = c('Not Coastal','Coastal'))), 293 | aes(x = StateCode, y = Score, fill = Coastal)) + 294 | geom_col() + 295 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = -.1)) + 296 | labs(title = "Fourth Grade Math NAEP Score by State", 297 | x = "State", 298 | y = "4th Grade Math NAEP") 299 | ``` 300 | 301 | ## Clarity 302 | 303 | - We have our comparison clarified 304 | - Let's see what else we can do with this graph 305 | 306 | ## Slices of Data 307 | 308 | - What data is important to our story, and what data is not? 309 | - We are interested in comparing Washington to other coastal states, why do we need all these other states 310 | - Having stuff on the graph says "this is worth your attention somehow." If it's not important to the story, it will confuse or distract from the story 311 | - Chuck 'em! 312 | - This will also give us room to rotate those axis labels 313 | 314 | 315 | ## Clarity 316 | 317 | ```{r} 318 | ggplot(naep %>% 319 | arrange(Score) %>% 320 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>% 321 | filter(Coastal == 1), 322 | aes(x = StateCode, y = Score)) + 323 | geom_col() + 324 | labs(title = "Fourth Grade Math NAEP Score by State", 325 | x = "State", 326 | y = "4th Grade Math NAEP") 327 | ``` 328 | 329 | ## Clarity 330 | 331 | - And dare we? 332 | - It's a debate as to whether it ever makes sense for your $y$-axis not to start at zero. But here it's really making things hard to see. Clarity could be improved by starting the axis at, say, 100! 333 | - We will leave this as-is, but it might imply something to think about for improvement in the future 334 | 335 | 336 | ## Contrast 337 | 338 | - We can use contrast to make Washington stand out again 339 | - Use a light color for others so they fade more into the back 340 | 341 | 342 | ## Contrast 343 | 344 | ```{r} 345 | ggplot(naep %>% 346 | arrange(Score) %>% 347 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>% 348 | filter(Coastal == 1), 349 | aes(x = StateCode, y = Score)) + 350 | geom_col() + 351 | labs(title = "Fourth Grade Math NAEP Score by State", 352 | x = "State", 353 | y = "4th Grade Math NAEP") + 354 | gghighlight(StateCode == 'WA') 355 | ``` 356 | 357 | ## Simplify 358 | 359 | - Enclosure allows us to get rid of a lot of the borders 360 | - And we an remove the backing ink too 361 | - **Don't underestimate the benefits of removing background ink.** It can really be distracting! Cleaner graphs look cleaner and are often nicer. 362 | 363 | ## Simplify 364 | 365 | ```{r} 366 | ggplot(naep %>% 367 | arrange(Score) %>% 368 | mutate(StateCode = factor(StateCode, levels = .$StateCode)) %>% 369 | filter(Coastal == 1), 370 | aes(x = StateCode, y = Score)) + 371 | geom_col() + 372 | labs(title = "Fourth Grade Math NAEP Score by State", 373 | x = "State", 374 | y = "4th Grade Math NAEP") + 375 | gghighlight(StateCode == 'WA') + 376 | theme_tufte() 377 | ``` 378 | 379 | ## Easily Following Information 380 | 381 | - **Think**: have we made the presentation as easy to see as possible? How could we make it more clear? 382 | - Especially with long text labels, horizontal bar charts are much easier to read and compare 383 | 384 | ## Easily Following Information 385 | 386 | ```{r} 387 | ggplot(naep %>% 388 | arrange(Score) %>% 389 | mutate(StateCode = factor(StateCode, levels = .$StateCode), 390 | State = factor(State, levels = .$State)) %>% 391 | filter(Coastal == 1), 392 | aes(x = StateCode, y = Score)) + 393 | geom_col() + 394 | labs(title = "Fourth Grade Math NAEP Score by State", 395 | x = "State", 396 | y = "4th Grade Math NAEP") + 397 | gghighlight(StateCode == 'WA') + 398 | theme_tufte() + 399 | coord_flip() 400 | ``` 401 | 402 | ## Label Data Directly 403 | 404 | - **Think**: have we provided *affordances*? Have we put the right answer in the place where you'd think to look for it? 405 | - Why make the reader work? Put the label right where it's needed 406 | - Also, remove the cognitive steps of translating the markers 407 | 408 | ## Label Data Directly 409 | 410 | ```{r} 411 | ggplot(naep %>% 412 | arrange(Score) %>% 413 | mutate(StateCode = factor(StateCode, levels = .$StateCode), 414 | State = factor(State, levels = .$State)) %>% 415 | filter(Coastal == 1), 416 | aes(x = StateCode, y = Score)) + 417 | geom_col(fill = 'blue') + 418 | geom_text(aes(label = State, x = StateCode, y = Score + 3), hjust = 0) + 419 | scale_y_continuous(limits = c(0,300)) + 420 | labs(title = "Fourth Grade Math NAEP Score by State", 421 | x = "State", 422 | y = "4th Grade Math NAEP") + 423 | gghighlight(StateCode == 'WA') + 424 | theme_tufte() + 425 | coord_flip() + 426 | theme(axis.text.y = element_blank(), 427 | axis.ticks.y = element_blank()) 428 | ``` 429 | 430 | ## Now - Your Turn! 431 | 432 | - You'll be creating a graph by hand 433 | - Think carefully about how to make data comparable 434 | - And how to contrast as needed 435 | - And how to tell the story 436 | 437 | ## Your Turn 438 | 439 | - Story: Sales and Marketing may be more expensive, but they haven't grown as much as R&D and HR. 440 | 441 | 442 | ```{r} 443 | df <- read.csv(text="Dept: Sales Marketing R&D HR 444 | 2018 10000 15000 5000 6000 445 | 2019 10500 15000 7200 6500 446 | 2020 10600 16000 9300 8000",sep='\t') 447 | 448 | df %>% 449 | knitr::kable(caption = 'Costs in thousands of dollars by department by year', 450 | format = 'html') %>% 451 | kableExtra::kable_styling(bootstrap_options = 'striped') 452 | ``` 453 | 454 | -------------------------------------------------------------------------------- /Lecture_02_Coffee.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Coffee.jpg -------------------------------------------------------------------------------- /Lecture_02_DuBois.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_DuBois.PNG -------------------------------------------------------------------------------- /Lecture_02_Healthcare.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Healthcare.PNG -------------------------------------------------------------------------------- /Lecture_02_Minard.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Minard.gif -------------------------------------------------------------------------------- /Lecture_02_NAEP.csv: -------------------------------------------------------------------------------- 1 | State,StateCode,Score,Coastal 2 | Minnesota ,MN,248,0 3 | Massachusetts ,MA,247,1 4 | Virginia ,VA,247,1 5 | Florida ,FL,246,1 6 | New Jersey ,NJ,246,1 7 | Wyoming ,WY,246,0 8 | Indiana ,IN,245,0 9 | New Hampshire ,NH,245,0 10 | Pennsylvania ,PA,244,0 11 | Utah ,UT,244,0 12 | Nebraska ,NE,244,0 13 | Texas ,TX,244,0 14 | Connecticut ,CT,243,1 15 | North Dakota ,ND,243,0 16 | Idaho ,ID,242,0 17 | Colorado ,CO,242,0 18 | Wisconsin ,WI,242,0 19 | North Carolina ,NC,241,1 20 | South Dakota ,SD,241,0 21 | Ohio ,OH,241,0 22 | Montana ,MT,241,0 23 | Maine ,ME,241,1 24 | Mississippi ,MS,241,0 25 | Iowa ,IA,241,0 26 | Tennessee ,TN,240,0 27 | Washington ,WA,240,1 28 | Kansas ,KS,239,0 29 | Rhode Island ,RI,239,1 30 | Delaware ,DE,239,1 31 | Kentucky ,KY,239,0 32 | Vermont ,VT,239,0 33 | Hawaii ,HI,239,1 34 | Maryland ,MD,239,1 35 | Missouri ,MO,238,0 36 | Georgia ,GA,238,1 37 | Arizona ,AZ,238,0 38 | Illinois ,IL,237,0 39 | Oklahoma ,OK,237,0 40 | South Carolina ,SC,237,1 41 | New York ,NY,237,1 42 | Oregon ,OR,236,1 43 | Michigan ,MI,236,0 44 | Nevada ,NV,236,0 45 | California ,CA,235,1 46 | District of Columbia ,DC,235,1 47 | Arkansas ,AR,233,0 48 | Alaska ,AK,232,1 49 | West Virginia ,WV,231,0 50 | Louisiana ,LA,231,0 51 | New Mexico ,NM,231,0 52 | Alabama ,AL,230,0 53 | Puerto Rico ,PR,185,1 54 | -------------------------------------------------------------------------------- /Lecture_02_NAEP.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_NAEP.xlsx -------------------------------------------------------------------------------- /Lecture_02_Network_Graph.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Network_Graph.jpg -------------------------------------------------------------------------------- /Lecture_02_Unemployment.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_02_Unemployment.PNG -------------------------------------------------------------------------------- /Lecture_03_Audience.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 3: Focus and Audience" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | { 19 | library(tidyverse) 20 | library(lubridate) 21 | library(gghighlight) 22 | library(ggthemes) 23 | library(directlabels) 24 | library(RColorBrewer) 25 | } 26 | ``` 27 | 28 | ## Who is it For? 29 | 30 | ```{r, results = 'asis'} 31 | cat(" 32 | ") 38 | ``` 39 | 40 | - It's impossible to do good communication without thinking about who you're communicating *to* 41 | - Something you make for a highly technical audience should be different from something you make for a non-technical boss, or an outsider, or the general population 42 | - And, crucially, all of it should be different from what you'd make *for yourself* 43 | - **Communication is not about what you say, it's about what people hear**. Different people hear differently. 44 | 45 | ## Who is it For? 46 | 47 | Always ask: 48 | 49 | - Who is going to see this? 50 | - *What do they know?* 51 | - What don't they know? 52 | - What do they *need to know?* 53 | 54 | ## What do they Know? 55 | 56 | - Keep in mind the technical skill of your audience - have they seen this graph type before? Do they know what a median is? Can my boss read a graph (yes really)? 57 | - Keep in mind the *contextual* knowledge of your audience - are you using lingo they don't know? Abbreviations? 58 | - Keep in mind *where their head is at* - you've been staring at the data for hours so you know "AR" means "annual revenue." Is that what they'll think if they see "AR"? You've done a million projects before so you know that $1 million is way above-average for March. Will they recognize that? 59 | 60 | ## Presentation 61 | 62 | - Your audience does not know what you know 63 | - In many cases they may not care as much as you care 64 | - And even if they know and care, they may not understand the way you think about it 65 | - Don't be afraid to hold their hand 66 | - Even if they *are* very familiar with what you're doing, they'll appreciate clarity. It will let them know that *you* know what you're doing 67 | - **Don't assume people will read your work super super closely. Make your work such that the right conclusion to draw is also the most obvious one** 68 | 69 | ## Tips 70 | 71 | - Have someone else look at your work and tell you what they think it means (maybe trade with them in this class?) 72 | - After finishing something, give yourself a little break from it and come back. Ask: "if I had no idea about what this was, would I figure it out? What conclusion would I draw? Would it be the right one?" 73 | - And "What could I change to make the proper conclusion even more obvious?" How can you **focus** attention? 74 | - It can rarely be too obvious! Rarely are there points for subtlety here. 75 | - As with *anything* (not just thinking about audience) always **look at your work and ask if it looks right** 76 | 77 | 78 | ## Example 79 | 80 | - What are some communication errors here? 81 | 82 | ```{r, echo = FALSE, eval = TRUE} 83 | ggplot(readRDS('king_dailyvisits.Rdata'), aes(x = reorder(naics_title, visits_by_day), y = visits_by_day)) + 84 | geom_col(color = 'lightblue') + 85 | coord_flip() + 86 | ggpubr::theme_pubr() + 87 | labs(y = 'Visits', 88 | x = NULL, 89 | title = 'Top Five Industries') 90 | ``` 91 | 92 | ## Mistakes 93 | 94 | - What are "visits"? (Logged mobile-phone visits to locations in these industries, but how would you know?) 95 | - Top five industries... at what? Where? (Seattle in June 2020, but how would you know?) 96 | - Scientific notation (Not everyone can read this!) 97 | - What are those industries? Nature parks we can guess, maybe the restaurant types. But what are Lessors of Nonresidential Buildings (except Miniwarehouses) and Snack and Nonalcoholic Beverage Bars? (Malls and Cafes) 98 | - Why are restaurants two industries? 99 | 100 | 101 | ## The Three Rules of Presentation 102 | 103 | A good way to focus attention on the story you want them to understand is to prepare them to receive information 104 | 105 | - Tell them what you're going to tell them 106 | - Tell them 107 | - Tell them what you told them 108 | 109 | 110 | ## Presentation 111 | 112 | - When transmitting information, always let the reader know where to go. **Give them a bucket to put the information in before you give them the information**. 113 | - Really well-done research or argumentative writing makes the reader think of the next point you're going to make before you even make it 114 | - Similarly, well done data communication doesn't assume knowledge of the reader and is willing to direct their attention. 115 | 116 | ## A Demonstration 117 | 118 | - Watch for how [this neuroscientist](https://www.youtube.com/watch?v=opqIa5Jiwuw) explains something to different audiences 119 | - How does he adjust both content and delivery style to who he's talking to? 120 | - How does he give them a bucket for the information he's about to provide? 121 | - How does he get them engaged with what he's talking about? 122 | 123 | ## Guiding Our Audience 124 | 125 | Beyond considering our audience's capabilities, we also want to think about how to guide their attention 126 | 127 | - How can we nudge them as to where to look, what to compare, what continuities to see? 128 | - How can we prepare them to receive the story we want them to understand? 129 | 130 | ## Focusing Attention 131 | 132 | - We want to tell a story with our data 133 | - Some stories are simple 134 | - Others are complex 135 | - They all need to be clear 136 | - And we may need to walk the reader along 137 | 138 | ## Foreground and Background 139 | 140 | - We are already trying to focus our reader by making the viz in the first place 141 | - After all, we could avoid focusing them entirely by just showing the full data set, but we don't do that 142 | - Can we focus them further to make the story even clearer? 143 | 144 | ## Foreground and Background 145 | 146 | - Think about what you want to highlight - these are the parts that will draw the eye and get the most attention 147 | - You don't have to limit yourself to *only* including the information that drives the story home (although do get rid of as much clutter as possible) 148 | - But the important parts, the ones that produce the right conclusion, should be *as visible as possible* 149 | - In writing, you may have learned about inverted-pyramid structure 150 | - In visualization there are other tricks we can use 151 | 152 | ## My Mom's Car 153 | 154 | - What does the driver want to know? And what is highlighted? 155 | 156 | ![The Dash of My Mom's Car](Lecture_03_Cardash.jpg) 157 | 158 | ## My Mom's Car 159 | 160 | - The driver probably wants to know how fast they are going 161 | - And how full their tank is 162 | - Instead the focus is heavily on the RPM of the engine, and the other things are tucked back, faded out by glare, and not where you immediately look 163 | - (more confusing is that the RPM is a dial that looks like a typical speedometer, but isn't) 164 | 165 | 166 | ## Preattentive Attributes 167 | 168 | 169 | - How can we direct the eye where we need it to go? 170 | - Thankfully, (most of) you have been looking with your eyes your whole life! 171 | - So this should be pretty intuitive 172 | 173 | --- 174 | 175 | ```{r} 176 | knitr::include_graphics('Lecture_03_Preattentive.png') 177 | ``` 178 | 179 | ## Preattentive Attributes in Images 180 | 181 | - Our brain knows what to look out for 182 | - This should be pretty intuitive 183 | - The purpose here is to emphasize one important thing! 184 | - Let's go through some commonly usable ones: 185 | 186 | ## Color 187 | 188 | ```{r} 189 | data(gapminder, package='gapminder') 190 | 191 | ggplot(gapminder %>% 192 | filter(year == 2007), 193 | aes(x = gdpPercap, y = lifeExp, color = continent == 'Asia')) + 194 | geom_point() + 195 | scale_x_log10() + 196 | labs(x = 'GDP Per Capita (log scale)', 197 | y = 'Life Expectancy', 198 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') + 199 | theme_minimal() + 200 | guides(color = FALSE) + 201 | scale_color_manual(values = c('#33ccFF','red')) + 202 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8) 203 | ``` 204 | 205 | ## Intensity 206 | 207 | ```{r} 208 | ggplot(gapminder %>% 209 | filter(year == 2007), 210 | aes(x = gdpPercap, y = lifeExp, color = continent == 'Asia', alpha = continent == 'Asia')) + 211 | geom_point() + 212 | scale_x_log10() + 213 | labs(x = 'GDP Per Capita (log scale)', 214 | y = 'Life Expectancy', 215 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') + 216 | theme_minimal() + 217 | guides(color = FALSE, alpha = FALSE) + 218 | scale_color_manual(values = c('#33ccFF','red')) + 219 | scale_alpha_manual(values = c(.4,1)) + 220 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8) 221 | ``` 222 | 223 | ## Size 224 | 225 | 226 | ```{r} 227 | ggplot(gapminder %>% 228 | filter(year == 2007), 229 | aes(x = gdpPercap, y = lifeExp, 230 | color = continent == 'Asia', 231 | alpha = continent == 'Asia', 232 | size = continent == 'Asia')) + 233 | geom_point() + 234 | scale_x_log10() + 235 | labs(x = 'GDP Per Capita (log scale)', 236 | y = 'Life Expectancy', 237 | title = 'Asia\'s Low-Income Countries have High Life Expectancy') + 238 | theme_minimal() + 239 | guides(color = FALSE, alpha = FALSE, size = FALSE) + 240 | scale_color_manual(values = c('#33ccFF','red')) + 241 | scale_alpha_manual(values = c(.4,1)) + 242 | scale_size_manual(values = c(1,2)) + 243 | annotate(geom='text',x=20000,y=65,color='red',label='Asia',size=8) 244 | ``` 245 | 246 | ## Width 247 | 248 | ```{r} 249 | gapyrs<- gapminder %>% 250 | group_by(continent, year) %>% 251 | summarize(lifeExp = mean(lifeExp)) 252 | 253 | ggplot(gapyrs, aes(x = year, y = lifeExp, group = continent, size = continent == 'Asia')) + 254 | geom_line() + 255 | geom_dl(aes(x = year + .5,label = continent), method = 'last.bumpup') + 256 | scale_x_continuous(limits = c(1950,2015)) + 257 | labs(x = 'Year', 258 | y = 'Average Life Expectancy', 259 | title = 'Overall Life Expectancy in Asia is Low') + 260 | scale_size_manual(values = c(.5, 1.5)) + 261 | guides(size = FALSE) + 262 | theme_tufte() 263 | ``` 264 | 265 | ## Enclosure 266 | 267 | ```{r} 268 | ggplot(gapminder %>% 269 | filter(year == 2007), 270 | aes(x = gdpPercap, y = lifeExp)) + 271 | geom_point() + 272 | scale_x_log10() + 273 | labs(x = 'GDP Per Capita (log scale)', 274 | y = 'Life Expectancy', 275 | title = 'There is a Small Cluster of Rich and Long-Lived Countries') + 276 | theme_minimal() + 277 | annotate('rect', xmin = 16000, xmax = 50000, ymin = 68, ymax= 85, alpha = .2)+ 278 | annotate('text', x = 16000, y = 62, hjust = 0, 279 | label = 'Even here,\nthere is a\npositive\nrelationship.') 280 | ``` 281 | 282 | ## Keep in Mind 283 | 284 | - With all of these attributes, some are more important than others! 285 | - And more intensity generally means more attention 286 | - Rather than providing a direct order, remember that it's subjective - try things out, see what is intuitive 287 | 288 | ## An Example 289 | 290 | ```{r} 291 | data(SPrail, package='pmdplyr') 292 | 293 | SPrail_dow <- SPrail %>% 294 | filter(!is.na(fare)) %>% 295 | mutate(fare = factor(fare)) %>% 296 | mutate(dow = factor(weekdays(start_date), levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>% 297 | group_by(fare, dow) %>% 298 | summarize(price = mean(price, na.rm = TRUE)) 299 | 300 | ggplot(SPrail_dow, aes(x = fare, y = price, fill = dow)) + 301 | geom_bar(position = 'dodge', stat = 'identity') + 302 | labs(fill = 'Day of Week', 303 | x = 'Fare Type', 304 | y = 'Average Price', 305 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 306 | theme_minimal() 307 | 308 | ``` 309 | 310 | ## Use Color Sparingly 311 | 312 | ```{r} 313 | SPrail_dow <- SPrail %>% 314 | filter(!is.na(fare)) %>% 315 | mutate(fare = factor(fare)) %>% 316 | mutate(dow = factor(weekdays(start_date), levels = c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>% 317 | mutate(weekend = dow %in% c('Saturday','Sunday')) %>% 318 | group_by(fare, dow, weekend) %>% 319 | summarize(price = mean(price, na.rm = TRUE)) %>% 320 | ungroup() 321 | 322 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) + 323 | geom_bar(position = 'dodge', stat = 'identity') + 324 | labs(fill = 'Weekend', 325 | x = 'Fare Type', 326 | y = 'Average Price', 327 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 328 | theme_minimal() 329 | 330 | ``` 331 | 332 | ## Choose Colors for Focus 333 | 334 | ```{r} 335 | 336 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) + 337 | geom_bar(position = 'dodge', stat = 'identity') + 338 | labs(fill = 'Weekend', 339 | x = 'Fare Type', 340 | y = 'Average Price', 341 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 342 | theme_minimal() + 343 | scale_fill_manual(values = c('#33ccFF','red')) 344 | 345 | ``` 346 | 347 | 348 | ## Also Make Them Look Good 349 | 350 | ```{r} 351 | ggplot(SPrail_dow, aes(x = fare, y = price, group = dow, fill = weekend)) + 352 | geom_bar(position = 'dodge', stat = 'identity') + 353 | labs(fill = 'Weekend', 354 | x = 'Fare Type', 355 | y = 'Average Price', 356 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 357 | theme_minimal() + 358 | scale_fill_manual(values = c('gray','#33ccFF')) 359 | ``` 360 | 361 | ## Choose Intensity for Focus 362 | 363 | ```{r} 364 | ggplot(SPrail_dow, aes(x = fare, 365 | y = price, 366 | group = dow, 367 | fill = weekend, 368 | alpha = fare == 'Flexible')) + 369 | geom_bar(position = 'dodge', stat = 'identity') + 370 | labs(fill = 'Weekend', 371 | x = 'Fare Type', 372 | y = 'Average Price', 373 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 374 | theme_tufte() + 375 | scale_fill_manual(values = c('gray','#33ccFF')) + 376 | scale_alpha_manual(values = c(.4,1)) + 377 | guides(alpha = FALSE) 378 | ``` 379 | 380 | ## Proximity for Focus and Comparison 381 | 382 | ```{r} 383 | SPrail_dow <- SPrail %>% 384 | filter(!is.na(fare)) %>% 385 | mutate(fare = factor(fare, levels = c('Adulto ida','Promo','Promo +','Flexible'))) %>% 386 | mutate(dow = factor(weekdays(start_date), levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))) %>% 387 | mutate(weekend = dow %in% c('Saturday','Sunday')) %>% 388 | group_by(fare, dow, weekend) %>% 389 | summarize(price = mean(price, na.rm = TRUE)) %>% 390 | ungroup() 391 | 392 | ggplot(SPrail_dow, aes(x = fare, 393 | y = price, 394 | group = dow, 395 | fill = weekend, 396 | alpha = fare == 'Flexible')) + 397 | geom_bar(position = 'dodge', stat = 'identity') + 398 | labs(fill = 'Weekend', 399 | x = 'Fare Type', 400 | y = 'Average Price', 401 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 402 | theme_tufte() + 403 | scale_fill_manual(values = c('gray','#33ccFF')) + 404 | scale_alpha_manual(values = c(.4,1)) + 405 | guides(alpha = FALSE) 406 | ``` 407 | 408 | ## Declutter 409 | 410 | - Labels might not even be necessary! 411 | 412 | ```{r} 413 | ggplot(SPrail_dow, aes(x = fare, 414 | y = price, 415 | group = dow, 416 | fill = weekend, 417 | alpha = fare == 'Flexible')) + 418 | geom_bar(position = 'dodge', stat = 'identity') + 419 | labs(fill = 'Weekend', 420 | x = 'Fare Type', 421 | y = 'Average Price', 422 | title = 'Flexible-Ticket Resale Prices are Lower on Weekends') + 423 | theme_tufte() + 424 | scale_fill_manual(values = c('gray','#33ccFF')) + 425 | scale_alpha_manual(values = c(.4,1)) + 426 | guides(alpha = FALSE,fill = FALSE) + 427 | annotate('text',x = .87, y = 36, label = 'Weekday',color = 'darkgray', alpha = 1) + 428 | annotate('text',x = 1.34, y = 36, label = 'Weekend',color = '#33ccFF', alpha = 1) 429 | ``` 430 | 431 | 432 | ## Could we Go Further? 433 | 434 | - Declutter with days-of-week $\rightarrow$ just one point for weekday, one for weekend 435 | - Do we need the other ticket types? 436 | - Slope graph? 437 | - Move the title in closer? 438 | 439 | ## Designing for Focus 440 | 441 | - Highlight what's important 442 | - Remove distractions 443 | - Unimportant stuff CAN be backgrounded 444 | 445 | ## Annotation 446 | 447 | - Annotation can be a great way of pointing out features of the data 448 | - If you can tell the reader what you mean rather than making them infer it, that's probably good 449 | - (although if they'd always infer it without you having to tell it, that's even better) 450 | - Remember - *walk your audience through the data*, and don't be afraid to interpret for them! 451 | 452 | ## Annotation 453 | 454 | - Tell them what you're going to tell them: Title the graph with the story or insight 455 | - Tell them: Have a graph that demonstrates that story or insight 456 | - Tell them what you told them: put a label or arrow or other attention-drawing element that makes clear how the graph supports that title 457 | 458 | ## Explanation with Audience and Focus 459 | 460 | - Pick something that you know a lot about, and pick something interesting about that thing. 461 | - Prepare a ~1 minute explanation of that interesting thing 462 | - Take turns giving that explanation to your partner 463 | - Then, after both explanations have been given, have them explain back to you the thing you explained to them. How accurate is it? -------------------------------------------------------------------------------- /Lecture_03_Cardash.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_03_Cardash.jpg -------------------------------------------------------------------------------- /Lecture_03_Preattentive.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_03_Preattentive.PNG -------------------------------------------------------------------------------- /Lecture_04_Common_R_Problems.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Common R Mistakes in Data Viz" 3 | author: "Nick HK" 4 | date: "`r Sys.Date()`" 5 | output: 6 | rmdformats::readthedown: 7 | toc: 2 8 | --- 9 | 10 | ```{r setup, include=FALSE} 11 | knitr::opts_chunk$set(echo = TRUE) 12 | ``` 13 | 14 | # Common R Problems 15 | 16 | This document catalogues some of the requests for help I get, and mistakes I see in looking at student work, particularly in my Data Communications class. I also, of course, cover how to fix these things. 17 | 18 | I've divided this into two sections: (1) things breaking, and (2) mistakes in coding. 19 | 20 | # Things Breaking 21 | 22 | These are common errors you might get, and common things that might cause errors. 23 | 24 | ## General troubleshooting tips 25 | 26 | 1. Run your code one line at a time. Between each line, check the output and make sure it looks like you expect it to look. Often, errors occur because an earlier line of code didn't do what we intended it to do. If you don't spot any problems there, keep going until you hit the error message. 27 | 1. Read the error message and see if you can make sense of what it says 28 | 1. Often, even if you don't understand the message, it will give you a hint as to what is causing the problem, and so you'll know where to look to double-check your code 29 | 1. Google 30 | 31 | If you still need help, I am your professor and I am here to help you! Here's some tips for asking me for coding help: 32 | 33 | 1. Let me know what the problem is ("it's not working" isn't too helpful) 34 | 1. What is the error message or warning, if there is one, and what lines of code are generating it? 35 | 1. What are you trying to accomplish with the code? What *should* it do? 36 | 1. Send along your code as well. The original file is a good idea, although then I won't be able to help on my phone and you'll have to wait until I get to a computer. Copy/pasting code and error message into an email works well. A screenshot (Google "how to take a screenshot" and either Mac or Windows, as appropriate) also works well. Using your phone to take a picture of your computer screen is the least-good way to share your problem. 37 | 38 | ## There is no package called 39 | 40 | If you get an error like `there is no package called 'packagename'`, that means that you've tried to load or use a package - perhaps using `library()` or something like that - but you don't have that package installed. 41 | 42 | The Fix: 43 | 44 | 1. Did you spell the package name correctly? Maybe you do have the package, but you made a typo 45 | 1. If you spelled it correctly, you may not have installed it. Install it (usually) with `install.packages('packagename')` where `'packagename'` is replaced with the name of the package you actually want. 46 | 47 | ## Could not find function 48 | 49 | If you get an error like `could not find function "functionname"`, that means it's trying to look for that function, but cannot find it. 50 | 51 | The Fix: 52 | 53 | 1. Did you spell the function name correctly? Maybe you made a typo. 54 | 1. If the function is in a package, did you load the package? Remember, the package must be re-loaded every time you restart R. If you're working in Quarto, you must load the package inside of your Quarto script, since Quarto starts from a blank slate. 55 | 56 | ## Object not found 57 | 58 | If you get an error like `object 'objectname' not found`, that means that you've made a reference to an object called `objectname`, but there is no such object to refer to. 59 | 60 | The Fix: 61 | 62 | 1. Did you spell the object's name correctly? Maybe you made a typo. 63 | 1. Did you remember to create the object before referring to it? For example, the code `mean(x); x <- 1:10` would not work, because you try to take the mean of `x` before it's been created. If you're working in Quarto, remember that it will start from a blank slate, so you must create all objects inside of your Quarto script. 64 | 65 | ## Column doesn't exist 66 | 67 | If you get an error like `column 'columnname' doesn't exist`, that means it's trying to refer to a column name that is not actually in your data. 68 | 69 | The Fix: 70 | 71 | 1. Did you spell the column's name correctly? Maybe you made a typo. 72 | 1. Are you referring to the proper dataset? Maybe you created the variable in `my_data2` but you're trying to access it in `my_data`. 73 | 1. Does your column name have spaces or other strange characters in it like dashes? If so, in many applications (anywhere you're not treating the column name as a string) you have to surround the column name in backticks (`r "\u0060"`) so it knows where the column name starts and ends. 74 | 1. Did you perhaps drop the column before trying to refer to it? For example, in the below code, the `group_by() %>% summarize()` will only keep columns named in the `group_by()` or in the `summarize()`. So even though the `mpg` column was there at the start, it no longer is by the time we refer to it. 75 | 76 | ```{r, eval = FALSE} 77 | mtcars %>% 78 | group_by(am) %>% 79 | summarize(mean_hp = mean(hp)) %>% 80 | mutate(hp_by_mpg = mean_hp/mpg) 81 | ``` 82 | 83 | ## Warning: Package was compiled with R version 84 | 85 | If you load a package and get a warning that looks like `Warning: packagename was compiled with R version...`, that just means that the package was compiled using a slightly newer version of R than yours. It's generally not a problem as long as you have a somewhat-recent version of R. 86 | 87 | The Fix: 88 | 89 | 1. Ignore it 90 | 1. If you want it to go away, update your R installation. 91 | 92 | ## Rtools is required to build packages from source 93 | 94 | If you are installing packages, you may be asked whether you'd like to build packages from source. If you say "Yes", but you do not have Rtools installed, you will get an error, and the package won't install properly. 95 | 96 | The Fix: 97 | 98 | 1. When asked whether you want to install packages from source, say "No." 99 | 1. Install Rtools before installing packages. The easiest way to do this is to install the **installr** package, and then run `installr::install.rtools()`. 100 | 101 | ## File does not exist, or No such file or directory 102 | 103 | If you get an error like `'filename.csv' does not exist`, that means that you're trying to load a file, but the file is not in the location you're telling R to look. 104 | 105 | The Fix: 106 | 107 | 1. Did you spell the filename correctly? Maybe you made a typo. Or, perhaps, you're trying to open an Excel fie (`.xlsx`) but telling R to look for a CSV (`.csv`). Or you wrote "file.csv" but the file is "file (1).csv". 108 | 1. Is the file perhaps open in Excel? Sometimes, Microsoft Office will hold files hostage and not let them be opened in other programs while they're open in Office. Close the Excel window that has the file open. 109 | 1. **Most often**, this is a working directory issue. If you tell R to look for `'filename.csv'`, it will look for that file in the *working directory*. Or, if you tell it to look for `'data/filename.csv'`, it will look for a folder called "data" in the working directory, and for filename.csv inside of that. In recent versions of RStudio, the current working directory is listed right on top of the Console. Did you check whether the file is actually in that folder? If you're running Quarto, it will set the working directory to the folder where the .qmd file is located. Did you put the file in the right spot relative to where the qmd is stored? See my [video on filepaths](https://www.youtube.com/watch?v=NG7Y0kkGR8g) for more help. 110 | 111 | 112 | ## My code works fine in RStudio, but my Quarto won't render 113 | 114 | If your code runs fine when you run it by hand, but your Quarto throws errors, these are the most common issues: 115 | 116 | 1. You may have loaded a library while working on your code, but not loaded that same library in your Quarto. Include all necessary `library()` functions inside your Quarto document. 117 | 1. You may have created an object while working on your code, but not created that same object in your Quarto. Make sure all objects are created within the Quarto itself. 118 | 1. To troubleshoot both (1) and (2) above, restart your R session (Session -> Restart R) and clear your environment (hit the broom icon in the Environment tab), and THEN run your code line by line to see where the error is. 119 | 1. If you're loading in a file, you may have a different working directory than the Quarto is using. Remember, Quarto sets the working directory to the folder where the file is saved. See the previous "File does not exist" issue above. 120 | 1. Quarto hates when you try to install packages inside the Quarto (and it's a bad idea anyway). Make sure there are no `install.packages()` or `remotes::install_github()` lines in your Quarto doc. 121 | 1. Did you make sure your code is inside of a chunk? Code outside of a chunk won't run. 122 | 123 | If none of those fix it, then look in the "Render" tab to see what exactly the error message is. Keep in mind that, in Quarto, the line number associated with the error message is the line number for the **start of the chunk where the error is**, not the **actual line where the error occurs.** 124 | 125 | ## Object of type 'closure' is not subsettable 126 | 127 | This is the most aggravating and least-informative R error. "Subsettable" is a hint though - you're trying to take a subset of something (such as picking an element from a vector, or picking a column from a data set), but the thing you're doing it to is a "closure" that you can't do that to. 128 | 129 | Most commonly, this is a case of accidentally giving the name of a function where you mean to give the name of a vector or data frame. For example: 130 | 131 | ```{r, eval = FALSE} 132 | # My data set is about how mean you are, so I'll call it mean_data 133 | mean_data <- tibble(person = c('Me','You'), meanness = c(1000,999)) 134 | # Now let's look at that meanness variable, but oops, I'm referring to the function mean() instead of my data mean_data 135 | mean$meanness 136 | # Error in mean$meanness : object of type 'closure' is not subsettable 137 | ``` 138 | 139 | The Fix: 140 | 141 | 1. See where it says the error is "in" - that's the place where you've referred to the wrong thing. fix it! 142 | 143 | ## You probably made a typo or didn't balance your parentheses 144 | 145 | Just in general, a very large percentage of errors students come to me with occur simply because they made a typo, or didn't balance their parentheses (pairing each ( with a )). 146 | 147 | The Fix: 148 | 149 | 1. If you get an error message of any kind, read the relevant part of the code carefully to make sure you didn't make a typo 150 | 1. Avoid coding where you have lots and lots of nested parentheses (this is something that **tidyverse** piping we learn is intended to avoid). Then, for whatever parentheses you do have, click on each one in RStudio - it will highlight its "partner" or show where there is a missing partner. 151 | 152 | 153 | # Mistakes in Coding 154 | 155 | These are common mistakes I see students making, which may not always lead to error, but sometimes will. And these mistakes will often lead to incorrect results even if it runs okay. 156 | 157 | ## Not checking 158 | 159 | This is not specific to R, or really to coding. But I am begging you to do this: 160 | 161 | After every line of code, check whether the code did what you expected it to do. 162 | 163 | Ask yourself: what was that line of code trying to do? What should the data look like after that line runs? 164 | 165 | Look at the data after you run the code. Does it look as expected? 166 | 167 | If you created a variable you'd expect to take certain values, perhaps use a function like `summary()` or `unique()` to see if it takes those values. 168 | 169 | If you tried to collapse the data so there should be one row per, say, country, look at your data (click it in the Environment pane) to see whether that happened. 170 | 171 | I have been using R for years (and coding generally for decades). I can look at your code and, often, know immediately whether it's going to do something unexpected. You (probably) don't have that luxury. But you can replicate 90% of it by just *checking your work*. I cannot emphasize enough how much of a good idea this actually is and I'm not just saying this. **I** do this *despite those decades of experience*. Or perhaps *because* of it. 172 | 173 | ## Using numbers as strings 174 | 175 | When you put something in quotes, like '1', it will treat it as a *string variable* - a set of characters to be read as text. 176 | 177 | If you want a variable to be treated as a number, **do not put it in quotes.** This way lies madness. 178 | 179 | R is pretty loose, and will allow you to do things like ask whether `'2' > 1`, and will answer yes. This *feels* like it's okay to put the 2 in quotes. But it's not. What R is doing is saying "oof... I can't compare a number to a letter, but you're asking me to. I'll just convert this number to a string as well, so I'm really doing `'2' > '1'`, and since '2' is alphabetically after '1', that's true, which is the desired result. Great! Except it also means that `'02' > 1` is false, since '02' is alphabetically before '1'. Same with `'120'>'20'` being false. Oops. 180 | 181 | The Fix: 182 | 183 | 1. If it's a number, and you want it treated like a numeric value, don't put quotes around it. 184 | 185 | ## Using strings for variable names 186 | 187 | If your main coding experience has been with Python so far, you're probably used to always having to put variable names in quotes. If I want to get the variable `mpg` from the Pandas DataFrame `mtcars`, I do `mtcars['mpg']` or `mtcars.loc[:, 'mpg']`. 188 | 189 | R is different. R makes a lot of use of what's called "non-standard evaluation" which basically means you can often write variable names like regular code instead of putting it in strings. So we *could* do `mtcars[['mpg']]`, but usually we'd do something like `mtcars$mpg` or `mtcars %>% pull(mpg)`. 190 | 191 | Confusingly, sometimes R *does* require quotes (`mtcars[['mpg']]` works but `mtcars[[mpg]]` doesn't), and sometimes it will accept things either way (`mtcars %>% select(mpg)` and `mtcars %>% select('mpg')` both work). 192 | 193 | But often, especially when working with the **tidyverse** as we do, using strings for variable names can get you in trouble. For example, you'll be scratching your head for a while to figure out why `ggplot(data, aes(x = 'x', y = 'y')) + geom_point()` doesn't work, before realizing that **ggplot2** *doesn't* accept variable names as strings, and it's literally trying to graph the letter "x" against the letter "y". Oops. 194 | 195 | The Fix: 196 | 197 | 1. It's generally a good idea to default to not using quotes around variable names. 198 | 199 | ## Letting NAs propogate 200 | 201 | In this class we do a whole lot of summary statistics! And you may find yourself with a lot of `NA` (missing) values in the results. Why is that? 202 | 203 | R's standard summarizing functions, like `mean()` or `sum()`, will return `NA` if *any* of the values you're giving it are `NA`. `mean(c(1,2,3,4,5, NA))` returns `NA`, not `3`. 204 | 205 | So, if you're doing something like a `group_by() %>% summarize()` with a function like `mean()` in it, then if there's a missing value *anywhere in your data*, you'll get back an `NA`. 206 | 207 | The Fix: 208 | 209 | 1. Whenever you use a function like `mean()`, there's usually a `na.rm = TRUE` option you can add to drop any `NA` values before computing the function. This is usually what you want. 210 | 1. Note the `na.rm = TRUE` option goes inside the `mean()` or whatever function, not the `mutate()` or `summarize() 211 | 212 | ## Indiscriminate na.omit() 213 | 214 | One way to deal with missing values is to purge them from your data entirely. I don't teach this in class, but I find a lot of students will look online and find the `na.omit()` function for doing this. 215 | 216 | However, be aware that `na.omit()` removes rows for which there's a missing value in *any* column. If you're analyzing three columns of data, but your data set has 200 columns, then you lose a row if that missing value is *anywhere*. That's probably not what you want. 217 | 218 | The Fix: 219 | 220 | 1. Limit your data only to the columns you're going to use (perhaps with `select()`) before running `na.omit()` 221 | 1. Instead use the superior `drop_na()` in **dplyr/tidyverse**, which also lets you specify the set of variables to look for NAs in. `data %>% drop_na(a, b)` will only drop rows if they have missing values in the `a` or `b` columns, not any others. 222 | 223 | ## Mixing up <-, =, and == 224 | 225 | `<-` is for assigning objects. You can say `a <- 1`, and that will create an object `a` with the value `1` inside. 226 | 227 | `=` *can also* be for assigning objects, just like `<-`, but is also for assigning arguments and options in functions. In the function `mean()`, which takes the argument `x`, I can say `mean(x = 1:10)` but not `mean(x <- 1:10)`. 228 | 229 | `==` is for *checking* whether two things are equal. `a == 1` will check whether the object `a` is equal to `1`, and will return `TRUE` if it is, or `FALSE` otherwise. 230 | 231 | Commonly, this will cause problems in cases like this: 232 | 233 | ```{r, eval = FALSE} 234 | data %>% 235 | filter(state = "WA") 236 | ``` 237 | 238 | which we might expect to filter our data to rows that are in the state of Washington, but instead will try to assign a new object `state` to be equal to `"WA"`. 239 | 240 | ## summarize() without summaries 241 | 242 | `summarize()` is a **dplyr/tidyverse** function that summarizes the data down to have one row, or one row per group if you've previously specified a `group_by()`. 243 | 244 | However, for it to do this, you need to tell it *how* to summarize that data. So, each argument needs to be a function that returns *one value*. Or else it won't work. 245 | 246 | Most commonly, students will just list a bunch of variables in their `summarize()` and expect it to take the mean of each of them. But it doesn't do that, and will instead return a non-summarized data set. 247 | 248 | ```{r, eval = FALSE} 249 | # wrong: 250 | mtcars %>% 251 | group_by(am) %>% 252 | summarize(mpg, hp) 253 | 254 | # right: 255 | mtcars %>% 256 | group_by(am) %>% 257 | summarize(mpg = mean(mpg), hp = mean(hp)) 258 | 259 | # or if you want to be fancy 260 | mtcars %>% 261 | group_by(am) %>% 262 | summarize(across(c(mpg, hp), mean)) 263 | ``` 264 | 265 | ## Working way too hard 266 | 267 | Data communications is a class in which we need to learn some coding to do what we need to do, but it's not a coding class. If there's a difficult task, I've usually gone over a simple solution somewhere in class. 268 | 269 | If you're trawling arcane StackExchange answers and pasting together a bunch of code you don't understand, or if you're spending tedious hours writing out a thousand lines of code, there's a good chance that you're doing something the hard way, when I've provided an easy way in class. 270 | 271 | Some tips for avoiding this: 272 | 273 | 1. If you're writing tiny variations on the same code a zillion times, try doing it in a loop instead 274 | 1. If you're having difficulty working with a date variable, there's probably a function that does things for you in the **lubridate** package 275 | 1. If you're having difficulty working with a string variable, there's probably a function that does things for you in the **stringr** package 276 | 1. If you're spending hours and hours on what feels like a problem that should have a simple answer, it probably does. Email me. -------------------------------------------------------------------------------- /Lecture_04_R_Walkthrough.R: -------------------------------------------------------------------------------- 1 | # Packages 2 | install.packages(c('tidyverse','vtable')) 3 | # One at a time! 4 | library(tidyverse) 5 | 6 | # Getting data 7 | # From packages 8 | data(storms, package = 'dplyr') 9 | # From files 10 | # d <- read_csv('exampledata.csv') 11 | 12 | # R objects 13 | a <- 1 14 | b <- 'hello' 15 | 16 | # Vectors 17 | a <- 11:20 18 | a[1] 19 | b <- c(1, 1, 4, 4, 2, 3, 6) 20 | 21 | # Logicals 22 | b == 1 23 | sum(b == 1) 24 | mean(b == 1) 25 | (b == 1) | (b == 4) 26 | (b >= 1) & (b < 4) 27 | 28 | # Variable types 29 | as.character(a) 30 | factor(b) 31 | b 32 | 33 | # Data frames / tibbles 34 | storms 35 | summary(storms) 36 | str(storms) 37 | names(storms) 38 | vtable::vt(storms) 39 | library(vtable) 40 | sumtable(storms) 41 | storms$category 42 | table(storms$category) 43 | storms[['category']] 44 | table(storms[['category']]) 45 | 46 | 47 | # Functions 48 | mean(b) 49 | table(b) 50 | mean(b, na.rm = TRUE) 51 | sumtable(storms, vars = names(storms)[5:8]) 52 | 53 | # Autocomplete 54 | # data( 55 | # mea 56 | 57 | # Help 58 | help(mean) 59 | ??mean 60 | # help pane 61 | 62 | # Rstudio notes 63 | # Working directory 64 | # Environment pane and broom 65 | # Viewer and file pane, and export 66 | 67 | # Swirl 68 | # install.packages('swirl') 69 | swirl() 70 | # Note the base package 4 on EDA! also 1 on basic programming 71 | # See https://github.com/swirldev/swirl_courses for more 72 | 73 | # Troubleshooting 74 | mean(c('a','b')) 75 | sumtable(mean) 76 | feols(Y~X) 77 | -------------------------------------------------------------------------------- /Lecture_04_R_and_RMarkdown.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 4: R and Quarto" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | { 19 | library(tidyverse) 20 | library(lubridate) 21 | library(gghighlight) 22 | library(ggthemes) 23 | library(directlabels) 24 | library(RColorBrewer) 25 | } 26 | rinline <- function(code) { 27 | sprintf('``` `r %s` ```', code) 28 | } 29 | ``` 30 | 31 | ## Tech! 32 | 33 | ```{r, results = 'asis'} 34 | cat(" 35 | ") 41 | ``` 42 | 43 | - As we go through the course, we're going to need to kep data visualization and communication principles close at hand 44 | - But it's also unavoidable that actually using them will require some technical skill on our part 45 | - In this class we'll be using R 46 | - Specifically, the **tidyverse** set of packages, and Quarto 47 | - I don't expect you to become R experts. You're not learning R so much as learning **dplyr** and **ggplot2** and some extra details 48 | 49 | ## R 50 | 51 | - Why R instead of Python? 52 | - It's good to know more than one language (nothing ever lasts!) 53 | - **ggplot2** is a top-tier visualization package and makes it much easier to implement our principles 54 | - Data-cleaning is a bit more intuitive than in **pandas** (and faster too if you're using **data.table** but we aren't) 55 | - Python > R in designing software for production, or in some machine learning apps. But we aren't doing those! 56 | 57 | ## R - The Basics 58 | 59 | - Let's switch over to an RStudio walkthrough. We'll cover: 60 | - Getting set up 61 | - The RStudio window 62 | - Using basic R elements 63 | - I'll also show you how to set up the **swirl** package so you can do some walkthroughs on your own 64 | 65 | ## Loading in data 66 | 67 | - For the most part the **tidyverse** package will cover all our needs 68 | - For loading in (and saving) data, I strongly suggest the use of the **rio** package 69 | - This package is new, so you won't find it on a lot of Google searches for "R load CSV file" 70 | - But it's super easy. Doesn't even matter what file format it is. `import('filename.csv')` or `.Rdata` or `.xlsx` etc. all work immediately. `import_list()` can import all the sheets of an Excel workbook, or a bunch of files, all at once 71 | 72 | ## My Promise 73 | 74 | - There are many ways to do anything in R 75 | - But if I've shown it in class, almost certainly the way I'm showing it is the easiest, especially in the context of this class, as I've chosen stuff that makes sense for us specifically 76 | - Don't make yourself work too hard! If you're trying to figure something out, ask "did we cover this in class?" If so, go back to this material - it will probably be less work than whatever Google solution you find 77 | - Like, seriously. The number of times students spend hours and hours trying to stitch together a solution from dozens of StackExchange answers, when it's something I've given a one-line solution to in class, is *a lot* 78 | 79 | ## Quarto 80 | 81 | - Quarto is a text and code processing system 82 | - Text and code in -> Formatted output out (HTML, Word, PDF, slides, books, etc.) 83 | - These slides are made in Quarto. I write books and papers in it too. 84 | 85 | ## Markdown 86 | 87 | - Markdown is a "markup language" like HTML or LaTeX. All your formatting goes in as raw text, and then you run it through an interpreter to get output 88 | - Markdown is *super super simple* 89 | - Let's walk through the [Markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) 90 | - Note HTML code also works! 91 | 92 | ## Markdown 93 | 94 | - Why Markdown instead of, like, Word? or JuPyteR? 95 | - Can be automated. Includes code and output directly in the document, including for in-line numbers 96 | - Much easier to share via something like, e.g., GitHub 97 | - Output is in standard formats like Word or HTML so they don't need any sort of setup to see it 98 | 99 | ## The YAML 100 | 101 | - The YAML is the set of text at the top of a Quarto that tells it how to interpret all that text. Here's an example YAML: 102 | 103 | ```{r, eval = FALSE, echo = TRUE} 104 | --- 105 | title: "Untitled" 106 | format: 107 | html: 108 | embed-resources: true 109 | toc: true 110 | editor: visual 111 | --- 112 | ``` 113 | 114 | 115 | ## Code Chunks 116 | 117 | - Include code in-line (fantastic for memos) with `r rinline('mean(1:3)')` which makes `r mean(1:3)` 118 | - Start a new code chunk with a forward slash, /. Let's take a look at an example Quarto doc 119 | - *Code goes inside the chunks, writing goes outside* 120 | - Note: want a data frame to nicely format as a table? use `knitr::kable()`. 121 | 122 | ## Code Chunks 123 | 124 | - Common options to set in each code chunk, or altogether in the YAML: 125 | - `echo`: Show code or not? 126 | - `eval`: Evaluate the code inside or not? 127 | - `warning` and `message`: Show warnings and messages from the code? 128 | - `include`: Include *any output at all* or not? 129 | 130 | ## Common Mistakes 131 | 132 | - Let's take a look at the [Common Errors and Mistakes Cheatsheet](https://nickch-k.github.io/DataCommSlides/Lecture_04_Common_R_Problems.html) 133 | - Common Quarto mistakes: forgetting the be sure that *all the code you need to run from fresh are in the Quarto*, including `install.packages()` (don't do it!), working directory issues, using the "Import Data" wizard 134 | 135 | ## Let's do it 136 | 137 | - Make a new Quarto HTML document 138 | - YAML: Turn the toc on, title it and add your name 139 | - Make one big section and title it (#) and two small (##) 140 | - In the setup chunk, load the **tidyverse** and load `data(storms)` 141 | - Write some text in the first small one, including a link to something 142 | - In the second, add a code chunk that does a `summary()` of `storms` 143 | - Then, some more text, with an in-line text to get the mean of `storms$wind`. 144 | - Bonus: load the **scales** package and use `number()` to format the in-line text. -------------------------------------------------------------------------------- /Lecture_04_example_rmarkdown.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "My Rmarkdown Doc" 3 | author: "Nick HK" 4 | date: "`r Sys.Date()`" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE) 10 | 11 | library(tidyverse) 12 | library(vtable) 13 | my_mean_data <- 100:200 14 | ``` 15 | 16 | ## R Markdown 17 | 18 | Write what I like: `hi this is code`, *this is italics* [Google](https://google.com) `r 2+2` 19 | 20 | This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see . 21 | 22 | When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: 23 | 24 | ```{r, echo = TRUE} 25 | knitr::kable(cars) %>% 26 | kableExtra::kable_styling('striped') 27 | ``` 28 | 29 | ```{r cars, eval = FALSE} 30 | summary(cars) 31 | ``` 32 | 33 | ## Including Plots 34 | 35 | You can also embed plots, for example: 36 | 37 | ```{r pressure, echo=FALSE} 38 | plot(pressure) 39 | ``` 40 | 41 | Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. 42 | -------------------------------------------------------------------------------- /Lecture_04_second_example.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "My Storms Stuff" 3 | author: "Nick HK" 4 | date: "`r Sys.Date()`" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | # install.packages('tidyverse') 11 | library(tidyverse) 12 | library(scales) 13 | 14 | data(storms) 15 | ``` 16 | 17 | # My Storms Analysis 18 | 19 | ## Storms 1 20 | 21 | Storms are bad [look at the weather](https://weather.com). 22 | 23 | ## Storms 2 24 | 25 | ```{r} 26 | summary(storms) 27 | ``` 28 | THe mean wind speed is `r number(mean(storms$wind))` 29 | -------------------------------------------------------------------------------- /Lecture_05_Example_Sprail.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture_05_Example_SPrail" 3 | author: "Nick Huntington-Klein" 4 | date: "7/8/2021" 5 | output: 6 | html_document: 7 | toc: TRUE 8 | toc_float: TRUE 9 | --- 10 | 11 | ```{r setup, include=FALSE} 12 | knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) 13 | 14 | { 15 | library(tidyverse) 16 | library(purrr) 17 | library(kableExtra) 18 | library(lubridate) 19 | } 20 | 21 | SPrail <- list.files(pattern = 'SPrail_') %>% 22 | map(read_csv) %>% 23 | bind_rows() 24 | 25 | ``` 26 | 27 | ## Is the data tidy? 28 | 29 | 30 | ```{r} 31 | head(SPrail) 32 | ``` 33 | 34 | Looks like one row per ticket (albeit without any sort of identifier) and everything is in its own column. Seems tidy to me! Although an ID would be nice. We could make one with `mutate(id = 1:n())` 35 | 36 | ## Data Processing 37 | 38 | ```{r} 39 | SPrail <- SPrail %>% 40 | filter(fare == 'Promo') %>% 41 | mutate(month = month(start_date)) 42 | 43 | months <- tibble(month = 1:12, monthname = month.name) 44 | 45 | SPrail <- SPrail %>% 46 | left_join(months) 47 | ``` 48 | 49 | ## Now to Make a Table! 50 | 51 | ```{r} 52 | meanprice <- SPrail %>% 53 | group_by(destination, monthname) %>% 54 | summarize(meanprice = mean(price, na.rm = TRUE), N = n()) %>% 55 | ungroup() %>% 56 | arrange(-meanprice) %>% 57 | slice(1:5) 58 | 59 | meanprice %>% 60 | mutate(meanprice = scales::dollar(meanprice)) %>% 61 | knitr::kable() %>% 62 | kable_styling(bootstrap_options = 'striped') 63 | ``` 64 | 65 | The most expensive tickets are for `r meanprice %>% slice(1) %>% pull(destination)` in `r meanprice %>% slice(1) %>% pull(monthname)`, at a price of `r meanprice %>% slice(1) %>% pull(meanprice) %>% scales::dollar()`. On average the five most expensive routes sold `r meanprice %>% pull(N) %>% mean()` tickets apiece. 66 | -------------------------------------------------------------------------------- /Lecture_05_and_06_example_code.R: -------------------------------------------------------------------------------- 1 | library(tidyverse) 2 | data1 <- tibble(Month = 1:4, MonthName = c('Jan','Feb','Mar','Apr'), 3 | Sales = c(6,8,1,2), RD = c(4,1,2,4)) 4 | data2 <- tibble(Department = 'Sales', Director = 'Lavy') 5 | 6 | joined_data <- data1 %>% 7 | pivot_longer(cols = c('Sales','RD'), 8 | names_to = 'Department', 9 | values_to = 'Budget') %>% 10 | full_join(data2, by = 'Department') 11 | 12 | 13 | joined_data %>% 14 | filter(!(MonthName %in% c('Jan','Feb'))) 15 | joined_data %>% 16 | filter((MonthName %in% c('Jan','Feb')) & Department == 'Sales') 17 | joined_data %>% 18 | filter((MonthName %in% c('Jan','Feb')) | Department == 'Sales') 19 | 20 | data("construction") 21 | construction %>% 22 | select(Year, Month, West) %>% 23 | filter(Month %in% c('July','August','September')) %>% 24 | arrange(-West) %>% 25 | slice(1:2) 26 | 27 | joined_data 28 | joined_data %>% 29 | group_by(MonthName) %>% 30 | mutate(MonthAverage = sum(Budget)/n()) 31 | 32 | joined_data %>% 33 | group_by(MonthName) %>% 34 | summarize(MonthAverage = mean(Budget)) 35 | 36 | joined_data %>% 37 | group_by(MonthName) %>% 38 | summarize(proportion_above_6 = mean(Budget > 5)) 39 | 40 | 41 | data(construction) 42 | new_cons <- construction %>% 43 | mutate(avg = (Northeast + Midwest + South + West)/4) %>% 44 | rename(`Price Average` = avg) %>% 45 | mutate(firsthalf = case_when( 46 | row_number() <= 6 ~ 'First half', 47 | TRUE ~ 'Second half' 48 | )) %>% 49 | group_by(firsthalf) %>% 50 | summarize(`Price Average` = mean(`Price Average`)) 51 | new_cons %>% 52 | pull(firsthalf) %>% 53 | is.character() 54 | str(new_cons) 55 | class(construction$Year) 56 | sapply(construction, class) 57 | 58 | joined_data %>% 59 | mutate(Department = factor(Department)) %>% 60 | group_by(Department) %>% 61 | summarize(Budget =mean(Budget)) %>% 62 | mutate(Department = reorder(Department, -Budget)) %>% 63 | arrange(Department) 64 | summarized$Department 65 | 66 | 67 | library(lubridate) 68 | ymd('2020-01-01') 69 | ym('202001') 70 | my('012020') 71 | mdy('01-01-2020') 72 | construction %>% 73 | mutate(year_as_date = ym(paste(Year, '01'))) %>% 74 | pull(year_as_date) 75 | 76 | tibble(year = 2015:2018, month = c(1,6,2,4), day = c(5,6,1,2)) %>% 77 | mutate(date = dmy(paste(day, month, year))) %>% 78 | mutate(quarter = quarter(date), 79 | year_new = year(date)) %>% 80 | mutate(month_label = month.name[month]) 81 | ymd('2020-01-01') + years(1) 82 | ymd('2020-01-01') + months(3) 83 | ymd('2020-03-31') - months(1) 84 | 85 | 86 | tibble(year = c(2015,2015,2016,2016), month = c(1,6,2,4), day = c(5,6,1,2)) %>% 87 | mutate(date = dmy(paste(day, month, year))) %>% 88 | mutate(quarter = quarter(date), 89 | year_new = year(date)) %>% 90 | mutate(month_label = month.name[month]) %>% 91 | mutate(year_date = floor_date(date, 'month')) 92 | 93 | c('my string', 'my second string') 94 | paste('a','b','c','d', sep = ' and ') 95 | paste0('a','b','c',' d') 96 | 97 | construction 98 | construction %>% 99 | mutate(Month = case_when( 100 | Month == 'January' ~ 'Janubary', 101 | TRUE ~ Month 102 | )) 103 | cat('Look at the \n result in my graph') 104 | str_sub('hello', 2, 4) 105 | 106 | construction %>% 107 | mutate(Month = str_replace(Month, 'Ju', 'Jub')) 108 | 109 | fips_to_names %>% 110 | mutate(Just_name = word(countyname, 1)) 111 | fips_to_names %>% 112 | mutate(Just_name = str_replace(countyname, 'County|Parish', '')) 113 | 114 | 115 | data(gss_cat) 116 | changed_gss_cat <- gss_cat %>% 117 | mutate(birthyear = ym(paste0(year, '-06')) - years(age)) %>% 118 | mutate(birthyear = floor_date(birthyear, 'year')) %>% 119 | mutate(race = factor(race, levels = c('Black','White','Other'))) %>% 120 | mutate(income_num = word(rincome,1)) %>% 121 | mutate(income_num = str_replace(income_num, '\\$','')) %>% 122 | mutate(income_num = as.numeric(income_num)) 123 | 124 | table(gss_cat$rincome) 125 | 126 | stockgrowth <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'), 127 | date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)), 128 | stock_price = c(103,103.4,107,85.2, 86.3, 85.6)) %>% 129 | arrange(ticker, date) %>% group_by(ticker) %>% 130 | mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1, 131 | price_growth_daily = stock_price/lag(stock_price, 1) - 1) 132 | stockgrowth 133 | 134 | do_bps <- function(p) { 135 | bps <- p*10000 136 | return(bps) 137 | } 138 | do_bps(.00006) 139 | library(purrr) 140 | c(.00006, .05, 1) %>% map_dbl(do_bps) 141 | 142 | list.files(pattern = 'trends_') %>% 143 | map_df(read_csv) 144 | 145 | 146 | my_function <- function(i) { 147 | print(i+1) 148 | } 149 | # Print the numbers 1:10 150 | 1:10 %>% map(my_function) 151 | 152 | xplusa <- function(x, a) { 153 | return(x+a) 154 | } 155 | 156 | my_plus_a_p <- function(a) { 157 | data(mtcars) 158 | mtcars %>% 159 | mutate(across(ends_with('p'), xplusa, a = a)) %>% 160 | return() 161 | } 162 | 163 | 1:5 %>% map_df(my_plus_a_p) %>% 164 | write_csv('my_filename.csv') 165 | 166 | save(my_data, construction, file = 'my_data_and_construction.Rdata') 167 | load('my_data_and_construction.Rdata') 168 | saveRDS(my_data, 'mydata.Rdata') 169 | my_data <- readRDS('mydata.Rdata') 170 | write_csv(my_data, 'mydata.csv') -------------------------------------------------------------------------------- /Lecture_06_GG_Components.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_06_GG_Components.png -------------------------------------------------------------------------------- /Lecture_07_EDA.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 7: Exploratory Data Analysis" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | { 19 | library(tidyverse) 20 | library(lubridate) 21 | library(gghighlight) 22 | library(ggthemes) 23 | library(directlabels) 24 | library(RColorBrewer) 25 | } 26 | rinline <- function(code) { 27 | sprintf('``` `r %s` ```', code) 28 | } 29 | ``` 30 | 31 | ## The Purpose of Analysis 32 | 33 | ```{r, results = 'asis'} 34 | cat(" 35 | ") 41 | ``` 42 | 43 | - What are we even *doing* with data? 44 | - We want to see what sorts of stuff is in the data, so that we know what things we *should* be reporting on 45 | - This requires us to *explore* the data 46 | - Even if we come in with a strong idea of what we want to find, learning about what the data looks like is a good idea 47 | 48 | ## The Purpose of Analysis 49 | 50 | We want to: 51 | 52 | - Understand our data (read the docs, too!!!) 53 | - Detect mistakes 54 | - Get a sense of what our variables look like 55 | - Figure out what some of the relationships are 56 | 57 | ## The Purpose of Analysis 58 | 59 | Always be on the lookout (in EDA and in your code!) 60 | 61 | - Things that are different that shouldn't be different ("why does this one person have a height 8x that of anyone else?") 62 | - Things that are the same that shouldn't be the same ("Why is the mean income for Americans and Canadians exactly preciely the same to the 8th decimal place?") 63 | - Relationships that are very surprising ("Why is age negatively correlated with height?") 64 | 65 | ## Exploring Data 66 | 67 | - Univariate non-graphical 68 | - Univariate graphical 69 | - Multivariate non-graphical 70 | - Multivariate graphical 71 | 72 | ## Univariate Exploratory Analysis 73 | 74 | Looking for: 75 | 76 | - What kind of values we have 77 | - Are there massive outliers? Are there values that look *incorrect*? 78 | - What is the distribution? Is there skew? If it's a factor, does one category dominate? 79 | - You'll want to do this with nearly every variable you have 80 | - Getting all the little presentation details is less important, but the graph should still be readable (labels etc.) On that note, EDA isn't always *concise* - some tables/graphs will slip off the edge of these slides. 81 | 82 | ## Univariate Exploratory Analysis 83 | 84 | - Think about *what features of the variable it makes sense to explore* 85 | - Central tendencies (mean, median, etc.?) 86 | - Tails? Skew? Percentiles? 87 | - If it's a factor, is showing *all* the categories informative? Do you need to collapse first? Or summarize? 88 | 89 | ## Univariate Exploratory Analysis Tools 90 | 91 | - `summary()` shows lots of good info about a variable's distribution (also works to summarize other objects) 92 | 93 | ```{r, echo = TRUE} 94 | summary(iris$Sepal.Length) 95 | summary(iris$Species) 96 | ``` 97 | 98 | 99 | ## Univariate Exploratory Analysis Tools 100 | 101 | - `str()` shows variable classes (important!) and some values 102 | 103 | ```{r, echo = TRUE} 104 | str(iris) 105 | ``` 106 | 107 | ## Univariate Exploratory Analysis Tools 108 | 109 | - `vtable()` in **vtable** is a more readable and flexible version of that. `lush = TRUE` for more info 110 | 111 | ```{r, echo = TRUE} 112 | library(vtable) 113 | vtable(iris, lush = TRUE) 114 | ``` 115 | 116 | ## Univariate Exploratory Analysis Tools 117 | 118 | - `sumtable()` in **vtable** is a full table of summary statistics 119 | 120 | ```{r, echo = TRUE} 121 | sumtable(iris) 122 | ``` 123 | 124 | 125 | ## Univariate Exploratory Analysis Tools 126 | 127 | - Moving on to graphical tools! In **ggplot**, `geom_density()` and `geom_histogram()` can both show full distributions (when might this *not* be useful?) 128 | 129 | ```{r, echo = TRUE, fig.height = 3} 130 | ggplot(iris, aes(x = Sepal.Length)) + geom_density() 131 | ``` 132 | 133 | ## Univariate Exploratory Analysis Tools 134 | 135 | - Boxplots aren't for general-audience consumption but they show some easy summary statistics 136 | 137 | ```{r, echo = TRUE, fig.height = 3} 138 | ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot() + coord_flip() 139 | ``` 140 | 141 | ## Univariate Exploratory Analysis 142 | 143 | - Load the `gov_transfers` data from the **causaldata** package 144 | - Read the docs! `help(gov_transfers)` 145 | - Explore the variables that are in there on a univariate basis 146 | 147 | ## Multivariate Exploratory Analysis 148 | 149 | - We want to know *how variables are related* in multivariate analysis 150 | - Again, *fit the analysis to the context*. Don't treat a continuous variable as a factor, and if you treat a discrete variable as continuous know that a scatterplot won't work! etc. 151 | - Don't just fire and forget. Think about whether the output is actually informative. 152 | - more than 2 variables are possible too! Just add more groups/aesthetics. 153 | 154 | ## Multivariate Exploratory Analysis Tools 155 | 156 | - For continuous vs. continuous, nothing wrong with starting with a linear correlation! 157 | - Bivariate OLS is just a rescaled correlation 158 | 159 | ```{r, echo = TRUE} 160 | cor(iris$Sepal.Length, iris$Sepal.Width) 161 | lm(Sepal.Length ~ Sepal.Width, data = iris) 162 | ``` 163 | 164 | 165 | ## Multivariate Exploratory Analysis Tools 166 | 167 | - If one variable is discrete or a factor, don't underestimate plain ol' `group_by() %>% summarize()` 168 | - In fact, many headaches of trying to get R to do something for you can be avoided by just doing `group_by() %>% summarize()` yourself 169 | - Don't forget `na.rm = TRUE` as appropriate 170 | 171 | ```{r, echo = TRUE} 172 | iris %>% 173 | group_by(Species) %>% 174 | summarise(mean.SL = mean(Sepal.Length, na.rm = TRUE), 175 | sd.SL = sd(Sepal.Length, na.rm = TRUE)) 176 | ``` 177 | 178 | ## Multivariate Exploratory Analysis Tools 179 | 180 | - `sumtable` has a `group` option for the same purpose (note `summ` can customize the summary functions) 181 | 182 | ```{r, echo = TRUE} 183 | iris %>% sumtable(group = 'Species') 184 | ``` 185 | 186 | ## Multivariate Exploratory Analysis Tools 187 | 188 | - `tabyl` in **janitor** is great for discrete vs. discrete; use `adorn` functions to get percents instead of counts. Watch the denominator! It affects interpretation a LOT 189 | 190 | ```{r, echo = TRUE} 191 | library(janitor) 192 | mtcars %>% tabyl(am, vs) %>% 193 | adorn_percentages(denominator = 'row') %>% 194 | adorn_pct_formatting() %>% adorn_title() 195 | ``` 196 | 197 | ## Multivariate Exploratory Analysis 198 | 199 | - Moving on to graphs! 200 | - If one variable is a factor/discrete you can do pretty much any kind of graph you like, but grouped 201 | 202 | ```{r, echo = TRUE, fig.height = 3} 203 | mtcars %>% group_by(vs, am) %>% 204 | summarize(mean_mpg = mean(mpg)) %>% 205 | ggplot(aes(x = vs, fill = factor(am), y = mean_mpg)) + geom_col(position = 'dodge') 206 | ``` 207 | 208 | ## Multivariate Exploratory Analysis 209 | 210 | - When it comes to densities, typical geoms look cluttered. I recommend `geom_density_ridges` in **ggridges**, or `geom_violin` (or use facets) 211 | 212 | ```{r, echo = TRUE, fig.height = 3} 213 | library(ggridges); library(patchwork) 214 | p1 <- ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) + geom_density_ridges() 215 | p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin() 216 | p1 + p2 217 | ``` 218 | 219 | ## Multivariate Exploratory Analysis 220 | 221 | - For two continuous variables, things can be tricky, but a scatterplot is a good place to start 222 | 223 | ```{r, echo = TRUE, fig.height = 4} 224 | ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() 225 | ``` 226 | 227 | ## Automation 228 | 229 | - As a starting place, `ggpairs` in **GGally** will automatically pick univariate and multivariate comparisons to start 230 | 231 | ```{r, echo = TRUE, eval = FALSE} 232 | GGally::ggpairs(iris) 233 | ``` 234 | 235 | ## Automation 236 | 237 | ```{r} 238 | GGally::ggpairs(iris) 239 | ``` 240 | 241 | ## Multivariate Exploratory Analysis 242 | 243 | - Let's do some of this in our `gov_transfers` data! 244 | 245 | ## Going into Detail 246 | 247 | - This has all been running with little input from us at this point 248 | - Yes we have to figure out which kind makes sense but we're just exploring 249 | - Going for a bit more detail requires us to have a *question* to answer, and then we can look into that 250 | - This targets us towards the variables/relationships to study, and *how* to explore them further 251 | - Perhaps, for example, checking trivariate relationships, one example being "*in this subgroup* how are X and Y related?" 252 | 253 | ## Careful! 254 | 255 | Be aware of what EDA does not do: 256 | 257 | - Checking *every* relationship will tend to uncover non-real relationships by random chance 258 | - Consider doing EDA on a random subset data for this reason, so you can check if something interesting actually is there in the other/full data and don't trick yourself 259 | - Hardly ever establishes a *causal relationship* - avoid causal language unless you can really back it up! 260 | 261 | ## Let's Do More (if there's time) 262 | 263 | - [https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Clothing.csv](https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Clothing.csv) 264 | - Data description: [https://rdrr.io/rforge/Stat2Data/man/Clothing.html](https://rdrr.io/rforge/Stat2Data/man/Clothing.html) 265 | - Explore! What do we find? -------------------------------------------------------------------------------- /Lecture_08_Basics_of_ggplot2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 8: Basics of ggplot2" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | ``` 20 | 21 | 22 | 23 | ## What is ggplot2? 24 | 25 | ```{r, results = 'asis'} 26 | cat(" 27 | ") 33 | ``` 34 | 35 | - **ggplot2** is a package in R, `library(ggplot2)` or `library(tidyverse)` 36 | - It is based on the *grammar of graphics* 37 | - Lots of major publications use it (538, The Economist, BBC, ...) 38 | - Zillions of extensions for doing anything you like 39 | - (also runs in Python using **plotnine**, and the newest version of **seaborn** is based on teh Grammar of Graphics, so everything you learn here will also teach you plotting in Python) 40 | 41 | ## Resources 42 | 43 | - [An overview of the Grammar of Graphics concept](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149) 44 | - [ggplot2 cheat sheet from RStudio](https://github.com/rstudio/cheatsheets/blob/master/data-visualization.pdf) 45 | - [ggplot2 book by Hadley Wickham](https://ggplot2-book.org/) 46 | - [LOST-Stats Figures](https://lost-stats.github.io/Presentation/Figures/Figures.html) 47 | - [R Graphics Cookbook](https://r-graphics.org/) 48 | - The **esquisse** or **swirlr** packages for learning **ggplot2** 49 | 50 | ## Why use ggplot? 51 | 52 | - Huge amounts of power and flexibility 53 | - Manipulating data easy in R 54 | - Easy to edit 55 | - Handles multidimensional data effortlessly 56 | - Once you know how to generate graphs in a flexible coding system, **it will be easier to do what you want in code than in a more canned and restricted piece of software** 57 | 58 | ## Tidy Tuesday 59 | 60 | - Every week, the R for Data Science community posts a data set you can use 61 | - An assignment for this class will be to participate in (some) Tidy Tuesday weeks 62 | - See the repository [here](https://github.com/rfordatascience/tidytuesday) 63 | - Now on to **ggplot2**! 64 | 65 | ## The Grammar of Graphics 66 | 67 | - There's a method here! 68 | - We want to describe graphics in the way we use language 69 | - Think about syntax and nouns and verbs 70 | - We can build a plot from the ground up, element by element 71 | - This also makes it easy to make changes or switch things out 72 | 73 | ## Components of the Grammar of Graphics 74 | 75 | ```{r} 76 | # https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149 77 | knitr::include_graphics('Lecture_06_GG_Components.png') 78 | ``` 79 | 80 | ## Components of the Grammar of Graphics 81 | 82 | - Data: Identify the dimensions you want to visualize. Prepare data so that it contains the points you want to visualize, or calculate from 83 | - Aesthetics: Axes are based on the data dimensions. Also includes other "axes" like size, shape, color, etc. 84 | - Scale: Do we need to scale the potential values, use a specific scale to represent multiple values or a range? 85 | 86 | ## Components of the Grammar of Graphics 87 | 88 | - Geometric objects (`geom_`s): How should the data be drawn on the graph? Lines? Points? Bars? 89 | - Statistics: Do we want to show calculations based on data rather than data itself? 90 | - Facets: Do we want to spread one of the "axes" across multiple graphs? 91 | - Coordinate system: Should it be cartesian or polar (or something else)? Flipped? Log scale? 92 | 93 | ## Setting up a ggplot Graph 94 | 95 | We can focus, at least to start, on: 96 | 97 | - Data 98 | - Aesthetic 99 | - Geometry 100 | 101 | ## Data 102 | 103 | - Data is just a regular data set (`data.frame` or `tibble`) as you'd normally work with it in R. 104 | - **ggplot2** is set up to work directly with data frames. References to variables will look in the dataset you give it 105 | 106 | ## Aesthetic 107 | 108 | - The *aesthetic* determines the "axes" of our graph 109 | - Literally the `x=` and `y=` axes, but also other axes on which data is differentiated, like `color=`, `size=`, `linetype=`, `fill=`, etc. 110 | - `aes(x=mpg,y=hp,color=transmission)` puts the `mpg` variable on the x-axis, `hp` on the y-axis, and colors things differently by `transmission` 111 | - `group=` to separate groups for things like labeling without making them visibly different 112 | - Different aesthetic values for different types of graphs/geometries 113 | 114 | ## For Example 115 | 116 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 117 | library(tidyverse) 118 | mtcars <- mtcars %>% 119 | mutate(Transmission = factor(am, labels = c('Automatic','Manual')), 120 | CarName = row.names(mtcars)) 121 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + 122 | geom_point() 123 | ``` 124 | 125 | ## Common Aesthetics 126 | 127 | See `help()` for a given geometry to see what it accepts. But common are: 128 | 129 | - `x`, `y`: position on the x and y axis 130 | - `color`, `fill`: color is for coloring things generally. For shapes, `color` is the outline and `fill` is the inside 131 | - `label`: text labels 132 | - `size`: guess what this does 133 | - `alpha`: Transparency 134 | - `linetype`, `linewidth`: for lines or outlines, the type/thickness of line (solid, dashed...) 135 | - `xmin`/`xmax`/`xend` (and same for `y`): For line segments or shaded ranges, where do they start/end? 136 | 137 | 138 | ## Geometry 139 | 140 | - The *geometry* is what actually gets drawn. In the last slide, `geom_point()` drew points. 141 | - There are many different geometries. Let's look at the [ggplot2 cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization.pdf) 142 | - Common: `geom_point`, `geom_line`, `geom_col`, `geom_text` 143 | 144 | ## Geometry: geom_line 145 | 146 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 147 | data(economics) 148 | ggplot(economics, aes(x = date, y = uempmed)) + 149 | geom_line() 150 | ``` 151 | 152 | 153 | 154 | ## Geometries: geom_text 155 | 156 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 157 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, label = CarName)) + 158 | geom_text() 159 | ``` 160 | 161 | ## Explicit bar graphs like Excel: geom_col 162 | 163 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 164 | data <- tibble(category = c('Apple','Banana','Carrot'), 165 | quality = c(6,4,3)) 166 | ggplot(data, aes(x = category,y=quality)) + geom_col() 167 | ``` 168 | 169 | ## Or explicit grouped charts: geom_col 170 | 171 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 172 | data <- tibble(category = c('Apple','Banana','Carrot','Apple','Banana','Carrot'), 173 | person = c('Me','Me','Me','You','You','You'), 174 | quality = c(6,4,3,1,6,3)) 175 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') 176 | ``` 177 | 178 | ## Hm? 179 | 180 | What was that `position = 'dodge'`? 181 | 182 | - Used when you have stacked elements with the same `x` aesthetic that you want to be side-by-side 183 | - With `geom_col()` or `geom_bar()` you can set `position = 'dodge'` to make the different-fill bars go side by side 184 | - Can apply `position = position_dodge(.9)` with other geometries to make them line up. For example... 185 | 186 | ## Dodging 187 | 188 | ```{r} 189 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') + 190 | geom_text(aes(label = quality), position = position_dodge(.9), vjust = -.4) 191 | ``` 192 | 193 | 194 | ## Calculations 195 | 196 | - Some geometries do calculations/summaries of your data before graphing 197 | - Common: `geom_density`/`geom_histogram`/`geom_dotplot`, `geom_bar`, `geom_smooth` 198 | 199 | ## Demonstrating Distributions 200 | 201 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 202 | ggplot(mtcars, aes(x = mpg)) + geom_histogram() 203 | ``` 204 | 205 | ## Demonstrating Distributions 206 | 207 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 208 | ggplot(mtcars, aes(x = mpg)) + geom_dotplot() 209 | ``` 210 | 211 | ## Demonstrating Distributions 212 | 213 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 214 | ggplot(mtcars, aes(x = mpg)) + geom_density() 215 | ``` 216 | 217 | ## Demonstrating Distributions 218 | 219 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 220 | ggplot(mtcars, aes(y = mpg, x = factor(am))) + geom_violin() 221 | ``` 222 | 223 | 224 | 225 | ## Discrete Distributions: geom_bar 226 | 227 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 228 | ggplot(mtcars, aes(x = Transmission)) + geom_bar() 229 | ``` 230 | 231 | ## Best-fit lines with geom_smooth 232 | 233 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 234 | ggplot(mtcars, aes(x = mpg, y = hp)) + 235 | geom_point() + geom_smooth() 236 | ``` 237 | 238 | ## Trendlines with geom_smooth 239 | 240 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 241 | # Or make it straight, and separate by Transmission, and no bands 242 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + 243 | geom_point() + geom_smooth(method = 'lm', se = FALSE) 244 | ``` 245 | 246 | ## Imitation and Flattery 247 | 248 | - Uses `data(iris)` 249 | 250 | ```{r, fig.height = 4, fig.width=6} 251 | ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) + 252 | geom_point() 253 | ``` 254 | 255 | ## Imitation and Flattery 256 | 257 | - Uses `data(economics_long)` 258 | - `economics_long <- economics_long %>%` `filter(variable %in% c('psavert','uempmed'))` keeps just the two variables we want 259 | 260 | ```{r, fig.height = 4, fig.width=6} 261 | data(economics_long) 262 | economics_long <- economics_long %>% filter(variable %in% c('psavert','uempmed')) 263 | ggplot(economics_long, aes(x = date, y = value, linetype = variable)) + 264 | geom_line() 265 | ``` 266 | 267 | ## Imitation and Flattery 268 | 269 | - Uses `data(economics)` 270 | 271 | ```{r, fig.height = 4, fig.width=6} 272 | data(economics) 273 | ggplot(economics, aes(x = uempmed)) + geom_density() 274 | ``` 275 | 276 | ## Moving Forward 277 | 278 | - Everything we've done up to now isn't all *that* different from any sort of graphing software 279 | - But what really makes **ggplot2** shine is its consistency and internal logic 280 | - You'll have full access to the *scales* of the graph, as well as to the components that make it up 281 | - And manipulating those things is fairly intuitive to do rather than head-banging work 282 | - That's what we'll work on next time! -------------------------------------------------------------------------------- /Lecture_09_Structuring_ggplot.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 9: Structuring ggplot2" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(ggrepel) 20 | library(scales) 21 | ``` 22 | 23 | 24 | 25 | ## Structuring ggplot 26 | 27 | ```{r, results = 'asis'} 28 | cat(" 29 | ") 35 | ``` 36 | 37 | - Last time we went through the basics of the grammar of graphics 38 | - And how to put together a ggplot graph: 39 | - Data, aesthetics, geometry 40 | 41 | ## Today! 42 | 43 | - We'll talk about different ways of changing the *structure* of your ggplot 44 | - Scales and coordinates, labels and legends, new geometries 45 | - There are infinite possibilities here, we can't cover everything 46 | - The goal is to reduce unknown unknowns so you know where to look/Google 47 | - Also, just start typing in RStudio and see what autocompletes! 48 | 49 | ## The Key Pieces 50 | 51 | A common structure for a `ggplot()` command with a fair amount of customization might be: 52 | 53 | ```{r, echo = TRUE, eval = FALSE} 54 | ggplot(data, aes(x = xvar, y = yvar, color = colorvar)) + 55 | geom_mygeometry(otheraesthetic = value) + 56 | scale_aesthetic_type(labels = something) + 57 | labs(x = 'My X Var', y = 'My Y Var', title = 'My Title') + 58 | guides(color = 'none') 59 | ``` 60 | 61 | ## Scales 62 | 63 | - Every axis (aesthetic element) has a *scale* 64 | - This determines how *a value in the data* (1, 2, 3) turns into *a value on the graph* (100 pixels to the right, 200 pixels, 300 pixels; or "red", "blue", "green") 65 | - These scales are fully manipulable and label-able! 66 | - `scale_axisname_type` 67 | 68 | ## Discrete and Continuous Scales 69 | 70 | - For example: 71 | - `scale_x_continuous` for a continuous x-axis 72 | - `scale_x_discrete` for a discrete one 73 | - Or similarly `scale_y_continuous`, `scale_color_continuous`, `scale_color_gradient`, `scale_fill_discrete`, and so on and so on 74 | 75 | ## Discrete and Continuous Scales 76 | 77 | - Use to: 78 | - Set colors, set limits (want the axis to extend beyond its values? `limits`) 79 | - Name the scale 80 | - Label values (for dicrete values, or perhaps as percents?) 81 | 82 | ## Setting Values Manually 83 | 84 | - `scale_x_manual` for discrete plots will let you set things by hand 85 | - For example if `x` is different companies and you want to pick each company's brand color for its bar 86 | 87 | ## Examples 88 | 89 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 90 | data <- tibble(category = c('Apple','Banana','Carrot','Apple','Banana','Carrot'), 91 | person = c('Me','Me','Me','You','You','You'), 92 | quality = c(.06,.04,.03,.01,.06,.03)) 93 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') 94 | ``` 95 | 96 | ## Setting Scales 97 | 98 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 99 | library(scales) 100 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') + 101 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) + 102 | scale_x_discrete(position = 'top') + 103 | scale_fill_manual(values = c('Apple'='red','Banana'='yellow','Carrot'='orange')) 104 | ``` 105 | 106 | ## Setting Scales 107 | 108 | - `limits` in the `scale` function can be used to specify the vector of values to be included, or as a range for continuous variables 109 | - `labels` is also handy. `labels = c('Red Apple',` `'Yellow Banana','Orange Carrot')` would relabel the legend (or axis labels) 110 | - In a continuous function, say for dollars, `scale_x_continuous(labels = scales::label_dollar())` would put it in dollar terms. More on this in a moment 111 | - Or dates: `scale_x_date(labels = function(x)` `paste(month.abb[month(x)],year(x)))` 112 | 113 | ## Setting Color or Fill Scales 114 | 115 | - Selecting a color scale is important - we've discussed discrete and continuous palettes 116 | - There are a bunch of `scale_color_`/`scale_fill_` functions that solely exist to help with this! 117 | - (not to mention entire packages like **paletteer**) 118 | 119 | ## Setting Color or Fill Scales 120 | 121 | Especially useful are: 122 | 123 | - `scale_color_gradient()` for gradient scales (or `_gradient2()` for diverging scales with a "middle" in them), `scale_color_viridis()` also has some great gradient scales (either discrete or continuous!) 124 | - `scale_color_brewer()`/`scale_fill_brewer()` functions for discrete values, or `_distiller()` for continuous values, or `_fermenter()` for binned 125 | 126 | ## Setting Color or Fill Scales 127 | 128 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 129 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') + 130 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) + 131 | scale_x_discrete(position = 'top') + 132 | scale_fill_brewer(palette = 'Dark2') 133 | ``` 134 | 135 | 136 | ## Setting Color or Fill Scales 137 | 138 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 139 | ggplot(data, aes(x = person, y = quality, group = category, fill = quality)) + geom_col(position = 'dodge') + 140 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) + 141 | scale_x_discrete(position = 'top') + 142 | scale_fill_viridis_c() 143 | ``` 144 | 145 | ## Setting Color or Fill Scales 146 | 147 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 148 | ggplot(data, aes(x = person, y = quality, group = category, fill = quality)) + geom_col(position = 'dodge') + 149 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) + 150 | scale_x_discrete(position = 'top') + 151 | scale_fill_gradient2(midpoint = .03) 152 | ``` 153 | 154 | ## Transformations 155 | 156 | - We can transform values as we plot them 157 | - Many `scale_something_continuous` entries have a `trans` option, set to `date`, `log`, `probability`, `reciprocal`, `sqrt`, `reverse`, etc. etc. to perform that transformation before plotting 158 | - You can also do things like `scale_x_log10` or `scale_y_reverse` for some common transforms 159 | - `scale_something_binned()` takes a continuous value and puts it in bins, handy sometimes for simplification 160 | 161 | ## Transformations 162 | 163 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 164 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + 165 | geom_point() 166 | ``` 167 | 168 | ## Transformations 169 | 170 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 171 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + 172 | geom_point() + 173 | scale_x_log10() + 174 | scale_y_continuous(trans='reverse') + 175 | scale_color_binned() 176 | ``` 177 | 178 | ## Log Scales 179 | 180 | When to use log scales? 181 | 182 | - When data is highly skewed, so a few huge observations are drawing all focus 183 | - When the relationship is multiplicative so you want to show, say, percentage growth 184 | 185 | ## Labeling Scales for Continuous Variables 186 | 187 | Scale functions have a `labels=` option. 188 | 189 | - USE IT!! 190 | - This is the quickest and easiest way to make your graph look a zillion times better and more professional 191 | - We already covered this for categorical variables. `labels = c('A','B')` gives the categories the names A and B, in that order (tip: check the order first) 192 | - For continuous variables, the **scales** package helps us immensely here 193 | 194 | ## scales 195 | 196 | Two main types of functions in **scales**: 197 | 198 | - Transformation functions like `dollar()`: `dollar(10)` creates $10 (NOTE: handy sometimes in RMarkdown text! Also note this creates text, not numbers, so don't use them in `aes()` unless you want the variable to be a string) 199 | - Labeling functions like `label_dollar()` designed to slot directly into the `labels=` argument. `scale_y_continuous(labels = label_dollar())` turns all your y-axis labels into the dollar equivalent 200 | 201 | 202 | ## scales 203 | 204 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 205 | ggplot(data, aes(x = person, y = quality, fill = category)) + geom_col(position = 'dodge') + 206 | scale_y_continuous(labels = label_percent(), limits = c(0,.1)) 207 | ``` 208 | 209 | ## Scales 210 | 211 | The `label_` functions have lots of options! You can set the `accuracy` (precision), decide how to break up big numbers (`big.mark`) or scale things down to, say, thousands! (`scale=1/1000, suffix = 'k'`) 212 | 213 | ```{r, echo = TRUE, fig.height = 2, fig.width = 6} 214 | data(gapminder, package = 'gapminder') 215 | ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + 216 | scale_x_log10(labels = label_dollar(accuracy = 1, scale = 1/1000, suffix = 'k')) 217 | ``` 218 | 219 | ## Documentation 220 | 221 | - There are a zillion options 222 | - Be sure to use `help(whatever)` before using something 223 | - Not sure what the function is? Check the **ggplot2** documentation page 224 | - Or just start typing `scales_` or whatever and see what pops up in autocomplete in RStudio 225 | 226 | ## Labels with labs() 227 | 228 | - Please label your axes and graphs!! 229 | - Label legends too by naming their axes 230 | - Captions and subcaptions available too 231 | - Annotation and labeling data points we'll do next time 232 | 233 | ## All Kinds of Labels 234 | 235 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 236 | data(mtcars) 237 | mtcars <- mtcars %>% mutate(CarName = row.names(mtcars)) 238 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() + 239 | scale_x_log10() + scale_y_reverse() + scale_color_binned() + 240 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight', 241 | title = 'Title', subtitle = 'Subtitle', caption = 'Caption') 242 | ``` 243 | 244 | ## Moving the Legend 245 | 246 | - Working with **ggplot2** legends is one of the trickier things 247 | - we'll do a lot more with aesthetic details next time but for now let's just move it and outline it. 248 | 249 | ## Moving the Legend 250 | 251 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 252 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() + 253 | scale_x_log10() + scale_y_reverse() + scale_color_binned() + 254 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight', 255 | title = 'Title', subtitle = 'Subtitle', caption = 'Caption') + 256 | theme(legend.position = c(.9,.3), legend.box.background = element_rect(color='black')) 257 | ``` 258 | 259 | ## Or Removing It 260 | 261 | 262 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 263 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() + 264 | scale_x_log10() + scale_y_reverse() + scale_color_binned() + 265 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') + 266 | guides(color = 'none') 267 | ``` 268 | 269 | ## Overlaid Graphs 270 | 271 | - In addition to adding chart elements, we can add whole new geometries over the top 272 | - Some stack naturally (adding a `geom_smooth` best fit line over top, for example) 273 | - For others we may have to change the data or aesthetic as we go 274 | 275 | ## Overlaid Graphs 276 | 277 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 278 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() + 279 | scale_x_log10() + scale_y_reverse() + scale_color_binned() + 280 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') + 281 | geom_smooth(method='lm', se = FALSE) 282 | ``` 283 | 284 | ## Overlaid Graphs 285 | 286 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 287 | ggplot(mtcars, aes(x = mpg, y = hp, color = wt)) + geom_point() + 288 | scale_x_log10() + scale_y_reverse() + scale_color_binned() + 289 | labs(x = 'Miles per Gallon', y = 'Horsepower', color = 'Car Weight') + 290 | geom_text_repel(data = mtcars %>% slice(1:5),aes(label = CarName),hjust=-1) 291 | ``` 292 | 293 | ## Facets 294 | 295 | - Separate graphs by group! 296 | 297 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 298 | ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point() + 299 | facet_wrap('cyl') + 300 | labs(x = 'Miles per Gallon', y = 'Horsepower', title = 'Horsepower vs. MPG by Cylinders') 301 | ``` 302 | 303 | 304 | ## ggforce's facet_zoom 305 | 306 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 307 | library(ggforce) 308 | ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + 309 | geom_point() + 310 | facet_zoom(x = Species == 'versicolor') 311 | ``` 312 | 313 | ## Multiple-Graph Notes 314 | 315 | - For overlaying the *same type* of graph, say two line graphs for two variables, don't `+` different geometries, you're almost always better off doing a `pivot_longer()` first and just using `aes()`. 316 | - For sticking together multiple completely disparate graphs, don't use facets, instead load **patchwork** and combine them as desired - we'll discuss this later 317 | 318 | ## Imitation and Flattery 319 | 320 | - Pick one of these graphs or from the previous lecture 321 | - Go into the geometry's help file and recreate an example, or the graph in these slides 322 | - Mess with the scales and labels! Try to get it to work! 323 | -------------------------------------------------------------------------------- /Lecture_10_Styling_ggplot.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 10: Styling ggplot2" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(ggrepel) 20 | library(ggpubr) 21 | ``` 22 | 23 | 24 | 25 | ## Styling ggplot 26 | 27 | ```{r, results = 'asis'} 28 | cat(" 29 | ") 35 | ``` 36 | 37 | - We should have a decent idea at this point of how to put a **ggplot2** graph together 38 | - But how do we make it look good? 39 | 40 | ## Today! 41 | 42 | - We'll talk about how to set aesthetic attributes along every axis 43 | - As well as overall styling and theming 44 | - Highlighting, putting graphs together, making them interactive, annotating... 45 | - **Think about this stuff**. Your graphs will look unprofessional if you don't, and I'll start grading down all-default styling or missed opportunities for improvement. 46 | 47 | ## Aesthetic Characteristics 48 | 49 | - As you'd expect, we'll be setting the values of certain aesthetic properties 50 | - Which ones apply depend on the geometry. See [this page](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html). Some common ones: 51 | - color, linetype, fill, size, shape (and soon: linewidth) 52 | - Text aesthetics: family (font), fontface, hjust, vjust (see **extrafont** package to use all your system fonts) 53 | 54 | ## Decorating Axes vs. Decorating Geometries 55 | 56 | - When a characteristic is set *inside* of an `aes()`, it becomes an axis and must be mapped to a variable name 57 | - *Outside* of an `aes()` (and in the geometry), it's a setting applied to the entire geometry 58 | 59 | ## Aesthetic Characteristics 60 | 61 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 62 | # We've seen this before 63 | mtcars <- mtcars %>% 64 | mutate(Transmission = factor(am, labels = c('Automatic','Manual'))) 65 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + 66 | geom_point() 67 | ``` 68 | 69 | ## Aesthetic Characteristics 70 | 71 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 72 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, 73 | size = wt, shape = Transmission)) + 74 | geom_point() 75 | ``` 76 | 77 | ## Aesthetic Characteristics 78 | 79 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 80 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, 81 | size = wt)) + 82 | geom_point(shape = 10) 83 | ``` 84 | 85 | ## Aesthetic Characteristics 86 | 87 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 88 | ggplot(mtcars, aes(x = Transmission, fill = factor(cyl))) + 89 | geom_bar(position = 'dodge', linetype = 'dashed', color = 'black') 90 | ``` 91 | 92 | ## Coordinates 93 | 94 | - Everything *behind* the geometries can be manipulated as well 95 | - For example, we can change the coordinate system (usually to `coord_flip` it for bar graphs) 96 | 97 | ## Coordinates 98 | 99 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 100 | library(ggalt) 101 | mtcars2 <- mtcars %>% group_by(Transmission) %>% summarize(count = n()) 102 | ggplot(mtcars2, aes(x = Transmission,y=count)) + geom_lollipop() 103 | ``` 104 | 105 | ## Coordinates 106 | 107 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 108 | # geom_lollipop in particular has a 'horizontal' option anyway 109 | # so this is just for demonstration 110 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 111 | geom_lollipop(color = 'red', size = 2) + coord_flip() 112 | ``` 113 | 114 | ## Axes 115 | 116 | - We can control axis text with `axis.text` and the ticks we mark on the `scale` 117 | - `element_text` provides text objects 118 | 119 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 120 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 121 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 122 | labs(x = '', y = '') + 123 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) + 124 | theme(axis.text.x = element_text(size = 12,family='serif'), 125 | axis.text.y = element_text(size = 16,family='serif', angle = 10)) 126 | ``` 127 | 128 | ## Axes 129 | 130 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 131 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 132 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 133 | labs(x = '', y = '') + 134 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) + 135 | theme(axis.text.x = element_text(size = 12,family='serif'), 136 | axis.text.y = element_text(size = 16,family='serif', angle = 10)) 137 | ``` 138 | 139 | 140 | ## Axes 141 | 142 | - Order for discrete axes can be changed with `limits` 143 | 144 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 145 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 146 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 147 | scale_x_discrete(limits = c("Manual", "Automatic")) 148 | ``` 149 | 150 | ## Theming 151 | 152 | - Pretty much all elements of presentation that aren't in a geometry can be controlled with `theme()` 153 | - The options are endless! See `help(theme)` 154 | - We set different aspects of the theme using `element_` functions like `element_text()`, `element_line()`, `element_rect()` which take aesthetic settings like `size`, `color`, etc. 155 | 156 | ## Axes 157 | 158 | - Change axis and tick lines with `axis.line` and `element_line` 159 | - Most everything can be changed generally (`axis.line` or even `line`) or specifically (`axis.ticks.x`) 160 | 161 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 162 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 163 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 164 | labs(x = '', y = '') + 165 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) + 166 | theme(axis.line = element_line(color = 'red'), 167 | axis.ticks.x = element_line(size = 5)) 168 | ``` 169 | 170 | ## Axes 171 | 172 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 173 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 174 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 175 | labs(x = '', y = '') + 176 | scale_y_continuous(breaks = c(10,20), limits = c(0,20)) + 177 | theme(axis.line = element_line(color = 'red'), 178 | axis.ticks.x = element_line(size = 5)) 179 | ``` 180 | 181 | ## Background 182 | 183 | - Work on the `panel` to change what goes behind that geometry! `element_rect` might come up! 184 | - Pretty much ANY element can be eliminated with `element_blank()` 185 | - This gonna be ugly 186 | 187 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 188 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 189 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 190 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'), 191 | panel.grid.major.x = element_line(color = 'green'), 192 | panel.grid.major.y = element_blank()) 193 | ``` 194 | 195 | ## Background 196 | 197 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 198 | ggplot(mtcars2, aes(x = Transmission,y=count)) + 199 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 200 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'), 201 | panel.grid.major.x = element_line(color = 'green'), 202 | panel.grid.major.y = element_blank()) 203 | ``` 204 | 205 | ## Legends 206 | 207 | - We showed a bit last time how legends can be changed. 208 | - Note in `element_text` (and elsewhere, like annotation or `geom_text`), `hjust` and `vjust` align text horizontally and vertically. Set 0/1 for L/R or Top/Bottom 209 | - `legend.position` is in portion of the frame. .8 = 80% to the right. 210 | 211 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 212 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 213 | theme(legend.text = element_text(hjust = .5, family = 'serif'), 214 | legend.title = element_text(hjust = 1, family = 'serif', face = 'bold'), 215 | legend.background = element_rect(color = 'black', fill = 'white'), 216 | legend.position = c(.8,.7)) 217 | ``` 218 | 219 | ## Legends 220 | 221 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 222 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 223 | theme(legend.text = element_text(hjust = .5, family = 'serif'), 224 | legend.title = element_text(hjust = 1, family = 'serif', face = 'bold'), 225 | legend.background = element_rect(color = 'black', fill = 'white'), 226 | legend.position = c(.8,.7)) 227 | ``` 228 | 229 | ## Themes 230 | 231 | - The theme is immensely flexible. We didn't even cover everything, like the plot title/caption 232 | - You can save your `theme()` settings as an object to use repeatedly as a house style 233 | - Or use a prepackaged one! You can even tack on more `theme()` customization afterwards. 234 | - Many `theme_` settings you can tack on. I only use a few... 235 | - (many many others... want to look like FiveThirtyEight or The Economist? Check **ggthemes**) 236 | 237 | ## Prepackaged Themes: Default 238 | 239 | ```{r, echo = TRUE, fig.height = 4, fig.width = 6} 240 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() 241 | ``` 242 | 243 | ## Prepackaged Themes: Classic 244 | 245 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 246 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 247 | theme_classic() 248 | ``` 249 | 250 | ## Prepackaged Themes: void (for things like flowcharts) 251 | 252 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 253 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 254 | theme_void() 255 | ``` 256 | 257 | ## Prepackaged Themes: tufte (in ggthemes) 258 | 259 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 260 | library(ggthemes) 261 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 262 | theme_tufte() 263 | ``` 264 | 265 | ## Prepackaged Themes: theme_pubr (in ggpubr), my go-to 266 | 267 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 268 | ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 269 | theme_pubr() 270 | ``` 271 | 272 | ## General Theming Notes 273 | 274 | - How can we think about theming generally? 275 | - We want, as always, to drive focus towards our story 276 | - This means not making the theme distracting it, but using it to drive attention to what we want 277 | - Need focus on the x-axis labels? Make em bold! 278 | - Often we don't need distracting background color (as there is in the default theme). Usually get rid of it! 279 | 280 | ## Other Styling 281 | 282 | - Let's move on to some other ways we can build focus and clarity 283 | - Highlight and annotations! 284 | 285 | ## Highlighting 286 | 287 | - Highlight certain values and gray others with the **gghighlight** package 288 | 289 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 290 | library(gghighlight); data(gapminder, package = 'gapminder') 291 | ggplot(gapminder, aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) + 292 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") + 293 | scale_x_continuous(limits=c(1950,2015), 294 | breaks = c(1950,1970,1990,2010))+ 295 | gghighlight(country %in% c('United States','Canada','Mexico'), 296 | unhighlighted_params = aes(size=.1), 297 | label_params=list(direction='y',nudge_x=10)) + 298 | theme_minimal(base_family='serif') 299 | ``` 300 | 301 | ## Highlighting 302 | 303 | 304 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 305 | library(gghighlight); data(gapminder, package = 'gapminder') 306 | ggplot(gapminder, aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) + 307 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") + 308 | scale_x_continuous(limits=c(1950,2015), 309 | breaks = c(1950,1970,1990,2010))+ 310 | gghighlight(country %in% c('United States','Canada','Mexico'), 311 | unhighlighted_params = aes(size=.1), 312 | label_params=list(direction='y',nudge_x=10)) + 313 | theme_minimal(base_family='serif') 314 | ``` 315 | 316 | ## Highlighting 317 | 318 | - Or do it manually with `scale_X_manual()` 319 | 320 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 321 | gapminder %>% mutate(color_name = ifelse(country %in% c('United States','Canada','Mexico'), as.character(country), 'Other')) %>% 322 | ggplot(aes(x = year, y = lifeExp, group = country, color=color_name, size = color_name, alpha = color_name)) + geom_line() + 323 | geom_text(aes(label = ifelse(year == 2007 & color_name != 'Other', color_name,'')), 324 | hjust = 0, size = 13/.pt) + 325 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") + 326 | scale_x_continuous(limits=c(1950,2025), 327 | breaks = c(1950,1970,1990,2010))+ 328 | scale_color_manual(values = c('red','forestgreen','gray','blue')) + 329 | scale_size_manual(values = c(1.5,1.5,.1,1.5)) + 330 | scale_alpha_manual(values = c(1,1,.2,1)) + 331 | theme_minimal(base_family='serif') + 332 | guides(color = 'none', size = 'none', alpha = 'none') 333 | ``` 334 | 335 | ## Highlighting 336 | 337 | ```{r, echo = FALSE, eval = TRUE, fig.height = 4, fig.width = 6} 338 | gapminder %>% mutate(color_name = ifelse(country %in% c('United States','Canada','Mexico'), as.character(country), 'Other')) %>% 339 | ggplot(aes(x = year, y = lifeExp, group = country, color=color_name, size = color_name, alpha = color_name)) + geom_line() + 340 | geom_text(aes(label = ifelse(year == 2007 & color_name != 'Other', color_name,'')), 341 | hjust = 0, size = 13/.pt) + 342 | labs(x = NULL, y = "Life Expectancy", title = "North America Only") + 343 | scale_x_continuous(limits=c(1950,2025), 344 | breaks = c(1950,1970,1990,2010))+ 345 | scale_color_manual(values = c('red','forestgreen','gray','blue')) + 346 | scale_size_manual(values = c(1.5,1.5,.1,1.5)) + 347 | scale_alpha_manual(values = c(1,1,.2,1)) + 348 | theme_minimal(base_family='serif') + 349 | guides(color = 'none', size = 'none', alpha = 'none') 350 | ``` 351 | 352 | ## Annotation 353 | 354 | - There are also many options for *annotating* our data 355 | - For direct data point annotation, especially of line graphs, you can do this with some data manipulation and `geom_text` (see previous slide) 356 | - For putting a block of text I recommend **ggannotate** 357 | - I also recommend looking into [**ggtext**](https://wilkelab.org/ggtext/) which allows rich-text formatting of all text on the graph, but we won't be covering it here. 358 | 359 | ## Direct Line Labels 360 | 361 | ```{r, echo = TRUE, eval = TRUE, fig.height = 4, fig.width = 6} 362 | p <- gapminder %>% 363 | group_by(country) %>% 364 | mutate(country_label = ifelse(year == max(year), as.character(country), NA_character_)) %>% 365 | filter(country %in% c('United States','Canada','Mexico')) %>% 366 | ggplot(aes(x = year, y = lifeExp, color=country)) + geom_line(size = 1.5) + 367 | scale_x_continuous(limits = c(1950,2025))+ theme_classic()+guides(color='none')+ 368 | geom_text(aes(x = year + 2,label = country_label), hjust = 0) 369 | p 370 | ``` 371 | 372 | ## Overall Annotation 373 | 374 | - You can use the regular **ggplot2** `annotate()` function, but this takes some work 375 | - **ggannotate** will let you work hands-on with it, and it gives you code at the end 376 | - Also works for adding lines and arrows 377 | - The `geom_mark` functions from **ggforce** are good for shape highlights ([here](https://ggforce.data-imaginist.com/reference/geom_mark_circle.html)) 378 | - Excel-esque but the end result is more rigorous 379 | - Install with `remotes::install_github` `("mattcowgill/ggannotate")`, not `install.packages()` 380 | 381 | ## ggannotate 382 | 383 | - Let's see what happens when we run this... 384 | 385 | 386 | ```{r, echo = TRUE, eval = FALSE, fig.height = 4, fig.width = 6} 387 | # Remember we created p a few slides ago 388 | ggannotate::ggannotate(p) 389 | ``` 390 | 391 | ## ggannotate 392 | 393 | - We get code chunks to add to our graph to get our annotations 394 | 395 | ```{r, echo=FALSE, eval =TRUE, fig.height = 5, fig.width = 7} 396 | p + geom_text(data = data.frame(x = 1994.88673320614, y = 60.8970008395899, label = "Mexico catches up considerably\nthrough the latter half of the 20th century."), 397 | mapping = aes(x = x, y = y, label = label), 398 | inherit.aes = FALSE) + 399 | geom_curve(data = data.frame(x = 1994.11374242598, y = 63.8009475579236, xend = 1991.27944289874, yend = 69.80243744248), 400 | mapping = aes(x = x, y = y, xend = xend, yend = yend), 401 | arrow = arrow(30L, unit(0.1, "inches"), 402 | "last", "closed"), 403 | inherit.aes = FALSE) 404 | ``` 405 | 406 | ## Export 407 | 408 | - Alright, we've crafted our perfect graph 409 | - What do we do now? 410 | - `ggsave()` will save our result to file (or we can Export Image in the Plots frame of RStudio) 411 | - But we have some more options still! 412 | 413 | ## Sticking Graphs Together 414 | 415 | - If you have multiple plots that go together, consider using **patchwork** to put them all into one 416 | - With the library loaded, you can just add plots together and they get stitched! 417 | - Let's stitch together some of the plots from this presentation 418 | - `+` to stick together, `|` to put "beside", `-` to move to next column, `/` for the next row 419 | 420 | 421 | ## Patchwork 422 | 423 | ```{r, echo = FALSE, eval = TRUE} 424 | p1 <- ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission)) + geom_point() + 425 | theme_tufte() 426 | p2 <- ggplot(mtcars, aes(x = mpg, y = hp, color = Transmission, 427 | size = wt, shape = Transmission)) + 428 | geom_point() 429 | p3 <- ggplot(mtcars2, aes(x = Transmission,y=count)) + 430 | geom_lollipop(color = 'red', size = 2) + coord_flip() + 431 | theme(panel.background = element_rect(color = 'blue', fill = 'yellow'), 432 | panel.grid.major.x = element_line(color = 'green'), 433 | panel.grid.major.y = element_blank()) 434 | p4 <- ggplot(mtcars, aes(x = Transmission, fill = factor(cyl))) + 435 | geom_bar(position = 'dodge', linetype = 'dashed', color = 'black') 436 | ``` 437 | 438 | 439 | ```{r, echo= TRUE, eval = TRUE, fig.height = 5, fig.width = 12} 440 | library(patchwork) 441 | # Named and saved these earlier 442 | (p1 | p2 - p3)/p4 443 | ``` 444 | 445 | 446 | ## Practice 447 | 448 | - We may not have time today to practice much of this 449 | - But I want you to take a basic graph, a real basic one, like this: 450 | - `ggplot(mtcars, aes(x = mpg, y = hp)) +` `geom_point()` 451 | - And just see how you can make it look. Mess around with it 452 | - The only goal is to try stuff out. -------------------------------------------------------------------------------- /Lecture_11_Bob_Ross.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Painting With the Prof" 3 | author: "Nick Huntington-Klein" 4 | date: "`r Sys.Date()`" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE) 10 | ``` 11 | 12 | Our setup code: 13 | 14 | ```{r, echo = TRUE} 15 | # install.packages('ggpubr','ggforce','ggalt') 16 | 17 | library(tidyverse) 18 | library(ggpubr) 19 | library(ggforce) 20 | library(ggalt) 21 | 22 | google <- read_csv('https://github.com/NickCH-K/causalbook/raw/main/EventStudies/google_stock_data.csv') %>% 23 | pivot_longer(cols = c('Google_Return','SP500_Return')) 24 | spend <- read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_spending.csv') 25 | #https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_spending.html 26 | ``` 27 | 28 | # Google's Alphabet Announcement 29 | 30 | On August 10, 2015, Google announced that they were reorganizing the company to fit underneath Alphabet, a new umbrella company. The GOOG stock would now be stock in Alphabet. Did this announcement affect the Google stock price? 31 | 32 | ```{r, echo = FALSE} 33 | gdata <- google %>% 34 | mutate(name = case_when( 35 | name == 'Google_Return' ~ 'Google Return', 36 | name == "SP500_Return" ~ "S&P 500 Return" 37 | )) 38 | 39 | ggplot(gdata, aes(x = Date, y = value, linetype = name, color = name)) + 40 | geom_line(size = 1) + 41 | geom_vline(aes(xintercept = as.Date('2015-08-10')), linetype = 'dashed') + 42 | annotate(geom = 'label',x = as.Date('2015-08-10'), y = .1, 43 | label = 'Alphabet\nAnnouncement', 44 | family = 'serif', size = 13/.pt, 45 | hjust = .5)+ 46 | scale_y_continuous(labels = scales::percent) + 47 | scale_color_manual(values = c('red','gray')) + 48 | theme_pubr() + 49 | theme(text = element_text(family = 'serif', size = 13), 50 | legend.position = c(.2, .9)) + 51 | labs(y = 'Daily Return', title = 'Google Stock Return Around Alphabet Announcement', 52 | color = NULL) + 53 | guides(linetype = 'none') + 54 | facet_zoom(xlim = c(as.Date('2015-08-05'),as.Date('2015-08-15')), zoom.size = 1) 55 | 56 | ``` 57 | 58 | 59 | # Life by Education 60 | 61 | ```{r, echo = TRUE} 62 | spend <- read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/stevedata/gss_spending.csv') 63 | #https://vincentarelbundock.github.io/Rdatasets/doc/stevedata/gss_spending.html 64 | 65 | spend <- spend %>% 66 | dplyr::filter(!is.na(degree)) %>% 67 | transmute(BA = degree >= 3, 68 | income_above_median = rincom16 >= median(rincom16, na.rm = TRUE), 69 | conservative = polviews >= 5, 70 | news_reader = news %in% 1:2, 71 | full_time = wrkstat == 1) %>% 72 | group_by(BA) %>% 73 | summarize(across(.fns = function(x) mean(x, na.rm = TRUE))) 74 | 75 | vtable::vtable(spend, labels = c('Has Bachelor or Grad Degree', 76 | 'Income above median', 77 | '5-7 on political views scale', 78 | 'Reads news at least a few times a week or daily', 79 | 'Works full-time')) 80 | ``` 81 | 82 | ```{r, echo = FALSE} 83 | knitr::kable(spend) %>% kableExtra::kable_styling(bootstrap_options = 'striped') 84 | ``` 85 | 86 | ```{r, echo = TRUE} 87 | spend <- spend %>% 88 | pivot_longer(cols = c('income_above_median','conservative','news_reader','full_time')) 89 | spend <- spend %>% 90 | pivot_wider(id_cols = name, names_from = BA) %>% 91 | rename(ba_value = `TRUE`, no_ba_value = `FALSE`) 92 | ``` 93 | 94 | ```{r} 95 | ggplot(spend, aes(x = no_ba_value, xend = ba_value, y = name)) + 96 | geom_dumbbell(size_x = 3, size_xend = 3, colour = 'black', colour_x = 'red', colour_xend = 'blue') + 97 | labs(x = 'Proportion', y = NULL, 98 | title = 'A Different Financial World for BA-Holders') + 99 | annotate(geom = 'text', x = spend$no_ba_value[3], y = 'news_reader', label = 'No BA', vjust = -1) + 100 | annotate(geom = 'text', x = spend$ba_value[3], y = 'news_reader', label = 'BA', vjust = -1) + 101 | theme_minimal() + 102 | theme(panel.grid.major.y = element_blank(), 103 | panel.grid.minor.y = element_blank(), 104 | text = element_text(size = 15)) + 105 | scale_x_continuous(labels = scales::percent, limits = c(0,1)) + 106 | scale_y_discrete(labels = c( 'Conservative', 'Works Full-Time','Income above Median','Frequent News Reader')) 107 | ``` -------------------------------------------------------------------------------- /Lecture_12_Hexmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_12_Hexmap.png -------------------------------------------------------------------------------- /Lecture_12_State_Heatmap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_12_State_Heatmap.png -------------------------------------------------------------------------------- /Lecture_13_climate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_13_climate.png -------------------------------------------------------------------------------- /Lecture_14_Clutter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Clutter.png -------------------------------------------------------------------------------- /Lecture_14_Color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Color.png -------------------------------------------------------------------------------- /Lecture_14_Dashboards_in_R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 14: Dashboards in R" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(paletteer) 20 | options("kableExtra.html.bsTable" = T) 21 | rinlinevarname <- function(code){ 22 | html <- '``` `CODE` ```' 23 | sub("CODE", code, html) 24 | } 25 | ``` 26 | 27 | 28 | ## Dashboards 29 | 30 | ```{r, results = 'asis'} 31 | cat(" 32 | ") 38 | ``` 39 | 40 | - Everything we've done so far has focused on *single images* 41 | - But there's only so much you can fit on one image! (although it's still a lot if done right) 42 | - The hot new thing is *dashboards* 43 | - Dashboards are *many* visualizations and tables on a single page to let you really explore the data 44 | - Often, interactive - hover-over for more information, or use menus to change which of the underlying data you're looking at 45 | 46 | ## Tips for Dashboards 47 | 48 | - (Inspired by [this article](https://www.tableau.com/about/blog/2017/10/7-tips-and-tricks-dashboard-experts-76821).) 49 | - Avoid clutter both in presentation *and* interactivity 50 | - Use a grid 51 | - Use consistent and clear colors and fonts 52 | - No scrolling! Dashboards aren't articles (those are neat too though) 53 | - Draw focus within visualizations as always (use differences *between* visualizations as well) 54 | - Also draw focus *between* visualizations with location and size 55 | 56 | ## Avoid Clutter 57 | 58 | - We know this one! 59 | - In the context of dashboards, it also means not presenting *all* the results you can 60 | - Think about what's redundant, what's unnecessary, what's distracting, and what's likely to be misinterpreted 61 | - Get rid of 'em! 62 | - Also avoid big walls of color 63 | 64 | ## Avoid Clutter 65 | 66 | ```{r} 67 | knitr::include_graphics('Lecture_14_Clutter.png') 68 | ``` 69 | 70 | 71 | ## Fonts and Colors 72 | 73 | - Big and bold for up-top conclusions 74 | - Smaller, generally serif fonts for other graphs 75 | - Avoid using LOTS of fonts 76 | - You can help the reader see the structure by having one font for BIG, another for medium, another for small 77 | 78 | ## Fonts and Colors 79 | 80 | - When it comes to color, one or two is often enough! 81 | - Instead of lots of color designating groups, standard here is to break groups into single-group graphs (facets) 82 | 83 | 84 | ```{r} 85 | knitr::include_graphics('Lecture_14_Color.png') 86 | ``` 87 | 88 | ## Focus 89 | 90 | - As always, figure out the story and focus on that 91 | - Buzzword here is KPI: "Key point of interest". Make it big, make it stand out! 92 | - Notice how readable ALL of these have been at tiny sizes! 93 | 94 | ```{r} 95 | knitr::include_graphics('Lecture_14_KPI.png') 96 | ``` 97 | 98 | 99 | ## Examples 100 | 101 | - Let's explore each of these examples and see what works and what might not 102 | - [Year-to-Date Income Statement](https://public.tableau.com/views/IncomeStatement_4/IncomeStatement?:embed=y&:display_count=yes&publish=yes&:showVizHome=no) 103 | - [NBA Scoring Dashboard](https://rpubs.com/jcheng/nba1) 104 | - [Economic Indicators for the USA](https://research.stlouisfed.org/dashboard/1151) 105 | 106 | ## Dashboards in R 107 | 108 | - R dashboard provide you access to a huge range of free premade graphical tools through htmlwidgets 109 | - See the [htmlwidgets gallery](https://www.htmlwidgets.org/showcase_leaflet.html) 110 | - Good for everything, but the maps do really stand out 111 | - As you learn these tools you can include them not just in dashboards but in RMarkdown docs, too 112 | - And also, it's not just htmlwidgets, but also Shiny! 113 | 114 | ## Shiny 115 | 116 | - Shiny is more internal to R than htmlwidgets; it's an R-specific interactive site builder 117 | - You can add user controls to allow them to change parameters around and see the results 118 | - Great for dashboards, but even beyond that, learning Shiny for the purpose of dashboards will also give you the tools to create data-based *apps* that you can upload on their own 119 | - Super cool stuff you can do if you branch out with this! 120 | 121 | ## On to dashboards 122 | 123 | - There are two approaches to doing dashboards in R. One is Shiny-focused, and the other is htmlwidgets focused (but lets you include Shiny stuff) 124 | - We'll be going with the latter, as it's easier, but the Shiny one is good too. 125 | - Our approach will be to use **flexdashboard** 126 | - Install **flexdashboard** with `install.packages('flexdashboard')` 127 | - Then you can open a new Flex Dashboard by going File $\rightarrow$ New File $\rightarrow$ R Markdown $\rightarrow$ From Template $\rightarrow$ Flex Dashboard. 128 | - Plenty of info on [https://rmarkdown.rstudio.com/flexdashboard/](https://rmarkdown.rstudio.com/flexdashboard/) 129 | 130 | ## What Do We See? 131 | 132 | ```{r} 133 | knitr::include_graphics('Lecture_14_Flexdashboard.png') 134 | ``` 135 | 136 | ## What Do We See? 137 | 138 | - Up top is the YAML - we know this from working with RMarkdown docs. Basic info about the doc itself 139 | - Then we have our *layout structure* - columns of stuff, with multiple graphs per columns 140 | - We can if we prefer lay things out row-wise (see next slide), and "storyboard" where it goes from page to page like a flipbook 141 | - Then within each column we have our graphs, labeled with three hashes 142 | 143 | ## Row orientation 144 | 145 | ```{r} 146 | knitr::include_graphics('Lecture_14_Flexdashboard_row.png') 147 | ``` 148 | 149 | ## Content 150 | 151 | - We can of course fill those charts with regular-ol **ggplot2** output 152 | - (or **gganimate**, or whatever else produces an image file) 153 | - Or Shiny stuff, which we'll get to next time 154 | - But also there are a bunch of packages designed to output **htmlwidgets** that we can include! Many of them look real nice 155 | - Let's peruse: [The HTMLWidgets R Gallery](http://gallery.htmlwidgets.org/) 156 | - We'll go more into detail on these next time 157 | 158 | 159 | ## Material 160 | 161 | - When it comes to any complex programmatic system like this, you always want to look at the docs at least a little 162 | - Let's just spend some time walking through [https://rmarkdown.rstudio.com/flexdashboard/](https://rmarkdown.rstudio.com/flexdashboard/) if only to see what's possible and what we need to know 163 | - Then come back here 164 | 165 | ## Practice 166 | 167 | - Open a Flex Dashboard template 168 | - Load any data set (in the Dashboard code) 169 | - Make a Flex Dashboard and set its `theme` 170 | - Include three graphs: one regular **ggplot2**, that same graph again but with a different theme, and one graph from [http://gallery.htmlwidgets.org/](http://gallery.htmlwidgets.org/) - an easy pick is [dygraphs](https://rstudio.github.io/dygraphs/) line graphs (although you do have to format your data as `xts`... try the **tbl2xts** package and its `tbl_xts()` function) 171 | - If you can make it look nice that's great! But focus more on getting it to work. -------------------------------------------------------------------------------- /Lecture_14_Example_Dashboard.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Some Categories" 3 | output: 4 | flexdashboard::flex_dashboard: 5 | orientation: columns 6 | vertical_layout: fill 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | library(flexdashboard) 11 | library(collapsibleTree) 12 | library(gapminder) 13 | library(tidyverse) 14 | library(plotly) 15 | library(ggthemes) 16 | data(gapminder) 17 | gapminder <- gapminder %>% 18 | filter(year == max(year)) %>% 19 | arrange(-pop) %>% 20 | slice(1:50) %>% 21 | mutate(GDP = gdpPercap*pop) 22 | ``` 23 | 24 | Column {data-width=450} 25 | ----------------------------------------------------------------------- 26 | 27 | ### GDP vs. Population 28 | 29 | ```{r} 30 | p <- ggplot(gapminder, 31 | aes(x = pop, y = GDP, color = continent)) + 32 | geom_point() + 33 | scale_x_log10(labels = function(x) scales::number(x, scale = 1/1000000)) + 34 | scale_y_log10(labels = function(x) scales::dollar(x, scale = 1/1000000000, accuracy = 1)) + 35 | theme_tufte() + 36 | guides(color = FALSE) + 37 | labs(x= "Population (Millions)", y = "GDP (Billions)") + 38 | annotate(geom = 'label', x = 300000000, y = 500000000000, label = 'Unsurprisingly, bigger\ncountries have bigger GDPs', hjust= 0, vjust = 0) 39 | p 40 | ``` 41 | 42 | ### Chart B 43 | 44 | ```{r} 45 | ggplotly(p) 46 | ``` 47 | 48 | Column {data-width=550} 49 | ----------------------------------------------------------------------- 50 | 51 | ### Percentage of GDP and Continent GDP 52 | 53 | ```{r} 54 | gapminder %>% 55 | collapsibleTreeSummary( 56 | hierarchy = c("continent","country"), 57 | root = "50 Biggest Countries", 58 | percentOfParent = TRUE, 59 | width = 800, 60 | attribute = "GDP", 61 | zoomable = FALSE 62 | ) 63 | ``` 64 | 65 | -------------------------------------------------------------------------------- /Lecture_14_Flexdashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Flexdashboard.png -------------------------------------------------------------------------------- /Lecture_14_Flexdashboard_row.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Flexdashboard_row.png -------------------------------------------------------------------------------- /Lecture_14_Grids.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_Grids.png -------------------------------------------------------------------------------- /Lecture_14_KPI.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_14_KPI.png -------------------------------------------------------------------------------- /Lecture_15_Example_Shiny.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "mtcars Dashboard with Shiny" 3 | output: 4 | flexdashboard::flex_dashboard: 5 | orientation: columns 6 | vertical_layout: fill 7 | runtime: shiny 8 | --- 9 | 10 | ```{r global, include=FALSE} 11 | library(flexdashboard) 12 | library(tidyverse) 13 | data(mtcars) 14 | mtcars <- mtcars %>% 15 | mutate(Transmission = factor(am, labels = c('Automatic','Manual'))) 16 | ``` 17 | 18 | Column {.sidebar} 19 | ----------------------------------------------------------------------- 20 | 21 | The mean of a variable by whether the car is an automatic or a manual. 22 | 23 | ```{r} 24 | selectInput("varname", label = "Variable to Analyze:", 25 | choices = names(mtcars), selected = 'mpg') 26 | 27 | textInput('col1', label = 'Color for Automatics', value = 'red') 28 | 29 | textInput('col2', label = 'Color for Manuals', value = 'black') 30 | ``` 31 | 32 | Column 33 | ----------------------------------------------------------------------- 34 | 35 | ### Mean by Transmission Type 36 | 37 | ```{r} 38 | renderPlot({ 39 | ggplot(mtcars, aes_string(x = 'Transmission', y = input$varname, fill = 'Transmission')) + 40 | geom_bar(stat = 'summary', fun = 'mean') + 41 | labs(y = input$varname) + 42 | theme_minimal() + 43 | scale_fill_manual(values = c(ifelse(input$col1 %in% colors(), input$col1, 'black'), 44 | ifelse(input$col2 %in% colors(), input$col2, 'black'))) + 45 | guides(fill = FALSE) 46 | }) 47 | 48 | 49 | ``` 50 | 51 | 52 | -------------------------------------------------------------------------------- /Lecture_15_Interactive_Visuals_in_R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 15: Interactive Visuals in R" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(paletteer) 20 | library(DT) 21 | library(dygraphs) 22 | library(leaflet) 23 | library(vtable) 24 | library(ggiraph) 25 | options("kableExtra.html.bsTable" = T) 26 | rinlinevarname <- function(code){ 27 | html <- '``` `CODE` ```' 28 | sub("CODE", code, html) 29 | } 30 | ``` 31 | 32 | 33 | ## Dashboards 34 | 35 | ```{r, results = 'asis'} 36 | cat(" 37 | ") 43 | ``` 44 | 45 | - Y'know, back in *my* day we had static images and gosh darnit we appreciated what we had!!! 46 | - These days, it's sort of expected that a lot of visuals are interactive, especially dashboards 47 | - Thankfully there are some straightforward ways to make interactive visuals in R 48 | - Note that these will all be reliant on HTML. You won't be able to output these to Word or PDF etc. 49 | 50 | ## Dashboards 51 | 52 | Three main approaches we'll take today: 53 | 54 | - **htmlwidgets**, focusing on **leaflet** (maps), **DT** (tables), and **dygraphs** (line graphs and time series) 55 | - **ggiraph** for interactivity via tooltip closely tied to **ggplot2** 56 | - **shiny** controls in our **flexdashboard** dashboards 57 | 58 | But of course there is a universe of other stuff 59 | 60 | ## htmlwidgets 61 | 62 | - **htmlwidgets** is a set of R front-ends for a bunch of different JavaScript-based visualization packages 63 | - We can stroll through the gallery [here](https://www.htmlwidgets.org/showcase_leaflet.html) 64 | - There are many, but I'll focus on just a few to recommend: 65 | - **leaflet** for maps, **DT** for tables, and **dygraphs** for line graphs and time series 66 | 67 | ## DT 68 | 69 | - Let's do an easy one first. **DT** makes tables interactive. No futzin' 70 | 71 | ```{r, echo = TRUE} 72 | iris %>% vtable(out = 'return', lush = TRUE) %>% datatable() 73 | ``` 74 | 75 | ## dygraphs 76 | 77 | - For plotting time series data as a line graph! Requires time series data (`xts`) input 78 | 79 | ```{r, echo = TRUE} 80 | ggplot2::economics %>% tbl2xts::tbl_xts(cols_to_xts = 'uempmed') %>% 81 | dygraph(ylab = 'Unemployment Rate and the 08-09 recession', height = 300) %>% dyShading(from = '2008-07-01', to = '2009-07-01') 82 | ``` 83 | 84 | ## leaflet 85 | 86 | - Leaflet is for making maps. It's quite easy (and while not **ggplot2**-compatible, functions in a similar way to **ggplot2** in a lot of ways), especially since it lets you skip the hardest part of graphing maps: messing with *shapefiles* 87 | - Shapefiles are basically the data of the map itself, and you have to find it, download it, and then deal with the eight zillion different file formats and converting between them 88 | - It's definitely possible, and not even all that bad once you've done a few (and there are plenty of resources out there for working with, say, **ggmap** or `geom_polygon`) but also you can just skip that with **leaflet** 89 | - (also, just a heads up, but the *Excel* map-making tool is also pretty neat and doesn't require shapefiles) 90 | 91 | ## leaflet 92 | 93 | - The example uses location markers, but area/polygons work too (or a dataset of entries so you don't have to put them in one at a time) 94 | 95 | 96 | ```{r, echo = TRUE, eval = FALSE} 97 | leaflet() %>% addTiles() %>% 98 | addCircleMarkers(lng = -122.3172284009, lat = 47.6105934921, popup = 'Seattle University', color = '#AA0000') %>% 99 | addCircleMarkers(lng = -122.303200, lat = 47.655548, popup = 'University of Washington', color = '#363C74') 100 | ``` 101 | 102 | ## leaflet 103 | 104 | ```{r, echo = FALSE, eval = TRUE} 105 | leaflet() %>% addTiles() %>% 106 | addCircleMarkers(lng = -122.3172284009, lat = 47.6105934921, popup = 'Seattle University', color = '#AA0000') %>% 107 | addCircleMarkers(lng = -122.303200, lat = 47.655548, popup = 'University of Washington', color = '#363C74') 108 | ``` 109 | 110 | 111 | ## plotly 112 | 113 | - Plotly is a big confusing graphing system of its very own that you have to learn from the ground up 114 | - Or you can just use the `ggplotly()` function in **plotly** to immediately turn your **ggplot** interactive 115 | - It's not super *flexible* but it is *fast* 116 | - You lose control over some stuff - tooltips, legends 117 | 118 | ## ggiraph 119 | 120 | - Shouldn't we just be able to make an interactive **ggplot2** graph? Yes! 121 | - **ggiraph** slots super easily in to standard **ggplot2** usage, and adds interactivity 122 | - It's not fully interactive but it adds highlighting and tooltips, which often is all you need 123 | - Just replace the geometry with an `_interactive` version and add a `tooltip` aesthetic. That's it! 124 | 125 | ## ggiraph 126 | 127 | ```{r, echo = TRUE, eval = FALSE} 128 | p <- ggplot2::economics %>% mutate(tooltip = paste0(date, '\n', scales::percent(uempmed/100, accuracy = .1))) %>% 129 | ggplot(aes(x = date, y = uempmed)) + geom_line() + labs(x = 'Date', y = 'Unemployment Rate') + 130 | scale_y_continuous(labels = function(x) scales::percent(x/100, accuracy = 1)) + 131 | geom_point_interactive(aes(tooltip = tooltip), size = .5) + 132 | theme_classic() + theme(text = element_text(family = 'serif', size = 13)) 133 | ggiraph(ggobj = p) 134 | ``` 135 | 136 | ## ggiraph 137 | 138 | ```{r, echo = FALSE, eval = TRUE} 139 | p <- ggplot2::economics %>% mutate(tooltip = paste0(date, '\n', scales::percent(uempmed/100, accuracy = .1))) %>% 140 | ggplot(aes(x = date, y = uempmed)) + geom_line() + labs(x = 'Date', y = 'Unemployment Rate') + 141 | scale_y_continuous(labels = function(x) scales::percent(x/100, accuracy = 1)) + 142 | geom_point_interactive(aes(tooltip = tooltip), size = .5) + 143 | theme_classic() + theme(text = element_text(family = 'serif', size = 13)) 144 | ggiraph(ggobj = p) 145 | ``` 146 | 147 | ## Making a tooltip 148 | 149 | - It's generally a good idea with **ggiraph** to make the tooltip by hand. 150 | - You can do this by `mutate`ing a new variable where you `paste0()` all the information you want together, using `\n` to break lines 151 | 152 | ```{r, echo = TRUE, eval = FALSE} 153 | p <- iris %>% group_by(Species) %>% summarize(`Sepal Length` = mean(Sepal.Length)) %>% 154 | mutate(tooltip = paste0('Species: ', stringr::str_to_title(Species), '\nAverage Sepal Length: ', `Sepal Length`)) %>% 155 | ggplot(aes(x = Species, y = `Sepal Length`, tooltip = tooltip)) + 156 | geom_col_interactive(fill = 'firebrick') + 157 | theme_minimal() 158 | ggiraph(ggobj = p) 159 | ``` 160 | 161 | ## ggiraph downsides 162 | 163 | - Line graphs are a bit tricky, but you can also easily overlay a `geom_point_interactive` on top of a `geom_line` 164 | - While it works great out of the box in a memo, getting the sizing right inside of a **flexdashboard** isn't automatic, and in particular it doesn't like fitting in very *wide* dashboard spaces 165 | - You can adjust the width with either `width` or `width_svg` and `height_svg` in the `ggiraph()` function, but it's not perfect 166 | - Or just put it in a squarish space 167 | 168 | ## Shiny 169 | 170 | - **shiny** is a way of making RMarkdown completely interactive. You can basically build web apps with it in a very straightforward way 171 | - **shiny** output can even be stored for free (with limited traffic access) on [shinyapps.io](shinyapps.io) 172 | - While there's a dedicated **shiny** environment for programming with complete flexibility, we'll focus on the kind that's easy to slot into a **flexdashboard** by adding `runtime: shiny` at the top 173 | 174 | ## Shiny 175 | 176 | - **shiny** gives us the ability to put interactive controls on our dashboard (selectors, etc.) 177 | - And then in `render` or `reactive()` functions, we can make our analysis update based on the controls! 178 | - So for one example, maybe there's a dropdown menu to pick a brand: Pepsi or Coke, and once you select it, the graphs update to only show Pepsi or Coke 179 | 180 | ## Shiny and flexdashboard 181 | 182 | - Let's walk through [https://rmarkdown.rstudio.com/flexdashboard/shiny.html](https://rmarkdown.rstudio.com/flexdashboard/shiny.html) 183 | - Pay close attention to the source code for the example 184 | - How do the input functions come together, and how are the choices referred to in the visualization? 185 | - (Even deeper documentation at [https://mastering-shiny.org](https://mastering-shiny.org) or the [Shiny Dev Center](https://shiny.rstudio.com/)) 186 | 187 | ## Inputs 188 | 189 | - What kinds of inputs can we have? 190 | - Dropdown menu (`selectInput`), a slider (`sliderInput`), radio buttons (`radioButtons`), text (`textInput`), numbers (`numericInput`), a checkbox (`checkboxInput`), dates or date ranges (`dateInput` and `dateRangeInput`) and file upload (`fileInput`) 191 | - For each, there are options for layout and the available choices. 192 | - Let's look at one in more detail, `selectInput` 193 | 194 | ## selectInput 195 | 196 | - This could be well used to, for example, choose which of many ways to look at the data, or which variable to graph, or which subset of the data to look at 197 | - The syntax is `selectInput(inputID, label, choices,` `selected = NULL, multiple = FALSE,` `selectize = TRUE, width = NULL,` `size = NULL)` 198 | - (the = parts are just the defaults) 199 | 200 | ## selectInput 201 | 202 | - The `inputID` is the slot in `input$` where the result will be stored. so with `inputID = 'subset'`, you can later use `input$subset` to know what was selected. That's not what the user sees for a title though, they see `label` 203 | - `choices` are the options, with default `selected`, in a standard vector format. So maybe to choose whether to graph independent or chain restaurants, `choices = c('Choose Restaurant Type' = '',` `'Independent','Chain')` 204 | - `multiple` determines whether multiple options can be selected 205 | - Then the minute details 206 | 207 | 208 | ## Output 209 | 210 | - You can include output to be updated as the options change with `renderPlot()` (for plots), `renderPrint()` (for any object being printed / shown on its own), `renderTable()` for tables of data, and `renderText()` for actual text output. 211 | - Inside of these functions we just take a regular plot / object / data table / text, refer to elements from the sidebar with `input$inputID`, wrap that in `{}`, and wrap THAT in the appropriate `render` function 212 | 213 | ## Output 214 | 215 | (Using fake data) in the global chunk: `library(tidyverse)` and `data(RestaurantData)`. In the sidebar column, an R chunk with: 216 | 217 | ```{r, echo = TRUE, eval = FALSE} 218 | selectInput('restaurantType', label = 'Type of Restaurant', choices = choices = c('Choose Restaurant Type' = '','Independent','Chain')) 219 | ``` 220 | 221 | ## Output 222 | 223 | And then in the next column, 224 | 225 | ```{r, echo = TRUE, eval = FALSE} 226 | renderTable({ 227 | RestaurantType %>% filter(type == input$restaurantType) 228 | }) 229 | ``` 230 | 231 | This will show a table of all the data, letting you pick whether to show independent or chain restaurants. 232 | 233 | ## Example 234 | 235 | - Let's see the example dashboard I have [here](https://nickch-k.shinyapps.io/Lecture_18_Example_Shiny/) and also the example code. 236 | - I have used the `mtcars` data we've used many times before 237 | - I've included a `geom_bar(stat='summary',fun = 'mean')` graph summarizing the mean of a variable by the type of transmission 238 | - But I let you pick the variable! (with `selectInput`) (Hint: use `aes_string` to input a string variable to **ggplot2**, this means everything must be a string in it) 239 | - And also the color scheme (with `textInput`). Note if you put in something that's not a color (which I check with `%in% colors()`) it will correct to black. 240 | 241 | ## The global chunk 242 | 243 | - Code in a **shiny** dash really comes in three parts: the `global` chunk, and then (as we've covered) the sidebar and the main part 244 | - Everything in `global` will only be run *once*, very handy and speedy for when you change a control! 245 | - Put as much as possible - *everything* that won't need to be updated when a control changes - in the global chunk 246 | 247 | ## Dynamic Documents 248 | 249 | - We're not going to do it in this class, but a nice heads up is that you can use shiny and flexdashboard to create *automatic documents* 250 | - The `params` argument in the YAML can take in inputs 251 | - And you can set whatever settings you want on your dashboard and hit a button to make it create a customized report in RMarkdown using those settings! 252 | - Huge time-saver, either for documents to be created automatically or for creating a frontend for people at your company to easily create customized reports 253 | - Alternate extra credit opportunity for this class: Make an automatic report generator 254 | 255 | ## Practice 256 | 257 | - Open a new **flexdashboard** 258 | - Set `runtime: shiny` 259 | - Rename that setup chunk `global` and load up `data(storms)`. 260 | - Add a `selectInput()` to select a `status` 261 | - Make some graph that only uses the data from that `status` 262 | - Then make it a **ggiraph**! 263 | -------------------------------------------------------------------------------- /Lecture_15b_Bob_Ross_My_ATUS_Dashboard_Final.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "ATUS Activity Dashboard " 3 | output: 4 | flexdashboard::flex_dashboard: 5 | orientation: columns 6 | vertical_layout: fill 7 | theme: 8 | version: 4 9 | runtime: shiny 10 | --- 11 | 12 | ```{r global, include=FALSE} 13 | { 14 | library(flexdashboard) 15 | library(atus) 16 | library(tidyverse) 17 | library(scales) 18 | library(plotly) 19 | library(shiny) 20 | library(ggiraph) 21 | library(dygraphs) 22 | } 23 | 24 | data(atusact) 25 | 26 | # Check that everything is 5 or 6 digits long 27 | # atusact %>% pull(tiercode) %>% nchar() %>% table() 28 | 29 | # Check that we can use the first four digits of tucaseid to find hte year 30 | # data(atusresp) 31 | # atusresp <- atusresp %>% 32 | # mutate(year_from_id = str_sub(tucaseid, 1, 4)) 33 | # table(atusresp$tuyear, atusresp$year_from_id) 34 | 35 | 36 | # Get RID of last two digits 37 | atusact <- atusact %>% 38 | ungroup() %>% 39 | mutate(tiercode = floor(tiercode/100)) %>% # Get four-digit codes 40 | mutate(code_2digit = floor(tiercode/100)) %>% # Get two-digit codes 41 | mutate(tiercode = factor(tiercode), 42 | code_2digit = factor(code_2digit)) %>% 43 | mutate(Year = as.numeric(str_sub(tucaseid, 1, 4))) # Get the year 44 | 45 | # Alternate approach using string manipulation 46 | # atusact %>% 47 | # mutate(six_digit_code = str_pad(tiercode, 6, 'left', '0')) %>% 48 | # mutate(tiercode_2digit = str_sub(six_digit_code, 1, 2)) %>% 49 | # mutate(tiercode_4digit = str_sub(six_digit_code, 1, 4)) 50 | 51 | 52 | # Get share of each two-digit code that is hte four-digit code (as a function of dur) 53 | # Overall 54 | overall_share <- atusact %>% 55 | group_by(tiercode, code_2digit) %>% 56 | summarize(total = sum(dur)) %>% 57 | group_by(code_2digit) %>% 58 | mutate(share = total/sum(total)) 59 | 60 | # Within year 61 | overall_share_by_year <- atusact %>% 62 | group_by(tiercode, code_2digit, Year) %>% 63 | summarize(total = sum(dur)) %>% 64 | group_by(code_2digit, Year) %>% 65 | mutate(share = total/sum(total)) 66 | 67 | 68 | # Get total time spent on each two-digit code each year 69 | total_time_per_year <- atusact %>% 70 | group_by(code_2digit, Year) %>% 71 | summarize(`Total Minutes (Millions)` = sum(dur)/1000000) %>% 72 | ungroup() %>% 73 | mutate(Date = lubridate::ym(paste0(Year, '-01'))) %>% 74 | select(-Year) 75 | ``` 76 | 77 | Column {.sidebar} 78 | ----------------------------------------------------------------------- 79 | 80 | ```{r} 81 | selectInput('twodigit', 'Select a Category of Activities:', 82 | sort(unique(overall_share$code_2digit))) 83 | ``` 84 | 85 | Column {data-width=650} 86 | ----------------------------------------------------------------------- 87 | 88 | ### Share of Each Activity Over Time 89 | 90 | ```{r} 91 | renderPlotly({ 92 | p1 <- overall_share_by_year %>% 93 | filter(code_2digit == input$twodigit) %>% 94 | mutate(Share = round(share*100, 1)) %>% 95 | ggplot(aes(x = Year, y = Share, fill = tiercode)) + 96 | geom_area() + 97 | scale_y_continuous(labels = label_percent(scale = 1)) + 98 | theme_classic() + 99 | labs(fill = 'Activity') 100 | ggplotly(p1) 101 | }) 102 | ``` 103 | 104 | ### Total Amount of Activity Over Time 105 | 106 | ```{r} 107 | renderDygraph({ 108 | total_time_per_year %>% 109 | filter(code_2digit == input$twodigit) %>% 110 | tbl2xts::tbl_xts() %>% 111 | dygraph() %>% 112 | dyRangeSelector() 113 | }) 114 | ``` 115 | 116 | Column {data-width=350} 117 | ----------------------------------------------------------------------- 118 | 119 | ### Share of Each Activity 120 | 121 | ```{r} 122 | renderggiraph({ 123 | p3 <- overall_share %>% 124 | filter(code_2digit == input$twodigit) %>% 125 | mutate(tooltip = paste0('Total Minutes (Mil.): ', 126 | number(total, scale = 1/1000000, accuracy = .1), 127 | '\nPercentage: ', 128 | percent(share, accuracy = .1))) %>% 129 | ggplot(aes(x = reorder(tiercode, share), y = share, tooltip = tooltip)) + 130 | geom_col_interactive(fill = 'lightblue', color = 'black') + 131 | scale_y_continuous(labels = label_percent()) + 132 | labs(x = 'Activity', y = 'Share') + 133 | theme_classic() + 134 | coord_flip() 135 | ggiraph(ggobj = p3) 136 | }) 137 | ``` 138 | 139 | -------------------------------------------------------------------------------- /Lecture_16_Data_Import.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Data_Import.png -------------------------------------------------------------------------------- /Lecture_16_Intro_to_Tableau.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 16: Intro to Tableau" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(knitr) 20 | library(paletteer) 21 | library(ggrepel) 22 | library(directlabels) 23 | library(gghighlight) 24 | library(lubridate) 25 | 26 | rinlinevarname <- function(code){ 27 | html <- '``` `CODE` ```' 28 | sub("CODE", code, html) 29 | } 30 | ``` 31 | 32 | 33 | ## Graphing Software 34 | 35 | ```{r, results = 'asis'} 36 | cat(" 37 | ") 43 | ``` 44 | 45 | - A lot of this class has been spent going over **ggplot2** in R 46 | - Today we'll be going over Tableau 47 | - (not to mention all we leave out - PowerBI, **matplotlib**/**Seaborn** in Python, every stats package other than R, automated stuff like Azure, etc. etc.) 48 | - Why introduce these different tools, and what do they do? 49 | 50 | ## Graphing Software 51 | 52 | - It's most important that we understand the underlying concepts of data communication 53 | - But also we have to know how to implement those concepts 54 | - And since so much of this is subjective, we have to practice, practice, practice. Working with the software is a requirement! 55 | - Plus, stuff like learning the grammar of graphics, baked into **ggplot2**, can help strengthen those concepts 56 | 57 | ## Graphing Software 58 | 59 | - R: High-power but with a learning curve. **Great for any size data sets**, **great for summarizing**, **endless customization**. **Great for contrasting groups** 60 | - **If the data isn't already in graphable format, R itself is crucial for cleaning/formatting it** 61 | - Supports **dashboards**, **notebooks**, **interactivity**, and **animation** 62 | - Until you're used to it, **a lot of work to get going though!** 63 | - Because there is **direct access to the underlying pieces of the graph**, there is no better tool for making **one beautiful image** 64 | - Requires additional work for working with enormous databases 65 | 66 | ## Graphing Software 67 | 68 | - Tableau: High-power, with less learning curve, but more constrained. **Great for any size data sets**, **great for summarizing**, **great for contrating groups** 69 | - Great at making a **host of graphs at once to explore a data set thoroughly** 70 | - It can **smartly suggest for you what kind of graph to use** 71 | - Built-in easy to work with tools for **enormous databases** and **dashboards** 72 | - Everything is **interactive by default** 73 | - **Easier than R** and also less work but **deep customization a bit more difficult** 74 | 75 | ## The Canned-Coded Scale 76 | 77 | - It is extremely common in software to have to choose between a deeper, more customizable option (such as coding something in **ggplot2**) or a more canned, one-click-does-it option (like Tableau or Excel) 78 | - Until you're very familiar with the coding option, the canned option will be faster and easier... 79 | - *IF* the thing you want to do is something the canned-software authors anticipated and decided to make easy 80 | - If it's not, it will usually be *much* harder than just coding the thing up 81 | 82 | ## The Canned-Coded Scale 83 | 84 | - Tableau is a very cool tool and I encourage you to use it (as are other more canned systems like PowerBI, Azure ,etc.) 85 | - But don't be surprised when you start thinking "hmm but what if my visualization was different in THIS nonstandard way" and start Googling, only to find it's a fourteen-step process that doesn't actually work 86 | - Using Tableau for the simple stuff, and immediately shifting to code for anything that's not stock-standard, is a decent strategy! 87 | - That said, let's explore Tableau 88 | 89 | ## Tableau 90 | 91 | - As opposed to (the way we've used) R, Tableau uses *database* logic - it sends the calculation **to the data** rather than bringing the data to the calculation 92 | - So we'll have "connections" to data rather than bringing data in; also, if data updates, so will Tableau 93 | - And the output we have will include underlying data, but only our calculations of the data. 94 | - Let's walk through some usage of Tableau 95 | 96 | 97 | ## Importing Data 98 | 99 | - A connection can be "Live" (database-like) or an "Extract" (bring it in!) 100 | - Specify how variables should be read by clicking on them 101 | - Tableau will guess as much as possible 102 | - Unlike R we do filtering, selection, and recoding at THIS step 103 | - Strongly recommended to do all data-cleaning in R BEFORE importing to Tableau - Tableau *can* clean data but it's a real drag 104 | 105 | ## Importing Data 106 | 107 | ```{r} 108 | knitr::include_graphics('Lecture_16_Data_Import.png') 109 | ``` 110 | 111 | ## Importing Data 112 | 113 | - NAs can make it think it's a string, let's tell it earnings is a number 114 | - Let's group `pred_degree_awarded_ipeds`: 1 = less-than-two-year college, 2 = two-year, 3 = four-year+ 115 | - Could also do a filter here, or only select certain variables 116 | - Or create variables calculated from others - let's make an employment rate 117 | 118 | ## The Tableau Screen 119 | 120 | - Dimensions (categorical) and measures (continuous numeric) on the left. Tableau really cares about discrete vs. continuous. 121 | - Columns and Rows areas to drag to 122 | - (single image export: Worksheet $\rightarrow$ Copy image) 123 | - "Show me" default graphs on the right. Decent for initial exploration but generally not what you want to work with 124 | 125 | --- 126 | 127 | ```{r} 128 | knitr::include_graphics('Lecture_16_Tableau_Main.png') 129 | ``` 130 | 131 | 132 | ## Working with Variables 133 | 134 | - To start making a graph, think about what you want on the x-axis (column) or y-axis (row) 135 | - How it treats a variable can then be manipulated: discrete vs. continuous and measure vs. dimension. Measures get aggregated/summarized in some specified way, dimensions don't! 136 | - Plenty of aggregation options for continuous variables 137 | 138 | ## Working with Variables 139 | 140 | - Also click the variables to do things like sort the order or turn them "dual axis" (two variables on the same axis) 141 | - In general, like Excel, click what you want to edit, like axis titles 142 | - Note with a graph we can find the data that goes into it 143 | - Some variables like "Measure Names" can become their own "shelf" like column/row for more complex diagrams 144 | 145 | ## Common Problems in Tableau 146 | 147 | - If something's not working, check whether your variable has the correct continuous/discrete measure/dimension settings 148 | - Note that sometimes "grouped" variables don't work the same as regular categorical variables - make life easier by just making them strings grouped as you like in R and import them 149 | - Did you try Googling to see how you can do (nonstandard thing X)? 150 | 151 | ## Basic Example 152 | 153 | - Let's remake it! 154 | 155 | ```{r} 156 | knitr::include_graphics('Lecture_16_Tableau_Employment.png') 157 | ``` 158 | 159 | ## Basic Example 160 | 161 | - This one is trickier! 162 | 163 | ```{r} 164 | knitr::include_graphics('Lecture_16_Tableau_Repayment.png') 165 | ``` 166 | 167 | ## Basic Example 168 | 169 | - Let's do this one together: what should be in rows and columns here? 170 | 171 | ```{r} 172 | knitr::include_graphics('Lecture_16_Tableau_Map.png') 173 | ``` 174 | 175 | ## Another Example 176 | 177 | - We can also set labels by clicking-and-labeling 178 | - Let's replicate the graph on the next page 179 | 180 | --- 181 | 182 | ```{r} 183 | knitr::include_graphics('Lecture_16_Tableau_Boxplot.png') 184 | ``` 185 | 186 | ## And Explore 187 | 188 | - Click around and make your own graph 189 | - See what kinds of customizations you can do 190 | -------------------------------------------------------------------------------- /Lecture_16_Scorecard.twbx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Scorecard.twbx -------------------------------------------------------------------------------- /Lecture_16_Tableau_Boxplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Boxplot.png -------------------------------------------------------------------------------- /Lecture_16_Tableau_Employment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Employment.png -------------------------------------------------------------------------------- /Lecture_16_Tableau_Main.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Main.png -------------------------------------------------------------------------------- /Lecture_16_Tableau_Map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Map.png -------------------------------------------------------------------------------- /Lecture_16_Tableau_Repayment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_16_Tableau_Repayment.png -------------------------------------------------------------------------------- /Lecture_17_Axis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Axis.png -------------------------------------------------------------------------------- /Lecture_17_CancelledFlights.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_CancelledFlights.png -------------------------------------------------------------------------------- /Lecture_17_CumulativeCancellations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_CumulativeCancellations.png -------------------------------------------------------------------------------- /Lecture_17_DelaybyDay.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_DelaybyDay.png -------------------------------------------------------------------------------- /Lecture_17_MiddayFlights.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_MiddayFlights.png -------------------------------------------------------------------------------- /Lecture_17_MiddayFlightsAxes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_MiddayFlightsAxes.png -------------------------------------------------------------------------------- /Lecture_17_More_Tableau.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 17: More Tableau" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | library(knitr) 20 | library(paletteer) 21 | library(ggrepel) 22 | library(directlabels) 23 | library(gghighlight) 24 | library(lubridate) 25 | 26 | rinlinevarname <- function(code){ 27 | html <- '``` `CODE` ```' 28 | sub("CODE", code, html) 29 | } 30 | ``` 31 | 32 | 33 | ## Tableau 34 | 35 | ```{r, results = 'asis'} 36 | cat(" 37 | ") 43 | ``` 44 | 45 | - Last time we got some basic graphs going and got familiar with the software 46 | - This time we'll go a little deeper! 47 | - First, working our way around the decoration and customization options 48 | - Then, making some more complex graphs with Table Calculations and Shelves 49 | 50 | ## Decoration 51 | 52 | - We can adjust the presentation of most elements of the graph by double clicking on them, Excel style 53 | - For example, axes 54 | - From here we can adjust tick marks, scale, consistency of scale across multi-part graphs, and numerical formatting, as well as things like "include 0 on the axis or no?" 55 | 56 | --- 57 | 58 | ```{r} 59 | knitr::include_graphics('Lecture_17_Axis.png') 60 | ``` 61 | 62 | ## Formatting Numbers 63 | 64 | - We can also see on the left when we click on an axis the "Scale" segment 65 | - We can set things like percentages being .1 or 10%, or the units of the variable (like $) 66 | - (If you're having trouble, RIGHT-click the axis and hit Format) 67 | 68 | ## Decoration 69 | 70 | - Let's take the example worksheet and walk through adjusting the axis 71 | - Look at Sheet 1 72 | - Make the y-axis tick marks percentages with one decimal place 73 | - And make the x-axis only have ticks on days 0, 10, 20, 30. 74 | - Also change axis titles to "Day of Month" and "Percent Cancelled 75 | 76 | ## Decoration 77 | 78 | ```{r} 79 | knitr::include_graphics('Lecture_17_Tickmarks.png') 80 | ``` 81 | 82 | ## Decoration 83 | 84 | ```{r} 85 | knitr::include_graphics('Lecture_17_Percent.png') 86 | ``` 87 | 88 | 89 | ## The Result 90 | 91 | ```{r} 92 | knitr::include_graphics('Lecture_17_CancelledFlights.png') 93 | ``` 94 | 95 | ## Colors 96 | 97 | - We can make a variable be distinguished by color by dragging it to the Color part of the Marks pane 98 | - (This will add color to the *variable*, not its position, so drag it from the left variables pane, not away from its column/row if it's there!) 99 | - Then click Color to manually adjust the colors used or pick a palette, or pick opacity 100 | 101 | ## Other Axes 102 | 103 | - Just like the axes in our **ggplot2** `aes()`thetic, we can similarly drag variables to be used on the Size, Label, etc. axes 104 | - Having variables *just* on these axes and not Column or Row will create different kinds of output - Labels-only will give a table, for example 105 | - Also, not really an axis, but by clicking data points we can zoom in on them, exclude them, or exclude others 106 | - The Marks card is the home of these, and we can edit things directly there 107 | 108 | ## Decoration 109 | 110 | - On Sheet 2 111 | - Let's color our data by Cancelled 112 | - And make the size differ by Distance 113 | - And exclude flights before 3AM 114 | - Make an annotation of an area (What story does this tell?) 115 | - And title it "Departure Times and Delays" 116 | 117 | ## Decoration 118 | 119 | ```{r} 120 | knitr::include_graphics('Lecture_17_MiddayFlightsAxes.png') 121 | ``` 122 | 123 | --- 124 | 125 | ```{r} 126 | knitr::include_graphics('Lecture_17_MiddayFlights.png') 127 | ``` 128 | 129 | 130 | ## Table Calculations 131 | 132 | - By clicking the down-arrow on a numeric variable on an axis, we can change it from being presented raw to being presented relative to other stuff 133 | - We already covered how we can make it a calculation/summary: mean, median, etc. 134 | - Table Calculations extend the analysis to make the presentation *relative to other calculations* 135 | - i.e. changing a line graph from the values to growth-from-previous-period, or moving average 136 | - Or making a bar graph percent-of-total, or rank 137 | 138 | 139 | ## Table Calculations 140 | 141 | - Let's change the bar graph from count to percent of total 142 | - Now you change the line graph from percent to change-since-previous 143 | - (Also, while we're at it, add data point labels to the line graph) 144 | - (And right-click an empty space to add an annotation) 145 | 146 | ## Other Down-Arrow Adjustments 147 | 148 | - We can also use this down-arrow to sort 149 | - For example, sorting a bar by its height, or the value of another variable 150 | - You can do this using another variable directly, or a table calculation (like a COUNT) of a variable 151 | 152 | ## Getting More Complex 153 | 154 | - If you want to do more detailed adjustments to variables, things like complex created variables, you can do so with Create Calculated Field 155 | - But this does require that you learn the Tableau coding system. You can Google for how to do stuff, but it's often easier just to do it in R and export to Tableau (same as with joins/etc.) 156 | - One exception is if you just want a basic condition checked 157 | - `IF [X] == Value THEN "A" ELSE "B" END` 158 | - Can also use `ELSEIF` for more-flexible grouped variables than with Groups 159 | 160 | ## Animation 161 | 162 | - You can do basic animations by adding a variable to "Pages"; this will make a slideshow going through each value 163 | - This is only one kind of animated graph but it is very easy to make! Let's try one. 164 | 165 | ## Sharing Work 166 | 167 | - Tableau workbooks can be saved and shared 168 | - If you want them to be interactive for someone without Tableau, have them use [Tableau Reader](https://www.tableau.com/products/reader) 169 | - Or just a static image with Worksheet $\rightarrow$ Export 170 | - Or make a dashboard (we'll get to that next time)! 171 | 172 | 173 | ## Sharing Work 174 | 175 | - If you're sharing with someone who has Tableau, and your data isn't enormous, use a **packaged workbook**. 176 | - This will bundle the data set in with your work so they can actually open it, making a `.twbx` file 177 | - If you don't do this and just send a `.twb` then they won't be able to see your work at all!! 178 | - If you turn in your Tidy Tuesday homework as a `.twb` I won't be able to read it and you'll have to resubmit. 179 | 180 | ## Practice 181 | 182 | - Let's walk through creating the graph on the following slide... 183 | 184 | --- 185 | 186 | ```{r} 187 | knitr::include_graphics('Lecture_17_CumulativeCancellations.png') 188 | ``` 189 | 190 | ## Practice 191 | 192 | - Now you recreate this graph 193 | 194 | ```{r} 195 | knitr::include_graphics('Lecture_17_DelaybyDay.png') 196 | ``` 197 | 198 | 199 | ## Practice 200 | 201 | - Make your own graph with this data -------------------------------------------------------------------------------- /Lecture_17_Percent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Percent.png -------------------------------------------------------------------------------- /Lecture_17_Table_Calc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Table_Calc.png -------------------------------------------------------------------------------- /Lecture_17_Tickmarks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_17_Tickmarks.png -------------------------------------------------------------------------------- /Lecture_18_A_Mess.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_A_Mess.png -------------------------------------------------------------------------------- /Lecture_18_AddWorksheets.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_AddWorksheets.png -------------------------------------------------------------------------------- /Lecture_18_Dashboard_Format.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_Dashboard_Format.png -------------------------------------------------------------------------------- /Lecture_18_Dashboards_in_Tableau.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 18: Dashboards in Tableau" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: simple 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) 18 | library(tidyverse) 19 | options("kableExtra.html.bsTable" = T) 20 | rinlinevarname <- function(code){ 21 | html <- '``` `CODE` ```' 22 | sub("CODE", code, html) 23 | } 24 | ``` 25 | 26 | 27 | ## Dashboards in Tableau 28 | 29 | ```{r, results = 'asis'} 30 | cat(" 31 | ") 37 | ``` 38 | 39 | - Tableau is designed for dashboards, and they make it easy 40 | - Most of what we'll cover today you could probably figure out yourself by clicking "New Dashboard" at the bottom and clicking around 41 | - So we'll keep this brief and then get to trying stuff out 42 | 43 | ## Creating a Dashboard 44 | 45 | - Create your visualizations and tables 46 | - Then down at the bottom, next to the Sheets, there's a + icon for "New Dashboard" 47 | - Click it! 48 | 49 | ## Adding Worksheets to the Dashboard 50 | 51 | - From here you'll see: 52 | - The Layout option (in a sec) 53 | - Your worksheets - drag them in! 54 | - Options at the bottom to add navigation, images, text, web pages 55 | 56 | ## Adding Worksheets to the Dashboard 57 | 58 | ```{r} 59 | knitr::include_graphics('Lecture_18_AddWorksheets.png') 60 | ``` 61 | 62 | ## Layout 63 | 64 | - Tableau Dashboards can be laid out for desktop, mobile, or tablet 65 | - Check your Size options, and try Device Preview with a few things on 66 | - Also check the Layout tab for other options 67 | - Keep the audience in mind, and whether some visualizations might not translate to tiny sizes! 68 | 69 | ## Layout 70 | 71 | - You can "float" elements of the dashboard but this isn't generally a good idea 72 | - What *is* a good idea is moving things like legends (if you have them) close to where the actual viz that uses them is 73 | - You don't want the legend for Viz B sitting next to Viz A - confusing! 74 | 75 | ## Formatting 76 | 77 | - Format $\rightarrow$ Dashboard brings up the formatting panel for things like overall coloring, fonts, etc. 78 | - You can also format the individual worksheets included by clicking on them as normal 79 | - Any re-decoration you want to do at this point is on the table! 80 | 81 | ## Formatting 82 | 83 | ```{r} 84 | knitr::include_graphics('Lecture_18_Dashboard_Format.png') 85 | ``` 86 | 87 | ## Filters 88 | 89 | - You can add interactive filters to look at different parts of the data, and apply those filters broadly 90 | - Select a viz that uses the variable you want to filter on and hit the "use as filter" button (looks like a funnel) and then the down arrow and Filter to select the variable to filter on 91 | - A filter will appear. On the down arrow for THAT, you can change the kind of filter you get (dropdown, etc.) and also which worksheets the filter applies to - can update the whole dashboard at once with one filter if you like! 92 | 93 | ## Considerations for Dashboards 94 | 95 | If you're planning to end up with a dashboard... 96 | 97 | - Hold off on annotations until the end; what looks good on a single image might get cramped on multiples 98 | - Consider how you might reduce the number of legends required to understand things. Lots of legends can get confusing 99 | - Try to make dashboard and worksheet formatting consistent 100 | - Use a grid layout! Don't get wild 101 | - Consider including the "Export to PDF" option 102 | 103 | --- 104 | 105 | ```{r} 106 | knitr::include_graphics('Lecture_18_A_Mess.png') 107 | ``` 108 | 109 | ## Sharing Dashboards 110 | 111 | - Save as an image (no interactivity) with Dashboard $\rightarrow$ Export Image 112 | - Send the workbook which can be opened with Tableau or (free) Tableau Reader 113 | - Tableau Server and Tableau Online store the workbook online - automatically updated as necessary. Users can also subscribe for update notifications. Requires paid license 114 | 115 | ## Let's Do This 116 | 117 | - See the provided workbook file Lecture_18_Oster_Data 118 | - This contains data from NHANES, the National Health and Nutrition Examination Survey 119 | 120 | ## Let's Do This 121 | 122 | - Specifically, an extract from Oster (2020), who was trying to tell the following story: 123 | - From 1999-2004, Vitamin E was a recommended supplement 124 | - Before 1999 and after 2004, it wasn't, and was actually recommended against 125 | - From 1999-2004, people who pay the most attention to their health will be more likely to follow the recommendation, i.e. take Vitamin E 126 | - So, during that time, the relationship between taking Vitamin E and health outcomes like mortality should be stronger 127 | 128 | ## Let's Do This 129 | 130 | - Try to tell this story in a dashboard! 131 | - Careful: how can you show that a *relationship* gets stronger or weaker over time? This is trivariate! -------------------------------------------------------------------------------- /Lecture_18_Oster_Data_packaged.twbx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/Lecture_18_Oster_Data_packaged.twbx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DataCommSlides 2 | Slides for Seattle University Course on Data Communications 3 | 4 | The recommended book for this course is Kieran Healy's [Data Visualization](http://socviz.co/) 5 | 6 | NOTE: Some Tableau data files are not included in the repo since they are too large. 7 | 8 | Also, check out the accompanying [video series](https://nickchk.com/videos.html#dcomm) for the class. As of now, these videos have not been updated for the newest version of the class, and instead follow along with the old version. 9 | 10 | ## Principles of Data Communication 11 | 12 | [Lecture 1: Data Commmunication](https://nickch-k.github.io/DataCommSlides/Lecture_01_Data_Communication.html#/) 13 | 14 | [Lecture 2: Clutter and Focus](https://nickch-k.github.io/DataCommSlides/Lecture_02_Clutter_and_Focus.html#/) 15 | 16 | [Lecture 3: Audience](https://nickch-k.github.io/DataCommSlides/Lecture_03_Audience.html#/) 17 | 18 | ## R, Workflow, and Data Wrangling 19 | 20 | [Lecture 4: R and RMarkdown](https://nickch-k.github.io/DataCommSlides/Lecture_04_R_and_RMarkdown.html#/) 21 | 22 | [Common R Problems](https://nickch-k.github.io/DataCommSlides/Lecture_04_Common_R_Problems.html#/) 23 | 24 | [Lecture 5 and Lecture 6: Data Wrangling](https://nickch-k.github.io/DataCommSlides/Lecture_05_and_06_Data_Wrangling.html#/) 25 | 26 | ## Exploratory Data Analysis 27 | 28 | [Lecture 7: Exploratory Data Analysis](https://nickch-k.github.io/DataCommSlides/Lecture_07_EDA.html#/) 29 | 30 | 31 | ## Data Visualization in R/ggplot2 32 | 33 | [Lecture 8: Basics of ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_08_Basics_of_ggplot2.html#/) 34 | 35 | [Lecture 9: Structuring ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_09_Structuring_ggplot.html#/) 36 | 37 | [Lecture 10: Styling ggplot2](https://nickch-k.github.io/DataCommSlides/Lecture_10_Styling_ggplot.html#/) 38 | 39 | [Lecture 11: Bob Ross Day](https://nickch-k.github.io/DataCommSlides/Lecture_11_Bob_Ross.html#/) (this one may not translate to a non-classroom scenario; the idea here is that we create these graphs together one step at a time, building them up from scratch and showing the process) 40 | 41 | 42 | ## Fine-Tuning our Communications 43 | 44 | [Lecture 12: Story and Geometry](https://nickch-k.github.io/DataCommSlides/Lecture_12_Story_and_Geometry.html#/) 45 | 46 | [Lecture 13: Communicating Statistics and Uncertainty](https://nickch-k.github.io/DataCommSlides/Lecture_13_Statistics.html#/) 47 | 48 | ## Dashboards and Interactivity 49 | 50 | [Lecture 14: Dashboards in R](https://nickch-k.github.io/DataCommSlides/Lecture_14_Dashboards_in_R.html#/) 51 | 52 | [Lecture 15: Interactivity](https://nickch-k.github.io/DataCommSlides/Lecture_15_Interactive_Visuals_in_R.html#/) 53 | 54 | [Example Shiny Dashboard](https://nickch-k.shinyapps.io/Lecture_18_Example_Shiny/) 55 | 56 | ## Tableau 57 | 58 | [Lecture 16: Introduction to Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_16_Intro_to_Tableau.html#/) 59 | 60 | [Lecture 17: More Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_17_More_Tableau.html#/) 61 | 62 | [Lecture 18: Dashboards in Tableau](https://nickch-k.github.io/DataCommSlides/Lecture_18_Dashboards_in_Tableau.html#/) 63 | 64 | 65 | ## Additional Materials 66 | 67 | [Easy Mistakes to Avoid, and Things You Must Not Do](https://nickch-k.github.io/DataCommSlides/Easy_Mistakes_to_Avoid.html) -------------------------------------------------------------------------------- /SPrail.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/SPrail.zip -------------------------------------------------------------------------------- /SPrail_july.csv: -------------------------------------------------------------------------------- 1 | insert_date,origin,destination,start_date,end_date,train_type,price,train_class,fare 2 | 2019-05-08T09:04:07Z,MADRID,BARCELONA,2019-07-05T16:30:00Z,2019-07-05T19:20:00Z,AVE,102.15,Turista Plus,Promo 3 | 2019-05-09T21:01:16Z,MADRID,SEVILLA,2019-07-06T12:00:00Z,2019-07-06T14:32:00Z,AVE,53.4,Turista,Promo 4 | 2019-05-09T09:43:19Z,MADRID,BARCELONA,2019-07-01T15:30:00Z,2019-07-01T18:40:00Z,AVE,69.8,Turista Plus,Promo 5 | 2019-05-09T21:14:08Z,MADRID,BARCELONA,2019-07-05T20:30:00Z,2019-07-05T23:40:00Z,AVE,58.15,Turista,Promo 6 | 2019-05-08T19:11:36Z,MADRID,BARCELONA,2019-07-02T18:30:00Z,2019-07-02T21:20:00Z,AVE,85.1,Turista,Promo 7 | 2019-05-07T07:47:46Z,BARCELONA,MADRID,2019-07-03T19:00:00Z,2019-07-03T22:00:00Z,AVE,85.1,Turista,Promo 8 | 2019-05-09T13:50:14Z,BARCELONA,MADRID,2019-07-04T11:00:00Z,2019-07-04T13:45:00Z,AVE,85.1,Turista,Promo 9 | 2019-05-07T03:42:13Z,MADRID,BARCELONA,2019-07-01T17:00:00Z,2019-07-01T19:30:00Z,AVE,88.95,Turista,Promo 10 | -------------------------------------------------------------------------------- /gganimate_guide.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Animations in ggplot - gganimate" 3 | author: "Gareth Green" 4 | output: slidy_presentation 5 | --- 6 | 7 | ```{r echo = FALSE} 8 | # Course: 5210 Communicating Data 9 | # Purpose: Demonstrate gganimate 10 | # Date: August 5, 2019 11 | # Author: Gareth Green 12 | 13 | ``` 14 | 15 | ```{r echo = FALSE} 16 | # Clear environment of variables and functions 17 | rm(list = ls(all = TRUE)) 18 | 19 | # Clear environmet of packages 20 | if(is.null(sessionInfo()$otherPkgs) == FALSE)lapply(paste("package:", names(sessionInfo()$otherPkgs), sep=""), detach, character.only = TRUE, unload = TRUE) 21 | 22 | ``` 23 | 24 | ```{r echo = FALSE, message = FALSE, warning = FALSE} 25 | # Load libraries 26 | library(tidyverse) 27 | library(gganimate) 28 | 29 | ``` 30 | 31 | ```{r echo = FALSE} 32 | # Load data 33 | qp1 <- read.csv("qp1_data.csv") 34 | 35 | ``` 36 | 37 | 38 | Animated visuals 39 | ======================================= 40 | 41 |
42 | 43 | + How many of you have seen this famous [gapminder video of Hans Rosling](https://www.bing.com/videos/search?q=hans+rosling&view=detail&mid=4CC1172C798C8C3BD2A94CC1172C798C8C3BD2A9&FORM=VIRE) 44 | 45 | + Animated visuals can help tell a story 46 | 47 | - especially in a presenation 48 | - you can tell the story while the visual progresses 49 | 50 | + Animations are especially good for time series 51 | 52 | - can work well for other forms of motion as well 53 | - for example, how median price changes with number of bedrooms 54 | - key is that members of a group differ over the varialbe that you are animating over 55 | 56 | + Only use if the animation ads substance 57 | 58 | - do not just use animations to show sizzle 59 | 60 | + You can save these in your TA and call them later 61 | 62 | - `anim_save()` is much like `ggsave()` 63 | 64 |
65 | 66 | 67 | Animated visual basics 68 | ======================================= 69 | 70 |
71 | 72 | + I want to show you a similar animation to the one Hans used 73 | 74 | + We will be using gganimate 75 | 76 | - I will show several examples of how to use gganminate 77 | - there are several differnt functions and they continue to develop new ones 78 | 79 | + Here we use the function `transition_time()` 80 | 81 | - this has the visual transition along a variable that is not in the plot 82 | - see that `year` is not in `ggplot()` but it is in `transition_time(year)` 83 | 84 | ```{r ann1, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5} 85 | # Load package with gapminder data 86 | library(gapminder) 87 | 88 | # Build a faceted animation graph 89 | ggplot(data = gapminder, 90 | mapping = aes(x = gdpPercap/1000, y = lifeExp, 91 | size = pop, color = country)) + 92 | geom_point(alpha = 0.7, show.legend = FALSE) + 93 | scale_colour_manual(values = country_colors) + 94 | scale_size(range = c(2, 12)) + 95 | scale_x_log10() + 96 | facet_wrap(~continent) + 97 | theme_bw() + 98 | 99 | # Here comes the gganimate specific bits 100 | labs(title = "Real GDP and life expectancy grow together: Year {frame_time}", 101 | x = "GDP per capita ($000)", y = "Life expectancy") + 102 | transition_time(year) + 103 | ease_aes("linear") 104 | 105 | ``` 106 | 107 |
108 | 109 | 110 | Time-series animation visuals 111 | ======================================= 112 | 113 |
114 | 115 | + Here is a typical time series visual 116 | 117 | - here I use `transition_reveal()` by the x-variable in `ggplot()` 118 | - also set `frame = yr_built` for the visual to transition along 119 | 120 | + I use `annimate()` to set the speed of the reveal 121 | 122 | ```{r ann2, message = FALSE, warning = FALSE} 123 | # Graph median price by year built 124 | p <- qp1 %>% 125 | select(yr_built, waterfront, price) %>% 126 | group_by(yr_built, waterfront) %>% 127 | summarize(med_price = median(price/1000)) %>% 128 | ggplot(aes(x = yr_built, y = med_price, frame = yr_built)) + 129 | geom_line(aes(color = as.factor(waterfront))) + 130 | scale_color_manual(name = "Waterfront", labels = c("No", "Yes"), values = c("navy", "blue"), 131 | guide = guide_legend(reverse=TRUE)) + 132 | theme_classic() + 133 | theme(axis.title.x = element_blank(), 134 | legend.position="bottom") + 135 | labs(title = "Waterfront median price is much more variable than non-waterfront", 136 | x = "", y = "Median price ($000)") + 137 | transition_reveal(yr_built) 138 | 139 | # Run as a GIF annimation and control speed 140 | animate(p, nframes = 300) 141 | 142 | ``` 143 | 144 |
145 | 146 | Animating distinct stages of a visual 147 | ======================================= 148 | 149 |
150 | 151 | + Can also animate visuals that have discrete states 152 | 153 | - good for bar graphs or similar 154 | 155 | + the variable in `transition_state()` does not have to be called in `ggplot()` 156 | 157 | - i.e. could have done by `view` 158 | 159 | ```{r ann3, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5} 160 | # Animate a bar graph 161 | qp1 %>% 162 | select(price, bedrooms, waterfront) %>% 163 | group_by(bedrooms, waterfront) %>% 164 | filter(bedrooms < 7 & bedrooms > 0) %>% 165 | summarise(med_price = median(price/1000)) %>% 166 | ggplot(aes(x = bedrooms, y = med_price, fill = as.factor(waterfront))) + 167 | geom_bar(stat = "identity", position = "dodge") + 168 | theme_classic() + 169 | scale_fill_manual(name = "Waterfront", labels = c("No", "Yes"), values = c("lightblue", "blue"), 170 | guide = guide_legend(reverse=TRUE)) + 171 | labs(title = "Median price grows more per bedroom for waterfront homes", 172 | x = "Number of bedrooms", y = "Median price ($000)") + 173 | transition_states(bedrooms, wrap = FALSE) + 174 | shadow_mark() 175 | 176 | ``` 177 | 178 |
179 | 180 | 181 | -------------------------------------------------------------------------------- /king_dailyvisits.Rdata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/king_dailyvisits.Rdata -------------------------------------------------------------------------------- /no_border.css: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /oster_vitamine_data.Rdata: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/oster_vitamine_data.Rdata -------------------------------------------------------------------------------- /oster_vitamine_data.hyper: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/oster_vitamine_data.hyper -------------------------------------------------------------------------------- /robocalls_excel.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/DataCommSlides/1dcb927b8d9220ad785ac6043e3f5c9498d2eefa/robocalls_excel.xlsx --------------------------------------------------------------------------------