├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── README.md
├── SportsDataTutorials.Rproj
└── inst
└── tutorials
├── 01-basics
├── 01-basics.Rmd
└── images
│ ├── basics1.png
│ ├── basics2.png
│ ├── basics3.png
│ └── console1.png
├── 02-databasics
├── 02-databasics.Rmd
└── images
│ ├── clean1.png
│ ├── clean2.png
│ ├── clean3.png
│ ├── columnnames.png
│ ├── data1.png
│ └── environment1.png
├── 03-aggregates1
├── 03-aggregates1.Rmd
└── images
│ ├── IMG_1345.png
│ └── IMG_1347.png
├── 04-aggregates2
├── 04-aggregates2.Rmd
└── images
│ ├── IMG_1345.png
│ └── IMG_1347.png
├── 05-mutating
├── 05-mutating.Rmd
└── images
│ └── environment.png
├── 06-filters
└── 06-filters.Rmd
├── 07-transforming
└── 07-transforming.Rmd
├── 08-significancetests
├── 08-significancetests.Rmd
└── images
│ └── simulations2.png
├── 09-regressions
├── 09-regressions.Rmd
└── 09-regressions_files
│ └── figure-html
│ └── correlationchart-1.png
├── 10-residuals
├── 10-residuals.Rmd
└── 10-residuals_files
│ └── figure-html
│ ├── correlationchart-1.png
│ ├── correlationchart2-1.png
│ ├── correlationchart3-1.png
│ └── correlationchart4-1.png
├── 11-z-scores
├── 11-z-scores.Rmd
└── images
│ └── simulations2.png
├── 12-usingpackages
└── 12-usingpackages.Rmd
├── 13-bar-charts
└── 13-bar-charts.Rmd
├── 14-stacked-bars
└── 14-stacked-bars.Rmd
├── 15-waffle-charts
└── 15-waffle-charts.Rmd
├── 16-line-charts
└── 16-line-charts.Rmd
├── 17-step-charts
└── 17-step-charts.Rmd
├── 18-slope-charts
└── 18-slope-charts.Rmd
├── 19-scatterplots
└── 19-scatterplots.Rmd
├── 20-bubble-charts
└── 20-bubble-charts.Rmd
├── 21-beeswarm-plots
└── 21-beeswarm-plots.Rmd
├── 22-bump-charts
└── 22-bump-charts.Rmd
├── 23-dumbbell-charts
└── 23-dumbbell-charts.Rmd
├── 24-tables
└── 24-tables.Rmd
├── 25-faceting
└── 25-faceting.Rmd
├── 26-arranging-plots
└── 26-arranging-plots.Rmd
├── 27-encircling-points
└── 27-encircling-points.Rmd
├── 28-blogging
├── 28-blogging.Rmd
└── images
│ ├── blog1.png
│ ├── blog12.png
│ ├── blog13.png
│ ├── blog14.png
│ ├── blog15.png
│ ├── blog2.png
│ ├── blog3.png
│ ├── blog4.png
│ ├── blog5.png
│ ├── blog7.png
│ └── blog9.png
├── 29-text-cleaning
└── 29-text-cleaning.Rmd
├── 30-annotations-and-headlines
└── 30-annotations-and-headlines.Rmd
├── 31-color
└── 31-color.Rmd
├── 32-finishing-touches
├── 32-finishing-touches.Rmd
└── images
│ └── chartannotated.png
├── 33-rvest
└── 33-rvest.Rmd
├── 34-multiple-regression
└── 34-multiple-regression.Rmd
├── 35-clustering
└── 35-clustering.Rmd
├── 36-simulations
├── 36-simulations.Rmd
└── images
│ ├── simulations1.png
│ └── simulations2.png
└── 37-joins
└── 37-joins.Rmd
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^SportsDataTutorials\.Rproj$
2 | ^\.Rproj\.user$
3 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # History files
2 | .Rhistory
3 | .Rapp.history
4 |
5 | # Session Data files
6 | .RData
7 |
8 | # User-specific files
9 | .Ruserdata
10 |
11 | # Example code in package build process
12 | *-Ex.R
13 |
14 | # Output files from R CMD build
15 | /*.tar.gz
16 |
17 | # Output files from R CMD check
18 | /*.Rcheck/
19 |
20 | # RStudio files
21 | .Rproj.user/
22 |
23 | # produced vignettes
24 | vignettes/*.html
25 | vignettes/*.pdf
26 |
27 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
28 | .httr-oauth
29 |
30 | # knitr and R markdown default cache directories
31 | *_cache/
32 | /cache/
33 |
34 | # Temporary files created by R markdown
35 | *.utf8.md
36 | *.knit.md
37 |
38 | # R Environment Variables
39 | .Renviron
40 | .Rproj.user
41 |
42 | .DS_Store
43 |
44 | *.html
45 |
46 | *.Rproj
47 |
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: SportsDataTutorials
2 | Title: The Companion Tutorials To UNL's SPMC 350 Sports Data Analysis Course
3 | Version: 1.0.0.0000
4 | Authors@R:
5 | person(given = "Matt",
6 | family = "Waite",
7 | role = c("aut", "cre"),
8 | email = "matt.waite@unl.edu")
9 | Description: Walks students through the skills needed for UNL's SPMC 350 course.
10 | License: MIT + file LICENSE
11 | Encoding: UTF-8
12 | LazyData: true
13 | Roxygen: list(markdown = TRUE)
14 | RoxygenNote: 7.1.1
15 | Suggests:
16 | learnr
17 | Imports:
18 | gradethis (>= 0.2.3.9001)
19 | Remotes:
20 | rstudio/gradethis
21 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Matt Waite
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LearnR Sports Data Tutorials
2 |
3 | Interactive R tutorials for SPMC350 at the University of Nebraska-Lincoln's College of Journalism and Mass Communications.
4 |
5 | To get them started, you need to do the following:
6 |
7 | 1. Install the latest version of R on your computer (a recent version will do if you already have it installed).
8 | 2. Install the latest version of R Studio.
9 | 3. Then, in the console of R Studio, run the following command:
10 |
11 | `install.packages(c("tidyverse", "rmarkdown", "lubridate", "janitor", "cowplot", "learnr", "wehoop", "worldfootballR", "remotes", "devtools", "waffle", "ggrepel", "ggbeeswarm", "ggbump", "ggalt", "ggtext", "rvest", "Hmisc", "cluster"))`
12 |
13 | 4. After that finishes, run the following command:
14 |
15 | `devtools::install_github("mattwaite/SportsDataTutorials")`
16 |
17 | If you are prompted that newer versions exist, decline to install them.
18 |
19 |
--------------------------------------------------------------------------------
/SportsDataTutorials.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 | ProjectId: 980dcc1a-fd77-4f59-9d16-5f9ff36726fe
3 |
4 | RestoreWorkspace: Default
5 | SaveWorkspace: Default
6 | AlwaysSaveHistory: Default
7 |
8 | EnableCodeIndexing: Yes
9 | UseSpacesForTab: Yes
10 | NumSpacesForTab: 2
11 | Encoding: UTF-8
12 |
13 | RnwWeave: Sweave
14 | LaTeX: pdfLaTeX
15 |
16 | BuildType: Package
17 | PackageUseDevtools: Yes
18 | PackageInstallArgs: --no-multiarch --with-keep.source
19 |
--------------------------------------------------------------------------------
/inst/tutorials/01-basics/01-basics.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 1: The Basics"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn the very basics of R, tools that will help you down the road.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(glue)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ```{r load-data, message=FALSE, warning=FALSE}
24 | g <- 62
25 | hr <- 72
26 | hpg <- 72/62
27 | ```
28 |
29 | ## The Goal
30 |
31 | This lesson aims to equip you with the fundamental tools of R programming for sports data analysis. You'll learn to use R as a calculator, create variables, and understand different data types - all essential skills for handling sports statistics. By applying these concepts to real Nebraska baseball data, you'll see how simple R commands can quickly yield meaningful insights like home runs per game. You'll also be introduced to R Notebooks, a powerful tool for combining code and explanations. Mastering these basics is crucial; they're the building blocks for all the complex analyses you'll tackle in future lessons. By the end of this tutorial, you'll have taken your first steps towards becoming a sports data analyst.
32 |
33 | ## The Basics
34 |
35 | R is a programming language, one specifically geared toward statistical analysis. Like all programming languages, it has certain built-in functions and you can interact with it in multiple ways. The first, and most basic, is the console. Think of the console like talking directly to R. It's direct, but it has some drawbacks and some quirks we'll get into later.
36 |
37 | {width="100%"}
38 | For now, we're going to use windows in this tutorial like console windows. You'll type commands into the boxes below and hit the Run Code button on the top right of the box. Be sure that you're reading closely -- **things like spaces, capitalization and spelling all matter**. Some things matter to the point that if you miss them, the code won't work. Other things are stylistic. It takes time and practice to know the difference.
39 |
40 |
41 |
Common Mistake
42 |
Everyone makes mistakes. Everyone makes dumb mistakes. Everyone overlooks dumb mistakes for hours only to be shown the missing comma. You aren't dumb. You're like every single person who has ever tried to learn how to code.
43 |
44 |
45 | One of the most difficult hurdles for beginning students of mathematics to get over is notation. If you miss a day, tune out for a class, or just never get around to asking what something means, you'll be lost. So lets just cover some bases and make sure we all understand some ultra-basic notation.
46 |
47 | You might laugh at these, but someone reading them is looking up Gods on Wikipedia to thank. So do yourself a favor and refresh.
48 |
49 | | Symbol | Meaning | Example |
50 | |--------|----------------|---------|
51 | | \+ | addition | 5+2 |
52 | | \- | subtraction | 6-2 |
53 | | \* | multiplication | 7\*2 |
54 | | / | division | 8/2 |
55 | | \^ | exponent | 2\^3 |
56 | | sqrt | square root | sqrt(4) |
57 |
58 | One of the most important and often overlooked concepts in basic math is the order that you do the calculations. When you have something like `5+5*5^2`, which gets done first?
59 |
60 | In math, and in code, order matters. If you must do A before you do B, then A needs to appear before and run before B. So you need to keep track of when things happen.
61 |
62 | Thankfully, math teachers have provided us an easy to remember mnemonic that you probably learned in sixth grade and tried to forget until now.
63 |
64 | **PEMDAS** -- **P**arenthesis, **E**xponents, **M**ultiplication, **D**ivision, **A**ddition, **S**ubtraction
65 |
66 | What that means is when you look at a mathematical formula, you do the calculations in the order PEMDAS tells you. Something in parenthesis? Do that first. Is there an exponent? Do that next. Multiplication or division? It's next and so forth. Knowing PEMDAS will save you from stupid mistakes down the road.
67 |
68 | ### Exercise 1: Using R like a calculator
69 |
70 | *Write the code necessary to add 17 and 34 together.* In this case, it's super simple.
71 |
72 | ```{r addition, exercise=TRUE, exercise.reveal_solution = FALSE}
73 |
74 | ```
75 |
76 | ```{r addition-solution, exercise.reveal_solution = FALSE}
77 | 17+34
78 | ```
79 |
80 | ```{r addition-check}
81 | grade_this_code()
82 | ```
83 |
84 | Congrats, you've run some code. It's not very complex, and you knew the answer before hand (maybe, probably), but you get the idea.
85 |
86 | ## Variables
87 |
88 | So we've learned that we can compute things. We can also store things. **In programming languages, these are called variables**.
89 |
90 |
91 |
Key Concept
92 |
A variable is a name that stores stuff. It can be as simple as a name, or a big as a dataset of every baseball game ever played. It's just a way to name a thing.
93 |
94 |
95 | We can assign things to variables using `<-`. And then we can do things with them.
96 |
97 |
98 |
Key Concept
99 |
The <-
is a called an assignment operator.
100 |
101 |
102 | With variables, we can assign names to values. Let's try it.
103 |
104 | ### Exercise 2: Variables
105 |
106 | One of the key concepts of programming that you'll need to learn is that **most things are a pattern**. You'll see that doing certain things takes the same form each time, and all you're doing is changing small parts of the pattern to fit what you want to do.
107 |
108 |
109 |
Key Concept
110 |
Declaring variables is a pattern. The pattern here is Name Of The Thing <- Value Of The Thing
.
111 |
112 |
113 | Names can be anything you want them to be, **but they have to be one word**, and it's better if they have no numbers or symbols in it. So you can create a variable called NebraskaWillWinTheNationalTitleThisSeasonMattRhuleIsGod if you want. But some questions for you: What is that variable going to contain? Is it obvious by the name? And do you want to try and type that again? How about three more times? Yeah, me neither.
114 |
115 | > **5 Rules For Naming Variables**
1. One word, no spaces.
2. If you must use a number, it has to be at the end, never at the beginning.
3. No symbols (like ! or ? or \# or anything like it).
4. Keep them short, but never one letter. If your variable is holding the number of touchdowns a team scored, don't call it t.
5. Make them understandable. If your variable is holding the number of touchdowns a teams scored, call it touchdowns. Keep it simple.
116 |
117 | The next thing we need to know about are the values. There's a bunch of different kinds of values, but three that we're going to work with the most are **strings**, **numbers** and, later, **data**. A string is just text. Your name is a string. Your home address is a string. For the most part, strings aren't hard -- they're anything you wouldn't do math on. So, for example, your phone number is a string, even though it's made up of numbers. If we add your phone number to mine, do we get something meaningful? No. Same with ZIP codes. They're strings.
118 |
119 | **To record a string, you put it in quotes.** So `myname <- "Matt"` is how to set the value of the variable myname to Matt.
120 |
121 | **To record a number, do *not* use quotes.** So `myage <- 46` is how to set the value of the variable myage to 46.
122 |
123 |
124 |
Key Concept
125 |
Strings go in quotes. Numbers do not have quotes. The use of quotes around something changes the type of data it is.
126 |
127 |
128 | Where this becomes a problem is where you try to make a string into a number or try to do math with strings.
129 |
130 | In the window below, create two variables called Number1 and Number2. Set Number1 to 5 and Number2 to "5" with the quotes. Then try adding Number1 + Number2. What happens?
131 |
132 | ```{r intentional-error-one, exercise=TRUE}
133 |
134 | ```
135 |
136 | You get an error. `non-numeric argument to binary operator` is a rather computer-ish way of saying you tried to add 5 and Zebra. You can't add those two things together. So when you see that error, **the non-numeric part is your hint that you're trying to do math on something that isn't a number**. To fix it, just remove the quotes.
137 |
138 | Let's do something more real and calculate the number of home runs per game the Nebraska baseball team hit this season. To do this, we need to know the number of home runs the team hit and the number of games they played. Then it's just simple division.
139 |
140 | [You can find season stats here](https://storage.googleapis.com/huskers-com-prod/2024/06/04/TE2BF2HsILQBpGGrpJgNm3IKFlj3V9fWUX3r1Jd1.pdf) at the bottom of the table in a row labeled Totals. The number of games is in the column named GP-GS, which stands for Games Played - Games Started. See why naming things what they are is a rule in this class?
141 |
142 | We're going to name the first variable HomeRuns and the second variable Games. **Pay Attention: HomeRuns, not Home Runs. The space will break it.** Set them equal to the numbers you got from the table, then divide HomeRuns by Games.
143 |
144 | *Write the code necessary to create the variables and set it to the number of home runs Nebraska hit in the 2024 season and the number of games played.*
145 |
146 | ```{r variables-one, exercise=TRUE}
147 |
148 | ```
149 |
150 | ```{r variables-one-hint}
151 | HomeRuns <- ??
152 | Games <- ??
153 | SOMETHING / SOMETHING
154 | ```
155 |
156 | ```{r variables-one-solution, exercise.reveal_solution = FALSE}
157 | HomeRuns <- 72
158 | Games <- 62
159 | HomeRuns / Games
160 | ```
161 |
162 | ```{r variables-one-check}
163 | grade_this_code()
164 | ```
165 |
166 | ```{r basicsresults, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "load-data", results='asis'}
167 | glue("Looking at our ouput, we can read that as the Huskers hit {format(round(hpg, 1))} home runs per game this past season. So if you watched a game, chances are you were going to see the Good Guys hit at least one bomb. Some days you'd get less, some days you get more, but on average, they hit just about {format(round(hpg, 1))} per game.")
168 | ```
169 |
170 | This isn't terribly useful just yet. If we wanted to know where the Huskers ranked in college baseball, it would take us forever to look up every team and do this math. But as you now know, variables can hold anything we want to store in them. **Soon enough, that HomeRuns value will hold every team in college baseball, and Games will hold the games that every team played**. The code doesn't change all that much, but we need to crawl before we walk.
171 |
172 | ## Notebooks
173 |
174 | Outside of these tutorials, you will work in notebooks. Notebooks are a place where we can combine text, where we explain our thinking, with code, where we do our work. Modern data science requires transparency. Notebooks are where we are transparent. Here is my data, here is my code, here is my thinking, you can see it all and judge for yourself.
175 |
176 | To make a notebook, go up to the top left of your R Studio window and you'll see a green plus sign. Click that, and go down to R Notebook. Or, you can go to the File menu and go to New File and then to R Notebook.
177 |
178 | {width="100%"}
179 |
180 | You'll notice, in your notebook, a lot of automatically generated text. We do not want that. So you'll need to select it all - command A on a Mac, control A on Windows - and hit delete. You should have a blank page now.
181 |
182 | Text -- explanations -- go between code chunks. So At first, just write a sentence.
183 |
184 | Then, you can click the green C in the tool bar and go to R or use the keyboard shortcut to add a chunk: Ctrl + Alt + I on Windows or Command + Option + I on Mac.
185 |
186 | {width="100%"}
187 |
188 | Then, it's simply a matter of putting things where they need to go.
189 |
190 | {width="100%"}
191 |
192 | When it's time to turn it in, you just go to File \> Save and give the file a name and save it. When asked, you're turning in the .Rmd file you just saved.
193 |
194 | ## Recap
195 |
196 | In this lesson, you've taken your first steps into R programming for sports data analysis. You've learned how to use R as a calculator, been introduced to variables, learning how to create them, name them effectively, and use them to store both numbers and text (strings). You've applied these skills to real sports data, calculating home runs per game for the Nebraska baseball team. Along the way, you've encountered common pitfalls like trying to perform math on strings, and you've learned about R Notebooks as a tool for combining code and explanations. These foundational skills will serve as the building blocks for more advanced analyses in your future work with sports data.
197 |
198 | ## Terms to know
199 |
200 | - **R**: A programming language for statistical computing and graphics.
201 | - **Console**: The direct interface for entering commands in R.
202 | - **Variable**: A container for storing data values.
203 | - **Assignment Operator (<-)**: The symbol used to assign values to variables in R.
204 | - **String**: A data type used to represent text, enclosed in quotes.
205 | - **Numeric**: A data type used to represent numbers.
206 | - **Notebook**: An interactive document that combines code, output, and narrative text.
207 | - **Code Chunk**: A section in an R Notebook where R code is written and executed.
208 | - **Package**: A collection of R functions, data, and documentation.
209 | - **Library**: A term often used interchangeably with "package" in R.
210 |
--------------------------------------------------------------------------------
/inst/tutorials/01-basics/images/basics1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/01-basics/images/basics1.png
--------------------------------------------------------------------------------
/inst/tutorials/01-basics/images/basics2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/01-basics/images/basics2.png
--------------------------------------------------------------------------------
/inst/tutorials/01-basics/images/basics3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/01-basics/images/basics3.png
--------------------------------------------------------------------------------
/inst/tutorials/01-basics/images/console1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/01-basics/images/console1.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/clean1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/clean1.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/clean2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/clean2.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/clean3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/clean3.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/columnnames.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/columnnames.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/data1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/data1.png
--------------------------------------------------------------------------------
/inst/tutorials/02-databasics/images/environment1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/02-databasics/images/environment1.png
--------------------------------------------------------------------------------
/inst/tutorials/03-aggregates1/images/IMG_1345.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/03-aggregates1/images/IMG_1345.png
--------------------------------------------------------------------------------
/inst/tutorials/03-aggregates1/images/IMG_1347.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/03-aggregates1/images/IMG_1347.png
--------------------------------------------------------------------------------
/inst/tutorials/04-aggregates2/04-aggregates2.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 4: Aggregates, Part 2"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn more about how to take lots of little things and total them up into bigger things.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(glue)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | The goal of this lesson is to take what we did with the `group_by()` and `summarise()` functions that you learned about in the last lesson and add some new tools to it. Not everything is a count. Sometimes, you need an average. Or a median. Or the biggest number. Or the smallest number. By the end of this lesson, you'll be able to answer questions like "What position averages the most kills?" The examples will use NCAA women's volleyball statistics, but the techniques you learn can be applied to any sport or dataset.
26 |
27 | ## The Basics
28 |
29 | Hopefully, this is starting to feel a little familiar. Remember: It's all a pattern. Just about every single time it's the same start. Load libraries, load data, look at the data, answer the questions.
30 |
31 | Let's get to work.
32 |
33 | ```{r load-tidyverse, exercise=TRUE}
34 | library(tidyverse)
35 | ```
36 |
37 | ```{r load-tidyverse-solution}
38 | library(tidyverse)
39 | ```
40 |
41 | ```{r load-tidyverse-check}
42 | grade_this_code()
43 | ```
44 |
45 | Now, we load the data.
46 |
47 | ```{r load-data, message=FALSE, warning=FALSE}
48 | volleyballplayers <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
49 |
50 | nplayers <- nrow(volleyballplayers)
51 |
52 | avgsets <- volleyballplayers |>
53 | group_by(position) |>
54 | summarise(
55 | count = n(),
56 | mean_sets = mean(s),
57 | median_sets = median(s)
58 | ) |>
59 | arrange(desc(mean_sets)) |> slice(1)
60 |
61 | killsleader <- volleyballplayers |> arrange(desc(kills)) |> slice(1)
62 | ```
63 |
64 | ```{r load-data-exercise, exercise = TRUE}
65 | volleyballplayers <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
66 | ```
67 |
68 | ```{r load-data-exercise-solution}
69 | volleyballplayers <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
70 | ```
71 |
72 | ```{r load-data-exercise-check}
73 | grade_this_code()
74 | ```
75 |
76 | Now, let's get a peek at our data.
77 |
78 | ```{r head-data, exercise=TRUE, exercise.setup = "load-data"}
79 | head(_____)
80 | ```
81 |
82 | ```{r head-data-solution, exercise.reveal_solution = FALSE}
83 | head(volleyballplayers)
84 | ```
85 |
86 | ```{r head-data-check}
87 | grade_this_code()
88 | ```
89 |
90 | ::: {#head-data-hint}
91 | **Hint:** The thing you need is to the left of a \<- in a block above.
92 | :::
93 |
94 | Now that we're set up, we can answer more questions with data.
95 |
96 | ### Exercise 1: More in summarize
97 |
98 | In the last example, we grouped some data together and counted it up, but there's so much more you can do.
99 |
100 | Sticking with our NCAA volleyball player data, we can calculate any number of measures inside summarize. Here, we'll use R's built in `mean` function to calculate ... well, you get the idea.
101 |
102 | Let's look just a the number of sets (s) each position gets. **In summarize, what you do left of the equal sign is giving something a name**. We should name our new column what it is: `mean_sets`. Then let's arrange it by the average number of sets (hint: that wording, average number of sets, is a trick. You arrange by the name of a column created in summarize).
103 |
104 |
105 |
Common Mistake
106 |
What goes inside mean()
has to be a column name. You can find those above where you ran head()
.
107 |
108 |
109 | ```{r group-by-4, exercise=TRUE, exercise.setup = "load-data"}
110 | _____ |>
111 | group_by(___) |>
112 | summarise(
113 | ____ = mean(____)
114 | ) |>
115 | arrange(desc(____))
116 | ```
117 |
118 | ```{r group-by-4-solution, exercise.reveal_solution = FALSE}
119 | volleyballplayers |>
120 | group_by(position) |>
121 | summarise(
122 | mean_sets = mean(s)
123 | ) |>
124 | arrange(desc(mean_sets))
125 | ```
126 |
127 | ```{r group-by-4-check}
128 | grade_this_code()
129 | ```
130 |
131 | ::: {#group-by-4-hint}
132 | **Hint:** What goes in `group_by` and `mean()` all come from column names you find with `head()`. What goes before the = are names in the previous paragraph.
133 | :::
134 |
135 | ```{r avgsets, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "load-data", results='asis'}
136 | glue("So there's {nplayers} players in the data. Of them, {avgsets$count} are of the position {avgsets$position}, which if you know volleyball you can figure out what that is pretty quickly. The average {avgsets$position} plays in {format(avgsets$mean_sets, digits=1)} sets on average.")
137 | ```
138 |
139 | ### Exercise 2: More than one number
140 |
141 | You can do multiple measures in a single `summarize` step as well. There's a simple way to think about this pattern. You might have noticed in the previous exercise, you first give your column a name, then an equal sign, then you do some math stuff. In the last exercise, it was `mean()`. All of that -- `nameofcolumn = math()` creates a column with a name and puts numbers in it. To add another one, just add a comma to the end of that line, and repeat the pattern.
142 |
143 | Let's add median_sets to our data analysis and arrange it by that.
144 |
145 | ```{r group-by-5, exercise=TRUE, exercise.setup = "load-data"}
146 | _____ |>
147 | group_by(___) |>
148 | summarise(
149 | ____ = mean(____),
150 | ____ = median(____),
151 | ) |>
152 | arrange(desc(____))
153 | ```
154 |
155 | ```{r group-by-5-solution, exercise.reveal_solution = FALSE}
156 | volleyballplayers |>
157 | group_by(position) |>
158 | summarise(
159 | mean_sets = mean(s),
160 | median_sets = median(s)
161 | ) |>
162 | arrange(desc(median_sets))
163 | ```
164 |
165 | ```{r group-by-5-check}
166 | grade_this_code()
167 | ```
168 |
169 | ::: {#group-by-5-hint}
170 | **Hint:** What goes in `group_by` and `mean()` all come from column names you find with `head()`. What goes before the = are names in the previous paragraph.
171 | :::
172 |
173 | ### Exercise 3: Arranging two ways
174 |
175 | One thing to keep in mind is this: You can use any of these verbs -- group_by, summarize, arrange -- on their own without the others. For example, what player had the most kills in 2024? To find this out, we don't need a group_by or summarize, so what goes in arrange is just the field name. Let's put this in order of kills.
176 |
177 | First let's get the most kills.
178 |
179 | ```{r group-by-6, exercise=TRUE, exercise.setup = "load-data"}
180 | volleyballplayers |> arrange(desc(____))
181 | ```
182 |
183 | ```{r group-by-6-solution, exercise.reveal_solution = FALSE}
184 | volleyballplayers |> arrange(desc(kills))
185 | ```
186 |
187 | ```{r group-by-6-check}
188 | grade_this_code()
189 | ```
190 |
191 | Now we can get the least by removing desc() and just doing arrange by our field name.
192 |
193 |
194 |
Key Concept
195 |
The default behavior of arrange is smallest to largest. If you do arrange(touchdowns), you'll get a list of the fewest touchdowns first. `desc()` reverses that.
196 |
197 |
198 | ```{r group-by-7, exercise=TRUE, exercise.setup = "load-data"}
199 | volleyballplayers |> arrange(____)
200 | ```
201 |
202 | ```{r group-by-7-solution, exercise.reveal_solution = FALSE}
203 | volleyballplayers |> arrange(kills)
204 | ```
205 |
206 | ```{r group-by-7-check}
207 | grade_this_code()
208 | ```
209 |
210 | ```{r killer, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "load-data", results='asis'}
211 | glue("The kill leader in the NCAA? {killsleader$name} of {killsleader$team} with {killsleader$kills} kills. The lowest? A whole lot of people with 0.")
212 | ```
213 |
214 | That's a huge difference.
215 |
216 | So when choosing a measure of the middle, you have to ask yourself -- could I have extremes? Because a median won't be sensitive to extremes. It will be the point at which half the numbers are above and half are below. The average or mean will be a measure of the middle, but if you have a bunch of pine riders and then one iron superstar, the average could be wildly skewed.
217 |
218 | We'll work more on that in the future.
219 |
220 | ### Exercise 4: Even more aggregates
221 |
222 | There's a ton of things we can do in summarize -- we'll work with more of them as the course progresses -- but here's a few other questions you can ask.
223 |
224 | Which position in the NCAA generates the most points? And what is the highest and lowest point total for that position? And how wide is the spread between points? We can find that with `sum` to add up the points to get the total points, `min` to find the minimum points, `max` to find the maximum points.
225 |
226 | ```{r group-by-8, exercise=TRUE, exercise.setup = "load-data"}
227 | volleyballplayers |>
228 | group_by(____) |>
229 | summarise(
230 | total = sum(pts),
231 | avg_points = mean(pts),
232 | min_points = min(pts),
233 | max_points = max(pts),
234 | stdev_points = sd(pts)) |>
235 | arrange(desc(total))
236 | ```
237 |
238 | ```{r group-by-8-solution, exercise.reveal_solution = FALSE}
239 | volleyballplayers |>
240 | group_by(position) |>
241 | summarise(
242 | total = sum(pts),
243 | avg_points = mean(pts),
244 | min_points = min(pts),
245 | max_points = max(pts),
246 | stdev_points = sd(pts)) |>
247 | arrange(desc(total))
248 | ```
249 |
250 | ```{r group-by-8-check}
251 | grade_this_code()
252 | ```
253 |
254 | ::: {#group-by-8-hint}
255 | **Hint:** Breathe deep. Slow down. Think about it. What's missing here is stuff you've done already.
256 | :::
257 |
258 |
259 |
Common Mistake
260 |
So many students of this class have struggled with the simple difference between a count (n()
) and a sum (sum()
). Let's pretend I have 5 players who have each scored 2 points. If I count them, I get 5. If I sum them, I get 10 -- 2+2+2+2+2 = 10.
261 |
262 |
263 | No surprise with our data analysis, outside hitters with the most points. Most of the stats here we can intuitively understand -- a minimum of zero means they probably didn't play at all. A large max means that player was good at volleyball. The one that you should give an extra check is the standard deviation. When your standard deviation is large, that's a sign you've got some huge points players and a bunch of bench players and you should consider which measure of the middle you're going to use.
264 |
265 | ## The Recap
266 |
267 | In data analysis, `group_by` and `summarize` are two of the most basic, but most common functions. With these functions, you can take every game played and turn it into season totals, team statistics and many others. You can take every pitch thrown and look at them by type. You can look at a volleyball team in conference and out of conference. In just two lessons, you've learned a *huge* amount of basic data analysis. You'll use this pattern *a lot*. What remains is creative application of what you've learned.
268 |
269 | ## Terms to Know
270 |
271 | - **mean()**: A function that calculates the arithmetic average of a set of numbers.
272 | - **median()**: A function that finds the middle value in a sorted set of numbers.
273 | - **sum()**: A function that adds up all the values in a specified column.
274 | - **arrange()**: A function used to order rows in a dataset based on values in specified columns.
275 | - **desc()**: A function used within arrange() to specify descending order for sorting.
276 |
--------------------------------------------------------------------------------
/inst/tutorials/04-aggregates2/images/IMG_1345.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/04-aggregates2/images/IMG_1345.png
--------------------------------------------------------------------------------
/inst/tutorials/04-aggregates2/images/IMG_1347.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/04-aggregates2/images/IMG_1347.png
--------------------------------------------------------------------------------
/inst/tutorials/05-mutating/05-mutating.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 5: Mutating data"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to make new columns of data.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(glue)
18 | library(tidyverse)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | Sometimes your data comes in fully baked. Some sports, some leagues, they provide just about everything you need. But a great deal of the time, you won't have all of the columns you need. Say you want to compare quarterbacks. You know how many touchdowns passes they've thrown. You know how many interceptions they've thrown. But a common metric for quarterbacks is the touchdown/interception ratio, which simply divides the number of touchdowns by the number of interceptions. If that's a big number, your quarterback is very good. If it's a small number, your quarterback is terrible. To calculate that, we're going to introduce a new verb: `mutate()`
26 |
27 | ## The Basics
28 |
29 | One of the most common data analysis techniques is to create a new stat out of existing stats provided by a league or sport. More often than not, this is meant to level the playing field, so to speak. It's hard to compare players or teams who have played a different number of games, or appeared for a different number of minutes. Often what we need is a per game or per attempt metric.
30 |
31 | First we'll import the tidyverse so we can read in our data and begin to work with it.
32 |
33 | ```{r load-tidyverse, exercise=TRUE}
34 | library(tidyverse)
35 | ```
36 | ```{r load-tidyverse-solution}
37 | library(tidyverse)
38 | ```
39 | ```{r load-tidyverse-check}
40 | grade_this_code()
41 | ```
42 |
43 | We're going to look at last season's women's soccer season and ask a couple of questions. Your first task is to import the data. For this exercise, you need to simply run this:
44 |
45 | ```{r mutating-load-data, message=FALSE, warning=FALSE}
46 | soccer <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_soccer_teams_2024.csv")
47 |
48 | ruthless <- soccer |> mutate(
49 | gpg = goals/games
50 | ) |>
51 | arrange(desc(gpg)) |> slice(1)
52 |
53 | snipers <- soccer |>
54 | mutate(sogpct = (so_g/sh_att)*100
55 | ) |>
56 | arrange(desc(sogpct)) |> slice(1)
57 |
58 | maxgames <- soccer |> summarize(max_games = max(games))
59 |
60 | confsnipers <- soccer |>
61 | group_by(conference) |>
62 | summarize(
63 | total_sog = sum(so_g),
64 | total_shatt = sum(sh_att)
65 | ) |>
66 | mutate(sogpct = (total_sog/total_shatt)*100
67 | ) |>
68 | arrange(desc(sogpct)) |> slice(1)
69 | ```
70 | ```{r mutating-load-data-exercise, exercise = TRUE}
71 | soccer <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_soccer_totals_2023.csv")
72 | ```
73 | ```{r mutating-load-data-exercise-solution}
74 | soccer <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_soccer_totals_2023.csv")
75 | ```
76 | ```{r mutating-load-data-exercise-check}
77 | grade_this_code()
78 | ```
79 |
80 | Remember, if you want to see the first six rows -- handy to take a peek at your data -- you can use the function `head`.
81 |
82 | ```{r head-data, exercise=TRUE, exercise.setup = "mutating-load-data"}
83 | head(??????)
84 | ```
85 | ```{r head-data-solution}
86 | head(soccer)
87 | ```
88 | ```{r head-data-check}
89 | grade_this_code()
90 | ```
91 |
92 | ### Exercise 1: Calculating goals per game
93 |
94 | The code to calculate anything with mutate is pretty simple. Remember, with `summarize`, we used `n()` to count things. With `mutate`, we use very similar syntax to calculate a new value using other values in our dataset. So in this case, **we're trying to do goals divided by games** -- goals per game -- but we're doing it with columns.
95 |
96 | If we look at what we got when we ran `head`, you'll see there's a Games column, and a Goals column. Then, to help us, we'll use arrange again to sort it, so we get the most ruthless squad over one year. Similar to summarize, we'll want to give what we create a name using =.
97 |
98 |
99 |
Key Concept
100 |
Similar to summarize, the pattern here is nameofyournewcolumn = existingcolumn/existingcolumn
if you want to divide two columns. You can add them, subtract them, multiply them -- you can make any form of math you need this way.
101 |
102 |
103 | Replace the words in the blanks with the correct parts and **name the new column we're creating as gpg**.
104 |
105 | ```{r mutate-change, exercise=TRUE, exercise.setup = "mutating-load-data", message=FALSE}
106 | ____ |>
107 | mutate(____ = ____/____
108 | )
109 | ```
110 | ```{r mutate-change-solution, exercise.reveal_solution = FALSE}
111 | soccer |> mutate(
112 | gpg = goals/games
113 | )
114 | ```
115 | ```{r mutate-change-check}
116 | grade_this_code()
117 | ```
118 |
119 | Click the black arrow at the top right of the results box to get all the way to the end, where you'll see gpg.
120 |
121 | But does this tell us who the most ruthless team is? No. We need to arrange to do that.
122 |
123 | ```{r mutate-change2, exercise=TRUE, exercise.setup = "mutating-load-data", message=FALSE}
124 | ____ |>
125 | mutate(____ = ____/____
126 | ) |>
127 | arrange(desc(gpg))
128 | ```
129 | ```{r mutate-change2-solution, exercise.reveal_solution = FALSE}
130 | soccer |> mutate(
131 | gpg = goals/games
132 | ) |>
133 | arrange(desc(gpg))
134 | ```
135 | ```{r mutate-change2-check}
136 | grade_this_code()
137 | ```
138 |
139 | ```{r ruthless, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "mutating-load-data", results='asis'}
140 | glue("So in that season, {ruthless$team} were the most ruthless team scoring about {format(ruthless$gpg, digits=2)} goals per game.")
141 | ```
142 |
143 | ### Exercise 2: Calculating a percentage.
144 |
145 | Another common analysis task is converting two numbers into a percentage. How often do shooters make shots? You can look at made shots, but how then do you compare a player who has played a lot of games to one that hasn't played as many? You don't. That's why you create a percentage to compare the two.
146 |
147 | Sticking with our soccer data, which team was the most accurate when it comes to shots on goal? Who was pressuring the keeper the most? Sure, we could just arrange the shots on goal column and see who was at the top, but how then do you compare a team that shoots a lot with a team who doesn't but is ruthlessly efficient when they do?
148 |
149 |
150 |
Key Concept
151 |
To calculate a percentage, we need to remember a formula: The smaller thing / the total number.
152 |
153 |
154 | The smaller thing is the outcome you're measuring. The total number is all the chances a team or a player had at that thing. For example: Completion percentage. How do you calculate that? You divide completions by the total number of passes thrown. But if you do that, you'll get a number less than 1. That's because it's not yet a percentage like you are accustomed to seeing. To get that, you have to multiply the result of your division by 100.
155 |
156 | To get shots on goal percentage, we're going to do something very similar to the first exercise, but adding in the multiplication. **The little thing in this case is `so_g` (short for Shots on Goal). The big thing is `sh_att` (short for Shot Attempts).**
157 |
158 | **Let's call our new column sogpct, short for Shot on Goal Percentage.**
159 |
160 | ```{r mutate-change-percent, exercise=TRUE, exercise.setup = "mutating-load-data", message=FALSE, warning=FALSE}
161 | soccer |>
162 | mutate(____ = (____/____)*____
163 | ) |>
164 | arrange(desc(____))
165 | ```
166 | ```{r mutate-change-percent-solution, exercise.reveal_solution = FALSE}
167 | soccer |>
168 | mutate(sogpct = (so_g/sh_att)*100
169 | ) |>
170 | arrange(desc(sogpct))
171 | ```
172 | ```{r mutate-change-percent-check}
173 | grade_this_code()
174 | ```
175 |
176 | ```{r accurate, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "mutating-load-data", results='asis'}
177 | glue("Who is putting the most pressure on the keeper shot after shot? {snipers$team}, who are on goal {format(snipers$sogpct, digits=1)}% of the time.
178 |
179 | A note about this. Note that {snipers$team} played in {snipers$games} matches. The team that played the most played in {maxgames$max_games}. We'll learn how to deal with that in the next tutorial.")
180 | ```
181 |
182 | ### Exercise 3: Combining what we know
183 |
184 | With this data, we have every team and their stats. **But what if we wanted to know which is the most ruthless conference?** Here is an example of how you can use what you learned in the last tutorial with what you learned here.
185 |
186 | To take a dataset of every team and get each conference, we need to use group by again. Remember the package of Skittles and putting them in piles? What we're doing is putting teams into piles based on something you're going to have to fill in.
187 |
188 | Then, to calculate a percentage for the conference, we need to add up the two pieces we need in summarize *before* we mutate a new column. We're going to sum the shots on goal and the shot attempts. You've used those two columns before -- look at your previous work.
189 |
190 | Then we're going to create a shots on goal percentage -- call it sogpct. To make that percentage, you only have what you made in summarize to work with. What you made is what you named. What you named is always left of the =.
191 |
192 | ```{r mutate-change-percent-arrange, exercise=TRUE, exercise.setup = "mutating-load-data", message=FALSE, warning=FALSE}
193 | soccer |>
194 | group_by(____) |>
195 | summarize(
196 | total_sog = sum(____),
197 | total_shatt = sum(____)
198 | ) |>
199 | mutate(____ = (____/____)*___
200 | ) |>
201 | arrange(desc(____))
202 | ```
203 | ```{r mutate-change-percent-arrange-solution, exercise.reveal_solution = FALSE, warning=FALSE}
204 | soccer |>
205 | group_by(conference) |>
206 | summarize(
207 | total_sog = sum(so_g),
208 | total_shatt = sum(sh_att)
209 | ) |>
210 | mutate(sogpct = (total_sog/total_shatt)*100
211 | ) |>
212 | arrange(desc(sogpct))
213 | ```
214 | ```{r mutate-change-percent-arrange-check}
215 | grade_this_code()
216 | ```
217 |
218 | ```{r conf, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "mutating-load-data", results='asis'}
219 | glue("What conference do you not want to be a keeper? The {confsnipers$conference} Conference, who are on goal {format(confsnipers$sogpct, digits=1)}% of the time.")
220 | ```
221 |
222 | ## The Recap
223 |
224 | In this lesson, we've explored the power of the mutate() function, a key tool in data analysis that allows us to create new variables based on existing ones. We learned how to calculate per-game statistics, such as goals per game, which helps level the playing field when comparing teams or players with different numbers of games played. We also delved into creating percentages, a crucial skill for comparing efficiencies across different sample sizes. By combining mutate() with other functions we've previously learned, like group_by(), summarize(), and arrange(), we've seen how we can perform more complex analyses, such as finding the most accurate shooting conference in women's soccer. These skills are fundamental in sports analytics, allowing us to derive meaningful insights that go beyond the basic statistics provided in raw data. Remember, the ability to create new, insightful metrics is often what separates good analysis from great analysis in the world of sports data.
225 |
226 | ## Terms to Know
227 |
228 | - **mutate()**: A function used to create new columns in a dataset based on existing columns.
229 | - **Percentage Calculation**: A formula used to express a part as a fraction of the whole, typically multiplied by 100 to get a percentage. Formula: (smaller number / total number) * 100.
230 | - **Per-Game Statistics**: Metrics calculated by dividing a cumulative statistic by the number of games played, allowing for fairer comparisons between players or teams with different numbers of games.
231 |
--------------------------------------------------------------------------------
/inst/tutorials/05-mutating/images/environment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/05-mutating/images/environment.png
--------------------------------------------------------------------------------
/inst/tutorials/07-transforming/07-transforming.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 7: Transforming Data"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to turn data on it's side and back again.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(glue)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | The goal of this lesson is to introduce you to the concept of data reshaping, specifically focusing on transforming data between "long" and "wide" formats using the pivot_longer() and pivot_wider() functions from the tidyr package. You'll learn why data sometimes needs to be reshaped, how to recognize long and wide data formats, and when to use each transformation. By the end of this lesson, you'll be able to restructure datasets to facilitate different types of analyses and visualizations, a crucial skill in sports analytics where data often comes in various formats. You'll apply these techniques to NBA per-game statistics, demonstrating how data reshaping can help answer questions about team performance across seasons.
26 |
27 | ## The Basics
28 |
29 | Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.
30 |
31 | You are soon going to discover that long before you can visualize data, **you need to have it in a form that the visualization library can deal with**. One of the ways that isn't immediately obvious is **how your data is cast**. Most of the data you will encounter will be **wide -- each row will represent a single entity with multiple measures for that entity**. So think of teams. The row of your dataset could have the team name, wins this season, wins last season, wins the season before that and so on.
32 |
33 | But what if your visualization library needs one row for each measure? So team, season and the wins. Nebraska, 2022, 4. That's one row. Then the next row is Nebraska, 2021, 3. That's the next row. That's where recasting your data comes in.
34 |
35 | We can use a library called `tidyr` to `pivot_longer` or `pivot_wider` the data, depending on what we need. We'll use a dataset of NBA per-game statistics. We're going to try and answer some questions about offensive improvement.
36 |
37 | First we need some libraries.
38 |
39 | ```{r load-tidyverse, exercise=TRUE}
40 | library(tidyverse)
41 | ```
42 | ```{r load-tidyverse-solution}
43 | library(tidyverse)
44 | ```
45 | ```{r load-tidyverse-check}
46 | grade_this_code()
47 | ```
48 |
49 | Now we'll load the data.
50 |
51 | ```{r transforming-load-data, message=FALSE, warning=FALSE}
52 | nba <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbalong.csv")
53 |
54 | nbawide <- nba |> pivot_wider(names_from = Season, values_from = PTS)
55 |
56 | tops <- nbawide |>
57 | mutate(Difference = `2023` - `2022`) |>
58 | arrange(desc(Difference)) |>
59 | slice(1)
60 | ```
61 | ```{r transforming-load-data-exercise, exercise = TRUE}
62 | nba <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbalong.csv")
63 | ```
64 | ```{r transforming-load-data-exercise-solution}
65 | nba <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbalong.csv")
66 | ```
67 | ```{r transforming-load-data-exercise-check}
68 | grade_this_code()
69 | ```
70 |
71 | As per usual, let's take a look at this with head.
72 |
73 | ```{r head-data, exercise=TRUE, exercise.setup = "transforming-load-data"}
74 | head(____)
75 | ```
76 | ```{r head-data-solution}
77 | head(nba)
78 | ```
79 | ```{r head-data-check}
80 | grade_this_code()
81 | ```
82 |
83 | As you can see, each row represents one season for one team. The problem we face is if we want to calculate change between years, we can't do that. For mutate to work, the columns have to be side by side. We can't subtract this year from last year if they're on two different *rows*. We can if they're in two different *columns*.
84 |
85 | ### Exercise 1: Making data wider
86 |
87 | To fix this, we use `pivot_wider` because we're making long data wide.
88 |
89 | Making wide data out of long data is relatively simple. `pivot_wider` is a function that takes two inputs -- `names_from` and `values_from`. How do we know which goes in which? When you have three fields like we do, it's pretty easy. The first thing you know is that it's not the name of the team or the player. We know, in the picture in our minds what this should look like, the name is the first thing. That leaves two other things -- the season and the offensive rating. If you think of what would look good as column names -- the headers across the top -- ask yourself, is it the season names -- 2019, 2020, etc. -- or is it **all** the possible combinations of offensive ratings. Go with the one that *isn't* going to create a million columns for the `names_from` bit.
90 |
91 | ```{r pivoting-wider, exercise=TRUE, exercise.setup = "transforming-load-data"}
92 | nba |> pivot_wider(names_from = ____, values_from = ____)
93 | ```
94 | ```{r pivoting-wider-solution, exercise.reveal_solution = FALSE}
95 | nba |> pivot_wider(names_from = Season, values_from = PTS)
96 | ```
97 | ```{r pivoting-wider-check}
98 | grade_this_code()
99 | ```
100 |
101 | You've gone from 270 rows to 30. Why? Because now each row is a team. There are 30 teams in the NBA. And one row has all of the seasons of data for that team, instead of there being eight rows for each team because there's eight seasons and each season for each team is a row.
102 |
103 | ### Exercise 2: Chaining commands to answer a question
104 |
105 | Now that we have wide data, we can answer this question: Which offense got better the most this season from the previous season?
106 |
107 | The first thing we will need to do is save our data into a new dataframe. We do that with a `<-`. We'll save it into a new dataframe called nbawide.
108 |
109 | Then, we're going to use mutate to subtract last season's points per game from this season's. **A quirk of dplyr: Because the column names now start with a number -- for example 2019 -- we have to put backticks around the seasons. The backtick says "this is okay."
110 |
111 | Finally, we'll arrange to see who was tops.
112 |
113 | ```{r pivoting-mutate, exercise=TRUE, exercise.setup = "transforming-load-data"}
114 | nbawide <- nba |> pivot_wider(names_from = ____, values_from = ____)
115 |
116 | nbawide |>
117 | mutate(Difference = `____` - `____`) |>
118 | arrange(desc(____))
119 | ```
120 | ```{r pivoting-mutate-solution, exercise.reveal_solution = FALSE}
121 | nbawide <- nba |> pivot_wider(names_from = Season, values_from = PTS)
122 |
123 | nbawide |>
124 | mutate(Difference = `2023` - `2022`) |>
125 | arrange(desc(Difference))
126 | ```
127 | ```{r pivoting-mutate-check}
128 | grade_this_code()
129 | ```
130 |
131 | ```{r topoffense, exercise=FALSE, exercise.eval=TRUE, exercise.setup = "transforming-load-data", results='asis'}
132 | glue("It was the {tops$Team} who improved the most, getting {format(tops$Difference, digits=2)} points better in one season.")
133 | ```
134 |
135 | ### Exercise 3: Going back to long
136 |
137 | `pivot_longer` is a function, much like `pivot_wider` in that it takes input. In this case, pivot_longer needs *three* things: what columns are getting pivoted, and what we need to name the new columns that are getting pivoted. Those names are called `names_to` and `values_to` because one is going to be a label and one is going to be a ... value.
138 |
139 | Given our data, we have Team and then a bunch of columns that are years. What we want to see is something like Team, Season, PTS, just like we had when we started.
140 |
141 | Why are we doing this? In the future, you are going to find the data you need and it is not going to be in the form you want it. You'll have to pivot it, one way or the other.
142 |
143 | The hardest part of pivoting longer is telling it which columns to pivot. You spell this out in `cols`. With simpler datasets, you can just say which ones. The columns we want to pivot are the ones with the numbers -- the year columns. There are some shortcuts to help us. Since all of the columns we want to make rows start with 20, we can use that pattern in our `cols` directive . Then we give that column a name -- Season -- and the values for each year need a name too. Those are the offensive ratings, but we're going to call them PTS. Replace the all caps here and you can see how it works.
144 |
145 | ```{r pivoting-longer, exercise=TRUE, exercise.setup = "transforming-load-data"}
146 | nbawide |> pivot_longer(cols = starts_with("20"), names_to = "____", values_to = "____")
147 | ```
148 | ```{r pivoting-longer-solution, exercise.reveal_solution = FALSE}
149 | nbawide |> pivot_longer(cols = starts_with("20"), names_to = "Season", values_to = "PTS")
150 | ```
151 | ```{r pivoting-longer-check}
152 | grade_this_code()
153 | ```
154 |
155 | We're back to 270 rows.
156 |
157 | ## The Recap
158 |
159 | In this lesson, we've explored the essential skill of data reshaping using the `pivot_wider()` and `pivot_longer()` functions from the tidyr package. We learned how to transform data from a long format, where each row represents a single observation, to a wide format, where each row contains multiple observations for a single entity. This transformation allowed us to perform calculations across seasons, such as comparing team performance year-over-year. We also practiced the reverse process, converting wide data back to long format, which is often necessary for certain types of analyses or visualizations. Through hands-on exercises with NBA data, you've seen how these transformations can help answer real-world sports analytics questions, like identifying the team with the most improved offense. Remember, the ability to reshape data is a powerful tool in your data analysis toolkit, allowing you to adapt your dataset to the specific requirements of your analysis or visualization tasks.
160 |
161 | ## Terms to Know
162 |
163 | - pivot_wider(): A function from the tidyr package used to transform data from long to wide format, spreading unique values from one column into multiple new columns.
164 | - pivot_longer(): A function from the tidyr package used to transform data from wide to long format, gathering multiple columns into key-value pairs.
165 | - Long format: A data structure where each row represents a single observation, and each column represents a variable. In long format, you often have repeated entries in identifier columns.
166 | - Wide format: A data structure where each row represents a single entity with multiple observations in separate columns. This format often has fewer rows but more columns than long format.
--------------------------------------------------------------------------------
/inst/tutorials/08-significancetests/08-significancetests.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 8: Significance tests"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to see if differences are meaningful.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to introduce you to significance testing, a crucial statistical tool for determining whether observed differences in data are meaningful or merely due to chance. You'll learn about the concept of hypothesis testing, including null hypotheses and p-values, and how to interpret these results. We'll focus on using R's t.test() function to compare means between two groups, applying this to real-world sports data from the NBA's "bubble" season during the COVID-19 pandemic. By the end of this lesson, you'll be able to formulate hypotheses about sports data, conduct significance tests, and draw evidence-based conclusions about whether factors like the presence of fans significantly impact game outcomes. This skill is essential for moving beyond simple descriptive statistics and making more robust, inferential claims in sports analytics.
25 |
26 | ## The Basics
27 |
28 | Now that we've worked with data a little, it's time to start asking more probing questions of our data. One of the most probing questions we can ask -- one that so few sports journalists ask -- is if the difference between this thing and the normal thing is real.
29 |
30 | We had a perfect natural experiment going on in sports to show how significance tests work. The NBA in 2021, to salvage a season and get to the playoffs, put their players in a bubble -- more accurately a hotel complex at Disney World in Orlando -- and had them play games without fans.
31 |
32 | So are the games different from other regular season games that had fans?
33 |
34 | To answer this, we need to understand that a significance test is a way to determine if two numbers are *significantly* different from each other. Generally speaking, we're asking if a subset of data -- a sample -- is different from the total data pool -- the population. Typically, this relies on data being in a normal distribution.
35 |
36 | ```{r, echo=FALSE}
37 | knitr::include_graphics(rep("images/simulations2.png"))
38 | ```
39 |
40 | If it is, then we know certain things about it. Like the mean -- the average -- will be a line right at the peak of cases. And that 66 percent of cases will be in that red area -- the first standard deviation.
41 |
42 | A significance test will determine if a sample taken from that group is different from the total.
43 |
44 | Significance testing involves stating a hypothesis. In our case, our hypothesis is that there is a difference between bubble games without people and regular games with people.
45 |
46 | In statistics, the **null hypothesis** is the opposite of your hypothesis. In this case, that there is no difference between fans and no fans.
47 |
48 | What we're driving toward is a metric called a p-value, which is the probability that you'd get your sample mean *if the null hypothesis is true.* So in our case, it's the probability we'd see the numbers we get if there was no difference between fans and no fans. If that probability is below .05, then we consider the difference significant and we reject the null hypothesis.
49 |
50 | So let's see. We'll need a log of every game in the covid NBA season. In this data, there's a field called COVID, which labels the game as a regular game or a bubble game.
51 |
52 | First we need some libraries.
53 |
54 | ```{r load-tidyverse, exercise=TRUE}
55 | library(tidyverse)
56 | ```
57 | ```{r load-tidyverse-solution}
58 | library(tidyverse)
59 | ```
60 | ```{r load-tidyverse-check}
61 | grade_this_code()
62 | ```
63 |
64 | And import the data.
65 |
66 | ```{r significance-load-data, message=FALSE, warning=FALSE}
67 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbabubble.csv")
68 | ```
69 | ```{r significance-load-data-exercise, exercise = TRUE}
70 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbabubble.csv")
71 | ```
72 | ```{r significance-load-data-exercise-solution}
73 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/nbabubble.csv")
74 | ```
75 | ```{r significance-load-data-exercise-check}
76 | grade_this_code()
77 | ```
78 |
79 | As per usual, let's take a look at this with head.
80 |
81 | ```{r head-data, exercise=TRUE, exercise.setup = "significance-load-data"}
82 | head(____)
83 | ```
84 | ```{r head-data-solution}
85 | head(logs)
86 | ```
87 | ```{r head-data-check}
88 | grade_this_code()
89 | ```
90 |
91 | ### Exercise 1: Let's test.
92 |
93 | First, let's just look at shooting Here's a theory: fans make players nervous. The screaming makes players tense up, and tension makes for bad shooting. An alternative to this: screaming fans make you defend harder. So my hypothesis is that not only is the shooting different, it's lower.
94 |
95 | Typically speaking, with significance tests, the process involves creating two different means and then running a bunch of formulas on them. R makes this easy by giving you a `t.test` function, which does all the work for you. What we have to tell it is what is the value we are testing (hint: we're looking at team shooting percentage), what is our grouping label field (hint: look at the last sentence right before the box where you download the data), and from what data (hint: look at the loading data step).
96 |
97 | ```{r ttest1, exercise=TRUE, exercise.setup = "significance-load-data"}
98 | t.test(____ ~ ____, data=____)
99 | ```
100 | ```{r ttest1-solution, exercise.reveal_solution = FALSE}
101 | t.test(TeamFGPCT ~ COVID, data=logs)
102 | ```
103 | ```{r ttest1-check}
104 | grade_this_code()
105 | ```
106 |
107 | Now let's talk about the output. I prefer to read these bottom up. So at the bottom, it says that the mean field goal percentage in an NBA game With Fans is 46.08 percent. The mean scored in games Without Fans is 46.45 percent. That means teams about a third of a percent BETTER without fans on average.
108 |
109 | But, some games people can't miss, some games players couldn't make a bucket if the hoop was a hula hoop. We learned that averages can be skewed by extremes. So the next thing we need to look at is the **p-value**. Remember, this is the probability that we'd get this sample mean -- the without fans mean -- if there was no difference between fans and no fans.
110 |
111 | The probability? 0.3894 or 38.9 percent. That means in almost 4 in 10 times, we WOULD get this answer.
112 |
113 | Remember, if the probability is below .05, then we determine that this number is statistically significant. We'll talk more about statistical significance soon, but in this case, statistical significance means that our hypothesis is correct: shooting percentages are different without fans than with.
114 |
115 | Except, our number is NOT lower than .05. It's considerably higher than that. That means our hypothesis is NOT correct and we have to *accept the null hypothesis.*
116 |
117 | We can safely say that professional basketball players shoot just as well with fans as without. So sit down and eat your popcorn.
118 |
119 | ## Rejecting the null hypothesis
120 |
121 | So what does it look like when your hypothesis is correct?
122 |
123 | Let's test another thing that may have been impacted by bubble games: the refs. It's almost a feature of sports: Yelling at the refs when they make a call, good or bad. Amazing how often the goodness or badness of the call depends on who you are rooting for.
124 |
125 | My hypothesis is that without fans to yell at refs, they weren't calling fouls like they were when the fans were there to ... ahem ... encourage them to do their jobs.
126 |
127 | ### Exercise 2: Do people affect the referees?
128 |
129 | First things first: We need to make a dataframe called refs where we'll add up the team and the opponent's personal fouls. This will tell is if the refs were more active with people or without. Will our t-test tell us it's significant?
130 |
131 | ```{r ttest2, exercise=TRUE, exercise.setup = "significance-load-data"}
132 | ____ <- logs |> mutate(totalfouls = ____ + ____)
133 |
134 | t.test(____ ~ ____, data=____)
135 | ```
136 | ```{r ttest2-solution, exercise.reveal_solution = FALSE}
137 | refs <- logs |> mutate(totalfouls = TeamPersonalFouls + OpponentPersonalFouls)
138 |
139 | t.test(totalfouls ~ COVID, data=refs)
140 | ```
141 | ```{r ttest2-check}
142 | grade_this_code()
143 | ```
144 |
145 | So again, start at the bottom. With Fans, refs called about 41 fouls per game. Without fans? More like 46. Without fans, the refs were calling MORE fouls.
146 |
147 | But is it significant?
148 |
149 | Look at the p-value. It's 2.247e-15. Is that less than .05? Depends on if you can remember how to read scientific notation. What that says is the p-value is 2.247 x 10 to the -15 power. Not remembering negative exponents? Take the decimal point and move it 15 places to the left. So point then 14 zeros then 2247 or .000000000000002247. Is that less than .05? Yes. Yes it is.
150 |
151 | That means we **reject the null hypothesis** that there is no difference between fans and no fans when it comes to the refs calling fouls in NBA games. They absolutely did call more fouls in the bubble without fans. Our hypothesis that the bubble affected refereeing is correct, but our supposition about it is not. I thought they'd back off. Instead, they called more.
152 |
153 | We're going to be analyzing these bubble games for *years* trying to find the true impact of fans.
154 |
155 | ## The Recap
156 |
157 | In this lesson, we've delved into the world of significance testing, a powerful tool in sports analytics for determining whether observed differences are statistically meaningful. We learned about formulating hypotheses, understanding null hypotheses, and interpreting p-values. Using the t.test() function in R, we analyzed real NBA data to investigate whether the absence of fans during the "bubble" season affected game outcomes. We discovered that while shooting percentages weren't significantly different, the number of fouls called by referees was significantly higher without fans present. This exercise demonstrated how statistical tests can challenge our assumptions and reveal unexpected insights. Remember, a p-value below 0.05 generally indicates a statistically significant difference, allowing us to reject the null hypothesis. However, it's crucial to interpret these results in context and consider practical significance alongside statistical significance.
158 |
159 | ## Terms to Know
160 |
161 | - Significance test: A statistical method used to determine if there is a meaningful difference between groups or if an observed effect is likely due to chance.
162 | - Hypothesis: A proposed explanation or prediction about a phenomenon, often used as the starting point for statistical testing.
163 | - Null hypothesis: The assumption that there is no significant difference or effect, which is tested against the alternative hypothesis.
164 | - p-value: The probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is true. Generally, a p-value less than 0.05 is considered statistically significant.
165 | - t-test: A statistical test used to compare the means of two groups and determine if they are significantly different from each other.
166 | - Statistical significance: The likelihood that a relationship between two or more variables is caused by something other than random chance. Typically indicated by a p-value less than 0.05.
--------------------------------------------------------------------------------
/inst/tutorials/08-significancetests/images/simulations2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/08-significancetests/images/simulations2.png
--------------------------------------------------------------------------------
/inst/tutorials/09-regressions/09-regressions_files/figure-html/correlationchart-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/09-regressions/09-regressions_files/figure-html/correlationchart-1.png
--------------------------------------------------------------------------------
/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart-1.png
--------------------------------------------------------------------------------
/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart2-1.png
--------------------------------------------------------------------------------
/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart3-1.png
--------------------------------------------------------------------------------
/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/10-residuals/10-residuals_files/figure-html/correlationchart4-1.png
--------------------------------------------------------------------------------
/inst/tutorials/11-z-scores/11-z-scores.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 11: Z-Scores"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to compare teams and players across leagues and time.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(glue)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | The goal of this lesson is to introduce you to z-scores, a powerful statistical tool for standardizing and comparing data across different groups or time periods. You'll learn what z-scores are, how they're calculated, and why they're particularly useful in sports analytics. We'll focus on using R to calculate z-scores for volleyball team performance, allowing us to compare teams across different seasons and conferences. By the end of this lesson, you'll be able to use z-scores to create composite measures of performance, rank teams relative to their peers, and make meaningful comparisons that account for differences in competition levels or eras. You'll also learn how to interpret z-scores and communicate their meaning effectively to a non-technical audience. This skill will enable you to conduct more sophisticated analyses of team and player performance, helping you identify truly standout performances in their proper context.
26 |
27 | ## The Basics
28 |
29 | Z-scores are a handy way to standardize numbers so you can compare things across groupings or time. In this class, we may want to compare teams by year, or era. We can use z-scores to answer questions like who was the greatest X of all time, because a z-score can put them in context to their era.
30 |
31 | A z-score is a measure of how far a particular stat is from the mean. It's measured in standard deviations from that mean. A standard deviation is a measure of how much variation -- how spread out -- numbers are in a data set. What it means here, with regards to z-scores, is that zero is perfectly average. If it's 1, it's one standard deviation above the mean, and 34 percent of all cases are between 0 and 1.
32 |
33 | ```{r, echo=FALSE}
34 | knitr::include_graphics(rep("images/simulations2.png"))
35 | ```
36 |
37 | If you think of the normal distribution, it means that 84.3 percent of all case are below that 1. If it were -1, it would mean the number is one standard deviation below the mean, and 84.3 percent of cases would be above that -1. So if you have numbers with z-scores of 3 or even 4, that means that number is waaaaaay above the mean.
38 |
39 | So let's use the remarkable career of Nebraska volleyball legend and CoJMC alumna Lexi Rodriguez, which if you weren't paying attention, she was very good at volleyball. She finished her career breaking Nebraska's all-time digs record, overtaking another Nebraska legend Justine Wong-Orontes. In her senior season, Rodriquez was up for player of the year awards. But was it her best season as a Husker?
40 |
41 | ### Calculating a Z score in R
42 |
43 | For this we'll need the season totals for the last four seasons. First, load the tidyverse.
44 |
45 | ```{r load-tidyverse, exercise=TRUE}
46 | library(tidyverse)
47 | ```
48 | ```{r load-tidyverse-solution}
49 | library(tidyverse)
50 | ```
51 | ```{r load-tidyverse-check}
52 | grade_this_code()
53 | ```
54 |
55 | And load the data.
56 |
57 | ```{r zscore-load-data, message=FALSE, warning=FALSE, results=FALSE}
58 | seasons <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2124.csv")
59 |
60 | perset <- seasons |>
61 | mutate(
62 | digsperset = digs/s,
63 | acesperset = aces/s
64 | ) |>
65 | select(season, team, conference, name, digsperset, acesperset)
66 |
67 | playerzscore <- perset |>
68 | group_by(season) |>
69 | mutate(
70 | aceszscore = as.numeric(scale(acesperset, center = TRUE, scale = TRUE)),
71 | digszscore = as.numeric(scale(digsperset, center = TRUE, scale = TRUE)),
72 | totalzscore = aceszscore + digszscore
73 | ) |> ungroup()
74 | ```
75 | ```{r zscore-load-data-exercise, exercise = TRUE}
76 | seasons <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2124.csv")
77 | ```
78 | ```{r zscore-load-data-exercise-solution}
79 | seasons <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2124.csv")
80 | ```
81 | ```{r zscore-load-data-exercise-check}
82 | grade_this_code()
83 | ```
84 |
85 | Let's `glimpse` the data and see what we've got.
86 |
87 | ```{r glimpse-data, exercise=TRUE, exercise.setup = "zscore-load-data"}
88 | glimpse(seasons)
89 | ```
90 | ```{r glimpse-data-solution}
91 | glimpse(seasons)
92 | ```
93 | ```{r glimpse-data-check}
94 | grade_this_code()
95 | ```
96 |
97 | ### Exercise 1: Preparing data
98 |
99 | The first thing we need to do is create some metrics to deal with comparing seasons that deal with players playing different amounts. Then we'll select some fields we think represent a good libero and a few things to help us keep things straight.
100 |
101 | Every player plays a different number of sets in a season, and that has to do with how good you are and how good your team is. In volleyball, being on a really good team can mean playing *fewer* sets in a season, because you're just sweeping everyone. But then you get the NCAA tournament, and get more matches. So it's just a mess. To be able to work with this, we need to standardize around *per set* metrics. Sets in our data are marked as s.
102 |
103 | To get that, we're going to use mutate to create `digsperset` and `acesperset`. Then, we're going to use select to get season, team, conference and our metrics.
104 |
105 | ```{r zscore2, exercise=TRUE, exercise.setup = "zscore-load-data"}
106 | perset <- seasons |>
107 | mutate(
108 | digsperset = ____/____,
109 | acesperset = ____/____
110 | ) |>
111 | select(season, team, conference, ____, ____)
112 | ```
113 | ```{r zscore2-solution, exercise.reveal_solution = FALSE}
114 | perset <- seasons |>
115 | mutate(
116 | digsperset = digs/s,
117 | acesperset = aces/s
118 | ) |>
119 | select(season, team, conference, name, digsperset, acesperset)
120 | ```
121 | ```{r zscore2-check}
122 | grade_this_code()
123 | ```
124 |
125 | ### Exercise 2: Create a z-score
126 |
127 | The first thing we're going to do is deal with having multiple seasons. We want the z-score to be centered on the average for *that season*, so each Husker gets ranked against their competition, not all players in all seasons. That way, we compare how good 2021 Lexi was compared to 2024 Lexi. Lexi from 2021 can't play a season of matches from 2024 or 2023 or 2022, so we can only compare her to her competition. How do we make z-scores for each season? Simple: group by.
128 |
129 | Then, we're *NOT* going to use summarize. We'll use mutate. That has the effect of putting all the 2020 teams together in a pile, calculating each teams z-score and then just leaving it there instead of folding it all together into a single number (like summarize would).
130 |
131 | To calculate a z-score in R, the easiest way is to use the `scale` function in base R. To use it, you use `scale(FieldName, center=TRUE, scale=TRUE)`. The center and scale indicate if you want to subtract from the mean and if you want to divide by the standard deviation, respectively. We do.
132 |
133 | When we have multiple z-scores, it's pretty standard practice to add them together into a composite score. That's what we're doing at the end here with `totalzscore`.
134 |
135 | ```{r zscore3, exercise=TRUE, exercise.setup = "zscore-load-data"}
136 | playerzscore <- perset |>
137 | group_by(____) |>
138 | mutate(
139 | aceszscore = as.numeric(scale(____perset, center = TRUE, scale = TRUE)),
140 | digszscore = as.numeric(scale(____perset, center = TRUE, scale = TRUE)),
141 | totalzscore = ____zscore + ____zscore
142 | ) |> ungroup()
143 | ```
144 | ```{r zscore3-solution, exercise.reveal_solution = FALSE}
145 | playerzscore <- perset |>
146 | group_by(season) |>
147 | mutate(
148 | aceszscore = as.numeric(scale(acesperset, center = TRUE, scale = TRUE)),
149 | digszscore = as.numeric(scale(digsperset, center = TRUE, scale = TRUE)),
150 | totalzscore = aceszscore + digszscore
151 | ) |> ungroup()
152 | ```
153 | ```{r zscore3-check}
154 | grade_this_code()
155 | ```
156 |
157 | So now we have a dataframe called `playerzscore` that has all players from the 2021-2024 seasons with Z scores. What does it look like?
158 |
159 | ```{r head-data, exercise=TRUE, exercise.setup = "zscore-load-data"}
160 | head(playerzscore)
161 | ```
162 | ```{r head-data-solution}
163 | head(playerzscore)
164 | ```
165 | ```{r head-data-check}
166 | grade_this_code()
167 | ```
168 |
169 | A way to interpret this data -- a player at zero is precisely average. The larger the positive number, the more exceptional they are. The larger the negative number, the more truly terrible they are (relative to all DI talent, which is orders of magnitude better than any of us).
170 |
171 | ### Exercise 3: Arrange to the rescue
172 |
173 | So who were the best players in the country over this time? Arrange can tell us. We just made a composite z-score -- a total if you will. Let's use that in arrange to see who is the best by our measure.
174 |
175 | ```{r zscore4, exercise=TRUE, exercise.setup = "zscore-load-data"}
176 | playerzscore |> arrange(desc(____))
177 | ```
178 | ```{r zscore4-solution, exercise.reveal_solution = FALSE}
179 | playerzscore |> arrange(desc(totalzscore))
180 | ```
181 | ```{r zscore4-check}
182 | grade_this_code()
183 | ```
184 |
185 | And the greatest libero in the last four years is ... wait what?
186 |
187 | So we have some work to do here if this is the answer we're getting, but what if we look a little closer.
188 |
189 | ### Exercise 4: Filtering to see Lexi
190 |
191 | Obviously, competition matters, and we could do some filtering to get only Power Four teams in here and do it again. But we set out to ask which was Lexi's best season. So let's filter for it.
192 |
193 | ```{r zscore5, exercise=TRUE, exercise.setup = "zscore-load-data"}
194 | playerzscore |>
195 | filter(name == "____") |>
196 | arrange(desc(____)) |>
197 | select(season, name, totalzscore)
198 | ```
199 | ```{r zscore5-solution, exercise.reveal_solution = FALSE}
200 | playerzscore |>
201 | filter(name == "Lexi Rodriguez") |>
202 | arrange(desc(totalzscore)) |>
203 | select(season, name, totalzscore)
204 | ```
205 | ```{r zscore5-check}
206 | grade_this_code()
207 | ```
208 |
209 | So, as we can see, with our composite Z Score, the best Lexi isn't this year. It isn't even last year. It was 2022.
210 |
211 | In 2024, breaking records and on the doorstep of Player of the Year, Lexi turned in her third best season, relative to competition. Huh.
212 |
213 | ## Writing about z-scores
214 |
215 | The great thing about z-scores is that they make it very easy for you, the sports analyst, to create your own measures of who is better than who. The downside: Only a small handful of sports fans know what the hell a z-score is.
216 |
217 | As such, you should try as hard as you can to avoid writing about them.
218 |
219 | If the word z-score appears in your story or in a chart, you need to explain what it is. "The ranking uses a statistical measure of the distance from the mean called a z-score" is a good way to go about it. You don't need a full stats textbook definition, just a quick explanation. And keep it simple.
220 |
221 | **Never use z-score in a headline.** Write around it. Away from it. Z-score in a headline is attention repellent. You won't get anyone to look at it. So "Tottenham tops in z-score" bad, "Tottenham tops in the Premiere League" good.
222 |
223 | ## The Recap
224 |
225 | In this lesson, we've explored the concept of z-scores and their application in sports analytics. We learned that z-scores provide a standardized measure of how far a data point is from the mean, expressed in terms of standard deviations. This allows us to compare performances across different seasons, leagues, or eras. We practiced calculating z-scores in R using the scale() function and applied this to volleyball player statistics. We created composite z-scores to rank players based on multiple performance metrics, demonstrating how this technique can provide insights that raw statistics alone might miss. However, we also encountered some unexpected results, highlighting the importance of critically evaluating our measures and considering additional context. We discussed the challenges of communicating z-scores to a general audience and strategies for presenting these insights effectively. Remember, while z-scores are a powerful tool for standardizing comparisons, they should be used in conjunction with other analytical methods and domain knowledge for comprehensive sports analysis.
226 |
227 | ## Terms to Know
228 |
229 | - Z-score: A statistical measure that indicates how many standard deviations an element is from the mean.
230 | - Standard Deviation: A measure of the amount of variation or dispersion of a set of values.
231 | - Normal Distribution: A probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.
232 | - scale() function: An R function used to compute z-scores by centering and scaling numeric variables.
233 | - Composite Z-score: A combined measure created by adding multiple individual z-scores together.
--------------------------------------------------------------------------------
/inst/tutorials/11-z-scores/images/simulations2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/11-z-scores/images/simulations2.png
--------------------------------------------------------------------------------
/inst/tutorials/12-usingpackages/12-usingpackages.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 12: Using packages to get data"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to use a library to import data instead of a csv.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(wehoop)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | The goal of this lesson is to introduce you to using specialized R packages to directly import sports data into your analysis workflow. You'll learn how to leverage libraries like those in the SportsDataverse to access rich datasets without needing to manually download CSV files or scrape websites. By the end of this lesson, you'll be able to use the wehoop package to import women's basketball data, including both box scores and play-by-play information. You'll practice filtering, grouping, and analyzing this data to gain insights into team performance and game dynamics. This skill will significantly expand your ability to access and work with diverse sports datasets, allowing you to conduct more sophisticated and up-to-date analyses.
26 |
27 | ## The Basics
28 |
29 | There is a growing number of packages and repositories of sports data, largely because there's a growing number of people who want to analyze that data. We've done it ourselves with simple Google Sheets tricks. Then there's RVest, which is a method of scraping the data yourself from websites that you'll do later. But with these packages, someone has done the work of gathering the data for you. All you have to learn are the commands to get it.
30 |
31 | One very promising collection of libraries is something called the [SportsDataverse](https://sportsdataverse.org/), which has a collection of packages covering specific sports, all of which are in various stages of development. Some are more complete than others, but they are all being actively worked on by developers. Packages of interest in this class are:
32 |
33 | * [cfbfastR, for college football](https://saiemgilani.github.io/cfbfastR/).
34 | * [hoopR, for men's professional and college basketball](https://saiemgilani.github.io/hoopR/).
35 | * [wehoop, for women's professional and college basketball](https://saiemgilani.github.io/wehoop/).
36 | * [baseballr, for professional and college baseball](https://billpetti.github.io/baseballr/).
37 | * [worldfootballR, for soccer data from around the world](https://jaseziv.github.io/worldfootballR/).
38 | * [hockeyR, for NHL hockey data](https://hockeyr.netlify.app/)
39 | * [recruitR, for college sports recruiting](https://saiemgilani.github.io/recruitR/)
40 |
41 | Not part of the SportsDataverse, but in the same neighborhood, is [nflfastR](https://www.nflfastr.com/), which can provide NFL play-by-play data.
42 | Because they're all under development, not all of them can be installed with just a simple `install.packages("something")`. Some require a little work, some require API keys or paid accounts with certain providers.
43 |
44 | The main issue for you is to read the documentation carefully.
45 |
46 | ## Using wehoop
47 |
48 | wehoop presents us a good view into what using libraries like this are all about. And, as of this writing, Nebraska's women's basketball team is having a helluva season.
49 |
50 | First things first, we need to load our libraries.
51 |
52 | ```{r load-tidyverse, exercise=TRUE}
53 | library(tidyverse)
54 | library(wehoop)
55 | ```
56 | ```{r load-tidyverse-solution}
57 | library(tidyverse)
58 | library(wehoop)
59 | ```
60 | ```{r load-tidyverse-check}
61 | grade_this_code()
62 | ```
63 |
64 | Normally, you would be reading the documentation for wehoop or any of the libraries in the Sportsdataverse, but some of them are a little thin (and hey, they're free, so I'm not complaining). The best part of [the documentation](https://wehoop.sportsdataverse.org/index.html) for wehoop is in the functional reference, which is a road map to everything it can do. One of the more interesting things to me with wehoop, cfbfastR and hoopR is play-by-play data. But that data gets very large, very fast. Let's sstart with something smaller, like player box scores.
65 |
66 | Here's how to get player box scores for the most recent season (which are numbered by the year they end in).
67 |
68 | ```{r packages-load-data, message=FALSE, warning=FALSE, results=FALSE}
69 | playerboxscores <- load_wbb_player_box(seasons = 2025)
70 | ```
71 | ```{r packages-load-data-exercise, exercise = TRUE}
72 | playerboxscores <- load_wbb_player_box(seasons = 2025)
73 | ```
74 | ```{r packages-load-data-exercise-solution}
75 | playerboxscores <- load_wbb_player_box(seasons = 2025)
76 | ```
77 | ```{r packages-load-data-exercise-check}
78 | grade_this_code()
79 | ```
80 |
81 | Let's `glimpse` the data because it's very wide and has a lot of columns.
82 |
83 | ```{r glimpse-data, exercise=TRUE, exercise.setup = "packages-load-data"}
84 | glimpse(playerboxscores)
85 | ```
86 | ```{r glimpse-data-solution}
87 | glimpse(playerboxscores)
88 | ```
89 | ```{r glimpse-data-check}
90 | grade_this_code()
91 | ```
92 |
93 | Without noticing it, you've already seen the magic. You ran that one line of code, and now you have data, just like before. You didn't have to download a CSV. You didn't have to get it yourself. You just have data. The code you ran pulled the data from a remote resource, and now, like all the other data we've used all along, you can answer questions with it.
94 |
95 | ### Exercise 1: Who has started the most for the Huskers?
96 |
97 | If you look at the data, there is one field that is a TRUE/FALSE field of who started the game. Does Amy Williams tinker with her lineups?
98 |
99 | First, since we have box scores for every team, we need to filter to just Nebraska. Then, we're going to use the & sign -- which is a way of saying AND. The filter is the team AND the starting status being true. Then we'll group by the player's name (look at glimpse) and count. This time, we're going to use a shortcut for counting called `tally`.
100 |
101 | ```{r starters, exercise=TRUE, exercise.setup = "packages-load-data"}
102 | playerboxscores |>
103 | filter(team_short_display_name == "____" & starter == ____) |>
104 | group_by(____) |>
105 | tally(sort=TRUE)
106 | ```
107 | ```{r starters-solution, exercise.reveal_solution = FALSE}
108 | playerboxscores |>
109 | filter(team_short_display_name == "Nebraska" & starter == TRUE) |>
110 | group_by(athlete_display_name) |>
111 | tally(sort=TRUE)
112 | ```
113 | ```{r starters-check}
114 | grade_this_code()
115 | ```
116 |
117 | At least at the time of this writing, Coach Williams seems to be looking for a few pieces.
118 |
119 | ## Another example
120 |
121 | The wehoop package also has play by play data in it. A warning -- this step will take a beat or two to finish. It's going to pull in hundreds of thousands of plays.
122 |
123 | ### Exercise 2: Get play-by-play data
124 |
125 | The `load_wbb_pbp` function works very similar to the box scores. You're going to provide the season you want. We want this season.
126 |
127 | ```{r packages2-load-data, message=FALSE, warning=FALSE, results=FALSE}
128 | playbyplay <- load_wbb_pbp(2025)
129 | ```
130 | ```{r packages2-load-data-exercise, exercise = TRUE}
131 | playbyplay <- load_wbb_pbp(____)
132 | ```
133 | ```{r packages2-load-data-exercise-solution, exercise.reveal_solution = FALSE}
134 | playbyplay <- load_wbb_pbp(2025)
135 | ```
136 | ```{r packages2-load-data-exercise-check}
137 | grade_this_code()
138 | ```
139 |
140 | And you'll want to glimpse it too.
141 |
142 | ```{r glimpse2-data, exercise=TRUE, exercise.setup = "packages2-load-data"}
143 | glimpse(playbyplay)
144 | ```
145 | ```{r glimpse2-data-solution}
146 | glimpse(playbyplay)
147 | ```
148 | ```{r glimpse2-data-check}
149 | grade_this_code()
150 | ```
151 |
152 | There are a *lot* of questions you can ask and answer with play-by-play data if you have some creativity. You can filter down to the last two minutes of a game and seeing who is getting the ball. You can look at who makes the plays when the score is close. There's a lot we can do.
153 |
154 | For now, let's try something simple -- what's the most common shooting play for Nebraska this season.
155 |
156 | ### Exercise 3: Shot's in the air, who is doing what?
157 |
158 | Like the box scores, we're going to start with a filter. We're going to use both & and |. The | means OR. So we want shooting plays AND where the home team OR the away time is Nebraska. After that, it's group by and tally.
159 |
160 | ```{r playbyplay, exercise=TRUE, exercise.setup = "packages2-load-data"}
161 | ____ |>
162 | filter(shooting_play == ____ & home_team_name == "____" | away_team_name == "____") |>
163 | group_by(text) |>
164 | tally(sort=TRUE)
165 | ```
166 | ```{r playbyplay-solution, exercise.reveal_solution = FALSE}
167 | playbyplay |>
168 | filter(shooting_play == TRUE & home_team_name == "Nebraska" | away_team_name == "Nebraska") |>
169 | group_by(text) |>
170 | tally(sort=TRUE)
171 | ```
172 | ```{r playbyplay-check}
173 | grade_this_code()
174 | ```
175 |
176 | The most common plays -- missed shots and made free throws. But if you look at basketball stats, that makes sense. There's a lot more defensive rebounds than offensive. There's a lot more missed threes than made ones, typically. But as Nebraska goes further on this run, you now have the ability to get very granular data to analyze how Amy Williams and the team are doing it.
177 |
178 | ## The Recap
179 |
180 | In this lesson, we've explored the power of specialized R packages for accessing sports data, focusing on the wehoop package for women's basketball. We learned how to load player box scores and play-by-play data directly into our R environment without the need for manual data downloads. We practiced filtering this data to focus on specific teams, grouping by various factors, and using functions like tally() to quickly summarize information. Through exercises analyzing Nebraska's starting lineups and most common shooting plays, we've seen how these packages can provide rich, detailed data for in-depth sports analysis. Remember, while these packages greatly simplify data access, it's crucial to read their documentation carefully as they may have specific requirements or limitations. This approach to data import opens up new possibilities for timely, comprehensive sports analytics across various leagues and sports.
181 |
182 | ## Terms to Know
183 |
184 | - SportsDataverse: A collection of R packages designed for accessing and analyzing sports data across various leagues and sports.
185 | - wehoop: An R package specifically for accessing women's basketball data, including player box scores and play-by-play information.
186 | - tally(): A dplyr function that provides a shortcut for counting observations within groups, often used with group_by() for quick summaries.
187 | - API (Application Programming Interface): A set of protocols and tools that allow different software applications to communicate with each other, often used by sports data packages to retrieve data.
188 | - Documentation: Written materials that explain how to use a software package, including function descriptions, parameters, and usage examples, crucial for effectively utilizing sports data packages.
189 |
--------------------------------------------------------------------------------
/inst/tutorials/13-bar-charts/13-bar-charts.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 13: Bar charts"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to start turning data into graphics.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(scales)
19 | knitr::opts_chunk$set(echo = FALSE)
20 | tutorial_options(exercise.completion=FALSE)
21 | ```
22 |
23 | ## The Goal
24 |
25 | The goal of this lesson is to introduce you to creating bar charts using ggplot2, one of the most powerful and flexible data visualization tools in R. By the end of this tutorial, you'll understand the basic structure of a ggplot command, how to create simple bar charts, and how to customize them to better represent your data. You'll learn how to filter data to focus on specific subsets, reorder bars for clearer presentation, use layering to highlight particular data points, and adjust the orientation of your charts. These skills will form the foundation for creating more complex and insightful visualizations as you progress in your sports data analysis journey. Remember, effective data visualization is key to communicating your findings clearly and compellingly in sports analytics.
26 |
27 | ## The Basics
28 |
29 | With `ggplot2`, we dive into the world of programmatic data visualization. The `ggplot2` library implements something called the grammar of graphics. The main concepts are:
30 |
31 | * aesthetics - which in this case means the data which we are going to plot
32 | * geometries - which means the shape the data is going to take
33 | * scales - which means any transformations we might make on the data
34 | * facets - which means how we might graph many elements of the same dataset in the same space
35 | * layers - which means how we might lay multiple geometries over top of each other to reveal new information.
36 |
37 | Hadley Wickham, who is behind all of the libraries we have used in this course to date, wrote about his layered grammar of graphics in [this 2009 paper that is worth your time to read](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf).
38 |
39 | Let's dive in using data on college softball in 2023. This workflow will represent a clear picture of what your work in this class will be like for much of the rest of the semester. One way to think of this workflow is that your R Notebook is now your digital sketchbook, where you will try different types of visualizations to find ones that work. Then, you will write the code that adds necessary and required parts to finish it.
40 |
41 | First, load the tidyverse.
42 |
43 | ```{r load-tidyverse, exercise=TRUE}
44 | library(tidyverse)
45 | ```
46 | ```{r load-tidyverse-solution}
47 | library(tidyverse)
48 | ```
49 | ```{r load-tidyverse-check}
50 | grade_this_code()
51 | ```
52 |
53 | And the data.
54 |
55 | ```{r ggplot-load-data, message=FALSE, warning=FALSE}
56 | softball <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_softball_teams_2024.csv")
57 |
58 | huskers <- softball |>
59 | filter(
60 | team == "Nebraska"
61 | )
62 |
63 | big10 <- softball |>
64 | filter(
65 | conference == "Big Ten"
66 | )
67 | ```
68 | ```{r ggplot-load-data-exercise, exercise = TRUE}
69 | softball <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_softball_teams_2024.csv")
70 | ```
71 | ```{r ggplot-load-data-exercise-solution}
72 | softball <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_softball_teams_2024.csv")
73 | ```
74 | ```{r ggplot-load-data-exercise-check}
75 | grade_this_code()
76 | ```
77 |
78 | And let's get a glimpse of the data to remind us what is all there.
79 |
80 | Let's `glimpse` the data because it's wide and has a lot of columns.
81 |
82 | ```{r glimpse-data, exercise=TRUE, exercise.setup = "ggplot-load-data"}
83 | glimpse(softball)
84 | ```
85 | ```{r glimpse-data-solution}
86 | glimpse(softball)
87 | ```
88 | ```{r glimpse-data-check}
89 | grade_this_code()
90 | ```
91 |
92 | Now that we're set up, when making graphics, you're rarely ever going straight from data to graphic. Normally, you'll have way too much data to visualize, or you need to do something to that data in order to visualize it. In our case, we have way too many schools in our data to make a reasonable bar chart. Our first challenge, then, is to narrow the pile.
93 |
94 | ### Exercise 1: Set up your data.
95 |
96 | To do this, we're going to do a few things we've done before. We're going to filter all the softball teams in Division I down to just the Big Ten. We'll put that into a new dataframe called `big10`.
97 |
98 | ```{r ggplot-top, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
99 | ____ <- ____ |>
100 | filter(
101 | conference == "____"
102 | )
103 | ```
104 | ```{r ggplot-top-solution, exercise.reveal_solution = FALSE}
105 | big10 <- softball |>
106 | filter(
107 | conference == "Big Ten"
108 | )
109 | ```
110 | ```{r ggplot-top-check}
111 | grade_this_code()
112 | ```
113 |
114 | Now, we should have 14 teams that we can look at. That's far more manageable than 306 -- the number of softball teams in Division I.
115 |
116 | ## The bar chart
117 |
118 | The easiest thing we can do is create a simple bar chart of our data. **Bar charts show magnitude. They invite you to compare how much more or less one thing is compared to others.**
119 |
120 | We could, for instance, create a bar chart of home runs each team hit. We might be asking the question: Which Big Ten team relies on the long ball to score runs? To do that, we simply tell `ggplot2` what our dataset is, what element of the data we want to make the bar chart out of (which is the aesthetic), and the geometry type (which is the geom). It looks like this:
121 |
122 | `ggplot() + geom_bar(data=big10, aes(x=team))`
123 |
124 | We start with `ggplot()` which is creating a blank canvas for us to work in. The `geom_bar()` is the geometry -- the form the data will take. We will learn many geometries over the next several lessons. `huskers` is our data, `aes` means aesthetics, `x=Season` explicitly tells `ggplot2` that our x value -- our horizontal value -- is the Season field from the data. Why season? We want one bar per season, no? Put the season on the x axis is what that is saying. And what do we get when we run that?
125 |
126 | ```{r ggplot-bar1, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
127 | ggplot() + geom_bar(data=big10, aes(x=team))
128 | ```
129 | ```{r ggplot-bar1-solution, exercise.reveal_solution = FALSE}
130 | ggplot() + geom_bar(data=big10, aes(x=team))
131 | ```
132 | ```{r ggplot-bar1-check}
133 | grade_this_code()
134 | ```
135 |
136 | We get ... weirdness. We expected to see bars of different sizes, but we get all with a count of 1. What gives? Well, this is the default behavior. What we have here is something called a histogram, where `ggplot2` helpfully counted up the number of times the team appears and made bars as long as the count. Since we only have one record per team, the count is always 1. How do we fix this? By adding `weight` to our aesthetic.
137 |
138 | ### Exercise 2: The weight of responsibility
139 |
140 | You saw how it was done before. This just adds weight to the aes. The weight is always going to be a number, and our number is the column for home runs.
141 |
142 | ```{r ggplot-bar2, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
143 | ggplot() +
144 | geom_bar(data=big10, aes(x=team, weight=____))
145 | ```
146 | ```{r ggplot-bar2-solution, exercise.reveal_solution = FALSE}
147 | ggplot() +
148 | geom_bar(data=big10, aes(x=team, weight=hr))
149 | ```
150 | ```{r ggplot-bar2-check}
151 | grade_this_code()
152 | ```
153 |
154 | Now we get bars of different sizes, and you can clearly see a difference between teams.
155 |
156 | Let's work on a different version of this to see if we can find a clearer story.
157 |
158 | ### Exercise 3: Reordering
159 |
160 | `ggplot2`'s default behavior is to sort the data by the x axis variable. It's in alphabetical order. Rarely ever is that useful.
161 |
162 | To change the order of the bars, we have to `reorder` it. With `reorder`, we first have to tell `ggplot` what we are reordering, and then we have to tell it HOW we are reordering it. Nearly 99 times out of 100, that HOW we're reordering it is going to be the number. So it's reorder(FIELD, SORTFIELD). So let's put the bars in order of how many home runs they hit.
163 |
164 | ```{r ggplot-bar3, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
165 | ggplot() +
166 | geom_bar(data=big10, aes(x=reorder(team, ____), weight=____))
167 | ```
168 | ```{r ggplot-bar3-solution, exercise.reveal_solution = FALSE}
169 | ggplot() +
170 | geom_bar(data=big10, aes(x=reorder(team, hr), weight=hr))
171 | ```
172 | ```{r ggplot-bar3-check}
173 | grade_this_code()
174 | ```
175 |
176 | There, now our bars are in some kind of order.
177 |
178 | ### Exercise 4: Layering
179 |
180 | Our data has 14 teams in it but what if we wanted to focus on one team? We can use layering for that.
181 | First, we need to make a dataframe of one. Let's focus on Nebraska for ... reasons. The, we just add another geom_bar to our chart.
182 |
183 | To our new geom, we're going to add one new thing -- a fill. Let's make the color red. Again. For reasons.
184 |
185 | ```{r ggplot-bar4, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
186 | huskers <- softball |>
187 | filter(
188 | team == "Nebraska"
189 | )
190 |
191 | ggplot() +
192 | geom_bar(data=big10, aes(x=reorder(____, ____), weight=____)) +
193 | geom_bar(data=huskers, aes(x=reorder(____, ____), weight=____), fill="____")
194 | ```
195 | ```{r ggplot-bar4-solution, exercise.reveal_solution = FALSE}
196 | huskers <- softball |>
197 | filter(
198 | team == "Nebraska"
199 | )
200 |
201 | ggplot() +
202 | geom_bar(data=big10, aes(x=reorder(team, hr), weight=hr)) +
203 | geom_bar(data=huskers, aes(x=reorder(team, hr), weight=hr), fill="red")
204 | ```
205 | ```{r ggplot-bar4-check}
206 | grade_this_code()
207 | ```
208 |
209 | Where Nebraska ranks is pretty clear now, eh?
210 |
211 | ## One last trick: coord flip
212 |
213 | Sometimes, we don't want vertical bars. Maybe we think this would look better horizontal. Maybe our x-axis labels will never fit on the x-axis. How do we do that? By adding `coord_flip()` to our code. It does what it says -- it inverts the coordinates of the figures.
214 |
215 | ### Exercise 5: Flip the coordinates
216 |
217 | ```{r ggplot-bar5, exercise=TRUE, exercise.setup = "ggplot-load-data", message=FALSE}
218 | ggplot() +
219 | geom_bar(data=big10, aes(x=reorder(team, hr), weight=hr)) +
220 | geom_bar(data=huskers, aes(x=reorder(team, hr), weight=hr), fill="red") +
221 | ____()
222 | ```
223 | ```{r ggplot-bar5-solution, exercise.reveal_solution = FALSE}
224 | ggplot() +
225 | geom_bar(data=big10, aes(x=reorder(team, hr), weight=hr)) +
226 | geom_bar(data=huskers, aes(x=reorder(team, hr), weight=hr), fill="red") +
227 | coord_flip()
228 | ```
229 | ```{r ggplot-bar5-check}
230 | grade_this_code()
231 | ```
232 |
233 | Now we've got something working. There's a lot of work left to do to make this publishable, but we'll get to that in due time.
234 |
235 | ## The Recap
236 |
237 | In this lesson, we've taken our first steps into the world of data visualization with ggplot2, focusing on bar charts. We've learned how to create basic bar charts using geom_bar(), and how to customize them to better represent our data. We explored key concepts such as filtering data to focus on specific subsets (like the Big Ten conference), using the weight aesthetic to represent values other than count, and reordering bars to improve readability. We also discovered how to use layering to highlight specific data points (like Nebraska's performance) and how to flip coordinates for horizontal bar charts. These foundational skills in creating and customizing bar charts will serve as building blocks for more complex visualizations in future lessons. Remember, effective data visualization is an iterative process - don't be afraid to experiment with different approaches to find the most clear and compelling way to present your sports data insights.
238 |
239 | ## Terms to Know
240 |
241 | - ggplot2: A data visualization package for R that implements the grammar of graphics, allowing for the creation of complex plots from data in a layered fashion.
242 | - `geom_bar()`: A ggplot2 function used to create bar charts, which represent data with rectangular bars proportional to the values they represent.
243 | - Aesthetics (aes): In ggplot2, aesthetics define how variables in the data are mapped to visual properties of the plot, such as x and y positions, colors, sizes, etc.
244 | weight: An aesthetic in geom_bar() that allows you to represent values other than simple counts in your bar chart.
245 | - `reorder()`: A function used to change the order of categorical variables in a plot, often used to sort bars by their values for better readability.
246 | - `coord_flip()`: A ggplot2 function that switches the x and y axes, useful for creating horizontal bar charts.
247 | - Layering: The process of adding multiple geometric objects (geoms) to a plot, allowing for more complex and informative visualizations.
248 | - `fill`: An aesthetic that controls the interior color of geometric shapes in a plot, such as the color inside bars in a bar chart.
249 | - `data`: The argument in ggplot2 functions that specifies which dataset to use for creating the visualization.
250 |
--------------------------------------------------------------------------------
/inst/tutorials/14-stacked-bars/14-stacked-bars.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 14: Stacked bar charts"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to add nuance to a bar chart.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to build upon your basic bar chart skills and introduce you to creating stacked bar charts using ggplot2. Stacked bar charts are powerful tools for visualizing the composition of a whole and comparing parts within it. By the end of this tutorial, you'll understand how to prepare your data for stacked bar charts, use the fill aesthetic to create layers within bars, and interpret the resulting visualizations. You'll learn how to create charts that show both total values and their components. This lesson will enhance your ability to create more nuanced and informative visualizations, allowing you to present complex relationships in sports data in a clear and compelling manner.
25 |
26 | ## The Basics
27 |
28 | One of the elements of data visualization excellence is **inviting comparison**. Often that comes in showing **what proportion a thing is in relation to the whole thing**. With bar charts, we're showing magnitude of the whole thing. If we have information about the parts of the whole, **we can stack them on top of each other to compare them, showing both the whole and the components**. And it's a simple change to what we've already done.
29 |
30 | We're going to use a dataset of college football games from this season.
31 |
32 | Load the tidyverse.
33 |
34 | ```{r load-tidyverse, exercise=TRUE}
35 | library(tidyverse)
36 | ```
37 | ```{r load-tidyverse-solution}
38 | library(tidyverse)
39 | ```
40 | ```{r load-tidyverse-check}
41 | grade_this_code()
42 | ```
43 |
44 | And the data.
45 |
46 | ```{r stacked-load-data, message=FALSE, warning=FALSE}
47 | football <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs23.csv")
48 |
49 | big <- football |>
50 | group_by(Conference, Team) |>
51 | summarise(
52 | SeasonRushingYards = sum(RushingYds),
53 | SeasonPassingYards = sum(PassingYds),
54 | ) |> filter(Conference == "Big Ten Conference")
55 |
56 | biglong <- big |>
57 | pivot_longer(
58 | cols=starts_with("Season"),
59 | names_to="Type",
60 | values_to="Yards")
61 | ```
62 | ```{r stacked-load-data-exercise, exercise = TRUE}
63 | football <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs23.csv")
64 | ```
65 | ```{r stacked-load-data-exercise-solution}
66 | football <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs23.csv")
67 | ```
68 | ```{r stacked-load-data-exercise-check}
69 | grade_this_code()
70 | ```
71 |
72 | What we have here is every game in college football in 2022. Let's quick take a glimpse.
73 |
74 | ```{r glimpse-data, exercise=TRUE, exercise.setup = "stacked-load-data"}
75 | glimpse(____)
76 | ```
77 | ```{r glimpse-data-solution}
78 | glimpse(football)
79 | ```
80 | ```{r glimpse-data-check}
81 | grade_this_code()
82 | ```
83 |
84 | The question we want to answer is this: Who had the most prolific offenses in the Big Ten? And how did they get there?
85 |
86 | So to make this chart, we have to just add one thing to a bar chart like we did in the previous chapter. However, it's not that simple.
87 |
88 | ### Exercise 1: Preparing your data
89 |
90 | We have game data, and we need season data. To get that, we need to do some group by and sum work. And since we're only interested in the Big Ten, we have some filtering to do too. For this, we're going to measure offensive production by rushing yards and passing yards. So if we have all the games a team played, and the rushing and passing yards for each of those games, what we need to do to get the season totals is just add them up. We'll put all of that into a new dataframe called big
91 |
92 | ```{r stacked-big, exercise=TRUE, exercise.setup = "stacked-load-data", message=FALSE}
93 | ____ <- football |>
94 | group_by(Conference, Team) |>
95 | summarise(
96 | SeasonRushingYards = sum(____),
97 | SeasonPassingYards = sum(____)
98 | ) |>
99 | filter(Conference == "Big Ten Conference")
100 | ```
101 | ```{r stacked-big-solution, exercise.reveal_solution = FALSE}
102 | big <- football |>
103 | group_by(Conference, Team) |>
104 | summarise(
105 | SeasonRushingYards = sum(RushingYds),
106 | SeasonPassingYards = sum(PassingYds)
107 | ) |>
108 | filter(Conference == "Big Ten Conference")
109 | ```
110 | ```{r stacked-big-check}
111 | grade_this_code()
112 | ```
113 |
114 | By looking at this, we can see we got what we needed. We have 14 teams (though remember that head only shows you six) and numbers that look like season totals for yards.
115 |
116 | ```{r head-data, exercise=TRUE, exercise.setup = "stacked-load-data"}
117 | head(big)
118 | ```
119 | ```{r head-data-solution}
120 | head(big)
121 | ```
122 | ```{r head-data-check}
123 | grade_this_code()
124 | ```
125 |
126 | Now, the problem we have is that ggplot wants long data and this data is wide.
127 |
128 | ### Exercise 2: Making wide data long
129 |
130 | Remember transforming data? Lesson 6? This is where that work is going to pay off. We need to pivot this data longer. In order to do that, we need to say which columns are being pivoted. Note in the head block above, we have four columns. Two columns are the Team and the Conference. The other columns start with the word Season. That, my friends, is a gigantic hint.
131 |
132 | We're going to save this work to a new dataframe called biglong.
133 |
134 | ```{r stacked-pivot, exercise=TRUE, exercise.setup = "stacked-load-data", message=FALSE}
135 | ____ <- ____ |>
136 | pivot_longer(
137 | cols=starts_with("____"),
138 | names_to="Type",
139 | values_to="Yards")
140 | ```
141 | ```{r stacked-pivot-solution, exercise.reveal_solution = FALSE}
142 | biglong <- big |>
143 | pivot_longer(
144 | cols=starts_with("Season"),
145 | names_to="Type",
146 | values_to="Yards")
147 | ```
148 | ```{r stacked-pivot-check}
149 | grade_this_code()
150 | ```
151 |
152 | If this worked, you'll have two rows for each team: One for rushing yards, one for passing yards. This is what ggplot needs.
153 |
154 | ```{r head2-data, exercise=TRUE, exercise.setup = "stacked-load-data"}
155 | head(biglong)
156 | ```
157 | ```{r head2-data-solution}
158 | head(biglong)
159 | ```
160 | ```{r head2-data-check}
161 | grade_this_code()
162 | ```
163 |
164 | ### Exercise 3: Making your first plot
165 |
166 | Building on what we learned in the last chapter, we know we can turn this into a bar chart with an x value, a weight and a geom_bar. What we are going to add is a `fill`. The `fill` will stack bars on each other based on which element it is. In this case, we can fill the bar by Type, which means it will stack the number of rushing yards on top of passing yards and we can see how they compare.
167 |
168 | ```{r stacked-plot1, exercise=TRUE, exercise.setup = "stacked-load-data", message=FALSE}
169 | ggplot() +
170 | geom_bar(data=____, aes(x=____, weight=Yards, fill=____)) +
171 | coord_flip()
172 | ```
173 | ```{r stacked-plot1-solution, exercise.reveal_solution = FALSE}
174 | ggplot() +
175 | geom_bar(data=biglong, aes(x=Team, weight=Yards, fill=Type)) +
176 | coord_flip()
177 | ```
178 | ```{r stacked-plot1-check}
179 | grade_this_code()
180 | ```
181 |
182 | What's the problem with this chart?
183 |
184 | There's a couple of things, one of which we'll deal with now: The ordering is alphabetical (from the bottom up). So let's `reorder` the teams by Yards.
185 |
186 | ### Exercise 4: Reordering your bars
187 |
188 | ```{r stacked-plot2, exercise=TRUE, exercise.setup = "stacked-load-data", message=FALSE}
189 | ggplot() +
190 | geom_bar(data=____, aes(x=reorder(____, ____), weight=Yards, fill=____)) +
191 | coord_flip()
192 | ```
193 | ```{r stacked-plot2-solution, exercise.reveal_solution = FALSE}
194 | ggplot() +
195 | geom_bar(data=biglong, aes(x=reorder(Team, Yards), weight=Yards, fill=Type)) +
196 | coord_flip()
197 | ```
198 | ```{r stacked-plot2-check}
199 | grade_this_code()
200 | ```
201 |
202 | And just like that ... Ohio State and Penn State are out on top. Where is Nebraska?
203 |
204 | ## The Recap
205 |
206 | In this lesson, we've expanded our data visualization skills by learning how to create stacked bar charts using ggplot2. We started by preparing our data, using group_by() and summarise() to aggregate game-level data into season totals, and then transforming it from wide to long format using pivot_longer(). This data preparation step is crucial for creating effective stacked bar charts. We then built on our basic bar chart knowledge, adding the fill aesthetic to create layers within our bars that represent different components of the whole (in this case, rushing and passing yards). We also reinforced the importance of reordering our data for more meaningful visualizations. Through this process, we've learned how to create charts that not only show total offensive production for Big Ten football teams, but also reveal the balance between rushing and passing yards for each team. This type of visualization allows for rich comparisons and insights, demonstrating the power of stacked bar charts in sports data analysis. Remember, effective data visualization often requires careful data preparation and thoughtful design choices to clearly communicate your insights.
207 |
208 | ## Terms to Know
209 |
210 | - Stacked Bar Chart: A type of bar chart where bars are divided into segments, each representing a subcategory of the main category. The total height of the bar represents the sum of all subcategories.
211 | - `pivot_longer()`: A function from the tidyr package used to transform data from wide to long format, which is often necessary for creating stacked bar charts in ggplot2.
212 | - `fill`: An aesthetic in ggplot2 used to assign different colors to subcategories within a stacked bar chart, allowing for visual distinction between components.
213 | - `reorder()`: A function used to change the order of categorical variables in a plot, often used in bar charts to sort bars by their total values for better readability.
214 | - Long format: A data structure where each row represents a single observation and each column represents a variable. This format is typically required for creating stacked bar charts in ggplot2.
215 |
--------------------------------------------------------------------------------
/inst/tutorials/15-waffle-charts/15-waffle-charts.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 15: Waffle charts"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to make another chart that shows magnitude and composition.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(waffle)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to introduce you to waffle charts, an effective alternative to pie charts for visualizing proportions and compositions. By the end of this tutorial, you'll understand why waffle charts are often preferred over pie charts for certain types of data visualization. You'll learn how to create basic waffle charts using the waffle package in R, including how to prepare your data using vectors, customize chart appearance, and create side-by-side comparisons using the iron() function. We'll apply these concepts to real sports data, visualizing offensive statistics from a college football game. This lesson will enhance your data visualization toolkit, allowing you to present proportional data in a clear, engaging, and easily interpretable format that's particularly useful for sports analytics.
25 |
26 | ## The Basics
27 |
28 | Pie charts are the devil. They should be an instant F in any data visualization class. The problem? How carefully can you evaluate angles and area? Unless they are blindingly obvious and only a few categories, not well. If you've got 25 categories, how can you tell the difference between 7 and 9 percent? You can't.
29 |
30 | So let's introduce a better way: The Waffle Chart. Some call it a square pie chart. I personally hate that. Waffles it is.
31 |
32 | **A waffle chart is designed to show you parts of the whole -- proportionality**. How many yards on offense come from rushing or passing. How many singles, doubles, triples and home runs make up a teams hits. How many shots a basketball team takes are two pointers versus three pointers.
33 |
34 | Now load it. For this exercise, we don't need the tidyverse.
35 |
36 | ```{r load-waffle, exercise=TRUE}
37 | library(waffle)
38 | ```
39 | ```{r load-waffle-solution}
40 | library(waffle)
41 | ```
42 | ```{r load-waffle-check}
43 | grade_this_code()
44 | ```
45 |
46 | ## Making waffles with vectors
47 |
48 | Let's look at the heartbreak that was Nebraska vs. Wisconsin in fall 2022 in college football. A standout heartbreak in a decade of heartbreaks. [Here's the box score](https://www.espn.com/college-football/matchup/_/gameId/401405142), which we'll use for this part of the walkthrough.
49 |
50 | Just remember -- we lost this game by 1 point.
51 |
52 | The easiest way to do waffle charts is to make vectors of your data and plug them in. To make a vector, we use the `c` or concatenate function. We did this all the way back in Lesson 1.
53 |
54 | So let's look at offense. Rushing vs passing.
55 |
56 | ```{r waffle-load-data, message=FALSE, warning=FALSE}
57 | nu <- c("Rushing"=106, "Passing"=65)
58 | wi <- c("Rushing"=235, "Passing"=83)
59 |
60 | total <- c("Nebraska"=171, "Wisconsin"=318)
61 |
62 | nu2 <- c("Rushing"=106, "Passing"=65, 147)
63 | wi2 <- c("Rushing"=235, "Passing"=83)
64 | ```
65 | ```{r waffle-load-data-exercise, exercise = TRUE}
66 | nu <- c("Rushing"=106, "Passing"=65)
67 | wi <- c("Rushing"=235, "Passing"=83)
68 | ```
69 | ```{r waffle-load-data-exercise-solution}
70 | nu <- c("Rushing"=106, "Passing"=65)
71 | wi <- c("Rushing"=235, "Passing"=83)
72 | ```
73 | ```{r waffle-load-data-exercise-check}
74 | grade_this_code()
75 | ```
76 |
77 | So what does the breakdown of the night look like?
78 |
79 | ### Exercise 1: Make a waffle chart
80 |
81 | The waffle library can break this down in a way that's easier on the eyes than a pie chart. We call the `waffle` function, add the data for Nebraska, specify the number of rows (10 is a good start), give it a title and an x value label, and to clean up a quirk of the library, we've got to specify colors.
82 |
83 | ```{r waffle, exercise=TRUE, exercise.setup = "waffle-load-data", message=FALSE}
84 | waffle(
85 | ____,
86 | rows = ____,
87 | title="Nebraska's offense",
88 | xlab="1 square = 1 yard",
89 | colors = c("black", "red")
90 | )
91 | ```
92 | ```{r waffle-solution, exercise.reveal_solution = FALSE}
93 | waffle(
94 | nu,
95 | rows = 10,
96 | title="Nebraska's offense",
97 | xlab="1 square = 1 yard",
98 | colors = c("black", "red")
99 | )
100 | ```
101 | ```{r waffle-check}
102 | grade_this_code()
103 | ```
104 |
105 | Or, we could make this two teams in the same chart by just changing the vector up. Nebraska had 171 yards that night. Wisconsin had 318.
106 |
107 | ### Exercise 2: Two teams, one waffle
108 |
109 | ```{r waffle2, exercise=TRUE, exercise.setup = "waffle-load-data", message=FALSE}
110 | total <- c("Nebraska"=____, "____"=318)
111 |
112 | waffle(
113 | ____,
114 | rows = 10,
115 | title="Nebraska vs Wisconsin: total yards",
116 | xlab="1 square = 1 yard",
117 | colors = c("red", "black")
118 | )
119 | ```
120 | ```{r waffle2-solution, exercise.reveal_solution = FALSE}
121 | total <- c("Nebraska"=171, "Wisconsin"=318)
122 |
123 | waffle(
124 | total,
125 | rows = 10,
126 | title="Nebraska vs Wisconsin: total yards",
127 | xlab="1 square = 1 yard",
128 | colors = c("red", "black")
129 | )
130 | ```
131 | ```{r waffle2-check}
132 | grade_this_code()
133 | ```
134 |
135 | No matter how you look at this game, it just comes back to heartache.
136 |
137 | ## Two waffles = waffle iron
138 |
139 | What does it look like if we compare the two teams using the two vectors in the same chart? To do that -- and I am not making this up -- you have to create a waffle iron. Get it? Waffle charts? Iron?
140 |
141 | ### Exercise 3: The waffle iron
142 |
143 | To make an waffle iron, you wrap your `waffle` functions in an `iron` function. The `iron` is just a wrapper -- it just combines them together. Each waffle functions seperately in the iron.
144 |
145 | In this block, we're going to use the first vectors we made with nu and ms. You'll see, the iron is just two waffles with a comma between them.
146 |
147 | ```{r waffle3, exercise=TRUE, exercise.setup = "waffle-load-data", message=FALSE}
148 | iron(
149 | waffle(____,
150 | rows = 10,
151 | title="Nebraska's offense",
152 | xlab="1 square = 1 yard",
153 | colors = c("black", "red")
154 | ),
155 | waffle(____,
156 | rows = 10,
157 | title="Wisconsin's offense",
158 | xlab="1 square = 1 yard",
159 | colors = c("black", "red")
160 | )
161 | )
162 | ```
163 | ```{r waffle3-solution, exercise.reveal_solution = FALSE}
164 | iron(
165 | waffle(nu,
166 | rows = 10,
167 | title="Nebraska's offense",
168 | xlab="1 square = 1 yard",
169 | colors = c("black", "red")
170 | ),
171 | waffle(wi,
172 | rows = 10,
173 | title="Wisconsin's offense",
174 | xlab="1 square = 1 yard",
175 | colors = c("black", "red")
176 | )
177 | )
178 | ```
179 | ```{r waffle3-check}
180 | grade_this_code()
181 | ```
182 |
183 | What do you notice about this chart? Notice how the squares aren't the same size? Well, Wisconsin out-gained Nebraska by a chunk ... AND ONLY WON BY A POINT. So the squares aren't the same size because the numbers aren't the same.
184 |
185 | ### Exercise 4: Adding padding
186 |
187 | We can fix the uneven box sizes by adding an unnamed padding number so the number of yards add up to the same thing. Wisconsin's total yards ended up at 318. To make the squares the same, we need to make the total for everyone be 318. To do that, we need to add 147 to Nebraska. REMEMBER: Don't name it or it'll show up in the legend.
188 |
189 | Now, in our waffle iron, if we don't give that padding a color, we'll get an error. So we need to make it white. Which, given our white background, means it will disappear.
190 |
191 | ```{r waffle4, exercise=TRUE, exercise.setup = "waffle-load-data", message=FALSE}
192 | nu2 <- c("Rushing"=106, "Passing"=65, 147)
193 | wi2 <- c("Rushing"=235, "Passing"=83)
194 |
195 | iron(
196 | waffle(____,
197 | rows = 10,
198 | title="Nebraska's offense",
199 | xlab="1 square = 1 yard",
200 | colors = c("black", "red", "____")
201 | ),
202 | waffle(____,
203 | rows = 10,
204 | title="Wisconsin's offense",
205 | xlab="1 square = 1 yard",
206 | colors = c("black", "red")
207 | )
208 | )
209 | ```
210 | ```{r waffle4-solution, exercise.reveal_solution = FALSE}
211 | nu2 <- c("Rushing"=106, "Passing"=65, 147)
212 | wi2 <- c("Rushing"=235, "Passing"=83)
213 |
214 | iron(
215 | waffle(nu2,
216 | rows = 10,
217 | title="Nebraska's offense",
218 | xlab="1 square = 1 yard",
219 | colors = c("black", "red", "white")
220 | ),
221 | waffle(wi2,
222 | rows = 10,
223 | title="Wisconsin's offense",
224 | xlab="1 square = 1 yard",
225 | colors = c("black", "red")
226 | )
227 | )
228 | ```
229 | ```{r waffle4-check}
230 | grade_this_code()
231 | ```
232 |
233 | ### Exercise 5: Many units, one box
234 |
235 | One last thing we can do is change the 1 square = 1 yard bit -- which makes the squares really small in this case -- by dividing our vector. Because of how R makes vectors, you can just divide it by a number and R will know to divide the numbers inside the vector by that number. Take what you just did and divide it by 2 and see what happens. Reminder: in R, divide is just the slash.
236 |
237 | ```{r waffle5, exercise=TRUE, exercise.setup = "waffle-load-data", message=FALSE}
238 | nu2 <- c("Rushing"=106, "Passing"=65, 147)
239 | wi2 <- c("Rushing"=235, "Passing"=83)
240 |
241 | iron(
242 | waffle(____/____,
243 | rows = 10,
244 | title="Nebraska's offense",
245 | xlab="1 square = 2 yards",
246 | colors = c("black", "red", "____")
247 | ),
248 | waffle(____/____,
249 | rows = 10,
250 | title="Wisconsin's offense",
251 | xlab="1 square = 2 yards",
252 | colors = c("black", "red")
253 | )
254 | )
255 | ```
256 | ```{r waffle5-solution, exercise.reveal_solution = FALSE}
257 | nu2 <- c("Rushing"=106, "Passing"=65, 147)
258 | wi2 <- c("Rushing"=235, "Passing"=83)
259 |
260 | iron(
261 | waffle(nu2/2,
262 | rows = 10,
263 | title="Nebraska's offense",
264 | xlab="1 square = 2 yards",
265 | colors = c("black", "red", "white")
266 | ),
267 | waffle(wi2/2,
268 | rows = 10,
269 | title="Wisconsin's offense",
270 | xlab="1 square = 2 yards",
271 | colors = c("black", "red")
272 | )
273 | )
274 | ```
275 | ```{r waffle5-check}
276 | grade_this_code()
277 | ```
278 |
279 | News flash: Wisconsin beat Nebraska by one point. One.
280 |
281 | ## The Recap
282 |
283 | In this lesson, we've explored the power of waffle charts as an effective alternative to pie charts for visualizing proportional data in sports analytics. We started by learning how to create simple waffle charts using vectors and the waffle() function, applying this to visualize offensive statistics from a college football game. We then advanced to creating side-by-side comparisons using the iron() function, which allowed us to effectively contrast data from two teams. We also learned important techniques for improving our visualizations, such as adding padding to ensure consistent square sizes and adjusting the scale to represent multiple units per square. Throughout the lesson, we've seen how waffle charts can clearly display both the magnitude and composition of data, making them an excellent tool for presenting sports statistics. Remember, effective data visualization is about choosing the right tool for your data and audience, and waffle charts offer a clear, engaging way to present proportional data that's easy for viewers to interpret and compare.
284 |
285 | ## Terms to Know
286 |
287 | - Waffle Chart: A chart type that represents data as squares in a grid, where each square represents a specific value or proportion of the whole.
288 | - Vector: In R, a basic data structure that contains elements of the same type, used to create the data for waffle charts.
289 | - `c()` Function: The concatenate function in R used to create vectors by combining values.
290 | - `waffle()` Function: The main function from the waffle package used to create individual waffle charts.
291 | - `iron()` Function: A function from the waffle package used to combine multiple waffle charts into a single visualization.
292 | - Padding: Extra values added to a vector to make the total consistent across different data sets, allowing for easier comparison in waffle charts.
--------------------------------------------------------------------------------
/inst/tutorials/16-line-charts/16-line-charts.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 16: Line charts"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to show change over time.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to introduce you to line charts, a powerful visualization tool for displaying trends over time. By the end of this tutorial, you'll understand when to use line charts, how to create them using ggplot2, and how to customize them for clearer data representation. You'll learn to plot single and multiple lines, adjust axes for better context, and add layers to compare different datasets. We'll apply these concepts to real college basketball data, analyzing team performance across a season. This lesson will enhance your ability to visualize and interpret time-series data, a crucial skill in sports analytics for understanding performance trends, seasonal patterns, and comparative team dynamics.
25 |
26 | ## The Basics
27 |
28 | So far, we've talked about bar charts -- stacked or otherwise -- are good for showing relative size of a thing compared to another thing. Stacked Bars and Waffle charts are good at showing proportions of a whole.
29 |
30 | **Line charts are good for showing change over time.**
31 |
32 | Let's look at how we can answer this question: How is this season's Nebraska's basketball season progressing?
33 |
34 | We'll use the logs of every game in college basketball.
35 |
36 | Let's start getting all that we need. We can use the tidyverse shortcut.
37 |
38 | ```{r load-tidyverse, exercise=TRUE}
39 | library(tidyverse)
40 | ```
41 | ```{r load-tidyverse-solution}
42 | library(tidyverse)
43 | ```
44 | ```{r load-tidyverse-check}
45 | grade_this_code()
46 | ```
47 |
48 | And the data.
49 |
50 | ```{r line-load-data, message=FALSE, warning=FALSE}
51 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/logs25.csv")
52 |
53 | nu <- logs |> filter(Team == "Nebraska")
54 | pu <- logs |> filter(Team == "Purdue")
55 | big <- logs |> filter(Conference == "Big Ten")
56 | average <- logs |> group_by(Date) |> summarise(mean_shooting=mean(TeamFGPCT))
57 |
58 | ```
59 | ```{r line-load-data-exercise, exercise = TRUE}
60 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/logs24.csv")
61 | ```
62 | ```{r line-load-data-exercise-solution}
63 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/logs24.csv")
64 | ```
65 | ```{r line-load-data-exercise-check}
66 | grade_this_code()
67 | ```
68 |
69 | ### Exercise 1: Prepare your data
70 |
71 | This data has every game from every team in it, so we need to use filtering to limit it, because we just want to look at Nebraska. If you don't remember, flip back to lesson 5. Let's make a new dataframe called nu for Nebraska's games.
72 |
73 | ```{r nu, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
74 | ____ <- logs |> filter(Team == "____")
75 | ```
76 | ```{r nu-solution, exercise.reveal_solution = FALSE}
77 | nu <- logs |> filter(Team == "Nebraska")
78 | ```
79 | ```{r nu-check}
80 | grade_this_code()
81 | ```
82 |
83 | We can get a peek at it to make sure it all worked.
84 |
85 | ```{r head-data, exercise=TRUE, exercise.setup = "line-load-data"}
86 | head(nu)
87 | ```
88 | ```{r head-data-solution}
89 | head(nu)
90 | ```
91 | ```{r head-data-check}
92 | grade_this_code()
93 | ```
94 |
95 | Because this data has just Nebraska data in it, the dates are formatted correctly, and the data is long data (instead of wide), we have what we need to make line charts.
96 |
97 | ### Exercise 2: Your first line chart
98 |
99 | Line charts, unlike bar charts, do have a y-axis. So in our ggplot step, we have to define what our x and y axes are. In this case, the x axis is our Date -- *the most common x axis in line charts is going to be a date of some variety* -- and y in this case is up to us. We've seen from previous walkthroughs that how well a team shoots the ball has a lot to do with how well a team does in a season, so let's chart that.
100 |
101 | ```{r line1, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
102 | ggplot() + geom_line(data=____, aes(x=____, y=____FGPCT))
103 | ```
104 | ```{r line1-solution, exercise.reveal_solution = FALSE}
105 | ggplot() + geom_line(data=nu, aes(x=Date, y=TeamFGPCT))
106 | ```
107 | ```{r line1-check}
108 | grade_this_code()
109 | ```
110 |
111 | See some problems here? Whatever you came up with, the real problem is that the Y axis doesn't start with zero.
112 |
113 | ### Exercise 3: Zeroing your axes
114 |
115 | Using a non-zero axis makes a line chart look more dramatic than it is. To make the axis what you want, you can use `scale_x_continuous` or `scale_y_continuous` and pass in a list with the bottom and top value you want. The bottom value is zero. The upper value is up to you. Let'ss try .7 or 70 percent.
116 |
117 | ```{r line2, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
118 | ggplot() +
119 | geom_line(data=____, aes(x=____, y=____FGPCT)) +
120 | scale_y_continuous(limits = c(0, .7))
121 | ```
122 | ```{r line2-solution, exercise.reveal_solution = FALSE}
123 | ggplot() +
124 | geom_line(data=nu, aes(x=Date, y=TeamFGPCT)) +
125 | scale_y_continuous(limits = c(0, .7))
126 | ```
127 | ```{r line2-check}
128 | grade_this_code()
129 | ```
130 |
131 | Note also that our X axis labels are automated. It knows it's a date and it just labels it by month.
132 |
133 | ## One line line charts are bad
134 |
135 | With datasets, we want to invite comparison. So let's answer the question visually. Let's put two lines on the same chart. How does Nebraska compare to Purdue, for example?
136 |
137 | ### Exercise 4: Adding another line
138 |
139 | When you have multiple datasets from the same data, you can add multiple geoms. All you have to do is swap out the data and add a + between them.
140 |
141 | ```{r line3, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
142 | pu <- logs |> filter(Team == "____")
143 |
144 | ggplot() +
145 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="red") +
146 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="gold") +
147 | scale_y_continuous(limits = c(0, .7))
148 | ```
149 | ```{r line3-solution, exercise.reveal_solution = FALSE}
150 | pu <- logs |> filter(Team == "Purdue")
151 |
152 | ggplot() +
153 | geom_line(data=nu, aes(x=Date, y=TeamFGPCT), color="red") +
154 | geom_line(data=pu, aes(x=Date, y=TeamFGPCT), color="gold") +
155 | scale_y_continuous(limits = c(0, .7))
156 | ```
157 | ```{r line3-check}
158 | grade_this_code()
159 | ```
160 |
161 | When you're doing this on your own in a notebook, REMEMBER COPY AND PASTE IS A THING. Nothing changes except what data you are using and maybe the color. The less typing you're doing, the better.
162 |
163 | So visually speaking, what is the difference between Nebraska and Purdue's season? Does it tell a story?
164 |
165 | ## A lot of lines = context.
166 |
167 | There are some line charts where a small number are good. A good way to add more context? Add more lines. But unlike what we're doing above, some lines don't need attention. I'll explain.
168 |
169 | Let's add all of the Big Ten to our line chart, but this time, we're going to fade them into the background by making them grey. Then, our red line for Nebraska and the gold line of Purdue will really stand out.
170 |
171 | ### Exercise 5: Adding the B1G
172 |
173 | An important thing to know is that geoms render in the order that they appear. The first one is on the bottom. The next one is the second layer. The third layer of three is on top. If you put the Big Ten layer last, the lines will cover everything in grey. If you put it first, then the Nebraska and Purdue's lines will be the only ones covered ... by Nebraska and Purdue.
174 |
175 | NOTE: When you have more than one thing in a dataframe -- teams, players, whatever -- and you're making a line chart out of it, you need to add the `group` to the aesthetic. In this case, the group is the Team.
176 |
177 | ```{r line4, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
178 | big <- logs |> filter(Conference == "Big Ten")
179 |
180 | ggplot() +
181 | geom_line(data=____, aes(x=Date, y=TeamFGPCT, group=Team), color="____") +
182 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="red") +
183 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="gold") +
184 | scale_y_continuous(limits = c(0, .7))
185 | ```
186 | ```{r line4-solution, exercise.reveal_solution = FALSE}
187 | big <- logs |> filter(Conference == "Big Ten")
188 |
189 | ggplot() +
190 | geom_line(data=big, aes(x=Date, y=TeamFGPCT, group=Team), color="grey") +
191 | geom_line(data=nu, aes(x=Date, y=TeamFGPCT), color="red") +
192 | geom_line(data=pu, aes(x=Date, y=TeamFGPCT), color="gold") +
193 | scale_y_continuous(limits = c(0, .7))
194 | ```
195 | ```{r line4-check}
196 | grade_this_code()
197 | ```
198 |
199 | What do we see here? How has Nebraska and Purdue's season evolved against all the rest of the teams in the Big Ten?
200 |
201 | ### Exercise 6: Adding the average
202 |
203 | But how does that compare to the average? We can add that pretty easily by creating a new dataframe with it and add another geom_line.
204 |
205 | ```{r average, exercise=TRUE, exercise.setup = "line-load-data"}
206 | average <- logs |> group_by(Date) |> summarise(mean_shooting=mean(TeamFGPCT))
207 | ```
208 | ```{r average-solution}
209 | average <- logs |> group_by(Date) |> summarise(mean_shooting=mean(TeamFGPCT))
210 | ```
211 | ```{r average-check}
212 | grade_this_code()
213 | ```
214 |
215 | And now we have another geom_line to add.
216 |
217 | ```{r line5, exercise=TRUE, exercise.setup = "line-load-data", message=FALSE}
218 | big <- logs |> filter(Conference == "Big Ten")
219 |
220 | ggplot() +
221 | geom_line(data=____, aes(x=Date, y=TeamFGPCT, group=Team), color="____") +
222 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="red") +
223 | geom_line(data=____, aes(x=Date, y=TeamFGPCT), color="gold") +
224 | geom_line(data=____, aes(x=Date, y=mean_shooting), color="black") +
225 | scale_y_continuous(limits = c(0, .7))
226 | ```
227 | ```{r line5-solution, exercise.reveal_solution = FALSE}
228 | big <- logs |> filter(Conference == "Big Ten")
229 |
230 | ggplot() +
231 | geom_line(data=big, aes(x=Date, y=TeamFGPCT, group=Team), color="grey") +
232 | geom_line(data=nu, aes(x=Date, y=TeamFGPCT), color="red") +
233 | geom_line(data=pu, aes(x=Date, y=TeamFGPCT), color="gold") +
234 | geom_line(data=average, aes(x=Date, y=mean_shooting), color="black") +
235 | scale_y_continuous(limits = c(0, .7))
236 | ```
237 | ```{r line5-check}
238 | grade_this_code()
239 | ```
240 |
241 |
242 | ## The Recap
243 |
244 | In this lesson, we've explored the power of line charts for visualizing trends over time in sports data. We started with creating basic line charts using ggplot2, learning how to plot a single team's performance throughout a season. We then advanced to more complex visualizations, comparing multiple teams on the same chart and adding context with conference-wide data. We covered important techniques such as zeroing axes for honest representation, using color to highlight specific data series, and incorporating average lines for broader context. Throughout the lesson, we've seen how line charts can effectively show performance fluctuations, allow for easy comparisons between teams, and reveal patterns that might be missed in tabular data. Remember, while line charts are excellent for showing trends over time, they're most effective when used thoughtfully - consider your audience and the story you're trying to tell when deciding how many lines to include and how to style them. This skill in creating and interpreting line charts will be invaluable as you continue to analyze and present sports performance data.
245 |
246 | ## Terms to Know
247 |
248 | - Line Chart: A type of graph that displays data points connected by straight line segments, typically used to show trends over time.
249 | - `geom_line()`: A ggplot2 function used to create line charts by connecting data points with lines.
250 | - Time Series: A sequence of data points indexed in time order, often visualized using line charts.
251 | - x-axis: The horizontal axis in a chart, typically representing time in line charts.
252 | - y-axis: The vertical axis in a chart, typically representing the measured variable in line charts.
253 | - `scale_y_continuous()` or `scale_x_continuous()`: A ggplot2 function used to customize the y-axis or x-axis, including setting limits and breaks.
254 | - Layering: The technique of adding multiple geom_line() functions to create a chart with multiple lines.
255 | - Context Line: A line added to a chart to provide additional context, such as an average or benchmark, often distinguished by color or style.
--------------------------------------------------------------------------------
/inst/tutorials/21-beeswarm-plots/21-beeswarm-plots.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 21: Beeswarm plots"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to make scatterplots grouped by categories on a number line.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(ggrepel)
19 | library(ggbeeswarm)
20 | knitr::opts_chunk$set(echo = FALSE)
21 | tutorial_options(exercise.completion=FALSE)
22 | ```
23 |
24 | ## The Goal
25 |
26 | The goal of this lesson is to introduce you to beeswarm plots, an innovative visualization technique that combines elements of scatterplots, histograms, and bar charts. By the end of this tutorial, you'll understand when to use beeswarm plots, how to create them using the ggbeeswarm package in R, and how to interpret the distributions they reveal. You'll learn to prepare data for beeswarm plots, create basic plots, layer data to highlight specific groups, and add labels using ggrepel. We'll apply these concepts to real college volleyball data, analyzing player performance across different positions. You'll also explore variations like quasirandom and jitter plots. This lesson will enhance your ability to visualize the distribution of individual data points within categories, a crucial skill for understanding player performance patterns and making nuanced comparisons in sports analytics.
27 |
28 | ## The Basics
29 |
30 | A beeswarm plot is sometimes called a column scatterplot. It's an effective way to show how individual things -- teams, players, etc. -- are distributed along a numberline. The column is a grouping -- say positions in basketball -- and the dots are players, and the dots cluster where the numbers are more common. So think of it like a histogram mixed with a scatterplot crossed with a bar chart.
31 |
32 | An example will help.
33 |
34 | Let's use the NU Women's Volleyball team as our example. Let's look at a new metric: points. Who is doing the work for NU?
35 |
36 | To make beeswarm plots, you need a library that adds some geoms to ggplot. In this cases it's called ggbeeswarm, and you installed it way back at the beginning. But any time you need a library, and it's on CRAN, you can go to your console and install it with `install.packages("ggbeeswarm")`
37 |
38 | We'll need to load ggbeeswarm, the tidyverse and, for later, ggrepel.
39 |
40 | ```{r load-tidyverse, exercise=TRUE}
41 | library(tidyverse)
42 | library(ggbeeswarm)
43 | library(ggrepel)
44 | ```
45 | ```{r load-tidyverse-solution}
46 | library(tidyverse)
47 | library(ggbeeswarm)
48 | library(ggrepel)
49 | ```
50 | ```{r load-tidyverse-check}
51 | grade_this_code()
52 | ```
53 |
54 | And the data.
55 |
56 | ```{r beeswarm-load-data, message=FALSE, warning=FALSE}
57 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
58 |
59 | set.seed(1234)
60 |
61 | activeplayers <- players |> filter(s > 75 & kills > 10 & is.na(position) == FALSE)
62 |
63 | nu <- activeplayers |>
64 | filter(team == "Nebraska")
65 | ```
66 | ```{r beeswarm-load-data-exercise, exercise = TRUE}
67 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
68 | ```
69 | ```{r beeswarm-load-data-exercise-solution}
70 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/ncaa_womens_volleyball_players_2024.csv")
71 | ```
72 | ```{r beeswarm-load-data-exercise-check}
73 | grade_this_code()
74 | ```
75 |
76 | Another bit of setup: we need to set the seed for the random number generator. The library "jitters" the dots in the beeswarm randomly. If we don't set the seed, we'll get different results each time. Setting the seed means we get the same look. You *can* use any number you want. I use 1234, so that's what we'll use here.
77 |
78 | ```{r beeswarm1, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
79 | set.seed(1234)
80 | ```
81 | ```{r beeswarm1-solution, exercise.reveal_solution = FALSE}
82 | set.seed(1234)
83 | ```
84 | ```{r beeswarm1-check}
85 | grade_this_code()
86 | ```
87 |
88 | We know this data has a lot of players who didn't play, so let's get rid of them.
89 |
90 | ### Exercise 1: Active players only please.
91 |
92 | Recall in an earlier lesson, we used `filter` to get rid of players who don't play much. Let's do that again, dumping everyone who hasn't played 75 sets -- `s` in the data -- and we're going to get rid of every player with 10 or fewer kills. Believe it or not, there are players who have appeared in 75 sets who didn't rack up a single kill. We'll name our dataframe activeplayers.
93 |
94 | ```{r beeswarm2, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
95 | ____ <- ____ |>
96 | filter(
97 | s > ____ &
98 | kills > ____ &
99 | is.na(position) == FALSE)
100 | ```
101 | ```{r beeswarm2-solution, exercise.reveal_solution = FALSE}
102 | activeplayers <- players |>
103 | filter(
104 | s > 75 &
105 | kills > 10 &
106 | is.na(position) == FALSE)
107 | ```
108 | ```{r beeswarm2-check}
109 | grade_this_code()
110 | ```
111 |
112 | ### Exercise 2: Your first beeswarm
113 |
114 | It works very much like you would expect, if you think about it. The group value -- the columns that you put your dots in -- is the x, the number is the y. We're going to beeswarm by position -- `position` -- and the dots will be `kills`.
115 |
116 | ```{r beeswarm5, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
117 | ggplot() +
118 | geom_beeswarm(data=____, aes(x=____, y=____), color="grey")
119 | ```
120 | ```{r beeswarm5-solution, exercise.reveal_solution = FALSE}
121 | ggplot() +
122 | geom_beeswarm(data=activeplayers, aes(x=position, y=kills), color="grey")
123 | ```
124 | ```{r beeswarm5-check}
125 | grade_this_code()
126 | ```
127 |
128 | You can see that there's a lot of setters who are getting kills, but not a lot. It's not their job. Same with Defensive Specialists and Liberos. You'll also see there's some creativity with naming positions -- OPP for opposite side hitter, which is the same thing as an RS, or right side hitter, which are really all Outside Hitters, or OH. A middle hitter -- MH -- is just another name for MB or middle blocker. If we were going to publish this, we'd have to fix that.
129 |
130 | So where are the Nebraska players in that mix?
131 |
132 | ### Exercise 3: Filtering for Nebraska
133 |
134 | We'll filter players on Nebraska who meet our criteria.
135 |
136 | ```{r beeswarm6, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
137 | nu <- activeplayers |>
138 | filter(team == "____")
139 | ```
140 | ```{r beeswarm6-solution, exercise.reveal_solution = FALSE}
141 | nu <- activeplayers |>
142 | filter(team == "Nebraska")
143 | ```
144 | ```{r beeswarm6-check}
145 | grade_this_code()
146 | ```
147 |
148 | There's six Huskers who have played in more than 75 sets and who have more than 10 kills.
149 |
150 | But how good are they as killers compared to the rest of college volleyball?
151 |
152 | ### Exercise 4: Layering in Nebraska
153 |
154 | When you add another beeswarm, we need to pass another element in -- we need to tell it if we're grouping on the x value. Not sure why -- it's a requirement of the library -- but you'll get a warning if you don't.
155 |
156 | ```{r beeswarm7, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
157 | ggplot() +
158 | geom_beeswarm(
159 | data=____,
160 | aes(x=____, y=____), color="grey") +
161 | geom_beeswarm(
162 | data=____,
163 | aes(x=____, y=____), color="red")
164 | ```
165 | ```{r beeswarm7-solution, exercise.reveal_solution = FALSE}
166 | ggplot() +
167 | geom_beeswarm(
168 | data=activeplayers,
169 | aes(x=position, y=kills), color="grey") +
170 | geom_beeswarm(
171 | data=nu,
172 | aes(x=position, y=kills), color="red")
173 | ```
174 | ```{r beeswarm7-check}
175 | grade_this_code()
176 | ```
177 |
178 | What does this say about Nebraska? The number of players, where they rank, by position?
179 |
180 | ### Exercise 5: Labeling
181 |
182 | This is where we can use ggrepel. Let's add a text layer and label the dots with the players name.
183 |
184 | ```{r beeswarm8, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
185 | ggplot() +
186 | geom_beeswarm(
187 | data=____,
188 | aes(x=____, y=____), color="grey") +
189 | geom_beeswarm(
190 | data=____,
191 | aes(x=____, y=____), color="red") +
192 | geom_text_repel(
193 | data=nu,
194 | aes(x=____, y=____, label=name))
195 | ```
196 | ```{r beeswarm8-solution, exercise.reveal_solution = FALSE}
197 | ggplot() +
198 | geom_beeswarm(
199 | data=activeplayers,
200 | aes(x=position, y=kills), color="grey") +
201 | geom_beeswarm(
202 | data=nu,
203 | aes(x=position, y=kills), color="red") +
204 | geom_text_repel(
205 | data=nu,
206 | aes(x=position, y=kills, label=name))
207 | ```
208 | ```{r beeswarm8-check}
209 | grade_this_code()
210 | ```
211 |
212 | What stories do you see?
213 |
214 | ## A few other options
215 |
216 | The ggbeeswarm library has a couple of variations on the geom_beeswarm that may work better for your application. They are `geom_quasirandom` and `geom_jitter`.
217 |
218 | There's not a lot to change from our example to see what they do.
219 |
220 | ```{r beeswarm9, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
221 | ggplot() +
222 | geom_quasirandom(
223 | data=activeplayers,
224 | aes(x=position, y=kills), color="grey") +
225 | geom_quasirandom(
226 | data=nu,
227 | aes(x=position, y=kills), color="red") +
228 | geom_text_repel(
229 | data=nu,
230 | aes(x=position, y=kills, label=name))
231 | ```
232 | ```{r beeswarm9-solution, exercise.reveal_solution = FALSE}
233 | ggplot() +
234 | geom_quasirandom(
235 | data=activeplayers,
236 | aes(x=position, y=kills), color="grey") +
237 | geom_quasirandom(
238 | data=nu,
239 | aes(x=position, y=kills), color="red") +
240 | geom_text_repel(
241 | data=nu,
242 | aes(x=position, y=kills, label=name))
243 | ```
244 | ```{r beeswarm9-check}
245 | grade_this_code()
246 | ```
247 |
248 | Quasirandom spreads out the dots you see in beeswarm using -- you guessed it -- quasirandom spacing.
249 |
250 | ```{r beeswarm10, exercise=TRUE, exercise.setup = "beeswarm-load-data", message=FALSE}
251 | ggplot() +
252 | geom_jitter(
253 | data=activeplayers,
254 | aes(x=position, y=kills), color="grey") +
255 | geom_jitter(
256 | data=nu,
257 | aes(x=position, y=kills), color="red") +
258 | geom_text_repel(
259 | data=nu,
260 | aes(x=position, y=kills, label=name))
261 | ```
262 | ```{r beeswarm10-solution, exercise.reveal_solution = FALSE}
263 | ggplot() +
264 | geom_jitter(
265 | data=activeplayers,
266 | aes(x=position, y=kills), color="grey") +
267 | geom_jitter(
268 | data=nu,
269 | aes(x=position, y=kills), color="red") +
270 | geom_text_repel(
271 | data=nu,
272 | aes(x=position, y=kills, label=name))
273 | ```
274 | ```{r beeswarm10-check}
275 | grade_this_code()
276 | ```
277 |
278 | `geom_jitter` spreads out the dots evenly across the width of the column, randomly deciding where in the line of the kills they appear.
279 |
280 | Which one is right for you? You're going to have to experiment and decide. This is the art in the art and a science.
281 |
282 | ## The Recap
283 |
284 | In this lesson, we've explored the power of beeswarm plots for visualizing the distribution of individual data points within categories in sports data. We started by learning how to prepare data, filtering for active players to ensure meaningful visualizations. We then progressed from creating basic beeswarm plots to more complex layered plots, highlighting specific teams within the context of all players. We covered important techniques such as setting a random seed for reproducibility, using color to differentiate groups, and adding non-overlapping labels with ggrepel. We also explored variations of the beeswarm plot, including quasirandom and jitter plots, discussing how each can offer different insights into data distribution. Throughout the lesson, we've seen how beeswarm plots can reveal patterns in player performance across different positions in volleyball, allowing for nuanced comparisons between teams and the broader player population. Remember, while beeswarm plots are powerful tools for visualizing categorical-numerical relationships, the choice between beeswarm, quasirandom, or jitter plots often depends on your specific data and the story you're trying to tell. Always experiment with different approaches to find the most effective visualization for your sports analytics insights.
285 |
286 | ## Terms to Know
287 |
288 | - Beeswarm Plot: A type of plot that displays the distribution of individual data points within categories, combining elements of scatterplots, histograms, and bar charts.
289 | - `ggbeeswarm`: An R package that extends ggplot2 to create beeswarm and related plots.
290 | - `geom_beeswarm()`: A function from the ggbeeswarm package used to create beeswarm plots in ggplot2.
291 | - `set.seed()`: An R function used to set a seed for the random number generator, ensuring reproducibility in plots with random elements.
292 | - `Jittering`: A technique of adding small random offsets to data points to reduce overplotting in visualizations.
293 | - `geom_quasirandom()`: A function from ggbeeswarm that creates a variation of the beeswarm plot with quasirandom point placement.
294 | - `geom_jitter()`: A function that creates a plot similar to a beeswarm plot but with points spread evenly across the width of each category.
--------------------------------------------------------------------------------
/inst/tutorials/23-dumbbell-charts/23-dumbbell-charts.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 23: Dumbbell charts"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to show the difference between two points on a number line.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(ggalt)
19 | library(ggtext)
20 | knitr::opts_chunk$set(echo = FALSE)
21 | tutorial_options(exercise.completion=FALSE)
22 | ```
23 |
24 | ## The Goal
25 |
26 | The goal of this lesson is to introduce you to dumbbell charts, a powerful visualization technique for comparing two related values across different categories. By the end of this tutorial, you'll understand when to use dumbbell charts, how to create them using the ggalt package in R, and how to customize them for clearer data representation. You'll learn to prepare data for dumbbell charts, create basic plots, add colors to differentiate between measures, and use reordering to enhance the story your data tells. We'll apply these concepts to real college football data, visualizing the difference between giveaways and takeaways for teams in the Big Ten conference. This lesson will enhance your ability to effectively communicate comparisons between paired data points, a crucial skill for analyzing and presenting sports statistics such as offensive versus defensive performance metrics.
27 |
28 | ## The Basics
29 |
30 | Second to my love of waffle charts because or their name and I'm always hungry, dumbbell charts are an excellently named way of **showing the difference between two things on a number line** -- a start and a finish, for instance. Or the difference between two related things. Say, turnovers and assists.
31 |
32 | Dumbbell charts come batteries included in `ggalt` which we used in Lesson 25. Like usual, you already installed this if you followed the install instructions at the beginning of the course. But if you're having trouble getting this to work, go to your console in R Studio and install it with `install.packages("ggalt")`
33 |
34 | Let's give it a whirl.
35 |
36 | ```{r load-tidyverse, exercise=TRUE, message=FALSE, warning=FALSE}
37 | library(tidyverse)
38 | library(ggalt)
39 | library(ggtext)
40 |
41 | ```
42 | ```{r load-tidyverse-solution}
43 | library(tidyverse)
44 | library(ggalt)
45 | library(ggtext)
46 | ```
47 | ```{r load-tidyverse-check}
48 | grade_this_code()
49 | ```
50 |
51 | For this, let's use college football game logs from this season. Load it.
52 |
53 | ```{r dumbbell-load-data, message=FALSE, warning=FALSE}
54 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
55 |
56 | turnovers <- logs |>
57 | group_by(Team, Conference) |>
58 | summarise(
59 | Giveaways = sum(TotalTurnovers),
60 | Takeaways = sum(DefTotalTurnovers)) |>
61 | filter(Conference == "Big Ten Conference")
62 | ```
63 | ```{r dumbbell-load-data-exercise, exercise = TRUE}
64 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
65 | ```
66 | ```{r dumbbell-load-data-exercise-solution}
67 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
68 | ```
69 | ```{r dumbbell-load-data-exercise-check}
70 | grade_this_code()
71 | ```
72 |
73 | Let's remember what we're looking at here:
74 |
75 | ```{r head-data, exercise=TRUE, exercise.setup = "dumbbell-load-data"}
76 | head(logs)
77 | ```
78 | ```{r head-data-solution}
79 | head(logs)
80 | ```
81 | ```{r head-data-check}
82 | grade_this_code()
83 | ```
84 |
85 | For this example, let's look at the difference between a team's giveaways -- turnovers lost -- versus takeaways, or turnovers gained.
86 |
87 | ### Exercise 1: Preparing the data
88 |
89 | To get this, we're going to add up all offensive turnovers and defensive turnovers for a team in a season and take a look at where they come out. Remember: TotalTurnovers is giving the ball away (bad turnovers) and DefTotalTurnovers is takeaways, good turnovers. To make this readable, I'm going to focus on the Big Ten.
90 |
91 | ```{r dumbbell1, exercise=TRUE, exercise.setup = "dumbbell-load-data", message=FALSE}
92 | turnovers <- ____ |>
93 | group_by(Team, Conference) |>
94 | summarise(
95 | Giveaways = sum(____Turnovers),
96 | Takeaways = sum(____Turnovers)) |>
97 | filter(Conference == "Big Ten Conference")
98 | ```
99 | ```{r dumbbell1-solution, exercise.reveal_solution = FALSE}
100 | turnovers <- logs |>
101 | group_by(Team, Conference) |>
102 | summarise(
103 | Giveaways = sum(TotalTurnovers),
104 | Takeaways = sum(DefTotalTurnovers)) |>
105 | filter(Conference == "Big Ten Conference")
106 | ```
107 | ```{r dumbbell1-check}
108 | grade_this_code()
109 | ```
110 |
111 | Now, the way that the `geom_dumbbell` works is pretty simple when viewed through what we've done before. There's just some tweaks.
112 |
113 | ### Exercise 2: The first dumbbell
114 |
115 | First: We start with the y axis. The reason is we want our dumbbells going left and right, so the label is going to be on the y axis.
116 |
117 | Second: Our x is actually two things: x and xend. What you put in there will decide where on the line the dot appears.
118 |
119 | Because we grouped by and summarized earlier, we have three fields to work with: Team, Takeaways and Giveaways.
120 |
121 | ```{r dumbbell2, exercise=TRUE, exercise.setup = "dumbbell-load-data", message=FALSE}
122 | ggplot() +
123 | geom_dumbbell(
124 | data=____,
125 | aes(y=____, x=____, xend=____)
126 | )
127 | ```
128 | ```{r dumbbell2-solution, exercise.reveal_solution = FALSE}
129 | ggplot() +
130 | geom_dumbbell(
131 | data=turnovers,
132 | aes(y=Team, x=Takeaways, xend=Giveaways)
133 | )
134 | ```
135 | ```{r dumbbell2-check}
136 | grade_this_code()
137 | ```
138 |
139 | Well, that's a chart alright, but what dot is the giveaways and what are the takeaways? To fix this, we'll add colors.
140 |
141 | ### Exercise 3: Colors and size
142 |
143 | So our choice of colors here is important. We want giveaways to be seen as bad and takeaways to be seen as good. So lets try red for giveaways and green for takeaways. To make this work, we'll need to do three things: first, use the English spelling of color, so `colour`. The, uh, `colour` is the bar between the dots, the `colour_x` is the color of the x value dot and the `colour_xend` is the color of the xend dot. So in our setup, takeaways are x, they're good, so they're green. While we're at it, we'll add a size to make the dots stand out.
144 |
145 | ```{r dumbbell3, exercise=TRUE, exercise.setup = "dumbbell-load-data", message=FALSE}
146 | ggplot() +
147 | geom_dumbbell(
148 | data=____,
149 | aes(y=____, x=Takeaways, xend=Giveaways),
150 | size = 2,
151 | colour = "____",
152 | colour_x = "____",
153 | colour_xend = "____")
154 | ```
155 | ```{r dumbbell3-solution, exercise.reveal_solution = FALSE}
156 | ggplot() +
157 | geom_dumbbell(
158 | data=turnovers,
159 | aes(y=Team, x=Takeaways, xend=Giveaways),
160 | size = 2,
161 | colour = "grey",
162 | colour_x = "green",
163 | colour_xend = "red")
164 | ```
165 | ```{r dumbbell3-check}
166 | grade_this_code()
167 | ```
168 |
169 | And now we have a chart that is trying to tells a story. We know, logically, that green on the right is good. A long distance between green and red? Better.
170 |
171 | ### Exercise 4: Arrange helps tell the story
172 |
173 | But what if we sort it by good turnovers?
174 |
175 | ```{r dumbbell4, exercise=TRUE, exercise.setup = "dumbbell-load-data", message=FALSE}
176 | ggplot() +
177 | geom_dumbbell(
178 | data=____,
179 | aes(y=reorder(____, ____aways), x=Takeaways, xend=Giveaways),
180 | size = 2,
181 | colour = "____",
182 | colour_x = "____",
183 | colour_xend = "____")
184 | ```
185 | ```{r dumbbell4-solution, exercise.reveal_solution = FALSE}
186 | ggplot() +
187 | geom_dumbbell(
188 | data=turnovers,
189 | aes(y=reorder(Team, Takeaways), x=Takeaways, xend=Giveaways),
190 | size = 2,
191 | colour = "grey",
192 | colour_x = "green",
193 | colour_xend = "red")
194 | ```
195 | ```{r dumbbell4-check}
196 | grade_this_code()
197 | ```
198 |
199 | What story does this tell?
200 |
201 | ### Exercise 5: Which dot is which?
202 |
203 | One downside to the dumbbell chart library is that it doesn't handle legends well. Adding a box to label which is the red dot and which is the green dot and which is the red dot is harder than it should be. An easier way is to use ggtext and make the colors red and green in some text explaining the graphic.
204 |
205 | With ggtext, we need to add `labs` to our chart -- a headline and some chatter, called a `title` and a `subtitle` in ggplot. Then, in the subtitle, we can add some HTML to our text that will turn text red and green to match our `colour_x` and `colour_xend`.
206 |
207 | ```{r dumbbell5, exercise=TRUE, exercise.setup = "dumbbell-load-data", message=FALSE}
208 | ggplot() +
209 | geom_dumbbell(
210 | data=____,
211 | aes(y=reorder(____, ____aways), x=Takeaways, xend=Giveaways),
212 | size = 2,
213 | colour = "____",
214 | colour_x = "____",
215 | colour_xend = "____") +
216 | labs(
217 | title="Giveaways vs takeaways in the Big Ten",
218 | subtitle = "If takeaways are on the right, that's good. If giveaways are on the right, that's bad."
219 | ) +
220 | theme(
221 | plot.subtitle = element_textbox_simple()
222 | )
223 | ```
224 | ```{r dumbbell5-solution, exercise.reveal_solution = FALSE}
225 | ggplot() +
226 | geom_dumbbell(
227 | data=turnovers,
228 | aes(y=reorder(Team, Takeaways), x=Takeaways, xend=Giveaways),
229 | size = 2,
230 | colour = "grey",
231 | colour_x = "green",
232 | colour_xend = "red") +
233 | labs(
234 | title="Giveaways vs takeaways in the Big Ten",
235 | subtitle = "If takeaways are on the right, that's good. If giveaways are on the right, that's bad."
236 | ) +
237 | theme(
238 | plot.subtitle = element_textbox_simple()
239 | )
240 | ```
241 | ```{r dumbbell5-check}
242 | grade_this_code()
243 | ```
244 |
245 | ## The Recap
246 |
247 | In this lesson, we've explored the power of dumbbell charts for visualizing comparisons between paired data points in sports statistics. We started by learning how to prepare data for dumbbell charts, grouping and summarizing game-level statistics into team-level metrics. We then progressed from creating basic dumbbell charts to more informative visualizations by adding meaningful colors, adjusting sizes for better visibility, and reordering data to highlight patterns. We covered important techniques such as using color to differentiate between positive and negative metrics (takeaways vs. giveaways), and leveraging the reorder function to sort teams based on performance. We also addressed the challenge of adding legends to dumbbell charts by incorporating color-coded text explanations using ggtext. Throughout the process, we saw how dumbbell charts can effectively show the relationship between two related metrics across multiple categories, providing insights into team performance in a visually intuitive way. Remember, the effectiveness of a dumbbell chart often lies in the thoughtful use of color, ordering, and supplementary text to guide the viewer's interpretation of the data.
248 |
249 | ## Terms to Know
250 |
251 |
252 | - Dumbbell Chart: A type of chart that visualizes the difference between two related values for multiple categories, resembling a dumbbell or barbell shape.
253 | - `ggalt`: An R package that extends ggplot2 to create alternative plot types, including dumbbell charts.
254 | - `geom_dumbbell()`: A function from the ggalt package used to create dumbbell charts in ggplot2.
255 | - `colour_x` and `colour_xend`: Parameters in geom_dumbbell() used to specify colors for the start and end points of each dumbbell.
256 | - `ggtext`: An R package that extends ggplot2's text rendering capabilities, allowing for formatted text in plot labels.
257 | - `element_textbox_simple()`: A function from ggtext that allows for the use of formatted text in plot subtitles and other text elements.
258 |
--------------------------------------------------------------------------------
/inst/tutorials/25-faceting/25-faceting.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 25: Faceting"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to make a lot of graphics with little code.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to introduce you to faceting, a powerful technique in ggplot2 for creating multiple related charts in a single visualization. By the end of this tutorial, you'll understand how to use facet_wrap() and facet_grid() to create small multiples, allowing you to compare trends across different categories or groups. You'll learn when to use faceting, how to implement it with various chart types, and how to customize faceted plots for better readability. We'll apply these concepts to real college football data, analyzing offensive performance across Big Ten teams. This lesson will enhance your ability to present complex, multi-dimensional data in a clear and intuitive format, enabling viewers to easily compare patterns and trends across different subsets of your data.
25 |
26 | ## The Basics
27 |
28 | Sometimes the easiest way to spot a trend is to chart a bunch of small things side by side. Edward Tufte, one of the most well known data visualization thinkers on the planet, calls this "small multiples" where ggplot calls this a facet wrap or a facet grid, depending.
29 |
30 | Who had the best offense in the Big Ten last season? We could answer this a number of ways, but the best way to show people would be visually. Let's use Small Multiples.
31 |
32 | As always, we start with libraries.
33 |
34 | ```{r load-tidyverse, exercise=TRUE}
35 | library(tidyverse)
36 | ```
37 | ```{r load-tidyverse-solution}
38 | library(tidyverse)
39 | ```
40 | ```{r load-tidyverse-check}
41 | grade_this_code()
42 | ```
43 |
44 | We're going to use the logs of college football games last season.
45 |
46 | ```{r facet-load-data, message=FALSE, warning=FALSE}
47 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
48 |
49 | big10 <- logs |> filter(Conference == "Big Ten Conference")
50 | ```
51 | ```{r facet-load-data-exercise, exercise = TRUE}
52 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
53 | ```
54 | ```{r facet-load-data-exercise-solution}
55 | logs <- read_csv("https://mattwaite.github.io/sportsdatafiles/footballlogs24.csv")
56 | ```
57 | ```{r facet-load-data-exercise-check}
58 | grade_this_code()
59 | ```
60 |
61 | Let's narrow our pile and look just at the Big Ten.
62 |
63 | ```{r facet1, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
64 | big10 <- logs |> filter(Conference == "Big Ten Conference")
65 | ```
66 | ```{r facet1-solution, exercise.reveal_solution = FALSE}
67 | big10 <- logs |> filter(Conference == "Big Ten Conference")
68 | ```
69 | ```{r facet1-check}
70 | grade_this_code()
71 | ```
72 |
73 | The first thing we can do is look at a line chart, like we have done in previous lessons.
74 |
75 | ### Exercise 1: Make a line chart
76 |
77 | We'll do something simple like a line chart of OffenseAvg -- the yards per play on offense -- and we're going to group on Team.
78 |
79 | ```{r facet2, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
80 | ggplot() +
81 | geom_____(data=____, aes(x=Date, y=____, group=____))
82 | ```
83 | ```{r facet2-solution, exercise.reveal_solution = FALSE}
84 | ggplot() +
85 | geom_line(data=big10, aes(x=Date, y=OffenseAvg, group=Team))
86 | ```
87 | ```{r facet2-check}
88 | grade_this_code()
89 | ```
90 |
91 | And, not surprisingly, we get a hairball. Who is the best offense week in and out in the Big Ten? We could color certain lines, but that would limit us to focus on one team. **What if we did all of them at once?** We do that with a `facet_wrap`.
92 |
93 | Will that give us an answer?
94 |
95 | ### Exercise 2: The facet wrap
96 |
97 | The only thing we MUST pass into a `facet_wrap` is what thing we're going to separate them out by. In this case, we precede that field with a tilde, so in our case we want the Team field.
98 |
99 | ```{r facet3, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
100 | ggplot() +
101 | geom_____(data=____, aes(x=Date, y=____, group=____)) +
102 | facet_wrap(~____)
103 | ```
104 | ```{r facet3-solution, exercise.reveal_solution = FALSE}
105 | ggplot() +
106 | geom_line(data=big10, aes(x=Date, y=OffenseAvg, group=Team)) +
107 | facet_wrap(~Team)
108 | ```
109 | ```{r facet3-check}
110 | grade_this_code()
111 | ```
112 |
113 | Answer: Not immediately clear, but we can look at this and analyze it. We could add a piece of annotation to help us out.
114 |
115 | ### Exercise 3: Put everyone in context
116 |
117 | Let's add a line to every chart that's the conference average shooting percentage.
118 |
119 | ```{r facet4, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
120 | big10 |> summarise(mean(OffenseAvg))
121 | ```
122 | ```{r facet4-solution, exercise.reveal_solution = FALSE}
123 | big10 |> summarise(mean(OffenseAvg))
124 | ```
125 | ```{r facet4-check}
126 | grade_this_code()
127 | ```
128 |
129 | Copy that number, and we're going to add it into the facet chart with a `geom_hline`.
130 |
131 | ```{r facet5, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
132 | ggplot() +
133 | geom_hline(yintercept=____, color="blue") +
134 | geom_____(data=____, aes(x=Date, y=____, group=____)) +
135 | facet_wrap(~____)
136 | ```
137 | ```{r facet5-solution, exercise.reveal_solution = FALSE}
138 | ggplot() +
139 | geom_hline(yintercept=5.708621, color="blue") +
140 | geom_line(data=big10, aes(x=Date, y=OffenseAvg, group=Team)) +
141 | facet_wrap(~Team)
142 | ```
143 | ```{r facet5-check}
144 | grade_this_code()
145 | ```
146 |
147 | What do you see here? How do teams compare? How do they change over time? I'm not asking you these questions because they're an assignment -- I'm asking because that's exactly what this chart helps answer. Your brain will immediately start making those connections.
148 |
149 | ## Facet grid vs facet wraps
150 |
151 | Facet grids allow us to put teams on the same plane, versus just repeating them. And we can specify that plane as vertical or horizontal. For example, here's our chart from above, but using facet_grid to stack them.
152 |
153 | ### Exercise 4: Facet grids
154 |
155 | ```{r facet6, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
156 | ggplot() +
157 | geom_hline(yintercept=____, color="blue") +
158 | geom_____(data=____, aes(x=Date, y=____, group=____)) +
159 | facet_grid(Team ~ .)
160 | ```
161 | ```{r facet6-solution, exercise.reveal_solution = FALSE}
162 | ggplot() +
163 | geom_hline(yintercept=5.708621, color="blue") +
164 | geom_line(data=big10, aes(x=Date, y=OffenseAvg, group=Team)) +
165 | facet_grid(Team ~ .)
166 | ```
167 | ```{r facet6-check}
168 | grade_this_code()
169 | ```
170 |
171 | And here they are next to each other:
172 |
173 | ```{r facet7, exercise=TRUE, exercise.setup = "facet-load-data", message=FALSE}
174 | ggplot() +
175 | geom_hline(yintercept=____, color="blue") +
176 | geom_____(data=____, aes(x=Date, y=____, group=____)) +
177 | facet_grid(. ~ Team)
178 | ```
179 | ```{r facet7-solution, exercise.reveal_solution = FALSE}
180 | ggplot() +
181 | geom_hline(yintercept=5.708621, color="blue") +
182 | geom_line(data=big10, aes(x=Date, y=OffenseAvg, group=Team)) +
183 | facet_grid(. ~ Team)
184 | ```
185 | ```{r facet7-check}
186 | grade_this_code()
187 | ```
188 |
189 | Two notes:
190 |
191 | 1. We'd have some work to do with the labeling on this -- we'll get to that -- but you can see where this is valuable comparing a group of things. One warning: Don't go too crazy with this or it loses it's visual power.
192 | 2. There's nothing special about a line chart with faceting. You can facet any kind of ggplot chart.
193 |
194 | ## The Recap
195 |
196 | In this lesson, we've explored the power of faceting in ggplot2 to create small multiples, a technique that allows for easy comparison across multiple categories or groups. We started by creating a basic line chart of offensive performance for Big Ten football teams, then transformed it into a faceted plot using facet_wrap(). We learned how to add context to our faceted plots by including a reference line for the conference average. We then explored the difference between facet_wrap() and facet_grid(), understanding how each can be used to arrange our small multiples either in a grid or in rows/columns. We saw how faceting can reveal patterns and trends that might be obscured in a single, overlapping chart. Throughout the lesson, we applied these techniques to real college football data, demonstrating how faceting can provide insights into team performance over time. Remember, while faceting is a powerful tool, it's important to use it judiciously to maintain clarity in your visualizations. As we progressed, we also touched on the importance of proper labeling and the versatility of faceting across different chart types.
197 |
198 | ## Terms to Know
199 |
200 | - Faceting: A technique in ggplot2 for creating multiple related charts in a single visualization, allowing for easy comparison across categories or groups.
201 | - Small Multiples: A series of similar graphs or charts using the same scale and axes, allowing for comparison across different categories or time periods.
202 | - `facet_wrap()`: A ggplot2 function that wraps a series of plots into a rectangular layout.
203 | - `facet_grid()`: A ggplot2 function that creates a grid of plots with fixed horizontal and vertical axes.
204 | - Tilde (~): In the context of faceting, used to specify the variable(s) by which to facet the plot.
--------------------------------------------------------------------------------
/inst/tutorials/27-encircling-points/27-encircling-points.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 27: Shapes as storytelling"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to draw attention to some points.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(ggalt)
19 | library(ggrepel)
20 | knitr::opts_chunk$set(echo = FALSE)
21 | tutorial_options(exercise.completion=FALSE)
22 | ```
23 |
24 | ## The Goal
25 |
26 | The goal of this lesson is to introduce you to the technique of encircling points on a scatterplot using the ggalt package in R and annotation with rectangles and `geom_rect`. By the end of this tutorial, you'll understand how to use the `geom_encircle()` and `annotate` functions to draw attention to specific data points or groups of points in your visualizations. You'll learn how to create basic scatterplots, layer additional data points, and then highlight key information using circles and text labels. We'll apply these concepts to real college basketball data, analyzing player performance in terms of shooting percentage and rebounds. This lesson will enhance your ability to create more informative and visually compelling scatterplots, allowing you to guide your audience's attention to the most important aspects of your data story in sports analytics.
27 |
28 | ## The Basics
29 |
30 | One thing we've talked about all semester is drawing attention to the thing you want to draw attention to. We've used color and labels to do that so far. Let's add another layer to it -- a shape around the point or points you want to highlight.
31 |
32 | Remember: The point of all of this is to draw the eye to what you are trying to show your reader. You want people to see the story you are trying to tell.
33 |
34 | It's not hard to draw a shape in ggplot -- it is a challenge to put it in the right place. But, there is a library to the rescue that makes this super easy -- `ggalt`.
35 |
36 | Like usual, you already installed this if you followed the install instructions at the beginning of the course. But if you're having trouble getting this to work, go to your console in R Studio and install it with `install.packages("ggalt")`
37 |
38 | There's a bunch of things that `ggalt` does, but one of the most useful for us is the function `encircle`. Let's dive in.
39 |
40 | ```{r load-tidyverse, exercise=TRUE}
41 | library(tidyverse)
42 | library(ggalt)
43 | library(ggrepel)
44 | ```
45 | ```{r load-tidyverse-solution}
46 | library(tidyverse)
47 | library(ggalt)
48 | library(ggrepel)
49 | ```
50 | ```{r load-tidyverse-check}
51 | grade_this_code()
52 | ```
53 |
54 | Let's say we want to highlight Brice Williams and his place in college basketball. Let's take a look at this season's player data.
55 |
56 | And while we're loading it, let's filter out anyone who hasn't played 100 minutes.
57 |
58 | ```{r encircle-load-data, message=FALSE, warning=FALSE}
59 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/players25.csv") |> filter(MP > 100)
60 |
61 | nu <- players |> filter(Team == "Nebraska Cornhuskers Men's")
62 |
63 | bw <- nu |> filter(Player == "Brice Williams")
64 | ```
65 | ```{r encircle-load-data-exercise, exercise = TRUE}
66 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/players25.csv") |>
67 | filter(MP > 100)
68 | ```
69 | ```{r encircle-load-data-exercise-solution}
70 | players <- read_csv("https://mattwaite.github.io/sportsdatafiles/players25.csv") |>
71 | filter(MP > 100)
72 | ```
73 | ```{r encircle-load-data-exercise-check}
74 | grade_this_code()
75 | ```
76 |
77 | We've done this before, but let's make a standard scatterplot of points and player efficiency rating.
78 |
79 | ### Exercise 1: Making the first scatterplot
80 |
81 | Who is hitting shots, and who is using his chances to score efficiently is the question we're after here. Our data is players, and our two fields are PTS and PER.
82 |
83 | ```{r encircle1, exercise=TRUE, exercise.setup = "encircle-load-data", message=FALSE}
84 | ggplot() + geom_point(data=____, aes(x=____, y=____))
85 | ```
86 | ```{r encircle1-solution, exercise.reveal_solution = FALSE}
87 | ggplot() + geom_point(data=players, aes(x=PTS, y=PER))
88 | ```
89 | ```{r encircle1-check}
90 | grade_this_code()
91 | ```
92 |
93 | We can see right away that there are some dots at the edges -- lots of points but maybe they aren't contributing elsewhere. Where are the Huskers in this? We can get that with filtering and layering.
94 |
95 | ### Exercise 2: Adding the Huskers as a layer
96 |
97 | Like we have done in the past, let's make a dataframe of Nebraska players and add a layer to our graph that puts Nebraska's dots in red. While we're at it, we'll make everyone else grey. And let's label the Nebraska players.
98 |
99 | ```{r encircle2, exercise=TRUE, exercise.setup = "encircle-load-data", message=FALSE}
100 | nu <- players |> filter(Team == "Nebraska Cornhuskers Men's")
101 |
102 | ggplot() +
103 | geom_point(data=____, aes(x=____, y=____), color="grey") +
104 | geom_point(data=____, aes(x=____, y=____), color="red") +
105 | geom_text_repel(data=____, aes(x=PTS, y=PER, label=Player))
106 | ```
107 | ```{r encircle2-solution, exercise.reveal_solution = FALSE}
108 | nu <- players |> filter(Team == "Nebraska Cornhuskers Men's")
109 |
110 | ggplot() +
111 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
112 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
113 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player))
114 | ```
115 | ```{r encircle2-check}
116 | grade_this_code()
117 | ```
118 |
119 | Given that this chart is about Brice Williams, I'll give you three guesses who that dot is scoring a lot of points really efficiently. Your first two guesses don't count and his last name rhymes with rice illiams.
120 |
121 | But just this quickly, we're on the path to something publishable. We'll need to label a dot and we'll need to drop the default grey and add some headlines and all that. And, for the most part, we've got a solid chart.
122 |
123 | But what if we could really draw the eye to a dot or group of them? In `ggalt`, there is a geom called `geom_encircle`, which ... does what you think it does. It encircles all the dots in a dataset.
124 |
125 | ### Exercise 3: geom_encircle
126 |
127 | Now we'll add geom_encircle and we'll just copy the data and the aes from our nu geom_point. Then, we need to give the encirclement a shape using s_shape -- which is a number between 0 and 1 -- and then how far away from the dots to draw the circle using expand, which is another number between 0 and 1.
128 |
129 | Let's start with `s_shape` .03 and `expand` .03. Getting to the right number involves a lot of trial and error messing with that number. Sometimes you'll want a bigger circle -- closer to one -- and sometimes you want small.
130 |
131 | ```{r encircle3, exercise=TRUE, exercise.setup = "encircle-load-data", message=FALSE}
132 | bw <- nu |> filter(Player == "Brice Williams")
133 |
134 | ggplot() +
135 | geom_point(data=____, aes(x=____, y=____), color="grey") +
136 | geom_point(data=____, aes(x=____, y=____), color="red") +
137 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
138 | geom_encircle(data=bw, aes(x=____, y=____), s_shape=.03, expand=.03, colour="red")
139 | ```
140 | ```{r encircle3-solution, exercise.reveal_solution = FALSE}
141 | bw <- nu |> filter(Player == "Brice Williams")
142 |
143 | ggplot() +
144 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
145 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
146 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
147 | geom_encircle(data=bw, aes(x=PTS, y=PER), s_shape=.03, expand=.03, colour="red")
148 | ```
149 | ```{r encircle3-check}
150 | grade_this_code()
151 | ```
152 |
153 | That's not bad. If you disagree, now you begin the process of fiddling with `s_shape` and `expand` until you get it just right.
154 |
155 | ## Using annotate to draw rectangles
156 |
157 | What if, instead of a circle, you wanted to add a rectangle to highlight the zone of good players (or something like that). Rectangles work great to highlight zones or, if you have time as your x axis, events and time periods. Think of it this way -- if you had a line chart of Nebraska's wins over time, you could add rectangles over the years for certain coaches you wanted to highlight.
158 |
159 | ### Exercise 4: Using a rectangle instead of a circle
160 |
161 | So how do you make rectangles? With annotate and some interpolation. What's interpolation? It's looking at the x and y axis and interpreting where to draw the lines.
162 |
163 | Looking at our previous work, what if we use 200 points and a 20 PER rating as the floor for our good zone. The end ceiling is the top of the chart -- which ggplot can conveniently just use infinity for that.
164 |
165 | One more thing? We should label our zone. It looks like there's a spot at 300 points and 35 PER to fit some text.
166 |
167 | ```{r encircle4, exercise=TRUE, exercise.setup = "encircle-load-data", message=FALSE}
168 | ggplot() +
169 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
170 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
171 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
172 | annotate("rect", fill = "blue", alpha = 0.1,
173 | xmin = ____, xmax = Inf,
174 | ymin = ____, ymax = Inf) +
175 | geom_text(aes(x=____, y=____, label="Good player zone"), size=3)
176 | ```
177 | ```{r encircle4-solution, exercise.reveal_solution = FALSE}
178 | ggplot() +
179 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
180 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
181 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
182 | annotate("rect", fill = "blue", alpha = 0.1,
183 | xmin = 200, xmax = Inf,
184 | ymin = 20, ymax = Inf) +
185 | geom_text(aes(x=300, y=35, label="Good player zone"), size=3)
186 | ```
187 | ```{r encircle4-check}
188 | grade_this_code()
189 | ```
190 |
191 | Now let's clean this up and make it presentable with labels, a theme and some alterations to that theme.
192 |
193 | ```{r encircle5, exercise=TRUE, exercise.setup = "encircle-load-data", message=FALSE}
194 | ggplot() +
195 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
196 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
197 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
198 | annotate("rect", fill = "blue", alpha = 0.1,
199 | xmin = ____, xmax = Inf,
200 | ymin = ____, ymax = Inf) +
201 | geom_text(aes(x=300, y=35, label="Good player zone"), size=3) +
202 | labs(title="Brice Williams is Nebraska's MVP", subtitle="Not just Nebraska's best, but one of the best in the Big Ten.", x="Points scored", y="Player Efficiency Rating") +
203 | theme_minimal() +
204 | theme(
205 | plot.title = element_text(size = 16, face = "bold"),
206 | axis.title = element_text(size = 8),
207 | plot.subtitle = element_text(size=10),
208 | panel.grid.minor = element_blank()
209 | )
210 | ```
211 | ```{r encircle5-solution, exercise.reveal_solution = FALSE}
212 | ggplot() +
213 | geom_point(data=players, aes(x=PTS, y=PER), color="grey") +
214 | geom_point(data=nu, aes(x=PTS, y=PER), color="red") +
215 | geom_text_repel(data=nu, aes(x=PTS, y=PER, label=Player)) +
216 | annotate("rect", fill = "blue", alpha = 0.1,
217 | xmin = 200, xmax = Inf,
218 | ymin = 20, ymax = Inf) +
219 | geom_text(aes(x=300, y=35, label="Good player zone"), size=3) +
220 | labs(title="Brice Williams is Nebraska's MVP", subtitle="Not just Nebraska's best, but one of the best in the Big Ten.", x="Points scored", y="Player Efficiency Rating") +
221 | theme_minimal() +
222 | theme(
223 | plot.title = element_text(size = 16, face = "bold"),
224 | axis.title = element_text(size = 8),
225 | plot.subtitle = element_text(size=10),
226 | panel.grid.minor = element_blank()
227 | )
228 | ```
229 | ```{r encircle5-check}
230 | grade_this_code()
231 | ```
232 |
233 | ## The Recap
234 |
235 | In this lesson, we've explored the powerful technique of encircling points on a scatterplot using the ggalt package in R and making rectangles with annotate. We started by creating a basic scatterplot of college basketball player performance, focusing on points and efficiency. We then learned how to layer additional data points to highlight specific teams or players. The key skill we developed was using geom_encircle() and annotate to draw attention to particular data points, in this case, showcasing Brice Williams exceptional performance for Nebraska. Throughout the lesson, we emphasized the importance of visual hierarchy and guiding the viewer's eye to the most crucial information. We also touched on how to make our visualizations more polished and publication-ready by adding appropriate titles, labels, and theming. Remember, the goal of these techniques is not just to present data, but to tell a compelling story with your visualizations.
236 |
237 | ## Terms to Know
238 |
239 | - `ggalt`: An extension package for ggplot2 that provides additional geoms and statistical transformations.
240 | - `geom_encircle()`: A function from ggalt that allows you to draw a circle or shape around specified data points on a plot.
241 | - `s_shape`: A parameter in geom_encircle() that controls the shape of the encircling line, with values between 0 and 1.
242 | - `expand`: A parameter in geom_encircle() that determines how far the encircling line extends from the data points.
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog1.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog12.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog13.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog13.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog14.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog14.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog15.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog15.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog2.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog3.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog4.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog5.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog7.png
--------------------------------------------------------------------------------
/inst/tutorials/28-blogging/images/blog9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/28-blogging/images/blog9.png
--------------------------------------------------------------------------------
/inst/tutorials/31-color/31-color.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 31: Using color the right way"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how the limited use of color can draw attention and focus your story.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | library(ggrepel)
19 | library(ggtext)
20 | knitr::opts_chunk$set(echo = FALSE)
21 | tutorial_options(exercise.completion=FALSE)
22 | ```
23 |
24 | ## The Goal
25 |
26 | The goal of this lesson is to introduce you to the strategic use of color in sports data visualizations. By the end of this tutorial, you'll understand how to effectively use color to enhance your charts and graphs, drawing attention to key data points and reinforcing your story. You'll learn the basic rules of color usage in data visualization, including limiting color palette, using contrast for emphasis, and avoiding color abuse. We'll apply these concepts to real sports data, demonstrating how thoughtful color choices can transform a basic chart into a compelling visual narrative. This lesson will sharpen your ability to create more professional and impactful visualizations for your sports analytics blog, helping you guide your audience's attention and effectively communicate complex insights through color.
27 |
28 | ## The Basics
29 |
30 | The main focus of this whole class -- indeed the whole sports media major -- is to tell a story. If the chart is not telling a story, then it doesn't serve a purpose and we've failed. And if our chart does tell a story, but the reader can't find it, then that means we've still failed the main purpose.
31 |
32 | Some charts do a lot of work showing the reader what the story is before we even anything to it. Some need more help. One way we can offer that help is to use color.
33 |
34 | Color can be very powerful. It can also ruin a graphic. And the right use of color isn't science -- it's art. That means color has been argued about for centuries, even in the world of graphics.
35 |
36 | The basic rules of color we're going to use are:
37 |
38 | 1. Limit the number of colors. The fewer the better. If everything is a color, nothing is a color.
39 | 2. Use contrasting colors to draw attention. A color in a sea of grey stands out.
40 | 3. Don't abuse color. Choose appropriate colors, avoid the defaults.
41 |
42 | Where do these rules come from? Experience of people who have made graphics before. Looking at what has succeeded and what has failed.
43 |
44 | **Rule 1:** Alberto Cairo, a professor in the University of Miami School of Communication and expert in data visualization, wrote in his book The Functional Art that limiting color was "not just a minimalist aesthetic choice, but a practical one. Limiting the amount of colors and different fonts in your graphics will help you create a sense of unity in the composition."
45 |
46 | He went on:
47 |
48 | "The best way to disorient your readers is to fill your graphic with objects colored in pure accent tones. Pure colors are uncommon in nature, so limit them to highlight whatever is important in your graphics, and use subdued hues — grays, light blues, and greens — for everything else. In other words: If you know that the brain prioritizes what it pays attention to, prioritize beforehand."
49 |
50 | **Rule 2:** Swiss cartographer Eduard Imhof wrote in 1982 his first rule of of color composition: "Pure, bright or very strong colors have loud unbearable effects when they stand unrelieved over large areas adjacent to each other, but extraordinary effects can be achieved when they are used sparingly on or between dull background tones."
51 |
52 | **Rule 3:** Edward Tufte, in Envisioning Information, wrote that adding color was the easy part. "The often scant benefits derived from coloring data indicate that putting a good color in a good place is a complex matter. Indeed, so difficult and subtle that avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm."
53 |
54 | ## An example of using color
55 |
56 | We're going to build a simple bar chart and layer in come color as an example. A story coming out of of the NCAA tournament in 2023 might be how Big Ten teams again made up a large number of tournament teams but most didn't make it past the second round and Michigan State was the last B1G team dancing. So let's explore that.
57 |
58 | First load the tidyverse.
59 |
60 | ```{r load-tidyverse, exercise=TRUE}
61 | library(tidyverse)
62 | ```
63 | ```{r load-tidyverse-solution}
64 | library(tidyverse)
65 | ```
66 | ```{r load-tidyverse-check}
67 | grade_this_code()
68 | ```
69 |
70 | And the data. We're going to do four things here: first, we'll load in the season stats for each team in college basketball in 2023. Then we'll filter out Big Ten teams to get us a dataframe of 14 teams called `big`. We'll then filter B1G teams that made the tournament. Lastly, we'll make a dataframe of just Michigan State.
71 |
72 | ```{r colors-load-data, message=FALSE, warning=FALSE}
73 | stats <- read_csv("https://mattwaite.github.io/sportsdatafiles/stats23.csv")
74 |
75 | teams <- c("Nebraska", "Iowa", "Northwestern", "Minnesota", "Wisconsin", "Illinois", "Indiana", "Purdue", "Michigan", "Michigan State", "Ohio State", "Penn State", "Rutgers", "Maryland")
76 |
77 | tournament <- c("Iowa", "Northwestern", "Maryland", "Indiana", "Penn State", "Illinois", "Michigan State", "Purdue")
78 |
79 | big <- stats |> filter(School %in% teams)
80 |
81 | dance <- stats |> filter(School %in% tournament)
82 |
83 | mi <- stats |> filter(School == "Michigan State")
84 | ```
85 | ```{r colors-load-data-exercise, exercise = TRUE}
86 | stats <- read_csv("https://mattwaite.github.io/sportsdatafiles/stats23.csv")
87 |
88 | teams <- c("Nebraska", "Iowa", "Northwestern", "Minnesota", "Wisconsin", "Illinois", "Indiana", "Purdue", "Michigan", "Michigan State", "Ohio State", "Penn State", "Rutgers", "Maryland")
89 |
90 | tournament <- c("Iowa", "Northwestern", "Maryland", "Indiana", "Penn State", "Illinois", "Michigan State", "Purdue")
91 |
92 | big <- stats |> filter(School %in% teams)
93 |
94 | dance <- stats |> filter(School %in% tournament)
95 |
96 | mi <- stats |> filter(School == "Michigan State")
97 | ```
98 | ```{r colors-load-data-exercise-solution}
99 | stats <- read_csv("https://mattwaite.github.io/sportsdatafiles/stats23.csv")
100 |
101 | teams <- c("Nebraska", "Iowa", "Northwestern", "Minnesota", "Wisconsin", "Illinois", "Indiana", "Purdue", "Michigan", "Michigan State", "Ohio State", "Penn State", "Rutgers", "Maryland")
102 |
103 | tournament <- c("Iowa", "Northwestern", "Maryland", "Indiana", "Penn State", "Illinois", "Michigan State", "Purdue")
104 |
105 | big <- stats |> filter(School %in% teams)
106 |
107 | dance <- stats |> filter(School %in% tournament)
108 |
109 | mi <- stats |> filter(School == "Michigan State")
110 | ```
111 | ```{r colors-load-data-exercise-check}
112 | grade_this_code()
113 | ```
114 |
115 | And we can see that `big` data with head. The `dance` and mi` dataframes have the same columns.
116 |
117 | ```{r head-data, exercise=TRUE, exercise.setup = "colors-load-data"}
118 | head(big)
119 | ```
120 | ```{r head-data-solution}
121 | head(big)
122 | ```
123 | ```{r head-data-check}
124 | grade_this_code()
125 | ```
126 |
127 | ### Exercise 1: Making the first bar
128 |
129 | Let's start by making a simple bar chart of `OverallSRS`. It's just a simple rating of each team. We're going to re-order it and coord_flip it like we've done before.
130 |
131 | ```{r colors-bar, exercise=TRUE, exercise.setup = "colors-load-data", message=FALSE}
132 | ggplot() +
133 | geom_bar(data=big, aes(x=reorder(School, ____), weight=____)) +
134 | coord_flip()
135 | ```
136 | ```{r colors-bar-solution, exercise.reveal_solution = FALSE}
137 | ggplot() +
138 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS)) +
139 | coord_flip()
140 | ```
141 | ```{r colors-bar-check}
142 | grade_this_code()
143 | ```
144 |
145 | Now we've got a base.
146 |
147 | ### Exercise 2: Using color to reduce attention
148 |
149 | Let's make that base fade into the background by changing the color to light grey (which is one word in ggplot). With bar charts, we're changing the `fill`. Changing the `color` will only change the outline of the bar.
150 |
151 | ```{r colors-bar2, exercise=TRUE, exercise.setup = "colors-load-data", message=FALSE}
152 | ggplot() +
153 | geom_bar(data=big, aes(x=reorder(School, ____), weight=____), fill="____") +
154 | coord_flip()
155 | ```
156 | ```{r colors-bar2-solution, exercise.reveal_solution = FALSE}
157 | ggplot() +
158 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="lightgrey") +
159 | coord_flip()
160 | ```
161 | ```{r colors-bar2-check}
162 | grade_this_code()
163 | ```
164 |
165 | Now we can add layers.
166 |
167 | ### Exercise 3: More layers, more colors
168 |
169 | Now we're going to add in `dance` and `mi` to our chart. We want them to stand out from the rest of the B1G, so we're going to made `dance` dark grey (again, one word) and we'll use Michigan State's hex code to shade them in.
170 |
171 | ```{r colors-bar3, exercise=TRUE, exercise.setup = "colors-load-data", message=FALSE}
172 | ggplot() +
173 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="____") +
174 | geom_bar(data=dance, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="____") +
175 | geom_bar(data=____, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="#18453B") +
176 | coord_flip()
177 | ```
178 | ```{r colors-bar3-solution, exercise.reveal_solution = FALSE}
179 | ggplot() +
180 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="lightgrey") +
181 | geom_bar(data=dance, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="darkgrey") +
182 | geom_bar(data=mi, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="#18453B") +
183 | coord_flip()
184 | ```
185 | ```{r colors-bar3-check}
186 | grade_this_code()
187 | ```
188 |
189 | See how you can now see who the tournament teams were, and where Michigan State ranked in the conference? Odd to see they were the last team out of the tournament with much higher-rated teams, no? That's often where stories lie.
190 |
191 | We've got one last color-based thing to do -- get rid of the grey background.
192 |
193 | ### Exercise 4: Getting out of the way of our colors
194 |
195 | `ggplot` by default adds a grey background to every chart. I don't know why, but it's there. That grey takes away from our contrast and makes the reader's eye wander more. We want to draw attention to our shapes, and use color to draw the eye to the shape we want them to see. Something that impacts that is bad, so we want to get rid of it.
196 |
197 | The fastest way is to use pre-made themes. The two that come in the most handy are `theme_minimal` and `theme_light`, both of which ditch the grey background.
198 |
199 | Let's use `theme_minimal` here.
200 |
201 | ```{r colors-bar4, exercise=TRUE, exercise.setup = "colors-load-data", message=FALSE}
202 | ggplot() +
203 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="____") +
204 | geom_bar(data=dance, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="____") +
205 | geom_bar(data=____, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="#18453B") +
206 | ____() +
207 | coord_flip()
208 | ```
209 | ```{r colors-bar4-solution, exercise.reveal_solution = FALSE}
210 | ggplot() +
211 | geom_bar(data=big, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="light grey") +
212 | geom_bar(data=dance, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="dark grey") +
213 | geom_bar(data=mi, aes(x=reorder(School, OverallSRS), weight=OverallSRS), fill="#18453B") +
214 | theme_minimal() +
215 | coord_flip()
216 | ```
217 | ```{r colors-bar4-check}
218 | grade_this_code()
219 | ```
220 |
221 | This chart has some work left -- headlines, some text to explain the dark grey bars, the axis labels are garbage -- but this chart tells a story and our reader can find it.
222 |
223 | ## The Recap
224 |
225 | In this lesson, we've explored the powerful role of color in enhancing sports data visualizations. We began by discussing the three fundamental rules of color usage: limiting the number of colors, using contrast to draw attention, and avoiding color abuse. We then applied these principles to a real-world example using college basketball data from the NCAA tournament. Through hands-on exercises, we transformed a basic bar chart into a visually compelling story about Big Ten teams' performance. We practiced using color strategically to fade less important data into the background, highlight key information, and guide the viewer's eye to the most crucial elements of our visualization. We also learned how to remove distracting elements like default backgrounds to further emphasize our color choices. By mastering these techniques, you'll be able to create more effective, professional-looking visualizations that not only present data but tell a clear, visually engaging story for your sports analytics blog.
226 |
227 | ## Terms to Know
228 |
229 | - Color Contrast: The use of different colors to make certain elements stand out from others in a visualization.
230 | - Color Palette: A set of colors chosen to be used together in a visualization, often selected for their harmony and effectiveness in conveying information.
231 | - `fill`: A ggplot2 parameter used to set the interior color of graphical elements like bars in a bar chart.
232 | - `color`: A ggplot2 parameter that sets the outline color of graphical elements.
233 | - Hex Code: A six-digit code that represents a specific color in digital design, often used for precise color selection in visualizations.
234 |
--------------------------------------------------------------------------------
/inst/tutorials/32-finishing-touches/images/chartannotated.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/32-finishing-touches/images/chartannotated.png
--------------------------------------------------------------------------------
/inst/tutorials/36-simulations/36-simulations.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Sports Data Lesson 36: Simulation"
3 | output:
4 | learnr::tutorial:
5 | progressive: true
6 | allow_skip: true
7 | theme: united
8 | ace_theme: github
9 | runtime: shiny_prerendered
10 | description: >
11 | Learn how to determine if it was bad luck or some kind of cosmic weirdness.
12 | ---
13 |
14 | ```{r setup, include=FALSE}
15 | library(learnr)
16 | library(gradethis)
17 | library(tidyverse)
18 | knitr::opts_chunk$set(echo = FALSE)
19 | tutorial_options(exercise.completion=FALSE)
20 | ```
21 |
22 | ## The Goal
23 |
24 | The goal of this lesson is to introduce you to the concept of statistical simulation and its applications in sports analytics. By the end of this tutorial, you'll understand how to use simulations to determine whether unusual performances or streaks are likely due to random chance or indicative of a significant change in performance. We'll focus on using R's random binomial distribution function to simulate shooting percentages over multiple trials. You'll learn how to interpret the results of these simulations, understand the concept of statistical distributions, and apply this knowledge to real-world sports scenarios. This skill will enhance your ability to differentiate between genuine performance issues and normal variability in sports data, providing valuable insights for player evaluation, strategy development, and performance analysis.
25 |
26 | ## The Basics
27 |
28 | In the 2017-2018 season, James Palmer Jr. took 139 three point attempts and made 43 of them for a .309 shooting percentage. A few weeks into the next season, he was 7 for 39 -- a paltry .179.
29 |
30 | Is something wrong or is this just bad luck?
31 |
32 | Luck is something that comes up a lot in sports. Is a team unlucky? Or a player? One way we can get to this, we can get to that is by simulating things based on their typical percentages. Simulations work by choosing random values within a range based on a distribution. The most common distribution is the normal or binomial distribution. The normal distribution is where the most cases appear around the mean, 66 percent of cases are within one standard deviation from the mean, and the further away from the mean you get, the more rare things become.
33 |
34 | ```{r, echo=FALSE}
35 | knitr::include_graphics(rep("images/simulations2.png"))
36 | ```
37 |
38 | Let's simulate 39 three point attempts 1000 times with his season long shooting percentage and see if this could just be random chance or something else.
39 |
40 | We do this using a base R function called `rbinom` or binomial distribution. So what that means is there's a normally distrubuted chance that James Palmer Jr. is going to shoot above and below his career three point shooting percentage. If we randomly assign values in that distribution 1000 times, how many times will it come up 7, like this example?
41 |
42 | ```{r sim1, exercise=TRUE, message=FALSE}
43 | set.seed(1234)
44 |
45 | simulations <- rbinom(n = 1000, size = 39, prob = .309)
46 |
47 | table(simulations)
48 | ```
49 | ```{r sim1-solution, exercise.reveal_solution = FALSE}
50 | set.seed(1234)
51 |
52 | simulations <- rbinom(n = 1000, size = 39, prob = .309)
53 |
54 | table(simulations)
55 | ```
56 | ```{r sim1-check}
57 | grade_this_code()
58 | ```
59 |
60 | How do we read this? The first row and the second row form a pair. The top row is the number of shots made. The number immediately under it is the number of simulations where that occurred.
61 |
62 | ```{r, echo=FALSE}
63 | knitr::include_graphics(rep("images/simulations1.png"))
64 | ```
65 |
66 | So what we see is given his season long shooting percentage, it's not out of the realm of randomness that with just 39 attempts for Palmer, he's only hit only 7. In 1000 simulations, it comes up 35 times. Is he below where he should be? Yes. Will he likely improve and soon? Unless something is very wrong, yes. And indeed, by the end of the season, he finished with a .313 shooting percentage from 3 point range. So we can say he was just unlucky.
67 |
68 | ## Cold streaks
69 |
70 | During the Western Illinois game in the 2018-2019 season, the team, shooting .329 on the season from behind the arc, went 0-15 in the second half. How strange is that?
71 |
72 | ```{r sim2, exercise=TRUE, message=FALSE}
73 | set.seed(1234)
74 |
75 | simulations <- rbinom(n = 1000, size = 15, prob = .329)
76 |
77 | table(simulations)
78 | ```
79 | ```{r sim2-solution, exercise.reveal_solution = FALSE}
80 | set.seed(1234)
81 |
82 | simulations <- rbinom(n = 1000, size = 15, prob = .329)
83 |
84 | table(simulations)
85 | ```
86 | ```{r sim2-check}
87 | grade_this_code()
88 | ```
89 |
90 | Short answer: Really weird. If you simulate 15 threes 1000 times, sometimes you'll see them miss all of them, but only a few times -- five times, in this case. Most of the time, the team won't go 0-15 even once. So going ice cold is not totally out of the realm of random chance, but it's highly unlikely.
91 |
92 | ## The Recap
93 |
94 | In this lesson, we've explored the powerful technique of statistical simulation and its application in sports analytics. We began by understanding the concept of normal distribution and how it relates to player performance. Using R's rbinom() function, we learned how to simulate shooting percentages over multiple trials, applying this to real examples like James Palmer Jr.'s three-point shooting slump. We discovered how to interpret the results of these simulations, determining whether observed performances are within the realm of normal variability or truly unusual. We also applied this technique to team performance, analyzing the likelihood of extreme cold streaks. This hands-on approach has equipped you with the skills to use simulations for evaluating player and team performances, helping to distinguish between random fluctuations and genuine issues. Remember, while simulations provide valuable insights, they should be used in conjunction with other analytical tools and contextual knowledge of the sport for comprehensive analysis.
95 |
96 | ## Terms to Know
97 |
98 | - Simulation: A statistical technique that uses random sampling to model real-world scenarios and estimate probabilities of different outcomes.
99 | - Normal Distribution: A symmetrical, bell-shaped probability distribution where data tends to cluster around the mean, with decreasing frequency as values move away from the center.
100 | - Binomial Distribution: A probability distribution that represents the number of successes in a fixed number of independent trials, each with the same probability of success.
101 | - `rbinom()`: An R function used to generate random numbers from a binomial distribution, often used in simulations.
102 | - Set Seed: A function used to initialize the random number generator, ensuring reproducibility in simulations.
103 | - Probability: The likelihood of an event occurring, expressed as a number between 0 and 1.
104 | - Standard Deviation: A measure of the amount of variation or dispersion of a set of values from their mean.
105 | - Random Variability: Natural fluctuations in performance or outcomes that occur by chance rather than due to a specific cause.
--------------------------------------------------------------------------------
/inst/tutorials/36-simulations/images/simulations1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/36-simulations/images/simulations1.png
--------------------------------------------------------------------------------
/inst/tutorials/36-simulations/images/simulations2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mattwaite/SportsDataTutorials/66d572a712d55b1c55e90b04486ce6264ed640c4/inst/tutorials/36-simulations/images/simulations2.png
--------------------------------------------------------------------------------