├── README.md
├── docs
    ├── index.html
    ├── nba2min.html
    └── nba_development.html
├── homefouls.qmd
├── logs1824.csv
├── nba2min.html
├── nba_development.html
├── nba_two_minutes.Rmd
└── nicar25_wbb.Rmd


/README.md:
--------------------------------------------------------------------------------
 1 | # Bringing data journalism to the sports section
 2 | 
 3 | ### THIS REPO: https://github.com/dwillis/nicar25-sports
 4 | 
 5 | Materials for this NICAR 2025 session.
 6 | 
 7 | * Matt Waite, University of Nebraska
 8 | * MaryJo Webster, Minnesota Star Tribune
 9 | * Derek Willis, University of Maryland
10 | 
11 | ### Resources
12 | 
13 | * [Sports DataVerse](https://sportsdataverse.org/)
14 | * [Sports Data Analysis and Visualization](https://www.thescoop.org/sports/)
15 | * [Using R Packages to get data](https://www.thescoop.org/sports/usingpackages.html)
16 | 
17 | ### Examples
18 | 
19 | * [Behind the Minnesota Vikings' Wild Season](https://www.startribune.com/a-look-at-the-data-behind-the-minnesota-vikings-wild-unpredictable-season/600241956/)
20 | 
21 | We wanted a story that would especially appeal to our digital readers that heavily used data and graphics to look back at the Vikings' crazy season, just as they were heading into a playoff game. The biggest piece on this story were the play-by-play win probability charts, which is from data that you can pull using the espnscrapeR package. (My [video tutorials](https://sites.google.com/view/mj-basic-data-academy/intro-to-r/getting-nfl-data?authuser=0) show how to do that for one or multiple games.)
22 | 
23 | I also used play-by-play data downloaded with the espnscrapeR package to look at the point differential at the end of each quarter. The play by play data is super useful because it not only shows every play, but also has a record for the end of each quarter, the two-minute warnings and you can also find the start of overtime.
24 | 
25 | * In November, we also published [this data-heavy piece on the Vikings](https://www.startribune.com/9-charts-that-show-the-minnesota-vikings-stunning-turnaround-from-last-season/600227084/). It also leaned heavily on the win probability data. We also used some NFL NextGen stats to look at average separation stats for Justin Jefferson. You can get that data with this little snippet: 
26 | 
27 | ```
28 | receiving_next_gen <-  load_nextgen_stats(
29 |   seasons = TRUE,
30 |   stat_type = "receiving",
31 |   file_type = getOption("nflreadr.prefer", default = "rds")
32 | )
33 | ```
34 | 
35 | And then we also got passing stats on Kirk Cousins that show how often he was throwing into tight coverage (known as "aggressiveness")
36 | You can get that data with this: 
37 | 
38 | ```
39 | passing_next_gen <-  load_nextgen_stats(
40 |   seasons = TRUE,
41 |   stat_type = "passing",
42 |   file_type = getOption("nflreadr.prefer", default = "rds")
43 | )
44 | ```
45 | 
46 | * [Justin Jefferson piece](https://www.startribune.com/justin-jefferson-minnesota-vikings-statistics-all-pro-mvp-randy-moss/600247729/?refresh=true), February 2023:  
47 | Most of this is just high-level stuff taken from various websites, but there are two pieces where we pulled data from an API using R. The main one is the bubble chart that shows how Jefferson led the team on offensive yards (There is a nifty package called "[packcircles](mimestream://messagethread/p448483/message/p511829?UUID=52FF0A60-4ECD-493C-B780-C97901E30DF2&loadRemoteResources#https://r-graph-gallery.com/305-basic-circle-packing-with-one-level.html)" that works with plotly package to make a bubble chart. The one in our story was made with other technology, though).  I also pulled the receiving yards per game data from the espnscrapR package, play by play data.
48 | 
49 | * [For Players, Scoring Matters. But Shouldn’t Other Things Too?](https://sash-wat.github.io/JOUR479XSite/_site/posts/soccer-analysis/)
50 | * [For WVU, There’s No Place Like Home](https://herhoopstats.substack.com/p/for-wvu-theres-no-place-like-home)
51 | 
52 | * [Three statistics that show why the Lynx can be playoff contenders](https://www.startribune.com/lynx-wnba-championship-contenders-assists-three-point-percentage-struggles-in-paint/601111960?utm_source=gift)
53 |       Source: WNBA data from R package called [wehoop](https://wehoop.sportsdataverse.org/index.html)
54 | * [We plotted every hit Joe Mauer collected in his 15-year career](https://strib.gift/l2if57f85)
55 |       Source: Purchased from [Sports Info Solutions](https://www.sportsinfosolutions.com/research-analysis/)  Other websites didn't go far enough back in time. 
56 | * [Minnesota is national leader in girls sports participation](https://www.startribune.com/minnesota-is-national-leader-in-girls-high-school-sports-participation/600182377?utm_source=gift)
57 |       Source: [Federation of State High School Associations](https://www.nfhs.org/)
58 | * [Gophers NIL deals](https://www.startribune.com/gophers-nil-deals-minnesota-social-media-women-football-dinkytown-athletes/600355016?utm_source=gift).  [Example of NIL data](https://docs.google.com/spreadsheets/d/1hmeMj2jM7DSnjfJOsjtl_axHDYZivG7PO1ebx4C5adE/edit?usp=sharing)
59 | 
60 | * [The NBA's three-point revolution has gotten extreme](https://wapo.st/43khZsn)
61 | * [NCAA Game Excitement Index](https://lukebenz.com/post/gei/)
62 | * [College Basketball Referee Logs](https://blessyourchart.shinyapps.io/cbb-ref-logs/)
63 | * [The Gyminternet](https://lse-me204.github.io/me204-2024-project-jesatuts2/)
64 | 


--------------------------------------------------------------------------------
/homefouls.qmd:
--------------------------------------------------------------------------------
  1 | # Bringing Data Analysis to the Sports Section
  2 | 
  3 | ## NICAR 2024
  4 | 
  5 | #### Matt Waite, University of Nebraska
  6 | 
  7 | #### Derek Willis, University of Maryland
  8 | 
  9 | #### <https://github.com/dwillis/nicar24-sports>
 10 | 
 11 | Some of the best sports data stories start with a good question. How about this one: which teams benefit the most from home-court officiating?
 12 | 
 13 | First, we'll load the tidyverse.
 14 | 
 15 | ```{r}
 16 | library(tidyverse)
 17 | ```
 18 | 
 19 | ```{r}
 20 | logs <- read_csv("logs1824.csv")
 21 | 
 22 | logs <- logs |> mutate(foul_diff = TeamPersonalFouls-OpponentPersonalFouls, steal_diff = TeamSteals- OpponentSteals, turnover_diff = TeamTurnovers- OpponentTurnovers)
 23 | ```
 24 | 
 25 | 
 26 | ```{r}
 27 | 
 28 | team_fouls_location_season <- logs |> 
 29 |   filter(!is.na(TeamPersonalFouls)) |> 
 30 |   group_by(Conference, Team, Season, HomeAway) |> 
 31 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 32 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 33 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 34 |   mutate(Diff = Home-Away)
 35 | 
 36 | conference_fouls_location_season <- logs |> 
 37 |   filter(!is.na(TeamPersonalFouls)) |> 
 38 |   group_by(Conference, Season, HomeAway) |> 
 39 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 40 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 41 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 42 |   mutate(Diff = Home-Away)
 43 | 
 44 | team_fouls_location <- logs |> 
 45 |   filter(!is.na(TeamPersonalFouls)) |> 
 46 |   group_by(Conference, Team, HomeAway) |> 
 47 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 48 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 49 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 50 |   mutate(Diff = Home-Away)
 51 | 
 52 | 
 53 | conference_fouls_location <- logs |> 
 54 |   filter(!is.na(TeamPersonalFouls)) |> 
 55 |   group_by(Conference, HomeAway) |> 
 56 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 57 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 58 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 59 |   mutate(Diff = Home-Away)
 60 | ```
 61 | ## Averages for all teams
 62 | 
 63 | ```{r}
 64 | logs |> filter(!is.na(TeamPersonalFouls)) |> group_by(HomeAway) |> summarize(avg_fouls = mean(TeamPersonalFouls))
 65 | ```
 66 | 
 67 | ```{r}
 68 | team_fouldiff_location_season <- logs |> 
 69 |   filter(!is.na(foul_diff)) |> 
 70 |   group_by(Conference, Team, Season, HomeAway) |> 
 71 |   summarize(avg_fouls = mean(foul_diff)) |> 
 72 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 73 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 74 |   mutate(Diff = Home-Away)
 75 | 
 76 | conference_fouldiff_location_season <- logs |> 
 77 |   filter(!is.na(foul_diff)) |> 
 78 |   group_by(Conference, Season, HomeAway) |> 
 79 |   summarize(avg_fouls = mean(foul_diff)) |> 
 80 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 81 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 82 |   mutate(Diff = Home-Away)
 83 | 
 84 | team_fouldiff_location <- logs |> 
 85 |   filter(!is.na(foul_diff)) |> 
 86 |   group_by(Team, HomeAway) |> 
 87 |   summarize(avg_fouls = mean(foul_diff)) |> 
 88 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 89 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 90 |   mutate(Diff = Home-Away)
 91 | 
 92 | 
 93 | conference_fouldiff_location <- logs |> 
 94 |   filter(!is.na(foul_diff)) |> 
 95 |   group_by(Conference, HomeAway) |> 
 96 |   summarize(avg_fouls = mean(foul_diff)) |> 
 97 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 98 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 99 |   mutate(Diff = Home-Away)
100 | ```
101 | ### How many teams have averaged fewer fouls at home and away over the past seven seasons?
102 | 
103 | ```{r}
104 | team_fouldiff_location_season |> filter(Home < 0, Away < 0) |> group_by(Team) |> summarize(count = n()) |> arrange(desc(count))
105 | ```
106 | 
107 | ### Teams that are good at this - are they alike? Yes, except South Carolina and Western Illinois.
108 | 
109 | ```{r}
110 | team_fouldiff_location |> 
111 |   filter(Away < 0, Home < 0) |> 
112 |   select(Team, Away, Home, Diff) |> 
113 |   arrange(Diff)
114 | ```
115 | 
116 | ### Drilling down into game-level data for West Virginia
117 | 
118 | 
119 | ```{r}
120 | logs |> filter(Team == "West Virginia", !is.na(W_L), is.na(HomeAway)) |>  select(Season, Date, Opponent, W_L, foul_diff, steal_diff, turnover_diff) |>  arrange(desc(Date)) |> distinct()
121 | ```
122 | 
123 | 
124 | 
125 | 


--------------------------------------------------------------------------------
/nba_two_minutes.Rmd:
--------------------------------------------------------------------------------
  1 | #Crunch Time Calls: NBA Referees Score 93% Accuracy in Final Two-Minute Scenarios
  2 | By Pablo Suarez
  3 | 
  4 | ```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
  5 | 
  6 | library(tidyverse)
  7 | library(ggplot2)
  8 | library(ggrepel)
  9 | library(janitor)
 10 | library(lubridate)
 11 | library(ggalt)
 12 | library(cowplot)
 13 | library(dplyr)
 14 | library(scales)
 15 | library(gt)
 16 | library(knitr)
 17 | 
 18 | ```
 19 | 
 20 | The NBA Officiating Last Two Minute Report exists to help provide transparency and assessment of officiating in the final two minutes of games at or within three points. An analysis of all reports between March 2015 and Nov. 2024 suggests that NBA referees are adequately accurate when the stakes and pressure are highest.
 21 | 
 22 | This analysis is possible thanks to a publicly accessible dataset of all individual reports released within that timeframe compiled by GitHub user "atlhawksfanatic." The data includes nearly 84,000 calls and provides contextual information from each assessed game including the call type, players initiating or impacted by the call, game dates, official names and whether a game was broadcast on national TV.
 23 | 
 24 | ```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
 25 | 
 26 | l2m <- read_csv("https://raw.githubusercontent.com/pabs-s/jour479x_fall_2024/refs/heads/main/presentations/l2m.csv")
 27 | colnames(l2m)
 28 | 
 29 | ```
 30 | 
 31 | The accuracy of a call is denoted by one of five labels -- correct call (CC), correct no call (CNC), incorrect call (IC), incorrect no call (INC) and NA. For context, calls that fall under "NA" are considered "only observable with the help of a stopwatch, zoom or other technical support" and are typically not deemed to be incorrectly officiated.  
 32 | 
 33 | The first step in this analysis was to determine an accuracy baseline for referees. Among the nearly 84,000 assessed calls, roughly 6,000 fell under the umbrella of incorrect calls (IC and INC). An average of these figures indicates that in these late-game scenarios, NBA referees are 93% accurate based on their assessment in the reports. 
 34 | 
 35 | In an academic environment and on a 4.0 GPA scale, that percentage would translate to the equivalent of an A-. That's "good." It's not perfect, but contrary to the beliefs of the most jaded fans, referees are humans too. NBA refs lag in comparison to their NFL counterparts who were found by the league to be "typically accurate on 98.9% of calls" (https://operations.nfl.com/officiating/nfl-officials-preparing-for-success/#:~:text=Each%20NFL%20game%20averages%20around,consequences%20for%20not%20achieving%20expectations) and aligned more closely with MLB umpires who scored an accuracy of 92.8% in 2023 according to Statcast (https://blogs.fangraphs.com/strike-three-lets-check-in-on-umpire-accuracy/).
 36 | 
 37 | ```{r}
 38 | 
 39 | incorrect_calls <- l2m |>
 40 |   filter(decision %in% c("INC", "IC")) |>
 41 |   group_by(call_type) |>
 42 |   summarise(count = n()) |>
 43 |   arrange(desc(count))
 44 | 
 45 | tot_inc <- sum(incorrect_calls$count, na.rm = TRUE) #5912
 46 | #83,987
 47 | 
 48 | acc_avg <- (5912/83987) * 100 
 49 | 
 50 | 
 51 | cat("Average accuracy of NBA referees in final two-minutes:", acc_avg, "%\n")
 52 | 
 53 | 
 54 | ```
 55 | 
 56 | That baseline for NBA referees seems to hold true throughout the years. Take a look at the distribution of calls from year to year. Only the plays classified as correct no calls observe the most fluctuation. That's encouraging for referees as it indicates that the reports are primarily validating (for the most part) correct rule interpretations and judgments on the court, rather than correcting consequential mistakes.
 57 | 
 58 | ```{r}
 59 | 
 60 | l2m_summary <- l2m |>
 61 |   filter(decision %in% c("IC", "INC", "CC", "CNC")) |> 
 62 |   filter(season >= 2015 & season <= 2024) |> 
 63 |   group_by(season, decision) |> 
 64 |   summarise(count = n())
 65 | 
 66 | ggplot(l2m_summary, aes(x = season, y = count, color = decision, group = decision)) +
 67 |   geom_line(size = 1) +
 68 |   geom_point(size = 2) +
 69 |   labs(
 70 |     title = "NBA Late-Game Refereeing is Consistent Except for Correct No Calls",
 71 |     subtitle = "Correct and incorrect calls by year from L2M reports.",
 72 |     x = "Season",
 73 |     y = "Number of Calls",
 74 |     color = "Decision Type",
 75 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez"
 76 |   ) +
 77 |   scale_color_manual(
 78 |     values = c("CC" = "green", "CNC" = "blue", "IC" = "red", "INC" = "orange"),
 79 |     labels = c("CC" = "Correct Call", 
 80 |                "CNC" = "Correct No Call", 
 81 |                "IC" = "Incorrect Call", 
 82 |                "INC" = "Incorrect No Call")
 83 |   ) +
 84 |   scale_x_continuous(breaks = seq(2015, 2025, by = 1)) +
 85 |   theme_minimal() +
 86 |   theme(
 87 |     axis.text.x = element_text(angle = 45, hjust = 1),
 88 |     legend.position = "bottom")
 89 | 
 90 | ```
 91 | Overall, the broad state of these incorrect calls over the years comes from common fouls, but more specifically shooting and personal fouls. There are nearly 1,400 instances of incorrectly officiated shooting fouls in the dataset.
 92 | 
 93 | ```{r}
 94 | 
 95 | top_ten_inc <- incorrect_calls |>
 96 |   slice_head(n = 10)  
 97 | 
 98 | ggplot(top_ten_inc, aes(x = reorder(call_type, count), y = count)) +
 99 |   geom_bar(stat = "identity", fill = "blue") +  
100 |   coord_flip() +  
101 |   scale_y_continuous(breaks = seq(0, max(top_ten_inc$count), by = 100)) +
102 |   labs(
103 |     title = "Shooting Fouls Spell End of Game Trouble for NBA Refs",
104 |     x = "Call Type",
105 |     y = "Count of Incorrect Calls",
106 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez") +
107 |   theme_minimal()
108 | 
109 | ```
110 | That sounds like a lot, but don't lose the plot here. There is almost a decade's worth of games represented in the data, and plenty of additional plays where referees assessed a shooting foul correctly (almost 19,000 to boot). The "good" far outweighs the "bad" in this case. In fact, percentage-wise, NBA referees incorrectly assess these fouls only approximately 6.5% of the time, which outpaces the 7% baseline.
111 | 
112 | ```{r}
113 | 
114 | shooting_fouls <- l2m |>
115 |   filter(call_type == "Foul: Shooting")  
116 | 
117 | shootfoul_count <- shooting_fouls |>
118 |   group_by(decision) |>        
119 |   summarise(count = n())
120 | 
121 | ggplot(shootfoul_count, aes(x = decision, y = count, fill = decision)) +
122 |   geom_bar(stat = "identity") +
123 |   labs(
124 |     title = "NBA Refs Are Overwhelmingly Accurate with Shooting Foul Calls",
125 |     x = "Decision Type",
126 |     y = "Count",
127 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez"
128 |   ) +
129 |   theme_minimal()
130 | 
131 | ```
132 | 
133 | ```{r}
134 | 
135 | total_shooting_fouls <- nrow(shooting_fouls)
136 | 
137 | inc_sf <- shooting_fouls |>
138 |   filter(decision %in% c("IC", "INC")) |>  
139 |   nrow()
140 | 
141 | pct_inc_sf <- (inc_sf / total_shooting_fouls) * 100
142 | 
143 | cat("Percentage of IC + INC out of total shooting fouls:", pct_inc_sf, "%\n")
144 | 
145 | ```
146 | 
147 | Here's another way to look at these incorrect calls. Usually, for facet wraps like this, the y-axis should be uniform for each chart. However, I opted to scale them to show each call's proportion in relation to each other. This not only better showcases problem calls for referees, but we also learn more in general because a uniform y-axis would only further prove what is called more often. 
148 | 
149 | ```{r}
150 | 
151 | filtered_l2m <- l2m |>
152 |   filter(call_type %in% c(
153 |     "Foul: Shooting",
154 |     "Foul: Personal",
155 |     "Foul: Offensive",
156 |     "Turnover: Traveling",
157 |     "Foul: Loose Ball",
158 |     "Turnover: 3 Second Violation",
159 |     "Violation: Lane",
160 |     "Foul: Away from Play",
161 |     "Stoppage: Out of Bounds"
162 |   )) |>
163 |   mutate(call_type = case_when(
164 |     call_type == "Turnover: 3 Second Violation" ~ "3 Sec. Violation",
165 |     call_type == "Stoppage: Out of Bounds" ~ "Out of Bounds",
166 |     TRUE ~ call_type 
167 |   ))
168 | 
169 | ggplot(filtered_l2m, aes(x = decision, fill = decision)) +
170 |   geom_bar() +
171 |   facet_wrap(~call_type, scales = "free_y") +  
172 |   theme_minimal() +
173 |   labs(
174 |     title = "Violations are Flying Under Referee Radars",
175 |     subtitle = "Distribution of top nine calls in games since March 2015.",
176 |     x = "Decision",
177 |     y = "Count",
178 |     fill = "Decision",
179 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez"
180 |   ) +
181 |   theme(
182 |     axis.text.x = element_text(angle = 45, hjust = 1),
183 |     strip.text = element_text(size = 10)
184 |   )
185 | 
186 | ```
187 | 
188 | It's worth noting above that calls like lane violations pale in comparison to fouls in terms of call frequency (81 lane violations vs nearly 1,400 shooting fouls). However, regardless of the frequency, each of these calls are consequential in late-game scenarios given the increased perceived value of each possession. 
189 | 
190 | If referees are struggling when faced with catching these violations, that is an issue worth highlighting. It's also worth discussing the NA values. If referees can't discern whether a violation is occurring in real-time without the assistance of replay, then perhaps it warrants a stricter rule from the NBA as a deterrent, a separate referee solely responsible for catching these calls or any sort of intervention with permitted sideline technology.
191 | 
192 | Again, these violations are a drop in the bucket compared to fouls. But in close games, every call matters and who's to say how better accuracy might alter the outcome.
193 | 
194 | ###Impact on Players and Teams
195 | 
196 | Let's address the elephant in the room: Which players and teams are adversely or positively impacted by NBA refereeing?
197 | 
198 | The reports use two labels to denote players involved in a given call: committing and disadvantaged. To a degree, it's straightforward to interpret. The player initiating the call is "committing" the act, while the player at the receiving end is considered "disadvantaged." Given that the data contains correct no calls or that players "flop," it's not always the case that the disadvantaged player is necessarily the one who is always negatively impacted by the outcome of a call. But assessing the true impact of each call is a much more taxing task and something that will require going call-by-call. I settled for the labels and how the report interpreted these calls in the meantime.
199 | 
200 | ```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
201 | 
202 | inc_ic_calls <- l2m |>
203 |   filter(decision %in% c("INC", "IC"))  
204 | 
205 | committing_counts <- inc_ic_calls |>
206 |   group_by(committing) |>
207 |   summarise(com_count = n(), .groups = "drop")  
208 | 
209 | disadvantaged_counts <- inc_ic_calls |>
210 |   group_by(disadvantaged) |>
211 |   summarise(dis_count = n(), .groups = "drop")
212 | 
213 | player_counts <- full_join(
214 |   committing_counts, disadvantaged_counts,
215 |   by = c("committing" = "disadvantaged")) |>
216 |   rename(player = committing) 
217 | 
218 | player_counts <- player_counts |>
219 |   mutate(
220 |     dis_count = replace_na(dis_count, 0),
221 |     com_count = replace_na(com_count, 0))
222 | 
223 | player_counts <- player_counts |>
224 |   mutate(total_count = dis_count + com_count) |>
225 |   arrange(desc(total_count))
226 | 
227 | top_15_players <- player_counts |>
228 |   filter(!is.na(player)) |> 
229 |   arrange(desc(total_count)) |> 
230 |   slice(1:15)
231 | 
232 | ```
233 | 
234 | ```{r}
235 | 
236 | ggplot(top_15_players, aes(x = dis_count, xend = com_count, y = reorder(player, total_count))) +
237 |   geom_dumbbell(
238 |     colour = "grey", 
239 |     size = 2,         
240 |     dot_guide = TRUE, 
241 |     dot_guide_size = 0.5) +
242 |   geom_point(aes(x = dis_count), color = "blue", size = 3) +  
243 |   geom_point(aes(x = com_count), color = "red", size = 3) +   
244 |   labs(
245 |     title = "Top 15 Players: Disadvantaged vs. Committing Counts",
246 |     subtitle = "Involvement in incorrect calls and no calls in NBA L2M report.",
247 |     x = "Number of Plays",
248 |     y = "Player",
249 |     caption = "Blue = Disadvantaged, Red = Committing | Data: atlhawksfanatic (GitHub) | By Pablo Suarez") +
250 |   theme_minimal()
251 | 
252 | ```
253 | 
254 | Above is a list of the 15 players with the highest combined counts of disadvantaged and committed calls. 
255 | 
256 | Nikola Jokic leads all NBA players for references in plays with incorrect calls and no calls with a count of 117. He's also the only player in the top 8 that has been the disadvantaged player more than the committing player. DeMar DeRozan showed the highest disparity among disadvantaged players, but he was involved in 47 less plays than Jokic.
257 | 
258 | Conversely, players like Al Horford, Draymond Green and Karl-Anthony Towns are major offenders in terms of having high disparity between the instances where they are disadvantaged compared to committing fouls or turnovers.
259 | 
260 | Roughly half of these players are considered centers or power forwards (a.k.a., big men). This might be because their position on the court would make them susceptible to contact. The players who would be considered exceptions to this rule, like James Harden, Westbrook and Jimmy Butler, have also made a living by frequently getting into the paint for layups. 
261 | 
262 | These findings make Jokic's elite offensive production even more impressive with this context. Naturally, I wanted to dig even deeper into his calls.
263 | 
264 | ```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
265 | 
266 | jokic <- l2m |> filter(disadvantaged == "Nikola Jokic" | committing == "Nikola Jokic")
267 | jokic_inc_calls <- jokic |> filter(decision %in% c("INC", "IC")) 
268 | 
269 | jokic_count <- jokic_inc_calls |>
270 |   group_by(call_type, decision) |>
271 |   summarise(count = n(), .groups = "drop") |>
272 |   arrange(desc(count))
273 | 
274 | jokic1 <- ggplot(jokic_count, aes(x = call_type, y = count, fill = decision)) +
275 |   geom_bar(stat = "identity") +
276 |   scale_fill_manual(
277 |     values = c("INC" = "blue", "IC" = "red")) +
278 |   labs(
279 |     title = "Jokic Gets a Good Whistle with Offensive Fouls",
280 |     x = "Call Type",
281 |     y = "Count",
282 |     fill = "Decision",
283 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez") +
284 |   theme_minimal() +
285 |   theme(
286 |     axis.text.x = element_text(angle = 45, hjust = 1),
287 |     plot.margin = margin(10, 10, 10, 30))
288 | 
289 | jokic_call_summary <- jokic_inc_calls |>
290 |   filter(committing == "Nikola Jokic" | disadvantaged == "Nikola Jokic") |> 
291 |   mutate(role = case_when(
292 |     committing == "Nikola Jokic" ~ "Committing",
293 |     disadvantaged == "Nikola Jokic" ~ "Disadvantaged")) |> 
294 |   group_by(call_type, role) |> 
295 |   summarise(count = n()) |>
296 |   arrange(desc(count))
297 | 
298 | jokic2 <- ggplot(jokic_call_summary, aes(x = call_type, y = count, fill = role)) +
299 |   geom_bar(stat = "identity") +
300 |   scale_fill_manual(
301 |     values = c("Disadvantaged" = "blue", "Committing" = "red")) +
302 |   labs(
303 |     x = "Call Type",
304 |     fill = "Role") +
305 |   theme_minimal() +
306 |   theme(
307 |     axis.text.x = element_text(angle = 45, hjust = 1))
308 | 
309 | ```
310 | 
311 | Isolating the two-minute report data to only show instances where he was referenced returned 1,562 plays. Filtering that data even further, I pulled all of the plays involving Jokic where the call was either incorrectly called or not called. That narrowed the data down to 117 plays. That means that out of all of the last two-minute report plays involving Jokic, roughly 7.5% of them ended with an incorrect call by the officials. Slightly higher than the league baseline that was previously established.
312 | 
313 | So what about a distribution of his committing versus disadvantaged fouls?
314 | 
315 | ```{r}
316 | 
317 | plot_grid(jokic1, jokic2)
318 | 
319 | ```
320 | 
321 | It appears that the incorrect calls Jokic is predominantly involved in are mostly fouls, but specifically personal, shooting and loose ball fouls. It looks like he often find himself on the unfortunate side of calls for personal and shooting fouls.
322 | 
323 | However, this also paints a less flattering image of Jokic. He's not that innocent. He appears to frequently initiate contact on the offensive side of the ball, and often gets away with more than he probably should when it comes to offensive fouls. NBA referees have deemed plenty of incorrect no calls regarding his offensive fouls and they disproportionately benefit him.
324 | 
325 | Regardless, it just seems like contact always finds him.
326 | 
327 | Shifting focus, here is that same analysis applied to teams in order to identify the most disadvantaged teams when it comes to incorrect calls. 
328 | 
329 | ```{r}
330 | 
331 | teams <- c("Celtics", "Nets", "Knicks", "76ers", "Raptors", 
332 |            "Bulls", "Cavaliers", "Pistons", "Pacers", "Bucks", 
333 |            "Hawks", "Hornets", "Heat", "Magic", "Wizards", 
334 |            "Nuggets", "Timberwolves", "Thunder", "Trail Blazers", 
335 |            "Jazz", "Warriors", "Clippers", "Lakers", "Suns", 
336 |            "Kings", "Mavericks", "Rockets", "Grizzlies", 
337 |            "Pelicans", "Spurs")
338 | 
339 | top_teams <- player_counts |>
340 |   filter(!is.na(player), player %in% teams) |>  
341 |   arrange(desc(total_count)) |>  
342 |   slice(1:15)  
343 | 
344 | ggplot(top_teams, aes(x = dis_count, xend = com_count, y = reorder(player, total_count))) +
345 |   geom_dumbbell(
346 |     colour = "grey", 
347 |     size = 2,         
348 |     dot_guide = TRUE, 
349 |     dot_guide_size = 0.5) +
350 |   geom_point(aes(x = dis_count), color = "blue", size = 3) +  
351 |   geom_point(aes(x = com_count), color = "red", size = 3) +   
352 |   labs(
353 |     title = "Top 15 Teams: Disadvantaged vs. Committing Counts",
354 |     subtitle = "Involvement in incorrect calls and no calls in NBA L2M report.",
355 |     x = "Number of Plays",
356 |     y = "Player",
357 |     caption = "Blue = Disadvantaged, Red = Committing | Data: atlhawksfanatic (GitHub) | By Pablo Suarez") +
358 |   theme_minimal()
359 | 
360 | ```
361 | 
362 | It's deja vu for Jokic and his Nuggets. After following the same process to identify the top 15 teams and their involvement in incorrect calls and no calls, the Denver Nuggets lead the league with 53 disadvantaged calls. 
363 | 
364 | It's also interesting that the Memphis Grizzlies lead all teams with five committing calls. That's possibly because there are many more possibilities for a call to be disadvantageous for a team rather than a team collectively commit some sort of offense.  
365 | 
366 | For context, the graphic above is strictly for calls that are attributed directly to the team in the reports. Getting an even clearer picture of the frequency for a team's advantageous or disadvantageous calls requires totaling all instances of referenced players. For what it's worth. I think it's possible to figure out moving forward, as the "comments" column in the dataframe contains abbreviations for referenced players. Here's an example involving Jokic himself, "Jokic (DEN) establishes himself in a screening position..." I would filter the column for those team abbreviations and then group those rows based on the committing vs disadvantaged result.
367 | 
368 | Okay, tangent over, now back to the Nuggets. Here is a distribution of those calls involving the team.
369 | 
370 | ```{r}
371 | 
372 | nuggets <- l2m |> filter(disadvantaged == "Nuggets" | committing == "Nuggets")
373 | nuggets_inc_calls <- nuggets |> filter(decision %in% c("INC", "IC")) 
374 | 
375 | nuggets_count <- nuggets_inc_calls |>
376 |   group_by(call_type, decision) |>
377 |   summarise(count = n(), .groups = "drop") |>
378 |   arrange(desc(count))
379 | 
380 | ggplot(nuggets_count, aes(x = call_type, y = count, fill = decision)) +
381 |   geom_bar(stat = "identity") +
382 |   scale_fill_manual(
383 |     values = c("INC" = "blue", "IC" = "red")) +
384 |   labs(
385 |     title = "Nuggets Suffer from Uncalled Opponent Travels and Def. 3 Seconds",
386 |     x = "Call Type",
387 |     y = "Count",
388 |     fill = "Decision",
389 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez") +
390 |   theme_minimal() +
391 |   theme(
392 |     axis.text.x = element_text(angle = 45, hjust = 1),
393 |     plot.margin = margin(10, 10, 10, 30))
394 | 
395 | 
396 | ```
397 | 
398 | Unfortunately for the Nuggets, it appears that they are not necessarily getting the traveling and defensive three second calls that they are owed. Between both calls, that's essentially 36 possessions the team was owed since 2015. It'd be interesting to understand how impactful those calls could have been in close games. 
399 | 
400 | Equally compelling is the disparity between those two specific types of calls and every other call referenced in the chart. Traveling has already proven to be troublesome for referees to accurately assess in real-time. However, this is the first time observing defensive three seconds as a problem area as well. 
401 | As noted in other sections, these calls (or no calls) aren't being highlighted here because they are a systemic problem across the NBA. Referees have already proven they get it right far more often than they get it wrong. But it does represent an improvement area that referees can consider moving forward.
402 | 
403 | ### Impact by National TV Network
404 | 
405 | Considering that the two-minute reports note whether a game is played on national TV and that it specifies the network if it is, it's worth assessing referee performance in games where "everybody" is watching. 
406 | 
407 | ```{r}
408 | 
409 | nat_tv <- l2m |>
410 |   filter(national_tv != "no")
411 | 
412 | tot_nat_tv <- nrow(nat_tv) 
413 | 
414 | inc_nattv_calls <- nat_tv |> 
415 |   filter(decision %in% c("IC", "INC")) |> 
416 |   nrow()
417 | 
418 | pct_inc_nattv_calls <- (inc_nattv_calls / tot_nat_tv) * 100
419 | 
420 | cat("Percentage of IC + INC calls out of total number of calls in national TV games:", pct_inc_nattv_calls, "%\n")
421 | 
422 | ```
423 | 
424 | Using a similar process to how the 7% baseline was established, the percentage of incorrect calls on national TV is just slightly better than average at 6.8%. 
425 | 
426 | It's not a significant difference, but it's also not nothing. Intuitively, it makes sense. The NBA has a vested interest in ensuring that games shown nationally are adequately officiated. That's not to say other games aren't, but more viewers equals more scrutiny and the league is better off saving its slightly more accurate crews for these games.
427 | 
428 | Let's then take a look at the difference among networks. The reports specify three national networks -- ESPN, TNT and NBA TV. 
429 | 
430 | ```{r}
431 | 
432 | espn <- l2m |>
433 |   filter(national_tv == "ESPN")
434 | 
435 | tot_espn <- nrow(espn)
436 | 
437 | inc_espn <- espn |> 
438 |   filter(decision %in% c("IC", "INC")) |> 
439 |   nrow()
440 | 
441 | pct_espn <- (inc_espn / tot_espn) * 100
442 | 
443 | cat("Percentage of IC + INC calls out of total number of calls in ESPN games:", pct_espn, "%\n")
444 | 
445 | ```
446 | 
447 | ```{r}
448 | 
449 | tnt <- l2m |>
450 |   filter(national_tv == "TNT")
451 | 
452 | tot_tnt <- nrow(tnt)
453 | 
454 | inc_tnt <- tnt |> 
455 |   filter(decision %in% c("IC", "INC")) |> 
456 |   nrow()
457 | 
458 | pct_tnt <- (inc_tnt / tot_tnt) * 100
459 | 
460 | cat("Percentage of IC + INC calls out of total number of calls in TNT games:", pct_tnt, "%\n")
461 | 
462 | ```
463 | 
464 | ```{r}
465 | 
466 | nbatv <- l2m |>
467 |   filter(national_tv == "NBATV")
468 | 
469 | tot_nbatv <- nrow(nbatv)
470 | 
471 | inc_nbatv <- nbatv |> 
472 |   filter(decision %in% c("IC", "INC")) |> 
473 |   nrow()
474 | 
475 | pct_nbatv <- (inc_nbatv / tot_nbatv) * 100
476 | 
477 | cat("Percentage of IC + INC calls out of total number of calls in NBA TV games:", pct_nbatv, "%\n")
478 | 
479 | ```
480 | 
481 | Referee performance across ESPN and TNT games is relatively normal. It's the NBA TV games that see a significant increase in the percentage of incorrectly called and uncalled plays, a roughly 5% increase to be more precise. That seems unusual compared to what has been observed up to this point.  
482 | 
483 | Getting to the root of the problem requires understanding what is predominantly called in these NBA TV games.
484 | 
485 | ```{r}
486 | 
487 | nbatv_count <- l2m |>
488 |   filter(national_tv == "NBATV") |> 
489 |   group_by(call_type) |> 
490 |   summarise(count = n()) |> 
491 |   arrange(desc(count)) |> 
492 |   slice_head(n = 10)
493 | 
494 | ggplot(nbatv_count, aes(x = reorder(call_type, count), y = count)) +
495 |   geom_bar(stat = "identity", fill = "blue") +
496 |   coord_flip() + 
497 |   labs(
498 |     title = "Fouls Are A Dime a Dozen in NBA TV Games",
499 |     x = "Call Type",
500 |     y = "Count",
501 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez"
502 |   ) +
503 |   theme_minimal()
504 | 
505 | ```
506 | 
507 | This finding aligns with the previous counts of calls made in the last two minutes. A better idea would be to see the distribution of all of these calls.
508 | 
509 | ```{r}
510 | 
511 | filtered_nbatv <- nbatv |>
512 |   filter(call_type %in% c(
513 |     "Foul: Shooting",
514 |     "Foul: Personal",
515 |     "Foul: Offensive",
516 |     "Foul: Loose Ball",
517 |     "Instant Replay: Support Ruling",
518 |     "Turnover: Traveling",
519 |     "Turnover: 24 Second Violation",
520 |     "Turnover: 5 Second Inbound",
521 |     "Foul: Personal Take",
522 |     "Foul: Defensive 3 Second"
523 |   )) |>
524 |   mutate(call_type = case_when(
525 |     call_type == "Turnover: 24 Second Violation" ~ "24 Sec. Violation",
526 |     call_type == "Instant Replay: Support Ruling" ~ "Instant Replay",
527 |     call_type == "Turnover: 5 Second Inbound" ~ "5 Sec. Inbound",
528 |     TRUE ~ call_type 
529 |   ))
530 | 
531 | ggplot(filtered_nbatv, aes(x = decision, fill = decision)) +
532 |   geom_bar() +
533 |   facet_wrap(~call_type, scales = "free_y") +  
534 |   theme_minimal() +
535 |   labs(
536 |     title = "Traveling Violations are Also a Problem Here",
537 |     subtitle = "Distribution of top ten calls in NBA TV games since March 2015.",
538 |     x = "Decision",
539 |     y = "Count",
540 |     fill = "Decision",
541 |     caption = "Data: atlhawksfanatic (GitHub) | By Pablo Suarez"
542 |   ) +
543 |   theme(
544 |     axis.text.x = element_text(angle = 45, hjust = 1),
545 |     strip.text = element_text(size = 10)
546 |   )
547 | 
548 | ```
549 | 
550 | Similar to the overall findings from the data, there seems to be an issue with calling certain violations in these NBA TV games. The 5-second inbound calls aren't entirely as worrisome as they appear. There are only five total instances across the NBA TV dataframe I made and four of them happened in 2015, while the remaining one occurred in 2016. Although it is a bit surprising that these calls either haven't taken place or been reviewed since then. 
551 | 
552 | The traveling calls, again, aren't necessarily frequently called, but the proportion of incorrect calls should be cause for attention from referees. That's especially considering that unlike the overall trends, incorrect calls are outpacing correctly judged travels on NBA TV games.
553 | 
554 | If there's anything that referees and the NBA can take away from this, it's that there's some work to do in terms of addressing calls beyond common fouls.
555 | 
556 | 
557 | 
558 | 
559 | 
560 | 
561 | 
562 | 
563 | 
564 | 
565 | 
566 | 
567 | 
568 | 


--------------------------------------------------------------------------------
/nicar25_wbb.Rmd:
--------------------------------------------------------------------------------
  1 | # Bringing Data Analysis to the Sports Section
  2 | 
  3 | ## NICAR 2025
  4 | 
  5 | #### Matt Waite, University of Nebraska
  6 | 
  7 | #### MaryJo Webster, Minnesota Star Tribune
  8 | 
  9 | #### Derek Willis, University of Maryland
 10 | 
 11 | 
 12 | #### <https://github.com/dwillis/nicar25-sports>
 13 | 
 14 | Some of the best sports data stories start with a good question. How about this one: which teams benefit the most from home-court officiating?
 15 | 
 16 | First, we'll load the tidyverse.
 17 | 
 18 | ```{r}
 19 | library(tidyverse)
 20 | ```
 21 | 
 22 | Our data is game-level records from the 2017-18 season through last week. After loading it, we'll want to add columns to measure the foul differential between the two teams, and I'll through in steal differential, too, since that's often associated with aggressive defense.
 23 | 
 24 | ```{r}
 25 | logs <- read_csv("logs1824.csv")
 26 | 
 27 | logs <- logs |> mutate(foul_diff = TeamPersonalFouls-OpponentPersonalFouls, steal_diff = TeamSteals- OpponentSteals, turnover_diff = TeamTurnovers- OpponentTurnovers)
 28 | ```
 29 | 
 30 | Let's take a look at our data so we can get familiar with it. 
 31 | 
 32 | ## Averages for all teams
 33 | 
 34 | ```{r}
 35 | logs |> filter(!is.na(TeamPersonalFouls)) |> group_by(HomeAway) |> summarize(avg_fouls = mean(TeamPersonalFouls))
 36 | ```
 37 | 
 38 | 
 39 | Then, we'll start our group_by and summarize work, for teams and conferences, thanks to `pivot_wider`:
 40 | 
 41 | ```{r}
 42 | 
 43 | team_fouls_location_season <- logs |> 
 44 |   filter(!is.na(TeamPersonalFouls)) |> 
 45 |   group_by(Conference, Team, Season, HomeAway) |> 
 46 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 47 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 48 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 49 |   mutate(Diff = Home-Away)
 50 | 
 51 | conference_fouls_location_season <- logs |> 
 52 |   filter(!is.na(TeamPersonalFouls)) |> 
 53 |   group_by(Conference, Season, HomeAway) |> 
 54 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 55 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 56 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 57 |   mutate(Diff = Home-Away)
 58 | 
 59 | team_fouls_location <- logs |> 
 60 |   filter(!is.na(TeamPersonalFouls)) |> 
 61 |   group_by(Conference, Team, HomeAway) |> 
 62 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 63 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 64 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 65 |   mutate(Diff = Home-Away)
 66 | 
 67 | 
 68 | conference_fouls_location <- logs |> 
 69 |   filter(!is.na(TeamPersonalFouls)) |> 
 70 |   group_by(Conference, HomeAway) |> 
 71 |   summarize(avg_fouls = mean(TeamPersonalFouls)) |> 
 72 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 73 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 74 |   mutate(Diff = Home-Away)
 75 | 
 76 | team_fouldiff_location_season <- logs |> 
 77 |   filter(!is.na(foul_diff)) |> 
 78 |   group_by(Conference, Team, Season, HomeAway) |> 
 79 |   summarize(avg_fouls = mean(foul_diff)) |> 
 80 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 81 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 82 |   mutate(Diff = Home-Away)
 83 | 
 84 | conference_fouldiff_location_season <- logs |> 
 85 |   filter(!is.na(foul_diff)) |> 
 86 |   group_by(Conference, Season, HomeAway) |> 
 87 |   summarize(avg_fouls = mean(foul_diff)) |> 
 88 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 89 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 90 |   mutate(Diff = Home-Away)
 91 | 
 92 | team_fouldiff_location <- logs |> 
 93 |   filter(!is.na(foul_diff)) |> 
 94 |   group_by(Team, HomeAway) |> 
 95 |   summarize(avg_fouls = mean(foul_diff)) |> 
 96 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
 97 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
 98 |   mutate(Diff = Home-Away)
 99 | 
100 | conference_fouldiff_location <- logs |> 
101 |   filter(!is.na(foul_diff)) |> 
102 |   group_by(Conference, HomeAway) |> 
103 |   summarize(avg_fouls = mean(foul_diff)) |> 
104 |   pivot_wider(names_from = HomeAway, values_from = avg_fouls) |> 
105 |   rename(Home = `NA`, Away = `@`, Neutral = N) |> 
106 |   mutate(Diff = Home-Away)
107 | ```
108 | ### How many teams have averaged fewer fouls at home and away over the past seven seasons?
109 | 
110 | ```{r}
111 | team_fouldiff_location_season |> filter(Home < 0, Away < 0) |> group_by(Team) |> summarize(count = n()) |> arrange(desc(count))
112 | ```
113 | 
114 | ### Teams that are good at this - are they alike? Yes, except South Carolina and Western Illinois.
115 | 
116 | ```{r}
117 | team_fouldiff_location |> 
118 |   filter(Away < 0, Home < 0) |> 
119 |   select(Team, Away, Home, Diff) |> 
120 |   arrange(Diff)
121 | ```
122 | 
123 | ### Drilling down into game-level data for West Virginia
124 | 
125 | 
126 | ```{r}
127 | logs |> filter(Team == "West Virginia", !is.na(W_L), is.na(HomeAway)) |>  select(Season, Date, Opponent, W_L, foul_diff, steal_diff, turnover_diff) |>  arrange(desc(Date)) |> distinct()
128 | ```
129 | 
130 | 
131 | 


--------------------------------------------------------------------------------