├── LICENSE
├── README.md
├── data_gathering
├── README.md
├── api_functions.py
├── constants.py
├── json_to_parquet.py
└── nhl_api_examples.py
├── generate_pdf.py
├── imgs
├── README.md
├── readme_imgs
│ ├── pie_plot_sample.png
│ ├── rank_hbar_plot_sample.png
│ ├── report_sample1.jpg
│ ├── report_sample2.jpg
│ ├── report_sample3.jpg
│ └── rink_image_sample.png
├── simple_rink_grey.jpg
└── team_logos
│ ├── 1.jpg
│ ├── 10.jpg
│ ├── 12.jpg
│ ├── 13.jpg
│ ├── 14.jpg
│ ├── 15.jpg
│ ├── 16.jpg
│ ├── 17.jpg
│ ├── 18.jpg
│ ├── 19.jpg
│ ├── 2.jpg
│ ├── 20.jpg
│ ├── 21.jpg
│ ├── 22.jpg
│ ├── 23.jpg
│ ├── 24.jpg
│ ├── 25.jpg
│ ├── 26.jpg
│ ├── 28.jpg
│ ├── 29.jpg
│ ├── 3.jpg
│ ├── 30.jpg
│ ├── 4.jpg
│ ├── 5.jpg
│ ├── 52.jpg
│ ├── 53.jpg
│ ├── 54.jpg
│ ├── 55.jpg
│ ├── 6.jpg
│ ├── 7.jpg
│ ├── 8.jpg
│ └── 9.jpg
├── report.pdf
├── report_generation
├── README.md
├── data_query.py
└── plotting_functions.py
└── requirements.txt
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Brendan Artley
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Analytics Report Generator
2 |
3 | ## 11/17/2023: NHL stats API updated, repo no longer maintained.
4 |
5 | ## Overview / Purpose
6 |
7 | The generated report is meant to provide coaches and players with a snapshot of their overall performance for a given season. The report can also be used by opposing teams to get a sense of where specific players are threatening in the offensive zone. This can be useful in guiding player development or preparing players for likely matchups in upcoming games.
8 |
9 |
10 |
11 |  |
12 |  |
13 |  |
14 |
15 |
16 |
17 | ## API Requests (Grequests)
18 |
19 | The first part of the project involved querying the NHL API to get all the games, players, events during games, rankings, general statistics, and more. Given that the NHL_API server does not have any API limiting, we can leverage the power of asynchronous requests and get information much faster.
20 |
21 | See more about the NHL API here --> [NHL Api Docs](https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md#configurations)
22 |
23 | Asynchronous requests send requests in parallel to reduce the time spent waiting for data to be sent from the server. That being said, doing large-scale requests in parallel can create server lag, so I performed these requests in batches of 100 to reduce server strain.
24 |
25 | For each batch, the API returns data in a JSON format and I used a context manager to write this to disk.
26 |
27 | ## JSON to Parquet (Pyspark)
28 |
29 | Although JSON will work for small datasets, I wanted to convert the JSON data to Parquet to consume less space, limit IO operations, and increase scalability. The benefits of this data format conversion become more evident as the size of the dataset grows.
30 |
31 | ## Genenerating Plots (Matplotlib, Pandas)
32 |
33 | Currently, there are 5 plots generated with Matplotlib on the reports. First of all, using Pypark we filter our dataset based on the player_id and the season specified, and then convert that data to a pandas data frame (which can be read by Matplotlib).
34 |
35 | The first two plots are pie plots that show the players shot accuracy and the percentage of shots that result in a goal. The following is an example of the shot/goal percentage pie plot.
36 |
37 |
38 |
39 | The next two scatter plots show the location of the player shots on the ice. In one plot, shots are marked as either a goal or no goal, and in the other plot, shots are marked as on-net or as missed-net. This is indicated by the color of the markers and the plot legend. Here is an example of the player shot accuracy scatter plot.
40 |
41 |
42 |
43 | The initial rink image was in color and contained unnecessary markings that needed removing. Given I am on a student budget, photoshop was out of the question, but there is a great alternative called Photopea. This in-browser, free tool works similarly to photoshop and was perfect for the small edits needed in this image.
44 |
45 | Check out PhotoPea here --> https://www.photopea.com/
46 |
47 | The final plot that was generated with Matplotlib was the horizontal bar plot that indicates the season ranking for the player. The color scheme and style of the plot are motivated by Tableau-style bar plots. When selecting the optimal colors for the report I used Colormind, and also Dopelycolor's color blender tool to create gradients.
48 |
49 | Here is an example of the bar plot that is included in the report
50 |
51 |
52 |
53 | Check out Colormind here --> http://colormind.io/
54 |
55 | Check out dopelyColors here --> https://colors.dopely.top/color-blender/
56 |
57 | ## PDF Generation (FPDF)
58 |
59 | "PyFPDF is a library for PDF document generation under Python, ported from PHP"
60 |
61 | This library was very interesting to work with. Although there is some documentation, it was not always complete and required experimentation to achieve the desired result. This package is very customizable and is well suited to automate professional report generation.
62 |
63 | I used this package in this project to create a title page, footer, header, page count, and to insert the Matplotlib plots. Once all the information was added to the report, the report is saved in a PDF format.
64 |
65 | See more about the FPDF Library here --> [FPDF Documentation](https://pyfpdf.readthedocs.io/en/latest/index.html)
66 |
67 | Tips / Notes
68 |
69 | - Initially, it was taking ~7mins to generate each report using .png images, but this decreased to ~10 seconds when using a jpg/jpeg format. Without images on the report, the PDF generation is quite fast and runs in <1sec
70 |
71 | - It is important to keep track of the page dimensions, because components are mostly added by x and y coordinates
72 |
73 | ## Improvements / Additions
74 |
75 | - There is not a great way of looking up player_ids without going to NHL.com and manually searching the player, then selecting the ID from the URL. This could be solved with some sort of bash auto-complete search package? Not sure if there are any tools for this.
76 |
77 | - I have not put any significant player_id checks to ensure that the id entered is correct. Currently, the code will throw an error because the data_query does not produce enough information to generate all the plots. The same problem occurs if a goalie ID is entered or if a player did not play in the specified season.
78 |
79 | - The data is currently stored locally as parquet files, but in production, it would more likely be in a Database / Cloud storage. Could be useful to store this data in the Cloud.
80 |
81 | ## Running the Code
82 |
83 | To run the code in the project and start generating reports, follow the steps below.
84 |
85 | 1. Install requirements with pip
86 |
87 | `pip install -r requirements.txt`
88 |
89 | 2. Make API query to build the dataset
90 |
91 | `python data_gathering/api_functions.py`
92 |
93 | Note: Try and do this step outside of game times (when the server has low traffic) as this will take roughly 15-20 mins and will be taxing if other users are using the API
94 |
95 | 3. Convert JSON to Parquet
96 |
97 | `python data_gathering/json_to_parquet.py`
98 |
99 | 4. Generate reports based on ID and Season
100 |
101 | `python generate_pdf.py `
102 |
103 | `python generate_pdf.py 8471698 20202021`
104 |
105 | ID Sample's
106 | - Hoglander - 8481535
107 | - Oshie - 8471698
108 | - Marchand - 8473419
109 |
110 | Season Samples
111 | - 20202021
112 | - 20132014
113 | - 20162017
114 |
--------------------------------------------------------------------------------
/data_gathering/README.md:
--------------------------------------------------------------------------------
1 | # Data Gathering
2 |
3 | `api_functions.py`
4 |
5 | This is the meat and potatoes of the data gathering process. This script contains all the API requests and JSON filtering used when sending requests to the API.
6 |
7 | `constants.py`
8 |
9 | Contains the constants that always prepend the endpoint for each request.
10 |
11 | `json_to_parquet.py`
12 |
13 | Given a directory name, converts all JSON files in that directory to a parquet format using Pyspark
14 |
15 | `nhl_api_examples.py`
16 |
17 | Provides a few simple example of how to make a request to the NHL API.
18 |
19 | See more comprehensive endpoint examples here --> [NHL Api Docs](https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md#configurations)
20 |
21 |
22 |
--------------------------------------------------------------------------------
/data_gathering/api_functions.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import grequests
3 | import constants
4 | import json
5 | from tqdm import tqdm
6 | import re
7 | import os
8 |
9 | def get_game_ids(just_regular_season, filter_by_team, season_start, season_end, teamId=23):
10 | """
11 | inputs example ..(just_regular_season=True, filter_by_team=False, season_start=20112012, season_end=20202021, teamId=23)
12 |
13 | Given starting season, ending season, and regular season boolean, team boolean, team id etc..
14 | Returns all the game_ids for specified the season range as an array/list.
15 |
16 | - season_start and season_end are inclusive
17 | """
18 | #log
19 | print("\n -- Getting Game ID's -- \n")
20 |
21 | #variables
22 | id_array = []
23 | url = constants.NHL_STATS_API
24 | endpoint = "api/v1/schedule/?season="
25 |
26 | #urls
27 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
28 | if just_regular_season:
29 | urls = [val + "&gameType=R" for val in urls]
30 | if filter_by_team:
31 | urls = [val + "&teamId={}".format(teamId) for val in urls]
32 |
33 | #making requests - 10x faster than simple requests
34 | reqs = (grequests.get(u) for u in tqdm(urls, desc="API Requests"))
35 | responses = grequests.map(reqs)
36 |
37 | #Fetch ID's from json
38 | for r in tqdm(responses, desc="Collecting List"):
39 | if r.ok:
40 | d = r.json()
41 | for i in range(len(d["dates"])):
42 | for j in range(len(d["dates"][i]["games"])):
43 | id_array.append(d["dates"][i]["games"][j]["gamePk"])
44 |
45 | return id_array
46 |
47 | def get_all_player_ids_in_range(season_start, season_end):
48 | """
49 | Given a season, returns a list of all the players
50 | on an NHL roster that year.
51 | """
52 | #log
53 | print("\n -- Getting Players in Season Range -- \n")
54 |
55 | #constants
56 | url = constants.NHL_STATS_API
57 | endpoint = "api/v1/teams?expand=team.roster&season="
58 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
59 | players = set()
60 |
61 | #Doing API calls in batches
62 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
63 | batch = urls[b:b+100]
64 |
65 | #making requests
66 | reqs = (grequests.get(u) for u in batch)
67 | responses = grequests.map(reqs)
68 |
69 | #writing data to file
70 | for r in responses:
71 | if r.ok:
72 | d = r.json()
73 | for i in range(len(d['teams'])):
74 | team_data = d['teams'][i]
75 | for player in team_data['roster']['roster']:
76 | if "person" in player and "id" in player["person"]:
77 | players.add(player["person"]["id"])
78 |
79 | return list(players)
80 |
81 | def simple_game_stats(fname, season_start, season_end, just_regular_season=True):
82 | """
83 | Returns scores, teams, shots for every game from 2010-2020.
84 | Output's json format into the raw folder
85 | """
86 | #log
87 | print("\n -- Getting Simple Game Stats -- \n")
88 |
89 | #URL format
90 | fname = "linescore_" + fname
91 | url = constants.NHL_STATS_API
92 | endpoint = "api/v1/schedule/?expand=schedule.linescore&season="
93 |
94 | #Creating each individual URL
95 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
96 | if just_regular_season:
97 | urls = [x + "&gameType=R" for x in urls]
98 |
99 | #making requests
100 | reqs = (grequests.get(u) for u in urls)
101 | responses = grequests.map(reqs)
102 |
103 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
104 |
105 | rj = {"gamePk":"",
106 | "season":"",
107 | "away_name":"",
108 | "away_id":"",
109 | "away_goals":"",
110 | "away_shotsOnGoal":"",
111 | "home_name":"",
112 | "home_id":"",
113 | "home_goals":"",
114 | "home_shotsOnGoal":""
115 | }
116 |
117 | stats_tracked = ["goals", "shotsOnGoal"]
118 | team_info = ["id", "name"]
119 |
120 | for r in tqdm(responses, desc="Writing Data to Disk"):
121 | if r is not None and r.ok and r.json() is not None:
122 | d = r.json()
123 |
124 | #looping through every game in specified season
125 | for i in range(len(d["dates"])):
126 | for j in range(len(d["dates"][i]["games"])):
127 | game = d["dates"][i]["games"][j]
128 | if game["status"]["abstractGameState"] == 'Final':
129 | if "gamePk" and "season" in game:
130 | rj["gamePk"] = game["gamePk"]
131 | rj["season"] = game["season"]
132 |
133 | for event in stats_tracked:
134 | if event in game["linescore"]["teams"]["away"] and game["linescore"]["teams"]["home"]:
135 | rj["away_{}".format(event)] = game["linescore"]["teams"]["away"][event]
136 | rj["home_{}".format(event)] = game["linescore"]["teams"]["home"][event]
137 | else:
138 | continue
139 |
140 | for item in team_info:
141 | if item in game["linescore"]["teams"]["away"]["team"] and game["linescore"]["teams"]["home"]["team"]:
142 | rj["away_{}".format(item)] = game["linescore"]["teams"]["away"]["team"][item]
143 | rj["home_{}".format(item)] = game["linescore"]["teams"]["home"]["team"][item]
144 |
145 | f.write(json.dumps(rj) + '\n')
146 |
147 | f.seek(0, 2) # seek to end of file
148 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
149 | f.truncate()
150 | pass
151 |
152 | def get_all_game_event_stats(game_id_array, fname):
153 | """
154 | Given a list of game ID's, returns raw event data for each game
155 | in a format suitable for pyspark.
156 | """
157 | #log
158 | print("\n -- Getting All Game Events -- \n")
159 |
160 | #Variables
161 | fname = "livefeed_" + fname
162 | url = constants.NHL_STATS_API
163 | endpoint = "api/v1/game/{}/feed/live"
164 |
165 | #Formatting URL's
166 | urls = [url + endpoint.format(date) for date in game_id_array]
167 |
168 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
169 | #Doing API calls in batches
170 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
171 | batch = urls[b:b+100]
172 |
173 | #Making requests - 10x faster than simple requests
174 | reqs = (grequests.get(u) for u in batch)
175 | responses = grequests.map(reqs)
176 |
177 | rj = {}
178 |
179 | for resp in responses:
180 | if resp is not None and resp.ok and resp.json() is not None:
181 | d = resp.json()
182 |
183 | if "gamePk" in d:
184 | rj["gamePk"] = d["gamePk"]
185 | else:
186 | continue
187 |
188 | rj["season"] = d["gameData"]['game']['season']
189 |
190 | for val in d["liveData"]["plays"]["allPlays"]:
191 |
192 | rj["event"] = val["result"]["event"]
193 | rj["periodTime"] = val["about"]["periodTime"]
194 | rj["dateTime"] = val["about"]["dateTime"]
195 | rj["period"] = val["about"]["period"]
196 |
197 | #on ice-coordinates
198 | if len(val["coordinates"]) == 2:
199 | rj["x_coordinate"] = val["coordinates"]["x"]
200 | rj["y_coordinate"] = val["coordinates"]["y"]
201 | else:
202 | rj["x_coordinate"] = None
203 | rj["y_coordinate"] = None
204 |
205 | #players involved, can be up to 4
206 | if "players" in val:
207 | num_players = len(val["players"])
208 | if num_players > 4:
209 | print(" ---------- Event > 4 Players?? ---------- ")
210 | for i in range(4):
211 | if i < num_players:
212 | rj["p{}_id".format(i+1)] = val["players"][i]["player"]["id"]
213 | rj["p{}_type".format(i+1)] = val["players"][i]["playerType"]
214 | rj["p{}_name".format(i+1)] = val["players"][i]["player"]["fullName"]
215 | else:
216 | rj["p{}_id".format(i+1)] = None
217 | rj["p{}_type".format(i+1)] = None
218 | rj["p{}_name".format(i+1)] = None
219 |
220 | else:
221 | for i in range(4):
222 | rj["p{}_id".format(i+1)] = None
223 | rj["p{}_type".format(i+1)] = None
224 | rj["p{}_name".format(i+1)] = None
225 |
226 | #team-id if relevant
227 | if "team" in val:
228 | rj["team_id"] = val["team"]["id"]
229 | else:
230 | rj["team_id"] = None
231 |
232 | #write output to JSON
233 | f.write(json.dumps(rj) + '\n')
234 | else:
235 | print("{}".format(resp.status_code))
236 |
237 | #remove the '/n' at the end of written JSON file
238 | f.seek(0, 2) # seek to end of file
239 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
240 | f.truncate()
241 |
242 | def get_players_season_goal_stats(player_id_array, season_start, season_end, fname, just_regular_season=True):
243 | """
244 | Given an array of player_id's and a season,
245 | returns the goal statistics for each player
246 | in JSON format.
247 | """
248 | #log
249 | print("\n -- Getting Player Goal Stats -- \n")
250 |
251 | #constants
252 | fname = "goalsByGameSituationStats_" + fname
253 | url = constants.NHL_STATS_API
254 | endpoint = "api/v1/people/{}/stats?stats=goalsByGameSituation&season="
255 | urls = []
256 |
257 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
258 | if just_regular_season:
259 | url_format = [x + "&gameType=R" for x in url_format]
260 |
261 | goal_stats_tracked = [
262 | 'goalsInFirstPeriod',
263 | 'goalsInSecondPeriod',
264 | 'goalsInThirdPeriod',
265 | 'gameWinningGoals',
266 | 'emptyNetGoals',
267 | 'shootOutGoals',
268 | 'shootOutShots',
269 | 'goalsTrailingByOne',
270 | 'goalsTrailingByThreePlus',
271 | 'goalsWhenTied',
272 | 'goalsLeadingByOne',
273 | 'goalsLeadingByTwo',
274 | 'goalsLeadingByThreePlus',
275 | 'penaltyGoals',
276 | 'penaltyShots',
277 | ]
278 |
279 | #Creating each individual URL
280 | for p_id in player_id_array:
281 | for url in url_format:
282 | urls.append(url.format(p_id))
283 |
284 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
285 | rj = {}
286 |
287 | #Doing API calls in batches
288 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
289 | batch = urls[b:b+100]
290 |
291 | #making requests
292 | reqs = (grequests.get(u) for u in batch)
293 | responses = grequests.map(reqs)
294 |
295 | for r in responses:
296 | if r is not None and r.ok and r.json() is not None:
297 | d = r.json()
298 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:])
299 | rj['season'] = re.findall(r'\d+', r.url)[-1]
300 | if len(d["stats"][0]["splits"]) != 0:
301 | for ev in goal_stats_tracked:
302 | if ev in d["stats"][0]["splits"][0]["stat"]:
303 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev]
304 | else:
305 | rj[ev] = 0
306 | f.write(json.dumps(rj) + '\n')
307 |
308 | #remove the '/n' at the end of written JSON file
309 | f.seek(0, 2) # seek to end of file
310 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
311 | f.truncate()
312 |
313 | def get_players_season_general_stats(player_id_array, season_start, season_end, fname, just_regular_season=True):
314 | """
315 | Given an array of player_id's and a season,
316 | returns the general statistics for each player
317 | in JSON format.
318 | """
319 | #log
320 | print("\n -- Getting Player General Stats -- \n")
321 |
322 |
323 | #constants
324 | fname = "statsSingleSeason_" + fname
325 | url = constants.NHL_STATS_API
326 | endpoint = "api/v1/people/{}/stats?stats=statsSingleSeason&season="
327 | urls = []
328 |
329 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
330 | if just_regular_season:
331 | url_format = [x + "&gameType=R" for x in url_format]
332 |
333 | stats_tracked = [
334 | 'timeOnIce',
335 | 'assists',
336 | 'goals',
337 | 'pim',
338 | 'shots',
339 | 'games',
340 | 'hits',
341 | 'powerPlayGoals',
342 | 'powerPlayPoints',
343 | 'powerPlayTimeOnIce',
344 | 'evenTimeOnIce',
345 | 'penaltyMinutes' ,
346 | 'faceOffPct',
347 | 'shotPct',
348 | 'gameWinningGoals',
349 | 'overTimeGoals',
350 | 'shortHandedGoals',
351 | 'shortHandedPoints',
352 | 'shortHandedTimeOnIce',
353 | 'blocked',
354 | 'plusMinus',
355 | 'points',
356 | 'shifts',
357 | 'timeOnIcePerGame',
358 | 'evenTimeOnIcePerGame',
359 | 'shortHandedTimeOnIcePerGame',
360 | 'powerPlayTimeOnIcePerGame',
361 | ]
362 |
363 | #Creating each individual URL
364 | for p_id in player_id_array:
365 | for url in url_format:
366 | urls.append(url.format(p_id))
367 |
368 | #writing data to file
369 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
370 | rj = {}
371 |
372 | #Doing API calls in batches
373 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
374 | batch = urls[b:b+100]
375 |
376 | #making requests
377 | reqs = (grequests.get(u) for u in batch)
378 | responses = grequests.map(reqs)
379 |
380 | for r in responses:
381 | if r is not None and r.ok and r.json() is not None:
382 | d = r.json()
383 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:])
384 | if len(d["stats"][0]["splits"]) != 0:
385 | rj["season"] = d["stats"][0]["splits"][0]['season']
386 | for ev in stats_tracked:
387 | if ev in d["stats"][0]["splits"][0]["stat"]:
388 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev]
389 | else:
390 | rj[ev] = None
391 |
392 | f.write(json.dumps(rj) + '\n')
393 |
394 | #remove the '/n' at the end of written JSON file
395 | f.seek(0, 2) # seek to end of file
396 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
397 | f.truncate()
398 |
399 | def get_players_season_stat_rankings(player_id_array, season_start, season_end, fname):
400 | """
401 | Given an array of player_id's and a season range,
402 | returns the rankings across numerous statisitcs
403 | for each player in a JSON format.
404 | """
405 | #log
406 | print("\n -- Getting Player Rankings -- \n")
407 |
408 | #constants
409 | fname = "regularSeasonStatRankings_" + fname
410 | url = constants.NHL_STATS_API
411 | endpoint = "api/v1/people/{}/stats?stats=regularSeasonStatRankings&season="
412 | urls = []
413 |
414 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)]
415 |
416 | stats_tracked = ['rankPowerPlayGoals',
417 | 'rankBlockedShots',
418 | 'rankAssists',
419 | 'rankShotPct',
420 | 'rankGoals',
421 | 'rankHits',
422 | 'rankPenaltyMinutes',
423 | 'rankShortHandedGoals',
424 | 'rankPlusMinus',
425 | 'rankShots',
426 | 'rankPoints',
427 | 'rankOvertimeGoals',
428 | 'rankGamesPlayed',
429 | ]
430 |
431 | #Creating each individual URL
432 | for p_id in player_id_array:
433 | for url in url_format:
434 | urls.append(url.format(p_id))
435 |
436 | #writing data to file
437 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
438 | rj = {}
439 |
440 | #Doing API calls in batches
441 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
442 | batch = urls[b:b+100]
443 |
444 | #making requests
445 | reqs = (grequests.get(u) for u in batch)
446 | responses = grequests.map(reqs)
447 |
448 | for r in responses:
449 | if r is not None and r.ok and r.json() is not None:
450 | d = r.json()
451 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:])
452 |
453 | if len(d["stats"][0]["splits"]) != 0:
454 | rj["season"] = d["stats"][0]["splits"][0]['season']
455 | for ev in stats_tracked:
456 | if ev in d["stats"][0]["splits"][0]["stat"]:
457 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev]
458 | else:
459 | rj[ev] = None
460 |
461 | f.write(json.dumps(rj) + '\n')
462 |
463 | #remove the '/n' at the end of written JSON file
464 | f.seek(0, 2) # seek to end of file
465 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
466 | f.truncate()
467 |
468 | def get_players_info_by_season(player_id_array, fname):
469 | """
470 | Given an array of player_id's and a season range,
471 | returns player info by season for each player in a
472 | JSON format.
473 | """
474 | #log
475 | print("\n -- Getting Player Info -- \n")
476 |
477 | #constants
478 | fname = "yearByYear_" + fname
479 | url = constants.NHL_STATS_API
480 | endpoint = "api/v1/people/{}"
481 | endpoint2 = "api/v1/people/{}/stats?stats=yearByYear"
482 | urls = []
483 |
484 | #Creating each individual URL
485 | for p_id in player_id_array:
486 | urls.append(url + endpoint.format(p_id))
487 | urls.append(url + endpoint2.format(p_id))
488 |
489 | #writing data to file
490 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f:
491 | rj = {}
492 |
493 | #Doing API calls in batches
494 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"):
495 | batch = urls[b:b+100]
496 |
497 | #making requests
498 | reqs = (grequests.get(u) for u in batch)
499 | responses = grequests.map(reqs)
500 |
501 | for i in range (0, len(responses), 2):
502 | r = responses[i]
503 | r2 = responses[i+1]
504 | if r is not None and r.ok and r.json() is not None:
505 | d = r.json()
506 | if "people" in d:
507 | rj["p_id"] = d["people"][0]["id"]
508 | rj["fullName"] = d["people"][0]["fullName"]
509 |
510 | if r2 is not None and r2.ok and r2.json() is not None:
511 | d2 = r2.json()
512 | if "stats" in d2 and "splits" in d2["stats"][0]:
513 | season_counter = 0
514 | last_season = ""
515 | for split in d2["stats"][0]["splits"]:
516 | if split["league"]["name"] == "National Hockey League":
517 | if split["season"] == last_season:
518 | season_counter += 1
519 | else:
520 | season_counter = 0
521 |
522 | rj["team_id"] = split["team"]["id"]
523 | rj["team_name"] = split["team"]["name"]
524 | rj["season"] = split["season"]
525 | rj["team_num_this_season"] = season_counter
526 |
527 | last_season = split["season"]
528 | f.write(json.dumps(rj) + '\n')
529 |
530 | #remove the '/n' at the end of written JSON file
531 | f.seek(0, 2) # seek to end of file
532 | f.seek(f.tell() - 2, 0) # seek to the second last char of file
533 | f.truncate()
534 |
535 | def check_raw_data_directory():
536 | """
537 | Checks if the temporary directory has already been created
538 | """
539 | if os.path.isdir("./raw_data") and len(os.listdir("./raw_data")) != 0:
540 | sys.exit(" ERROR:\n - Raw_data contains files? Remove data and store elsewhere before proceeding -")
541 | else:
542 | if not os.path.isdir("./raw_data"):
543 | os.mkdir("./raw_data")
544 | pass
545 |
546 | def main(output):
547 | ss = 20112012
548 | se = 20202021
549 |
550 | #check raw_data directory
551 | check_raw_data_directory()
552 |
553 | #get ids
554 | all_game_ids = get_game_ids(just_regular_season=True, filter_by_team=False, season_start=ss, season_end=se)
555 | all_player_ids = get_all_player_ids_in_range(season_start=ss, season_end=se)
556 |
557 | #game events + stats
558 | simple_game_stats(fname=output, season_start=ss, season_end=se, just_regular_season=True)
559 | get_all_game_event_stats(all_game_ids, fname=output)
560 |
561 | #player_stats
562 | get_players_season_goal_stats(all_player_ids, season_start=ss, season_end=se, fname=output)
563 | get_players_season_general_stats(all_player_ids, season_start=ss, season_end=se, fname=output)
564 | get_players_season_stat_rankings(all_player_ids, season_start=ss, season_end=se, fname = output)
565 |
566 | get_players_info_by_season(all_player_ids, fname = output)
567 | return
568 |
569 | if __name__ == '__main__':
570 | output = sys.argv[1]
571 | main(output)
--------------------------------------------------------------------------------
/data_gathering/constants.py:
--------------------------------------------------------------------------------
1 | NHL_STATS_API = 'https://statsapi.web.nhl.com/'
2 | NHL_RECORDS_API = 'https://records.nhl.com/'
--------------------------------------------------------------------------------
/data_gathering/json_to_parquet.py:
--------------------------------------------------------------------------------
1 | import sys
2 | assert sys.version_info >= (3, 5) # make sure we have Python 3.5+
3 | import os
4 |
5 | from pyspark.sql import SparkSession, functions, types #type:ignore
6 |
7 | def main():
8 | """
9 | Converting the JSON Data to Parquet
10 | """
11 | inputs = "./raw_data/"
12 |
13 | for file in os.listdir(inputs):
14 | if file[-5:] == ".json":
15 | df = spark.read.json(inputs + file)
16 | df.write.parquet("./raw_data/parquet/{}".format(file[:-5]))
17 | pass
18 |
19 | if __name__ == '__main__':
20 | spark = SparkSession.builder.appName('example code').getOrCreate()
21 | assert spark.version >= '3.0' # make sure we have Spark 3.0+
22 | spark.sparkContext.setLogLevel('WARN')
23 | sc = spark.sparkContext
24 | main()
--------------------------------------------------------------------------------
/data_gathering/nhl_api_examples.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import requests
3 |
4 | def main():
5 | '''
6 | See documentation on NHL API by Drew Hynes.
7 |
8 | https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md
9 | '''
10 | #Startpoint for every API request
11 | url = 'https://statsapi.web.nhl.com/'
12 |
13 | #team stats example (ID: 23 - Canucks, Franchise ID: 20)
14 | endpoint = "api/v1/teams/23"
15 |
16 | #player stats example (ID: 8474568 - Luke Schenn)
17 | endpoint = "api/v1/people/8474568"
18 |
19 | #game ID example (ID: )
20 | endpoint = "api/v1/game/2020020018/feed/live"
21 |
22 | r = requests.get(url + endpoint)
23 | return
24 |
25 | if __name__ == '__main__':
26 | main()
--------------------------------------------------------------------------------
/generate_pdf.py:
--------------------------------------------------------------------------------
1 | # Import FPDF class
2 | from fpdf import FPDF
3 | import report_generation.plotting_functions
4 | import os
5 | import sys
6 |
7 | def main(p_id, season):
8 |
9 | goal_stats_list, player_info, player_stats_list = report_generation.plotting_functions.generate_all_plots(p_id = p_id, season = season)
10 |
11 | class PDF(FPDF):
12 |
13 | def __init__(self, player_name, season, organization, team_id):
14 | FPDF.__init__(self) #initializes parent class
15 | self.player_name = player_name
16 | self.season = season
17 | self.organization = organization
18 | self.team_id = team_id
19 |
20 | def header(self):
21 | # Logo (name, x, y, w = 0, h = 0)
22 | # w,h = 0 means automatic
23 | self.image('./imgs/team_logos/{}.jpg'.format(self.team_id), 10, 8, 15, 0)
24 | # font (font,bold,size)
25 | self.set_font('Arial', 'B', 15)
26 | # Move to the right
27 | self.cell(80)
28 | # Title (w,h,text,border,ln,align)
29 | if self.page_no()==1:
30 | pass
31 | elif self.page_no()==2:
32 | self.cell(30, 10, '{} - Goals / Shots'.format(self.player_name), 0, 0, 'C')
33 | elif self.page_no()==3:
34 | self.cell(30, 10, '{} - Shots / Rank / Other'.format(self.player_name), 0, 0, 'C')
35 | # Line break
36 | self.ln(20)
37 |
38 | # Page footer
39 | def footer(self):
40 | if self.page_no()!=1:
41 | # Position at 1.5 cm from bottom
42 | self.set_y(-15)
43 | # Arial italic 8
44 | self.set_font('Arial', 'I', 8)
45 | # Page number
46 | self.cell(0, 10, 'Page ' + str(self.page_no()-1), 0, 0, 'R')
47 |
48 | # ---------- Instantiation of Inherited Class ----------
49 | pdf = PDF(player_info["fullName"], player_info["season"], player_info["team_name"], player_info["team_id"])
50 | pdf.alias_nb_pages()
51 |
52 | # ---------- First Page ----------
53 | pdf.add_page()
54 | pdf.set_font('Times', '', 18)
55 | pdf.ln(h = 30)
56 | pdf.cell(w=0, h=10, txt="Annual Player Report", border=0, ln=1, align="C")
57 | pdf.cell(w=0, h=10, txt=pdf.season[:4] + " / " + pdf.season[4:], border=0, ln=1, align="C")
58 | pdf.cell(w=0, h=10, txt=pdf.organization, border=0, ln=1, align="C")
59 | pdf.cell(w=0, h=10, txt=pdf.player_name, border=0, ln=1, align="C")
60 | pdf.image('./tmp/player.jpg', x = 85, y = 110, w = 40, h = 0, type = '', link = '')
61 |
62 | # ---------- Second Page ----------
63 |
64 | # Since we do not need to draw lines anymore, there is no need to separate
65 | # headers from data matrix.
66 |
67 | pdf.add_page()
68 | pdf.set_font('Times', '', 12)
69 |
70 | table_cell_height = 9
71 | table_cell_width_col1 = 60
72 | table_cell_width_col2 = 20
73 |
74 | # Here we add more padding by passing 2*th as height
75 | pdf.set_fill_color(189,210,236) #(r,g,b)
76 | pdf.cell(table_cell_width_col1, table_cell_height, "Goal Statistics", border=1, align='C', fill=True)
77 | pdf.cell(table_cell_width_col2, table_cell_height, "Count", border=1, ln=1, align='C', fill=True)
78 |
79 | pdf.set_fill_color(235,241,249)
80 | for row in goal_stats_list:
81 | for i, datum in enumerate(row):
82 | # Enter data in colums
83 | if i == 0:
84 | pdf.cell(table_cell_width_col1, table_cell_height, str(datum), border=1, fill=True)
85 | else:
86 | pdf.cell(table_cell_width_col2, table_cell_height, str(datum), border=1, align='C', fill=True)
87 |
88 | pdf.ln(table_cell_height)
89 |
90 | WIDTH = 210
91 | HEIGHT = 297
92 | pdf.image('./tmp/pie_plot1.jpg', x = 120, y = 20, w = (WIDTH-60)//2, h = 0, type = '', link = '')
93 | pdf.image('./tmp/pie_plot2.jpg', x = 120, y = 95, w = (WIDTH-60)//2, h = 0, type = '', link = '')
94 | pdf.image('./tmp/rink_image1.jpg', x = 50, y = 180, w = 110, h = 0, type = '', link = '')
95 |
96 | # ---------- Third Page ----------
97 | # Shot plot and other stats
98 | pdf.add_page()
99 | pdf.set_font('Times', '', 12)
100 | pdf.ln(100)
101 |
102 | table_cell_height = 9
103 | table_cell_width_col1 = 60
104 | table_cell_width_col2 = 20
105 |
106 | # Here we add more padding by passing 2*th as height
107 | pdf.set_fill_color(189,210,236) #(r,g,b)
108 | pdf.cell(table_cell_width_col1, table_cell_height, "Other Statistics", border=1, align='C', fill=True)
109 | pdf.cell(table_cell_width_col2, table_cell_height, "Count", border=1, ln=1, align='C', fill=True)
110 |
111 | pdf.set_fill_color(235,241,249)
112 | for row in player_stats_list:
113 | for i, datum in enumerate(row):
114 | # Enter data in colums
115 | if i == 0:
116 | pdf.cell(table_cell_width_col1, table_cell_height, str(datum), border=1, fill=True)
117 | else:
118 | pdf.cell(table_cell_width_col2, table_cell_height, str(datum), border=1, align='C', fill=True)
119 |
120 | pdf.ln(table_cell_height)
121 |
122 | pdf.image('./tmp/rink_image2.jpg', x = 50, y = 30, w = 110, h = 0, type = '', link = '')
123 | pdf.image('./tmp/rank_hbar_plot1.jpg', x = 110, y = 130, w = 76, h = 0, type = '', link = '')
124 |
125 | # Clear tmp directory
126 | for file in os.listdir("./tmp"):
127 | os.remove("./tmp/" + file)
128 |
129 | # ---------- Save PDF to Output ----------
130 | pdf.output('report.pdf','F')
131 | print(" --- Complete --- ")
132 |
133 | if __name__ == "__main__":
134 | try:
135 | p_id = int(sys.argv[1].strip())
136 | season = int(sys.argv[2].strip())
137 | if len(sys.argv[2].strip()) != 8:
138 | sys.exit(" --------- Check input format --------- ")
139 | except:
140 | sys.exit(" --------- Check input format --------- ")
141 |
142 | main(p_id, season)
--------------------------------------------------------------------------------
/imgs/README.md:
--------------------------------------------------------------------------------
1 | ## Imgs
2 |
3 | Contains images used in the main README.md file, and also the team_logos named by their respective team id's.
--------------------------------------------------------------------------------
/imgs/readme_imgs/pie_plot_sample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/pie_plot_sample.png
--------------------------------------------------------------------------------
/imgs/readme_imgs/rank_hbar_plot_sample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/rank_hbar_plot_sample.png
--------------------------------------------------------------------------------
/imgs/readme_imgs/report_sample1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample1.jpg
--------------------------------------------------------------------------------
/imgs/readme_imgs/report_sample2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample2.jpg
--------------------------------------------------------------------------------
/imgs/readme_imgs/report_sample3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample3.jpg
--------------------------------------------------------------------------------
/imgs/readme_imgs/rink_image_sample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/rink_image_sample.png
--------------------------------------------------------------------------------
/imgs/simple_rink_grey.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/simple_rink_grey.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/1.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/10.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/10.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/12.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/12.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/13.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/13.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/14.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/14.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/15.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/15.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/16.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/16.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/17.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/17.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/18.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/18.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/19.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/19.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/2.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/20.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/20.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/21.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/21.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/22.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/22.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/23.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/23.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/24.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/24.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/25.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/25.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/26.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/26.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/28.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/28.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/29.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/29.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/3.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/30.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/30.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/4.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/5.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/52.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/52.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/53.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/53.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/54.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/54.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/55.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/55.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/6.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/6.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/7.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/7.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/8.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/8.jpg
--------------------------------------------------------------------------------
/imgs/team_logos/9.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/9.jpg
--------------------------------------------------------------------------------
/report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/report.pdf
--------------------------------------------------------------------------------
/report_generation/README.md:
--------------------------------------------------------------------------------
1 | ## Report Generation
2 |
3 | `data_query.py`
4 |
5 | The data query script contains the pyspark script that starts a spark session, and filters data by the player_id and season. The sript returns a pandas dataframe, a couple short arrays, and a dictionary. These are used to create that plots that are included in the report.
6 |
7 | `plotting_functions.py`
8 |
9 | This script contains the custom Matplotlib functions used the generate the plots in the report
10 |
--------------------------------------------------------------------------------
/report_generation/data_query.py:
--------------------------------------------------------------------------------
1 | import sys
2 | assert sys.version_info >= (3, 5) # make sure we have Python 3.5+
3 |
4 | from pyspark.sql import SparkSession, functions, types #type:ignore
5 |
6 | # add more functions as necessary
7 |
8 | def query(p_id, season):
9 |
10 | print("Player ID: " + str(p_id))
11 | print("Season: " + str(season))
12 |
13 | #start spark session
14 | spark = SparkSession.builder.appName('example code').getOrCreate()
15 | assert spark.version >= '3.0' # make sure we have Spark 3.0+
16 | spark.sparkContext.setLogLevel('WARN')
17 | sc = spark.sparkContext
18 |
19 | #player events
20 | df = spark.read.parquet("./raw_data/parquet/livefeed_p2")
21 | player_df = (df.where(
22 | (df["p1_id"] == p_id) &
23 | (df["p1_type"].isin(["Shooter","Scorer"])) &
24 | (df["season"] == season))
25 | .select("x_coordinate","y_coordinate","event","p1_id","p1_name","period","periodTime"))
26 |
27 | #player rankings
28 | rank_df = spark.read.parquet("./raw_data/parquet/regularSeasonStatRankings_p2")
29 | rank_df = rank_df.where((rank_df["p_id"] == p_id) & (rank_df["season"] == season))
30 | rank_list = [[col,rank_df.take(1)[0][col]] for col in rank_df.columns if col not in ["p_id", "season"]]
31 |
32 | #player goal stats
33 | goal_stats_df = spark.read.parquet("./raw_data/parquet/goalsByGameSituationStats_p2")
34 | goal_stats_df = goal_stats_df.where((goal_stats_df["p_id"] == p_id) & (goal_stats_df["season"] == season))
35 | goal_stats_list = [[col,goal_stats_df.take(1)[0][col]] for col in goal_stats_df.columns if col not in ["p_id", "season"]]
36 |
37 | #player other stats
38 | p_stats_df = spark.read.parquet("./raw_data/parquet/statsSingleSeason_p2")
39 | p_stats_df = p_stats_df.where((p_stats_df["p_id"] == p_id) & (p_stats_df["season"] == season))
40 |
41 | stats = ['assists', 'goals', 'games', 'hits', 'powerPlayPoints',
42 | 'penaltyMinutes', 'faceOffPct', 'blocked', 'plusMinus',
43 | 'points', 'shifts', 'timeOnIcePerGame', 'evenTimeOnIcePerGame',
44 | 'shortHandedTimeOnIcePerGame', 'powerPlayTimeOnIcePerGame']
45 |
46 | player_stats_list = [[col,p_stats_df.take(1)[0][col]] for col in p_stats_df.columns if col in stats]
47 |
48 | #player information
49 | p_info_df = spark.read.parquet("./raw_data/parquet/yearByYear_p2")
50 | p_info_df = (p_info_df.where((p_info_df["p_id"] == p_id) &
51 | (p_info_df["season"] == season))
52 | .orderBy(p_info_df["team_num_this_season"], ascending=False))
53 | player_info = {col:p_info_df.take(1)[0][col] for col in p_info_df.columns}
54 |
55 | return player_df.toPandas(), rank_list, goal_stats_list, player_info, player_stats_list
56 |
57 | # if __name__ == 'query_data':
58 | # spark = SparkSession.builder.appName('example code').getOrCreate()
59 | # assert spark.version >= '3.0' # make sure we have Spark 3.0+
60 | # spark.sparkContext.setLogLevel('WARN')
61 | # sc = spark.sparkContext
62 | # player_df, rank_list = main()
--------------------------------------------------------------------------------
/report_generation/plotting_functions.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import matplotlib.pyplot as plt
3 | import sys
4 | import os
5 | from PIL import Image
6 | import warnings
7 | import requests
8 | import report_generation.data_query
9 |
10 | warnings.filterwarnings("ignore", category=DeprecationWarning)
11 | plt.rcParams.update({'font.size': 22})
12 |
13 | def shot_scatter_plot(df, rink_image_fname, event, legend_labels, colors, out_fname):
14 | """
15 | Given a list of parameters, creates shot plot and saves
16 | image to temporary file before being added to report.
17 | """
18 | #read data
19 | df.loc[df['x_coordinate'] >= 0, 'y_coordinate',] = -1*df["y_coordinate"]
20 | df.loc[df['x_coordinate'] >= 0, 'x_coordinate',] = -1*df["x_coordinate"]
21 |
22 | rink_img = plt.imread(rink_image_fname)
23 |
24 | #plot data
25 | plt.figure(figsize=(10,10))
26 | plt.scatter(df.loc[df['event'] == event]["x_coordinate"], df.loc[df['event'] == event]["y_coordinate"], c=colors[0], s=100, zorder=3)
27 | plt.scatter(df.loc[df['event'] != event]["x_coordinate"], df.loc[df['event'] != event]["y_coordinate"], c=colors[1], s=100, zorder=1)
28 | plt.imshow(rink_img, cmap="gray", extent=[-100, 100, -42.5, 42.5])
29 | plt.xlim(left=-100, right=0)
30 | plt.ylim(bottom=-42.5, top=42.5)
31 | plt.legend(legend_labels, prop={'size': 22})
32 | plt.axis('off')
33 |
34 | #need to add os call that checks if the file exists, and create DIR if not
35 | plt.savefig('./{}.png'.format("./tmp/" + out_fname), dpi=300, bbox_inches='tight')
36 | pass
37 |
38 | def shot_pie_plot(df, event, legend_labels, colors, out_fname):
39 |
40 | #preprocess data
41 | if event == "Goal":
42 | goal_pct = round(len(df.loc[df['event'] == event]["x_coordinate"])/len(df), 3)*100
43 | sa = 180
44 | else:
45 | goal_pct = round(len(df.loc[df['event'] != event]["x_coordinate"])/len(df), 3)*100
46 | sa = 270
47 |
48 | #pie plot figure
49 | sizes = [goal_pct, 100-goal_pct]
50 | explodes = [0.25, 0]
51 | plt.figure(figsize=(10,10))
52 | patches, texts, _ = plt.pie(sizes, explode=explodes, autopct='%1.1f%%',shadow=True, startangle=sa, colors=colors)
53 | plt.legend(patches, legend_labels, loc="best", prop={'size': 22})
54 | plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
55 |
56 | if event == "Goal":
57 | plt.title("Shot Scored?")
58 | else:
59 | plt.title("Shot on Net?")
60 |
61 | #save figure
62 | plt.savefig('./{}.png'.format("./tmp/" + out_fname), dpi=300, bbox_inches='tight')
63 | pass
64 |
65 | def by_period_bar_plot(df, event, color, out_fname):
66 | """
67 | Given a dataframe, returns a matplotlib bar plot of
68 | the number of goals scored each period
69 | """
70 | if event != "Goals" and event != "Shots":
71 | sys.exit(" ---------- Invalid Event: {} ---------- ".format(event))
72 |
73 | #processing data
74 | if event == "Goals":
75 | goal_dict = dict(df.loc[(df["event"] == "Goal") & (df["period"].isin([1,2,3]))]["period"].value_counts().sort_index())
76 | else:
77 | goal_dict = dict(df.loc[df["period"].isin([1,2,3])]["period"].value_counts().sort_index())
78 |
79 | #creating figure
80 | plt.figure(figsize=(10,5))
81 | plt.bar(goal_dict.keys(), goal_dict.values(), color = color, width = 0.4, tick_label=[1,2,3], zorder=3)
82 |
83 | #remove ticks and borders
84 | plt.tick_params(bottom=False, left=False)
85 | for i, spine in enumerate(plt.gca().spines):
86 | if i != 2:
87 | plt.gca().spines[spine].set_visible(False)
88 |
89 | #labels / grid
90 | plt.gca().yaxis.grid(zorder=0)
91 | plt.xlabel("Period")
92 | plt.ylabel(event)
93 | plt.title(event + " by Period")
94 | plt.xticks(fontsize=14)
95 | plt.yticks(fontsize=14)
96 |
97 | #save figure
98 | plt.savefig('./tmp/{}.png'.format(out_fname), dpi=300, bbox_inches='tight')
99 | pass
100 |
101 | def rankings_hbar_plot(data2, out_fname):
102 | """
103 | Given player statistics rankings for the season,
104 | creates a horizontal bar plot.
105 |
106 | - Need to modify function to take a data format other than nested array
107 | """
108 |
109 | def sort_rankings(data):
110 | """
111 | Given list of rankings, returns sorted array
112 | """
113 | l = []
114 | res = []
115 | for i, val in enumerate(data):
116 | l.append([i, int(val[1][:-2])])
117 | l = sorted(l, key = lambda x: x[1], reverse=True)
118 | for val in l:
119 | res.append([data2[val[0]][0][4:], data2[val[0]][1]])
120 | return res[::-1]
121 |
122 | data2 = sort_rankings(data2)
123 | data = {"Stat": [x[0] for x in data2], "Rank": [x[1] for x in data2]}
124 |
125 | df = pd.DataFrame(data, index = data["Stat"])
126 | fig, ax = plt.subplots(figsize=(5,18))
127 |
128 | #range - #1f77b4 --> #aec7e8
129 | colors = ['#297db8','#3382bb','#3e88bf','#488ec3',
130 | '#5294c7','#5c99ca','#679fce','#71a5d2','#7baad5',
131 | '#85b0d9','#8fb6dd','#9abce1','#a4c1e4']
132 |
133 | p1 = ax.barh(data["Stat"], data["Rank"], color = colors)
134 | ax.set_title('Regular Season Rankings\n', loc='right')
135 | ax.margins(x=0.1, y=0)
136 | ax.spines['right'].set_visible(False)
137 | ax.spines['top'].set_visible(False)
138 | ax.spines['bottom'].set_visible(False)
139 | ax.set_xticks([])
140 | ax.set_xticklabels([])
141 | ax.invert_yaxis()
142 |
143 | for rect, label in zip(ax.patches, [x[1] for x in data2]):
144 | height = rect.get_y() + (rect.get_height() / 2) + 0.15
145 | width = rect.get_width() + rect.get_x() + 1
146 | ax.text(
147 | width, height, label, ha="left", va="bottom"
148 | )
149 |
150 | plt.savefig('./tmp/{}.png'.format(out_fname), dpi=300, bbox_inches='tight')
151 |
152 | def convert_pngs_to_jpegs(fpath = "./tmp"):
153 | """
154 | Given a directory, converts all .png images
155 | to JPEG's
156 | """
157 | for img in os.listdir(fpath):
158 | if img[-4:] == ".png":
159 | Image.open('{}/{}'.format(fpath, img)).convert('RGB').save('{}/{}.jpg'.format(fpath, img[:-4]), 'JPEG', quality=95)
160 | os.remove('{}/{}'.format(fpath, img))
161 | pass
162 |
163 | def get_player_image(player_id, fpath = "./tmp"):
164 | """
165 | Downloads player image for title page
166 | of the report
167 | """
168 | url = 'https://cms.nhl.bamgrid.com/images/headshots/current/168x168/{}@2x.jpg'.format(player_id)
169 | r = requests.get(url, stream=True)
170 | if r.ok:
171 | Image.open(r.raw).save(fpath + "/player.jpg", 'JPEG', quality=95)
172 | pass
173 |
174 | def check_tmp_directory():
175 | """
176 | Checks if the temporary directory has already been created
177 | """
178 | if os.path.isdir("./tmp") and len(os.listdir("./tmp")) != 0:
179 | sys.exit(" ERROR:\n - Delete \"./tmp\" contents to continue -")
180 | else:
181 | if not os.path.isdir("./tmp"):
182 | os.mkdir("./tmp")
183 | pass
184 |
185 | def generate_all_plots(p_id, season):
186 |
187 | check_tmp_directory()
188 | print(" --- Querying Data --- ")
189 |
190 | player_df, rank_list, goal_stats_list, player_info, player_stats_list = report_generation.data_query.query(p_id, season)
191 | # add check to see if player not found in that season
192 | print(" --- Generating Plots --- ")
193 |
194 | rink_im = "/Users/brendanartley/dev/Sports-Analytics/imgs/simple_rink_grey.jpg"
195 |
196 | goal_colors = ["#88B4AA", "#e0ddbd"]
197 | on_net_colors = ["#e0ddbd", "#77A6C0"]
198 |
199 | #scatter plot rink imgs
200 | shot_scatter_plot(player_df, rink_im, event="Goal", legend_labels=["Goal", "No Goal"], colors = goal_colors, out_fname="rink_image1")
201 | shot_scatter_plot(player_df, rink_im, event="Missed Shot", legend_labels=["Missed Net", "On Net"], colors = on_net_colors[::-1], out_fname="rink_image2")
202 |
203 | #pie plot imgs
204 | shot_pie_plot(player_df, event="Goal", legend_labels=["Goal", "No Goal"], colors = goal_colors, out_fname="pie_plot1")
205 | shot_pie_plot(player_df, event="Missed Shot", legend_labels=["On Net", "Missed Net"], colors = on_net_colors, out_fname="pie_plot2")
206 |
207 | #rank plot
208 | rankings_hbar_plot(rank_list, out_fname = "rank_hbar_plot1")
209 |
210 | #getting player + team image
211 | get_player_image(player_id = p_id)
212 |
213 | #converting formats to jpegs
214 | convert_pngs_to_jpegs(fpath = "./tmp")
215 |
216 | return goal_stats_list, player_info, player_stats_list
217 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | appnope==0.1.2
2 | backcall==0.2.0
3 | certifi==2021.10.8
4 | charset-normalizer==2.0.9
5 | cycler==0.11.0
6 | decorator==5.1.0
7 | fonttools==4.28.5
8 | fpdf==1.7.2
9 | gevent==21.12.0
10 | greenlet==1.1.2
11 | grequests==0.6.0
12 | idna==3.3
13 | ipython==7.31.1
14 | jedi==0.18.1
15 | kiwisolver==1.3.2
16 | matplotlib==3.5.1
17 | matplotlib-inline==0.1.3
18 | numpy==1.21.4
19 | packaging==21.3
20 | pandas==1.3.5
21 | parso==0.8.3
22 | pexpect==4.8.0
23 | pickleshare==0.7.5
24 | Pillow==9.0.1
25 | prompt-toolkit==3.0.24
26 | ptyprocess==0.7.0
27 | py4j==0.10.9.2
28 | Pygments==2.10.0
29 | pyparsing==3.0.6
30 | pyspark==3.2.0
31 | python-dateutil==2.8.2
32 | pytz==2021.3
33 | requests==2.26.0
34 | six==1.16.0
35 | tqdm==4.62.3
36 | traitlets==5.1.1
37 | urllib3==1.26.7
38 | wcwidth==0.2.5
39 | zope.event==4.5.0
40 | zope.interface==5.4.0
41 |
--------------------------------------------------------------------------------