├── LICENSE ├── README.md ├── data_gathering ├── README.md ├── api_functions.py ├── constants.py ├── json_to_parquet.py └── nhl_api_examples.py ├── generate_pdf.py ├── imgs ├── README.md ├── readme_imgs │ ├── pie_plot_sample.png │ ├── rank_hbar_plot_sample.png │ ├── report_sample1.jpg │ ├── report_sample2.jpg │ ├── report_sample3.jpg │ └── rink_image_sample.png ├── simple_rink_grey.jpg └── team_logos │ ├── 1.jpg │ ├── 10.jpg │ ├── 12.jpg │ ├── 13.jpg │ ├── 14.jpg │ ├── 15.jpg │ ├── 16.jpg │ ├── 17.jpg │ ├── 18.jpg │ ├── 19.jpg │ ├── 2.jpg │ ├── 20.jpg │ ├── 21.jpg │ ├── 22.jpg │ ├── 23.jpg │ ├── 24.jpg │ ├── 25.jpg │ ├── 26.jpg │ ├── 28.jpg │ ├── 29.jpg │ ├── 3.jpg │ ├── 30.jpg │ ├── 4.jpg │ ├── 5.jpg │ ├── 52.jpg │ ├── 53.jpg │ ├── 54.jpg │ ├── 55.jpg │ ├── 6.jpg │ ├── 7.jpg │ ├── 8.jpg │ └── 9.jpg ├── report.pdf ├── report_generation ├── README.md ├── data_query.py └── plotting_functions.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Brendan Artley 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Analytics Report Generator 2 | 3 | ## 11/17/2023: NHL stats API updated, repo no longer maintained. 4 | 5 | ## Overview / Purpose 6 | 7 | The generated report is meant to provide coaches and players with a snapshot of their overall performance for a given season. The report can also be used by opposing teams to get a sense of where specific players are threatening in the offensive zone. This can be useful in guiding player development or preparing players for likely matchups in upcoming games. 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
123
16 | 17 | ## API Requests (Grequests) 18 | 19 | The first part of the project involved querying the NHL API to get all the games, players, events during games, rankings, general statistics, and more. Given that the NHL_API server does not have any API limiting, we can leverage the power of asynchronous requests and get information much faster. 20 | 21 | See more about the NHL API here --> [NHL Api Docs](https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md#configurations) 22 | 23 | Asynchronous requests send requests in parallel to reduce the time spent waiting for data to be sent from the server. That being said, doing large-scale requests in parallel can create server lag, so I performed these requests in batches of 100 to reduce server strain. 24 | 25 | For each batch, the API returns data in a JSON format and I used a context manager to write this to disk. 26 | 27 | ## JSON to Parquet (Pyspark) 28 | 29 | Although JSON will work for small datasets, I wanted to convert the JSON data to Parquet to consume less space, limit IO operations, and increase scalability. The benefits of this data format conversion become more evident as the size of the dataset grows. 30 | 31 | ## Genenerating Plots (Matplotlib, Pandas) 32 | 33 | Currently, there are 5 plots generated with Matplotlib on the reports. First of all, using Pypark we filter our dataset based on the player_id and the season specified, and then convert that data to a pandas data frame (which can be read by Matplotlib). 34 | 35 | The first two plots are pie plots that show the players shot accuracy and the percentage of shots that result in a goal. The following is an example of the shot/goal percentage pie plot. 36 | 37 | 38 | 39 | The next two scatter plots show the location of the player shots on the ice. In one plot, shots are marked as either a goal or no goal, and in the other plot, shots are marked as on-net or as missed-net. This is indicated by the color of the markers and the plot legend. Here is an example of the player shot accuracy scatter plot. 40 | 41 | 42 | 43 | The initial rink image was in color and contained unnecessary markings that needed removing. Given I am on a student budget, photoshop was out of the question, but there is a great alternative called Photopea. This in-browser, free tool works similarly to photoshop and was perfect for the small edits needed in this image. 44 | 45 | Check out PhotoPea here --> https://www.photopea.com/ 46 | 47 | The final plot that was generated with Matplotlib was the horizontal bar plot that indicates the season ranking for the player. The color scheme and style of the plot are motivated by Tableau-style bar plots. When selecting the optimal colors for the report I used Colormind, and also Dopelycolor's color blender tool to create gradients. 48 | 49 | Here is an example of the bar plot that is included in the report 50 | 51 | 52 | 53 | Check out Colormind here --> http://colormind.io/ 54 | 55 | Check out dopelyColors here --> https://colors.dopely.top/color-blender/ 56 | 57 | ## PDF Generation (FPDF) 58 | 59 | "PyFPDF is a library for PDF document generation under Python, ported from PHP" 60 | 61 | This library was very interesting to work with. Although there is some documentation, it was not always complete and required experimentation to achieve the desired result. This package is very customizable and is well suited to automate professional report generation. 62 | 63 | I used this package in this project to create a title page, footer, header, page count, and to insert the Matplotlib plots. Once all the information was added to the report, the report is saved in a PDF format. 64 | 65 | See more about the FPDF Library here --> [FPDF Documentation](https://pyfpdf.readthedocs.io/en/latest/index.html) 66 | 67 | Tips / Notes 68 | 69 | - Initially, it was taking ~7mins to generate each report using .png images, but this decreased to ~10 seconds when using a jpg/jpeg format. Without images on the report, the PDF generation is quite fast and runs in <1sec 70 | 71 | - It is important to keep track of the page dimensions, because components are mostly added by x and y coordinates 72 | 73 | ## Improvements / Additions 74 | 75 | - There is not a great way of looking up player_ids without going to NHL.com and manually searching the player, then selecting the ID from the URL. This could be solved with some sort of bash auto-complete search package? Not sure if there are any tools for this. 76 | 77 | - I have not put any significant player_id checks to ensure that the id entered is correct. Currently, the code will throw an error because the data_query does not produce enough information to generate all the plots. The same problem occurs if a goalie ID is entered or if a player did not play in the specified season. 78 | 79 | - The data is currently stored locally as parquet files, but in production, it would more likely be in a Database / Cloud storage. Could be useful to store this data in the Cloud. 80 | 81 | ## Running the Code 82 | 83 | To run the code in the project and start generating reports, follow the steps below. 84 | 85 | 1. Install requirements with pip 86 | 87 | `pip install -r requirements.txt` 88 | 89 | 2. Make API query to build the dataset 90 | 91 | `python data_gathering/api_functions.py` 92 | 93 | Note: Try and do this step outside of game times (when the server has low traffic) as this will take roughly 15-20 mins and will be taxing if other users are using the API 94 | 95 | 3. Convert JSON to Parquet 96 | 97 | `python data_gathering/json_to_parquet.py` 98 | 99 | 4. Generate reports based on ID and Season 100 | 101 | `python generate_pdf.py ` 102 | 103 | `python generate_pdf.py 8471698 20202021` 104 | 105 | ID Sample's 106 | - Hoglander - 8481535 107 | - Oshie - 8471698 108 | - Marchand - 8473419 109 | 110 | Season Samples 111 | - 20202021 112 | - 20132014 113 | - 20162017 114 | -------------------------------------------------------------------------------- /data_gathering/README.md: -------------------------------------------------------------------------------- 1 | # Data Gathering 2 | 3 | `api_functions.py` 4 | 5 | This is the meat and potatoes of the data gathering process. This script contains all the API requests and JSON filtering used when sending requests to the API. 6 | 7 | `constants.py` 8 | 9 | Contains the constants that always prepend the endpoint for each request. 10 | 11 | `json_to_parquet.py` 12 | 13 | Given a directory name, converts all JSON files in that directory to a parquet format using Pyspark 14 | 15 | `nhl_api_examples.py` 16 | 17 | Provides a few simple example of how to make a request to the NHL API. 18 | 19 | See more comprehensive endpoint examples here --> [NHL Api Docs](https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md#configurations) 20 | 21 | 22 | -------------------------------------------------------------------------------- /data_gathering/api_functions.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import grequests 3 | import constants 4 | import json 5 | from tqdm import tqdm 6 | import re 7 | import os 8 | 9 | def get_game_ids(just_regular_season, filter_by_team, season_start, season_end, teamId=23): 10 | """ 11 | inputs example ..(just_regular_season=True, filter_by_team=False, season_start=20112012, season_end=20202021, teamId=23) 12 | 13 | Given starting season, ending season, and regular season boolean, team boolean, team id etc.. 14 | Returns all the game_ids for specified the season range as an array/list. 15 | 16 | - season_start and season_end are inclusive 17 | """ 18 | #log 19 | print("\n -- Getting Game ID's -- \n") 20 | 21 | #variables 22 | id_array = [] 23 | url = constants.NHL_STATS_API 24 | endpoint = "api/v1/schedule/?season=" 25 | 26 | #urls 27 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 28 | if just_regular_season: 29 | urls = [val + "&gameType=R" for val in urls] 30 | if filter_by_team: 31 | urls = [val + "&teamId={}".format(teamId) for val in urls] 32 | 33 | #making requests - 10x faster than simple requests 34 | reqs = (grequests.get(u) for u in tqdm(urls, desc="API Requests")) 35 | responses = grequests.map(reqs) 36 | 37 | #Fetch ID's from json 38 | for r in tqdm(responses, desc="Collecting List"): 39 | if r.ok: 40 | d = r.json() 41 | for i in range(len(d["dates"])): 42 | for j in range(len(d["dates"][i]["games"])): 43 | id_array.append(d["dates"][i]["games"][j]["gamePk"]) 44 | 45 | return id_array 46 | 47 | def get_all_player_ids_in_range(season_start, season_end): 48 | """ 49 | Given a season, returns a list of all the players 50 | on an NHL roster that year. 51 | """ 52 | #log 53 | print("\n -- Getting Players in Season Range -- \n") 54 | 55 | #constants 56 | url = constants.NHL_STATS_API 57 | endpoint = "api/v1/teams?expand=team.roster&season=" 58 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 59 | players = set() 60 | 61 | #Doing API calls in batches 62 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 63 | batch = urls[b:b+100] 64 | 65 | #making requests 66 | reqs = (grequests.get(u) for u in batch) 67 | responses = grequests.map(reqs) 68 | 69 | #writing data to file 70 | for r in responses: 71 | if r.ok: 72 | d = r.json() 73 | for i in range(len(d['teams'])): 74 | team_data = d['teams'][i] 75 | for player in team_data['roster']['roster']: 76 | if "person" in player and "id" in player["person"]: 77 | players.add(player["person"]["id"]) 78 | 79 | return list(players) 80 | 81 | def simple_game_stats(fname, season_start, season_end, just_regular_season=True): 82 | """ 83 | Returns scores, teams, shots for every game from 2010-2020. 84 | Output's json format into the raw folder 85 | """ 86 | #log 87 | print("\n -- Getting Simple Game Stats -- \n") 88 | 89 | #URL format 90 | fname = "linescore_" + fname 91 | url = constants.NHL_STATS_API 92 | endpoint = "api/v1/schedule/?expand=schedule.linescore&season=" 93 | 94 | #Creating each individual URL 95 | urls = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 96 | if just_regular_season: 97 | urls = [x + "&gameType=R" for x in urls] 98 | 99 | #making requests 100 | reqs = (grequests.get(u) for u in urls) 101 | responses = grequests.map(reqs) 102 | 103 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 104 | 105 | rj = {"gamePk":"", 106 | "season":"", 107 | "away_name":"", 108 | "away_id":"", 109 | "away_goals":"", 110 | "away_shotsOnGoal":"", 111 | "home_name":"", 112 | "home_id":"", 113 | "home_goals":"", 114 | "home_shotsOnGoal":"" 115 | } 116 | 117 | stats_tracked = ["goals", "shotsOnGoal"] 118 | team_info = ["id", "name"] 119 | 120 | for r in tqdm(responses, desc="Writing Data to Disk"): 121 | if r is not None and r.ok and r.json() is not None: 122 | d = r.json() 123 | 124 | #looping through every game in specified season 125 | for i in range(len(d["dates"])): 126 | for j in range(len(d["dates"][i]["games"])): 127 | game = d["dates"][i]["games"][j] 128 | if game["status"]["abstractGameState"] == 'Final': 129 | if "gamePk" and "season" in game: 130 | rj["gamePk"] = game["gamePk"] 131 | rj["season"] = game["season"] 132 | 133 | for event in stats_tracked: 134 | if event in game["linescore"]["teams"]["away"] and game["linescore"]["teams"]["home"]: 135 | rj["away_{}".format(event)] = game["linescore"]["teams"]["away"][event] 136 | rj["home_{}".format(event)] = game["linescore"]["teams"]["home"][event] 137 | else: 138 | continue 139 | 140 | for item in team_info: 141 | if item in game["linescore"]["teams"]["away"]["team"] and game["linescore"]["teams"]["home"]["team"]: 142 | rj["away_{}".format(item)] = game["linescore"]["teams"]["away"]["team"][item] 143 | rj["home_{}".format(item)] = game["linescore"]["teams"]["home"]["team"][item] 144 | 145 | f.write(json.dumps(rj) + '\n') 146 | 147 | f.seek(0, 2) # seek to end of file 148 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 149 | f.truncate() 150 | pass 151 | 152 | def get_all_game_event_stats(game_id_array, fname): 153 | """ 154 | Given a list of game ID's, returns raw event data for each game 155 | in a format suitable for pyspark. 156 | """ 157 | #log 158 | print("\n -- Getting All Game Events -- \n") 159 | 160 | #Variables 161 | fname = "livefeed_" + fname 162 | url = constants.NHL_STATS_API 163 | endpoint = "api/v1/game/{}/feed/live" 164 | 165 | #Formatting URL's 166 | urls = [url + endpoint.format(date) for date in game_id_array] 167 | 168 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 169 | #Doing API calls in batches 170 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 171 | batch = urls[b:b+100] 172 | 173 | #Making requests - 10x faster than simple requests 174 | reqs = (grequests.get(u) for u in batch) 175 | responses = grequests.map(reqs) 176 | 177 | rj = {} 178 | 179 | for resp in responses: 180 | if resp is not None and resp.ok and resp.json() is not None: 181 | d = resp.json() 182 | 183 | if "gamePk" in d: 184 | rj["gamePk"] = d["gamePk"] 185 | else: 186 | continue 187 | 188 | rj["season"] = d["gameData"]['game']['season'] 189 | 190 | for val in d["liveData"]["plays"]["allPlays"]: 191 | 192 | rj["event"] = val["result"]["event"] 193 | rj["periodTime"] = val["about"]["periodTime"] 194 | rj["dateTime"] = val["about"]["dateTime"] 195 | rj["period"] = val["about"]["period"] 196 | 197 | #on ice-coordinates 198 | if len(val["coordinates"]) == 2: 199 | rj["x_coordinate"] = val["coordinates"]["x"] 200 | rj["y_coordinate"] = val["coordinates"]["y"] 201 | else: 202 | rj["x_coordinate"] = None 203 | rj["y_coordinate"] = None 204 | 205 | #players involved, can be up to 4 206 | if "players" in val: 207 | num_players = len(val["players"]) 208 | if num_players > 4: 209 | print(" ---------- Event > 4 Players?? ---------- ") 210 | for i in range(4): 211 | if i < num_players: 212 | rj["p{}_id".format(i+1)] = val["players"][i]["player"]["id"] 213 | rj["p{}_type".format(i+1)] = val["players"][i]["playerType"] 214 | rj["p{}_name".format(i+1)] = val["players"][i]["player"]["fullName"] 215 | else: 216 | rj["p{}_id".format(i+1)] = None 217 | rj["p{}_type".format(i+1)] = None 218 | rj["p{}_name".format(i+1)] = None 219 | 220 | else: 221 | for i in range(4): 222 | rj["p{}_id".format(i+1)] = None 223 | rj["p{}_type".format(i+1)] = None 224 | rj["p{}_name".format(i+1)] = None 225 | 226 | #team-id if relevant 227 | if "team" in val: 228 | rj["team_id"] = val["team"]["id"] 229 | else: 230 | rj["team_id"] = None 231 | 232 | #write output to JSON 233 | f.write(json.dumps(rj) + '\n') 234 | else: 235 | print("{}".format(resp.status_code)) 236 | 237 | #remove the '/n' at the end of written JSON file 238 | f.seek(0, 2) # seek to end of file 239 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 240 | f.truncate() 241 | 242 | def get_players_season_goal_stats(player_id_array, season_start, season_end, fname, just_regular_season=True): 243 | """ 244 | Given an array of player_id's and a season, 245 | returns the goal statistics for each player 246 | in JSON format. 247 | """ 248 | #log 249 | print("\n -- Getting Player Goal Stats -- \n") 250 | 251 | #constants 252 | fname = "goalsByGameSituationStats_" + fname 253 | url = constants.NHL_STATS_API 254 | endpoint = "api/v1/people/{}/stats?stats=goalsByGameSituation&season=" 255 | urls = [] 256 | 257 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 258 | if just_regular_season: 259 | url_format = [x + "&gameType=R" for x in url_format] 260 | 261 | goal_stats_tracked = [ 262 | 'goalsInFirstPeriod', 263 | 'goalsInSecondPeriod', 264 | 'goalsInThirdPeriod', 265 | 'gameWinningGoals', 266 | 'emptyNetGoals', 267 | 'shootOutGoals', 268 | 'shootOutShots', 269 | 'goalsTrailingByOne', 270 | 'goalsTrailingByThreePlus', 271 | 'goalsWhenTied', 272 | 'goalsLeadingByOne', 273 | 'goalsLeadingByTwo', 274 | 'goalsLeadingByThreePlus', 275 | 'penaltyGoals', 276 | 'penaltyShots', 277 | ] 278 | 279 | #Creating each individual URL 280 | for p_id in player_id_array: 281 | for url in url_format: 282 | urls.append(url.format(p_id)) 283 | 284 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 285 | rj = {} 286 | 287 | #Doing API calls in batches 288 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 289 | batch = urls[b:b+100] 290 | 291 | #making requests 292 | reqs = (grequests.get(u) for u in batch) 293 | responses = grequests.map(reqs) 294 | 295 | for r in responses: 296 | if r is not None and r.ok and r.json() is not None: 297 | d = r.json() 298 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:]) 299 | rj['season'] = re.findall(r'\d+', r.url)[-1] 300 | if len(d["stats"][0]["splits"]) != 0: 301 | for ev in goal_stats_tracked: 302 | if ev in d["stats"][0]["splits"][0]["stat"]: 303 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev] 304 | else: 305 | rj[ev] = 0 306 | f.write(json.dumps(rj) + '\n') 307 | 308 | #remove the '/n' at the end of written JSON file 309 | f.seek(0, 2) # seek to end of file 310 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 311 | f.truncate() 312 | 313 | def get_players_season_general_stats(player_id_array, season_start, season_end, fname, just_regular_season=True): 314 | """ 315 | Given an array of player_id's and a season, 316 | returns the general statistics for each player 317 | in JSON format. 318 | """ 319 | #log 320 | print("\n -- Getting Player General Stats -- \n") 321 | 322 | 323 | #constants 324 | fname = "statsSingleSeason_" + fname 325 | url = constants.NHL_STATS_API 326 | endpoint = "api/v1/people/{}/stats?stats=statsSingleSeason&season=" 327 | urls = [] 328 | 329 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 330 | if just_regular_season: 331 | url_format = [x + "&gameType=R" for x in url_format] 332 | 333 | stats_tracked = [ 334 | 'timeOnIce', 335 | 'assists', 336 | 'goals', 337 | 'pim', 338 | 'shots', 339 | 'games', 340 | 'hits', 341 | 'powerPlayGoals', 342 | 'powerPlayPoints', 343 | 'powerPlayTimeOnIce', 344 | 'evenTimeOnIce', 345 | 'penaltyMinutes' , 346 | 'faceOffPct', 347 | 'shotPct', 348 | 'gameWinningGoals', 349 | 'overTimeGoals', 350 | 'shortHandedGoals', 351 | 'shortHandedPoints', 352 | 'shortHandedTimeOnIce', 353 | 'blocked', 354 | 'plusMinus', 355 | 'points', 356 | 'shifts', 357 | 'timeOnIcePerGame', 358 | 'evenTimeOnIcePerGame', 359 | 'shortHandedTimeOnIcePerGame', 360 | 'powerPlayTimeOnIcePerGame', 361 | ] 362 | 363 | #Creating each individual URL 364 | for p_id in player_id_array: 365 | for url in url_format: 366 | urls.append(url.format(p_id)) 367 | 368 | #writing data to file 369 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 370 | rj = {} 371 | 372 | #Doing API calls in batches 373 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 374 | batch = urls[b:b+100] 375 | 376 | #making requests 377 | reqs = (grequests.get(u) for u in batch) 378 | responses = grequests.map(reqs) 379 | 380 | for r in responses: 381 | if r is not None and r.ok and r.json() is not None: 382 | d = r.json() 383 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:]) 384 | if len(d["stats"][0]["splits"]) != 0: 385 | rj["season"] = d["stats"][0]["splits"][0]['season'] 386 | for ev in stats_tracked: 387 | if ev in d["stats"][0]["splits"][0]["stat"]: 388 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev] 389 | else: 390 | rj[ev] = None 391 | 392 | f.write(json.dumps(rj) + '\n') 393 | 394 | #remove the '/n' at the end of written JSON file 395 | f.seek(0, 2) # seek to end of file 396 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 397 | f.truncate() 398 | 399 | def get_players_season_stat_rankings(player_id_array, season_start, season_end, fname): 400 | """ 401 | Given an array of player_id's and a season range, 402 | returns the rankings across numerous statisitcs 403 | for each player in a JSON format. 404 | """ 405 | #log 406 | print("\n -- Getting Player Rankings -- \n") 407 | 408 | #constants 409 | fname = "regularSeasonStatRankings_" + fname 410 | url = constants.NHL_STATS_API 411 | endpoint = "api/v1/people/{}/stats?stats=regularSeasonStatRankings&season=" 412 | urls = [] 413 | 414 | url_format = ["{}{}{}{}".format(url, endpoint, i, i+1) for i in range(season_start // 10000, (season_end // 10000)+1)] 415 | 416 | stats_tracked = ['rankPowerPlayGoals', 417 | 'rankBlockedShots', 418 | 'rankAssists', 419 | 'rankShotPct', 420 | 'rankGoals', 421 | 'rankHits', 422 | 'rankPenaltyMinutes', 423 | 'rankShortHandedGoals', 424 | 'rankPlusMinus', 425 | 'rankShots', 426 | 'rankPoints', 427 | 'rankOvertimeGoals', 428 | 'rankGamesPlayed', 429 | ] 430 | 431 | #Creating each individual URL 432 | for p_id in player_id_array: 433 | for url in url_format: 434 | urls.append(url.format(p_id)) 435 | 436 | #writing data to file 437 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 438 | rj = {} 439 | 440 | #Doing API calls in batches 441 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 442 | batch = urls[b:b+100] 443 | 444 | #making requests 445 | reqs = (grequests.get(u) for u in batch) 446 | responses = grequests.map(reqs) 447 | 448 | for r in responses: 449 | if r is not None and r.ok and r.json() is not None: 450 | d = r.json() 451 | rj['p_id'] = int(re.findall(r'[//][0-9]+', r.url)[0][1:]) 452 | 453 | if len(d["stats"][0]["splits"]) != 0: 454 | rj["season"] = d["stats"][0]["splits"][0]['season'] 455 | for ev in stats_tracked: 456 | if ev in d["stats"][0]["splits"][0]["stat"]: 457 | rj[ev] = d["stats"][0]["splits"][0]["stat"][ev] 458 | else: 459 | rj[ev] = None 460 | 461 | f.write(json.dumps(rj) + '\n') 462 | 463 | #remove the '/n' at the end of written JSON file 464 | f.seek(0, 2) # seek to end of file 465 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 466 | f.truncate() 467 | 468 | def get_players_info_by_season(player_id_array, fname): 469 | """ 470 | Given an array of player_id's and a season range, 471 | returns player info by season for each player in a 472 | JSON format. 473 | """ 474 | #log 475 | print("\n -- Getting Player Info -- \n") 476 | 477 | #constants 478 | fname = "yearByYear_" + fname 479 | url = constants.NHL_STATS_API 480 | endpoint = "api/v1/people/{}" 481 | endpoint2 = "api/v1/people/{}/stats?stats=yearByYear" 482 | urls = [] 483 | 484 | #Creating each individual URL 485 | for p_id in player_id_array: 486 | urls.append(url + endpoint.format(p_id)) 487 | urls.append(url + endpoint2.format(p_id)) 488 | 489 | #writing data to file 490 | with open("./raw_data/{}.json".format(fname), 'w', encoding='utf-16') as f: 491 | rj = {} 492 | 493 | #Doing API calls in batches 494 | for b in tqdm(range(0, len(urls), 100), desc="Iterating Batches"): 495 | batch = urls[b:b+100] 496 | 497 | #making requests 498 | reqs = (grequests.get(u) for u in batch) 499 | responses = grequests.map(reqs) 500 | 501 | for i in range (0, len(responses), 2): 502 | r = responses[i] 503 | r2 = responses[i+1] 504 | if r is not None and r.ok and r.json() is not None: 505 | d = r.json() 506 | if "people" in d: 507 | rj["p_id"] = d["people"][0]["id"] 508 | rj["fullName"] = d["people"][0]["fullName"] 509 | 510 | if r2 is not None and r2.ok and r2.json() is not None: 511 | d2 = r2.json() 512 | if "stats" in d2 and "splits" in d2["stats"][0]: 513 | season_counter = 0 514 | last_season = "" 515 | for split in d2["stats"][0]["splits"]: 516 | if split["league"]["name"] == "National Hockey League": 517 | if split["season"] == last_season: 518 | season_counter += 1 519 | else: 520 | season_counter = 0 521 | 522 | rj["team_id"] = split["team"]["id"] 523 | rj["team_name"] = split["team"]["name"] 524 | rj["season"] = split["season"] 525 | rj["team_num_this_season"] = season_counter 526 | 527 | last_season = split["season"] 528 | f.write(json.dumps(rj) + '\n') 529 | 530 | #remove the '/n' at the end of written JSON file 531 | f.seek(0, 2) # seek to end of file 532 | f.seek(f.tell() - 2, 0) # seek to the second last char of file 533 | f.truncate() 534 | 535 | def check_raw_data_directory(): 536 | """ 537 | Checks if the temporary directory has already been created 538 | """ 539 | if os.path.isdir("./raw_data") and len(os.listdir("./raw_data")) != 0: 540 | sys.exit(" ERROR:\n - Raw_data contains files? Remove data and store elsewhere before proceeding -") 541 | else: 542 | if not os.path.isdir("./raw_data"): 543 | os.mkdir("./raw_data") 544 | pass 545 | 546 | def main(output): 547 | ss = 20112012 548 | se = 20202021 549 | 550 | #check raw_data directory 551 | check_raw_data_directory() 552 | 553 | #get ids 554 | all_game_ids = get_game_ids(just_regular_season=True, filter_by_team=False, season_start=ss, season_end=se) 555 | all_player_ids = get_all_player_ids_in_range(season_start=ss, season_end=se) 556 | 557 | #game events + stats 558 | simple_game_stats(fname=output, season_start=ss, season_end=se, just_regular_season=True) 559 | get_all_game_event_stats(all_game_ids, fname=output) 560 | 561 | #player_stats 562 | get_players_season_goal_stats(all_player_ids, season_start=ss, season_end=se, fname=output) 563 | get_players_season_general_stats(all_player_ids, season_start=ss, season_end=se, fname=output) 564 | get_players_season_stat_rankings(all_player_ids, season_start=ss, season_end=se, fname = output) 565 | 566 | get_players_info_by_season(all_player_ids, fname = output) 567 | return 568 | 569 | if __name__ == '__main__': 570 | output = sys.argv[1] 571 | main(output) -------------------------------------------------------------------------------- /data_gathering/constants.py: -------------------------------------------------------------------------------- 1 | NHL_STATS_API = 'https://statsapi.web.nhl.com/' 2 | NHL_RECORDS_API = 'https://records.nhl.com/' -------------------------------------------------------------------------------- /data_gathering/json_to_parquet.py: -------------------------------------------------------------------------------- 1 | import sys 2 | assert sys.version_info >= (3, 5) # make sure we have Python 3.5+ 3 | import os 4 | 5 | from pyspark.sql import SparkSession, functions, types #type:ignore 6 | 7 | def main(): 8 | """ 9 | Converting the JSON Data to Parquet 10 | """ 11 | inputs = "./raw_data/" 12 | 13 | for file in os.listdir(inputs): 14 | if file[-5:] == ".json": 15 | df = spark.read.json(inputs + file) 16 | df.write.parquet("./raw_data/parquet/{}".format(file[:-5])) 17 | pass 18 | 19 | if __name__ == '__main__': 20 | spark = SparkSession.builder.appName('example code').getOrCreate() 21 | assert spark.version >= '3.0' # make sure we have Spark 3.0+ 22 | spark.sparkContext.setLogLevel('WARN') 23 | sc = spark.sparkContext 24 | main() -------------------------------------------------------------------------------- /data_gathering/nhl_api_examples.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import requests 3 | 4 | def main(): 5 | ''' 6 | See documentation on NHL API by Drew Hynes. 7 | 8 | https://gitlab.com/dword4/nhlapi/-/blob/master/stats-api.md 9 | ''' 10 | #Startpoint for every API request 11 | url = 'https://statsapi.web.nhl.com/' 12 | 13 | #team stats example (ID: 23 - Canucks, Franchise ID: 20) 14 | endpoint = "api/v1/teams/23" 15 | 16 | #player stats example (ID: 8474568 - Luke Schenn) 17 | endpoint = "api/v1/people/8474568" 18 | 19 | #game ID example (ID: ) 20 | endpoint = "api/v1/game/2020020018/feed/live" 21 | 22 | r = requests.get(url + endpoint) 23 | return 24 | 25 | if __name__ == '__main__': 26 | main() -------------------------------------------------------------------------------- /generate_pdf.py: -------------------------------------------------------------------------------- 1 | # Import FPDF class 2 | from fpdf import FPDF 3 | import report_generation.plotting_functions 4 | import os 5 | import sys 6 | 7 | def main(p_id, season): 8 | 9 | goal_stats_list, player_info, player_stats_list = report_generation.plotting_functions.generate_all_plots(p_id = p_id, season = season) 10 | 11 | class PDF(FPDF): 12 | 13 | def __init__(self, player_name, season, organization, team_id): 14 | FPDF.__init__(self) #initializes parent class 15 | self.player_name = player_name 16 | self.season = season 17 | self.organization = organization 18 | self.team_id = team_id 19 | 20 | def header(self): 21 | # Logo (name, x, y, w = 0, h = 0) 22 | # w,h = 0 means automatic 23 | self.image('./imgs/team_logos/{}.jpg'.format(self.team_id), 10, 8, 15, 0) 24 | # font (font,bold,size) 25 | self.set_font('Arial', 'B', 15) 26 | # Move to the right 27 | self.cell(80) 28 | # Title (w,h,text,border,ln,align) 29 | if self.page_no()==1: 30 | pass 31 | elif self.page_no()==2: 32 | self.cell(30, 10, '{} - Goals / Shots'.format(self.player_name), 0, 0, 'C') 33 | elif self.page_no()==3: 34 | self.cell(30, 10, '{} - Shots / Rank / Other'.format(self.player_name), 0, 0, 'C') 35 | # Line break 36 | self.ln(20) 37 | 38 | # Page footer 39 | def footer(self): 40 | if self.page_no()!=1: 41 | # Position at 1.5 cm from bottom 42 | self.set_y(-15) 43 | # Arial italic 8 44 | self.set_font('Arial', 'I', 8) 45 | # Page number 46 | self.cell(0, 10, 'Page ' + str(self.page_no()-1), 0, 0, 'R') 47 | 48 | # ---------- Instantiation of Inherited Class ---------- 49 | pdf = PDF(player_info["fullName"], player_info["season"], player_info["team_name"], player_info["team_id"]) 50 | pdf.alias_nb_pages() 51 | 52 | # ---------- First Page ---------- 53 | pdf.add_page() 54 | pdf.set_font('Times', '', 18) 55 | pdf.ln(h = 30) 56 | pdf.cell(w=0, h=10, txt="Annual Player Report", border=0, ln=1, align="C") 57 | pdf.cell(w=0, h=10, txt=pdf.season[:4] + " / " + pdf.season[4:], border=0, ln=1, align="C") 58 | pdf.cell(w=0, h=10, txt=pdf.organization, border=0, ln=1, align="C") 59 | pdf.cell(w=0, h=10, txt=pdf.player_name, border=0, ln=1, align="C") 60 | pdf.image('./tmp/player.jpg', x = 85, y = 110, w = 40, h = 0, type = '', link = '') 61 | 62 | # ---------- Second Page ---------- 63 | 64 | # Since we do not need to draw lines anymore, there is no need to separate 65 | # headers from data matrix. 66 | 67 | pdf.add_page() 68 | pdf.set_font('Times', '', 12) 69 | 70 | table_cell_height = 9 71 | table_cell_width_col1 = 60 72 | table_cell_width_col2 = 20 73 | 74 | # Here we add more padding by passing 2*th as height 75 | pdf.set_fill_color(189,210,236) #(r,g,b) 76 | pdf.cell(table_cell_width_col1, table_cell_height, "Goal Statistics", border=1, align='C', fill=True) 77 | pdf.cell(table_cell_width_col2, table_cell_height, "Count", border=1, ln=1, align='C', fill=True) 78 | 79 | pdf.set_fill_color(235,241,249) 80 | for row in goal_stats_list: 81 | for i, datum in enumerate(row): 82 | # Enter data in colums 83 | if i == 0: 84 | pdf.cell(table_cell_width_col1, table_cell_height, str(datum), border=1, fill=True) 85 | else: 86 | pdf.cell(table_cell_width_col2, table_cell_height, str(datum), border=1, align='C', fill=True) 87 | 88 | pdf.ln(table_cell_height) 89 | 90 | WIDTH = 210 91 | HEIGHT = 297 92 | pdf.image('./tmp/pie_plot1.jpg', x = 120, y = 20, w = (WIDTH-60)//2, h = 0, type = '', link = '') 93 | pdf.image('./tmp/pie_plot2.jpg', x = 120, y = 95, w = (WIDTH-60)//2, h = 0, type = '', link = '') 94 | pdf.image('./tmp/rink_image1.jpg', x = 50, y = 180, w = 110, h = 0, type = '', link = '') 95 | 96 | # ---------- Third Page ---------- 97 | # Shot plot and other stats 98 | pdf.add_page() 99 | pdf.set_font('Times', '', 12) 100 | pdf.ln(100) 101 | 102 | table_cell_height = 9 103 | table_cell_width_col1 = 60 104 | table_cell_width_col2 = 20 105 | 106 | # Here we add more padding by passing 2*th as height 107 | pdf.set_fill_color(189,210,236) #(r,g,b) 108 | pdf.cell(table_cell_width_col1, table_cell_height, "Other Statistics", border=1, align='C', fill=True) 109 | pdf.cell(table_cell_width_col2, table_cell_height, "Count", border=1, ln=1, align='C', fill=True) 110 | 111 | pdf.set_fill_color(235,241,249) 112 | for row in player_stats_list: 113 | for i, datum in enumerate(row): 114 | # Enter data in colums 115 | if i == 0: 116 | pdf.cell(table_cell_width_col1, table_cell_height, str(datum), border=1, fill=True) 117 | else: 118 | pdf.cell(table_cell_width_col2, table_cell_height, str(datum), border=1, align='C', fill=True) 119 | 120 | pdf.ln(table_cell_height) 121 | 122 | pdf.image('./tmp/rink_image2.jpg', x = 50, y = 30, w = 110, h = 0, type = '', link = '') 123 | pdf.image('./tmp/rank_hbar_plot1.jpg', x = 110, y = 130, w = 76, h = 0, type = '', link = '') 124 | 125 | # Clear tmp directory 126 | for file in os.listdir("./tmp"): 127 | os.remove("./tmp/" + file) 128 | 129 | # ---------- Save PDF to Output ---------- 130 | pdf.output('report.pdf','F') 131 | print(" --- Complete --- ") 132 | 133 | if __name__ == "__main__": 134 | try: 135 | p_id = int(sys.argv[1].strip()) 136 | season = int(sys.argv[2].strip()) 137 | if len(sys.argv[2].strip()) != 8: 138 | sys.exit(" --------- Check input format --------- ") 139 | except: 140 | sys.exit(" --------- Check input format --------- ") 141 | 142 | main(p_id, season) -------------------------------------------------------------------------------- /imgs/README.md: -------------------------------------------------------------------------------- 1 | ## Imgs 2 | 3 | Contains images used in the main README.md file, and also the team_logos named by their respective team id's. -------------------------------------------------------------------------------- /imgs/readme_imgs/pie_plot_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/pie_plot_sample.png -------------------------------------------------------------------------------- /imgs/readme_imgs/rank_hbar_plot_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/rank_hbar_plot_sample.png -------------------------------------------------------------------------------- /imgs/readme_imgs/report_sample1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample1.jpg -------------------------------------------------------------------------------- /imgs/readme_imgs/report_sample2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample2.jpg -------------------------------------------------------------------------------- /imgs/readme_imgs/report_sample3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/report_sample3.jpg -------------------------------------------------------------------------------- /imgs/readme_imgs/rink_image_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/readme_imgs/rink_image_sample.png -------------------------------------------------------------------------------- /imgs/simple_rink_grey.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/simple_rink_grey.jpg -------------------------------------------------------------------------------- /imgs/team_logos/1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/1.jpg -------------------------------------------------------------------------------- /imgs/team_logos/10.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/10.jpg -------------------------------------------------------------------------------- /imgs/team_logos/12.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/12.jpg -------------------------------------------------------------------------------- /imgs/team_logos/13.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/13.jpg -------------------------------------------------------------------------------- /imgs/team_logos/14.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/14.jpg -------------------------------------------------------------------------------- /imgs/team_logos/15.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/15.jpg -------------------------------------------------------------------------------- /imgs/team_logos/16.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/16.jpg -------------------------------------------------------------------------------- /imgs/team_logos/17.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/17.jpg -------------------------------------------------------------------------------- /imgs/team_logos/18.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/18.jpg -------------------------------------------------------------------------------- /imgs/team_logos/19.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/19.jpg -------------------------------------------------------------------------------- /imgs/team_logos/2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/2.jpg -------------------------------------------------------------------------------- /imgs/team_logos/20.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/20.jpg -------------------------------------------------------------------------------- /imgs/team_logos/21.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/21.jpg -------------------------------------------------------------------------------- /imgs/team_logos/22.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/22.jpg -------------------------------------------------------------------------------- /imgs/team_logos/23.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/23.jpg -------------------------------------------------------------------------------- /imgs/team_logos/24.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/24.jpg -------------------------------------------------------------------------------- /imgs/team_logos/25.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/25.jpg -------------------------------------------------------------------------------- /imgs/team_logos/26.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/26.jpg -------------------------------------------------------------------------------- /imgs/team_logos/28.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/28.jpg -------------------------------------------------------------------------------- /imgs/team_logos/29.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/29.jpg -------------------------------------------------------------------------------- /imgs/team_logos/3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/3.jpg -------------------------------------------------------------------------------- /imgs/team_logos/30.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/30.jpg -------------------------------------------------------------------------------- /imgs/team_logos/4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/4.jpg -------------------------------------------------------------------------------- /imgs/team_logos/5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/5.jpg -------------------------------------------------------------------------------- /imgs/team_logos/52.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/52.jpg -------------------------------------------------------------------------------- /imgs/team_logos/53.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/53.jpg -------------------------------------------------------------------------------- /imgs/team_logos/54.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/54.jpg -------------------------------------------------------------------------------- /imgs/team_logos/55.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/55.jpg -------------------------------------------------------------------------------- /imgs/team_logos/6.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/6.jpg -------------------------------------------------------------------------------- /imgs/team_logos/7.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/7.jpg -------------------------------------------------------------------------------- /imgs/team_logos/8.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/8.jpg -------------------------------------------------------------------------------- /imgs/team_logos/9.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/imgs/team_logos/9.jpg -------------------------------------------------------------------------------- /report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brendanartley/Analytics-Report-Generator/3c69d6f5ac92f0acdf3ff9a70587249f6c16c868/report.pdf -------------------------------------------------------------------------------- /report_generation/README.md: -------------------------------------------------------------------------------- 1 | ## Report Generation 2 | 3 | `data_query.py` 4 | 5 | The data query script contains the pyspark script that starts a spark session, and filters data by the player_id and season. The sript returns a pandas dataframe, a couple short arrays, and a dictionary. These are used to create that plots that are included in the report. 6 | 7 | `plotting_functions.py` 8 | 9 | This script contains the custom Matplotlib functions used the generate the plots in the report 10 | -------------------------------------------------------------------------------- /report_generation/data_query.py: -------------------------------------------------------------------------------- 1 | import sys 2 | assert sys.version_info >= (3, 5) # make sure we have Python 3.5+ 3 | 4 | from pyspark.sql import SparkSession, functions, types #type:ignore 5 | 6 | # add more functions as necessary 7 | 8 | def query(p_id, season): 9 | 10 | print("Player ID: " + str(p_id)) 11 | print("Season: " + str(season)) 12 | 13 | #start spark session 14 | spark = SparkSession.builder.appName('example code').getOrCreate() 15 | assert spark.version >= '3.0' # make sure we have Spark 3.0+ 16 | spark.sparkContext.setLogLevel('WARN') 17 | sc = spark.sparkContext 18 | 19 | #player events 20 | df = spark.read.parquet("./raw_data/parquet/livefeed_p2") 21 | player_df = (df.where( 22 | (df["p1_id"] == p_id) & 23 | (df["p1_type"].isin(["Shooter","Scorer"])) & 24 | (df["season"] == season)) 25 | .select("x_coordinate","y_coordinate","event","p1_id","p1_name","period","periodTime")) 26 | 27 | #player rankings 28 | rank_df = spark.read.parquet("./raw_data/parquet/regularSeasonStatRankings_p2") 29 | rank_df = rank_df.where((rank_df["p_id"] == p_id) & (rank_df["season"] == season)) 30 | rank_list = [[col,rank_df.take(1)[0][col]] for col in rank_df.columns if col not in ["p_id", "season"]] 31 | 32 | #player goal stats 33 | goal_stats_df = spark.read.parquet("./raw_data/parquet/goalsByGameSituationStats_p2") 34 | goal_stats_df = goal_stats_df.where((goal_stats_df["p_id"] == p_id) & (goal_stats_df["season"] == season)) 35 | goal_stats_list = [[col,goal_stats_df.take(1)[0][col]] for col in goal_stats_df.columns if col not in ["p_id", "season"]] 36 | 37 | #player other stats 38 | p_stats_df = spark.read.parquet("./raw_data/parquet/statsSingleSeason_p2") 39 | p_stats_df = p_stats_df.where((p_stats_df["p_id"] == p_id) & (p_stats_df["season"] == season)) 40 | 41 | stats = ['assists', 'goals', 'games', 'hits', 'powerPlayPoints', 42 | 'penaltyMinutes', 'faceOffPct', 'blocked', 'plusMinus', 43 | 'points', 'shifts', 'timeOnIcePerGame', 'evenTimeOnIcePerGame', 44 | 'shortHandedTimeOnIcePerGame', 'powerPlayTimeOnIcePerGame'] 45 | 46 | player_stats_list = [[col,p_stats_df.take(1)[0][col]] for col in p_stats_df.columns if col in stats] 47 | 48 | #player information 49 | p_info_df = spark.read.parquet("./raw_data/parquet/yearByYear_p2") 50 | p_info_df = (p_info_df.where((p_info_df["p_id"] == p_id) & 51 | (p_info_df["season"] == season)) 52 | .orderBy(p_info_df["team_num_this_season"], ascending=False)) 53 | player_info = {col:p_info_df.take(1)[0][col] for col in p_info_df.columns} 54 | 55 | return player_df.toPandas(), rank_list, goal_stats_list, player_info, player_stats_list 56 | 57 | # if __name__ == 'query_data': 58 | # spark = SparkSession.builder.appName('example code').getOrCreate() 59 | # assert spark.version >= '3.0' # make sure we have Spark 3.0+ 60 | # spark.sparkContext.setLogLevel('WARN') 61 | # sc = spark.sparkContext 62 | # player_df, rank_list = main() -------------------------------------------------------------------------------- /report_generation/plotting_functions.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | import sys 4 | import os 5 | from PIL import Image 6 | import warnings 7 | import requests 8 | import report_generation.data_query 9 | 10 | warnings.filterwarnings("ignore", category=DeprecationWarning) 11 | plt.rcParams.update({'font.size': 22}) 12 | 13 | def shot_scatter_plot(df, rink_image_fname, event, legend_labels, colors, out_fname): 14 | """ 15 | Given a list of parameters, creates shot plot and saves 16 | image to temporary file before being added to report. 17 | """ 18 | #read data 19 | df.loc[df['x_coordinate'] >= 0, 'y_coordinate',] = -1*df["y_coordinate"] 20 | df.loc[df['x_coordinate'] >= 0, 'x_coordinate',] = -1*df["x_coordinate"] 21 | 22 | rink_img = plt.imread(rink_image_fname) 23 | 24 | #plot data 25 | plt.figure(figsize=(10,10)) 26 | plt.scatter(df.loc[df['event'] == event]["x_coordinate"], df.loc[df['event'] == event]["y_coordinate"], c=colors[0], s=100, zorder=3) 27 | plt.scatter(df.loc[df['event'] != event]["x_coordinate"], df.loc[df['event'] != event]["y_coordinate"], c=colors[1], s=100, zorder=1) 28 | plt.imshow(rink_img, cmap="gray", extent=[-100, 100, -42.5, 42.5]) 29 | plt.xlim(left=-100, right=0) 30 | plt.ylim(bottom=-42.5, top=42.5) 31 | plt.legend(legend_labels, prop={'size': 22}) 32 | plt.axis('off') 33 | 34 | #need to add os call that checks if the file exists, and create DIR if not 35 | plt.savefig('./{}.png'.format("./tmp/" + out_fname), dpi=300, bbox_inches='tight') 36 | pass 37 | 38 | def shot_pie_plot(df, event, legend_labels, colors, out_fname): 39 | 40 | #preprocess data 41 | if event == "Goal": 42 | goal_pct = round(len(df.loc[df['event'] == event]["x_coordinate"])/len(df), 3)*100 43 | sa = 180 44 | else: 45 | goal_pct = round(len(df.loc[df['event'] != event]["x_coordinate"])/len(df), 3)*100 46 | sa = 270 47 | 48 | #pie plot figure 49 | sizes = [goal_pct, 100-goal_pct] 50 | explodes = [0.25, 0] 51 | plt.figure(figsize=(10,10)) 52 | patches, texts, _ = plt.pie(sizes, explode=explodes, autopct='%1.1f%%',shadow=True, startangle=sa, colors=colors) 53 | plt.legend(patches, legend_labels, loc="best", prop={'size': 22}) 54 | plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. 55 | 56 | if event == "Goal": 57 | plt.title("Shot Scored?") 58 | else: 59 | plt.title("Shot on Net?") 60 | 61 | #save figure 62 | plt.savefig('./{}.png'.format("./tmp/" + out_fname), dpi=300, bbox_inches='tight') 63 | pass 64 | 65 | def by_period_bar_plot(df, event, color, out_fname): 66 | """ 67 | Given a dataframe, returns a matplotlib bar plot of 68 | the number of goals scored each period 69 | """ 70 | if event != "Goals" and event != "Shots": 71 | sys.exit(" ---------- Invalid Event: {} ---------- ".format(event)) 72 | 73 | #processing data 74 | if event == "Goals": 75 | goal_dict = dict(df.loc[(df["event"] == "Goal") & (df["period"].isin([1,2,3]))]["period"].value_counts().sort_index()) 76 | else: 77 | goal_dict = dict(df.loc[df["period"].isin([1,2,3])]["period"].value_counts().sort_index()) 78 | 79 | #creating figure 80 | plt.figure(figsize=(10,5)) 81 | plt.bar(goal_dict.keys(), goal_dict.values(), color = color, width = 0.4, tick_label=[1,2,3], zorder=3) 82 | 83 | #remove ticks and borders 84 | plt.tick_params(bottom=False, left=False) 85 | for i, spine in enumerate(plt.gca().spines): 86 | if i != 2: 87 | plt.gca().spines[spine].set_visible(False) 88 | 89 | #labels / grid 90 | plt.gca().yaxis.grid(zorder=0) 91 | plt.xlabel("Period") 92 | plt.ylabel(event) 93 | plt.title(event + " by Period") 94 | plt.xticks(fontsize=14) 95 | plt.yticks(fontsize=14) 96 | 97 | #save figure 98 | plt.savefig('./tmp/{}.png'.format(out_fname), dpi=300, bbox_inches='tight') 99 | pass 100 | 101 | def rankings_hbar_plot(data2, out_fname): 102 | """ 103 | Given player statistics rankings for the season, 104 | creates a horizontal bar plot. 105 | 106 | - Need to modify function to take a data format other than nested array 107 | """ 108 | 109 | def sort_rankings(data): 110 | """ 111 | Given list of rankings, returns sorted array 112 | """ 113 | l = [] 114 | res = [] 115 | for i, val in enumerate(data): 116 | l.append([i, int(val[1][:-2])]) 117 | l = sorted(l, key = lambda x: x[1], reverse=True) 118 | for val in l: 119 | res.append([data2[val[0]][0][4:], data2[val[0]][1]]) 120 | return res[::-1] 121 | 122 | data2 = sort_rankings(data2) 123 | data = {"Stat": [x[0] for x in data2], "Rank": [x[1] for x in data2]} 124 | 125 | df = pd.DataFrame(data, index = data["Stat"]) 126 | fig, ax = plt.subplots(figsize=(5,18)) 127 | 128 | #range - #1f77b4 --> #aec7e8 129 | colors = ['#297db8','#3382bb','#3e88bf','#488ec3', 130 | '#5294c7','#5c99ca','#679fce','#71a5d2','#7baad5', 131 | '#85b0d9','#8fb6dd','#9abce1','#a4c1e4'] 132 | 133 | p1 = ax.barh(data["Stat"], data["Rank"], color = colors) 134 | ax.set_title('Regular Season Rankings\n', loc='right') 135 | ax.margins(x=0.1, y=0) 136 | ax.spines['right'].set_visible(False) 137 | ax.spines['top'].set_visible(False) 138 | ax.spines['bottom'].set_visible(False) 139 | ax.set_xticks([]) 140 | ax.set_xticklabels([]) 141 | ax.invert_yaxis() 142 | 143 | for rect, label in zip(ax.patches, [x[1] for x in data2]): 144 | height = rect.get_y() + (rect.get_height() / 2) + 0.15 145 | width = rect.get_width() + rect.get_x() + 1 146 | ax.text( 147 | width, height, label, ha="left", va="bottom" 148 | ) 149 | 150 | plt.savefig('./tmp/{}.png'.format(out_fname), dpi=300, bbox_inches='tight') 151 | 152 | def convert_pngs_to_jpegs(fpath = "./tmp"): 153 | """ 154 | Given a directory, converts all .png images 155 | to JPEG's 156 | """ 157 | for img in os.listdir(fpath): 158 | if img[-4:] == ".png": 159 | Image.open('{}/{}'.format(fpath, img)).convert('RGB').save('{}/{}.jpg'.format(fpath, img[:-4]), 'JPEG', quality=95) 160 | os.remove('{}/{}'.format(fpath, img)) 161 | pass 162 | 163 | def get_player_image(player_id, fpath = "./tmp"): 164 | """ 165 | Downloads player image for title page 166 | of the report 167 | """ 168 | url = 'https://cms.nhl.bamgrid.com/images/headshots/current/168x168/{}@2x.jpg'.format(player_id) 169 | r = requests.get(url, stream=True) 170 | if r.ok: 171 | Image.open(r.raw).save(fpath + "/player.jpg", 'JPEG', quality=95) 172 | pass 173 | 174 | def check_tmp_directory(): 175 | """ 176 | Checks if the temporary directory has already been created 177 | """ 178 | if os.path.isdir("./tmp") and len(os.listdir("./tmp")) != 0: 179 | sys.exit(" ERROR:\n - Delete \"./tmp\" contents to continue -") 180 | else: 181 | if not os.path.isdir("./tmp"): 182 | os.mkdir("./tmp") 183 | pass 184 | 185 | def generate_all_plots(p_id, season): 186 | 187 | check_tmp_directory() 188 | print(" --- Querying Data --- ") 189 | 190 | player_df, rank_list, goal_stats_list, player_info, player_stats_list = report_generation.data_query.query(p_id, season) 191 | # add check to see if player not found in that season 192 | print(" --- Generating Plots --- ") 193 | 194 | rink_im = "/Users/brendanartley/dev/Sports-Analytics/imgs/simple_rink_grey.jpg" 195 | 196 | goal_colors = ["#88B4AA", "#e0ddbd"] 197 | on_net_colors = ["#e0ddbd", "#77A6C0"] 198 | 199 | #scatter plot rink imgs 200 | shot_scatter_plot(player_df, rink_im, event="Goal", legend_labels=["Goal", "No Goal"], colors = goal_colors, out_fname="rink_image1") 201 | shot_scatter_plot(player_df, rink_im, event="Missed Shot", legend_labels=["Missed Net", "On Net"], colors = on_net_colors[::-1], out_fname="rink_image2") 202 | 203 | #pie plot imgs 204 | shot_pie_plot(player_df, event="Goal", legend_labels=["Goal", "No Goal"], colors = goal_colors, out_fname="pie_plot1") 205 | shot_pie_plot(player_df, event="Missed Shot", legend_labels=["On Net", "Missed Net"], colors = on_net_colors, out_fname="pie_plot2") 206 | 207 | #rank plot 208 | rankings_hbar_plot(rank_list, out_fname = "rank_hbar_plot1") 209 | 210 | #getting player + team image 211 | get_player_image(player_id = p_id) 212 | 213 | #converting formats to jpegs 214 | convert_pngs_to_jpegs(fpath = "./tmp") 215 | 216 | return goal_stats_list, player_info, player_stats_list 217 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.2 2 | backcall==0.2.0 3 | certifi==2021.10.8 4 | charset-normalizer==2.0.9 5 | cycler==0.11.0 6 | decorator==5.1.0 7 | fonttools==4.28.5 8 | fpdf==1.7.2 9 | gevent==21.12.0 10 | greenlet==1.1.2 11 | grequests==0.6.0 12 | idna==3.3 13 | ipython==7.31.1 14 | jedi==0.18.1 15 | kiwisolver==1.3.2 16 | matplotlib==3.5.1 17 | matplotlib-inline==0.1.3 18 | numpy==1.21.4 19 | packaging==21.3 20 | pandas==1.3.5 21 | parso==0.8.3 22 | pexpect==4.8.0 23 | pickleshare==0.7.5 24 | Pillow==9.0.1 25 | prompt-toolkit==3.0.24 26 | ptyprocess==0.7.0 27 | py4j==0.10.9.2 28 | Pygments==2.10.0 29 | pyparsing==3.0.6 30 | pyspark==3.2.0 31 | python-dateutil==2.8.2 32 | pytz==2021.3 33 | requests==2.26.0 34 | six==1.16.0 35 | tqdm==4.62.3 36 | traitlets==5.1.1 37 | urllib3==1.26.7 38 | wcwidth==0.2.5 39 | zope.event==4.5.0 40 | zope.interface==5.4.0 41 | --------------------------------------------------------------------------------