├── .github
└── workflows
│ └── production-features-pipeline.yml
├── .gitignore
├── README.md
├── data
└── .gitkeep
├── images
├── basketball_money2.jpg
├── calibration.png
├── confusion_matrix.png
├── correlation_bar_chart.png
├── distributions.png
├── neptune.png
├── streamlit2.png
├── streamlit_example.jpg
├── train_vs_test_shapley.jpg
└── train_vs_test_shapley.svg
├── notebooks
├── 01_eda.ipynb
├── 02_data_processing.ipynb
├── 03_train_test_split.ipynb
├── 04_model_baseline.ipynb
├── 05_feature_engineering.ipynb
├── 06_feature_selection.ipynb
├── 07_model_testing.ipynb
├── 08_backfill_features.ipynb
├── 09_production_features_pipeline.ipynb
├── 10_model_training_pipeline.ipynb
├── feature_names.json
├── hopsworks.ipynb
├── lightgbm.json
├── model_data.json
└── xgboost.json
├── requirements.txt
└── src
├── common_functions.py
├── data_processing.py
├── feature_engineering.py
├── hopsworks_utils.py
├── model_training.py
├── optuna_objectives.py
├── streamlit_app.py
└── webscraping.py
/.github/workflows/production-features-pipeline.yml:
--------------------------------------------------------------------------------
1 |
2 | name: production-features-pipeline
3 |
4 | on:
5 | workflow_dispatch:
6 | # schedule:
7 | # - cron: '0 8 * * *'
8 |
9 | jobs:
10 | scrape_features:
11 | #runs-on: windows-latest
12 | runs-on: ubuntu-latest
13 | steps:
14 | - name: checkout repo content
15 | uses: actions/checkout@v2
16 |
17 | - name: setup python
18 | uses: actions/setup-python@v2
19 | with:
20 | python-version: '3.9.13'
21 |
22 | - name: install python packages
23 | run: |
24 | python -m pip install --upgrade pip
25 | pip install -r requirements.txt
26 | python -m pip install jupyter nbconvert nbformat scrapingant-client
27 |
28 | - name: execute python workflows from notebook
29 | env:
30 | HOPSWORKS_API_KEY: ${{ secrets.HOPSWORKS_API_KEY }}
31 | NEPTUNE_API_TOKEN: ${{ secrets.NEPTUNE_API_TOKEN }}
32 | SCRAPINGANT_API_KEY: ${{ secrets.SCRAPINGANT_API_KEY }}
33 |
34 | run:
35 | jupyter nbconvert --to notebook --execute notebooks/09_production_features_pipeline.ipynb
36 |
37 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | /venv
2 | /old
3 | /.ipynb_checkpoints
4 | .hw_api_key
5 | .ipynb_checkpoints
6 | .neptune
7 | __pycache__
8 | ca_chain
9 | certfile
10 | keyfile
11 | geckodriver.log
12 | .env
13 | .vscode
14 | .venv/
15 | *.csv
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
NBA game predictor
3 |
Click here to see it in action
4 |
5 |
6 |
13 |
14 | 
15 |
16 | #### Table of contents
17 | - [Introduction](#introduction)
18 | - [Problem](#problem-increase-the-profitability-of-betting-on-nba-games)
19 | - [Initial step](#initial-step-predict-the-probability-that-the-home-team-will-win-each-game)
20 | - [Plan](#plan)
21 | - [Overview](#overview)
22 | - [Future Possibilities](#future-possibilities)
23 | - [Structure](#structure)
24 | - [Data](#data)
25 | - [EDA and data processing](#eda-and-data-processing)
26 | - [Train/validation/test split](#train--testvalidation-split)
27 | - [Baseline models](#baseline-models)
28 | - [Feature engineering](#feature-engineering)
29 | - [Model training/testing](#model-training-pipeline)
30 | - [Streamlit app](#streamlit-app)
31 | - [Feedback](#feedback)
32 |
33 | ## Introduction
34 |
35 | This project is a demonstration of my ability to quickly learn, develop, and deploy end-to-end machine learning technologies. I am currently seeking to change careers into Machine Learning / Data Science.
36 |
37 | I chose to predict the winner of NBA games because:
38 | - multiple games are played every day during the season so I can see how my model performs on a daily basis
39 | - picking a game winner is easy for a casual audience to understand and appreciate
40 | - there is a lot of data available
41 | - it can be used to make money (via betting strategies). I have always been interested in making money.
42 |
43 | I am actually not really a big fan of the NBA but have watched a few games and have basic knowledge. I have never done any sports betting either, but I have always loved exploration and discovery; the possibility of maybe finding something that somebody else has "missed" is very appealing to me, especially in terms of competition and of making money
44 |
45 | ## Problem: Increase the profitability of betting on NBA games
46 |
47 | ### Initial Step: Predict the probability that the home team will win each game
48 |
49 | Machine learning classification models will be used to predict the probability of the winner of each game based upon historical data. This is a first step in developing a betting strategy that will increase the profitability of betting on NBA games.
50 |
51 | *Disclaimer*
52 |
53 | In reality, a betting strategy is a rather complex problem with many elements beyond simply picking the winner of each game. Huge amounts of manpower and money have been invested in developing such strategies, and it is not likely that a learning project will be able to compete very well with such efforts. However, it may provide an extra element of insight that could be used to improve the profitability of an existing betting strategy.
54 |
55 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction)
56 |
57 | ### Plan
58 |
59 | - Gradient boosted tree models (Xgboost and LightGBM) will be utilized to determine the probability that the home team will win each game.
60 | - The model probability will be calibrated against the true probability distribution using sklearn's CalibratedClassifierCV.
61 | - The probability of winning will be important in developing betting strategies because such strategies will not bet on every game, just on games with better expected values.
62 | - Pipelines will be setup to scrape new data from NBA website every day and retrain the model when desired.
63 | - The model will be deployed online using a [streamlit app](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/) to predict and report winning probabilities every day.
64 |
65 |
66 |
67 |
68 | ### Overview
69 |
70 | - Historical game data is retrieved from Kaggle.
71 | - EDA, Data Processing, and Feature Engineering are used to develop best model in either XGboost or LightGBM.
72 | - Data and model is added to serverless Feature Store and Model Registry
73 | - Model is deployed online as a Streamlit app
74 | - Pipelines are setup to:
75 | - Scrape new data from NBA website and add to Feature Store every day using Github Actions
76 | - Retrain model and tune hyperparameters
77 |
78 | Tools Used:
79 |
80 | - VS Code w/ Copilot - IDE
81 | - Pandas - data manipulation
82 | - XGboost - modeling
83 | - LightGBM - modeling
84 | - Scikit-learn - probability calibration
85 | - Optuna - hyperparameter tuning
86 | - Neptune.ai - experiment tracking
87 | - Selenium - data scraping and processing
88 | - ScrapingAnt - data scraping
89 | - BeautifulSoup - data processing of scraped data
90 | - Hopsworks.ai - Feature Store and Model Registry
91 | - Github Actions - running notebooks to scrape new data, predict winning probabilities, and retrain models
92 | - Streamlit - web app deployment
93 |
94 | ### Future Possibilities
95 |
96 | Continual improvements might include:
97 |
98 | - More feature engineering/selection
99 | - More data sources (player stats, injuries, etc...)
100 | - A/B testing against outside and internal models (e.g. other predictor projects, Vegas odds, betting site odds, etc...)
101 | - Track model performance over time, testing for model drift, etc...
102 | - Develop optimized retraining criteria (e.g. time periods, number of games, model drift, etc...)
103 | - Better data visualizations of model predictions and performance
104 | - Develop betting strategies based upon model predictions
105 | - Ensemble betting strategies with other models and strategies, including human experts
106 | - Track model performance against other models and betting strategies
107 |
108 |
109 | ### Structure
110 |
111 | Jupyter Notebooks were used for initial development and testing and are labeled 01 through 10 in the main directory. Notebooks 01 thru 06 are primarily just historical records and notes for the development process.
112 |
113 | Key functions were moved to .py files in src directory once the functions were stable.
114 |
115 | Notebooks 07, 09, and 10 are used in production. I chose to keep the notebooks instead of full conversion to scripts because:
116 |
117 | - I think they look better in terms of documentation
118 | - I prefer to peruse the notebook output after model testing and retraining sometimes instead of relying just on experiment tracking logs
119 | - I haven't yet conceptually decided on my preferred way of structuring my model testing pipelines for best reusability and maintainability (e.g. should I use custom wrapper functions to invoke experiment logging so that I can easily change providers, or should I just choose one provider and stick with their API?)
120 |
121 |
122 |
123 | ### Data
124 |
125 | Data from the 2013 thru 2021 season has been archived on Kaggle. New data is scraped from NBA website.
126 |
127 | Currently available data includes:
128 |
129 | - games_details.csv .. (each-game player stats for everyone on the roster)
130 | - games.csv .......... (each-game team stats: final scores, points scored, field-goal & free-throw percentages, etc...)
131 | - players.csv ........ (index of players' names and teams)
132 | - ranking.csv ........ (incremental daily record of standings, games played, won, lost, win%, home record, road record)
133 | - teams.csv .......... (index of team info such as city and arena names and also head coach)
134 |
135 | NOTES
136 | - games.csv is the primary data source and will be the only data used initially
137 | - games_details.csv details individual player stats for each game and may be added to the model later
138 | - ranking.csv data is essentially cumulative averages from the beginning of the season and is not really needed as these and other rolling averages can be calculated from the games.csv data
139 |
140 |
141 | **New Data**
142 |
143 | New data is scraped from [https://www.nba.com/stats/teams/boxscores](https://www.nba.com/stats/teams/boxscores)
144 |
145 |
146 | **Data Leakage**
147 |
148 | The data for each game are stats for the *completed* game. We want to predict the winner *before* the game is played, not after. The model should only use data that would be available before the game is played. Our model features will primarily be rolling stats for the previous games (e.g. average assists for previous 5 games) while excluding the current game.
149 |
150 | I mention this because I did see several similar projects online that failed to take this into account. If the goal is simply to predict which stats are important for winning games, then the model can be trained on the entire dataset. However, if the goal is to predict the winner of a game like we are trying to do, then the model must be trained on data that would only be available before the game is played.
151 |
152 | ### EDA and Data Processing
153 |
154 | Exploratory Data Analysis (EDA) and Data Processing are summarized and detailed in the notebooks. Some examples include:
155 |
156 | Histograms of various features
157 |
158 |
159 |
160 | Correlations between features
161 |
162 |
163 |
164 |
165 | ### Train / Test/Validation Split
166 |
167 | - Latest season is used as Test/Validation data and previous seasons are used as Train data
168 |
169 | ### Baseline Models
170 |
171 | Simple If-Then Models
172 |
173 | - Home team always wins (Accuracy = 0.59, AUC = 0.50 on Train data, Accuracy = 0.49, AUC = 0.50 on Test data)
174 |
175 | ML Models
176 |
177 | - LightGBM (Accuracy = 0.58, AUC = 0.64 on Test data)
178 | - XGBoost (Accuracy = 0.59, AUC = 0.61 on Test data)
179 |
180 | ### Feature Engineering
181 |
182 | - Covert game date to month only
183 | - Compile rolling means for various time periods for each team as home team and as visitor team
184 | - Compile current win streak for each team as home team and as visitor team
185 | - Compile head-to-head matchup data for each team pair
186 | - Compile rolling means for various time periods for each team regardless of home or visitor status
187 | - Compile current win streak for each team regardless of home or visitor status
188 | - Subtract the league average rolling means from each team's rolling means
189 |
190 |
191 | ### Model Training/Testing
192 |
193 | **Models**
194 | - LightGBM
195 | - XGBoost
196 |
197 | The native Python API (rather than the Scikit-learn wrapper) is used for initial testing of both models because of ease of built-in Shapley values, which are used for feature importance analysis and for adversarial validation (since Shapley values are local to each dataset, they can be used to determine if the train and test datasets have the same feature importances. If they do not, then it may indicate that the model does not generalize very well.)
198 |
199 | The Scikit-learn wrapper is used later in production because it allows for easier probability calibration using sklearn's CalibratedClassifierCV.
200 |
201 |
202 |
203 | **Evaluation**
204 | - AUC is primary metric, Accuracy is secondary metric (it is more meaningful to casual users)
205 | - Shapley values compared: Train set vs Test/Validation set
206 | - Test/Validation set is split: early half vs later half
207 |
208 |
209 |
210 |
211 | **Experiment Tracking**
212 |
213 | Notebook 07 integrates Neptune.ai for experiment tracking and Optuna for hyperparameter tuning.
214 |
215 | Experiment tracking logs can be viewed here: [https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602](https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602)
216 |
217 |
218 |
219 |
220 | **Probability Calibration**
221 |
222 | SKlearn's CalibratedClassifierCV is used to ensure that the model probabilities are calibrated against the true probability distribution. The Brier loss score is used to by the software to automatically select the best calibration method (sigmoid, isotonic, or none).
223 |
224 |
225 |
226 |
227 |
228 | ### Production Features Pipeline
229 |
230 | Notebook 09 is run from a Github Actions every morning.
231 |
232 | - It scrapes the stats from the previous day's games, updates all the rolling statistics and streaks, and adds them to the Feature Store.
233 | - It scrapes the upcoming game matchups for the current day and adds them to the Feature Store so that the streamlit app can use these to make it's daily predictions.
234 |
235 | A variable can be set to either use Selenium or ScrapingAnt for scraping the data. ScrapingAnt is used in production because of its built-in proxy server.
236 |
237 | - The Selenium notebook worked fine when ran locally, but there were issues when running the notebook in Github Actions, likely due to the ip address and anti-bot measures on the NBA website (which would require a proxy server to address)
238 | - ScrapingAnt is a cloud-based scraper with a Python API that handles the proxy server issues. An account is required, but the free account is sufficient for this project.
239 |
240 | ### Model Training Pipeline
241 |
242 | Notebook 10 retrieves the most current data, executes Notebook 07 to handle hyperparameter tuning, model training, and calibration, and then adds the model to the Model Registry. The time periods used for the train set and test set can be adjusted so that the model can be tested only on the most current games.
243 |
244 | ### Streamlit App
245 |
246 | The streamlit app is deployed at streamlit.io and can be accessed here: [https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/)
247 |
248 | It uses the model in the Model Registry to predict the win probability of the home team for the current day's upcoming games.
249 |
250 |
251 |
252 | ### Feedback
253 |
254 | Thanks for taking the time to read about my project. This is my primary "portfolio" project for my quest to change careers and find an entry level position in Machine Learning / Data Science. I appreciate any feedback.
255 |
256 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction)
257 |
258 | My Linked-In profile: [https://www.linkedin.com/in/chris-munch/](https://www.linkedin.com/in/chris-munch/)
259 |
260 | Twitter: [https://twitter.com/curiovana](https://twitter.com/curiovana)
261 |
--------------------------------------------------------------------------------
/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/data/.gitkeep
--------------------------------------------------------------------------------
/images/basketball_money2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/basketball_money2.jpg
--------------------------------------------------------------------------------
/images/calibration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/calibration.png
--------------------------------------------------------------------------------
/images/confusion_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/confusion_matrix.png
--------------------------------------------------------------------------------
/images/correlation_bar_chart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/correlation_bar_chart.png
--------------------------------------------------------------------------------
/images/distributions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/distributions.png
--------------------------------------------------------------------------------
/images/neptune.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/neptune.png
--------------------------------------------------------------------------------
/images/streamlit2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit2.png
--------------------------------------------------------------------------------
/images/streamlit_example.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit_example.jpg
--------------------------------------------------------------------------------
/images/train_vs_test_shapley.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/train_vs_test_shapley.jpg
--------------------------------------------------------------------------------
/notebooks/03_train_test_split.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "defc34ab-7465-454b-a148-f18f857eafaa",
6 | "metadata": {},
7 | "source": [
8 | "### Train / Test Split\n",
9 | "\n",
10 | "This notebook splits the data into train data and test data.\n",
11 | "\n",
12 | "It selects the latest season and uses this as the test data."
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 1,
18 | "id": "f270d24d-731b-4a33-a737-599871c1b024",
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "\n",
25 | "from pathlib import Path #for Windows/Linux compatibility\n",
26 | "DATAPATH = Path(r'data')\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 2,
32 | "id": "45005c7b-e29c-48cc-adf2-47b4cf6a7863",
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "df = pd.read_csv(DATAPATH / \"transformed.csv\")"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "2ddfe2f2-e69e-4f41-a047-87e2caee29f7",
42 | "metadata": {},
43 | "source": [
44 | "**Split Data to Train and Test**"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 3,
50 | "id": "2ec58305-22d8-4e01-ad13-a6052967fb43",
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "latest_season = df['SEASON'].unique().max()\n",
55 | "\n",
56 | "train = df[df['SEASON'] < (latest_season)]\n",
57 | "test = df[df['SEASON'] >= (latest_season - 1)]\n",
58 | "\n",
59 | "train.to_csv(DATAPATH / \"train.csv\",index=False)\n",
60 | "test.to_csv(DATAPATH / \"test.csv\",index=False)\n"
61 | ]
62 | }
63 | ],
64 | "metadata": {
65 | "kernelspec": {
66 | "display_name": "nba3",
67 | "language": "python",
68 | "name": "python3"
69 | },
70 | "language_info": {
71 | "codemirror_mode": {
72 | "name": "ipython",
73 | "version": 3
74 | },
75 | "file_extension": ".py",
76 | "mimetype": "text/x-python",
77 | "name": "python",
78 | "nbconvert_exporter": "python",
79 | "pygments_lexer": "ipython3",
80 | "version": "3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]"
81 | },
82 | "vscode": {
83 | "interpreter": {
84 | "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06"
85 | }
86 | }
87 | },
88 | "nbformat": 4,
89 | "nbformat_minor": 5
90 | }
91 |
--------------------------------------------------------------------------------
/notebooks/06_feature_selection.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "ea2619eb-c30b-457d-abfa-a7b4b4dd584c",
6 | "metadata": {},
7 | "source": [
8 | "## Feature Selection"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "f71632a1-ef6a-4404-b662-e8227de30fa1",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import numpy as np\n",
20 | "\n",
21 | "pd.set_option('display.max_columns', 500)\n",
22 | "\n",
23 | "from sklearn.feature_selection import SelectKBest, chi2\n",
24 | "\n",
25 | "\n",
26 | "import matplotlib.pyplot as plt\n",
27 | "import seaborn as sns\n",
28 | "\n",
29 | "from pathlib import Path #for Windows/Linux compatibility\n",
30 | "DATAPATH = Path(r'data')\n"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 2,
36 | "id": "d4aabdeb-0167-490e-bc37-65a62f7c0821",
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "train = pd.read_csv(DATAPATH / \"train_features.csv\")\n",
41 | "test = pd.read_csv(DATAPATH / \"test_features.csv\")\n"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 4,
47 | "id": "2d3c9ab6-9a07-46c5-8d57-d75df3171036",
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "train.to_csv(DATAPATH / \"train_selected.csv\",index=False)\n",
52 | "test.to_csv(DATAPATH / \"test_selected.csv\",index=False)"
53 | ]
54 | }
55 | ],
56 | "metadata": {
57 | "kernelspec": {
58 | "display_name": "Python 3 (ipykernel)",
59 | "language": "python",
60 | "name": "python3"
61 | },
62 | "language_info": {
63 | "codemirror_mode": {
64 | "name": "ipython",
65 | "version": 3
66 | },
67 | "file_extension": ".py",
68 | "mimetype": "text/x-python",
69 | "name": "python",
70 | "nbconvert_exporter": "python",
71 | "pygments_lexer": "ipython3",
72 | "version": "3.9.13"
73 | }
74 | },
75 | "nbformat": 4,
76 | "nbformat_minor": 5
77 | }
78 |
--------------------------------------------------------------------------------
/notebooks/feature_names.json:
--------------------------------------------------------------------------------
1 | {"game_date_est": "GAME_DATE_EST", "game_id": "GAME_ID", "home_team_id": "HOME_TEAM_ID", "visitor_team_id": "VISITOR_TEAM_ID", "season": "SEASON", "pts_home": "PTS_home", "fg_pct_home": "FG_PCT_home", "ft_pct_home": "FT_PCT_home", "fg3_pct_home": "FG3_PCT_home", "ast_home": "AST_home", "reb_home": "REB_home", "pts_away": "PTS_away", "fg_pct_away": "FG_PCT_away", "ft_pct_away": "FT_PCT_away", "fg3_pct_away": "FG3_PCT_away", "ast_away": "AST_away", "reb_away": "REB_away", "home_team_wins": "HOME_TEAM_WINS", "target": "TARGET", "month": "MONTH", "home_team_win_streak": "HOME_TEAM_WIN_STREAK", "home_team_wins_avg_last_3_home": "HOME_TEAM_WINS_AVG_LAST_3_HOME", "home_team_wins_avg_last_7_home": "HOME_TEAM_WINS_AVG_LAST_7_HOME", "home_team_wins_avg_last_10_home": "HOME_TEAM_WINS_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home": "HOME_PTS_home_AVG_LAST_3_HOME", "home_pts_home_avg_last_7_home": "HOME_PTS_home_AVG_LAST_7_HOME", "home_pts_home_avg_last_10_home": "HOME_PTS_home_AVG_LAST_10_HOME", "home_fg_pct_home_avg_last_3_home": "HOME_FG_PCT_home_AVG_LAST_3_HOME", "home_fg_pct_home_avg_last_7_home": "HOME_FG_PCT_home_AVG_LAST_7_HOME", "home_fg_pct_home_avg_last_10_home": "HOME_FG_PCT_home_AVG_LAST_10_HOME", "home_ft_pct_home_avg_last_3_home": "HOME_FT_PCT_home_AVG_LAST_3_HOME", "home_ft_pct_home_avg_last_7_home": "HOME_FT_PCT_home_AVG_LAST_7_HOME", "home_ft_pct_home_avg_last_10_home": "HOME_FT_PCT_home_AVG_LAST_10_HOME", "home_fg3_pct_home_avg_last_3_home": "HOME_FG3_PCT_home_AVG_LAST_3_HOME", "home_fg3_pct_home_avg_last_7_home": "HOME_FG3_PCT_home_AVG_LAST_7_HOME", "home_fg3_pct_home_avg_last_10_home": "HOME_FG3_PCT_home_AVG_LAST_10_HOME", "home_ast_home_avg_last_3_home": "HOME_AST_home_AVG_LAST_3_HOME", "home_ast_home_avg_last_7_home": "HOME_AST_home_AVG_LAST_7_HOME", "home_ast_home_avg_last_10_home": "HOME_AST_home_AVG_LAST_10_HOME", "home_reb_home_avg_last_3_home": "HOME_REB_home_AVG_LAST_3_HOME", "home_reb_home_avg_last_7_home": "HOME_REB_home_AVG_LAST_7_HOME", "home_reb_home_avg_last_10_home": "HOME_REB_home_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_7_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_10_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_3_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_7_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_10_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_3_home_minus_league_avg": "HOME_AST_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_7_home_minus_league_avg": "HOME_AST_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_10_home_minus_league_avg": "HOME_AST_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_3_home_minus_league_avg": "HOME_REB_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_7_home_minus_league_avg": "HOME_REB_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_10_home_minus_league_avg": "HOME_REB_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "visitor_team_win_streak": "VISITOR_TEAM_WIN_STREAK", "visitor_team_wins_avg_last_3_visitor": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR", "visitor_team_wins_avg_last_7_visitor": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR", "visitor_team_wins_avg_last_10_visitor": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR", "visitor_pts_away_avg_last_3_visitor": "VISITOR_PTS_away_AVG_LAST_3_VISITOR", "visitor_pts_away_avg_last_7_visitor": "VISITOR_PTS_away_AVG_LAST_7_VISITOR", "visitor_pts_away_avg_last_10_visitor": "VISITOR_PTS_away_AVG_LAST_10_VISITOR", "visitor_fg_pct_away_avg_last_3_visitor": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg_pct_away_avg_last_7_visitor": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg_pct_away_avg_last_10_visitor": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR", "visitor_ft_pct_away_avg_last_3_visitor": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR", "visitor_ft_pct_away_avg_last_7_visitor": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR", "visitor_ft_pct_away_avg_last_10_visitor": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR", "visitor_fg3_pct_away_avg_last_3_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg3_pct_away_avg_last_7_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg3_pct_away_avg_last_10_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR", "visitor_ast_away_avg_last_3_visitor": "VISITOR_AST_away_AVG_LAST_3_VISITOR", "visitor_ast_away_avg_last_7_visitor": "VISITOR_AST_away_AVG_LAST_7_VISITOR", "visitor_ast_away_avg_last_10_visitor": "VISITOR_AST_away_AVG_LAST_10_VISITOR", "visitor_reb_away_avg_last_3_visitor": "VISITOR_REB_away_AVG_LAST_3_VISITOR", "visitor_reb_away_avg_last_7_visitor": "VISITOR_REB_away_AVG_LAST_7_VISITOR", "visitor_reb_away_avg_last_10_visitor": "VISITOR_REB_away_AVG_LAST_10_VISITOR", "visitor_team_wins_avg_last_3_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_7_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_10_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_3_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_7_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_10_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_3_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_7_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_10_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_3_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_7_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_10_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "matchup_winpct_3_x": "MATCHUP_WINPCT_3_x", "matchup_winpct_7_x": "MATCHUP_WINPCT_7_x", "matchup_winpct_10_x": "MATCHUP_WINPCT_10_x", "matchup_win_streak_x": "MATCHUP_WIN_STREAK_x", "win_streak_x": "WIN_STREAK_x", "home_away_streak_x": "HOME_AWAY_STREAK_x", "team1_win_avg_last_3_all_x": "TEAM1_win_AVG_LAST_3_ALL_x", "team1_win_avg_last_7_all_x": "TEAM1_win_AVG_LAST_7_ALL_x", "team1_win_avg_last_10_all_x": "TEAM1_win_AVG_LAST_10_ALL_x", "team1_win_avg_last_15_all_x": "TEAM1_win_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_x": "PTS_AVG_LAST_3_ALL_x", "pts_avg_last_7_all_x": "PTS_AVG_LAST_7_ALL_x", "pts_avg_last_10_all_x": "PTS_AVG_LAST_10_ALL_x", "pts_avg_last_15_all_x": "PTS_AVG_LAST_15_ALL_x", "fg_pct_avg_last_3_all_x": "FG_PCT_AVG_LAST_3_ALL_x", "fg_pct_avg_last_7_all_x": "FG_PCT_AVG_LAST_7_ALL_x", "fg_pct_avg_last_10_all_x": "FG_PCT_AVG_LAST_10_ALL_x", "fg_pct_avg_last_15_all_x": "FG_PCT_AVG_LAST_15_ALL_x", "ft_pct_avg_last_3_all_x": "FT_PCT_AVG_LAST_3_ALL_x", "ft_pct_avg_last_7_all_x": "FT_PCT_AVG_LAST_7_ALL_x", "ft_pct_avg_last_10_all_x": "FT_PCT_AVG_LAST_10_ALL_x", "ft_pct_avg_last_15_all_x": "FT_PCT_AVG_LAST_15_ALL_x", "fg3_pct_avg_last_3_all_x": "FG3_PCT_AVG_LAST_3_ALL_x", "fg3_pct_avg_last_7_all_x": "FG3_PCT_AVG_LAST_7_ALL_x", "fg3_pct_avg_last_10_all_x": "FG3_PCT_AVG_LAST_10_ALL_x", "fg3_pct_avg_last_15_all_x": "FG3_PCT_AVG_LAST_15_ALL_x", "ast_avg_last_3_all_x": "AST_AVG_LAST_3_ALL_x", "ast_avg_last_7_all_x": "AST_AVG_LAST_7_ALL_x", "ast_avg_last_10_all_x": "AST_AVG_LAST_10_ALL_x", "ast_avg_last_15_all_x": "AST_AVG_LAST_15_ALL_x", "reb_avg_last_3_all_x": "REB_AVG_LAST_3_ALL_x", "reb_avg_last_7_all_x": "REB_AVG_LAST_7_ALL_x", "reb_avg_last_10_all_x": "REB_AVG_LAST_10_ALL_x", "reb_avg_last_15_all_x": "REB_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_minus_league_avg_x": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_7_all_minus_league_avg_x": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_10_all_minus_league_avg_x": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_15_all_minus_league_avg_x": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_3_all_minus_league_avg_x": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_7_all_minus_league_avg_x": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_10_all_minus_league_avg_x": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_15_all_minus_league_avg_x": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_3_all_minus_league_avg_x": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_7_all_minus_league_avg_x": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_10_all_minus_league_avg_x": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_15_all_minus_league_avg_x": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_3_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_7_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_10_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_15_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_3_all_minus_league_avg_x": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_7_all_minus_league_avg_x": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_10_all_minus_league_avg_x": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_15_all_minus_league_avg_x": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_3_all_minus_league_avg_x": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_7_all_minus_league_avg_x": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_10_all_minus_league_avg_x": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_15_all_minus_league_avg_x": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "win_streak_y": "WIN_STREAK_y", "home_away_streak_y": "HOME_AWAY_STREAK_y", "team1_win_avg_last_3_all_y": "TEAM1_win_AVG_LAST_3_ALL_y", "team1_win_avg_last_7_all_y": "TEAM1_win_AVG_LAST_7_ALL_y", "team1_win_avg_last_10_all_y": "TEAM1_win_AVG_LAST_10_ALL_y", "team1_win_avg_last_15_all_y": "TEAM1_win_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_y": "PTS_AVG_LAST_3_ALL_y", "pts_avg_last_7_all_y": "PTS_AVG_LAST_7_ALL_y", "pts_avg_last_10_all_y": "PTS_AVG_LAST_10_ALL_y", "pts_avg_last_15_all_y": "PTS_AVG_LAST_15_ALL_y", "fg_pct_avg_last_3_all_y": "FG_PCT_AVG_LAST_3_ALL_y", "fg_pct_avg_last_7_all_y": "FG_PCT_AVG_LAST_7_ALL_y", "fg_pct_avg_last_10_all_y": "FG_PCT_AVG_LAST_10_ALL_y", "fg_pct_avg_last_15_all_y": "FG_PCT_AVG_LAST_15_ALL_y", "ft_pct_avg_last_3_all_y": "FT_PCT_AVG_LAST_3_ALL_y", "ft_pct_avg_last_7_all_y": "FT_PCT_AVG_LAST_7_ALL_y", "ft_pct_avg_last_10_all_y": "FT_PCT_AVG_LAST_10_ALL_y", "ft_pct_avg_last_15_all_y": "FT_PCT_AVG_LAST_15_ALL_y", "fg3_pct_avg_last_3_all_y": "FG3_PCT_AVG_LAST_3_ALL_y", "fg3_pct_avg_last_7_all_y": "FG3_PCT_AVG_LAST_7_ALL_y", "fg3_pct_avg_last_10_all_y": "FG3_PCT_AVG_LAST_10_ALL_y", "fg3_pct_avg_last_15_all_y": "FG3_PCT_AVG_LAST_15_ALL_y", "ast_avg_last_3_all_y": "AST_AVG_LAST_3_ALL_y", "ast_avg_last_7_all_y": "AST_AVG_LAST_7_ALL_y", "ast_avg_last_10_all_y": "AST_AVG_LAST_10_ALL_y", "ast_avg_last_15_all_y": "AST_AVG_LAST_15_ALL_y", "reb_avg_last_3_all_y": "REB_AVG_LAST_3_ALL_y", "reb_avg_last_7_all_y": "REB_AVG_LAST_7_ALL_y", "reb_avg_last_10_all_y": "REB_AVG_LAST_10_ALL_y", "reb_avg_last_15_all_y": "REB_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_minus_league_avg_y": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_7_all_minus_league_avg_y": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_10_all_minus_league_avg_y": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_15_all_minus_league_avg_y": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_3_all_minus_league_avg_y": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_7_all_minus_league_avg_y": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_10_all_minus_league_avg_y": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_15_all_minus_league_avg_y": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_3_all_minus_league_avg_y": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_7_all_minus_league_avg_y": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_10_all_minus_league_avg_y": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_15_all_minus_league_avg_y": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_3_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_7_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_10_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_15_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_3_all_minus_league_avg_y": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_7_all_minus_league_avg_y": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_10_all_minus_league_avg_y": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_15_all_minus_league_avg_y": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_3_all_minus_league_avg_y": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_7_all_minus_league_avg_y": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_10_all_minus_league_avg_y": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_15_all_minus_league_avg_y": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "win_streak_x_minus_y": "WIN_STREAK_x_minus_y", "home_away_streak_x_minus_y": "HOME_AWAY_STREAK_x_minus_y", "team1_win_avg_last_3_all_x_minus_y": "TEAM1_win_AVG_LAST_3_ALL_x_minus_y", "team1_win_avg_last_7_all_x_minus_y": "TEAM1_win_AVG_LAST_7_ALL_x_minus_y", "team1_win_avg_last_10_all_x_minus_y": "TEAM1_win_AVG_LAST_10_ALL_x_minus_y", "team1_win_avg_last_15_all_x_minus_y": "TEAM1_win_AVG_LAST_15_ALL_x_minus_y", "pts_avg_last_3_all_x_minus_y": "PTS_AVG_LAST_3_ALL_x_minus_y", "pts_avg_last_7_all_x_minus_y": "PTS_AVG_LAST_7_ALL_x_minus_y", "pts_avg_last_10_all_x_minus_y": "PTS_AVG_LAST_10_ALL_x_minus_y", "pts_avg_last_15_all_x_minus_y": "PTS_AVG_LAST_15_ALL_x_minus_y", "fg_pct_avg_last_3_all_x_minus_y": "FG_PCT_AVG_LAST_3_ALL_x_minus_y", "fg_pct_avg_last_7_all_x_minus_y": "FG_PCT_AVG_LAST_7_ALL_x_minus_y", "fg_pct_avg_last_10_all_x_minus_y": "FG_PCT_AVG_LAST_10_ALL_x_minus_y", "fg_pct_avg_last_15_all_x_minus_y": "FG_PCT_AVG_LAST_15_ALL_x_minus_y", "ft_pct_avg_last_3_all_x_minus_y": "FT_PCT_AVG_LAST_3_ALL_x_minus_y", "ft_pct_avg_last_7_all_x_minus_y": "FT_PCT_AVG_LAST_7_ALL_x_minus_y", "ft_pct_avg_last_10_all_x_minus_y": "FT_PCT_AVG_LAST_10_ALL_x_minus_y", "ft_pct_avg_last_15_all_x_minus_y": "FT_PCT_AVG_LAST_15_ALL_x_minus_y", "fg3_pct_avg_last_3_all_x_minus_y": "FG3_PCT_AVG_LAST_3_ALL_x_minus_y", "fg3_pct_avg_last_7_all_x_minus_y": "FG3_PCT_AVG_LAST_7_ALL_x_minus_y", "fg3_pct_avg_last_10_all_x_minus_y": "FG3_PCT_AVG_LAST_10_ALL_x_minus_y", "fg3_pct_avg_last_15_all_x_minus_y": "FG3_PCT_AVG_LAST_15_ALL_x_minus_y", "ast_avg_last_3_all_x_minus_y": "AST_AVG_LAST_3_ALL_x_minus_y", "ast_avg_last_7_all_x_minus_y": "AST_AVG_LAST_7_ALL_x_minus_y", "ast_avg_last_10_all_x_minus_y": "AST_AVG_LAST_10_ALL_x_minus_y", "ast_avg_last_15_all_x_minus_y": "AST_AVG_LAST_15_ALL_x_minus_y", "reb_avg_last_3_all_x_minus_y": "REB_AVG_LAST_3_ALL_x_minus_y", "reb_avg_last_7_all_x_minus_y": "REB_AVG_LAST_7_ALL_x_minus_y", "reb_avg_last_10_all_x_minus_y": "REB_AVG_LAST_10_ALL_x_minus_y", "reb_avg_last_15_all_x_minus_y": "REB_AVG_LAST_15_ALL_x_minus_y"}
--------------------------------------------------------------------------------
/notebooks/hopsworks.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "2f0c58a3-8738-42ad-be18-8e13c171108f",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import os\n",
11 | "\n",
12 | "import pandas as pd\n",
13 | "import numpy as np\n",
14 | "\n",
15 | "import hopsworks\n",
16 | "\n",
17 | "from datetime import datetime, timedelta\n",
18 | "from pytz import timezone\n",
19 | "\n",
20 | "from src.webscraping_a import (\n",
21 | " scrape_to_dataframe,\n",
22 | " convert_columns,\n",
23 | " combine_home_visitor, \n",
24 | " get_todays_matchups,\n",
25 | ")\n",
26 | "\n",
27 | "from src.data_processing import (\n",
28 | " process_games,\n",
29 | " add_TARGET,\n",
30 | ")\n",
31 | "\n",
32 | "from src.feature_engineering import (\n",
33 | " process_features,\n",
34 | ")\n",
35 | "\n",
36 | "from src.hopsworks_utils import (\n",
37 | " save_feature_names,\n",
38 | " convert_feature_names,\n",
39 | ")\n",
40 | "\n",
41 | "import json\n",
42 | "\n",
43 | "from pathlib import Path #for Windows/Linux compatibility\n",
44 | "DATAPATH = Path(r'data')"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 2,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "from dotenv import load_dotenv\n",
54 | "\n",
55 | "load_dotenv()\n",
56 | "\n",
57 | "try:\n",
58 | " HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']\n",
59 | "except:\n",
60 | " raise Exception('Set environment variable HOPSWORKS_API_KEY')"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 3,
66 | "id": "ba69a88c-691e-4f8d-b68e-e79f5582a939",
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "name": "stdout",
71 | "output_type": "stream",
72 | "text": [
73 | "Connected. Call `.close()` to terminate connection gracefully.\n",
74 | "\n",
75 | "Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3350\n",
76 | "Connected. Call `.close()` to terminate connection gracefully.\n"
77 | ]
78 | },
79 | {
80 | "name": "stderr",
81 | "output_type": "stream",
82 | "text": [
83 | "DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n"
84 | ]
85 | }
86 | ],
87 | "source": [
88 | "project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)\n",
89 | "fs = project.get_feature_store()"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 4,
95 | "id": "641d97e5-0888-42c3-a9b1-97c9ef2b7082",
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "rolling_stats_fg = fs.get_feature_group(\n",
100 | " name=\"rolling_stats\",\n",
101 | " version=1,\n",
102 | ")"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 5,
108 | "id": "69a2b254",
109 | "metadata": {},
110 | "outputs": [
111 | {
112 | "name": "stdout",
113 | "output_type": "stream",
114 | "text": [
115 | "2023-01-31 08:30:10,072 INFO: USE `nba_predictor_featurestore`\n",
116 | "2023-01-31 08:30:10,441 INFO: SELECT `fg0`.`game_date_est` `game_date_est`, `fg0`.`game_id` `game_id`, `fg0`.`home_team_id` `home_team_id`, `fg0`.`visitor_team_id` `visitor_team_id`, `fg0`.`season` `season`, `fg0`.`pts_home` `pts_home`, `fg0`.`fg_pct_home` `fg_pct_home`, `fg0`.`ft_pct_home` `ft_pct_home`, `fg0`.`fg3_pct_home` `fg3_pct_home`, `fg0`.`ast_home` `ast_home`, `fg0`.`reb_home` `reb_home`, `fg0`.`pts_away` `pts_away`, `fg0`.`fg_pct_away` `fg_pct_away`, `fg0`.`ft_pct_away` `ft_pct_away`, `fg0`.`fg3_pct_away` `fg3_pct_away`, `fg0`.`ast_away` `ast_away`, `fg0`.`reb_away` `reb_away`, `fg0`.`home_team_wins` `home_team_wins`\n",
117 | "FROM `nba_predictor_featurestore`.`rolling_stats_1` `fg0`\n"
118 | ]
119 | },
120 | {
121 | "name": "stderr",
122 | "output_type": "stream",
123 | "text": [
124 | "UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy\n"
125 | ]
126 | },
127 | {
128 | "data": {
129 | "text/html": [
130 | "\n",
131 | "\n",
144 | "
\n",
145 | " \n",
146 | " \n",
147 | " | \n",
148 | " GAME_DATE_EST | \n",
149 | " GAME_ID | \n",
150 | " HOME_TEAM_ID | \n",
151 | " VISITOR_TEAM_ID | \n",
152 | " SEASON | \n",
153 | " PTS_home | \n",
154 | " FG_PCT_home | \n",
155 | " FT_PCT_home | \n",
156 | " FG3_PCT_home | \n",
157 | " AST_home | \n",
158 | " REB_home | \n",
159 | " PTS_away | \n",
160 | " FG_PCT_away | \n",
161 | " FT_PCT_away | \n",
162 | " FG3_PCT_away | \n",
163 | " AST_away | \n",
164 | " REB_away | \n",
165 | " HOME_TEAM_WINS | \n",
166 | "
\n",
167 | " \n",
168 | " \n",
169 | " \n",
170 | " 1384 | \n",
171 | " 2023-01-31 | \n",
172 | " 22200765 | \n",
173 | " 1610612752 | \n",
174 | " 1610612747 | \n",
175 | " 2022 | \n",
176 | " 0 | \n",
177 | " 0.0 | \n",
178 | " 0.0 | \n",
179 | " 0.0 | \n",
180 | " 0 | \n",
181 | " 0 | \n",
182 | " 0 | \n",
183 | " 0.0 | \n",
184 | " 0.0 | \n",
185 | " 0.0 | \n",
186 | " 0 | \n",
187 | " 0 | \n",
188 | " 0 | \n",
189 | "
\n",
190 | " \n",
191 | " 3042 | \n",
192 | " 2023-01-31 | \n",
193 | " 22200767 | \n",
194 | " 1610612749 | \n",
195 | " 1610612766 | \n",
196 | " 2022 | \n",
197 | " 0 | \n",
198 | " 0.0 | \n",
199 | " 0.0 | \n",
200 | " 0.0 | \n",
201 | " 0 | \n",
202 | " 0 | \n",
203 | " 0 | \n",
204 | " 0.0 | \n",
205 | " 0.0 | \n",
206 | " 0.0 | \n",
207 | " 0 | \n",
208 | " 0 | \n",
209 | " 0 | \n",
210 | "
\n",
211 | " \n",
212 | " 8239 | \n",
213 | " 2023-01-31 | \n",
214 | " 22200764 | \n",
215 | " 1610612739 | \n",
216 | " 1610612748 | \n",
217 | " 2022 | \n",
218 | " 0 | \n",
219 | " 0.0 | \n",
220 | " 0.0 | \n",
221 | " 0.0 | \n",
222 | " 0 | \n",
223 | " 0 | \n",
224 | " 0 | \n",
225 | " 0.0 | \n",
226 | " 0.0 | \n",
227 | " 0.0 | \n",
228 | " 0 | \n",
229 | " 0 | \n",
230 | " 0 | \n",
231 | "
\n",
232 | " \n",
233 | " 8907 | \n",
234 | " 2023-01-31 | \n",
235 | " 22200766 | \n",
236 | " 1610612741 | \n",
237 | " 1610612746 | \n",
238 | " 2022 | \n",
239 | " 0 | \n",
240 | " 0.0 | \n",
241 | " 0.0 | \n",
242 | " 0.0 | \n",
243 | " 0 | \n",
244 | " 0 | \n",
245 | " 0 | \n",
246 | " 0.0 | \n",
247 | " 0.0 | \n",
248 | " 0.0 | \n",
249 | " 0 | \n",
250 | " 0 | \n",
251 | " 0 | \n",
252 | "
\n",
253 | " \n",
254 | " 21628 | \n",
255 | " 2023-01-31 | \n",
256 | " 22200768 | \n",
257 | " 1610612743 | \n",
258 | " 1610612740 | \n",
259 | " 2022 | \n",
260 | " 0 | \n",
261 | " 0.0 | \n",
262 | " 0.0 | \n",
263 | " 0.0 | \n",
264 | " 0 | \n",
265 | " 0 | \n",
266 | " 0 | \n",
267 | " 0.0 | \n",
268 | " 0.0 | \n",
269 | " 0.0 | \n",
270 | " 0 | \n",
271 | " 0 | \n",
272 | " 0 | \n",
273 | "
\n",
274 | " \n",
275 | "
\n",
276 | "
"
277 | ],
278 | "text/plain": [
279 | " GAME_DATE_EST GAME_ID HOME_TEAM_ID VISITOR_TEAM_ID SEASON \\\n",
280 | "1384 2023-01-31 22200765 1610612752 1610612747 2022 \n",
281 | "3042 2023-01-31 22200767 1610612749 1610612766 2022 \n",
282 | "8239 2023-01-31 22200764 1610612739 1610612748 2022 \n",
283 | "8907 2023-01-31 22200766 1610612741 1610612746 2022 \n",
284 | "21628 2023-01-31 22200768 1610612743 1610612740 2022 \n",
285 | "\n",
286 | " PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home REB_home \\\n",
287 | "1384 0 0.0 0.0 0.0 0 0 \n",
288 | "3042 0 0.0 0.0 0.0 0 0 \n",
289 | "8239 0 0.0 0.0 0.0 0 0 \n",
290 | "8907 0 0.0 0.0 0.0 0 0 \n",
291 | "21628 0 0.0 0.0 0.0 0 0 \n",
292 | "\n",
293 | " PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away \\\n",
294 | "1384 0 0.0 0.0 0.0 0 0 \n",
295 | "3042 0 0.0 0.0 0.0 0 0 \n",
296 | "8239 0 0.0 0.0 0.0 0 0 \n",
297 | "8907 0 0.0 0.0 0.0 0 0 \n",
298 | "21628 0 0.0 0.0 0.0 0 0 \n",
299 | "\n",
300 | " HOME_TEAM_WINS \n",
301 | "1384 0 \n",
302 | "3042 0 \n",
303 | "8239 0 \n",
304 | "8907 0 \n",
305 | "21628 0 "
306 | ]
307 | },
308 | "execution_count": 5,
309 | "metadata": {},
310 | "output_type": "execute_result"
311 | }
312 | ],
313 | "source": [
314 | "BASE_FEATURES = ['game_date_est',\n",
315 | " 'game_id',\n",
316 | " 'home_team_id',\n",
317 | " 'visitor_team_id',\n",
318 | " 'season',\n",
319 | " 'pts_home',\n",
320 | " 'fg_pct_home',\n",
321 | " 'ft_pct_home',\n",
322 | " 'fg3_pct_home',\n",
323 | " 'ast_home',\n",
324 | " 'reb_home',\n",
325 | " 'pts_away',\n",
326 | " 'fg_pct_away',\n",
327 | " 'ft_pct_away',\n",
328 | " 'fg3_pct_away',\n",
329 | " 'ast_away',\n",
330 | " 'reb_away',\n",
331 | " 'home_team_wins',\n",
332 | "]\n",
333 | "\n",
334 | "ds_query = rolling_stats_fg.select(BASE_FEATURES)\n",
335 | "df_old = ds_query.read()\n",
336 | "df_old = convert_feature_names(df_old)\n",
337 | "df_old[df_old['PTS_home'] == 0]\n"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 6,
343 | "metadata": {},
344 | "outputs": [
345 | {
346 | "data": {
347 | "text/html": [
348 | "\n",
349 | "\n",
362 | "
\n",
363 | " \n",
364 | " \n",
365 | " | \n",
366 | " GAME_DATE_EST | \n",
367 | " GAME_ID | \n",
368 | " HOME_TEAM_ID | \n",
369 | " VISITOR_TEAM_ID | \n",
370 | " SEASON | \n",
371 | " PTS_home | \n",
372 | " FG_PCT_home | \n",
373 | " FT_PCT_home | \n",
374 | " FG3_PCT_home | \n",
375 | " AST_home | \n",
376 | " REB_home | \n",
377 | " PTS_away | \n",
378 | " FG_PCT_away | \n",
379 | " FT_PCT_away | \n",
380 | " FG3_PCT_away | \n",
381 | " AST_away | \n",
382 | " REB_away | \n",
383 | " HOME_TEAM_WINS | \n",
384 | "
\n",
385 | " \n",
386 | " \n",
387 | " \n",
388 | " 1384 | \n",
389 | " 2023-01-31 | \n",
390 | " 22200765 | \n",
391 | " 1610612752 | \n",
392 | " 1610612747 | \n",
393 | " 2022 | \n",
394 | " 0 | \n",
395 | " 0.00000 | \n",
396 | " 0.0000 | \n",
397 | " 0.000000 | \n",
398 | " 0 | \n",
399 | " 0 | \n",
400 | " 0 | \n",
401 | " 0.00000 | \n",
402 | " 0.0000 | \n",
403 | " 0.000000 | \n",
404 | " 0 | \n",
405 | " 0 | \n",
406 | " 0 | \n",
407 | "
\n",
408 | " \n",
409 | " 1705 | \n",
410 | " 2023-01-30 | \n",
411 | " 22200759 | \n",
412 | " 1610612744 | \n",
413 | " 1610612760 | \n",
414 | " 2022 | \n",
415 | " 128 | \n",
416 | " 51.09375 | \n",
417 | " 80.0000 | \n",
418 | " 42.593750 | \n",
419 | " 37 | \n",
420 | " 44 | \n",
421 | " 120 | \n",
422 | " 49.50000 | \n",
423 | " 85.0000 | \n",
424 | " 45.812500 | \n",
425 | " 23 | \n",
426 | " 43 | \n",
427 | " 1 | \n",
428 | "
\n",
429 | " \n",
430 | " 1755 | \n",
431 | " 2023-01-29 | \n",
432 | " 22200754 | \n",
433 | " 1610612746 | \n",
434 | " 1610612739 | \n",
435 | " 2022 | \n",
436 | " 99 | \n",
437 | " 41.59375 | \n",
438 | " 85.1875 | \n",
439 | " 10.500000 | \n",
440 | " 17 | \n",
441 | " 45 | \n",
442 | " 122 | \n",
443 | " 56.40625 | \n",
444 | " 73.6875 | \n",
445 | " 60.593750 | \n",
446 | " 35 | \n",
447 | " 35 | \n",
448 | " 0 | \n",
449 | "
\n",
450 | " \n",
451 | " 3042 | \n",
452 | " 2023-01-31 | \n",
453 | " 22200767 | \n",
454 | " 1610612749 | \n",
455 | " 1610612766 | \n",
456 | " 2022 | \n",
457 | " 0 | \n",
458 | " 0.00000 | \n",
459 | " 0.0000 | \n",
460 | " 0.000000 | \n",
461 | " 0 | \n",
462 | " 0 | \n",
463 | " 0 | \n",
464 | " 0.00000 | \n",
465 | " 0.0000 | \n",
466 | " 0.000000 | \n",
467 | " 0 | \n",
468 | " 0 | \n",
469 | " 0 | \n",
470 | "
\n",
471 | " \n",
472 | " 3825 | \n",
473 | " 2023-01-30 | \n",
474 | " 22200756 | \n",
475 | " 1610612753 | \n",
476 | " 1610612755 | \n",
477 | " 2022 | \n",
478 | " 119 | \n",
479 | " 42.40625 | \n",
480 | " 80.0000 | \n",
481 | " 37.906250 | \n",
482 | " 23 | \n",
483 | " 50 | \n",
484 | " 109 | \n",
485 | " 47.09375 | \n",
486 | " 78.3125 | \n",
487 | " 36.687500 | \n",
488 | " 28 | \n",
489 | " 45 | \n",
490 | " 1 | \n",
491 | "
\n",
492 | " \n",
493 | " 5048 | \n",
494 | " 2023-01-30 | \n",
495 | " 22200763 | \n",
496 | " 1610612737 | \n",
497 | " 1610612757 | \n",
498 | " 2022 | \n",
499 | " 125 | \n",
500 | " 46.81250 | \n",
501 | " 80.0000 | \n",
502 | " 43.312500 | \n",
503 | " 26 | \n",
504 | " 45 | \n",
505 | " 129 | \n",
506 | " 54.40625 | \n",
507 | " 88.8750 | \n",
508 | " 47.500000 | \n",
509 | " 24 | \n",
510 | " 32 | \n",
511 | " 0 | \n",
512 | "
\n",
513 | " \n",
514 | " 5716 | \n",
515 | " 2023-01-29 | \n",
516 | " 22200753 | \n",
517 | " 1610612754 | \n",
518 | " 1610612763 | \n",
519 | " 2022 | \n",
520 | " 100 | \n",
521 | " 45.59375 | \n",
522 | " 90.5000 | \n",
523 | " 28.093750 | \n",
524 | " 21 | \n",
525 | " 38 | \n",
526 | " 112 | \n",
527 | " 48.31250 | \n",
528 | " 77.3125 | \n",
529 | " 24.296875 | \n",
530 | " 29 | \n",
531 | " 41 | \n",
532 | " 0 | \n",
533 | "
\n",
534 | " \n",
535 | " 8239 | \n",
536 | " 2023-01-31 | \n",
537 | " 22200764 | \n",
538 | " 1610612739 | \n",
539 | " 1610612748 | \n",
540 | " 2022 | \n",
541 | " 0 | \n",
542 | " 0.00000 | \n",
543 | " 0.0000 | \n",
544 | " 0.000000 | \n",
545 | " 0 | \n",
546 | " 0 | \n",
547 | " 0 | \n",
548 | " 0.00000 | \n",
549 | " 0.0000 | \n",
550 | " 0.000000 | \n",
551 | " 0 | \n",
552 | " 0 | \n",
553 | " 0 | \n",
554 | "
\n",
555 | " \n",
556 | " 8381 | \n",
557 | " 2023-01-29 | \n",
558 | " 22200752 | \n",
559 | " 1610612748 | \n",
560 | " 1610612766 | \n",
561 | " 2022 | \n",
562 | " 117 | \n",
563 | " 48.40625 | \n",
564 | " 73.8750 | \n",
565 | " 32.312500 | \n",
566 | " 26 | \n",
567 | " 36 | \n",
568 | " 122 | \n",
569 | " 54.18750 | \n",
570 | " 77.3125 | \n",
571 | " 37.500000 | \n",
572 | " 25 | \n",
573 | " 47 | \n",
574 | " 0 | \n",
575 | "
\n",
576 | " \n",
577 | " 8727 | \n",
578 | " 2023-01-30 | \n",
579 | " 22200762 | \n",
580 | " 1610612761 | \n",
581 | " 1610612756 | \n",
582 | " 2022 | \n",
583 | " 106 | \n",
584 | " 44.90625 | \n",
585 | " 89.5000 | \n",
586 | " 27.296875 | \n",
587 | " 19 | \n",
588 | " 44 | \n",
589 | " 114 | \n",
590 | " 49.40625 | \n",
591 | " 81.0000 | \n",
592 | " 39.312500 | \n",
593 | " 28 | \n",
594 | " 42 | \n",
595 | " 0 | \n",
596 | "
\n",
597 | " \n",
598 | " 8907 | \n",
599 | " 2023-01-31 | \n",
600 | " 22200766 | \n",
601 | " 1610612741 | \n",
602 | " 1610612746 | \n",
603 | " 2022 | \n",
604 | " 0 | \n",
605 | " 0.00000 | \n",
606 | " 0.0000 | \n",
607 | " 0.000000 | \n",
608 | " 0 | \n",
609 | " 0 | \n",
610 | " 0 | \n",
611 | " 0.00000 | \n",
612 | " 0.0000 | \n",
613 | " 0.000000 | \n",
614 | " 0 | \n",
615 | " 0 | \n",
616 | " 0 | \n",
617 | "
\n",
618 | " \n",
619 | " 9991 | \n",
620 | " 2023-01-30 | \n",
621 | " 22200758 | \n",
622 | " 1610612758 | \n",
623 | " 1610612750 | \n",
624 | " 2022 | \n",
625 | " 118 | \n",
626 | " 47.50000 | \n",
627 | " 68.1875 | \n",
628 | " 30.000000 | \n",
629 | " 20 | \n",
630 | " 50 | \n",
631 | " 111 | \n",
632 | " 46.68750 | \n",
633 | " 52.0000 | \n",
634 | " 30.796875 | \n",
635 | " 26 | \n",
636 | " 49 | \n",
637 | " 1 | \n",
638 | "
\n",
639 | " \n",
640 | " 10092 | \n",
641 | " 2023-01-30 | \n",
642 | " 22200760 | \n",
643 | " 1610612764 | \n",
644 | " 1610612759 | \n",
645 | " 2022 | \n",
646 | " 127 | \n",
647 | " 55.81250 | \n",
648 | " 88.1875 | \n",
649 | " 53.312500 | \n",
650 | " 32 | \n",
651 | " 45 | \n",
652 | " 106 | \n",
653 | " 43.31250 | \n",
654 | " 65.1875 | \n",
655 | " 24.093750 | \n",
656 | " 26 | \n",
657 | " 43 | \n",
658 | " 1 | \n",
659 | "
\n",
660 | " \n",
661 | " 15857 | \n",
662 | " 2023-01-30 | \n",
663 | " 22200757 | \n",
664 | " 1610612747 | \n",
665 | " 1610612751 | \n",
666 | " 2022 | \n",
667 | " 104 | \n",
668 | " 39.31250 | \n",
669 | " 62.1875 | \n",
670 | " 37.906250 | \n",
671 | " 24 | \n",
672 | " 53 | \n",
673 | " 121 | \n",
674 | " 47.40625 | \n",
675 | " 78.8750 | \n",
676 | " 39.000000 | \n",
677 | " 24 | \n",
678 | " 50 | \n",
679 | " 0 | \n",
680 | "
\n",
681 | " \n",
682 | " 15938 | \n",
683 | " 2023-01-30 | \n",
684 | " 22200761 | \n",
685 | " 1610612765 | \n",
686 | " 1610612742 | \n",
687 | " 2022 | \n",
688 | " 105 | \n",
689 | " 45.59375 | \n",
690 | " 70.3750 | \n",
691 | " 35.906250 | \n",
692 | " 18 | \n",
693 | " 40 | \n",
694 | " 111 | \n",
695 | " 49.40625 | \n",
696 | " 69.6875 | \n",
697 | " 29.406250 | \n",
698 | " 17 | \n",
699 | " 40 | \n",
700 | " 0 | \n",
701 | "
\n",
702 | " \n",
703 | " 19613 | \n",
704 | " 2023-01-29 | \n",
705 | " 22200755 | \n",
706 | " 1610612740 | \n",
707 | " 1610612749 | \n",
708 | " 2022 | \n",
709 | " 110 | \n",
710 | " 44.09375 | \n",
711 | " 71.3750 | \n",
712 | " 38.187500 | \n",
713 | " 25 | \n",
714 | " 38 | \n",
715 | " 135 | \n",
716 | " 55.18750 | \n",
717 | " 60.0000 | \n",
718 | " 39.500000 | \n",
719 | " 27 | \n",
720 | " 57 | \n",
721 | " 0 | \n",
722 | "
\n",
723 | " \n",
724 | " 21628 | \n",
725 | " 2023-01-31 | \n",
726 | " 22200768 | \n",
727 | " 1610612743 | \n",
728 | " 1610612740 | \n",
729 | " 2022 | \n",
730 | " 0 | \n",
731 | " 0.00000 | \n",
732 | " 0.0000 | \n",
733 | " 0.000000 | \n",
734 | " 0 | \n",
735 | " 0 | \n",
736 | " 0 | \n",
737 | " 0.00000 | \n",
738 | " 0.0000 | \n",
739 | " 0.000000 | \n",
740 | " 0 | \n",
741 | " 0 | \n",
742 | " 0 | \n",
743 | "
\n",
744 | " \n",
745 | "
\n",
746 | "
"
747 | ],
748 | "text/plain": [
749 | " GAME_DATE_EST GAME_ID HOME_TEAM_ID VISITOR_TEAM_ID SEASON \\\n",
750 | "1384 2023-01-31 22200765 1610612752 1610612747 2022 \n",
751 | "1705 2023-01-30 22200759 1610612744 1610612760 2022 \n",
752 | "1755 2023-01-29 22200754 1610612746 1610612739 2022 \n",
753 | "3042 2023-01-31 22200767 1610612749 1610612766 2022 \n",
754 | "3825 2023-01-30 22200756 1610612753 1610612755 2022 \n",
755 | "5048 2023-01-30 22200763 1610612737 1610612757 2022 \n",
756 | "5716 2023-01-29 22200753 1610612754 1610612763 2022 \n",
757 | "8239 2023-01-31 22200764 1610612739 1610612748 2022 \n",
758 | "8381 2023-01-29 22200752 1610612748 1610612766 2022 \n",
759 | "8727 2023-01-30 22200762 1610612761 1610612756 2022 \n",
760 | "8907 2023-01-31 22200766 1610612741 1610612746 2022 \n",
761 | "9991 2023-01-30 22200758 1610612758 1610612750 2022 \n",
762 | "10092 2023-01-30 22200760 1610612764 1610612759 2022 \n",
763 | "15857 2023-01-30 22200757 1610612747 1610612751 2022 \n",
764 | "15938 2023-01-30 22200761 1610612765 1610612742 2022 \n",
765 | "19613 2023-01-29 22200755 1610612740 1610612749 2022 \n",
766 | "21628 2023-01-31 22200768 1610612743 1610612740 2022 \n",
767 | "\n",
768 | " PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home REB_home \\\n",
769 | "1384 0 0.00000 0.0000 0.000000 0 0 \n",
770 | "1705 128 51.09375 80.0000 42.593750 37 44 \n",
771 | "1755 99 41.59375 85.1875 10.500000 17 45 \n",
772 | "3042 0 0.00000 0.0000 0.000000 0 0 \n",
773 | "3825 119 42.40625 80.0000 37.906250 23 50 \n",
774 | "5048 125 46.81250 80.0000 43.312500 26 45 \n",
775 | "5716 100 45.59375 90.5000 28.093750 21 38 \n",
776 | "8239 0 0.00000 0.0000 0.000000 0 0 \n",
777 | "8381 117 48.40625 73.8750 32.312500 26 36 \n",
778 | "8727 106 44.90625 89.5000 27.296875 19 44 \n",
779 | "8907 0 0.00000 0.0000 0.000000 0 0 \n",
780 | "9991 118 47.50000 68.1875 30.000000 20 50 \n",
781 | "10092 127 55.81250 88.1875 53.312500 32 45 \n",
782 | "15857 104 39.31250 62.1875 37.906250 24 53 \n",
783 | "15938 105 45.59375 70.3750 35.906250 18 40 \n",
784 | "19613 110 44.09375 71.3750 38.187500 25 38 \n",
785 | "21628 0 0.00000 0.0000 0.000000 0 0 \n",
786 | "\n",
787 | " PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away \\\n",
788 | "1384 0 0.00000 0.0000 0.000000 0 0 \n",
789 | "1705 120 49.50000 85.0000 45.812500 23 43 \n",
790 | "1755 122 56.40625 73.6875 60.593750 35 35 \n",
791 | "3042 0 0.00000 0.0000 0.000000 0 0 \n",
792 | "3825 109 47.09375 78.3125 36.687500 28 45 \n",
793 | "5048 129 54.40625 88.8750 47.500000 24 32 \n",
794 | "5716 112 48.31250 77.3125 24.296875 29 41 \n",
795 | "8239 0 0.00000 0.0000 0.000000 0 0 \n",
796 | "8381 122 54.18750 77.3125 37.500000 25 47 \n",
797 | "8727 114 49.40625 81.0000 39.312500 28 42 \n",
798 | "8907 0 0.00000 0.0000 0.000000 0 0 \n",
799 | "9991 111 46.68750 52.0000 30.796875 26 49 \n",
800 | "10092 106 43.31250 65.1875 24.093750 26 43 \n",
801 | "15857 121 47.40625 78.8750 39.000000 24 50 \n",
802 | "15938 111 49.40625 69.6875 29.406250 17 40 \n",
803 | "19613 135 55.18750 60.0000 39.500000 27 57 \n",
804 | "21628 0 0.00000 0.0000 0.000000 0 0 \n",
805 | "\n",
806 | " HOME_TEAM_WINS \n",
807 | "1384 0 \n",
808 | "1705 1 \n",
809 | "1755 0 \n",
810 | "3042 0 \n",
811 | "3825 1 \n",
812 | "5048 0 \n",
813 | "5716 0 \n",
814 | "8239 0 \n",
815 | "8381 0 \n",
816 | "8727 0 \n",
817 | "8907 0 \n",
818 | "9991 1 \n",
819 | "10092 1 \n",
820 | "15857 0 \n",
821 | "15938 0 \n",
822 | "19613 0 \n",
823 | "21628 0 "
824 | ]
825 | },
826 | "execution_count": 6,
827 | "metadata": {},
828 | "output_type": "execute_result"
829 | }
830 | ],
831 | "source": [
832 | "df_old[df_old['GAME_ID'] > 22200751]"
833 | ]
834 | }
835 | ],
836 | "metadata": {
837 | "kernelspec": {
838 | "display_name": "nba3",
839 | "language": "python",
840 | "name": "python3"
841 | },
842 | "language_info": {
843 | "codemirror_mode": {
844 | "name": "ipython",
845 | "version": 3
846 | },
847 | "file_extension": ".py",
848 | "mimetype": "text/x-python",
849 | "name": "python",
850 | "nbconvert_exporter": "python",
851 | "pygments_lexer": "ipython3",
852 | "version": "3.9.13"
853 | },
854 | "orig_nbformat": 4,
855 | "vscode": {
856 | "interpreter": {
857 | "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06"
858 | }
859 | }
860 | },
861 | "nbformat": 4,
862 | "nbformat_minor": 2
863 | }
864 |
--------------------------------------------------------------------------------
/notebooks/lightgbm.json:
--------------------------------------------------------------------------------
1 | {"lambda_l1": 0.014353286456597142, "lambda_l2": 1.8797714715567863e-05, "learning_rate": 0.0074219704795943425, "max_depth": 5, "n_estimators": 8165, "feature_fraction": 0.7165888165225329, "bagging_fraction": 0.41679785470194425, "bagging_freq": 5, "num_leaves": 965, "min_child_samples": 94, "min_data_per_groups": 89}
--------------------------------------------------------------------------------
/notebooks/model_data.json:
--------------------------------------------------------------------------------
1 | {"model_name": "xgboost", "calibration_method": "Model + Isotonic", "brier_loss": 0.22892725648801726, "metrics": {"AUC": 0.6314368770764119, "Accuracy": 0.6584507042253521}}
--------------------------------------------------------------------------------
/notebooks/xgboost.json:
--------------------------------------------------------------------------------
1 | {"num_round": 868, "learning_rate": 0.018405745552075914, "max_bin": 558, "max_depth": 2, "alpha": 8.281713514214621, "gamma": 11.304706114754095, "reg_lambda": 11.806523119478767, "colsample_bytree": 0.4339252336236323, "subsample": 0.5326856953922422, "min_child_weight": 6.179632416051588, "scale_pos_weight": 1}
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | alembic==1.8.1
2 | altair==4.2.0
3 | asttokens==2.2.0
4 | async-generator==1.10
5 | attrs==22.1.0
6 | autopage==0.5.1
7 | avro==1.10.2
8 | backcall==0.2.0
9 | beautifulsoup4==4.11.1
10 | blinker==1.5
11 | boto3==1.26.20
12 | botocore==1.29.20
13 | cachetools==5.2.0
14 | certifi==2022.9.24
15 | cffi==1.15.1
16 | charset-normalizer==2.1.1
17 | click==8.1.3
18 | cliff==4.1.0
19 | cloudpickle==2.2.0
20 | cmaes==0.9.0
21 | cmd2==2.4.2
22 | colorama==0.4.6
23 | colorlog==6.7.0
24 | commonmark==0.9.1
25 | confluent-kafka==1.8.2
26 | contourpy==1.0.6
27 | cryptography==38.0.4
28 | cycler==0.11.0
29 | debugpy==1.6.4
30 | decorator==5.1.1
31 | entrypoints==0.4
32 | exceptiongroup==1.0.4
33 | executing==1.2.0
34 | fastavro==1.4.11
35 | fastjsonschema==2.16.2
36 | fonttools==4.38.0
37 | furl==2.1.3
38 | future==0.18.2
39 | gitdb==4.0.10
40 | GitPython==3.1.29
41 | great-expectations==0.14.12
42 | greenlet==2.0.1
43 | h11==0.14.0
44 | hopsworks==3.0.4
45 | hsfs==3.0.5
46 | hsml==3.0.3
47 | html5lib==1.1
48 | idna==3.4
49 | importlib-metadata==5.1.0
50 | importlib-resources==5.10.0
51 | ipykernel==6.17.1
52 | ipython==8.7.0
53 | ipywidgets==8.0.2
54 | javaobj-py3==0.4.3
55 | jedi==0.18.2
56 | Jinja2==3.0.3
57 | jmespath==1.0.1
58 | joblib==1.2.0
59 | jsonpatch==1.32
60 | jsonpointer==2.3
61 | jsonschema==4.17.3
62 | jupyter_client==7.4.7
63 | jupyter_core==5.1.0
64 | jupyterlab-widgets==3.0.3
65 | kiwisolver==1.4.4
66 | lightgbm==3.3.2
67 | llvmlite==0.39.1
68 | lxml==4.9.1
69 | Mako==1.2.4
70 | MarkupSafe==2.0.1
71 | matplotlib==3.6.2
72 | matplotlib-inline==0.1.6
73 | mistune==0.8.4
74 | mock==4.0.3
75 | nbformat==5.7.0
76 | nest-asyncio==1.5.6
77 | numba==0.56.4
78 | numpy==1.22.4
79 | optuna==2.10.1
80 | orderedmultidict==1.0.1
81 | outcome==1.2.0
82 | packaging==21.3
83 | pandas==1.4.3
84 | parso==0.8.3
85 | pbr==5.11.0
86 | pickleshare==0.7.5
87 | Pillow==9.3.0
88 | platformdirs==2.5.4
89 | plotly==5.9.0
90 | prettytable==3.5.0
91 | prompt-toolkit==3.0.33
92 | protobuf==3.20.3
93 | psutil==5.9.4
94 | pure-eval==0.2.2
95 | pyarrow==10.0.1
96 | pyasn1==0.4.8
97 | pyasn1-modules==0.2.8
98 | pycparser==2.21
99 | pycryptodomex==3.16.0
100 | pydeck==0.8.0
101 | Pygments==2.13.0
102 | PyHopsHive==0.6.4.1.dev0
103 | pyhumps==1.6.1
104 | pyjks==20.0.0
105 | Pympler==1.0.1
106 | PyMySQL==1.0.2
107 | pyparsing==2.4.7
108 | pyperclip==1.8.2
109 | pyreadline3==3.4.1
110 | pyrsistent==0.19.2
111 | PySocks==1.7.1
112 | python-dateutil==2.8.2
113 | python-dotenv==0.21.0
114 | pytz==2022.6
115 | pytz-deprecation-shim==0.1.0.post0
116 | PyVirtualDisplay==3.0
117 | PyYAML==6.0
118 | pyzmq==24.0.1
119 | requests==2.28.1
120 | rich==12.6.0
121 | ruamel.yaml==0.17.17
122 | ruamel.yaml.clib==0.2.7
123 | s3transfer==0.6.0
124 | scikit-learn==1.1.3
125 | scipy==1.9.3
126 | seaborn==0.11.2
127 | selenium==4.6.0
128 | semver==2.13.0
129 | shap==0.41.0
130 | six==1.16.0
131 | slicer==0.0.7
132 | smmap==5.0.0
133 | sniffio==1.3.0
134 | sortedcontainers==2.4.0
135 | soupsieve==2.3.2.post1
136 | SQLAlchemy==1.4.44
137 | stack-data==0.6.2
138 | stevedore==4.1.1
139 | streamlit==1.15.2
140 | sweetviz==2.1.4
141 | tenacity==8.1.0
142 | termcolor==2.1.1
143 | threadpoolctl==3.1.0
144 | thrift==0.16.0
145 | toml==0.10.2
146 | toolz==0.12.0
147 | tornado==6.2
148 | tqdm==4.64.0
149 | traitlets==5.6.0
150 | trio==0.22.0
151 | trio-websocket==0.9.2
152 | twofish==0.3.0
153 | typing_extensions==4.4.0
154 | tzdata==2022.7
155 | tzlocal==4.2
156 | urllib3==1.26.13
157 | validators==0.20.0
158 | watchdog==2.2.0
159 | wcwidth==0.2.5
160 | webdriver-manager==3.8.4
161 | webencodings==0.5.1
162 | widgetsnbextension==4.0.3
163 | wsproto==1.2.0
164 | xgboost==1.6.1
165 | zipp==3.11.0
166 |
--------------------------------------------------------------------------------
/src/common_functions.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 |
4 | import itertools
5 |
6 | import matplotlib.pyplot as plt
7 | from matplotlib.colors import TwoSlopeNorm
8 |
9 | import sweetviz as sv
10 |
11 | from datetime import datetime
12 |
13 |
14 | def plot_corr_barchart(df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None:
15 | """
16 | Plots a color-gradient bar chart showing top n correlations between features
17 |
18 | Args:
19 | df1 (pd.DataFrame): the dataframe to plot
20 | drop_cols (list): list of columns not to include in plot
21 | n (int): number of top n correlations to plot
22 |
23 | Returns:
24 | None
25 |
26 | Sources:
27 | https://typefully.com/levikul09/j6qzwR0
28 | https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas
29 |
30 | """
31 |
32 | df1 = df1.drop(columns=drop_cols)
33 | useful_columns = df1.select_dtypes(include=['number']).columns
34 |
35 | def get_redundant_pairs(df):
36 | pairs_to_drop = set()
37 | cols = df.columns
38 | for i in range(0,df.shape[1]):
39 | for j in range(0,i+1):
40 | pairs_to_drop.add((cols[i],cols[j]))
41 | return pairs_to_drop
42 |
43 | def get_correlations(df,n=n):
44 | au_corr = df.corr(method = 'spearman').unstack() #spearman used because not all data is normalized
45 | labels_to_drop = get_redundant_pairs(df)
46 | au_corr = au_corr.drop(labels = labels_to_drop).sort_values(ascending=False)
47 | top_n = au_corr[0:n]
48 | bottom_n = au_corr[-n:]
49 | top_corr = pd.concat([top_n, bottom_n])
50 | return top_corr
51 |
52 | corrplot = get_correlations(df1[useful_columns])
53 |
54 |
55 | fig, ax = plt.subplots(figsize=(15,10))
56 | norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1)
57 | colors = [plt.cm.RdYlGn(norm(c)) for c in corrplot.values]
58 |
59 | print(corrplot)
60 |
61 | corrplot.plot.barh(color=colors)
62 |
63 | return
64 |
65 |
66 | def plot_corr_vs_target(target: str, df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None:
67 | """
68 | Plots a color-gradient bar chart showing top n correlations between features and target
69 |
70 | Args:
71 | target (str): the name of the target column
72 | df1 (pd.DataFrame): the dataframe to plot
73 | drop_cols (list): list of columns not to include in plot
74 | n (int): number of top n correlations to plot
75 |
76 | Returns:
77 | None
78 |
79 | """
80 |
81 |
82 | target_series = df1[target]
83 | df1 = df1.drop(columns=drop_cols)
84 |
85 | x = df1.corrwith(target_series, method = 'spearman',numeric_only=True).sort_values(ascending=False)
86 | top_n = x[0:n]
87 | bottom_n = x[-n:]
88 | top_corr = pd.concat([top_n, bottom_n])
89 | x = top_corr
90 |
91 | print(x)
92 |
93 | fig, ax = plt.subplots(figsize=(15,10))
94 | norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1)
95 | colors = [plt.cm.RdYlGn(norm(c)) for c in x.values]
96 | x.plot.barh(color=colors)
97 |
98 | return
99 |
100 |
101 | def plot_confusion_matrix(cm: np.ndarray,
102 | target_names: list,
103 | title: str ='Confusion matrix',
104 | cmap: plt.colormaps =None,
105 | normalize: bool =True):
106 | """
107 | given a sklearn confusion matrix (cm), make a nice plot
108 |
109 | Arguments
110 | ---------
111 | cm: confusion matrix from sklearn.metrics.confusion_matrix
112 |
113 | target_names: given classification classes such as [0, 1, 2]
114 | the class names, for example: ['high', 'medium', 'low']
115 |
116 | title: the text to display at the top of the matrix
117 |
118 | cmap: the gradient of the values displayed from matplotlib.pyplot.cm
119 | see http://matplotlib.org/examples/color/colormaps_reference.html
120 | plt.get_cmap('jet') or plt.cm.Blues
121 |
122 | normalize: If False, plot the raw numbers
123 | If True, plot the proportions
124 |
125 | Usage
126 | -----
127 | plot_confusion_matrix(cm = cm, # confusion matrix created by
128 | # sklearn.metrics.confusion_matrix
129 | normalize = True, # show proportions
130 | target_names = y_labels_vals, # list of names of the classes
131 | title = best_estimator_name) # title of graph
132 |
133 | Citation
134 | ---------
135 | http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
136 |
137 | Source:
138 | https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels/50386871#50386871
139 |
140 | """
141 |
142 | accuracy = np.trace(cm) / np.sum(cm).astype('float')
143 | misclass = 1 - accuracy
144 |
145 | if cmap is None:
146 | cmap = plt.get_cmap('Blues')
147 |
148 | fig = plt.figure(figsize=(8, 6))
149 | plt.imshow(cm, interpolation='nearest', cmap=cmap)
150 | plt.title(title)
151 | plt.colorbar()
152 |
153 | if target_names is not None:
154 | tick_marks = np.arange(len(target_names))
155 | plt.xticks(tick_marks, target_names, rotation=45)
156 | plt.yticks(tick_marks, target_names)
157 |
158 | if normalize:
159 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
160 |
161 |
162 | thresh = cm.max() / 1.5 if normalize else cm.max() / 2
163 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
164 | if normalize:
165 | plt.text(j, i, "{:0.4f}".format(cm[i, j]),
166 | horizontalalignment="center",
167 | color="white" if cm[i, j] > thresh else "black")
168 | else:
169 | plt.text(j, i, "{:,}".format(cm[i, j]),
170 | horizontalalignment="center",
171 | color="white" if cm[i, j] > thresh else "black")
172 |
173 |
174 | plt.tight_layout()
175 | plt.ylabel('True label')
176 | plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
177 | plt.show()
178 |
179 | return fig
180 |
181 | def run_sweetviz_report(df: pd.DataFrame, TARGET: str) -> None:
182 | """
183 | Generates a sweetviz report and saves it to a html file
184 |
185 | Args:
186 | df (pd.DataFrame): the dataframe to analyze
187 | TARGET (str): the name of the target column
188 | """
189 |
190 | report_label = datetime.today().strftime('%Y-%m-%d_%H_%M')
191 |
192 | my_report = sv.analyze(df,target_feat=TARGET)
193 | my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html')
194 |
195 | return
196 |
197 |
198 | def run_sweetviz_comparison(df1: pd.DataFrame, df1_name: str, df2: pd.DataFrame, df2_name: str, TARGET: str, report_label: str) -> None:
199 | """
200 | Generates a sweetviz comparison report between two dataframes and saves it to a html file
201 |
202 | Args:
203 | df1 (pd.DataFrame): the first dataframe to analyze
204 | df1_name (str): the name of the first dataframe
205 | df2 (pd.DataFrame): the second dataframe to analyze
206 | df2_name (str): the name of the second dataframe
207 | TARGET (str): the name of the target column
208 | report_label (str): identifier incorporated into the filename of the report
209 | """
210 |
211 | report_label = report_label + datetime.today().strftime('%Y-%m-%d_%H_%M')
212 |
213 | my_report = sv.compare([df1, df1_name], [df2, df2_name], target_feat=TARGET,pairwise_analysis="off")
214 | my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html')
215 |
216 | return
--------------------------------------------------------------------------------
/src/data_processing.py:
--------------------------------------------------------------------------------
1 |
2 | import pandas as pd
3 |
4 | def process_games(games: pd.DataFrame) -> pd.DataFrame:
5 | """
6 | Performs basic data cleaning on the games dataset.
7 |
8 | Args:
9 | games (pd.DataFrame): the raw games dataframe
10 |
11 | Returns:
12 | the cleaned games dataframe
13 |
14 | """
15 |
16 |
17 | # remove preseason games (GAME_ID begins with a 1)
18 | games = games[games['GAME_ID'] > 20000000]
19 |
20 | # flag postseason games (GAME_ID begins with >2)
21 | games['PLAYOFF'] = (games['GAME_ID'] >= 30000000).astype('int8')
22 |
23 | # remove duplicates (each GAME_ID should be unique)
24 | games = games[~games.duplicated(subset=['GAME_ID'])]
25 |
26 | # drop unnecessary fields
27 | all_columns = games.columns.tolist()
28 | drop_columns = ['GAME_STATUS_TEXT', 'TEAM_ID_home', 'TEAM_ID_away']
29 | use_columns = [item for item in all_columns if item not in drop_columns]
30 | games = games[use_columns]
31 |
32 | return games
33 |
34 |
35 | def process_ranking(ranking: pd.DataFrame) -> pd.DataFrame:
36 | """
37 | Performs basic data cleaning on the ranking dataset.
38 |
39 | Args:
40 | ranking (pd.DataFrame): the raw ranking dataframe
41 |
42 | Returns:
43 | the cleaned ranking dataframe
44 |
45 | """
46 |
47 |
48 | # remove preseason rankings (SEASON_ID begins with 1)
49 | ranking = ranking[ranking['SEASON_ID'] > 20000]
50 |
51 | # convert home record and road record to numeric
52 | ranking['HOME_W'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
53 | ranking['HOME_L'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
54 | ranking['HOME_W_PCT'] = ranking['HOME_W'] / ( ranking['HOME_W'] + ranking['HOME_L'] )
55 |
56 | ranking['ROAD_W'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
57 | ranking['ROAD_L'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
58 | ranking['ROAD_W_PCT'] = ranking['ROAD_W'] / ( ranking['ROAD_W'] + ranking['ROAD_L'] )
59 |
60 | # encode CONFERENCE as an integer (just using pandas - not importing sklearn for just one feature)
61 | ranking['CONFERENCE'] = ranking['CONFERENCE'].apply(lambda x: 0 if x=='East' else 1 ).astype('int')
62 |
63 | # remove duplicates (there should only be one TEAM_ID per STANDINGSDATE)
64 | ranking = ranking[~ranking.duplicated(subset=['TEAM_ID','STANDINGSDATE'])]
65 |
66 | # drop unnecessary fields
67 | drop_fields = ['SEASON_ID', 'LEAGUE_ID', 'RETURNTOPLAY', 'TEAM', 'HOME_RECORD', 'ROAD_RECORD']
68 | ranking = ranking.drop(drop_fields,axis=1)
69 |
70 | return ranking
71 |
72 |
73 | def process_games_details(details: pd.DataFrame) -> pd.DataFrame:
74 | """
75 | Performs basic data cleaning on the games_details dataset.
76 |
77 | Args:
78 | details (pd.DataFrame): the raw games_details dataframe
79 |
80 | Returns:
81 | the cleaned games_details dataframe
82 |
83 | """
84 |
85 |
86 | # convert MIN:SEC to float
87 | df = details.loc[details['MIN'].str.contains(':',na=False)]
88 | df['MIN_whole'] = df['MIN'].apply(lambda x: x.split(':')[0]).astype("int8")
89 | df['MIN_seconds'] = df['MIN'].apply(lambda x: x.split(':')[1]).astype("int8")
90 | df['MIN'] = df['MIN_whole'] + (df['MIN_seconds'] / 60)
91 |
92 | details['MIN'].loc[details['MIN'].str.contains(':',na=False)] = df['MIN']
93 | details['MIN'] = details['MIN'].astype("float16")
94 |
95 | # convert negatives to positive
96 | details['MIN'].loc[details['MIN'] < 0] = -(details['MIN'])
97 |
98 | # update START_POSITION if did not play (MIN = NaN)
99 | details['START_POSITION'].loc[details['MIN'].isna()] = 'NP'
100 |
101 | # update START_POSITION if null
102 | details['START_POSITION'] = details['START_POSITION'].fillna('NS')
103 |
104 | # drop unnecessary fields
105 | drop_fields = ['COMMENT', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_NAME', 'NICKNAME']
106 | details = details.drop(drop_fields,axis=1)
107 |
108 | return details
109 |
110 |
111 | def add_TARGET(df: pd.DataFrame) -> pd.DataFrame:
112 | """
113 | Adds a TARGET column to the dataframe by copying HOME_TEAM_WINS.
114 |
115 | Args:
116 | df (pd.DataFrame): the dataframe to add the TARGET column to
117 |
118 | Returns:
119 | the games dataframe with a TARGET column
120 |
121 | """
122 |
123 | df['TARGET'] = df['HOME_TEAM_WINS']
124 |
125 | return df
126 |
127 | def split_train_test(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
128 | """
129 | Splits the dataframe into train and test sets.
130 |
131 | Splits the latest season as the test set and the rest as the train set.
132 | The second latest season included with the test set to allow for feature engineering.
133 |
134 | Args:
135 | df (pd.DataFrame): the dataframe to split
136 |
137 | Returns:
138 | the train and test dataframes
139 |
140 | """
141 |
142 | latest_season = df['SEASON'].unique().max()
143 |
144 | train = df[df['SEASON'] < (latest_season)]
145 | test = df[df['SEASON'] >= (latest_season - 1)]
146 |
147 | return train, test
148 |
149 |
--------------------------------------------------------------------------------
/src/feature_engineering.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 |
4 | def process_features(df: pd.DataFrame) -> pd.DataFrame:
5 | """
6 | Master function to perform all the steps of feature engineering
7 |
8 | Args:
9 | df (pd.DataFrame): the dataframe to process
10 |
11 | Returns:
12 | the processed dataframe
13 |
14 |
15 |
16 | Feature engineering to add:
17 | - rolling averages of key stats,
18 | - win/lose streaks,
19 | - home/away streaks,
20 | - specific matchup (team X vs team Y) rolling averages and streaks
21 | - home team rolling stats minus visitor team rolling stats
22 | - rolling stats minus current league average
23 |
24 | Functions include:
25 | - fix_datatypes(): converts date to proper format and reduces memory footprint of ints and floats
26 | - add_date_features(): adds a feature for month number from game date
27 | - remove_playoff_games(): playoff games may bias the statistics
28 | - add_rolling_home_visitor(): rolling avgs and streaks for home/visitor team when playing as home/visitor
29 | - process_games_consecutively(): separate home team stats from visitor team stats for each game and stack these together by game date
30 | - add_past_performance_all(): rolling avgs and streaks no matter if playing as home or visitor team
31 | - process_x_minus_league_avg: subtract league avg rolling stats from team's rolling stats
32 | - add_matchups(): rolling avgs and steaks for each time when Team A played Team B
33 | - combine_new_features(): combine back home team and visitor team features so each game has only one row again
34 | - process_x_minus_y(): subtract visitor team rolling stats from home rolling stats
35 | """
36 |
37 | # lengths of rolling averages and streaks to calculate for each team
38 | # we will try a variety of lengths to see which works best
39 | home_visitor_roll_list = [3, 7, 10] #lengths to use when restricting to home or visitor role
40 | all_roll_list = [3, 7, 10, 15] #lengths to use when NOT restricting to home or visitor role
41 |
42 | long_integer_fields = ['GAME_ID', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON']
43 | short_integer_fields = ['PTS_home', 'AST_home', 'REB_home', 'PTS_away', 'AST_away', 'REB_away']
44 | date_fields = ['GAME_DATE_EST']
45 |
46 | df = fix_datatypes(df, date_fields, short_integer_fields, long_integer_fields)
47 | df = add_date_features(df)
48 | df = remove_playoff_games(df)
49 | df = add_rolling_home_visitor(df, "HOME", home_visitor_roll_list)
50 | df = add_rolling_home_visitor(df, "VISITOR", home_visitor_roll_list)
51 |
52 | df_consecutive = process_games_consecutively(df)
53 | df_consecutive = add_matchups(df_consecutive, home_visitor_roll_list)
54 | df_consecutive = add_past_performance_all(df_consecutive, all_roll_list)
55 |
56 | #add these features back to main dataframe
57 | df = combine_new_features(df,df_consecutive)
58 |
59 | df = process_x_minus_y(df)
60 |
61 | return df
62 |
63 |
64 | def fix_datatypes(df: pd.DataFrame, date_columns: list, short_integer_fields: list, long_integer_fields: list)-> pd.DataFrame:
65 | """
66 | Converts date to proper format and reduces memory footprint of ints and floats
67 |
68 | Args:
69 | df (pd.DataFrame): the dataframe to process
70 |
71 | Returns:
72 | the processed dataframe
73 |
74 | """
75 |
76 | for field in date_columns:
77 | df[field] = pd.to_datetime(df[field])
78 |
79 | #convert long integer fields to int32 from int64
80 | for field in long_integer_fields:
81 | df[field] = df[field].astype('int32')
82 |
83 | #convert specific fields to int16 to avoid type issues with hopsworks.ai
84 | for field in short_integer_fields:
85 | df[field] = df[field].astype('int16')
86 |
87 | #convert to positive. For some reason, some values have been saved as negative numbers
88 | for field in short_integer_fields:
89 | df[field] = df[field].abs()
90 |
91 | #convert the remaining int64s to int8
92 | for field in df.select_dtypes(include=['int64']).columns.tolist():
93 | df[field] = df[field].astype('int8')
94 |
95 | #convert float64s to float16s
96 | for field in df.select_dtypes(include=['float64']).columns.tolist():
97 | df[field] = df[field].astype('float16')
98 |
99 | return df
100 |
101 |
102 | def add_date_features(df: pd.DataFrame)-> pd.DataFrame:
103 | """
104 | Creates new features from the game date, which will hopefully be more useful for the model
105 |
106 | Currently simply converts game date to just month and add as a feature. This limits cardinality of the date feature.
107 |
108 | Args:
109 | df (pd.DataFrame): the dataframe to process
110 |
111 | Returns:
112 | the processed dataframe
113 | """
114 |
115 | df['MONTH'] = df['GAME_DATE_EST'].dt.month
116 |
117 | return df
118 |
119 |
120 | def remove_playoff_games(df: pd.DataFrame)-> pd.DataFrame:
121 | """
122 | Remove playoff games
123 |
124 | Playoff games may bias the statistics because they are not played under the same conditions as regular season games and are played in a tournament format.
125 |
126 | Args:
127 | df (pd.DataFrame): the dataframe to process
128 |
129 | Returns:
130 | the processed dataframe
131 | """
132 |
133 |
134 | # Filter to only non-playoff games and then drop the PLAYOFF feature
135 |
136 | df = df[df["PLAYOFF"] == 0]
137 | df = df.drop("PLAYOFF", axis = 1)
138 |
139 | return df
140 |
141 |
142 | def add_rolling_home_visitor(df: pd.DataFrame, location: str, roll_list: list)-> pd.DataFrame:
143 | """
144 | Add rolling avgs and win/lose streaks for home/visitor team when playing as home/visitor for a variety of rolling lengths
145 |
146 | This function also invokes another function to calculate the league average rolling stats for that moment in time and subtracts these from the team's rolling stats.
147 |
148 | Args:
149 | df (pd.DataFrame): the dataframe to process
150 | location (str): "HOME" or "VISITOR"
151 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15]
152 |
153 | Returns:
154 | the processed dataframe
155 |
156 |
157 | We are adding features that show how well the home team has done in its last home games and how well the visitor team has done in its last away games.
158 | We are also determining the current win streak for each team (negative if losing streak) when playing as home or visitor team.
159 |
160 | """
161 |
162 | # compile stats for home or visitor team
163 | location_id = location + "_TEAM_ID"
164 |
165 | # sort games by the order in which they were played for each home or visitor team
166 | df = df.sort_values(by = [location_id, 'GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True)
167 |
168 | # Win streak, negative if a losing streak
169 | df[location + '_TEAM_WIN_STREAK'] = df['HOME_TEAM_WINS'].groupby((df['HOME_TEAM_WINS'].shift() != df.groupby([location_id])['HOME_TEAM_WINS'].shift(2)).cumsum()).cumcount() + 1
170 | # if home team lost the last game of the streak, then the streak must be a losing streak. make it negative
171 | df[location + '_TEAM_WIN_STREAK'].loc[df['HOME_TEAM_WINS'].shift() == 0] = -1 * df[location + '_TEAM_WIN_STREAK']
172 |
173 | # If visitor, the streak has opposite meaning (3 wins in a row for home team is 3 losses in a row for visitor)
174 | if location == 'VISITOR':
175 | df[location + '_TEAM_WIN_STREAK'] = - df[location + '_TEAM_WIN_STREAK']
176 |
177 |
178 | # rolling means
179 | feature_list = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']
180 |
181 | if location == 'VISITOR':
182 | feature_list = ['HOME_TEAM_WINS', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
183 |
184 |
185 | roll_feature_list = []
186 | for feature in feature_list:
187 | for roll in roll_list:
188 | roll_feature_name = location + '_' + feature + '_AVG_LAST_' + str(roll) + '_' + location
189 | if feature == 'HOME_TEAM_WINS': #remove the "HOME_" for better readability
190 | roll_feature_name = location + '_' + feature[5:] + '_AVG_LAST_' + str(roll) + '_' + location
191 | roll_feature_list.append(roll_feature_name)
192 | df[roll_feature_name] = df.groupby(['HOME_TEAM_ID'])[feature].rolling(roll, closed= "left").mean().values
193 |
194 |
195 |
196 | # determine league avg for each stat and then subtract it from the each team's avg
197 | # as a measure of how well that team compares to all teams in that moment in time
198 |
199 | #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose)
200 | roll_feature_list = [x for x in roll_feature_list if not x.startswith('HOME_TEAM_WINS')]
201 | #print(location_id)
202 | df = process_x_minus_league_avg(df, roll_feature_list, location_id)
203 |
204 |
205 | return df
206 |
207 |
208 | def process_games_consecutively(df_data: pd.DataFrame)-> pd.DataFrame:
209 | """
210 |
211 | Separate home team stats from visitor team stats for each game and stack these together by game date.
212 |
213 | (Each game record will go from a single row, Home/Visitor combined, to two rows, one for home team and one for visitor)
214 |
215 | Args:
216 | df (pd.DataFrame): the dataframe to process
217 |
218 | Returns:
219 | the processed dataframe
220 | """
221 |
222 | # re-organize so that all of a team's games can be listed in chronological order whether HOME or VISITOR
223 | # this will facilitate feature engineering (winpct vs team X, 5-game winpct, current win streak, etc...)
224 |
225 | # before this step, the data is stored by game, and each game has 2 teams
226 | # this function will separate each teams stats so that each game has 2 rows (one for each team) instead of one combined row
227 |
228 | #this data will need to be re-linked back to the main dataframe after all processing is done,
229 | #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records
230 |
231 | #TEAM1 will be the key field. TEAM2 is used solely to process past team matchups
232 |
233 | # all the home games for each team will be selected and then stacked with all the away games
234 |
235 | df_home = pd.DataFrame()
236 | df_home['GAME_DATE_EST'] = df_data['GAME_DATE_EST']
237 | df_home['GAME_ID'] = df_data['GAME_ID']
238 | df_home['TEAM1'] = df_data['HOME_TEAM_ID']
239 | df_home['TEAM1_home'] = 1
240 | df_home['TEAM1_win'] = df_data['HOME_TEAM_WINS']
241 | df_home['TEAM2'] = df_data['VISITOR_TEAM_ID']
242 | df_home['SEASON'] = df_data['SEASON']
243 |
244 | df_home['PTS'] = df_data['PTS_home']
245 | df_home['FG_PCT'] = df_data['FG_PCT_home']
246 | df_home['FT_PCT'] = df_data['FT_PCT_home']
247 | df_home['FG3_PCT'] = df_data['FG3_PCT_home']
248 | df_home['AST'] = df_data['AST_home']
249 | df_home['REB'] = df_data['REB_home']
250 |
251 | # now for visitor teams
252 |
253 | df_visitor = pd.DataFrame()
254 | df_visitor['GAME_DATE_EST'] = df_data['GAME_DATE_EST']
255 | df_visitor['GAME_ID'] = df_data['GAME_ID']
256 | df_visitor['TEAM1'] = df_data['VISITOR_TEAM_ID']
257 | df_visitor['TEAM1_home'] = 0
258 | df_visitor['TEAM1_win'] = df_data['HOME_TEAM_WINS'].apply(lambda x: 1 if x == 0 else 0)
259 | df_visitor['TEAM2'] = df_data['HOME_TEAM_ID']
260 | df_visitor['SEASON'] = df_data['SEASON']
261 |
262 | df_visitor['PTS'] = df_data['PTS_away']
263 | df_visitor['FG_PCT'] = df_data['FG_PCT_away']
264 | df_visitor['FT_PCT'] = df_data['FT_PCT_away']
265 | df_visitor['FG3_PCT'] = df_data['FG3_PCT_away']
266 | df_visitor['AST'] = df_data['AST_away']
267 | df_visitor['REB'] = df_data['REB_away']
268 |
269 | # merge dfs
270 |
271 | df = pd.concat([df_home, df_visitor])
272 |
273 | column2 = df.pop('TEAM1')
274 | column3 = df.pop('TEAM1_home')
275 | column4 = df.pop('TEAM2')
276 | column5 = df.pop('TEAM1_win')
277 |
278 | df.insert(2,'TEAM1', column2)
279 | df.insert(3,'TEAM1_home', column3)
280 | df.insert(4,'TEAM2', column4)
281 | df.insert(5,'TEAM1_win', column5)
282 |
283 | df = df.sort_values(by = ['TEAM1', 'GAME_ID'], axis=0, ascending=[True, True], ignore_index=True)
284 |
285 | return df
286 |
287 |
288 | def add_matchups(df: pd.DataFrame, roll_list: list)-> pd.DataFrame:
289 | """
290 | Add rolling win pcts and win/lose steaks for each time when Team A played Team B for a variety of rolling windows
291 |
292 | Args:
293 | df (pd.DataFrame): the dataframe to process
294 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15]
295 |
296 | Returns:
297 | the processed dataframe
298 | """
299 |
300 |
301 | # group all the games that 2 teams played each other
302 | # calculate home team win pct and the home team win/lose streak
303 |
304 |
305 | df = df.sort_values(by = ['TEAM1', 'TEAM2','GAME_DATE_EST'], axis=0, ascending=[True, True, True], ignore_index=True)
306 |
307 | for roll in roll_list:
308 | df['MATCHUP_WINPCT_' + str(roll)] = df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].rolling(roll, closed= "left").mean().values
309 |
310 | df['MATCHUP_WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1
311 |
312 | # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative
313 | df['MATCHUP_WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0] = -1 * df['MATCHUP_WIN_STREAK']
314 |
315 |
316 | return df
317 |
318 |
319 | def add_past_performance_all(df: pd.DataFrame, roll_list: list)-> pd.DataFrame:
320 | """
321 | Add rolling avgs, win/lose streak, and home/away streak no matter if playing as home or visitor team.
322 |
323 | Args:
324 | df (pd.DataFrame): the dataframe to process
325 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15]
326 |
327 | Returns:
328 | the processed dataframe
329 | """
330 |
331 | # add features showing how well each team has done in its last games
332 | # regardless whether they were at home or away
333 |
334 | # add rolling means and win streaks (negative number if losing streak)
335 |
336 | #this data will need to be re-linked back to the main dataframe after all processing is done,
337 | #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records
338 |
339 | #TEAM1 will be the key field. TEAM2 was used solely to process past team matchups
340 |
341 |
342 | df = df.sort_values(by = ['TEAM1','GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True)
343 |
344 | #streak of games won/lost, make negative is a losing streak
345 | df['WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1
346 |
347 | # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative
348 | df['WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0] = -1 * df['WIN_STREAK']
349 |
350 | #streak of games played at home/away, make negative if away streak
351 | df['HOME_AWAY_STREAK'] = df['TEAM1_home'].groupby((df['TEAM1_home'].shift() != df.groupby(['TEAM1'])['TEAM1_home'].shift(2)).cumsum()).cumcount() + 1
352 |
353 | # if team1 played the game of the streak away, then the streak must be an away streak. make it negative
354 | df['HOME_AWAY_STREAK'].loc[df['TEAM1_home'].shift() == 0] = -1 * df['HOME_AWAY_STREAK']
355 |
356 | #rolling means
357 |
358 | feature_list = ['TEAM1_win', 'PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB']
359 |
360 | #create new feature names based upon rolling period
361 |
362 | roll_feature_list =[]
363 |
364 | for feature in feature_list:
365 | for roll in roll_list:
366 | roll_feature_name = feature + '_AVG_LAST_' + str(roll) + '_ALL'
367 | roll_feature_list.append(roll_feature_name)
368 | df[roll_feature_name] = df.groupby(['TEAM1'])[feature].rolling(roll, closed= "left").mean().values
369 |
370 | # determine league avg for each stat and then subtract it from the each team's average
371 | # as a measure of how well that team compares to all teams in that moment in time
372 |
373 | #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose)
374 | roll_feature_list = [x for x in roll_feature_list if not x.startswith('TEAM1_win')]
375 |
376 | df = process_x_minus_league_avg(df, roll_feature_list, 'TEAM1')
377 |
378 |
379 | return df
380 |
381 |
382 |
383 | def process_x_minus_league_avg(df: pd.DataFrame, feature_list: list, team_feature: str)-> pd.DataFrame:
384 | """
385 | Calculate the league average for every day of the season and then subtract the league average of each stat from the team's current stat for that day.
386 |
387 | This provides a measure of how good the team is compared to the the rest of the league at that moment in time.
388 |
389 | Args:
390 | df (pd.DataFrame): the dataframe to process
391 | feature_list (list): list of features to be used for subtraction, e.g. [PTS_AVG_LAST_5_ALL, REB_AVG_LAST_20_ALL]
392 | team_feature (str): the team's role (subset of data) that is being worked upon ("HOME_TEAM_ID", "VISITOR_TEAM_ID", or "TEAM1" for all roles)
393 |
394 | Returns:
395 | the processed dataframe
396 |
397 | """
398 |
399 | # create a temp dataframe so that every date can be front-filled
400 | # we need the current average for all 30 teams for every day during the season
401 | # whether that team played or not.
402 | # We will front-fill from previous days to ensure that every day has stats for every team
403 |
404 | df.to_csv("df.csv",index=False)
405 |
406 | # create feature list for temp dataframe to hold league averages
407 | temp_feature_list = feature_list.copy()
408 | temp_feature_list.append(team_feature)
409 | temp_feature_list.append("GAME_DATE_EST")
410 |
411 | df_temp = df[temp_feature_list]
412 | print(temp_feature_list)
413 | df_temp.to_csv("df_temp.csv",index=False)
414 |
415 |
416 | # populate the dataframe with all days played and forward fill previous value if a particular team did not play that day
417 | # https://stackoverflow.com/questions/70362869
418 | df_temp = (df_temp.set_index('GAME_DATE_EST',)
419 | .groupby([team_feature])[feature_list]
420 | .apply(lambda x: x.asfreq('d', method = "ffill"))
421 | .reset_index()
422 | [temp_feature_list]
423 | )
424 |
425 | # find the average across all teams for each day
426 | df_temp = df_temp.groupby(['GAME_DATE_EST'])[feature_list].mean().reset_index()
427 |
428 | # rename features for merging
429 | df_temp = df_temp.add_suffix('_LEAGUE_AVG')
430 | temp_features = df_temp.columns
431 |
432 | # merge all-team averages with each record so that they can be subtracted
433 | df = df.sort_values(by = 'GAME_DATE_EST', axis=0, ascending= True, ignore_index=True)
434 | df = pd.merge(df, df_temp, left_on='GAME_DATE_EST', right_on='GAME_DATE_EST_LEAGUE_AVG', how="left",)
435 | # subtract league average for each feature
436 | for feature in feature_list:
437 | df[feature + "_MINUS_LEAGUE_AVG"] = df[feature] - df[feature + "_LEAGUE_AVG"]
438 |
439 | # drop temp features that were only used for subtraction
440 | df = df.drop(temp_features, axis = 1)
441 |
442 | return df
443 |
444 |
445 | def combine_new_features(df: pd.DataFrame, df_consecutive: pd.DataFrame)-> pd.DataFrame:
446 | """
447 | Re-combine back the features created in the consecutive dataframe to the main dataframe.
448 |
449 | The consecutive dataframe was used to derive features regardless of whether the team was home or away, and now we need to add those features back to the main dataframe.
450 |
451 | Args:
452 | df (pd.DataFrame): the main dataframe where each row is a game with both a home team and a visitor team
453 | df_consecutive (pd.DataFrame): the dataframe where each row is a game with only one team (either home or visitor)
454 |
455 | Returns:
456 | the merged dataframe
457 | """
458 |
459 |
460 | # add back all the new features created in the consecutive dataframe to the main dataframe
461 | # all data for TEAM1 will be applied to the home team and then again to the visitor team
462 | # except for head-to-head MATCHUP data, which will only be applied to home team (redundant to include for both)
463 | # the letter '_x' will be appended to feature names when adding to home team
464 | # the letter '_y' will be appended to feature names when adding to visitor team
465 | # to match the existing convention in the dataset
466 |
467 | #first select out the new features
468 | all_features = df_consecutive.columns.tolist()
469 | link_features = ['GAME_ID', 'TEAM1', ]
470 | redundant_features = ['GAME_DATE_EST','TEAM1_home','TEAM1_win','TEAM2','SEASON','PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB',]
471 | matchup_features = [x for x in all_features if "MATCHUP" in x]
472 | ignore_features = link_features + redundant_features
473 |
474 | new_features = [x for x in all_features if x not in ignore_features]
475 |
476 | # first home teams
477 |
478 | df1 = df_consecutive[df_consecutive['TEAM1_home'] == 1]
479 | #add "_x" to new features
480 | df1.columns = [x + '_x' if x in new_features else x for x in df1.columns]
481 | #drop features that don't need to be merged
482 | df1 = df1.drop(redundant_features,axis=1)
483 | #change TEAM1 to HOME_TEAM_ID for easy merging
484 | df1 = df1.rename(columns={'TEAM1': 'HOME_TEAM_ID'})
485 | df = pd.merge(df, df1, how="left", on=["GAME_ID", "HOME_TEAM_ID"])
486 |
487 | #don't include matchup features for visitor team since they are equivant for both home and visitor
488 | new_features = [x for x in new_features if x not in matchup_features]
489 | df_consecutive = df_consecutive.drop(matchup_features,axis=1)
490 |
491 | # next visitor teams
492 |
493 | df2 = df_consecutive[df_consecutive['TEAM1_home'] == 0]
494 | #add "_y" to new features
495 | df2.columns = [x + '_y' if x in new_features else x for x in df2.columns]
496 | #drop features that don't need to be merged
497 | df2 = df2.drop(redundant_features,axis=1)
498 | #change TEAM1 to VISITOR_TEAM_ID for easy merging
499 | df2 = df2.rename(columns={'TEAM1': 'VISITOR_TEAM_ID'})
500 | df = pd.merge(df, df2, how="left", on=["GAME_ID", "VISITOR_TEAM_ID"])
501 |
502 | return df
503 |
504 |
505 | def process_x_minus_y(df: pd.DataFrame)-> pd.DataFrame:
506 | """
507 | Subtract visitor team rolling stats from home rolling stats.
508 |
509 | This may (or may not) be useful for the model to explicitly see the difference between the two teams. GBM models may be able to handle this automatically, but other models may not.
510 |
511 | Args:
512 | df (pd.DataFrame): the dataframe to process
513 |
514 | Returns:
515 | the processed dataframe
516 | """
517 |
518 | # field_x - field_y
519 |
520 | # remove the current games stats since they are data leaks - we don't know these until after the game is played
521 | useful_features = remove_non_rolling(df)
522 |
523 | comparison_features = [x for x in useful_features if "_y" in x]
524 |
525 | #don't include redundant features. (x - league_avg) - (y - league_avg) = x-y
526 | comparison_features = [x for x in comparison_features if "_MINUS_LEAGUE_AVG" not in x]
527 |
528 | for feature in comparison_features:
529 | feature_base = feature[:-2] #remove "_y" from the end
530 | df[feature_base + "_x_minus_y"] = df[feature_base + "_x"] - df[feature_base + "_y"]
531 |
532 | #df = df.drop("CONFERENCE_x_minus_y") #category variable not meaningful?
533 |
534 | return df
535 |
536 |
537 | def remove_non_rolling(df: pd.DataFrame) -> list:
538 | """
539 | Returns a list of columns in a dataframe with the current games stats removed, leaving only rolling averages and streaks
540 |
541 | Args:
542 | df (pd.DataFrame): the dataframe to process
543 |
544 | Returns:
545 | list: only the columns that are rolling averages and streaks
546 | """
547 |
548 | # remove non-rolling features - these are data leaks
549 | # they are stats from the actual game that decides winner/loser,
550 | # but we don't know these stats before a game is played
551 |
552 | # These must be retained in the database to recalculate rolling avgs and streaks in the future,
553 | # so are filtered out as appropriate instead of deleted
554 |
555 | drop_columns =[]
556 |
557 | all_columns = df.columns.tolist()
558 |
559 | drop_columns1 = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']
560 | drop_columns2 = ['PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
561 |
562 | drop_columns = drop_columns + drop_columns1
563 | drop_columns = drop_columns + drop_columns2
564 |
565 | use_columns = [item for item in all_columns if item not in drop_columns]
566 |
567 | return use_columns
568 |
--------------------------------------------------------------------------------
/src/hopsworks_utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import json
3 | import hopsworks
4 |
5 | from datetime import datetime, timedelta
6 |
7 | def save_feature_names(df: pd.DataFrame) -> str:
8 | """
9 | Saves dictionary of {lower case feature names : original mixed-case feature names} to JSON file
10 |
11 | Args:
12 | df (pd.DataFrame): the dataframe with the features to be saved
13 |
14 | Returns:
15 | "File Saved."
16 | """
17 |
18 | # hopsworks "sanitizes" feature names by converting to all lowercase
19 | # this function saves the original so that they can be re-mapped later
20 | # for code re-usability
21 |
22 | original_f_names = df.columns.tolist()
23 | hopsworks_f_names = [x.lower() for x in original_f_names]
24 |
25 | # create a dictionary
26 | feature_mapper = {hopsworks_f_names[i]: original_f_names[i] for i in range(len(hopsworks_f_names))}
27 |
28 | with open("feature_names.json", "w") as fp:
29 | json.dump(feature_mapper, fp)
30 |
31 | return "File Saved."
32 |
33 |
34 | def convert_feature_names(df: pd.DataFrame) -> pd.DataFrame:
35 | """
36 | Converts hopsworks.ai lower-case feature names back to original mixed-case feature names that have been saved in JSON file
37 |
38 | Args:
39 | df (pd.DataFrame): the dataframe with features in all lower-case
40 |
41 | Returns:
42 | pd.DataFrame: the dataframe with features in original mixed-case
43 | """
44 |
45 | # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case
46 | # converting these back to original format is needed for optimal code re-useability.
47 |
48 | # read in list of original feature names
49 | with open('feature_names.json', 'rb') as fp:
50 | feature_mapper = json.load(fp)
51 |
52 | df = df.rename(columns=feature_mapper)
53 |
54 | return df
55 |
56 |
57 | def create_train_test_data(HOPSWORKS_API_KEY:str, STARTDATE:str, DAYS:int) -> tuple[pd.DataFrame, pd.DataFrame]:
58 | """
59 | Returns train and test data from Hopsworks.ai feature store based upon how many DAYS back to use as test data
60 |
61 | Args:
62 | HOPSWORKS_API_KEY (str): subscription key for Hopsworks.ai
63 | STARTDATE (str): start date for train data, format YYYY-MM-DD
64 | DAYS (int): number of days back to use as test data, the train data will be all data except the last DAYS
65 |
66 | Returns:
67 | Train and Test data as pandas dataframes
68 | """
69 |
70 | # log into hopsworks.ai and create a feature view object from the feature group
71 | # the api makes it easier to retrieve training/test data from a feature view than a feature group
72 |
73 | project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
74 | fs = project.get_feature_store()
75 |
76 | rolling_stats_fg = fs.get_or_create_feature_group(
77 | name="rolling_stats",
78 | version=2,
79 | )
80 |
81 | query = rolling_stats_fg.select_all()
82 |
83 | feature_view = fs.create_feature_view(
84 | name = 'rolling_stats_fv',
85 | version = 2,
86 | query = query
87 | )
88 |
89 | # calculate the start and end dates for the train and test data and then retrieve the data from the feature view
90 | # the train data will be all data except the last DAYS
91 |
92 |
93 | TODAY = datetime.now()
94 | LASTYEAR = (TODAY - timedelta(days=DAYS)).strftime('%Y-%m-%d')
95 | TODAY = TODAY.strftime('%Y-%m-%d')
96 |
97 | td_train, td_job = feature_view.create_training_data(
98 | start_time=STARTDATE,
99 | end_time=LASTYEAR,
100 | description='All data except last ' + str(DAYS) + ' days',
101 | data_format="csv",
102 | coalesce=True,
103 | write_options={'wait_for_job': False},
104 | )
105 |
106 | td_test, td_job = feature_view.create_training_data(
107 | start_time=LASTYEAR,
108 | end_time=TODAY,
109 | description='Last ' + str(DAYS) + ' days',
110 | data_format="csv",
111 | coalesce=True,
112 | write_options={'wait_for_job': False},
113 | )
114 |
115 | train = feature_view.get_training_data(td_train)[0]
116 | test = feature_view.get_training_data(td_test)[0]
117 |
118 | # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case
119 | # converting these back to original format is needed for optimal code re-useability.
120 |
121 | train = convert_feature_names(train)
122 | test = convert_feature_names(test)
123 |
124 | # fix date format (truncate to YYYY-MM-DD)
125 | train["GAME_DATE_EST"] = train["GAME_DATE_EST"].str[:10]
126 | test["GAME_DATE_EST"] = test["GAME_DATE_EST"].str[:10]
127 |
128 | # feature view is no longer needed, so delete it
129 | feature_view.delete()
130 |
131 | return train, test
132 |
133 |
--------------------------------------------------------------------------------
/src/model_training.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 |
4 | from collections import defaultdict
5 |
6 | import matplotlib.pyplot as plt
7 | from matplotlib.gridspec import GridSpec
8 |
9 | from sklearn.calibration import (
10 | CalibrationDisplay,
11 | )
12 |
13 | from sklearn.metrics import (
14 | precision_score,
15 | recall_score,
16 | f1_score,
17 | brier_score_loss,
18 | log_loss,
19 | roc_auc_score,
20 | )
21 |
22 |
23 | def encode_categoricals(df: pd.dataframe, category_columns: list, MODEL_NAME: str, ENABLE_CATEGORICAL: bool) -> pd.dataframe:
24 | """
25 | Encode categorical features as integers for use in XGBoost and LightGBM
26 |
27 | Args:
28 | df (pd.DataFrame): the dataframe to process
29 | category_columns (list): list of columns to encode as categorical
30 | MODEL_NAME (str): the name of the model being used
31 | ENABLE_CATEGORICAL (bool): whether or not to enable categorical features in the model
32 |
33 | Returns:
34 | the dataframe with categorical features encoded
35 |
36 |
37 | """
38 |
39 | # To use special category feature capabilities in XGB and LGB, categorical features must be ints from 0 to N-1
40 | # Conversion can be accomplished by simple subtraction for several features
41 | # (these category capabilities may or may not be used, but encoding does not hurt anything)
42 |
43 | first_team_ID = df['HOME_TEAM_ID'].min()
44 | first_season = df['SEASON'].min()
45 |
46 | # subtract lowest value from each to create a range of 0 thru N-1
47 | df['HOME_TEAM_ID'] = (df['HOME_TEAM_ID'] - first_team_ID).astype('int8') #team ID - 1610612737 = 0 thru 29
48 | df['VISITOR_TEAM_ID'] = (df['VISITOR_TEAM_ID'] - first_team_ID).astype('int8')
49 | df['SEASON'] = (df['SEASON'] - first_season).astype('int8')
50 |
51 | # if xgb experimental categorical capabilities are to be used, then features must be of category type
52 | if MODEL_NAME == "xgboost":
53 | if ENABLE_CATEGORICAL:
54 | for field in category_columns:
55 | df[field] = df[field].astype('category')
56 |
57 | return df
58 |
59 |
60 | def plot_calibration_curve(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe, n_bins: int = 10) -> None:
61 | """
62 | Plots calibration curves for a list of classifiers vs ideal probability distribution
63 |
64 | Args:
65 | clf_list (list): the classifiers to plot
66 | X_train (pd.dataframe): training data
67 | y_train (pd.dataframe): labels for training data
68 | X_test (pd.dataframe): test data
69 | y_test (pd.dataframe): labels for test data
70 | n_bins (int, optional): how many bins to use for calibration. Defaults to 10.
71 |
72 | Returns:
73 | None
74 |
75 | FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
76 | """
77 |
78 |
79 | fig = plt.figure(figsize=(10, 10))
80 | gs = GridSpec(4, 2)
81 | colors = plt.cm.get_cmap("Dark2")
82 |
83 | ax_calibration_curve = fig.add_subplot(gs[:2, :2])
84 | calibration_displays = {}
85 | for i, (clf, name) in enumerate(clf_list):
86 | clf.fit(X_train, y_train)
87 | display = CalibrationDisplay.from_estimator(
88 | clf,
89 | X_test,
90 | y_test,
91 | n_bins=n_bins,
92 | name=name,
93 | ax=ax_calibration_curve,
94 | color=colors(i),
95 | )
96 | calibration_displays[name] = display
97 |
98 | ax_calibration_curve.grid()
99 | ax_calibration_curve.set_title(f"Calibration plots (bins = {n_bins})")
100 |
101 | # Add histogram
102 | grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
103 | for i, (_, name) in enumerate(clf_list):
104 | row, col = grid_positions[i]
105 | ax = fig.add_subplot(gs[row, col])
106 |
107 | ax.hist(
108 | calibration_displays[name].y_prob,
109 | range=(0, 1),
110 | bins=n_bins,
111 | label=name,
112 | color=colors(i),
113 | )
114 | ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")
115 |
116 | plt.tight_layout()
117 | plt.show()
118 |
119 | return
120 |
121 |
122 | def calculate_classification_metrics(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe) -> tuple[pd.dataframe, list]:
123 | """
124 | Calculates classification metrics for a list of classifiers. Brier score, log loss, precision, recall, f1, and roc_auc are calculated.
125 |
126 | Args:
127 | clf_list (list): the classifiers to calculate metrics for
128 | X_train (pd.dataframe): training data
129 | y_train (pd.dataframe): labels for training data
130 | X_test (pd.dataframe): test data
131 | y_test (pd.dataframe): labels for test data
132 |
133 | Returns:
134 | tuple: (dataframe) of the metrics and (list) containing the fitted models and names of the models as strings
135 |
136 | FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
137 | """
138 |
139 | # this is
140 |
141 | scores = defaultdict(list)
142 |
143 | for i, (clf, name) in enumerate(clf_list):
144 | clf.fit(X_train, y_train)
145 | y_prob = clf.predict_proba(X_test)
146 | y_pred = clf.predict(X_test)
147 | scores["Classifier"].append(name)
148 |
149 | for metric in [brier_score_loss, log_loss]:
150 | score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize()
151 | scores[score_name].append(metric(y_test, y_prob[:, 1]))
152 |
153 | for metric in [precision_score, recall_score, f1_score, roc_auc_score]:
154 | score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize()
155 | scores[score_name].append(metric(y_test, y_pred))
156 |
157 | score_df = pd.DataFrame(scores).set_index("Classifier")
158 | score_df.round(decimals=3)
159 |
160 | # update clf_list with the trained model
161 | clf_list[i] = (clf, name)
162 |
163 |
164 | return score_df, clf_list
--------------------------------------------------------------------------------
/src/optuna_objectives.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | import xgboost as xgb
5 |
6 | import lightgbm as lgb
7 | from lightgbm import (
8 | early_stopping,
9 | log_evaluation,
10 | )
11 |
12 | from sklearn.model_selection import (
13 | StratifiedKFold,
14 | TimeSeriesSplit,
15 | )
16 |
17 | from sklearn.metrics import (
18 | accuracy_score,
19 | roc_auc_score,
20 | )
21 |
22 |
23 |
24 | def XGB_objective(trial, train, target, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED):
25 |
26 |
27 | train_oof = np.zeros((train.shape[0],))
28 |
29 | train_dmatrix = xgb.DMatrix(train, target,
30 | feature_names=train.columns,
31 | enable_categorical=ENABLE_CATEGORICAL)
32 |
33 | xgb_params= {
34 | 'num_round': trial.suggest_int('num_round', 2, 1000),
35 | 'learning_rate': trial.suggest_float('learning_rate', 1E-3, 1),
36 | 'max_bin': trial.suggest_int('max_bin', 2, 1000),
37 | 'max_depth': trial.suggest_int('max_depth', 1, 8),
38 | 'alpha': trial.suggest_float('alpha', 1E-16, 12),
39 | 'gamma': trial.suggest_float('gamma', 1E-16, 12),
40 | 'reg_lambda': trial.suggest_float('reg_lambda', 1E-16, 12),
41 | 'colsample_bytree': trial.suggest_float('colsample_bytree', 1E-16, 1.0),
42 | 'subsample': trial.suggest_float('subsample', 1E-16, 1.0),
43 | 'min_child_weight': trial.suggest_float('min_child_weight', 1E-16, 12),
44 | 'scale_pos_weight': trial.suggest_int('scale_pos_weight', 1, 15),
45 | }
46 |
47 | xgb_params = xgb_params | STATIC_PARAMS
48 |
49 | #pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "evaluation-auc")
50 |
51 | if OPTUNA_CV == "StratifiedKFold":
52 | kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED)
53 | elif OPTUNA_CV == "TimeSeriesSplit":
54 | kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS)
55 |
56 |
57 | for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))):
58 |
59 | train_df, val_df = train.iloc[train_ind], train.iloc[val_ind]
60 |
61 | train_target, val_target = target[train_ind], target[val_ind]
62 |
63 | train_dmatrix = xgb.DMatrix(train_df, label=train_target,enable_categorical=ENABLE_CATEGORICAL)
64 | val_dmatrix = xgb.DMatrix(val_df, label=val_target,enable_categorical=ENABLE_CATEGORICAL)
65 |
66 |
67 | model = xgb.train(xgb_params,
68 | train_dmatrix,
69 | num_boost_round = NUM_BOOST_ROUND,
70 | #callbacks=[pruning_callback],
71 | )
72 |
73 | temp_oof = model.predict(val_dmatrix)
74 |
75 | train_oof[val_ind] = temp_oof
76 |
77 | #print(roc_auc_score(val_target, temp_oof))
78 |
79 | val_score = roc_auc_score(target, train_oof)
80 |
81 | return val_score
82 |
83 |
84 | def LGB_objective(trial, train, target, category_columns, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED, EARLY_STOPPING):
85 |
86 |
87 |
88 | train_oof = np.zeros((train.shape[0],))
89 |
90 |
91 | lgb_params= {
92 | "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
93 | "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
94 | "learning_rate": trial.suggest_loguniform('learning_rate', 1e-4, 0.5),
95 | "max_depth": trial.suggest_categorical('max_depth', [5,10,20,40,100, -1]),
96 | "n_estimators": trial.suggest_int("n_estimators", 50, 200000),
97 | "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
98 | "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
99 | "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
100 | "num_leaves": trial.suggest_int("num_leaves", 2, 1000),
101 | "min_child_samples": trial.suggest_int("min_child_samples", 5, 300),
102 | "cat_smooth" : trial.suggest_int('min_data_per_groups', 1, 100)
103 | }
104 |
105 | lgb_params = lgb_params | STATIC_PARAMS
106 |
107 | #pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc")
108 |
109 | if OPTUNA_CV == "StratifiedKFold":
110 | kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED)
111 | elif OPTUNA_CV == "TimeSeriesSplit":
112 | kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS)
113 |
114 |
115 | for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))):
116 |
117 | train_df, val_df = train.iloc[train_ind], train.iloc[val_ind]
118 |
119 | train_target, val_target = target[train_ind], target[val_ind]
120 |
121 | train_lgbdataset = lgb.Dataset(train_df, label=train_target,categorical_feature=category_columns)
122 | val_lgbdataset = lgb.Dataset(val_df, label=val_target, reference = train_lgbdataset, categorical_feature=category_columns)
123 |
124 |
125 | model = lgb.train(lgb_params,
126 | train_lgbdataset,
127 | valid_sets=val_lgbdataset,
128 | #num_boost_round = NUM_BOOST_ROUND,
129 | callbacks=[#log_evaluation(LOG_EVALUATION),
130 | early_stopping(EARLY_STOPPING,verbose=False),
131 | #pruning_callback,
132 | ]
133 | #verbose_eval= VERBOSE_EVAL,
134 | )
135 |
136 | temp_oof = model.predict(val_df)
137 |
138 | train_oof[val_ind] = temp_oof
139 |
140 | #print(roc_auc_score(val_target, temp_oof))
141 |
142 | val_score = roc_auc_score(target, train_oof)
143 |
144 | return val_score
--------------------------------------------------------------------------------
/src/streamlit_app.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import streamlit as st
4 | import hopsworks
5 | import joblib
6 | import pandas as pd
7 | import numpy as np
8 | import json
9 | import time
10 | from datetime import timedelta, datetime
11 | import xgboost as xgb
12 |
13 | from src.hopsworks_utils import (
14 | convert_feature_names,
15 | )
16 |
17 | from src.feature_engineering import (
18 | fix_datatypes,
19 | remove_non_rolling,
20 | )
21 |
22 |
23 | # Load hopsworks API key from .env file
24 |
25 | from dotenv import load_dotenv
26 |
27 | load_dotenv()
28 |
29 | try:
30 | HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
31 | except:
32 | raise Exception('Set environment variable HOPSWORKS_API_KEY')
33 |
34 |
35 |
36 | def fancy_header(text, font_size=24):
37 | res = f'{text}'
38 | st.markdown(res, unsafe_allow_html=True )
39 |
40 | def get_model(project, model_name, evaluation_metric, sort_metrics_by):
41 | """Retrieve desired model from the Hopsworks Model Registry."""
42 |
43 | mr = project.get_model_registry()
44 | # get best model based on custom metrics
45 | model = mr.get_best_model(model_name,
46 | evaluation_metric,
47 | sort_metrics_by)
48 |
49 | # download model from Hopsworks
50 | #model_dir = model.download()
51 | #print(model_dir)
52 | with open("model.pkl", 'rb') as f:
53 | loaded_model = joblib.load(f)
54 |
55 |
56 | return loaded_model
57 |
58 |
59 | # dictionary to convert team ids to team names
60 |
61 | nba_team_names = {
62 | 1610612737: "Atlanta Hawks",
63 | 1610612738: "Boston Celtics",
64 | 1610612739: "Cleveland Cavaliers",
65 | 1610612740: "New Orleans Pelicans",
66 | 1610612741: "Chicago Bulls",
67 | 1610612742: "Dallas Mavericks",
68 | 1610612743: "Denver Nuggets",
69 | 1610612744: "Golden State Warriors",
70 | 1610612745: "Houston Rockets",
71 | 1610612746: "LA Clippers",
72 | 1610612754: "Indiana Pacers",
73 | 1610612747: "Los Angeles Lakers",
74 | 1610612763: "Memphis Grizzlies",
75 | 1610612748: "Miami Heat",
76 | 1610612749: "Milwaukee Bucks",
77 | 1610612750: "Minnesota Timberwolves",
78 | 1610612751: "Brooklyn Nets",
79 | 1610612752: "New York Knicks",
80 | 1610612753: "Orlando Magic",
81 | 1610612755: "Philadelphia 76ers",
82 | 1610612756: "Phoenix Suns",
83 | 1610612757: "Portland Trail Blazers",
84 | 1610612758: "Sacramento Kings",
85 | 1610612759: "San Antonio Spurs",
86 | 1610612760: "Oklahoma City Thunder",
87 | 1610612761: "Toronto Raptors",
88 | 1610612762: "Utah Jazz",
89 | 1610612764: "Washington Wizards",
90 | 1610612765: "Detroit Pistons",
91 | 1610612766: "Charlotte Hornets",
92 | }
93 |
94 | # Streamlit app
95 | st.title('NBA Prediction Project')
96 |
97 | progress_bar = st.sidebar.header('⚙️ Working Progress')
98 | progress_bar = st.sidebar.progress(0)
99 | st.write(36 * "-")
100 | fancy_header('\n📡 Connecting to Hopsworks Feature Store...')
101 |
102 |
103 | # Connect to Hopsworks Feature Store and get Feature Group
104 | project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
105 | fs = project.get_feature_store()
106 |
107 | rolling_stats_fg = fs.get_feature_group(
108 | name="rolling_stats",
109 | version=2,
110 | )
111 |
112 | st.write("Successfully connected!✔️")
113 | progress_bar.progress(20)
114 |
115 |
116 |
117 | # Get data from Feature Store
118 | st.write(36 * "-")
119 | fancy_header('\n☁️ Retrieving data from Feature Store...')
120 |
121 | # filter new games that are scheduled for today
122 | # these are games where no points have been scored yet
123 | ds_query = rolling_stats_fg.filter(rolling_stats_fg.pts_home == 0)
124 | df_todays_matches = ds_query.read()
125 |
126 | if df_todays_matches.shape[0] == 0:
127 | progress_bar.progress(100)
128 | st.write()
129 | fancy_header('\n 🤷♂️ No games scheduled for today! 🤷♂️')
130 | st.write()
131 | st.write("Try again tomorrow!")
132 | st.write()
133 | st.write("NBA season and postseason usually runs from October to June.")
134 | st.stop()
135 |
136 | st.write("Successfully retrieved!✔️")
137 | progress_bar.progress(40)
138 | print(df_todays_matches.head(5))
139 |
140 |
141 | # Prepare data for prediction
142 | st.write(36 * "-")
143 | fancy_header('\n☁️ Processing Data for prediction...')
144 |
145 | # convert feature names back to mixed case
146 | df_todays_matches = convert_feature_names(df_todays_matches)
147 |
148 | # Add a column that displays the matchup using the team names
149 | # this will make the display more meaningful
150 | df_todays_matches['MATCHUP'] = df_todays_matches['VISITOR_TEAM_ID'].map(nba_team_names) + " @ " + df_todays_matches['HOME_TEAM_ID'].map(nba_team_names)
151 |
152 | # fix date and other types
153 | df_todays_matches = fix_datatypes(df_todays_matches)
154 |
155 | # remove features not used by model
156 | drop_columns = ['TARGET', 'GAME_DATE_EST', 'GAME_ID', ]
157 | df_todays_matches = df_todays_matches.drop(drop_columns, axis=1)
158 |
159 | # remove stats from today's games - these are blank (the game hasn't been played) and are not used by the model
160 | use_columns = remove_non_rolling(df_todays_matches)
161 | X = df_todays_matches[use_columns]
162 |
163 | # MATCHUP is just for informational display, not used by model
164 | X = X.drop('MATCHUP', axis=1)
165 |
166 | #X_dmatrix = xgb.DMatrix(X) # convert to DMatrix for XGBoost
167 |
168 | st.write(df_todays_matches['MATCHUP'])
169 |
170 | st.write("Successfully processed!✔️")
171 | progress_bar.progress(60)
172 |
173 |
174 | # Load model from Hopsworks Model Registry
175 | st.write(36 * "-")
176 | fancy_header(f"Loading Best Model...")
177 |
178 | model = get_model(project=project,
179 | model_name="xgboost",
180 | evaluation_metric="AUC",
181 | sort_metrics_by="max")
182 |
183 | st.write("Successfully loaded!✔️")
184 | progress_bar.progress(80)
185 |
186 |
187 |
188 | # Predict winning probabilities of home team
189 | st.write(36 * "-")
190 | fancy_header(f"Predicting Winning Probabilities...")
191 |
192 |
193 | #preds = model.predict(X_dmatrix)
194 | preds = model.predict_proba(X)[:,1]
195 |
196 | df_todays_matches['HOME_TEAM_WIN_PROBABILITY'] = preds
197 |
198 | st.dataframe(df_todays_matches[['MATCHUP', 'HOME_TEAM_WIN_PROBABILITY']])
199 |
200 | progress_bar.progress(100)
201 | st.button("Re-run")
202 |
--------------------------------------------------------------------------------
/src/webscraping.py:
--------------------------------------------------------------------------------
1 |
2 | import pandas as pd
3 | import numpy as np
4 |
5 | import os
6 |
7 | import asyncio
8 |
9 | #if using scrapingant, import these
10 | from scrapingant_client import ScrapingAntClient
11 |
12 | # if using selenium and chrome, import these
13 | from selenium import webdriver
14 | from selenium.webdriver.chrome.service import Service as ChromiumService
15 | from selenium.webdriver.common.by import By
16 | from selenium.webdriver.chrome.options import Options
17 | from webdriver_manager.core.utils import ChromeType
18 | from webdriver_manager.chrome import ChromeDriverManager
19 |
20 | # if using selenium and firefox, import these
21 | from selenium.webdriver.firefox.service import Service as FirefoxService
22 | from webdriver_manager.firefox import GeckoDriverManager
23 |
24 |
25 | from bs4 import BeautifulSoup as soup
26 |
27 | from datetime import datetime, timedelta
28 | from pytz import timezone
29 |
30 | from pathlib import Path #for Windows/Linux compatibility
31 | DATAPATH = Path(r'data')
32 |
33 | import time
34 |
35 |
36 |
37 |
38 | def activate_web_driver(browser: str) -> webdriver:
39 | """
40 | Activate selenium web driver for use in scraping
41 |
42 | Args:
43 | browser (str): the name of the browser to use, either "firefox" or "chromium"
44 |
45 | Returns:
46 | the selected webdriver
47 | """
48 |
49 | # options for selenium webdrivers, used to assist headless scraping. Still ran into issues, so I used scrapingant instead when running from github actions
50 | options = [
51 | "--headless",
52 | "--window-size=1920,1200",
53 | "--start-maximized",
54 | "--no-sandbox",
55 | "--disable-dev-shm-usage",
56 | "--disable-gpu",
57 | "--ignore-certificate-errors",
58 | "--disable-extensions",
59 | "--disable-popup-blocking",
60 | "--disable-notifications",
61 | "--remote-debugging-port=9222", #https://stackoverflow.com/questions/56637973/how-to-fix-selenium-devtoolsactiveport-file-doesnt-exist-exception-in-python
62 | "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
63 | "--disable-blink-features=AutomationControlled",
64 | ]
65 |
66 | if browser == "firefox":
67 | service = FirefoxService(executable_path=GeckoDriverManager().install())
68 |
69 | firefox_options = webdriver.FirefoxOptions()
70 | for option in options:
71 | firefox_options.add_argument(option)
72 |
73 | driver = webdriver.Firefox(service=service, options=firefox_options)
74 |
75 | else:
76 | service = ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install())
77 |
78 | chrome_options = Options()
79 | for option in options:
80 | chrome_options.add_argument(option)
81 |
82 | driver = webdriver.Chrome(service=service, options=chrome_options)
83 |
84 | return driver
85 |
86 |
87 |
88 | def get_new_games(SCRAPINGANT_API_KEY: str, driver: webdriver) -> pd.DataFrame:
89 |
90 | # set search for previous days games; use 2 days to catch up in case of a failed run
91 | DAYS = 2
92 | SEASON = "" #no season will cause website to default to current season, format is "2022-23"
93 | TODAY = datetime.now(timezone('EST')) #nba.com uses US Eastern Standard Time
94 | LASTWEEK = (TODAY - timedelta(days=DAYS))
95 | DATETO = TODAY.strftime("%m/%d/%y")
96 | DATEFROM = LASTWEEK.strftime("%m/%d/%y")
97 |
98 |
99 | df = scrape_to_dataframe(api_key=SCRAPINGANT_API_KEY, driver=driver, Season=SEASON, DateFrom=DATEFROM, DateTo=DATETO, )
100 |
101 | df = convert_columns(df)
102 |
103 |
104 | df = combine_home_visitor(df)
105 |
106 | return df
107 |
108 |
109 |
110 | def parse_ids(data_table):
111 |
112 | # TEAM_ID and GAME_ID are encoded in href= links
113 | # find all the hrefs, add them to a list
114 | # then parse out a list for teams ids and game ids
115 | # and convert these to pandas series
116 |
117 | CLASS_ID = 'Anchor_anchor__cSc3P' #determined by visual inspection of page source code
118 |
119 | # get all the links
120 | links = data_table.find_all('a', {'class':CLASS_ID})
121 |
122 | # get the href part (web addresses)
123 | # href="/stats/team/1610612740" for teams
124 | # href="/game/0022200191" for games
125 | links_list = [i.get("href") for i in links]
126 |
127 | # create a series using last 10 digits of the appropriate links
128 | team_id = pd.Series([i[-10:] for i in links_list if ('stats' in i)])
129 | game_id = pd.Series([i[-10:] for i in links_list if ('/game/' in i)])
130 |
131 | return team_id, game_id
132 |
133 |
134 |
135 | def scrape_to_dataframe(api_key, driver, Season, DateFrom="NONE", DateTo="NONE", stat_type='standard'):
136 |
137 | # go to boxscores webpage at nba.com
138 | # check if the data table is split over multiple pages
139 | # if so, then select the "ALL" choice in pulldown menu to show all on one page
140 | # extract out the html table and convert to dataframe
141 | # parse out GAME_ID and TEAM_ID from href links
142 | # and add these to dataframe
143 |
144 | # if season not provided, then will default to current season
145 | # if DateFrom and DateTo not provided, then don't include in url - pull the whole season
146 | if stat_type == 'standard':
147 | nba_url = "https://www.nba.com/stats/teams/boxscores"
148 | else:
149 | nba_url = "https://www.nba.com/stats/teams/boxscores-"+ stat_type
150 |
151 | if not Season:
152 | nba_url = nba_url + "?DateFrom=" + DateFrom + "&DateTo=" + DateTo
153 | else:
154 | if DateFrom == "NONE" and DateTo == "NONE":
155 | nba_url = nba_url + "?Season=" + Season
156 | else:
157 | nba_url = nba_url + "?Season=" + Season + "&DateFrom=" + DateFrom + "&DateTo=" + DateTo
158 |
159 | #try 3 times to load page correctly; scrapingant can fail sometimes on it first try
160 | for i in range(3):
161 | if api_key == "": #if no api key, then use selenium
162 | driver.get(nba_url)
163 | time.sleep(10)
164 | source = soup(driver.page_source, 'html.parser')
165 | else: #if api key, then use scrapingant
166 | client = ScrapingAntClient(token=api_key)
167 | result = client.general_request(nba_url)
168 | source = soup(result.content, 'html.parser')
169 |
170 | # the data table is the key dynamic element that may fail to load
171 | CLASS_ID_TABLE = 'Crom_table__p1iZz' #determined by visual inspection of page source code
172 | data_table = source.find('table', {'class':CLASS_ID_TABLE})
173 |
174 | if data_table is None:
175 | time.sleep(10)
176 | else:
177 | break
178 |
179 |
180 | #check for more than one page
181 | CLASS_ID_PAGINATION = "Pagination_pageDropdown__KgjBU" #determined by visual inspection of page source code
182 | pagination = source.find('div', {'class':CLASS_ID_PAGINATION})
183 |
184 | if api_key == "": #if using selenium, then check for multiple pages
185 | if pagination is not None:
186 | # if multiple pages, first activate pulldown option for All pages to show all rows on one page
187 | CLASS_ID_DROPDOWN = "DropDown_select__4pIg9" #determined by visual inspection of page source code
188 | page_dropdown = driver.find_element(By.XPATH, "//*[@class='" + CLASS_ID_PAGINATION + "']//*[@class='" + CLASS_ID_DROPDOWN + "']")
189 |
190 | page_dropdown.send_keys("ALL") # show all pages
191 | #page_dropdown.click() doesn't work in headless mode
192 | time.sleep(3)
193 | driver.execute_script('arguments[0].click()', page_dropdown) #click() didn't work in headless mode, used this workaround (https://stackoverflow.com/questions/57741875)
194 |
195 | #refresh page data now that it contains all rows of the table
196 | time.sleep(3)
197 | source = soup(driver.page_source, 'html.parser')
198 | data_table = source.find('table', {'class':CLASS_ID_TABLE})
199 |
200 | #print(source)
201 |
202 | # convert the html table to a dataframe
203 | dfs = pd.read_html(str(data_table), header=0)
204 | df = pd.concat(dfs)
205 |
206 | # pull out teams ids and game ids from hrefs and add these to the dataframe
207 | TEAM_ID, GAME_ID = parse_ids(data_table)
208 | df['TEAM_ID'] = TEAM_ID
209 | df['GAME_ID'] = GAME_ID
210 |
211 |
212 | return df
213 |
214 | def convert_columns(df):
215 |
216 | # convert the dataframe to same format and column names as main data
217 |
218 | # drop columns not used
219 | drop_columns = ['Team', 'MIN', 'FGM', 'FGA', '3PM', '3PA', 'FTM', 'FTA', 'OREB', 'DREB', 'STL', 'BLK', 'TOV', 'PF', '+/-',]
220 | df = df.drop(columns=drop_columns)
221 |
222 | #rename columns to match existing dataframes
223 | mapper = {
224 | 'Match Up': 'HOME',
225 | 'Game Date': 'GAME_DATE_EST',
226 | 'W/L': 'HOME_TEAM_WINS',
227 | 'FG%': 'FG_PCT',
228 | '3P%': 'FG3_PCT',
229 | 'FT%': 'FT_PCT',
230 | }
231 | df = df.rename(columns=mapper)
232 |
233 | # reformat column data
234 |
235 | # make HOME true if @ is in the text
236 | # (Match Ups: POR @ DAL or DAl vs POR. Home team always has @)
237 | df['HOME'] = df['HOME'].apply(lambda x: 1 if '@' in x else 0)
238 |
239 | # convert wins to home team wins
240 | # incomplete games will be NaN
241 | df = df[df['HOME_TEAM_WINS'].notna()]
242 | # convert W/L to 1/0
243 | df['HOME_TEAM_WINS'] = df['HOME_TEAM_WINS'].apply(lambda x: 1 if 'W' in x else 0)
244 | # no need to do anything else, win/loss of visitor teams is not used in final dataframe
245 |
246 | #convert date format
247 | df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST'])
248 | df['GAME_DATE_EST'] = df['GAME_DATE_EST'].dt.strftime('%Y-%m-%d')
249 | df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST'])
250 |
251 | return df
252 |
253 | def combine_home_visitor(df):
254 |
255 | # each game currently has one row for home team stats
256 | # and one row for visitor team stats
257 | # these be will combined into a single row
258 |
259 | # separate home vs visitor
260 | home_df = df[df['HOME'] == 1]
261 | visitor_df = df[df['HOME'] == 0]
262 |
263 | # HOME column no longer needed
264 | home_df = home_df.drop(columns='HOME')
265 | visitor_df = visitor_df.drop(columns='HOME')
266 |
267 | # HOME_TEAM_WINS and GAME_DATE_EST columns not needed for visitor
268 | visitor_df = visitor_df.drop(columns=['HOME_TEAM_WINS','GAME_DATE_EST'])
269 |
270 | # rename TEAM_ID columns
271 | home_df = home_df.rename(columns={'TEAM_ID':'HOME_TEAM_ID'})
272 | visitor_df = visitor_df.rename(columns={'TEAM_ID':'VISITOR_TEAM_ID'})
273 |
274 | # merge the home and visitor data
275 | df = pd.merge(home_df, visitor_df, how="left", on=["GAME_ID"],suffixes=('_home', '_away'))
276 |
277 | # add a column for SEASON
278 | # determine SEASON by parsing GAME_ID
279 | # (e.g. 0022200192 1st 2 digits not used, 3rd digit 2 = regular season, 4th and 5th digit = SEASON)
280 | game_id = df['GAME_ID'].iloc[0]
281 | season = game_id[3:5]
282 | season = str(20) + season
283 | df['SEASON'] = season
284 |
285 | #print(df)
286 |
287 | #convert all object columns to int64
288 | for field in df.select_dtypes(include=['object']).columns.tolist():
289 | df[field] = df[field].astype('int64')
290 |
291 | return df
292 |
293 | def get_todays_matchups(api_key: str, driver: webdriver) -> list:
294 |
295 | '''
296 | Goes to NBA Schedule and scrapes the teams playing today
297 | '''
298 |
299 | NBA_SCHEDULE = "https://www.nba.com/schedule"
300 |
301 |
302 | if api_key == "": #if no api key, then use selenium
303 | driver.get(NBA_SCHEDULE)
304 | time.sleep(10)
305 | source = soup(driver.page_source, 'html.parser')
306 | else: #if api key, then use scrapingant
307 | client = ScrapingAntClient(token=api_key)
308 | result = client.general_request(NBA_SCHEDULE)
309 | source = soup(result.content, 'html.parser')
310 |
311 |
312 | # Get the block of all of todays games
313 | # Sometimes, the results of yesterday's games are listed first, then todays games are listed
314 | # Other times, yesterday's games are not listed, or when the playoffs approach, future games are listed
315 | # We will check the date for the first div, if it is not todays date, then we will look for the next div
316 | CLASS_GAMES_PER_DAY = "ScheduleDay_sdGames__NGdO5" # the div containing all games for a day
317 | CLASS_DAY = "ScheduleDay_sdDay__3s2Xt" # the heading with the date for the games (e.g. "Wednesday, February 1")
318 | div_games = source.find('div', {'class':CLASS_GAMES_PER_DAY}) # first div may or may not be yesterday's games or even future games when playoffs approach
319 | div_game_day = source.find('h4', {'class':CLASS_DAY})
320 | today = datetime.today().strftime('%A, %B %d')[:3] # e.g. "Wednesday, February 1" -> "Wed" for convenience with dealing with leading zeros
321 | todays_games = None
322 |
323 | while div_games:
324 | print(div_game_day.text[:3])
325 | if today == div_game_day.text[:3]:
326 | todays_games = div_games
327 | break
328 | else:
329 | # move to next div
330 | div_games = div_games.find_next('div', {'class':CLASS_GAMES_PER_DAY})
331 | div_game_day = div_game_day.find_next('h4', {'class':CLASS_DAY})
332 |
333 | if todays_games is None:
334 | # no games today
335 | return None, None
336 |
337 | # Get the teams playing
338 | # Each team listed in todays block will have a href with the specified anchor class
339 | # e.g. PREVIEW
362 | # Each game will have two links with the specified anchor class, one for the preview and one to buy tickets
363 | # all using the same anchor class, so we will filter out those just for PREVIEW
364 | CLASS_ID = "Anchor_anchor__cSc3P TabLink_link__f_15h"
365 | links = todays_games.find_all('a', {'class':CLASS_ID})
366 | #print(links)
367 | links = [i for i in links if "PREVIEW" in i]
368 | game_id_list = [i.get("href") for i in links]
369 | #print(game_id_list)
370 |
371 | games = []
372 | for game in game_id_list:
373 | game_id = game.partition("-00")[2].partition("?")[0] # extract team id from text for link
374 | if len(game_id) > 0:
375 | games.append(game_id)
376 |
377 | #asyncio.run(main())
378 |
379 | return matchups, games
380 |
--------------------------------------------------------------------------------