├── .github └── workflows │ └── production-features-pipeline.yml ├── .gitignore ├── README.md ├── data └── .gitkeep ├── images ├── basketball_money2.jpg ├── calibration.png ├── confusion_matrix.png ├── correlation_bar_chart.png ├── distributions.png ├── neptune.png ├── streamlit2.png ├── streamlit_example.jpg ├── train_vs_test_shapley.jpg └── train_vs_test_shapley.svg ├── notebooks ├── 01_eda.ipynb ├── 02_data_processing.ipynb ├── 03_train_test_split.ipynb ├── 04_model_baseline.ipynb ├── 05_feature_engineering.ipynb ├── 06_feature_selection.ipynb ├── 07_model_testing.ipynb ├── 08_backfill_features.ipynb ├── 09_production_features_pipeline.ipynb ├── 10_model_training_pipeline.ipynb ├── feature_names.json ├── hopsworks.ipynb ├── lightgbm.json ├── model_data.json └── xgboost.json ├── requirements.txt └── src ├── common_functions.py ├── data_processing.py ├── feature_engineering.py ├── hopsworks_utils.py ├── model_training.py ├── optuna_objectives.py ├── streamlit_app.py └── webscraping.py /.github/workflows/production-features-pipeline.yml: -------------------------------------------------------------------------------- 1 | 2 | name: production-features-pipeline 3 | 4 | on: 5 | workflow_dispatch: 6 | # schedule: 7 | # - cron: '0 8 * * *' 8 | 9 | jobs: 10 | scrape_features: 11 | #runs-on: windows-latest 12 | runs-on: ubuntu-latest 13 | steps: 14 | - name: checkout repo content 15 | uses: actions/checkout@v2 16 | 17 | - name: setup python 18 | uses: actions/setup-python@v2 19 | with: 20 | python-version: '3.9.13' 21 | 22 | - name: install python packages 23 | run: | 24 | python -m pip install --upgrade pip 25 | pip install -r requirements.txt 26 | python -m pip install jupyter nbconvert nbformat scrapingant-client 27 | 28 | - name: execute python workflows from notebook 29 | env: 30 | HOPSWORKS_API_KEY: ${{ secrets.HOPSWORKS_API_KEY }} 31 | NEPTUNE_API_TOKEN: ${{ secrets.NEPTUNE_API_TOKEN }} 32 | SCRAPINGANT_API_KEY: ${{ secrets.SCRAPINGANT_API_KEY }} 33 | 34 | run: 35 | jupyter nbconvert --to notebook --execute notebooks/09_production_features_pipeline.ipynb 36 | 37 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /venv 2 | /old 3 | /.ipynb_checkpoints 4 | .hw_api_key 5 | .ipynb_checkpoints 6 | .neptune 7 | __pycache__ 8 | ca_chain 9 | certfile 10 | keyfile 11 | geckodriver.log 12 | .env 13 | .vscode 14 | .venv/ 15 | *.csv -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 |

NBA game predictor

3 | Click here to see it in action 4 |
5 |
6 |
7 | Let's connect 🤗 8 |
9 | Twitter • 10 | LinkedIn 11 |
12 |
13 | 14 | ![image info](./images/basketball_money2.jpg) 15 | 16 | #### Table of contents 17 | - [Introduction](#introduction) 18 | - [Problem](#problem-increase-the-profitability-of-betting-on-nba-games) 19 | - [Initial step](#initial-step-predict-the-probability-that-the-home-team-will-win-each-game) 20 | - [Plan](#plan) 21 | - [Overview](#overview) 22 | - [Future Possibilities](#future-possibilities) 23 | - [Structure](#structure) 24 | - [Data](#data) 25 | - [EDA and data processing](#eda-and-data-processing) 26 | - [Train/validation/test split](#train--testvalidation-split) 27 | - [Baseline models](#baseline-models) 28 | - [Feature engineering](#feature-engineering) 29 | - [Model training/testing](#model-training-pipeline) 30 | - [Streamlit app](#streamlit-app) 31 | - [Feedback](#feedback) 32 | 33 | ## Introduction 34 | 35 | This project is a demonstration of my ability to quickly learn, develop, and deploy end-to-end machine learning technologies. I am currently seeking to change careers into Machine Learning / Data Science. 36 | 37 | I chose to predict the winner of NBA games because: 38 | - multiple games are played every day during the season so I can see how my model performs on a daily basis 39 | - picking a game winner is easy for a casual audience to understand and appreciate 40 | - there is a lot of data available 41 | - it can be used to make money (via betting strategies). I have always been interested in making money. 42 | 43 | I am actually not really a big fan of the NBA but have watched a few games and have basic knowledge. I have never done any sports betting either, but I have always loved exploration and discovery; the possibility of maybe finding something that somebody else has "missed" is very appealing to me, especially in terms of competition and of making money 44 | 45 | ## Problem: Increase the profitability of betting on NBA games 46 | 47 | ### Initial Step: Predict the probability that the home team will win each game 48 | 49 | Machine learning classification models will be used to predict the probability of the winner of each game based upon historical data. This is a first step in developing a betting strategy that will increase the profitability of betting on NBA games. 50 | 51 | *Disclaimer* 52 | 53 | In reality, a betting strategy is a rather complex problem with many elements beyond simply picking the winner of each game. Huge amounts of manpower and money have been invested in developing such strategies, and it is not likely that a learning project will be able to compete very well with such efforts. However, it may provide an extra element of insight that could be used to improve the profitability of an existing betting strategy. 54 | 55 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction) 56 | 57 | ### Plan 58 | 59 | - Gradient boosted tree models (Xgboost and LightGBM) will be utilized to determine the probability that the home team will win each game. 60 | - The model probability will be calibrated against the true probability distribution using sklearn's CalibratedClassifierCV. 61 | - The probability of winning will be important in developing betting strategies because such strategies will not bet on every game, just on games with better expected values. 62 | - Pipelines will be setup to scrape new data from NBA website every day and retrain the model when desired. 63 | - The model will be deployed online using a [streamlit app](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/) to predict and report winning probabilities every day. 64 | 65 | 66 | 67 | 68 | ### Overview 69 | 70 | - Historical game data is retrieved from Kaggle. 71 | - EDA, Data Processing, and Feature Engineering are used to develop best model in either XGboost or LightGBM. 72 | - Data and model is added to serverless Feature Store and Model Registry 73 | - Model is deployed online as a Streamlit app 74 | - Pipelines are setup to: 75 | - Scrape new data from NBA website and add to Feature Store every day using Github Actions 76 | - Retrain model and tune hyperparameters 77 | 78 | Tools Used: 79 | 80 | - VS Code w/ Copilot - IDE 81 | - Pandas - data manipulation 82 | - XGboost - modeling 83 | - LightGBM - modeling 84 | - Scikit-learn - probability calibration 85 | - Optuna - hyperparameter tuning 86 | - Neptune.ai - experiment tracking 87 | - Selenium - data scraping and processing 88 | - ScrapingAnt - data scraping 89 | - BeautifulSoup - data processing of scraped data 90 | - Hopsworks.ai - Feature Store and Model Registry 91 | - Github Actions - running notebooks to scrape new data, predict winning probabilities, and retrain models 92 | - Streamlit - web app deployment 93 | 94 | ### Future Possibilities 95 | 96 | Continual improvements might include: 97 | 98 | - More feature engineering/selection 99 | - More data sources (player stats, injuries, etc...) 100 | - A/B testing against outside and internal models (e.g. other predictor projects, Vegas odds, betting site odds, etc...) 101 | - Track model performance over time, testing for model drift, etc... 102 | - Develop optimized retraining criteria (e.g. time periods, number of games, model drift, etc...) 103 | - Better data visualizations of model predictions and performance 104 | - Develop betting strategies based upon model predictions 105 | - Ensemble betting strategies with other models and strategies, including human experts 106 | - Track model performance against other models and betting strategies 107 | 108 | 109 | ### Structure 110 | 111 | Jupyter Notebooks were used for initial development and testing and are labeled 01 through 10 in the main directory. Notebooks 01 thru 06 are primarily just historical records and notes for the development process. 112 | 113 | Key functions were moved to .py files in src directory once the functions were stable. 114 | 115 | Notebooks 07, 09, and 10 are used in production. I chose to keep the notebooks instead of full conversion to scripts because: 116 | 117 | - I think they look better in terms of documentation 118 | - I prefer to peruse the notebook output after model testing and retraining sometimes instead of relying just on experiment tracking logs 119 | - I haven't yet conceptually decided on my preferred way of structuring my model testing pipelines for best reusability and maintainability (e.g. should I use custom wrapper functions to invoke experiment logging so that I can easily change providers, or should I just choose one provider and stick with their API?) 120 | 121 | 122 | 123 | ### Data 124 | 125 | Data from the 2013 thru 2021 season has been archived on Kaggle. New data is scraped from NBA website. 126 | 127 | Currently available data includes: 128 | 129 | - games_details.csv .. (each-game player stats for everyone on the roster) 130 | - games.csv .......... (each-game team stats: final scores, points scored, field-goal & free-throw percentages, etc...) 131 | - players.csv ........ (index of players' names and teams) 132 | - ranking.csv ........ (incremental daily record of standings, games played, won, lost, win%, home record, road record) 133 | - teams.csv .......... (index of team info such as city and arena names and also head coach) 134 | 135 | NOTES 136 | - games.csv is the primary data source and will be the only data used initially 137 | - games_details.csv details individual player stats for each game and may be added to the model later 138 | - ranking.csv data is essentially cumulative averages from the beginning of the season and is not really needed as these and other rolling averages can be calculated from the games.csv data 139 | 140 | 141 | **New Data** 142 | 143 | New data is scraped from [https://www.nba.com/stats/teams/boxscores](https://www.nba.com/stats/teams/boxscores) 144 | 145 | 146 | **Data Leakage** 147 | 148 | The data for each game are stats for the *completed* game. We want to predict the winner *before* the game is played, not after. The model should only use data that would be available before the game is played. Our model features will primarily be rolling stats for the previous games (e.g. average assists for previous 5 games) while excluding the current game. 149 | 150 | I mention this because I did see several similar projects online that failed to take this into account. If the goal is simply to predict which stats are important for winning games, then the model can be trained on the entire dataset. However, if the goal is to predict the winner of a game like we are trying to do, then the model must be trained on data that would only be available before the game is played. 151 | 152 | ### EDA and Data Processing 153 | 154 | Exploratory Data Analysis (EDA) and Data Processing are summarized and detailed in the notebooks. Some examples include: 155 | 156 | Histograms of various features 157 | 158 | 159 | 160 | Correlations between features 161 | 162 | 163 | 164 | 165 | ### Train / Test/Validation Split 166 | 167 | - Latest season is used as Test/Validation data and previous seasons are used as Train data 168 | 169 | ### Baseline Models 170 | 171 | Simple If-Then Models 172 | 173 | - Home team always wins (Accuracy = 0.59, AUC = 0.50 on Train data, Accuracy = 0.49, AUC = 0.50 on Test data) 174 | 175 | ML Models 176 | 177 | - LightGBM (Accuracy = 0.58, AUC = 0.64 on Test data) 178 | - XGBoost (Accuracy = 0.59, AUC = 0.61 on Test data) 179 | 180 | ### Feature Engineering 181 | 182 | - Covert game date to month only 183 | - Compile rolling means for various time periods for each team as home team and as visitor team 184 | - Compile current win streak for each team as home team and as visitor team 185 | - Compile head-to-head matchup data for each team pair 186 | - Compile rolling means for various time periods for each team regardless of home or visitor status 187 | - Compile current win streak for each team regardless of home or visitor status 188 | - Subtract the league average rolling means from each team's rolling means 189 | 190 | 191 | ### Model Training/Testing 192 | 193 | **Models** 194 | - LightGBM 195 | - XGBoost 196 | 197 | The native Python API (rather than the Scikit-learn wrapper) is used for initial testing of both models because of ease of built-in Shapley values, which are used for feature importance analysis and for adversarial validation (since Shapley values are local to each dataset, they can be used to determine if the train and test datasets have the same feature importances. If they do not, then it may indicate that the model does not generalize very well.) 198 | 199 | The Scikit-learn wrapper is used later in production because it allows for easier probability calibration using sklearn's CalibratedClassifierCV. 200 | 201 | 202 | 203 | **Evaluation** 204 | - AUC is primary metric, Accuracy is secondary metric (it is more meaningful to casual users) 205 | - Shapley values compared: Train set vs Test/Validation set 206 | - Test/Validation set is split: early half vs later half 207 | 208 | 209 | 210 | 211 | **Experiment Tracking** 212 | 213 | Notebook 07 integrates Neptune.ai for experiment tracking and Optuna for hyperparameter tuning. 214 | 215 | Experiment tracking logs can be viewed here: [https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602](https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602) 216 | 217 | 218 | 219 | 220 | **Probability Calibration** 221 | 222 | SKlearn's CalibratedClassifierCV is used to ensure that the model probabilities are calibrated against the true probability distribution. The Brier loss score is used to by the software to automatically select the best calibration method (sigmoid, isotonic, or none). 223 | 224 | 225 | 226 | 227 | 228 | ### Production Features Pipeline 229 | 230 | Notebook 09 is run from a Github Actions every morning. 231 | 232 | - It scrapes the stats from the previous day's games, updates all the rolling statistics and streaks, and adds them to the Feature Store. 233 | - It scrapes the upcoming game matchups for the current day and adds them to the Feature Store so that the streamlit app can use these to make it's daily predictions. 234 | 235 | A variable can be set to either use Selenium or ScrapingAnt for scraping the data. ScrapingAnt is used in production because of its built-in proxy server. 236 | 237 | - The Selenium notebook worked fine when ran locally, but there were issues when running the notebook in Github Actions, likely due to the ip address and anti-bot measures on the NBA website (which would require a proxy server to address) 238 | - ScrapingAnt is a cloud-based scraper with a Python API that handles the proxy server issues. An account is required, but the free account is sufficient for this project. 239 | 240 | ### Model Training Pipeline 241 | 242 | Notebook 10 retrieves the most current data, executes Notebook 07 to handle hyperparameter tuning, model training, and calibration, and then adds the model to the Model Registry. The time periods used for the train set and test set can be adjusted so that the model can be tested only on the most current games. 243 | 244 | ### Streamlit App 245 | 246 | The streamlit app is deployed at streamlit.io and can be accessed here: [https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/) 247 | 248 | It uses the model in the Model Registry to predict the win probability of the home team for the current day's upcoming games. 249 | 250 | 251 | 252 | ### Feedback 253 | 254 | Thanks for taking the time to read about my project. This is my primary "portfolio" project for my quest to change careers and find an entry level position in Machine Learning / Data Science. I appreciate any feedback. 255 | 256 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction) 257 | 258 | My Linked-In profile: [https://www.linkedin.com/in/chris-munch/](https://www.linkedin.com/in/chris-munch/) 259 | 260 | Twitter: [https://twitter.com/curiovana](https://twitter.com/curiovana) 261 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/data/.gitkeep -------------------------------------------------------------------------------- /images/basketball_money2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/basketball_money2.jpg -------------------------------------------------------------------------------- /images/calibration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/calibration.png -------------------------------------------------------------------------------- /images/confusion_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/confusion_matrix.png -------------------------------------------------------------------------------- /images/correlation_bar_chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/correlation_bar_chart.png -------------------------------------------------------------------------------- /images/distributions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/distributions.png -------------------------------------------------------------------------------- /images/neptune.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/neptune.png -------------------------------------------------------------------------------- /images/streamlit2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit2.png -------------------------------------------------------------------------------- /images/streamlit_example.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit_example.jpg -------------------------------------------------------------------------------- /images/train_vs_test_shapley.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/train_vs_test_shapley.jpg -------------------------------------------------------------------------------- /notebooks/03_train_test_split.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "defc34ab-7465-454b-a148-f18f857eafaa", 6 | "metadata": {}, 7 | "source": [ 8 | "### Train / Test Split\n", 9 | "\n", 10 | "This notebook splits the data into train data and test data.\n", 11 | "\n", 12 | "It selects the latest season and uses this as the test data." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "id": "f270d24d-731b-4a33-a737-599871c1b024", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "\n", 25 | "from pathlib import Path #for Windows/Linux compatibility\n", 26 | "DATAPATH = Path(r'data')\n" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "id": "45005c7b-e29c-48cc-adf2-47b4cf6a7863", 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "df = pd.read_csv(DATAPATH / \"transformed.csv\")" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "id": "2ddfe2f2-e69e-4f41-a047-87e2caee29f7", 42 | "metadata": {}, 43 | "source": [ 44 | "**Split Data to Train and Test**" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "id": "2ec58305-22d8-4e01-ad13-a6052967fb43", 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "latest_season = df['SEASON'].unique().max()\n", 55 | "\n", 56 | "train = df[df['SEASON'] < (latest_season)]\n", 57 | "test = df[df['SEASON'] >= (latest_season - 1)]\n", 58 | "\n", 59 | "train.to_csv(DATAPATH / \"train.csv\",index=False)\n", 60 | "test.to_csv(DATAPATH / \"test.csv\",index=False)\n" 61 | ] 62 | } 63 | ], 64 | "metadata": { 65 | "kernelspec": { 66 | "display_name": "nba3", 67 | "language": "python", 68 | "name": "python3" 69 | }, 70 | "language_info": { 71 | "codemirror_mode": { 72 | "name": "ipython", 73 | "version": 3 74 | }, 75 | "file_extension": ".py", 76 | "mimetype": "text/x-python", 77 | "name": "python", 78 | "nbconvert_exporter": "python", 79 | "pygments_lexer": "ipython3", 80 | "version": "3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]" 81 | }, 82 | "vscode": { 83 | "interpreter": { 84 | "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06" 85 | } 86 | } 87 | }, 88 | "nbformat": 4, 89 | "nbformat_minor": 5 90 | } 91 | -------------------------------------------------------------------------------- /notebooks/06_feature_selection.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "ea2619eb-c30b-457d-abfa-a7b4b4dd584c", 6 | "metadata": {}, 7 | "source": [ 8 | "## Feature Selection" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "f71632a1-ef6a-4404-b662-e8227de30fa1", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np\n", 20 | "\n", 21 | "pd.set_option('display.max_columns', 500)\n", 22 | "\n", 23 | "from sklearn.feature_selection import SelectKBest, chi2\n", 24 | "\n", 25 | "\n", 26 | "import matplotlib.pyplot as plt\n", 27 | "import seaborn as sns\n", 28 | "\n", 29 | "from pathlib import Path #for Windows/Linux compatibility\n", 30 | "DATAPATH = Path(r'data')\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "id": "d4aabdeb-0167-490e-bc37-65a62f7c0821", 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "train = pd.read_csv(DATAPATH / \"train_features.csv\")\n", 41 | "test = pd.read_csv(DATAPATH / \"test_features.csv\")\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 4, 47 | "id": "2d3c9ab6-9a07-46c5-8d57-d75df3171036", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "train.to_csv(DATAPATH / \"train_selected.csv\",index=False)\n", 52 | "test.to_csv(DATAPATH / \"test_selected.csv\",index=False)" 53 | ] 54 | } 55 | ], 56 | "metadata": { 57 | "kernelspec": { 58 | "display_name": "Python 3 (ipykernel)", 59 | "language": "python", 60 | "name": "python3" 61 | }, 62 | "language_info": { 63 | "codemirror_mode": { 64 | "name": "ipython", 65 | "version": 3 66 | }, 67 | "file_extension": ".py", 68 | "mimetype": "text/x-python", 69 | "name": "python", 70 | "nbconvert_exporter": "python", 71 | "pygments_lexer": "ipython3", 72 | "version": "3.9.13" 73 | } 74 | }, 75 | "nbformat": 4, 76 | "nbformat_minor": 5 77 | } 78 | -------------------------------------------------------------------------------- /notebooks/feature_names.json: -------------------------------------------------------------------------------- 1 | {"game_date_est": "GAME_DATE_EST", "game_id": "GAME_ID", "home_team_id": "HOME_TEAM_ID", "visitor_team_id": "VISITOR_TEAM_ID", "season": "SEASON", "pts_home": "PTS_home", "fg_pct_home": "FG_PCT_home", "ft_pct_home": "FT_PCT_home", "fg3_pct_home": "FG3_PCT_home", "ast_home": "AST_home", "reb_home": "REB_home", "pts_away": "PTS_away", "fg_pct_away": "FG_PCT_away", "ft_pct_away": "FT_PCT_away", "fg3_pct_away": "FG3_PCT_away", "ast_away": "AST_away", "reb_away": "REB_away", "home_team_wins": "HOME_TEAM_WINS", "target": "TARGET", "month": "MONTH", "home_team_win_streak": "HOME_TEAM_WIN_STREAK", "home_team_wins_avg_last_3_home": "HOME_TEAM_WINS_AVG_LAST_3_HOME", "home_team_wins_avg_last_7_home": "HOME_TEAM_WINS_AVG_LAST_7_HOME", "home_team_wins_avg_last_10_home": "HOME_TEAM_WINS_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home": "HOME_PTS_home_AVG_LAST_3_HOME", "home_pts_home_avg_last_7_home": "HOME_PTS_home_AVG_LAST_7_HOME", "home_pts_home_avg_last_10_home": "HOME_PTS_home_AVG_LAST_10_HOME", "home_fg_pct_home_avg_last_3_home": "HOME_FG_PCT_home_AVG_LAST_3_HOME", "home_fg_pct_home_avg_last_7_home": "HOME_FG_PCT_home_AVG_LAST_7_HOME", "home_fg_pct_home_avg_last_10_home": "HOME_FG_PCT_home_AVG_LAST_10_HOME", "home_ft_pct_home_avg_last_3_home": "HOME_FT_PCT_home_AVG_LAST_3_HOME", "home_ft_pct_home_avg_last_7_home": "HOME_FT_PCT_home_AVG_LAST_7_HOME", "home_ft_pct_home_avg_last_10_home": "HOME_FT_PCT_home_AVG_LAST_10_HOME", "home_fg3_pct_home_avg_last_3_home": "HOME_FG3_PCT_home_AVG_LAST_3_HOME", "home_fg3_pct_home_avg_last_7_home": "HOME_FG3_PCT_home_AVG_LAST_7_HOME", "home_fg3_pct_home_avg_last_10_home": "HOME_FG3_PCT_home_AVG_LAST_10_HOME", "home_ast_home_avg_last_3_home": "HOME_AST_home_AVG_LAST_3_HOME", "home_ast_home_avg_last_7_home": "HOME_AST_home_AVG_LAST_7_HOME", "home_ast_home_avg_last_10_home": "HOME_AST_home_AVG_LAST_10_HOME", "home_reb_home_avg_last_3_home": "HOME_REB_home_AVG_LAST_3_HOME", "home_reb_home_avg_last_7_home": "HOME_REB_home_AVG_LAST_7_HOME", "home_reb_home_avg_last_10_home": "HOME_REB_home_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_7_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_10_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_3_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_7_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_10_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_3_home_minus_league_avg": "HOME_AST_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_7_home_minus_league_avg": "HOME_AST_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_10_home_minus_league_avg": "HOME_AST_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_3_home_minus_league_avg": "HOME_REB_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_7_home_minus_league_avg": "HOME_REB_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_10_home_minus_league_avg": "HOME_REB_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "visitor_team_win_streak": "VISITOR_TEAM_WIN_STREAK", "visitor_team_wins_avg_last_3_visitor": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR", "visitor_team_wins_avg_last_7_visitor": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR", "visitor_team_wins_avg_last_10_visitor": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR", "visitor_pts_away_avg_last_3_visitor": "VISITOR_PTS_away_AVG_LAST_3_VISITOR", "visitor_pts_away_avg_last_7_visitor": "VISITOR_PTS_away_AVG_LAST_7_VISITOR", "visitor_pts_away_avg_last_10_visitor": "VISITOR_PTS_away_AVG_LAST_10_VISITOR", "visitor_fg_pct_away_avg_last_3_visitor": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg_pct_away_avg_last_7_visitor": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg_pct_away_avg_last_10_visitor": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR", "visitor_ft_pct_away_avg_last_3_visitor": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR", "visitor_ft_pct_away_avg_last_7_visitor": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR", "visitor_ft_pct_away_avg_last_10_visitor": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR", "visitor_fg3_pct_away_avg_last_3_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg3_pct_away_avg_last_7_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg3_pct_away_avg_last_10_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR", "visitor_ast_away_avg_last_3_visitor": "VISITOR_AST_away_AVG_LAST_3_VISITOR", "visitor_ast_away_avg_last_7_visitor": "VISITOR_AST_away_AVG_LAST_7_VISITOR", "visitor_ast_away_avg_last_10_visitor": "VISITOR_AST_away_AVG_LAST_10_VISITOR", "visitor_reb_away_avg_last_3_visitor": "VISITOR_REB_away_AVG_LAST_3_VISITOR", "visitor_reb_away_avg_last_7_visitor": "VISITOR_REB_away_AVG_LAST_7_VISITOR", "visitor_reb_away_avg_last_10_visitor": "VISITOR_REB_away_AVG_LAST_10_VISITOR", "visitor_team_wins_avg_last_3_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_7_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_10_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_3_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_7_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_10_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_3_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_7_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_10_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_3_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_7_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_10_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "matchup_winpct_3_x": "MATCHUP_WINPCT_3_x", "matchup_winpct_7_x": "MATCHUP_WINPCT_7_x", "matchup_winpct_10_x": "MATCHUP_WINPCT_10_x", "matchup_win_streak_x": "MATCHUP_WIN_STREAK_x", "win_streak_x": "WIN_STREAK_x", "home_away_streak_x": "HOME_AWAY_STREAK_x", "team1_win_avg_last_3_all_x": "TEAM1_win_AVG_LAST_3_ALL_x", "team1_win_avg_last_7_all_x": "TEAM1_win_AVG_LAST_7_ALL_x", "team1_win_avg_last_10_all_x": "TEAM1_win_AVG_LAST_10_ALL_x", "team1_win_avg_last_15_all_x": "TEAM1_win_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_x": "PTS_AVG_LAST_3_ALL_x", "pts_avg_last_7_all_x": "PTS_AVG_LAST_7_ALL_x", "pts_avg_last_10_all_x": "PTS_AVG_LAST_10_ALL_x", "pts_avg_last_15_all_x": "PTS_AVG_LAST_15_ALL_x", "fg_pct_avg_last_3_all_x": "FG_PCT_AVG_LAST_3_ALL_x", "fg_pct_avg_last_7_all_x": "FG_PCT_AVG_LAST_7_ALL_x", "fg_pct_avg_last_10_all_x": "FG_PCT_AVG_LAST_10_ALL_x", "fg_pct_avg_last_15_all_x": "FG_PCT_AVG_LAST_15_ALL_x", "ft_pct_avg_last_3_all_x": "FT_PCT_AVG_LAST_3_ALL_x", "ft_pct_avg_last_7_all_x": "FT_PCT_AVG_LAST_7_ALL_x", "ft_pct_avg_last_10_all_x": "FT_PCT_AVG_LAST_10_ALL_x", "ft_pct_avg_last_15_all_x": "FT_PCT_AVG_LAST_15_ALL_x", "fg3_pct_avg_last_3_all_x": "FG3_PCT_AVG_LAST_3_ALL_x", "fg3_pct_avg_last_7_all_x": "FG3_PCT_AVG_LAST_7_ALL_x", "fg3_pct_avg_last_10_all_x": "FG3_PCT_AVG_LAST_10_ALL_x", "fg3_pct_avg_last_15_all_x": "FG3_PCT_AVG_LAST_15_ALL_x", "ast_avg_last_3_all_x": "AST_AVG_LAST_3_ALL_x", "ast_avg_last_7_all_x": "AST_AVG_LAST_7_ALL_x", "ast_avg_last_10_all_x": "AST_AVG_LAST_10_ALL_x", "ast_avg_last_15_all_x": "AST_AVG_LAST_15_ALL_x", "reb_avg_last_3_all_x": "REB_AVG_LAST_3_ALL_x", "reb_avg_last_7_all_x": "REB_AVG_LAST_7_ALL_x", "reb_avg_last_10_all_x": "REB_AVG_LAST_10_ALL_x", "reb_avg_last_15_all_x": "REB_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_minus_league_avg_x": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_7_all_minus_league_avg_x": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_10_all_minus_league_avg_x": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_15_all_minus_league_avg_x": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_3_all_minus_league_avg_x": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_7_all_minus_league_avg_x": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_10_all_minus_league_avg_x": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_15_all_minus_league_avg_x": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_3_all_minus_league_avg_x": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_7_all_minus_league_avg_x": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_10_all_minus_league_avg_x": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_15_all_minus_league_avg_x": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_3_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_7_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_10_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_15_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_3_all_minus_league_avg_x": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_7_all_minus_league_avg_x": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_10_all_minus_league_avg_x": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_15_all_minus_league_avg_x": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_3_all_minus_league_avg_x": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_7_all_minus_league_avg_x": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_10_all_minus_league_avg_x": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_15_all_minus_league_avg_x": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "win_streak_y": "WIN_STREAK_y", "home_away_streak_y": "HOME_AWAY_STREAK_y", "team1_win_avg_last_3_all_y": "TEAM1_win_AVG_LAST_3_ALL_y", "team1_win_avg_last_7_all_y": "TEAM1_win_AVG_LAST_7_ALL_y", "team1_win_avg_last_10_all_y": "TEAM1_win_AVG_LAST_10_ALL_y", "team1_win_avg_last_15_all_y": "TEAM1_win_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_y": "PTS_AVG_LAST_3_ALL_y", "pts_avg_last_7_all_y": "PTS_AVG_LAST_7_ALL_y", "pts_avg_last_10_all_y": "PTS_AVG_LAST_10_ALL_y", "pts_avg_last_15_all_y": "PTS_AVG_LAST_15_ALL_y", "fg_pct_avg_last_3_all_y": "FG_PCT_AVG_LAST_3_ALL_y", "fg_pct_avg_last_7_all_y": "FG_PCT_AVG_LAST_7_ALL_y", "fg_pct_avg_last_10_all_y": "FG_PCT_AVG_LAST_10_ALL_y", "fg_pct_avg_last_15_all_y": "FG_PCT_AVG_LAST_15_ALL_y", "ft_pct_avg_last_3_all_y": "FT_PCT_AVG_LAST_3_ALL_y", "ft_pct_avg_last_7_all_y": "FT_PCT_AVG_LAST_7_ALL_y", "ft_pct_avg_last_10_all_y": "FT_PCT_AVG_LAST_10_ALL_y", "ft_pct_avg_last_15_all_y": "FT_PCT_AVG_LAST_15_ALL_y", "fg3_pct_avg_last_3_all_y": "FG3_PCT_AVG_LAST_3_ALL_y", "fg3_pct_avg_last_7_all_y": "FG3_PCT_AVG_LAST_7_ALL_y", "fg3_pct_avg_last_10_all_y": "FG3_PCT_AVG_LAST_10_ALL_y", "fg3_pct_avg_last_15_all_y": "FG3_PCT_AVG_LAST_15_ALL_y", "ast_avg_last_3_all_y": "AST_AVG_LAST_3_ALL_y", "ast_avg_last_7_all_y": "AST_AVG_LAST_7_ALL_y", "ast_avg_last_10_all_y": "AST_AVG_LAST_10_ALL_y", "ast_avg_last_15_all_y": "AST_AVG_LAST_15_ALL_y", "reb_avg_last_3_all_y": "REB_AVG_LAST_3_ALL_y", "reb_avg_last_7_all_y": "REB_AVG_LAST_7_ALL_y", "reb_avg_last_10_all_y": "REB_AVG_LAST_10_ALL_y", "reb_avg_last_15_all_y": "REB_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_minus_league_avg_y": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_7_all_minus_league_avg_y": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_10_all_minus_league_avg_y": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_15_all_minus_league_avg_y": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_3_all_minus_league_avg_y": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_7_all_minus_league_avg_y": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_10_all_minus_league_avg_y": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_15_all_minus_league_avg_y": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_3_all_minus_league_avg_y": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_7_all_minus_league_avg_y": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_10_all_minus_league_avg_y": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_15_all_minus_league_avg_y": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_3_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_7_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_10_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_15_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_3_all_minus_league_avg_y": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_7_all_minus_league_avg_y": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_10_all_minus_league_avg_y": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_15_all_minus_league_avg_y": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_3_all_minus_league_avg_y": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_7_all_minus_league_avg_y": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_10_all_minus_league_avg_y": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_15_all_minus_league_avg_y": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "win_streak_x_minus_y": "WIN_STREAK_x_minus_y", "home_away_streak_x_minus_y": "HOME_AWAY_STREAK_x_minus_y", "team1_win_avg_last_3_all_x_minus_y": "TEAM1_win_AVG_LAST_3_ALL_x_minus_y", "team1_win_avg_last_7_all_x_minus_y": "TEAM1_win_AVG_LAST_7_ALL_x_minus_y", "team1_win_avg_last_10_all_x_minus_y": "TEAM1_win_AVG_LAST_10_ALL_x_minus_y", "team1_win_avg_last_15_all_x_minus_y": "TEAM1_win_AVG_LAST_15_ALL_x_minus_y", "pts_avg_last_3_all_x_minus_y": "PTS_AVG_LAST_3_ALL_x_minus_y", "pts_avg_last_7_all_x_minus_y": "PTS_AVG_LAST_7_ALL_x_minus_y", "pts_avg_last_10_all_x_minus_y": "PTS_AVG_LAST_10_ALL_x_minus_y", "pts_avg_last_15_all_x_minus_y": "PTS_AVG_LAST_15_ALL_x_minus_y", "fg_pct_avg_last_3_all_x_minus_y": "FG_PCT_AVG_LAST_3_ALL_x_minus_y", "fg_pct_avg_last_7_all_x_minus_y": "FG_PCT_AVG_LAST_7_ALL_x_minus_y", "fg_pct_avg_last_10_all_x_minus_y": "FG_PCT_AVG_LAST_10_ALL_x_minus_y", "fg_pct_avg_last_15_all_x_minus_y": "FG_PCT_AVG_LAST_15_ALL_x_minus_y", "ft_pct_avg_last_3_all_x_minus_y": "FT_PCT_AVG_LAST_3_ALL_x_minus_y", "ft_pct_avg_last_7_all_x_minus_y": "FT_PCT_AVG_LAST_7_ALL_x_minus_y", "ft_pct_avg_last_10_all_x_minus_y": "FT_PCT_AVG_LAST_10_ALL_x_minus_y", "ft_pct_avg_last_15_all_x_minus_y": "FT_PCT_AVG_LAST_15_ALL_x_minus_y", "fg3_pct_avg_last_3_all_x_minus_y": "FG3_PCT_AVG_LAST_3_ALL_x_minus_y", "fg3_pct_avg_last_7_all_x_minus_y": "FG3_PCT_AVG_LAST_7_ALL_x_minus_y", "fg3_pct_avg_last_10_all_x_minus_y": "FG3_PCT_AVG_LAST_10_ALL_x_minus_y", "fg3_pct_avg_last_15_all_x_minus_y": "FG3_PCT_AVG_LAST_15_ALL_x_minus_y", "ast_avg_last_3_all_x_minus_y": "AST_AVG_LAST_3_ALL_x_minus_y", "ast_avg_last_7_all_x_minus_y": "AST_AVG_LAST_7_ALL_x_minus_y", "ast_avg_last_10_all_x_minus_y": "AST_AVG_LAST_10_ALL_x_minus_y", "ast_avg_last_15_all_x_minus_y": "AST_AVG_LAST_15_ALL_x_minus_y", "reb_avg_last_3_all_x_minus_y": "REB_AVG_LAST_3_ALL_x_minus_y", "reb_avg_last_7_all_x_minus_y": "REB_AVG_LAST_7_ALL_x_minus_y", "reb_avg_last_10_all_x_minus_y": "REB_AVG_LAST_10_ALL_x_minus_y", "reb_avg_last_15_all_x_minus_y": "REB_AVG_LAST_15_ALL_x_minus_y"} -------------------------------------------------------------------------------- /notebooks/hopsworks.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "2f0c58a3-8738-42ad-be18-8e13c171108f", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import os\n", 11 | "\n", 12 | "import pandas as pd\n", 13 | "import numpy as np\n", 14 | "\n", 15 | "import hopsworks\n", 16 | "\n", 17 | "from datetime import datetime, timedelta\n", 18 | "from pytz import timezone\n", 19 | "\n", 20 | "from src.webscraping_a import (\n", 21 | " scrape_to_dataframe,\n", 22 | " convert_columns,\n", 23 | " combine_home_visitor, \n", 24 | " get_todays_matchups,\n", 25 | ")\n", 26 | "\n", 27 | "from src.data_processing import (\n", 28 | " process_games,\n", 29 | " add_TARGET,\n", 30 | ")\n", 31 | "\n", 32 | "from src.feature_engineering import (\n", 33 | " process_features,\n", 34 | ")\n", 35 | "\n", 36 | "from src.hopsworks_utils import (\n", 37 | " save_feature_names,\n", 38 | " convert_feature_names,\n", 39 | ")\n", 40 | "\n", 41 | "import json\n", 42 | "\n", 43 | "from pathlib import Path #for Windows/Linux compatibility\n", 44 | "DATAPATH = Path(r'data')" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "from dotenv import load_dotenv\n", 54 | "\n", 55 | "load_dotenv()\n", 56 | "\n", 57 | "try:\n", 58 | " HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']\n", 59 | "except:\n", 60 | " raise Exception('Set environment variable HOPSWORKS_API_KEY')" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "id": "ba69a88c-691e-4f8d-b68e-e79f5582a939", 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "Connected. Call `.close()` to terminate connection gracefully.\n", 74 | "\n", 75 | "Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3350\n", 76 | "Connected. Call `.close()` to terminate connection gracefully.\n" 77 | ] 78 | }, 79 | { 80 | "name": "stderr", 81 | "output_type": "stream", 82 | "text": [ 83 | "DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)\n", 89 | "fs = project.get_feature_store()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 4, 95 | "id": "641d97e5-0888-42c3-a9b1-97c9ef2b7082", 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "rolling_stats_fg = fs.get_feature_group(\n", 100 | " name=\"rolling_stats\",\n", 101 | " version=1,\n", 102 | ")" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 5, 108 | "id": "69a2b254", 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "2023-01-31 08:30:10,072 INFO: USE `nba_predictor_featurestore`\n", 116 | "2023-01-31 08:30:10,441 INFO: SELECT `fg0`.`game_date_est` `game_date_est`, `fg0`.`game_id` `game_id`, `fg0`.`home_team_id` `home_team_id`, `fg0`.`visitor_team_id` `visitor_team_id`, `fg0`.`season` `season`, `fg0`.`pts_home` `pts_home`, `fg0`.`fg_pct_home` `fg_pct_home`, `fg0`.`ft_pct_home` `ft_pct_home`, `fg0`.`fg3_pct_home` `fg3_pct_home`, `fg0`.`ast_home` `ast_home`, `fg0`.`reb_home` `reb_home`, `fg0`.`pts_away` `pts_away`, `fg0`.`fg_pct_away` `fg_pct_away`, `fg0`.`ft_pct_away` `ft_pct_away`, `fg0`.`fg3_pct_away` `fg3_pct_away`, `fg0`.`ast_away` `ast_away`, `fg0`.`reb_away` `reb_away`, `fg0`.`home_team_wins` `home_team_wins`\n", 117 | "FROM `nba_predictor_featurestore`.`rolling_stats_1` `fg0`\n" 118 | ] 119 | }, 120 | { 121 | "name": "stderr", 122 | "output_type": "stream", 123 | "text": [ 124 | "UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy\n" 125 | ] 126 | }, 127 | { 128 | "data": { 129 | "text/html": [ 130 | "
\n", 131 | "\n", 144 | "\n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | "
GAME_DATE_ESTGAME_IDHOME_TEAM_IDVISITOR_TEAM_IDSEASONPTS_homeFG_PCT_homeFT_PCT_homeFG3_PCT_homeAST_homeREB_homePTS_awayFG_PCT_awayFT_PCT_awayFG3_PCT_awayAST_awayREB_awayHOME_TEAM_WINS
13842023-01-312220076516106127521610612747202200.00.00.00000.00.00.0000
30422023-01-312220076716106127491610612766202200.00.00.00000.00.00.0000
82392023-01-312220076416106127391610612748202200.00.00.00000.00.00.0000
89072023-01-312220076616106127411610612746202200.00.00.00000.00.00.0000
216282023-01-312220076816106127431610612740202200.00.00.00000.00.00.0000
\n", 276 | "
" 277 | ], 278 | "text/plain": [ 279 | " GAME_DATE_EST GAME_ID HOME_TEAM_ID VISITOR_TEAM_ID SEASON \\\n", 280 | "1384 2023-01-31 22200765 1610612752 1610612747 2022 \n", 281 | "3042 2023-01-31 22200767 1610612749 1610612766 2022 \n", 282 | "8239 2023-01-31 22200764 1610612739 1610612748 2022 \n", 283 | "8907 2023-01-31 22200766 1610612741 1610612746 2022 \n", 284 | "21628 2023-01-31 22200768 1610612743 1610612740 2022 \n", 285 | "\n", 286 | " PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home REB_home \\\n", 287 | "1384 0 0.0 0.0 0.0 0 0 \n", 288 | "3042 0 0.0 0.0 0.0 0 0 \n", 289 | "8239 0 0.0 0.0 0.0 0 0 \n", 290 | "8907 0 0.0 0.0 0.0 0 0 \n", 291 | "21628 0 0.0 0.0 0.0 0 0 \n", 292 | "\n", 293 | " PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away \\\n", 294 | "1384 0 0.0 0.0 0.0 0 0 \n", 295 | "3042 0 0.0 0.0 0.0 0 0 \n", 296 | "8239 0 0.0 0.0 0.0 0 0 \n", 297 | "8907 0 0.0 0.0 0.0 0 0 \n", 298 | "21628 0 0.0 0.0 0.0 0 0 \n", 299 | "\n", 300 | " HOME_TEAM_WINS \n", 301 | "1384 0 \n", 302 | "3042 0 \n", 303 | "8239 0 \n", 304 | "8907 0 \n", 305 | "21628 0 " 306 | ] 307 | }, 308 | "execution_count": 5, 309 | "metadata": {}, 310 | "output_type": "execute_result" 311 | } 312 | ], 313 | "source": [ 314 | "BASE_FEATURES = ['game_date_est',\n", 315 | " 'game_id',\n", 316 | " 'home_team_id',\n", 317 | " 'visitor_team_id',\n", 318 | " 'season',\n", 319 | " 'pts_home',\n", 320 | " 'fg_pct_home',\n", 321 | " 'ft_pct_home',\n", 322 | " 'fg3_pct_home',\n", 323 | " 'ast_home',\n", 324 | " 'reb_home',\n", 325 | " 'pts_away',\n", 326 | " 'fg_pct_away',\n", 327 | " 'ft_pct_away',\n", 328 | " 'fg3_pct_away',\n", 329 | " 'ast_away',\n", 330 | " 'reb_away',\n", 331 | " 'home_team_wins',\n", 332 | "]\n", 333 | "\n", 334 | "ds_query = rolling_stats_fg.select(BASE_FEATURES)\n", 335 | "df_old = ds_query.read()\n", 336 | "df_old = convert_feature_names(df_old)\n", 337 | "df_old[df_old['PTS_home'] == 0]\n" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 6, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/html": [ 348 | "
\n", 349 | "\n", 362 | "\n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | "
GAME_DATE_ESTGAME_IDHOME_TEAM_IDVISITOR_TEAM_IDSEASONPTS_homeFG_PCT_homeFT_PCT_homeFG3_PCT_homeAST_homeREB_homePTS_awayFG_PCT_awayFT_PCT_awayFG3_PCT_awayAST_awayREB_awayHOME_TEAM_WINS
13842023-01-312220076516106127521610612747202200.000000.00000.0000000000.000000.00000.000000000
17052023-01-302220075916106127441610612760202212851.0937580.000042.593750374412049.5000085.000045.81250023431
17552023-01-29222007541610612746161061273920229941.5937585.187510.500000174512256.4062573.687560.59375035350
30422023-01-312220076716106127491610612766202200.000000.00000.0000000000.000000.00000.000000000
38252023-01-302220075616106127531610612755202211942.4062580.000037.906250235010947.0937578.312536.68750028451
50482023-01-302220076316106127371610612757202212546.8125080.000043.312500264512954.4062588.875047.50000024320
57162023-01-292220075316106127541610612763202210045.5937590.500028.093750213811248.3125077.312524.29687529410
82392023-01-312220076416106127391610612748202200.000000.00000.0000000000.000000.00000.000000000
83812023-01-292220075216106127481610612766202211748.4062573.875032.312500263612254.1875077.312537.50000025470
87272023-01-302220076216106127611610612756202210644.9062589.500027.296875194411449.4062581.000039.31250028420
89072023-01-312220076616106127411610612746202200.000000.00000.0000000000.000000.00000.000000000
99912023-01-302220075816106127581610612750202211847.5000068.187530.000000205011146.6875052.000030.79687526491
100922023-01-302220076016106127641610612759202212755.8125088.187553.312500324510643.3125065.187524.09375026431
158572023-01-302220075716106127471610612751202210439.3125062.187537.906250245312147.4062578.875039.00000024500
159382023-01-302220076116106127651610612742202210545.5937570.375035.906250184011149.4062569.687529.40625017400
196132023-01-292220075516106127401610612749202211044.0937571.375038.187500253813555.1875060.000039.50000027570
216282023-01-312220076816106127431610612740202200.000000.00000.0000000000.000000.00000.000000000
\n", 746 | "
" 747 | ], 748 | "text/plain": [ 749 | " GAME_DATE_EST GAME_ID HOME_TEAM_ID VISITOR_TEAM_ID SEASON \\\n", 750 | "1384 2023-01-31 22200765 1610612752 1610612747 2022 \n", 751 | "1705 2023-01-30 22200759 1610612744 1610612760 2022 \n", 752 | "1755 2023-01-29 22200754 1610612746 1610612739 2022 \n", 753 | "3042 2023-01-31 22200767 1610612749 1610612766 2022 \n", 754 | "3825 2023-01-30 22200756 1610612753 1610612755 2022 \n", 755 | "5048 2023-01-30 22200763 1610612737 1610612757 2022 \n", 756 | "5716 2023-01-29 22200753 1610612754 1610612763 2022 \n", 757 | "8239 2023-01-31 22200764 1610612739 1610612748 2022 \n", 758 | "8381 2023-01-29 22200752 1610612748 1610612766 2022 \n", 759 | "8727 2023-01-30 22200762 1610612761 1610612756 2022 \n", 760 | "8907 2023-01-31 22200766 1610612741 1610612746 2022 \n", 761 | "9991 2023-01-30 22200758 1610612758 1610612750 2022 \n", 762 | "10092 2023-01-30 22200760 1610612764 1610612759 2022 \n", 763 | "15857 2023-01-30 22200757 1610612747 1610612751 2022 \n", 764 | "15938 2023-01-30 22200761 1610612765 1610612742 2022 \n", 765 | "19613 2023-01-29 22200755 1610612740 1610612749 2022 \n", 766 | "21628 2023-01-31 22200768 1610612743 1610612740 2022 \n", 767 | "\n", 768 | " PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home REB_home \\\n", 769 | "1384 0 0.00000 0.0000 0.000000 0 0 \n", 770 | "1705 128 51.09375 80.0000 42.593750 37 44 \n", 771 | "1755 99 41.59375 85.1875 10.500000 17 45 \n", 772 | "3042 0 0.00000 0.0000 0.000000 0 0 \n", 773 | "3825 119 42.40625 80.0000 37.906250 23 50 \n", 774 | "5048 125 46.81250 80.0000 43.312500 26 45 \n", 775 | "5716 100 45.59375 90.5000 28.093750 21 38 \n", 776 | "8239 0 0.00000 0.0000 0.000000 0 0 \n", 777 | "8381 117 48.40625 73.8750 32.312500 26 36 \n", 778 | "8727 106 44.90625 89.5000 27.296875 19 44 \n", 779 | "8907 0 0.00000 0.0000 0.000000 0 0 \n", 780 | "9991 118 47.50000 68.1875 30.000000 20 50 \n", 781 | "10092 127 55.81250 88.1875 53.312500 32 45 \n", 782 | "15857 104 39.31250 62.1875 37.906250 24 53 \n", 783 | "15938 105 45.59375 70.3750 35.906250 18 40 \n", 784 | "19613 110 44.09375 71.3750 38.187500 25 38 \n", 785 | "21628 0 0.00000 0.0000 0.000000 0 0 \n", 786 | "\n", 787 | " PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away \\\n", 788 | "1384 0 0.00000 0.0000 0.000000 0 0 \n", 789 | "1705 120 49.50000 85.0000 45.812500 23 43 \n", 790 | "1755 122 56.40625 73.6875 60.593750 35 35 \n", 791 | "3042 0 0.00000 0.0000 0.000000 0 0 \n", 792 | "3825 109 47.09375 78.3125 36.687500 28 45 \n", 793 | "5048 129 54.40625 88.8750 47.500000 24 32 \n", 794 | "5716 112 48.31250 77.3125 24.296875 29 41 \n", 795 | "8239 0 0.00000 0.0000 0.000000 0 0 \n", 796 | "8381 122 54.18750 77.3125 37.500000 25 47 \n", 797 | "8727 114 49.40625 81.0000 39.312500 28 42 \n", 798 | "8907 0 0.00000 0.0000 0.000000 0 0 \n", 799 | "9991 111 46.68750 52.0000 30.796875 26 49 \n", 800 | "10092 106 43.31250 65.1875 24.093750 26 43 \n", 801 | "15857 121 47.40625 78.8750 39.000000 24 50 \n", 802 | "15938 111 49.40625 69.6875 29.406250 17 40 \n", 803 | "19613 135 55.18750 60.0000 39.500000 27 57 \n", 804 | "21628 0 0.00000 0.0000 0.000000 0 0 \n", 805 | "\n", 806 | " HOME_TEAM_WINS \n", 807 | "1384 0 \n", 808 | "1705 1 \n", 809 | "1755 0 \n", 810 | "3042 0 \n", 811 | "3825 1 \n", 812 | "5048 0 \n", 813 | "5716 0 \n", 814 | "8239 0 \n", 815 | "8381 0 \n", 816 | "8727 0 \n", 817 | "8907 0 \n", 818 | "9991 1 \n", 819 | "10092 1 \n", 820 | "15857 0 \n", 821 | "15938 0 \n", 822 | "19613 0 \n", 823 | "21628 0 " 824 | ] 825 | }, 826 | "execution_count": 6, 827 | "metadata": {}, 828 | "output_type": "execute_result" 829 | } 830 | ], 831 | "source": [ 832 | "df_old[df_old['GAME_ID'] > 22200751]" 833 | ] 834 | } 835 | ], 836 | "metadata": { 837 | "kernelspec": { 838 | "display_name": "nba3", 839 | "language": "python", 840 | "name": "python3" 841 | }, 842 | "language_info": { 843 | "codemirror_mode": { 844 | "name": "ipython", 845 | "version": 3 846 | }, 847 | "file_extension": ".py", 848 | "mimetype": "text/x-python", 849 | "name": "python", 850 | "nbconvert_exporter": "python", 851 | "pygments_lexer": "ipython3", 852 | "version": "3.9.13" 853 | }, 854 | "orig_nbformat": 4, 855 | "vscode": { 856 | "interpreter": { 857 | "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06" 858 | } 859 | } 860 | }, 861 | "nbformat": 4, 862 | "nbformat_minor": 2 863 | } 864 | -------------------------------------------------------------------------------- /notebooks/lightgbm.json: -------------------------------------------------------------------------------- 1 | {"lambda_l1": 0.014353286456597142, "lambda_l2": 1.8797714715567863e-05, "learning_rate": 0.0074219704795943425, "max_depth": 5, "n_estimators": 8165, "feature_fraction": 0.7165888165225329, "bagging_fraction": 0.41679785470194425, "bagging_freq": 5, "num_leaves": 965, "min_child_samples": 94, "min_data_per_groups": 89} -------------------------------------------------------------------------------- /notebooks/model_data.json: -------------------------------------------------------------------------------- 1 | {"model_name": "xgboost", "calibration_method": "Model + Isotonic", "brier_loss": 0.22892725648801726, "metrics": {"AUC": 0.6314368770764119, "Accuracy": 0.6584507042253521}} -------------------------------------------------------------------------------- /notebooks/xgboost.json: -------------------------------------------------------------------------------- 1 | {"num_round": 868, "learning_rate": 0.018405745552075914, "max_bin": 558, "max_depth": 2, "alpha": 8.281713514214621, "gamma": 11.304706114754095, "reg_lambda": 11.806523119478767, "colsample_bytree": 0.4339252336236323, "subsample": 0.5326856953922422, "min_child_weight": 6.179632416051588, "scale_pos_weight": 1} -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | alembic==1.8.1 2 | altair==4.2.0 3 | asttokens==2.2.0 4 | async-generator==1.10 5 | attrs==22.1.0 6 | autopage==0.5.1 7 | avro==1.10.2 8 | backcall==0.2.0 9 | beautifulsoup4==4.11.1 10 | blinker==1.5 11 | boto3==1.26.20 12 | botocore==1.29.20 13 | cachetools==5.2.0 14 | certifi==2022.9.24 15 | cffi==1.15.1 16 | charset-normalizer==2.1.1 17 | click==8.1.3 18 | cliff==4.1.0 19 | cloudpickle==2.2.0 20 | cmaes==0.9.0 21 | cmd2==2.4.2 22 | colorama==0.4.6 23 | colorlog==6.7.0 24 | commonmark==0.9.1 25 | confluent-kafka==1.8.2 26 | contourpy==1.0.6 27 | cryptography==38.0.4 28 | cycler==0.11.0 29 | debugpy==1.6.4 30 | decorator==5.1.1 31 | entrypoints==0.4 32 | exceptiongroup==1.0.4 33 | executing==1.2.0 34 | fastavro==1.4.11 35 | fastjsonschema==2.16.2 36 | fonttools==4.38.0 37 | furl==2.1.3 38 | future==0.18.2 39 | gitdb==4.0.10 40 | GitPython==3.1.29 41 | great-expectations==0.14.12 42 | greenlet==2.0.1 43 | h11==0.14.0 44 | hopsworks==3.0.4 45 | hsfs==3.0.5 46 | hsml==3.0.3 47 | html5lib==1.1 48 | idna==3.4 49 | importlib-metadata==5.1.0 50 | importlib-resources==5.10.0 51 | ipykernel==6.17.1 52 | ipython==8.7.0 53 | ipywidgets==8.0.2 54 | javaobj-py3==0.4.3 55 | jedi==0.18.2 56 | Jinja2==3.0.3 57 | jmespath==1.0.1 58 | joblib==1.2.0 59 | jsonpatch==1.32 60 | jsonpointer==2.3 61 | jsonschema==4.17.3 62 | jupyter_client==7.4.7 63 | jupyter_core==5.1.0 64 | jupyterlab-widgets==3.0.3 65 | kiwisolver==1.4.4 66 | lightgbm==3.3.2 67 | llvmlite==0.39.1 68 | lxml==4.9.1 69 | Mako==1.2.4 70 | MarkupSafe==2.0.1 71 | matplotlib==3.6.2 72 | matplotlib-inline==0.1.6 73 | mistune==0.8.4 74 | mock==4.0.3 75 | nbformat==5.7.0 76 | nest-asyncio==1.5.6 77 | numba==0.56.4 78 | numpy==1.22.4 79 | optuna==2.10.1 80 | orderedmultidict==1.0.1 81 | outcome==1.2.0 82 | packaging==21.3 83 | pandas==1.4.3 84 | parso==0.8.3 85 | pbr==5.11.0 86 | pickleshare==0.7.5 87 | Pillow==9.3.0 88 | platformdirs==2.5.4 89 | plotly==5.9.0 90 | prettytable==3.5.0 91 | prompt-toolkit==3.0.33 92 | protobuf==3.20.3 93 | psutil==5.9.4 94 | pure-eval==0.2.2 95 | pyarrow==10.0.1 96 | pyasn1==0.4.8 97 | pyasn1-modules==0.2.8 98 | pycparser==2.21 99 | pycryptodomex==3.16.0 100 | pydeck==0.8.0 101 | Pygments==2.13.0 102 | PyHopsHive==0.6.4.1.dev0 103 | pyhumps==1.6.1 104 | pyjks==20.0.0 105 | Pympler==1.0.1 106 | PyMySQL==1.0.2 107 | pyparsing==2.4.7 108 | pyperclip==1.8.2 109 | pyreadline3==3.4.1 110 | pyrsistent==0.19.2 111 | PySocks==1.7.1 112 | python-dateutil==2.8.2 113 | python-dotenv==0.21.0 114 | pytz==2022.6 115 | pytz-deprecation-shim==0.1.0.post0 116 | PyVirtualDisplay==3.0 117 | PyYAML==6.0 118 | pyzmq==24.0.1 119 | requests==2.28.1 120 | rich==12.6.0 121 | ruamel.yaml==0.17.17 122 | ruamel.yaml.clib==0.2.7 123 | s3transfer==0.6.0 124 | scikit-learn==1.1.3 125 | scipy==1.9.3 126 | seaborn==0.11.2 127 | selenium==4.6.0 128 | semver==2.13.0 129 | shap==0.41.0 130 | six==1.16.0 131 | slicer==0.0.7 132 | smmap==5.0.0 133 | sniffio==1.3.0 134 | sortedcontainers==2.4.0 135 | soupsieve==2.3.2.post1 136 | SQLAlchemy==1.4.44 137 | stack-data==0.6.2 138 | stevedore==4.1.1 139 | streamlit==1.15.2 140 | sweetviz==2.1.4 141 | tenacity==8.1.0 142 | termcolor==2.1.1 143 | threadpoolctl==3.1.0 144 | thrift==0.16.0 145 | toml==0.10.2 146 | toolz==0.12.0 147 | tornado==6.2 148 | tqdm==4.64.0 149 | traitlets==5.6.0 150 | trio==0.22.0 151 | trio-websocket==0.9.2 152 | twofish==0.3.0 153 | typing_extensions==4.4.0 154 | tzdata==2022.7 155 | tzlocal==4.2 156 | urllib3==1.26.13 157 | validators==0.20.0 158 | watchdog==2.2.0 159 | wcwidth==0.2.5 160 | webdriver-manager==3.8.4 161 | webencodings==0.5.1 162 | widgetsnbextension==4.0.3 163 | wsproto==1.2.0 164 | xgboost==1.6.1 165 | zipp==3.11.0 166 | -------------------------------------------------------------------------------- /src/common_functions.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | import itertools 5 | 6 | import matplotlib.pyplot as plt 7 | from matplotlib.colors import TwoSlopeNorm 8 | 9 | import sweetviz as sv 10 | 11 | from datetime import datetime 12 | 13 | 14 | def plot_corr_barchart(df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None: 15 | """ 16 | Plots a color-gradient bar chart showing top n correlations between features 17 | 18 | Args: 19 | df1 (pd.DataFrame): the dataframe to plot 20 | drop_cols (list): list of columns not to include in plot 21 | n (int): number of top n correlations to plot 22 | 23 | Returns: 24 | None 25 | 26 | Sources: 27 | https://typefully.com/levikul09/j6qzwR0 28 | https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas 29 | 30 | """ 31 | 32 | df1 = df1.drop(columns=drop_cols) 33 | useful_columns = df1.select_dtypes(include=['number']).columns 34 | 35 | def get_redundant_pairs(df): 36 | pairs_to_drop = set() 37 | cols = df.columns 38 | for i in range(0,df.shape[1]): 39 | for j in range(0,i+1): 40 | pairs_to_drop.add((cols[i],cols[j])) 41 | return pairs_to_drop 42 | 43 | def get_correlations(df,n=n): 44 | au_corr = df.corr(method = 'spearman').unstack() #spearman used because not all data is normalized 45 | labels_to_drop = get_redundant_pairs(df) 46 | au_corr = au_corr.drop(labels = labels_to_drop).sort_values(ascending=False) 47 | top_n = au_corr[0:n] 48 | bottom_n = au_corr[-n:] 49 | top_corr = pd.concat([top_n, bottom_n]) 50 | return top_corr 51 | 52 | corrplot = get_correlations(df1[useful_columns]) 53 | 54 | 55 | fig, ax = plt.subplots(figsize=(15,10)) 56 | norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1) 57 | colors = [plt.cm.RdYlGn(norm(c)) for c in corrplot.values] 58 | 59 | print(corrplot) 60 | 61 | corrplot.plot.barh(color=colors) 62 | 63 | return 64 | 65 | 66 | def plot_corr_vs_target(target: str, df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None: 67 | """ 68 | Plots a color-gradient bar chart showing top n correlations between features and target 69 | 70 | Args: 71 | target (str): the name of the target column 72 | df1 (pd.DataFrame): the dataframe to plot 73 | drop_cols (list): list of columns not to include in plot 74 | n (int): number of top n correlations to plot 75 | 76 | Returns: 77 | None 78 | 79 | """ 80 | 81 | 82 | target_series = df1[target] 83 | df1 = df1.drop(columns=drop_cols) 84 | 85 | x = df1.corrwith(target_series, method = 'spearman',numeric_only=True).sort_values(ascending=False) 86 | top_n = x[0:n] 87 | bottom_n = x[-n:] 88 | top_corr = pd.concat([top_n, bottom_n]) 89 | x = top_corr 90 | 91 | print(x) 92 | 93 | fig, ax = plt.subplots(figsize=(15,10)) 94 | norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1) 95 | colors = [plt.cm.RdYlGn(norm(c)) for c in x.values] 96 | x.plot.barh(color=colors) 97 | 98 | return 99 | 100 | 101 | def plot_confusion_matrix(cm: np.ndarray, 102 | target_names: list, 103 | title: str ='Confusion matrix', 104 | cmap: plt.colormaps =None, 105 | normalize: bool =True): 106 | """ 107 | given a sklearn confusion matrix (cm), make a nice plot 108 | 109 | Arguments 110 | --------- 111 | cm: confusion matrix from sklearn.metrics.confusion_matrix 112 | 113 | target_names: given classification classes such as [0, 1, 2] 114 | the class names, for example: ['high', 'medium', 'low'] 115 | 116 | title: the text to display at the top of the matrix 117 | 118 | cmap: the gradient of the values displayed from matplotlib.pyplot.cm 119 | see http://matplotlib.org/examples/color/colormaps_reference.html 120 | plt.get_cmap('jet') or plt.cm.Blues 121 | 122 | normalize: If False, plot the raw numbers 123 | If True, plot the proportions 124 | 125 | Usage 126 | ----- 127 | plot_confusion_matrix(cm = cm, # confusion matrix created by 128 | # sklearn.metrics.confusion_matrix 129 | normalize = True, # show proportions 130 | target_names = y_labels_vals, # list of names of the classes 131 | title = best_estimator_name) # title of graph 132 | 133 | Citation 134 | --------- 135 | http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html 136 | 137 | Source: 138 | https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels/50386871#50386871 139 | 140 | """ 141 | 142 | accuracy = np.trace(cm) / np.sum(cm).astype('float') 143 | misclass = 1 - accuracy 144 | 145 | if cmap is None: 146 | cmap = plt.get_cmap('Blues') 147 | 148 | fig = plt.figure(figsize=(8, 6)) 149 | plt.imshow(cm, interpolation='nearest', cmap=cmap) 150 | plt.title(title) 151 | plt.colorbar() 152 | 153 | if target_names is not None: 154 | tick_marks = np.arange(len(target_names)) 155 | plt.xticks(tick_marks, target_names, rotation=45) 156 | plt.yticks(tick_marks, target_names) 157 | 158 | if normalize: 159 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 160 | 161 | 162 | thresh = cm.max() / 1.5 if normalize else cm.max() / 2 163 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 164 | if normalize: 165 | plt.text(j, i, "{:0.4f}".format(cm[i, j]), 166 | horizontalalignment="center", 167 | color="white" if cm[i, j] > thresh else "black") 168 | else: 169 | plt.text(j, i, "{:,}".format(cm[i, j]), 170 | horizontalalignment="center", 171 | color="white" if cm[i, j] > thresh else "black") 172 | 173 | 174 | plt.tight_layout() 175 | plt.ylabel('True label') 176 | plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass)) 177 | plt.show() 178 | 179 | return fig 180 | 181 | def run_sweetviz_report(df: pd.DataFrame, TARGET: str) -> None: 182 | """ 183 | Generates a sweetviz report and saves it to a html file 184 | 185 | Args: 186 | df (pd.DataFrame): the dataframe to analyze 187 | TARGET (str): the name of the target column 188 | """ 189 | 190 | report_label = datetime.today().strftime('%Y-%m-%d_%H_%M') 191 | 192 | my_report = sv.analyze(df,target_feat=TARGET) 193 | my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html') 194 | 195 | return 196 | 197 | 198 | def run_sweetviz_comparison(df1: pd.DataFrame, df1_name: str, df2: pd.DataFrame, df2_name: str, TARGET: str, report_label: str) -> None: 199 | """ 200 | Generates a sweetviz comparison report between two dataframes and saves it to a html file 201 | 202 | Args: 203 | df1 (pd.DataFrame): the first dataframe to analyze 204 | df1_name (str): the name of the first dataframe 205 | df2 (pd.DataFrame): the second dataframe to analyze 206 | df2_name (str): the name of the second dataframe 207 | TARGET (str): the name of the target column 208 | report_label (str): identifier incorporated into the filename of the report 209 | """ 210 | 211 | report_label = report_label + datetime.today().strftime('%Y-%m-%d_%H_%M') 212 | 213 | my_report = sv.compare([df1, df1_name], [df2, df2_name], target_feat=TARGET,pairwise_analysis="off") 214 | my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html') 215 | 216 | return -------------------------------------------------------------------------------- /src/data_processing.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | 4 | def process_games(games: pd.DataFrame) -> pd.DataFrame: 5 | """ 6 | Performs basic data cleaning on the games dataset. 7 | 8 | Args: 9 | games (pd.DataFrame): the raw games dataframe 10 | 11 | Returns: 12 | the cleaned games dataframe 13 | 14 | """ 15 | 16 | 17 | # remove preseason games (GAME_ID begins with a 1) 18 | games = games[games['GAME_ID'] > 20000000] 19 | 20 | # flag postseason games (GAME_ID begins with >2) 21 | games['PLAYOFF'] = (games['GAME_ID'] >= 30000000).astype('int8') 22 | 23 | # remove duplicates (each GAME_ID should be unique) 24 | games = games[~games.duplicated(subset=['GAME_ID'])] 25 | 26 | # drop unnecessary fields 27 | all_columns = games.columns.tolist() 28 | drop_columns = ['GAME_STATUS_TEXT', 'TEAM_ID_home', 'TEAM_ID_away'] 29 | use_columns = [item for item in all_columns if item not in drop_columns] 30 | games = games[use_columns] 31 | 32 | return games 33 | 34 | 35 | def process_ranking(ranking: pd.DataFrame) -> pd.DataFrame: 36 | """ 37 | Performs basic data cleaning on the ranking dataset. 38 | 39 | Args: 40 | ranking (pd.DataFrame): the raw ranking dataframe 41 | 42 | Returns: 43 | the cleaned ranking dataframe 44 | 45 | """ 46 | 47 | 48 | # remove preseason rankings (SEASON_ID begins with 1) 49 | ranking = ranking[ranking['SEASON_ID'] > 20000] 50 | 51 | # convert home record and road record to numeric 52 | ranking['HOME_W'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[0]).astype('int') 53 | ranking['HOME_L'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[1]).astype('int') 54 | ranking['HOME_W_PCT'] = ranking['HOME_W'] / ( ranking['HOME_W'] + ranking['HOME_L'] ) 55 | 56 | ranking['ROAD_W'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[0]).astype('int') 57 | ranking['ROAD_L'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[1]).astype('int') 58 | ranking['ROAD_W_PCT'] = ranking['ROAD_W'] / ( ranking['ROAD_W'] + ranking['ROAD_L'] ) 59 | 60 | # encode CONFERENCE as an integer (just using pandas - not importing sklearn for just one feature) 61 | ranking['CONFERENCE'] = ranking['CONFERENCE'].apply(lambda x: 0 if x=='East' else 1 ).astype('int') 62 | 63 | # remove duplicates (there should only be one TEAM_ID per STANDINGSDATE) 64 | ranking = ranking[~ranking.duplicated(subset=['TEAM_ID','STANDINGSDATE'])] 65 | 66 | # drop unnecessary fields 67 | drop_fields = ['SEASON_ID', 'LEAGUE_ID', 'RETURNTOPLAY', 'TEAM', 'HOME_RECORD', 'ROAD_RECORD'] 68 | ranking = ranking.drop(drop_fields,axis=1) 69 | 70 | return ranking 71 | 72 | 73 | def process_games_details(details: pd.DataFrame) -> pd.DataFrame: 74 | """ 75 | Performs basic data cleaning on the games_details dataset. 76 | 77 | Args: 78 | details (pd.DataFrame): the raw games_details dataframe 79 | 80 | Returns: 81 | the cleaned games_details dataframe 82 | 83 | """ 84 | 85 | 86 | # convert MIN:SEC to float 87 | df = details.loc[details['MIN'].str.contains(':',na=False)] 88 | df['MIN_whole'] = df['MIN'].apply(lambda x: x.split(':')[0]).astype("int8") 89 | df['MIN_seconds'] = df['MIN'].apply(lambda x: x.split(':')[1]).astype("int8") 90 | df['MIN'] = df['MIN_whole'] + (df['MIN_seconds'] / 60) 91 | 92 | details['MIN'].loc[details['MIN'].str.contains(':',na=False)] = df['MIN'] 93 | details['MIN'] = details['MIN'].astype("float16") 94 | 95 | # convert negatives to positive 96 | details['MIN'].loc[details['MIN'] < 0] = -(details['MIN']) 97 | 98 | # update START_POSITION if did not play (MIN = NaN) 99 | details['START_POSITION'].loc[details['MIN'].isna()] = 'NP' 100 | 101 | # update START_POSITION if null 102 | details['START_POSITION'] = details['START_POSITION'].fillna('NS') 103 | 104 | # drop unnecessary fields 105 | drop_fields = ['COMMENT', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_NAME', 'NICKNAME'] 106 | details = details.drop(drop_fields,axis=1) 107 | 108 | return details 109 | 110 | 111 | def add_TARGET(df: pd.DataFrame) -> pd.DataFrame: 112 | """ 113 | Adds a TARGET column to the dataframe by copying HOME_TEAM_WINS. 114 | 115 | Args: 116 | df (pd.DataFrame): the dataframe to add the TARGET column to 117 | 118 | Returns: 119 | the games dataframe with a TARGET column 120 | 121 | """ 122 | 123 | df['TARGET'] = df['HOME_TEAM_WINS'] 124 | 125 | return df 126 | 127 | def split_train_test(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]: 128 | """ 129 | Splits the dataframe into train and test sets. 130 | 131 | Splits the latest season as the test set and the rest as the train set. 132 | The second latest season included with the test set to allow for feature engineering. 133 | 134 | Args: 135 | df (pd.DataFrame): the dataframe to split 136 | 137 | Returns: 138 | the train and test dataframes 139 | 140 | """ 141 | 142 | latest_season = df['SEASON'].unique().max() 143 | 144 | train = df[df['SEASON'] < (latest_season)] 145 | test = df[df['SEASON'] >= (latest_season - 1)] 146 | 147 | return train, test 148 | 149 | -------------------------------------------------------------------------------- /src/feature_engineering.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | def process_features(df: pd.DataFrame) -> pd.DataFrame: 5 | """ 6 | Master function to perform all the steps of feature engineering 7 | 8 | Args: 9 | df (pd.DataFrame): the dataframe to process 10 | 11 | Returns: 12 | the processed dataframe 13 | 14 | 15 | 16 | Feature engineering to add: 17 | - rolling averages of key stats, 18 | - win/lose streaks, 19 | - home/away streaks, 20 | - specific matchup (team X vs team Y) rolling averages and streaks 21 | - home team rolling stats minus visitor team rolling stats 22 | - rolling stats minus current league average 23 | 24 | Functions include: 25 | - fix_datatypes(): converts date to proper format and reduces memory footprint of ints and floats 26 | - add_date_features(): adds a feature for month number from game date 27 | - remove_playoff_games(): playoff games may bias the statistics 28 | - add_rolling_home_visitor(): rolling avgs and streaks for home/visitor team when playing as home/visitor 29 | - process_games_consecutively(): separate home team stats from visitor team stats for each game and stack these together by game date 30 | - add_past_performance_all(): rolling avgs and streaks no matter if playing as home or visitor team 31 | - process_x_minus_league_avg: subtract league avg rolling stats from team's rolling stats 32 | - add_matchups(): rolling avgs and steaks for each time when Team A played Team B 33 | - combine_new_features(): combine back home team and visitor team features so each game has only one row again 34 | - process_x_minus_y(): subtract visitor team rolling stats from home rolling stats 35 | """ 36 | 37 | # lengths of rolling averages and streaks to calculate for each team 38 | # we will try a variety of lengths to see which works best 39 | home_visitor_roll_list = [3, 7, 10] #lengths to use when restricting to home or visitor role 40 | all_roll_list = [3, 7, 10, 15] #lengths to use when NOT restricting to home or visitor role 41 | 42 | long_integer_fields = ['GAME_ID', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON'] 43 | short_integer_fields = ['PTS_home', 'AST_home', 'REB_home', 'PTS_away', 'AST_away', 'REB_away'] 44 | date_fields = ['GAME_DATE_EST'] 45 | 46 | df = fix_datatypes(df, date_fields, short_integer_fields, long_integer_fields) 47 | df = add_date_features(df) 48 | df = remove_playoff_games(df) 49 | df = add_rolling_home_visitor(df, "HOME", home_visitor_roll_list) 50 | df = add_rolling_home_visitor(df, "VISITOR", home_visitor_roll_list) 51 | 52 | df_consecutive = process_games_consecutively(df) 53 | df_consecutive = add_matchups(df_consecutive, home_visitor_roll_list) 54 | df_consecutive = add_past_performance_all(df_consecutive, all_roll_list) 55 | 56 | #add these features back to main dataframe 57 | df = combine_new_features(df,df_consecutive) 58 | 59 | df = process_x_minus_y(df) 60 | 61 | return df 62 | 63 | 64 | def fix_datatypes(df: pd.DataFrame, date_columns: list, short_integer_fields: list, long_integer_fields: list)-> pd.DataFrame: 65 | """ 66 | Converts date to proper format and reduces memory footprint of ints and floats 67 | 68 | Args: 69 | df (pd.DataFrame): the dataframe to process 70 | 71 | Returns: 72 | the processed dataframe 73 | 74 | """ 75 | 76 | for field in date_columns: 77 | df[field] = pd.to_datetime(df[field]) 78 | 79 | #convert long integer fields to int32 from int64 80 | for field in long_integer_fields: 81 | df[field] = df[field].astype('int32') 82 | 83 | #convert specific fields to int16 to avoid type issues with hopsworks.ai 84 | for field in short_integer_fields: 85 | df[field] = df[field].astype('int16') 86 | 87 | #convert to positive. For some reason, some values have been saved as negative numbers 88 | for field in short_integer_fields: 89 | df[field] = df[field].abs() 90 | 91 | #convert the remaining int64s to int8 92 | for field in df.select_dtypes(include=['int64']).columns.tolist(): 93 | df[field] = df[field].astype('int8') 94 | 95 | #convert float64s to float16s 96 | for field in df.select_dtypes(include=['float64']).columns.tolist(): 97 | df[field] = df[field].astype('float16') 98 | 99 | return df 100 | 101 | 102 | def add_date_features(df: pd.DataFrame)-> pd.DataFrame: 103 | """ 104 | Creates new features from the game date, which will hopefully be more useful for the model 105 | 106 | Currently simply converts game date to just month and add as a feature. This limits cardinality of the date feature. 107 | 108 | Args: 109 | df (pd.DataFrame): the dataframe to process 110 | 111 | Returns: 112 | the processed dataframe 113 | """ 114 | 115 | df['MONTH'] = df['GAME_DATE_EST'].dt.month 116 | 117 | return df 118 | 119 | 120 | def remove_playoff_games(df: pd.DataFrame)-> pd.DataFrame: 121 | """ 122 | Remove playoff games 123 | 124 | Playoff games may bias the statistics because they are not played under the same conditions as regular season games and are played in a tournament format. 125 | 126 | Args: 127 | df (pd.DataFrame): the dataframe to process 128 | 129 | Returns: 130 | the processed dataframe 131 | """ 132 | 133 | 134 | # Filter to only non-playoff games and then drop the PLAYOFF feature 135 | 136 | df = df[df["PLAYOFF"] == 0] 137 | df = df.drop("PLAYOFF", axis = 1) 138 | 139 | return df 140 | 141 | 142 | def add_rolling_home_visitor(df: pd.DataFrame, location: str, roll_list: list)-> pd.DataFrame: 143 | """ 144 | Add rolling avgs and win/lose streaks for home/visitor team when playing as home/visitor for a variety of rolling lengths 145 | 146 | This function also invokes another function to calculate the league average rolling stats for that moment in time and subtracts these from the team's rolling stats. 147 | 148 | Args: 149 | df (pd.DataFrame): the dataframe to process 150 | location (str): "HOME" or "VISITOR" 151 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 152 | 153 | Returns: 154 | the processed dataframe 155 | 156 | 157 | We are adding features that show how well the home team has done in its last home games and how well the visitor team has done in its last away games. 158 | We are also determining the current win streak for each team (negative if losing streak) when playing as home or visitor team. 159 | 160 | """ 161 | 162 | # compile stats for home or visitor team 163 | location_id = location + "_TEAM_ID" 164 | 165 | # sort games by the order in which they were played for each home or visitor team 166 | df = df.sort_values(by = [location_id, 'GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True) 167 | 168 | # Win streak, negative if a losing streak 169 | df[location + '_TEAM_WIN_STREAK'] = df['HOME_TEAM_WINS'].groupby((df['HOME_TEAM_WINS'].shift() != df.groupby([location_id])['HOME_TEAM_WINS'].shift(2)).cumsum()).cumcount() + 1 170 | # if home team lost the last game of the streak, then the streak must be a losing streak. make it negative 171 | df[location + '_TEAM_WIN_STREAK'].loc[df['HOME_TEAM_WINS'].shift() == 0] = -1 * df[location + '_TEAM_WIN_STREAK'] 172 | 173 | # If visitor, the streak has opposite meaning (3 wins in a row for home team is 3 losses in a row for visitor) 174 | if location == 'VISITOR': 175 | df[location + '_TEAM_WIN_STREAK'] = - df[location + '_TEAM_WIN_STREAK'] 176 | 177 | 178 | # rolling means 179 | feature_list = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home'] 180 | 181 | if location == 'VISITOR': 182 | feature_list = ['HOME_TEAM_WINS', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away'] 183 | 184 | 185 | roll_feature_list = [] 186 | for feature in feature_list: 187 | for roll in roll_list: 188 | roll_feature_name = location + '_' + feature + '_AVG_LAST_' + str(roll) + '_' + location 189 | if feature == 'HOME_TEAM_WINS': #remove the "HOME_" for better readability 190 | roll_feature_name = location + '_' + feature[5:] + '_AVG_LAST_' + str(roll) + '_' + location 191 | roll_feature_list.append(roll_feature_name) 192 | df[roll_feature_name] = df.groupby(['HOME_TEAM_ID'])[feature].rolling(roll, closed= "left").mean().values 193 | 194 | 195 | 196 | # determine league avg for each stat and then subtract it from the each team's avg 197 | # as a measure of how well that team compares to all teams in that moment in time 198 | 199 | #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose) 200 | roll_feature_list = [x for x in roll_feature_list if not x.startswith('HOME_TEAM_WINS')] 201 | #print(location_id) 202 | df = process_x_minus_league_avg(df, roll_feature_list, location_id) 203 | 204 | 205 | return df 206 | 207 | 208 | def process_games_consecutively(df_data: pd.DataFrame)-> pd.DataFrame: 209 | """ 210 | 211 | Separate home team stats from visitor team stats for each game and stack these together by game date. 212 | 213 | (Each game record will go from a single row, Home/Visitor combined, to two rows, one for home team and one for visitor) 214 | 215 | Args: 216 | df (pd.DataFrame): the dataframe to process 217 | 218 | Returns: 219 | the processed dataframe 220 | """ 221 | 222 | # re-organize so that all of a team's games can be listed in chronological order whether HOME or VISITOR 223 | # this will facilitate feature engineering (winpct vs team X, 5-game winpct, current win streak, etc...) 224 | 225 | # before this step, the data is stored by game, and each game has 2 teams 226 | # this function will separate each teams stats so that each game has 2 rows (one for each team) instead of one combined row 227 | 228 | #this data will need to be re-linked back to the main dataframe after all processing is done, 229 | #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records 230 | 231 | #TEAM1 will be the key field. TEAM2 is used solely to process past team matchups 232 | 233 | # all the home games for each team will be selected and then stacked with all the away games 234 | 235 | df_home = pd.DataFrame() 236 | df_home['GAME_DATE_EST'] = df_data['GAME_DATE_EST'] 237 | df_home['GAME_ID'] = df_data['GAME_ID'] 238 | df_home['TEAM1'] = df_data['HOME_TEAM_ID'] 239 | df_home['TEAM1_home'] = 1 240 | df_home['TEAM1_win'] = df_data['HOME_TEAM_WINS'] 241 | df_home['TEAM2'] = df_data['VISITOR_TEAM_ID'] 242 | df_home['SEASON'] = df_data['SEASON'] 243 | 244 | df_home['PTS'] = df_data['PTS_home'] 245 | df_home['FG_PCT'] = df_data['FG_PCT_home'] 246 | df_home['FT_PCT'] = df_data['FT_PCT_home'] 247 | df_home['FG3_PCT'] = df_data['FG3_PCT_home'] 248 | df_home['AST'] = df_data['AST_home'] 249 | df_home['REB'] = df_data['REB_home'] 250 | 251 | # now for visitor teams 252 | 253 | df_visitor = pd.DataFrame() 254 | df_visitor['GAME_DATE_EST'] = df_data['GAME_DATE_EST'] 255 | df_visitor['GAME_ID'] = df_data['GAME_ID'] 256 | df_visitor['TEAM1'] = df_data['VISITOR_TEAM_ID'] 257 | df_visitor['TEAM1_home'] = 0 258 | df_visitor['TEAM1_win'] = df_data['HOME_TEAM_WINS'].apply(lambda x: 1 if x == 0 else 0) 259 | df_visitor['TEAM2'] = df_data['HOME_TEAM_ID'] 260 | df_visitor['SEASON'] = df_data['SEASON'] 261 | 262 | df_visitor['PTS'] = df_data['PTS_away'] 263 | df_visitor['FG_PCT'] = df_data['FG_PCT_away'] 264 | df_visitor['FT_PCT'] = df_data['FT_PCT_away'] 265 | df_visitor['FG3_PCT'] = df_data['FG3_PCT_away'] 266 | df_visitor['AST'] = df_data['AST_away'] 267 | df_visitor['REB'] = df_data['REB_away'] 268 | 269 | # merge dfs 270 | 271 | df = pd.concat([df_home, df_visitor]) 272 | 273 | column2 = df.pop('TEAM1') 274 | column3 = df.pop('TEAM1_home') 275 | column4 = df.pop('TEAM2') 276 | column5 = df.pop('TEAM1_win') 277 | 278 | df.insert(2,'TEAM1', column2) 279 | df.insert(3,'TEAM1_home', column3) 280 | df.insert(4,'TEAM2', column4) 281 | df.insert(5,'TEAM1_win', column5) 282 | 283 | df = df.sort_values(by = ['TEAM1', 'GAME_ID'], axis=0, ascending=[True, True], ignore_index=True) 284 | 285 | return df 286 | 287 | 288 | def add_matchups(df: pd.DataFrame, roll_list: list)-> pd.DataFrame: 289 | """ 290 | Add rolling win pcts and win/lose steaks for each time when Team A played Team B for a variety of rolling windows 291 | 292 | Args: 293 | df (pd.DataFrame): the dataframe to process 294 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 295 | 296 | Returns: 297 | the processed dataframe 298 | """ 299 | 300 | 301 | # group all the games that 2 teams played each other 302 | # calculate home team win pct and the home team win/lose streak 303 | 304 | 305 | df = df.sort_values(by = ['TEAM1', 'TEAM2','GAME_DATE_EST'], axis=0, ascending=[True, True, True], ignore_index=True) 306 | 307 | for roll in roll_list: 308 | df['MATCHUP_WINPCT_' + str(roll)] = df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].rolling(roll, closed= "left").mean().values 309 | 310 | df['MATCHUP_WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1 311 | 312 | # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative 313 | df['MATCHUP_WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0] = -1 * df['MATCHUP_WIN_STREAK'] 314 | 315 | 316 | return df 317 | 318 | 319 | def add_past_performance_all(df: pd.DataFrame, roll_list: list)-> pd.DataFrame: 320 | """ 321 | Add rolling avgs, win/lose streak, and home/away streak no matter if playing as home or visitor team. 322 | 323 | Args: 324 | df (pd.DataFrame): the dataframe to process 325 | roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 326 | 327 | Returns: 328 | the processed dataframe 329 | """ 330 | 331 | # add features showing how well each team has done in its last games 332 | # regardless whether they were at home or away 333 | 334 | # add rolling means and win streaks (negative number if losing streak) 335 | 336 | #this data will need to be re-linked back to the main dataframe after all processing is done, 337 | #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records 338 | 339 | #TEAM1 will be the key field. TEAM2 was used solely to process past team matchups 340 | 341 | 342 | df = df.sort_values(by = ['TEAM1','GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True) 343 | 344 | #streak of games won/lost, make negative is a losing streak 345 | df['WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1 346 | 347 | # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative 348 | df['WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0] = -1 * df['WIN_STREAK'] 349 | 350 | #streak of games played at home/away, make negative if away streak 351 | df['HOME_AWAY_STREAK'] = df['TEAM1_home'].groupby((df['TEAM1_home'].shift() != df.groupby(['TEAM1'])['TEAM1_home'].shift(2)).cumsum()).cumcount() + 1 352 | 353 | # if team1 played the game of the streak away, then the streak must be an away streak. make it negative 354 | df['HOME_AWAY_STREAK'].loc[df['TEAM1_home'].shift() == 0] = -1 * df['HOME_AWAY_STREAK'] 355 | 356 | #rolling means 357 | 358 | feature_list = ['TEAM1_win', 'PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB'] 359 | 360 | #create new feature names based upon rolling period 361 | 362 | roll_feature_list =[] 363 | 364 | for feature in feature_list: 365 | for roll in roll_list: 366 | roll_feature_name = feature + '_AVG_LAST_' + str(roll) + '_ALL' 367 | roll_feature_list.append(roll_feature_name) 368 | df[roll_feature_name] = df.groupby(['TEAM1'])[feature].rolling(roll, closed= "left").mean().values 369 | 370 | # determine league avg for each stat and then subtract it from the each team's average 371 | # as a measure of how well that team compares to all teams in that moment in time 372 | 373 | #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose) 374 | roll_feature_list = [x for x in roll_feature_list if not x.startswith('TEAM1_win')] 375 | 376 | df = process_x_minus_league_avg(df, roll_feature_list, 'TEAM1') 377 | 378 | 379 | return df 380 | 381 | 382 | 383 | def process_x_minus_league_avg(df: pd.DataFrame, feature_list: list, team_feature: str)-> pd.DataFrame: 384 | """ 385 | Calculate the league average for every day of the season and then subtract the league average of each stat from the team's current stat for that day. 386 | 387 | This provides a measure of how good the team is compared to the the rest of the league at that moment in time. 388 | 389 | Args: 390 | df (pd.DataFrame): the dataframe to process 391 | feature_list (list): list of features to be used for subtraction, e.g. [PTS_AVG_LAST_5_ALL, REB_AVG_LAST_20_ALL] 392 | team_feature (str): the team's role (subset of data) that is being worked upon ("HOME_TEAM_ID", "VISITOR_TEAM_ID", or "TEAM1" for all roles) 393 | 394 | Returns: 395 | the processed dataframe 396 | 397 | """ 398 | 399 | # create a temp dataframe so that every date can be front-filled 400 | # we need the current average for all 30 teams for every day during the season 401 | # whether that team played or not. 402 | # We will front-fill from previous days to ensure that every day has stats for every team 403 | 404 | df.to_csv("df.csv",index=False) 405 | 406 | # create feature list for temp dataframe to hold league averages 407 | temp_feature_list = feature_list.copy() 408 | temp_feature_list.append(team_feature) 409 | temp_feature_list.append("GAME_DATE_EST") 410 | 411 | df_temp = df[temp_feature_list] 412 | print(temp_feature_list) 413 | df_temp.to_csv("df_temp.csv",index=False) 414 | 415 | 416 | # populate the dataframe with all days played and forward fill previous value if a particular team did not play that day 417 | # https://stackoverflow.com/questions/70362869 418 | df_temp = (df_temp.set_index('GAME_DATE_EST',) 419 | .groupby([team_feature])[feature_list] 420 | .apply(lambda x: x.asfreq('d', method = "ffill")) 421 | .reset_index() 422 | [temp_feature_list] 423 | ) 424 | 425 | # find the average across all teams for each day 426 | df_temp = df_temp.groupby(['GAME_DATE_EST'])[feature_list].mean().reset_index() 427 | 428 | # rename features for merging 429 | df_temp = df_temp.add_suffix('_LEAGUE_AVG') 430 | temp_features = df_temp.columns 431 | 432 | # merge all-team averages with each record so that they can be subtracted 433 | df = df.sort_values(by = 'GAME_DATE_EST', axis=0, ascending= True, ignore_index=True) 434 | df = pd.merge(df, df_temp, left_on='GAME_DATE_EST', right_on='GAME_DATE_EST_LEAGUE_AVG', how="left",) 435 | # subtract league average for each feature 436 | for feature in feature_list: 437 | df[feature + "_MINUS_LEAGUE_AVG"] = df[feature] - df[feature + "_LEAGUE_AVG"] 438 | 439 | # drop temp features that were only used for subtraction 440 | df = df.drop(temp_features, axis = 1) 441 | 442 | return df 443 | 444 | 445 | def combine_new_features(df: pd.DataFrame, df_consecutive: pd.DataFrame)-> pd.DataFrame: 446 | """ 447 | Re-combine back the features created in the consecutive dataframe to the main dataframe. 448 | 449 | The consecutive dataframe was used to derive features regardless of whether the team was home or away, and now we need to add those features back to the main dataframe. 450 | 451 | Args: 452 | df (pd.DataFrame): the main dataframe where each row is a game with both a home team and a visitor team 453 | df_consecutive (pd.DataFrame): the dataframe where each row is a game with only one team (either home or visitor) 454 | 455 | Returns: 456 | the merged dataframe 457 | """ 458 | 459 | 460 | # add back all the new features created in the consecutive dataframe to the main dataframe 461 | # all data for TEAM1 will be applied to the home team and then again to the visitor team 462 | # except for head-to-head MATCHUP data, which will only be applied to home team (redundant to include for both) 463 | # the letter '_x' will be appended to feature names when adding to home team 464 | # the letter '_y' will be appended to feature names when adding to visitor team 465 | # to match the existing convention in the dataset 466 | 467 | #first select out the new features 468 | all_features = df_consecutive.columns.tolist() 469 | link_features = ['GAME_ID', 'TEAM1', ] 470 | redundant_features = ['GAME_DATE_EST','TEAM1_home','TEAM1_win','TEAM2','SEASON','PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB',] 471 | matchup_features = [x for x in all_features if "MATCHUP" in x] 472 | ignore_features = link_features + redundant_features 473 | 474 | new_features = [x for x in all_features if x not in ignore_features] 475 | 476 | # first home teams 477 | 478 | df1 = df_consecutive[df_consecutive['TEAM1_home'] == 1] 479 | #add "_x" to new features 480 | df1.columns = [x + '_x' if x in new_features else x for x in df1.columns] 481 | #drop features that don't need to be merged 482 | df1 = df1.drop(redundant_features,axis=1) 483 | #change TEAM1 to HOME_TEAM_ID for easy merging 484 | df1 = df1.rename(columns={'TEAM1': 'HOME_TEAM_ID'}) 485 | df = pd.merge(df, df1, how="left", on=["GAME_ID", "HOME_TEAM_ID"]) 486 | 487 | #don't include matchup features for visitor team since they are equivant for both home and visitor 488 | new_features = [x for x in new_features if x not in matchup_features] 489 | df_consecutive = df_consecutive.drop(matchup_features,axis=1) 490 | 491 | # next visitor teams 492 | 493 | df2 = df_consecutive[df_consecutive['TEAM1_home'] == 0] 494 | #add "_y" to new features 495 | df2.columns = [x + '_y' if x in new_features else x for x in df2.columns] 496 | #drop features that don't need to be merged 497 | df2 = df2.drop(redundant_features,axis=1) 498 | #change TEAM1 to VISITOR_TEAM_ID for easy merging 499 | df2 = df2.rename(columns={'TEAM1': 'VISITOR_TEAM_ID'}) 500 | df = pd.merge(df, df2, how="left", on=["GAME_ID", "VISITOR_TEAM_ID"]) 501 | 502 | return df 503 | 504 | 505 | def process_x_minus_y(df: pd.DataFrame)-> pd.DataFrame: 506 | """ 507 | Subtract visitor team rolling stats from home rolling stats. 508 | 509 | This may (or may not) be useful for the model to explicitly see the difference between the two teams. GBM models may be able to handle this automatically, but other models may not. 510 | 511 | Args: 512 | df (pd.DataFrame): the dataframe to process 513 | 514 | Returns: 515 | the processed dataframe 516 | """ 517 | 518 | # field_x - field_y 519 | 520 | # remove the current games stats since they are data leaks - we don't know these until after the game is played 521 | useful_features = remove_non_rolling(df) 522 | 523 | comparison_features = [x for x in useful_features if "_y" in x] 524 | 525 | #don't include redundant features. (x - league_avg) - (y - league_avg) = x-y 526 | comparison_features = [x for x in comparison_features if "_MINUS_LEAGUE_AVG" not in x] 527 | 528 | for feature in comparison_features: 529 | feature_base = feature[:-2] #remove "_y" from the end 530 | df[feature_base + "_x_minus_y"] = df[feature_base + "_x"] - df[feature_base + "_y"] 531 | 532 | #df = df.drop("CONFERENCE_x_minus_y") #category variable not meaningful? 533 | 534 | return df 535 | 536 | 537 | def remove_non_rolling(df: pd.DataFrame) -> list: 538 | """ 539 | Returns a list of columns in a dataframe with the current games stats removed, leaving only rolling averages and streaks 540 | 541 | Args: 542 | df (pd.DataFrame): the dataframe to process 543 | 544 | Returns: 545 | list: only the columns that are rolling averages and streaks 546 | """ 547 | 548 | # remove non-rolling features - these are data leaks 549 | # they are stats from the actual game that decides winner/loser, 550 | # but we don't know these stats before a game is played 551 | 552 | # These must be retained in the database to recalculate rolling avgs and streaks in the future, 553 | # so are filtered out as appropriate instead of deleted 554 | 555 | drop_columns =[] 556 | 557 | all_columns = df.columns.tolist() 558 | 559 | drop_columns1 = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home'] 560 | drop_columns2 = ['PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away'] 561 | 562 | drop_columns = drop_columns + drop_columns1 563 | drop_columns = drop_columns + drop_columns2 564 | 565 | use_columns = [item for item in all_columns if item not in drop_columns] 566 | 567 | return use_columns 568 | -------------------------------------------------------------------------------- /src/hopsworks_utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import json 3 | import hopsworks 4 | 5 | from datetime import datetime, timedelta 6 | 7 | def save_feature_names(df: pd.DataFrame) -> str: 8 | """ 9 | Saves dictionary of {lower case feature names : original mixed-case feature names} to JSON file 10 | 11 | Args: 12 | df (pd.DataFrame): the dataframe with the features to be saved 13 | 14 | Returns: 15 | "File Saved." 16 | """ 17 | 18 | # hopsworks "sanitizes" feature names by converting to all lowercase 19 | # this function saves the original so that they can be re-mapped later 20 | # for code re-usability 21 | 22 | original_f_names = df.columns.tolist() 23 | hopsworks_f_names = [x.lower() for x in original_f_names] 24 | 25 | # create a dictionary 26 | feature_mapper = {hopsworks_f_names[i]: original_f_names[i] for i in range(len(hopsworks_f_names))} 27 | 28 | with open("feature_names.json", "w") as fp: 29 | json.dump(feature_mapper, fp) 30 | 31 | return "File Saved." 32 | 33 | 34 | def convert_feature_names(df: pd.DataFrame) -> pd.DataFrame: 35 | """ 36 | Converts hopsworks.ai lower-case feature names back to original mixed-case feature names that have been saved in JSON file 37 | 38 | Args: 39 | df (pd.DataFrame): the dataframe with features in all lower-case 40 | 41 | Returns: 42 | pd.DataFrame: the dataframe with features in original mixed-case 43 | """ 44 | 45 | # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case 46 | # converting these back to original format is needed for optimal code re-useability. 47 | 48 | # read in list of original feature names 49 | with open('feature_names.json', 'rb') as fp: 50 | feature_mapper = json.load(fp) 51 | 52 | df = df.rename(columns=feature_mapper) 53 | 54 | return df 55 | 56 | 57 | def create_train_test_data(HOPSWORKS_API_KEY:str, STARTDATE:str, DAYS:int) -> tuple[pd.DataFrame, pd.DataFrame]: 58 | """ 59 | Returns train and test data from Hopsworks.ai feature store based upon how many DAYS back to use as test data 60 | 61 | Args: 62 | HOPSWORKS_API_KEY (str): subscription key for Hopsworks.ai 63 | STARTDATE (str): start date for train data, format YYYY-MM-DD 64 | DAYS (int): number of days back to use as test data, the train data will be all data except the last DAYS 65 | 66 | Returns: 67 | Train and Test data as pandas dataframes 68 | """ 69 | 70 | # log into hopsworks.ai and create a feature view object from the feature group 71 | # the api makes it easier to retrieve training/test data from a feature view than a feature group 72 | 73 | project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY) 74 | fs = project.get_feature_store() 75 | 76 | rolling_stats_fg = fs.get_or_create_feature_group( 77 | name="rolling_stats", 78 | version=2, 79 | ) 80 | 81 | query = rolling_stats_fg.select_all() 82 | 83 | feature_view = fs.create_feature_view( 84 | name = 'rolling_stats_fv', 85 | version = 2, 86 | query = query 87 | ) 88 | 89 | # calculate the start and end dates for the train and test data and then retrieve the data from the feature view 90 | # the train data will be all data except the last DAYS 91 | 92 | 93 | TODAY = datetime.now() 94 | LASTYEAR = (TODAY - timedelta(days=DAYS)).strftime('%Y-%m-%d') 95 | TODAY = TODAY.strftime('%Y-%m-%d') 96 | 97 | td_train, td_job = feature_view.create_training_data( 98 | start_time=STARTDATE, 99 | end_time=LASTYEAR, 100 | description='All data except last ' + str(DAYS) + ' days', 101 | data_format="csv", 102 | coalesce=True, 103 | write_options={'wait_for_job': False}, 104 | ) 105 | 106 | td_test, td_job = feature_view.create_training_data( 107 | start_time=LASTYEAR, 108 | end_time=TODAY, 109 | description='Last ' + str(DAYS) + ' days', 110 | data_format="csv", 111 | coalesce=True, 112 | write_options={'wait_for_job': False}, 113 | ) 114 | 115 | train = feature_view.get_training_data(td_train)[0] 116 | test = feature_view.get_training_data(td_test)[0] 117 | 118 | # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case 119 | # converting these back to original format is needed for optimal code re-useability. 120 | 121 | train = convert_feature_names(train) 122 | test = convert_feature_names(test) 123 | 124 | # fix date format (truncate to YYYY-MM-DD) 125 | train["GAME_DATE_EST"] = train["GAME_DATE_EST"].str[:10] 126 | test["GAME_DATE_EST"] = test["GAME_DATE_EST"].str[:10] 127 | 128 | # feature view is no longer needed, so delete it 129 | feature_view.delete() 130 | 131 | return train, test 132 | 133 | -------------------------------------------------------------------------------- /src/model_training.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | 4 | from collections import defaultdict 5 | 6 | import matplotlib.pyplot as plt 7 | from matplotlib.gridspec import GridSpec 8 | 9 | from sklearn.calibration import ( 10 | CalibrationDisplay, 11 | ) 12 | 13 | from sklearn.metrics import ( 14 | precision_score, 15 | recall_score, 16 | f1_score, 17 | brier_score_loss, 18 | log_loss, 19 | roc_auc_score, 20 | ) 21 | 22 | 23 | def encode_categoricals(df: pd.dataframe, category_columns: list, MODEL_NAME: str, ENABLE_CATEGORICAL: bool) -> pd.dataframe: 24 | """ 25 | Encode categorical features as integers for use in XGBoost and LightGBM 26 | 27 | Args: 28 | df (pd.DataFrame): the dataframe to process 29 | category_columns (list): list of columns to encode as categorical 30 | MODEL_NAME (str): the name of the model being used 31 | ENABLE_CATEGORICAL (bool): whether or not to enable categorical features in the model 32 | 33 | Returns: 34 | the dataframe with categorical features encoded 35 | 36 | 37 | """ 38 | 39 | # To use special category feature capabilities in XGB and LGB, categorical features must be ints from 0 to N-1 40 | # Conversion can be accomplished by simple subtraction for several features 41 | # (these category capabilities may or may not be used, but encoding does not hurt anything) 42 | 43 | first_team_ID = df['HOME_TEAM_ID'].min() 44 | first_season = df['SEASON'].min() 45 | 46 | # subtract lowest value from each to create a range of 0 thru N-1 47 | df['HOME_TEAM_ID'] = (df['HOME_TEAM_ID'] - first_team_ID).astype('int8') #team ID - 1610612737 = 0 thru 29 48 | df['VISITOR_TEAM_ID'] = (df['VISITOR_TEAM_ID'] - first_team_ID).astype('int8') 49 | df['SEASON'] = (df['SEASON'] - first_season).astype('int8') 50 | 51 | # if xgb experimental categorical capabilities are to be used, then features must be of category type 52 | if MODEL_NAME == "xgboost": 53 | if ENABLE_CATEGORICAL: 54 | for field in category_columns: 55 | df[field] = df[field].astype('category') 56 | 57 | return df 58 | 59 | 60 | def plot_calibration_curve(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe, n_bins: int = 10) -> None: 61 | """ 62 | Plots calibration curves for a list of classifiers vs ideal probability distribution 63 | 64 | Args: 65 | clf_list (list): the classifiers to plot 66 | X_train (pd.dataframe): training data 67 | y_train (pd.dataframe): labels for training data 68 | X_test (pd.dataframe): test data 69 | y_test (pd.dataframe): labels for test data 70 | n_bins (int, optional): how many bins to use for calibration. Defaults to 10. 71 | 72 | Returns: 73 | None 74 | 75 | FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html 76 | """ 77 | 78 | 79 | fig = plt.figure(figsize=(10, 10)) 80 | gs = GridSpec(4, 2) 81 | colors = plt.cm.get_cmap("Dark2") 82 | 83 | ax_calibration_curve = fig.add_subplot(gs[:2, :2]) 84 | calibration_displays = {} 85 | for i, (clf, name) in enumerate(clf_list): 86 | clf.fit(X_train, y_train) 87 | display = CalibrationDisplay.from_estimator( 88 | clf, 89 | X_test, 90 | y_test, 91 | n_bins=n_bins, 92 | name=name, 93 | ax=ax_calibration_curve, 94 | color=colors(i), 95 | ) 96 | calibration_displays[name] = display 97 | 98 | ax_calibration_curve.grid() 99 | ax_calibration_curve.set_title(f"Calibration plots (bins = {n_bins})") 100 | 101 | # Add histogram 102 | grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)] 103 | for i, (_, name) in enumerate(clf_list): 104 | row, col = grid_positions[i] 105 | ax = fig.add_subplot(gs[row, col]) 106 | 107 | ax.hist( 108 | calibration_displays[name].y_prob, 109 | range=(0, 1), 110 | bins=n_bins, 111 | label=name, 112 | color=colors(i), 113 | ) 114 | ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count") 115 | 116 | plt.tight_layout() 117 | plt.show() 118 | 119 | return 120 | 121 | 122 | def calculate_classification_metrics(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe) -> tuple[pd.dataframe, list]: 123 | """ 124 | Calculates classification metrics for a list of classifiers. Brier score, log loss, precision, recall, f1, and roc_auc are calculated. 125 | 126 | Args: 127 | clf_list (list): the classifiers to calculate metrics for 128 | X_train (pd.dataframe): training data 129 | y_train (pd.dataframe): labels for training data 130 | X_test (pd.dataframe): test data 131 | y_test (pd.dataframe): labels for test data 132 | 133 | Returns: 134 | tuple: (dataframe) of the metrics and (list) containing the fitted models and names of the models as strings 135 | 136 | FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html 137 | """ 138 | 139 | # this is 140 | 141 | scores = defaultdict(list) 142 | 143 | for i, (clf, name) in enumerate(clf_list): 144 | clf.fit(X_train, y_train) 145 | y_prob = clf.predict_proba(X_test) 146 | y_pred = clf.predict(X_test) 147 | scores["Classifier"].append(name) 148 | 149 | for metric in [brier_score_loss, log_loss]: 150 | score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize() 151 | scores[score_name].append(metric(y_test, y_prob[:, 1])) 152 | 153 | for metric in [precision_score, recall_score, f1_score, roc_auc_score]: 154 | score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize() 155 | scores[score_name].append(metric(y_test, y_pred)) 156 | 157 | score_df = pd.DataFrame(scores).set_index("Classifier") 158 | score_df.round(decimals=3) 159 | 160 | # update clf_list with the trained model 161 | clf_list[i] = (clf, name) 162 | 163 | 164 | return score_df, clf_list -------------------------------------------------------------------------------- /src/optuna_objectives.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | import xgboost as xgb 5 | 6 | import lightgbm as lgb 7 | from lightgbm import ( 8 | early_stopping, 9 | log_evaluation, 10 | ) 11 | 12 | from sklearn.model_selection import ( 13 | StratifiedKFold, 14 | TimeSeriesSplit, 15 | ) 16 | 17 | from sklearn.metrics import ( 18 | accuracy_score, 19 | roc_auc_score, 20 | ) 21 | 22 | 23 | 24 | def XGB_objective(trial, train, target, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED): 25 | 26 | 27 | train_oof = np.zeros((train.shape[0],)) 28 | 29 | train_dmatrix = xgb.DMatrix(train, target, 30 | feature_names=train.columns, 31 | enable_categorical=ENABLE_CATEGORICAL) 32 | 33 | xgb_params= { 34 | 'num_round': trial.suggest_int('num_round', 2, 1000), 35 | 'learning_rate': trial.suggest_float('learning_rate', 1E-3, 1), 36 | 'max_bin': trial.suggest_int('max_bin', 2, 1000), 37 | 'max_depth': trial.suggest_int('max_depth', 1, 8), 38 | 'alpha': trial.suggest_float('alpha', 1E-16, 12), 39 | 'gamma': trial.suggest_float('gamma', 1E-16, 12), 40 | 'reg_lambda': trial.suggest_float('reg_lambda', 1E-16, 12), 41 | 'colsample_bytree': trial.suggest_float('colsample_bytree', 1E-16, 1.0), 42 | 'subsample': trial.suggest_float('subsample', 1E-16, 1.0), 43 | 'min_child_weight': trial.suggest_float('min_child_weight', 1E-16, 12), 44 | 'scale_pos_weight': trial.suggest_int('scale_pos_weight', 1, 15), 45 | } 46 | 47 | xgb_params = xgb_params | STATIC_PARAMS 48 | 49 | #pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "evaluation-auc") 50 | 51 | if OPTUNA_CV == "StratifiedKFold": 52 | kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED) 53 | elif OPTUNA_CV == "TimeSeriesSplit": 54 | kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS) 55 | 56 | 57 | for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))): 58 | 59 | train_df, val_df = train.iloc[train_ind], train.iloc[val_ind] 60 | 61 | train_target, val_target = target[train_ind], target[val_ind] 62 | 63 | train_dmatrix = xgb.DMatrix(train_df, label=train_target,enable_categorical=ENABLE_CATEGORICAL) 64 | val_dmatrix = xgb.DMatrix(val_df, label=val_target,enable_categorical=ENABLE_CATEGORICAL) 65 | 66 | 67 | model = xgb.train(xgb_params, 68 | train_dmatrix, 69 | num_boost_round = NUM_BOOST_ROUND, 70 | #callbacks=[pruning_callback], 71 | ) 72 | 73 | temp_oof = model.predict(val_dmatrix) 74 | 75 | train_oof[val_ind] = temp_oof 76 | 77 | #print(roc_auc_score(val_target, temp_oof)) 78 | 79 | val_score = roc_auc_score(target, train_oof) 80 | 81 | return val_score 82 | 83 | 84 | def LGB_objective(trial, train, target, category_columns, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED, EARLY_STOPPING): 85 | 86 | 87 | 88 | train_oof = np.zeros((train.shape[0],)) 89 | 90 | 91 | lgb_params= { 92 | "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True), 93 | "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True), 94 | "learning_rate": trial.suggest_loguniform('learning_rate', 1e-4, 0.5), 95 | "max_depth": trial.suggest_categorical('max_depth', [5,10,20,40,100, -1]), 96 | "n_estimators": trial.suggest_int("n_estimators", 50, 200000), 97 | "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0), 98 | "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0), 99 | "bagging_freq": trial.suggest_int("bagging_freq", 1, 7), 100 | "num_leaves": trial.suggest_int("num_leaves", 2, 1000), 101 | "min_child_samples": trial.suggest_int("min_child_samples", 5, 300), 102 | "cat_smooth" : trial.suggest_int('min_data_per_groups', 1, 100) 103 | } 104 | 105 | lgb_params = lgb_params | STATIC_PARAMS 106 | 107 | #pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc") 108 | 109 | if OPTUNA_CV == "StratifiedKFold": 110 | kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED) 111 | elif OPTUNA_CV == "TimeSeriesSplit": 112 | kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS) 113 | 114 | 115 | for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))): 116 | 117 | train_df, val_df = train.iloc[train_ind], train.iloc[val_ind] 118 | 119 | train_target, val_target = target[train_ind], target[val_ind] 120 | 121 | train_lgbdataset = lgb.Dataset(train_df, label=train_target,categorical_feature=category_columns) 122 | val_lgbdataset = lgb.Dataset(val_df, label=val_target, reference = train_lgbdataset, categorical_feature=category_columns) 123 | 124 | 125 | model = lgb.train(lgb_params, 126 | train_lgbdataset, 127 | valid_sets=val_lgbdataset, 128 | #num_boost_round = NUM_BOOST_ROUND, 129 | callbacks=[#log_evaluation(LOG_EVALUATION), 130 | early_stopping(EARLY_STOPPING,verbose=False), 131 | #pruning_callback, 132 | ] 133 | #verbose_eval= VERBOSE_EVAL, 134 | ) 135 | 136 | temp_oof = model.predict(val_df) 137 | 138 | train_oof[val_ind] = temp_oof 139 | 140 | #print(roc_auc_score(val_target, temp_oof)) 141 | 142 | val_score = roc_auc_score(target, train_oof) 143 | 144 | return val_score -------------------------------------------------------------------------------- /src/streamlit_app.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import streamlit as st 4 | import hopsworks 5 | import joblib 6 | import pandas as pd 7 | import numpy as np 8 | import json 9 | import time 10 | from datetime import timedelta, datetime 11 | import xgboost as xgb 12 | 13 | from src.hopsworks_utils import ( 14 | convert_feature_names, 15 | ) 16 | 17 | from src.feature_engineering import ( 18 | fix_datatypes, 19 | remove_non_rolling, 20 | ) 21 | 22 | 23 | # Load hopsworks API key from .env file 24 | 25 | from dotenv import load_dotenv 26 | 27 | load_dotenv() 28 | 29 | try: 30 | HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY'] 31 | except: 32 | raise Exception('Set environment variable HOPSWORKS_API_KEY') 33 | 34 | 35 | 36 | def fancy_header(text, font_size=24): 37 | res = f'{text}' 38 | st.markdown(res, unsafe_allow_html=True ) 39 | 40 | def get_model(project, model_name, evaluation_metric, sort_metrics_by): 41 | """Retrieve desired model from the Hopsworks Model Registry.""" 42 | 43 | mr = project.get_model_registry() 44 | # get best model based on custom metrics 45 | model = mr.get_best_model(model_name, 46 | evaluation_metric, 47 | sort_metrics_by) 48 | 49 | # download model from Hopsworks 50 | #model_dir = model.download() 51 | #print(model_dir) 52 | with open("model.pkl", 'rb') as f: 53 | loaded_model = joblib.load(f) 54 | 55 | 56 | return loaded_model 57 | 58 | 59 | # dictionary to convert team ids to team names 60 | 61 | nba_team_names = { 62 | 1610612737: "Atlanta Hawks", 63 | 1610612738: "Boston Celtics", 64 | 1610612739: "Cleveland Cavaliers", 65 | 1610612740: "New Orleans Pelicans", 66 | 1610612741: "Chicago Bulls", 67 | 1610612742: "Dallas Mavericks", 68 | 1610612743: "Denver Nuggets", 69 | 1610612744: "Golden State Warriors", 70 | 1610612745: "Houston Rockets", 71 | 1610612746: "LA Clippers", 72 | 1610612754: "Indiana Pacers", 73 | 1610612747: "Los Angeles Lakers", 74 | 1610612763: "Memphis Grizzlies", 75 | 1610612748: "Miami Heat", 76 | 1610612749: "Milwaukee Bucks", 77 | 1610612750: "Minnesota Timberwolves", 78 | 1610612751: "Brooklyn Nets", 79 | 1610612752: "New York Knicks", 80 | 1610612753: "Orlando Magic", 81 | 1610612755: "Philadelphia 76ers", 82 | 1610612756: "Phoenix Suns", 83 | 1610612757: "Portland Trail Blazers", 84 | 1610612758: "Sacramento Kings", 85 | 1610612759: "San Antonio Spurs", 86 | 1610612760: "Oklahoma City Thunder", 87 | 1610612761: "Toronto Raptors", 88 | 1610612762: "Utah Jazz", 89 | 1610612764: "Washington Wizards", 90 | 1610612765: "Detroit Pistons", 91 | 1610612766: "Charlotte Hornets", 92 | } 93 | 94 | # Streamlit app 95 | st.title('NBA Prediction Project') 96 | 97 | progress_bar = st.sidebar.header('⚙️ Working Progress') 98 | progress_bar = st.sidebar.progress(0) 99 | st.write(36 * "-") 100 | fancy_header('\n📡 Connecting to Hopsworks Feature Store...') 101 | 102 | 103 | # Connect to Hopsworks Feature Store and get Feature Group 104 | project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY) 105 | fs = project.get_feature_store() 106 | 107 | rolling_stats_fg = fs.get_feature_group( 108 | name="rolling_stats", 109 | version=2, 110 | ) 111 | 112 | st.write("Successfully connected!✔️") 113 | progress_bar.progress(20) 114 | 115 | 116 | 117 | # Get data from Feature Store 118 | st.write(36 * "-") 119 | fancy_header('\n☁️ Retrieving data from Feature Store...') 120 | 121 | # filter new games that are scheduled for today 122 | # these are games where no points have been scored yet 123 | ds_query = rolling_stats_fg.filter(rolling_stats_fg.pts_home == 0) 124 | df_todays_matches = ds_query.read() 125 | 126 | if df_todays_matches.shape[0] == 0: 127 | progress_bar.progress(100) 128 | st.write() 129 | fancy_header('\n 🤷‍♂️ No games scheduled for today! 🤷‍♂️') 130 | st.write() 131 | st.write("Try again tomorrow!") 132 | st.write() 133 | st.write("NBA season and postseason usually runs from October to June.") 134 | st.stop() 135 | 136 | st.write("Successfully retrieved!✔️") 137 | progress_bar.progress(40) 138 | print(df_todays_matches.head(5)) 139 | 140 | 141 | # Prepare data for prediction 142 | st.write(36 * "-") 143 | fancy_header('\n☁️ Processing Data for prediction...') 144 | 145 | # convert feature names back to mixed case 146 | df_todays_matches = convert_feature_names(df_todays_matches) 147 | 148 | # Add a column that displays the matchup using the team names 149 | # this will make the display more meaningful 150 | df_todays_matches['MATCHUP'] = df_todays_matches['VISITOR_TEAM_ID'].map(nba_team_names) + " @ " + df_todays_matches['HOME_TEAM_ID'].map(nba_team_names) 151 | 152 | # fix date and other types 153 | df_todays_matches = fix_datatypes(df_todays_matches) 154 | 155 | # remove features not used by model 156 | drop_columns = ['TARGET', 'GAME_DATE_EST', 'GAME_ID', ] 157 | df_todays_matches = df_todays_matches.drop(drop_columns, axis=1) 158 | 159 | # remove stats from today's games - these are blank (the game hasn't been played) and are not used by the model 160 | use_columns = remove_non_rolling(df_todays_matches) 161 | X = df_todays_matches[use_columns] 162 | 163 | # MATCHUP is just for informational display, not used by model 164 | X = X.drop('MATCHUP', axis=1) 165 | 166 | #X_dmatrix = xgb.DMatrix(X) # convert to DMatrix for XGBoost 167 | 168 | st.write(df_todays_matches['MATCHUP']) 169 | 170 | st.write("Successfully processed!✔️") 171 | progress_bar.progress(60) 172 | 173 | 174 | # Load model from Hopsworks Model Registry 175 | st.write(36 * "-") 176 | fancy_header(f"Loading Best Model...") 177 | 178 | model = get_model(project=project, 179 | model_name="xgboost", 180 | evaluation_metric="AUC", 181 | sort_metrics_by="max") 182 | 183 | st.write("Successfully loaded!✔️") 184 | progress_bar.progress(80) 185 | 186 | 187 | 188 | # Predict winning probabilities of home team 189 | st.write(36 * "-") 190 | fancy_header(f"Predicting Winning Probabilities...") 191 | 192 | 193 | #preds = model.predict(X_dmatrix) 194 | preds = model.predict_proba(X)[:,1] 195 | 196 | df_todays_matches['HOME_TEAM_WIN_PROBABILITY'] = preds 197 | 198 | st.dataframe(df_todays_matches[['MATCHUP', 'HOME_TEAM_WIN_PROBABILITY']]) 199 | 200 | progress_bar.progress(100) 201 | st.button("Re-run") 202 | -------------------------------------------------------------------------------- /src/webscraping.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import numpy as np 4 | 5 | import os 6 | 7 | import asyncio 8 | 9 | #if using scrapingant, import these 10 | from scrapingant_client import ScrapingAntClient 11 | 12 | # if using selenium and chrome, import these 13 | from selenium import webdriver 14 | from selenium.webdriver.chrome.service import Service as ChromiumService 15 | from selenium.webdriver.common.by import By 16 | from selenium.webdriver.chrome.options import Options 17 | from webdriver_manager.core.utils import ChromeType 18 | from webdriver_manager.chrome import ChromeDriverManager 19 | 20 | # if using selenium and firefox, import these 21 | from selenium.webdriver.firefox.service import Service as FirefoxService 22 | from webdriver_manager.firefox import GeckoDriverManager 23 | 24 | 25 | from bs4 import BeautifulSoup as soup 26 | 27 | from datetime import datetime, timedelta 28 | from pytz import timezone 29 | 30 | from pathlib import Path #for Windows/Linux compatibility 31 | DATAPATH = Path(r'data') 32 | 33 | import time 34 | 35 | 36 | 37 | 38 | def activate_web_driver(browser: str) -> webdriver: 39 | """ 40 | Activate selenium web driver for use in scraping 41 | 42 | Args: 43 | browser (str): the name of the browser to use, either "firefox" or "chromium" 44 | 45 | Returns: 46 | the selected webdriver 47 | """ 48 | 49 | # options for selenium webdrivers, used to assist headless scraping. Still ran into issues, so I used scrapingant instead when running from github actions 50 | options = [ 51 | "--headless", 52 | "--window-size=1920,1200", 53 | "--start-maximized", 54 | "--no-sandbox", 55 | "--disable-dev-shm-usage", 56 | "--disable-gpu", 57 | "--ignore-certificate-errors", 58 | "--disable-extensions", 59 | "--disable-popup-blocking", 60 | "--disable-notifications", 61 | "--remote-debugging-port=9222", #https://stackoverflow.com/questions/56637973/how-to-fix-selenium-devtoolsactiveport-file-doesnt-exist-exception-in-python 62 | "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" 63 | "--disable-blink-features=AutomationControlled", 64 | ] 65 | 66 | if browser == "firefox": 67 | service = FirefoxService(executable_path=GeckoDriverManager().install()) 68 | 69 | firefox_options = webdriver.FirefoxOptions() 70 | for option in options: 71 | firefox_options.add_argument(option) 72 | 73 | driver = webdriver.Firefox(service=service, options=firefox_options) 74 | 75 | else: 76 | service = ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()) 77 | 78 | chrome_options = Options() 79 | for option in options: 80 | chrome_options.add_argument(option) 81 | 82 | driver = webdriver.Chrome(service=service, options=chrome_options) 83 | 84 | return driver 85 | 86 | 87 | 88 | def get_new_games(SCRAPINGANT_API_KEY: str, driver: webdriver) -> pd.DataFrame: 89 | 90 | # set search for previous days games; use 2 days to catch up in case of a failed run 91 | DAYS = 2 92 | SEASON = "" #no season will cause website to default to current season, format is "2022-23" 93 | TODAY = datetime.now(timezone('EST')) #nba.com uses US Eastern Standard Time 94 | LASTWEEK = (TODAY - timedelta(days=DAYS)) 95 | DATETO = TODAY.strftime("%m/%d/%y") 96 | DATEFROM = LASTWEEK.strftime("%m/%d/%y") 97 | 98 | 99 | df = scrape_to_dataframe(api_key=SCRAPINGANT_API_KEY, driver=driver, Season=SEASON, DateFrom=DATEFROM, DateTo=DATETO, ) 100 | 101 | df = convert_columns(df) 102 | 103 | 104 | df = combine_home_visitor(df) 105 | 106 | return df 107 | 108 | 109 | 110 | def parse_ids(data_table): 111 | 112 | # TEAM_ID and GAME_ID are encoded in href= links 113 | # find all the hrefs, add them to a list 114 | # then parse out a list for teams ids and game ids 115 | # and convert these to pandas series 116 | 117 | CLASS_ID = 'Anchor_anchor__cSc3P' #determined by visual inspection of page source code 118 | 119 | # get all the links 120 | links = data_table.find_all('a', {'class':CLASS_ID}) 121 | 122 | # get the href part (web addresses) 123 | # href="/stats/team/1610612740" for teams 124 | # href="/game/0022200191" for games 125 | links_list = [i.get("href") for i in links] 126 | 127 | # create a series using last 10 digits of the appropriate links 128 | team_id = pd.Series([i[-10:] for i in links_list if ('stats' in i)]) 129 | game_id = pd.Series([i[-10:] for i in links_list if ('/game/' in i)]) 130 | 131 | return team_id, game_id 132 | 133 | 134 | 135 | def scrape_to_dataframe(api_key, driver, Season, DateFrom="NONE", DateTo="NONE", stat_type='standard'): 136 | 137 | # go to boxscores webpage at nba.com 138 | # check if the data table is split over multiple pages 139 | # if so, then select the "ALL" choice in pulldown menu to show all on one page 140 | # extract out the html table and convert to dataframe 141 | # parse out GAME_ID and TEAM_ID from href links 142 | # and add these to dataframe 143 | 144 | # if season not provided, then will default to current season 145 | # if DateFrom and DateTo not provided, then don't include in url - pull the whole season 146 | if stat_type == 'standard': 147 | nba_url = "https://www.nba.com/stats/teams/boxscores" 148 | else: 149 | nba_url = "https://www.nba.com/stats/teams/boxscores-"+ stat_type 150 | 151 | if not Season: 152 | nba_url = nba_url + "?DateFrom=" + DateFrom + "&DateTo=" + DateTo 153 | else: 154 | if DateFrom == "NONE" and DateTo == "NONE": 155 | nba_url = nba_url + "?Season=" + Season 156 | else: 157 | nba_url = nba_url + "?Season=" + Season + "&DateFrom=" + DateFrom + "&DateTo=" + DateTo 158 | 159 | #try 3 times to load page correctly; scrapingant can fail sometimes on it first try 160 | for i in range(3): 161 | if api_key == "": #if no api key, then use selenium 162 | driver.get(nba_url) 163 | time.sleep(10) 164 | source = soup(driver.page_source, 'html.parser') 165 | else: #if api key, then use scrapingant 166 | client = ScrapingAntClient(token=api_key) 167 | result = client.general_request(nba_url) 168 | source = soup(result.content, 'html.parser') 169 | 170 | # the data table is the key dynamic element that may fail to load 171 | CLASS_ID_TABLE = 'Crom_table__p1iZz' #determined by visual inspection of page source code 172 | data_table = source.find('table', {'class':CLASS_ID_TABLE}) 173 | 174 | if data_table is None: 175 | time.sleep(10) 176 | else: 177 | break 178 | 179 | 180 | #check for more than one page 181 | CLASS_ID_PAGINATION = "Pagination_pageDropdown__KgjBU" #determined by visual inspection of page source code 182 | pagination = source.find('div', {'class':CLASS_ID_PAGINATION}) 183 | 184 | if api_key == "": #if using selenium, then check for multiple pages 185 | if pagination is not None: 186 | # if multiple pages, first activate pulldown option for All pages to show all rows on one page 187 | CLASS_ID_DROPDOWN = "DropDown_select__4pIg9" #determined by visual inspection of page source code 188 | page_dropdown = driver.find_element(By.XPATH, "//*[@class='" + CLASS_ID_PAGINATION + "']//*[@class='" + CLASS_ID_DROPDOWN + "']") 189 | 190 | page_dropdown.send_keys("ALL") # show all pages 191 | #page_dropdown.click() doesn't work in headless mode 192 | time.sleep(3) 193 | driver.execute_script('arguments[0].click()', page_dropdown) #click() didn't work in headless mode, used this workaround (https://stackoverflow.com/questions/57741875) 194 | 195 | #refresh page data now that it contains all rows of the table 196 | time.sleep(3) 197 | source = soup(driver.page_source, 'html.parser') 198 | data_table = source.find('table', {'class':CLASS_ID_TABLE}) 199 | 200 | #print(source) 201 | 202 | # convert the html table to a dataframe 203 | dfs = pd.read_html(str(data_table), header=0) 204 | df = pd.concat(dfs) 205 | 206 | # pull out teams ids and game ids from hrefs and add these to the dataframe 207 | TEAM_ID, GAME_ID = parse_ids(data_table) 208 | df['TEAM_ID'] = TEAM_ID 209 | df['GAME_ID'] = GAME_ID 210 | 211 | 212 | return df 213 | 214 | def convert_columns(df): 215 | 216 | # convert the dataframe to same format and column names as main data 217 | 218 | # drop columns not used 219 | drop_columns = ['Team', 'MIN', 'FGM', 'FGA', '3PM', '3PA', 'FTM', 'FTA', 'OREB', 'DREB', 'STL', 'BLK', 'TOV', 'PF', '+/-',] 220 | df = df.drop(columns=drop_columns) 221 | 222 | #rename columns to match existing dataframes 223 | mapper = { 224 | 'Match Up': 'HOME', 225 | 'Game Date': 'GAME_DATE_EST', 226 | 'W/L': 'HOME_TEAM_WINS', 227 | 'FG%': 'FG_PCT', 228 | '3P%': 'FG3_PCT', 229 | 'FT%': 'FT_PCT', 230 | } 231 | df = df.rename(columns=mapper) 232 | 233 | # reformat column data 234 | 235 | # make HOME true if @ is in the text 236 | # (Match Ups: POR @ DAL or DAl vs POR. Home team always has @) 237 | df['HOME'] = df['HOME'].apply(lambda x: 1 if '@' in x else 0) 238 | 239 | # convert wins to home team wins 240 | # incomplete games will be NaN 241 | df = df[df['HOME_TEAM_WINS'].notna()] 242 | # convert W/L to 1/0 243 | df['HOME_TEAM_WINS'] = df['HOME_TEAM_WINS'].apply(lambda x: 1 if 'W' in x else 0) 244 | # no need to do anything else, win/loss of visitor teams is not used in final dataframe 245 | 246 | #convert date format 247 | df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST']) 248 | df['GAME_DATE_EST'] = df['GAME_DATE_EST'].dt.strftime('%Y-%m-%d') 249 | df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST']) 250 | 251 | return df 252 | 253 | def combine_home_visitor(df): 254 | 255 | # each game currently has one row for home team stats 256 | # and one row for visitor team stats 257 | # these be will combined into a single row 258 | 259 | # separate home vs visitor 260 | home_df = df[df['HOME'] == 1] 261 | visitor_df = df[df['HOME'] == 0] 262 | 263 | # HOME column no longer needed 264 | home_df = home_df.drop(columns='HOME') 265 | visitor_df = visitor_df.drop(columns='HOME') 266 | 267 | # HOME_TEAM_WINS and GAME_DATE_EST columns not needed for visitor 268 | visitor_df = visitor_df.drop(columns=['HOME_TEAM_WINS','GAME_DATE_EST']) 269 | 270 | # rename TEAM_ID columns 271 | home_df = home_df.rename(columns={'TEAM_ID':'HOME_TEAM_ID'}) 272 | visitor_df = visitor_df.rename(columns={'TEAM_ID':'VISITOR_TEAM_ID'}) 273 | 274 | # merge the home and visitor data 275 | df = pd.merge(home_df, visitor_df, how="left", on=["GAME_ID"],suffixes=('_home', '_away')) 276 | 277 | # add a column for SEASON 278 | # determine SEASON by parsing GAME_ID 279 | # (e.g. 0022200192 1st 2 digits not used, 3rd digit 2 = regular season, 4th and 5th digit = SEASON) 280 | game_id = df['GAME_ID'].iloc[0] 281 | season = game_id[3:5] 282 | season = str(20) + season 283 | df['SEASON'] = season 284 | 285 | #print(df) 286 | 287 | #convert all object columns to int64 288 | for field in df.select_dtypes(include=['object']).columns.tolist(): 289 | df[field] = df[field].astype('int64') 290 | 291 | return df 292 | 293 | def get_todays_matchups(api_key: str, driver: webdriver) -> list: 294 | 295 | ''' 296 | Goes to NBA Schedule and scrapes the teams playing today 297 | ''' 298 | 299 | NBA_SCHEDULE = "https://www.nba.com/schedule" 300 | 301 | 302 | if api_key == "": #if no api key, then use selenium 303 | driver.get(NBA_SCHEDULE) 304 | time.sleep(10) 305 | source = soup(driver.page_source, 'html.parser') 306 | else: #if api key, then use scrapingant 307 | client = ScrapingAntClient(token=api_key) 308 | result = client.general_request(NBA_SCHEDULE) 309 | source = soup(result.content, 'html.parser') 310 | 311 | 312 | # Get the block of all of todays games 313 | # Sometimes, the results of yesterday's games are listed first, then todays games are listed 314 | # Other times, yesterday's games are not listed, or when the playoffs approach, future games are listed 315 | # We will check the date for the first div, if it is not todays date, then we will look for the next div 316 | CLASS_GAMES_PER_DAY = "ScheduleDay_sdGames__NGdO5" # the div containing all games for a day 317 | CLASS_DAY = "ScheduleDay_sdDay__3s2Xt" # the heading with the date for the games (e.g. "Wednesday, February 1") 318 | div_games = source.find('div', {'class':CLASS_GAMES_PER_DAY}) # first div may or may not be yesterday's games or even future games when playoffs approach 319 | div_game_day = source.find('h4', {'class':CLASS_DAY}) 320 | today = datetime.today().strftime('%A, %B %d')[:3] # e.g. "Wednesday, February 1" -> "Wed" for convenience with dealing with leading zeros 321 | todays_games = None 322 | 323 | while div_games: 324 | print(div_game_day.text[:3]) 325 | if today == div_game_day.text[:3]: 326 | todays_games = div_games 327 | break 328 | else: 329 | # move to next div 330 | div_games = div_games.find_next('div', {'class':CLASS_GAMES_PER_DAY}) 331 | div_game_day = div_game_day.find_next('h4', {'class':CLASS_DAY}) 332 | 333 | if todays_games is None: 334 | # no games today 335 | return None, None 336 | 337 | # Get the teams playing 338 | # Each team listed in todays block will have a href with the specified anchor class 339 | # e.g. PREVIEW 362 | # Each game will have two links with the specified anchor class, one for the preview and one to buy tickets 363 | # all using the same anchor class, so we will filter out those just for PREVIEW 364 | CLASS_ID = "Anchor_anchor__cSc3P TabLink_link__f_15h" 365 | links = todays_games.find_all('a', {'class':CLASS_ID}) 366 | #print(links) 367 | links = [i for i in links if "PREVIEW" in i] 368 | game_id_list = [i.get("href") for i in links] 369 | #print(game_id_list) 370 | 371 | games = [] 372 | for game in game_id_list: 373 | game_id = game.partition("-00")[2].partition("?")[0] # extract team id from text for link 374 | if len(game_id) > 0: 375 | games.append(game_id) 376 | 377 | #asyncio.run(main()) 378 | 379 | return matchups, games 380 | --------------------------------------------------------------------------------