NBA game predictor

├── .github
    └── workflows
    │   └── production-features-pipeline.yml
├── .gitignore
├── README.md
├── data
    └── .gitkeep
├── images
    ├── basketball_money2.jpg
    ├── calibration.png
    ├── confusion_matrix.png
    ├── correlation_bar_chart.png
    ├── distributions.png
    ├── neptune.png
    ├── streamlit2.png
    ├── streamlit_example.jpg
    ├── train_vs_test_shapley.jpg
    └── train_vs_test_shapley.svg
├── notebooks
    ├── 01_eda.ipynb
    ├── 02_data_processing.ipynb
    ├── 03_train_test_split.ipynb
    ├── 04_model_baseline.ipynb
    ├── 05_feature_engineering.ipynb
    ├── 06_feature_selection.ipynb
    ├── 07_model_testing.ipynb
    ├── 08_backfill_features.ipynb
    ├── 09_production_features_pipeline.ipynb
    ├── 10_model_training_pipeline.ipynb
    ├── feature_names.json
    ├── hopsworks.ipynb
    ├── lightgbm.json
    ├── model_data.json
    └── xgboost.json
├── requirements.txt
└── src
    ├── common_functions.py
    ├── data_processing.py
    ├── feature_engineering.py
    ├── hopsworks_utils.py
    ├── model_training.py
    ├── optuna_objectives.py
    ├── streamlit_app.py
    └── webscraping.py


/.github/workflows/production-features-pipeline.yml:
--------------------------------------------------------------------------------
 1 | 
 2 | name: production-features-pipeline
 3 | 
 4 | on:
 5 |   workflow_dispatch:
 6 | #   schedule:
 7 | #     - cron: '0 8 * * *'
 8 | 
 9 | jobs:
10 |   scrape_features:
11 |     #runs-on: windows-latest
12 |     runs-on: ubuntu-latest
13 |     steps:
14 |       - name: checkout repo content
15 |         uses: actions/checkout@v2
16 | 
17 |       - name: setup python
18 |         uses: actions/setup-python@v2
19 |         with:
20 |           python-version: '3.9.13'
21 |           
22 |       - name: install python packages
23 |         run: |
24 |           python -m pip install --upgrade pip
25 |           pip install -r requirements.txt
26 |           python -m pip install jupyter nbconvert nbformat scrapingant-client
27 |       
28 |       - name: execute python workflows from notebook
29 |         env: 
30 |           HOPSWORKS_API_KEY: ${{ secrets.HOPSWORKS_API_KEY }}
31 |           NEPTUNE_API_TOKEN: ${{ secrets.NEPTUNE_API_TOKEN }}
32 |           SCRAPINGANT_API_KEY: ${{ secrets.SCRAPINGANT_API_KEY }}
33 |           
34 |         run:
35 |           jupyter nbconvert --to notebook --execute notebooks/09_production_features_pipeline.ipynb
36 | 
37 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | /venv
 2 | /old
 3 | /.ipynb_checkpoints
 4 | .hw_api_key
 5 | .ipynb_checkpoints
 6 | .neptune
 7 | __pycache__
 8 | ca_chain
 9 | certfile
10 | keyfile
11 | geckodriver.log
12 | .env
13 | .vscode
14 | .venv/
15 | *.csv


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 |     <h1>NBA game predictor</h1>
  3 |     <i>Click <a href=" https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/">here </a>to see it in action</i>
  4 | </div>
  5 | <br />
  6 | <div align="center">
  7 |     <sub>Let's connect 🤗</sub>
  8 |     <br />
  9 |     <a href="https://twitter.com/curiovana">Twitter</a> •
 10 |     <a href="https://www.linkedin.com/in/chris-munch/">LinkedIn</a>
 11 | <br />
 12 | </div>
 13 | 
 14 | ![image info](./images/basketball_money2.jpg)
 15 | 
 16 | #### Table of contents
 17 | - [Introduction](#introduction)
 18 | - [Problem](#problem-increase-the-profitability-of-betting-on-nba-games)
 19 | - [Initial step](#initial-step-predict-the-probability-that-the-home-team-will-win-each-game)
 20 | - [Plan](#plan)
 21 | - [Overview](#overview)
 22 | - [Future Possibilities](#future-possibilities)
 23 | - [Structure](#structure)
 24 | - [Data](#data)
 25 | - [EDA and data processing](#eda-and-data-processing)
 26 | - [Train/validation/test split](#train--testvalidation-split)
 27 | - [Baseline models](#baseline-models)
 28 | - [Feature engineering](#feature-engineering)
 29 | - [Model training/testing](#model-training-pipeline)
 30 | - [Streamlit app](#streamlit-app)
 31 | - [Feedback](#feedback)
 32 | 
 33 | ## Introduction
 34 | 
 35 | This project is a demonstration of my ability to quickly learn, develop, and deploy end-to-end machine learning technologies. I am currently seeking to change careers into Machine Learning / Data Science.
 36 | 
 37 | I chose to predict the winner of NBA games because:
 38 |  - multiple games are played every day during the season so I can see how my model performs on a daily basis
 39 |  - picking a game winner is easy for a casual audience to understand and appreciate
 40 |  - there is a lot of data available
 41 |  - it can be used to make money (via betting strategies). I have always been interested in making money.
 42 | 
 43 | I am actually not really a big fan of the NBA but have watched a few games and have basic knowledge. I have never done any sports betting either, but I have always loved exploration and discovery; the possibility of maybe finding something that somebody else has "missed" is very appealing to me, especially in terms of competition and of making money
 44 | 
 45 | ## Problem: Increase the profitability of betting on NBA games
 46 | 
 47 | ### Initial Step: Predict the probability that the home team will win each game
 48 | 
 49 | Machine learning classification models will be used to predict the probability of the winner of each game based upon historical data. This is a first step in developing a betting strategy that will increase the profitability of betting on NBA games. 
 50 | 
 51 | *Disclaimer*
 52 | 
 53 | In reality, a betting strategy is a rather complex problem with many elements beyond simply picking the winner of each game. Huge amounts of manpower and money have been invested in developing such strategies, and it is not likely that a learning project will be able to compete very well with such efforts. However, it may provide an extra element of insight that could be used to improve the profitability of an existing betting strategy.
 54 | 
 55 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction)
 56 | 
 57 | ### Plan
 58 |  
 59 |  - Gradient boosted tree models (Xgboost and LightGBM) will be utilized to determine the probability that the home team will win each game. 
 60 |  - The model probability will be calibrated against the true probability distribution using sklearn's CalibratedClassifierCV. 
 61 |  - The probability of winning will be important in developing betting strategies because such strategies will not bet on every game, just on games with better expected values.
 62 |  - Pipelines will be setup to scrape new data from NBA website every day and retrain the model when desired.
 63 |  - The model will be deployed online using a [streamlit app](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/) to predict and report winning probabilities every day. 
 64 | 
 65 | <img src="./images/streamlit_example.jpg" width="800"/>
 66 | 
 67 | 
 68 | ### Overview
 69 | 
 70 |  - Historical game data is retrieved from Kaggle.
 71 |  - EDA, Data Processing, and Feature Engineering are used to develop best model in either XGboost or LightGBM.
 72 |  - Data and model is added to serverless Feature Store and Model Registry
 73 |  - Model is deployed online as a Streamlit app
 74 |  - Pipelines are setup to:
 75 |    - Scrape new data from NBA website and add to Feature Store every day using Github Actions
 76 |    - Retrain model and tune hyperparameters
 77 | 
 78 |  Tools Used:
 79 | 
 80 |  - VS Code w/ Copilot - IDE
 81 |  - Pandas - data manipulation
 82 |  - XGboost - modeling
 83 |  - LightGBM - modeling
 84 |  - Scikit-learn - probability calibration
 85 |  - Optuna - hyperparameter tuning
 86 |  - Neptune.ai - experiment tracking
 87 |  - Selenium - data scraping and processing
 88 |  - ScrapingAnt - data scraping
 89 |  - BeautifulSoup - data processing of scraped data
 90 |  - Hopsworks.ai - Feature Store and Model Registry
 91 |  - Github Actions - running notebooks to scrape new data, predict winning probabilities, and retrain models
 92 |  - Streamlit - web app deployment
 93 | 
 94 | ### Future Possibilities
 95 | 
 96 | Continual improvements might include:
 97 | 
 98 |  - More feature engineering/selection
 99 |  - More data sources (player stats, injuries, etc...)
100 |  - A/B testing against outside and internal models (e.g. other predictor projects, Vegas odds, betting site odds, etc...)
101 |  - Track model performance over time, testing for model drift, etc...
102 |  - Develop optimized retraining criteria (e.g. time periods, number of games, model drift, etc...)
103 |  - Better data visualizations of model predictions and performance
104 |  - Develop betting strategies based upon model predictions
105 |  - Ensemble betting strategies with other models and strategies, including human experts
106 |  - Track model performance against other models and betting strategies
107 | 
108 | 
109 | ### Structure
110 | 
111 | Jupyter Notebooks were used for initial development and testing and are labeled 01 through 10 in the main directory. Notebooks 01 thru 06 are primarily just historical records and notes for the development process.
112 | 
113 | Key functions were moved to .py files in src directory once the functions were stable.
114 | 
115 | Notebooks 07, 09, and 10 are used in production. I chose to keep the notebooks instead of full conversion to scripts because:
116 | 
117 |  - I think they look better in terms of documentation
118 |  - I prefer to peruse the notebook output after model testing and retraining sometimes instead of relying just on experiment tracking logs
119 |  - I haven't yet conceptually decided on my preferred way of structuring my model testing pipelines for best reusability and maintainability (e.g. should I use custom wrapper functions to invoke experiment logging so that I can easily change providers, or should I just choose one provider and stick with their API?)
120 | 
121 | 
122 | 
123 | ### Data
124 | 
125 | Data from the 2013 thru 2021 season has been archived on Kaggle. New data is scraped from NBA website. 
126 | 
127 | Currently available data includes:
128 | 
129 |  - games_details.csv .. (each-game player stats for everyone on the roster)
130 |  - games.csv .......... (each-game team stats: final scores, points scored, field-goal & free-throw percentages, etc...)
131 |  - players.csv ........ (index of players' names and teams)
132 |  - ranking.csv ........ (incremental daily record of standings, games played, won, lost, win%, home record, road record)
133 |  - teams.csv .......... (index of team info such as city and arena names and also head coach) 
134 |  
135 |  NOTES 
136 |  - games.csv is the primary data source and will be the only data used initially
137 |  - games_details.csv details individual player stats for each game and may be added to the model later
138 |  - ranking.csv data is essentially cumulative averages from the beginning of the season and is not really needed as these and other rolling averages can be calculated from the games.csv data 
139 | 
140 | 
141 | **New Data**
142 | 
143 | New data is scraped from [https://www.nba.com/stats/teams/boxscores](https://www.nba.com/stats/teams/boxscores)
144 | 
145 |  
146 | **Data Leakage**
147 | 
148 | The data for each game are stats for the *completed* game. We want to predict the winner *before* the game is played, not after. The model should only use data that would be available before the game is played. Our model features will primarily be rolling stats for the previous games (e.g. average assists for previous 5 games) while excluding the current game.
149 | 
150 | I mention this because I did see several similar projects online that failed to take this into account. If the goal is simply to predict which stats are important for winning games, then the model can be trained on the entire dataset. However, if the goal is to predict the winner of a game like we are trying to do, then the model must be trained on data that would only be available before the game is played.
151 | 
152 | ### EDA and Data Processing
153 | 
154 | Exploratory Data Analysis (EDA) and Data Processing are summarized and detailed in the notebooks. Some examples include:
155 | 
156 | Histograms of various features
157 | 
158 | <img src="./images/distributions.png" width="500"/> 
159 | 
160 | Correlations between features
161 | 
162 | <img src="./images/correlation_bar_chart.png" width="500"/>
163 | 
164 | 
165 | ### Train / Test/Validation Split
166 |   
167 |   - Latest season is used as Test/Validation data and previous seasons are used as Train data
168 |   
169 | ### Baseline Models
170 |   
171 | Simple If-Then Models
172 | 
173 |  - Home team always wins (Accuracy = 0.59, AUC = 0.50 on Train data, Accuracy = 0.49, AUC = 0.50 on Test data)
174 |  
175 | ML Models
176 | 
177 |  - LightGBM (Accuracy = 0.58, AUC = 0.64 on Test data)
178 |  - XGBoost (Accuracy = 0.59, AUC = 0.61 on Test data)
179 | 
180 | ### Feature Engineering
181 | 
182 |  - Covert game date to month only
183 |  - Compile rolling means for various time periods for each team as home team and as visitor team 
184 |  - Compile current win streak for each team as home team and as visitor team
185 |  - Compile head-to-head matchup data for each team pair 
186 |  - Compile rolling means for various time periods for each team regardless of home or visitor status
187 |  - Compile current win streak for each team regardless of home or visitor status
188 |  - Subtract the league average rolling means from each team's rolling means
189 | 
190 | 
191 | ### Model Training/Testing
192 | 
193 | **Models**
194 |  - LightGBM 
195 |  - XGBoost 
196 | 
197 | The native Python API (rather than the Scikit-learn wrapper) is used for initial testing of both models because of ease of built-in Shapley values, which are used for feature importance analysis and for adversarial validation (since Shapley values are local to each dataset, they can be used to determine if the train and test datasets have the same feature importances. If they do not, then it may indicate that the model does not generalize very well.)
198 | 
199 | The Scikit-learn wrapper is used later in production because it allows for easier probability calibration using sklearn's CalibratedClassifierCV.
200 | 
201 | <img src="./images/train_vs_test_shapley.jpg" width="500"/>
202 | 
203 | **Evaluation**
204 |  - AUC is primary metric, Accuracy is secondary metric (it is more meaningful to casual users)
205 |  - Shapley values compared: Train set vs Test/Validation set
206 |  - Test/Validation set is split: early half vs later half
207 | 
208 | <img src="./images/confusion_matrix.png" width="300"/>
209 | 
210 |  
211 | **Experiment Tracking**
212 |  
213 | Notebook 07 integrates Neptune.ai for experiment tracking and Optuna for hyperparameter tuning.
214 | 
215 | Experiment tracking logs can be viewed here: [https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602](https://app.neptune.ai/cmunch1/nba-prediction/experiments?split=tbl&dash=charts&viewId=979e20ed-e172-4c33-8aae-0b1aa1af3602)
216 | 
217 | <img src="./images/neptune.png" width="500"/>
218 | 
219 | 
220 | **Probability Calibration**
221 | 
222 | SKlearn's CalibratedClassifierCV is used to ensure that the model probabilities are calibrated against the true probability distribution. The Brier loss score is used to by the software to automatically select the best calibration method (sigmoid, isotonic, or none).
223 | 
224 | <img src="./images/calibration.png" width="500"/>
225 | 
226 | 
227 | 
228 | ### Production Features Pipeline
229 | 
230 | Notebook 09 is run from a Github Actions every morning.
231 | 
232 | - It scrapes the stats from the previous day's games, updates all the rolling statistics and streaks, and adds them to the Feature Store.
233 | - It scrapes the upcoming game matchups for the current day and adds them to the Feature Store so that the streamlit app can use these to make it's daily predictions.
234 | 
235 | A variable can be set to either use Selenium or ScrapingAnt for scraping the data. ScrapingAnt is used in production because of its built-in proxy server.
236 | 
237 |  - The Selenium notebook worked fine when ran locally, but there were issues when running the notebook in Github Actions, likely due to the ip address and anti-bot measures on the NBA website (which would require a proxy server to address)
238 |  - ScrapingAnt is a cloud-based scraper with a Python API that handles the proxy server issues. An account is required, but the free account is sufficient for this project.
239 | 
240 | ### Model Training Pipeline
241 | 
242 | Notebook 10 retrieves the most current data, executes Notebook 07 to handle hyperparameter tuning, model training, and calibration, and then adds the model to the Model Registry. The time periods used for the train set and test set can be adjusted so that the model can be tested only on the most current games.
243 | 
244 | ### Streamlit App
245 | 
246 | The streamlit app is deployed at streamlit.io and can be accessed here: [https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/](https://cmunch1-nba-prediction-streamlit-app-fs5l47.streamlit.app/)
247 | 
248 | It uses the model in the Model Registry to predict the win probability of the home team for the current day's upcoming games.
249 | 
250 | <img src="./images/streamlit2.png" width="500"/>
251 | 
252 | ### Feedback
253 | 
254 | Thanks for taking the time to read about my project. This is my primary "portfolio" project for my quest to change careers and find an entry level position in Machine Learning / Data Science. I appreciate any feedback.
255 | 
256 | Project Repository: [https://github.com/cmunch1/nba-prediction](https://github.com/cmunch1/nba-prediction)
257 | 
258 | My Linked-In profile: [https://www.linkedin.com/in/chris-munch/](https://www.linkedin.com/in/chris-munch/)
259 | 
260 | Twitter: [https://twitter.com/curiovana](https://twitter.com/curiovana)
261 | 


--------------------------------------------------------------------------------
/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/data/.gitkeep


--------------------------------------------------------------------------------
/images/basketball_money2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/basketball_money2.jpg


--------------------------------------------------------------------------------
/images/calibration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/calibration.png


--------------------------------------------------------------------------------
/images/confusion_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/confusion_matrix.png


--------------------------------------------------------------------------------
/images/correlation_bar_chart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/correlation_bar_chart.png


--------------------------------------------------------------------------------
/images/distributions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/distributions.png


--------------------------------------------------------------------------------
/images/neptune.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/neptune.png


--------------------------------------------------------------------------------
/images/streamlit2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit2.png


--------------------------------------------------------------------------------
/images/streamlit_example.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/streamlit_example.jpg


--------------------------------------------------------------------------------
/images/train_vs_test_shapley.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Paulescu/nba-prediction/fd91c0268fa4ca7a6488e76d21a2264085814c8c/images/train_vs_test_shapley.jpg


--------------------------------------------------------------------------------
/notebooks/03_train_test_split.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "defc34ab-7465-454b-a148-f18f857eafaa",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "### Train / Test Split\n",
 9 |     "\n",
10 |     "This notebook splits the data into train data and test data.\n",
11 |     "\n",
12 |     "It selects the latest season and uses this as the test data."
13 |    ]
14 |   },
15 |   {
16 |    "cell_type": "code",
17 |    "execution_count": 1,
18 |    "id": "f270d24d-731b-4a33-a737-599871c1b024",
19 |    "metadata": {},
20 |    "outputs": [],
21 |    "source": [
22 |     "import pandas as pd\n",
23 |     "import numpy as np\n",
24 |     "\n",
25 |     "from pathlib import Path  #for Windows/Linux compatibility\n",
26 |     "DATAPATH = Path(r'data')\n"
27 |    ]
28 |   },
29 |   {
30 |    "cell_type": "code",
31 |    "execution_count": 2,
32 |    "id": "45005c7b-e29c-48cc-adf2-47b4cf6a7863",
33 |    "metadata": {},
34 |    "outputs": [],
35 |    "source": [
36 |     "df = pd.read_csv(DATAPATH / \"transformed.csv\")"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "2ddfe2f2-e69e-4f41-a047-87e2caee29f7",
42 |    "metadata": {},
43 |    "source": [
44 |     "**Split Data to Train and Test**"
45 |    ]
46 |   },
47 |   {
48 |    "cell_type": "code",
49 |    "execution_count": 3,
50 |    "id": "2ec58305-22d8-4e01-ad13-a6052967fb43",
51 |    "metadata": {},
52 |    "outputs": [],
53 |    "source": [
54 |     "latest_season = df['SEASON'].unique().max()\n",
55 |     "\n",
56 |     "train = df[df['SEASON'] < (latest_season)]\n",
57 |     "test = df[df['SEASON'] >= (latest_season - 1)]\n",
58 |     "\n",
59 |     "train.to_csv(DATAPATH / \"train.csv\",index=False)\n",
60 |     "test.to_csv(DATAPATH / \"test.csv\",index=False)\n"
61 |    ]
62 |   }
63 |  ],
64 |  "metadata": {
65 |   "kernelspec": {
66 |    "display_name": "nba3",
67 |    "language": "python",
68 |    "name": "python3"
69 |   },
70 |   "language_info": {
71 |    "codemirror_mode": {
72 |     "name": "ipython",
73 |     "version": 3
74 |    },
75 |    "file_extension": ".py",
76 |    "mimetype": "text/x-python",
77 |    "name": "python",
78 |    "nbconvert_exporter": "python",
79 |    "pygments_lexer": "ipython3",
80 |    "version": "3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]"
81 |   },
82 |   "vscode": {
83 |    "interpreter": {
84 |     "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06"
85 |    }
86 |   }
87 |  },
88 |  "nbformat": 4,
89 |  "nbformat_minor": 5
90 | }
91 | 


--------------------------------------------------------------------------------
/notebooks/06_feature_selection.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "ea2619eb-c30b-457d-abfa-a7b4b4dd584c",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "## Feature Selection"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "code",
13 |    "execution_count": 1,
14 |    "id": "f71632a1-ef6a-4404-b662-e8227de30fa1",
15 |    "metadata": {},
16 |    "outputs": [],
17 |    "source": [
18 |     "import pandas as pd\n",
19 |     "import numpy as np\n",
20 |     "\n",
21 |     "pd.set_option('display.max_columns', 500)\n",
22 |     "\n",
23 |     "from sklearn.feature_selection import SelectKBest, chi2\n",
24 |     "\n",
25 |     "\n",
26 |     "import matplotlib.pyplot as plt\n",
27 |     "import seaborn as sns\n",
28 |     "\n",
29 |     "from pathlib import Path  #for Windows/Linux compatibility\n",
30 |     "DATAPATH = Path(r'data')\n"
31 |    ]
32 |   },
33 |   {
34 |    "cell_type": "code",
35 |    "execution_count": 2,
36 |    "id": "d4aabdeb-0167-490e-bc37-65a62f7c0821",
37 |    "metadata": {},
38 |    "outputs": [],
39 |    "source": [
40 |     "train = pd.read_csv(DATAPATH / \"train_features.csv\")\n",
41 |     "test = pd.read_csv(DATAPATH / \"test_features.csv\")\n"
42 |    ]
43 |   },
44 |   {
45 |    "cell_type": "code",
46 |    "execution_count": 4,
47 |    "id": "2d3c9ab6-9a07-46c5-8d57-d75df3171036",
48 |    "metadata": {},
49 |    "outputs": [],
50 |    "source": [
51 |     "train.to_csv(DATAPATH / \"train_selected.csv\",index=False)\n",
52 |     "test.to_csv(DATAPATH / \"test_selected.csv\",index=False)"
53 |    ]
54 |   }
55 |  ],
56 |  "metadata": {
57 |   "kernelspec": {
58 |    "display_name": "Python 3 (ipykernel)",
59 |    "language": "python",
60 |    "name": "python3"
61 |   },
62 |   "language_info": {
63 |    "codemirror_mode": {
64 |     "name": "ipython",
65 |     "version": 3
66 |    },
67 |    "file_extension": ".py",
68 |    "mimetype": "text/x-python",
69 |    "name": "python",
70 |    "nbconvert_exporter": "python",
71 |    "pygments_lexer": "ipython3",
72 |    "version": "3.9.13"
73 |   }
74 |  },
75 |  "nbformat": 4,
76 |  "nbformat_minor": 5
77 | }
78 | 


--------------------------------------------------------------------------------
/notebooks/feature_names.json:
--------------------------------------------------------------------------------
1 | {"game_date_est": "GAME_DATE_EST", "game_id": "GAME_ID", "home_team_id": "HOME_TEAM_ID", "visitor_team_id": "VISITOR_TEAM_ID", "season": "SEASON", "pts_home": "PTS_home", "fg_pct_home": "FG_PCT_home", "ft_pct_home": "FT_PCT_home", "fg3_pct_home": "FG3_PCT_home", "ast_home": "AST_home", "reb_home": "REB_home", "pts_away": "PTS_away", "fg_pct_away": "FG_PCT_away", "ft_pct_away": "FT_PCT_away", "fg3_pct_away": "FG3_PCT_away", "ast_away": "AST_away", "reb_away": "REB_away", "home_team_wins": "HOME_TEAM_WINS", "target": "TARGET", "month": "MONTH", "home_team_win_streak": "HOME_TEAM_WIN_STREAK", "home_team_wins_avg_last_3_home": "HOME_TEAM_WINS_AVG_LAST_3_HOME", "home_team_wins_avg_last_7_home": "HOME_TEAM_WINS_AVG_LAST_7_HOME", "home_team_wins_avg_last_10_home": "HOME_TEAM_WINS_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home": "HOME_PTS_home_AVG_LAST_3_HOME", "home_pts_home_avg_last_7_home": "HOME_PTS_home_AVG_LAST_7_HOME", "home_pts_home_avg_last_10_home": "HOME_PTS_home_AVG_LAST_10_HOME", "home_fg_pct_home_avg_last_3_home": "HOME_FG_PCT_home_AVG_LAST_3_HOME", "home_fg_pct_home_avg_last_7_home": "HOME_FG_PCT_home_AVG_LAST_7_HOME", "home_fg_pct_home_avg_last_10_home": "HOME_FG_PCT_home_AVG_LAST_10_HOME", "home_ft_pct_home_avg_last_3_home": "HOME_FT_PCT_home_AVG_LAST_3_HOME", "home_ft_pct_home_avg_last_7_home": "HOME_FT_PCT_home_AVG_LAST_7_HOME", "home_ft_pct_home_avg_last_10_home": "HOME_FT_PCT_home_AVG_LAST_10_HOME", "home_fg3_pct_home_avg_last_3_home": "HOME_FG3_PCT_home_AVG_LAST_3_HOME", "home_fg3_pct_home_avg_last_7_home": "HOME_FG3_PCT_home_AVG_LAST_7_HOME", "home_fg3_pct_home_avg_last_10_home": "HOME_FG3_PCT_home_AVG_LAST_10_HOME", "home_ast_home_avg_last_3_home": "HOME_AST_home_AVG_LAST_3_HOME", "home_ast_home_avg_last_7_home": "HOME_AST_home_AVG_LAST_7_HOME", "home_ast_home_avg_last_10_home": "HOME_AST_home_AVG_LAST_10_HOME", "home_reb_home_avg_last_3_home": "HOME_REB_home_AVG_LAST_3_HOME", "home_reb_home_avg_last_7_home": "HOME_REB_home_AVG_LAST_7_HOME", "home_reb_home_avg_last_10_home": "HOME_REB_home_AVG_LAST_10_HOME", "home_pts_home_avg_last_3_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_7_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_pts_home_avg_last_10_home_minus_league_avg": "HOME_PTS_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_3_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_7_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ft_pct_home_avg_last_10_home_minus_league_avg": "HOME_FT_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_3_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_7_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_fg3_pct_home_avg_last_10_home_minus_league_avg": "HOME_FG3_PCT_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_3_home_minus_league_avg": "HOME_AST_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_7_home_minus_league_avg": "HOME_AST_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_ast_home_avg_last_10_home_minus_league_avg": "HOME_AST_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_3_home_minus_league_avg": "HOME_REB_home_AVG_LAST_3_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_7_home_minus_league_avg": "HOME_REB_home_AVG_LAST_7_HOME_MINUS_LEAGUE_AVG", "home_reb_home_avg_last_10_home_minus_league_avg": "HOME_REB_home_AVG_LAST_10_HOME_MINUS_LEAGUE_AVG", "visitor_team_win_streak": "VISITOR_TEAM_WIN_STREAK", "visitor_team_wins_avg_last_3_visitor": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR", "visitor_team_wins_avg_last_7_visitor": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR", "visitor_team_wins_avg_last_10_visitor": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR", "visitor_pts_away_avg_last_3_visitor": "VISITOR_PTS_away_AVG_LAST_3_VISITOR", "visitor_pts_away_avg_last_7_visitor": "VISITOR_PTS_away_AVG_LAST_7_VISITOR", "visitor_pts_away_avg_last_10_visitor": "VISITOR_PTS_away_AVG_LAST_10_VISITOR", "visitor_fg_pct_away_avg_last_3_visitor": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg_pct_away_avg_last_7_visitor": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg_pct_away_avg_last_10_visitor": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR", "visitor_ft_pct_away_avg_last_3_visitor": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR", "visitor_ft_pct_away_avg_last_7_visitor": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR", "visitor_ft_pct_away_avg_last_10_visitor": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR", "visitor_fg3_pct_away_avg_last_3_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR", "visitor_fg3_pct_away_avg_last_7_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR", "visitor_fg3_pct_away_avg_last_10_visitor": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR", "visitor_ast_away_avg_last_3_visitor": "VISITOR_AST_away_AVG_LAST_3_VISITOR", "visitor_ast_away_avg_last_7_visitor": "VISITOR_AST_away_AVG_LAST_7_VISITOR", "visitor_ast_away_avg_last_10_visitor": "VISITOR_AST_away_AVG_LAST_10_VISITOR", "visitor_reb_away_avg_last_3_visitor": "VISITOR_REB_away_AVG_LAST_3_VISITOR", "visitor_reb_away_avg_last_7_visitor": "VISITOR_REB_away_AVG_LAST_7_VISITOR", "visitor_reb_away_avg_last_10_visitor": "VISITOR_REB_away_AVG_LAST_10_VISITOR", "visitor_team_wins_avg_last_3_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_7_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_team_wins_avg_last_10_visitor_minus_league_avg": "VISITOR_TEAM_WINS_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_3_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_7_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_pts_away_avg_last_10_visitor_minus_league_avg": "VISITOR_PTS_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ft_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FT_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_3_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_7_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_fg3_pct_away_avg_last_10_visitor_minus_league_avg": "VISITOR_FG3_PCT_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_3_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_7_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_ast_away_avg_last_10_visitor_minus_league_avg": "VISITOR_AST_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_3_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_3_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_7_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_7_VISITOR_MINUS_LEAGUE_AVG", "visitor_reb_away_avg_last_10_visitor_minus_league_avg": "VISITOR_REB_away_AVG_LAST_10_VISITOR_MINUS_LEAGUE_AVG", "matchup_winpct_3_x": "MATCHUP_WINPCT_3_x", "matchup_winpct_7_x": "MATCHUP_WINPCT_7_x", "matchup_winpct_10_x": "MATCHUP_WINPCT_10_x", "matchup_win_streak_x": "MATCHUP_WIN_STREAK_x", "win_streak_x": "WIN_STREAK_x", "home_away_streak_x": "HOME_AWAY_STREAK_x", "team1_win_avg_last_3_all_x": "TEAM1_win_AVG_LAST_3_ALL_x", "team1_win_avg_last_7_all_x": "TEAM1_win_AVG_LAST_7_ALL_x", "team1_win_avg_last_10_all_x": "TEAM1_win_AVG_LAST_10_ALL_x", "team1_win_avg_last_15_all_x": "TEAM1_win_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_x": "PTS_AVG_LAST_3_ALL_x", "pts_avg_last_7_all_x": "PTS_AVG_LAST_7_ALL_x", "pts_avg_last_10_all_x": "PTS_AVG_LAST_10_ALL_x", "pts_avg_last_15_all_x": "PTS_AVG_LAST_15_ALL_x", "fg_pct_avg_last_3_all_x": "FG_PCT_AVG_LAST_3_ALL_x", "fg_pct_avg_last_7_all_x": "FG_PCT_AVG_LAST_7_ALL_x", "fg_pct_avg_last_10_all_x": "FG_PCT_AVG_LAST_10_ALL_x", "fg_pct_avg_last_15_all_x": "FG_PCT_AVG_LAST_15_ALL_x", "ft_pct_avg_last_3_all_x": "FT_PCT_AVG_LAST_3_ALL_x", "ft_pct_avg_last_7_all_x": "FT_PCT_AVG_LAST_7_ALL_x", "ft_pct_avg_last_10_all_x": "FT_PCT_AVG_LAST_10_ALL_x", "ft_pct_avg_last_15_all_x": "FT_PCT_AVG_LAST_15_ALL_x", "fg3_pct_avg_last_3_all_x": "FG3_PCT_AVG_LAST_3_ALL_x", "fg3_pct_avg_last_7_all_x": "FG3_PCT_AVG_LAST_7_ALL_x", "fg3_pct_avg_last_10_all_x": "FG3_PCT_AVG_LAST_10_ALL_x", "fg3_pct_avg_last_15_all_x": "FG3_PCT_AVG_LAST_15_ALL_x", "ast_avg_last_3_all_x": "AST_AVG_LAST_3_ALL_x", "ast_avg_last_7_all_x": "AST_AVG_LAST_7_ALL_x", "ast_avg_last_10_all_x": "AST_AVG_LAST_10_ALL_x", "ast_avg_last_15_all_x": "AST_AVG_LAST_15_ALL_x", "reb_avg_last_3_all_x": "REB_AVG_LAST_3_ALL_x", "reb_avg_last_7_all_x": "REB_AVG_LAST_7_ALL_x", "reb_avg_last_10_all_x": "REB_AVG_LAST_10_ALL_x", "reb_avg_last_15_all_x": "REB_AVG_LAST_15_ALL_x", "pts_avg_last_3_all_minus_league_avg_x": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_7_all_minus_league_avg_x": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_10_all_minus_league_avg_x": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "pts_avg_last_15_all_minus_league_avg_x": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_3_all_minus_league_avg_x": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_7_all_minus_league_avg_x": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_10_all_minus_league_avg_x": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg_pct_avg_last_15_all_minus_league_avg_x": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_3_all_minus_league_avg_x": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_7_all_minus_league_avg_x": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_10_all_minus_league_avg_x": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ft_pct_avg_last_15_all_minus_league_avg_x": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_3_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_7_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_10_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "fg3_pct_avg_last_15_all_minus_league_avg_x": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_3_all_minus_league_avg_x": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_7_all_minus_league_avg_x": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_10_all_minus_league_avg_x": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "ast_avg_last_15_all_minus_league_avg_x": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_3_all_minus_league_avg_x": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_7_all_minus_league_avg_x": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_10_all_minus_league_avg_x": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_x", "reb_avg_last_15_all_minus_league_avg_x": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_x", "win_streak_y": "WIN_STREAK_y", "home_away_streak_y": "HOME_AWAY_STREAK_y", "team1_win_avg_last_3_all_y": "TEAM1_win_AVG_LAST_3_ALL_y", "team1_win_avg_last_7_all_y": "TEAM1_win_AVG_LAST_7_ALL_y", "team1_win_avg_last_10_all_y": "TEAM1_win_AVG_LAST_10_ALL_y", "team1_win_avg_last_15_all_y": "TEAM1_win_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_y": "PTS_AVG_LAST_3_ALL_y", "pts_avg_last_7_all_y": "PTS_AVG_LAST_7_ALL_y", "pts_avg_last_10_all_y": "PTS_AVG_LAST_10_ALL_y", "pts_avg_last_15_all_y": "PTS_AVG_LAST_15_ALL_y", "fg_pct_avg_last_3_all_y": "FG_PCT_AVG_LAST_3_ALL_y", "fg_pct_avg_last_7_all_y": "FG_PCT_AVG_LAST_7_ALL_y", "fg_pct_avg_last_10_all_y": "FG_PCT_AVG_LAST_10_ALL_y", "fg_pct_avg_last_15_all_y": "FG_PCT_AVG_LAST_15_ALL_y", "ft_pct_avg_last_3_all_y": "FT_PCT_AVG_LAST_3_ALL_y", "ft_pct_avg_last_7_all_y": "FT_PCT_AVG_LAST_7_ALL_y", "ft_pct_avg_last_10_all_y": "FT_PCT_AVG_LAST_10_ALL_y", "ft_pct_avg_last_15_all_y": "FT_PCT_AVG_LAST_15_ALL_y", "fg3_pct_avg_last_3_all_y": "FG3_PCT_AVG_LAST_3_ALL_y", "fg3_pct_avg_last_7_all_y": "FG3_PCT_AVG_LAST_7_ALL_y", "fg3_pct_avg_last_10_all_y": "FG3_PCT_AVG_LAST_10_ALL_y", "fg3_pct_avg_last_15_all_y": "FG3_PCT_AVG_LAST_15_ALL_y", "ast_avg_last_3_all_y": "AST_AVG_LAST_3_ALL_y", "ast_avg_last_7_all_y": "AST_AVG_LAST_7_ALL_y", "ast_avg_last_10_all_y": "AST_AVG_LAST_10_ALL_y", "ast_avg_last_15_all_y": "AST_AVG_LAST_15_ALL_y", "reb_avg_last_3_all_y": "REB_AVG_LAST_3_ALL_y", "reb_avg_last_7_all_y": "REB_AVG_LAST_7_ALL_y", "reb_avg_last_10_all_y": "REB_AVG_LAST_10_ALL_y", "reb_avg_last_15_all_y": "REB_AVG_LAST_15_ALL_y", "pts_avg_last_3_all_minus_league_avg_y": "PTS_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_7_all_minus_league_avg_y": "PTS_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_10_all_minus_league_avg_y": "PTS_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "pts_avg_last_15_all_minus_league_avg_y": "PTS_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_3_all_minus_league_avg_y": "FG_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_7_all_minus_league_avg_y": "FG_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_10_all_minus_league_avg_y": "FG_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg_pct_avg_last_15_all_minus_league_avg_y": "FG_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_3_all_minus_league_avg_y": "FT_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_7_all_minus_league_avg_y": "FT_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_10_all_minus_league_avg_y": "FT_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ft_pct_avg_last_15_all_minus_league_avg_y": "FT_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_3_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_7_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_10_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "fg3_pct_avg_last_15_all_minus_league_avg_y": "FG3_PCT_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_3_all_minus_league_avg_y": "AST_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_7_all_minus_league_avg_y": "AST_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_10_all_minus_league_avg_y": "AST_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "ast_avg_last_15_all_minus_league_avg_y": "AST_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_3_all_minus_league_avg_y": "REB_AVG_LAST_3_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_7_all_minus_league_avg_y": "REB_AVG_LAST_7_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_10_all_minus_league_avg_y": "REB_AVG_LAST_10_ALL_MINUS_LEAGUE_AVG_y", "reb_avg_last_15_all_minus_league_avg_y": "REB_AVG_LAST_15_ALL_MINUS_LEAGUE_AVG_y", "win_streak_x_minus_y": "WIN_STREAK_x_minus_y", "home_away_streak_x_minus_y": "HOME_AWAY_STREAK_x_minus_y", "team1_win_avg_last_3_all_x_minus_y": "TEAM1_win_AVG_LAST_3_ALL_x_minus_y", "team1_win_avg_last_7_all_x_minus_y": "TEAM1_win_AVG_LAST_7_ALL_x_minus_y", "team1_win_avg_last_10_all_x_minus_y": "TEAM1_win_AVG_LAST_10_ALL_x_minus_y", "team1_win_avg_last_15_all_x_minus_y": "TEAM1_win_AVG_LAST_15_ALL_x_minus_y", "pts_avg_last_3_all_x_minus_y": "PTS_AVG_LAST_3_ALL_x_minus_y", "pts_avg_last_7_all_x_minus_y": "PTS_AVG_LAST_7_ALL_x_minus_y", "pts_avg_last_10_all_x_minus_y": "PTS_AVG_LAST_10_ALL_x_minus_y", "pts_avg_last_15_all_x_minus_y": "PTS_AVG_LAST_15_ALL_x_minus_y", "fg_pct_avg_last_3_all_x_minus_y": "FG_PCT_AVG_LAST_3_ALL_x_minus_y", "fg_pct_avg_last_7_all_x_minus_y": "FG_PCT_AVG_LAST_7_ALL_x_minus_y", "fg_pct_avg_last_10_all_x_minus_y": "FG_PCT_AVG_LAST_10_ALL_x_minus_y", "fg_pct_avg_last_15_all_x_minus_y": "FG_PCT_AVG_LAST_15_ALL_x_minus_y", "ft_pct_avg_last_3_all_x_minus_y": "FT_PCT_AVG_LAST_3_ALL_x_minus_y", "ft_pct_avg_last_7_all_x_minus_y": "FT_PCT_AVG_LAST_7_ALL_x_minus_y", "ft_pct_avg_last_10_all_x_minus_y": "FT_PCT_AVG_LAST_10_ALL_x_minus_y", "ft_pct_avg_last_15_all_x_minus_y": "FT_PCT_AVG_LAST_15_ALL_x_minus_y", "fg3_pct_avg_last_3_all_x_minus_y": "FG3_PCT_AVG_LAST_3_ALL_x_minus_y", "fg3_pct_avg_last_7_all_x_minus_y": "FG3_PCT_AVG_LAST_7_ALL_x_minus_y", "fg3_pct_avg_last_10_all_x_minus_y": "FG3_PCT_AVG_LAST_10_ALL_x_minus_y", "fg3_pct_avg_last_15_all_x_minus_y": "FG3_PCT_AVG_LAST_15_ALL_x_minus_y", "ast_avg_last_3_all_x_minus_y": "AST_AVG_LAST_3_ALL_x_minus_y", "ast_avg_last_7_all_x_minus_y": "AST_AVG_LAST_7_ALL_x_minus_y", "ast_avg_last_10_all_x_minus_y": "AST_AVG_LAST_10_ALL_x_minus_y", "ast_avg_last_15_all_x_minus_y": "AST_AVG_LAST_15_ALL_x_minus_y", "reb_avg_last_3_all_x_minus_y": "REB_AVG_LAST_3_ALL_x_minus_y", "reb_avg_last_7_all_x_minus_y": "REB_AVG_LAST_7_ALL_x_minus_y", "reb_avg_last_10_all_x_minus_y": "REB_AVG_LAST_10_ALL_x_minus_y", "reb_avg_last_15_all_x_minus_y": "REB_AVG_LAST_15_ALL_x_minus_y"}


--------------------------------------------------------------------------------
/notebooks/hopsworks.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "2f0c58a3-8738-42ad-be18-8e13c171108f",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import os\n",
 11 |     "\n",
 12 |     "import pandas as pd\n",
 13 |     "import numpy as np\n",
 14 |     "\n",
 15 |     "import hopsworks\n",
 16 |     "\n",
 17 |     "from datetime import datetime, timedelta\n",
 18 |     "from pytz import timezone\n",
 19 |     "\n",
 20 |     "from src.webscraping_a import (\n",
 21 |     "    scrape_to_dataframe,\n",
 22 |     "    convert_columns,\n",
 23 |     "    combine_home_visitor,  \n",
 24 |     "    get_todays_matchups,\n",
 25 |     ")\n",
 26 |     "\n",
 27 |     "from src.data_processing import (\n",
 28 |     "    process_games,\n",
 29 |     "    add_TARGET,\n",
 30 |     ")\n",
 31 |     "\n",
 32 |     "from src.feature_engineering import (\n",
 33 |     "    process_features,\n",
 34 |     ")\n",
 35 |     "\n",
 36 |     "from src.hopsworks_utils import (\n",
 37 |     "    save_feature_names,\n",
 38 |     "    convert_feature_names,\n",
 39 |     ")\n",
 40 |     "\n",
 41 |     "import json\n",
 42 |     "\n",
 43 |     "from pathlib import Path  #for Windows/Linux compatibility\n",
 44 |     "DATAPATH = Path(r'data')"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 2,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "from dotenv import load_dotenv\n",
 54 |     "\n",
 55 |     "load_dotenv()\n",
 56 |     "\n",
 57 |     "try:\n",
 58 |     "    HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']\n",
 59 |     "except:\n",
 60 |     "    raise Exception('Set environment variable HOPSWORKS_API_KEY')"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "id": "ba69a88c-691e-4f8d-b68e-e79f5582a939",
 67 |    "metadata": {},
 68 |    "outputs": [
 69 |     {
 70 |      "name": "stdout",
 71 |      "output_type": "stream",
 72 |      "text": [
 73 |       "Connected. Call `.close()` to terminate connection gracefully.\n",
 74 |       "\n",
 75 |       "Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3350\n",
 76 |       "Connected. Call `.close()` to terminate connection gracefully.\n"
 77 |      ]
 78 |     },
 79 |     {
 80 |      "name": "stderr",
 81 |      "output_type": "stream",
 82 |      "text": [
 83 |       "DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n"
 84 |      ]
 85 |     }
 86 |    ],
 87 |    "source": [
 88 |     "project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)\n",
 89 |     "fs = project.get_feature_store()"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 4,
 95 |    "id": "641d97e5-0888-42c3-a9b1-97c9ef2b7082",
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "rolling_stats_fg = fs.get_feature_group(\n",
100 |     "    name=\"rolling_stats\",\n",
101 |     "    version=1,\n",
102 |     ")"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 5,
108 |    "id": "69a2b254",
109 |    "metadata": {},
110 |    "outputs": [
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "2023-01-31 08:30:10,072 INFO: USE `nba_predictor_featurestore`\n",
116 |       "2023-01-31 08:30:10,441 INFO: SELECT `fg0`.`game_date_est` `game_date_est`, `fg0`.`game_id` `game_id`, `fg0`.`home_team_id` `home_team_id`, `fg0`.`visitor_team_id` `visitor_team_id`, `fg0`.`season` `season`, `fg0`.`pts_home` `pts_home`, `fg0`.`fg_pct_home` `fg_pct_home`, `fg0`.`ft_pct_home` `ft_pct_home`, `fg0`.`fg3_pct_home` `fg3_pct_home`, `fg0`.`ast_home` `ast_home`, `fg0`.`reb_home` `reb_home`, `fg0`.`pts_away` `pts_away`, `fg0`.`fg_pct_away` `fg_pct_away`, `fg0`.`ft_pct_away` `ft_pct_away`, `fg0`.`fg3_pct_away` `fg3_pct_away`, `fg0`.`ast_away` `ast_away`, `fg0`.`reb_away` `reb_away`, `fg0`.`home_team_wins` `home_team_wins`\n",
117 |       "FROM `nba_predictor_featurestore`.`rolling_stats_1` `fg0`\n"
118 |      ]
119 |     },
120 |     {
121 |      "name": "stderr",
122 |      "output_type": "stream",
123 |      "text": [
124 |       "UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy\n"
125 |      ]
126 |     },
127 |     {
128 |      "data": {
129 |       "text/html": [
130 |        "<div>\n",
131 |        "<style scoped>\n",
132 |        "    .dataframe tbody tr th:only-of-type {\n",
133 |        "        vertical-align: middle;\n",
134 |        "    }\n",
135 |        "\n",
136 |        "    .dataframe tbody tr th {\n",
137 |        "        vertical-align: top;\n",
138 |        "    }\n",
139 |        "\n",
140 |        "    .dataframe thead th {\n",
141 |        "        text-align: right;\n",
142 |        "    }\n",
143 |        "</style>\n",
144 |        "<table border=\"1\" class=\"dataframe\">\n",
145 |        "  <thead>\n",
146 |        "    <tr style=\"text-align: right;\">\n",
147 |        "      <th></th>\n",
148 |        "      <th>GAME_DATE_EST</th>\n",
149 |        "      <th>GAME_ID</th>\n",
150 |        "      <th>HOME_TEAM_ID</th>\n",
151 |        "      <th>VISITOR_TEAM_ID</th>\n",
152 |        "      <th>SEASON</th>\n",
153 |        "      <th>PTS_home</th>\n",
154 |        "      <th>FG_PCT_home</th>\n",
155 |        "      <th>FT_PCT_home</th>\n",
156 |        "      <th>FG3_PCT_home</th>\n",
157 |        "      <th>AST_home</th>\n",
158 |        "      <th>REB_home</th>\n",
159 |        "      <th>PTS_away</th>\n",
160 |        "      <th>FG_PCT_away</th>\n",
161 |        "      <th>FT_PCT_away</th>\n",
162 |        "      <th>FG3_PCT_away</th>\n",
163 |        "      <th>AST_away</th>\n",
164 |        "      <th>REB_away</th>\n",
165 |        "      <th>HOME_TEAM_WINS</th>\n",
166 |        "    </tr>\n",
167 |        "  </thead>\n",
168 |        "  <tbody>\n",
169 |        "    <tr>\n",
170 |        "      <th>1384</th>\n",
171 |        "      <td>2023-01-31</td>\n",
172 |        "      <td>22200765</td>\n",
173 |        "      <td>1610612752</td>\n",
174 |        "      <td>1610612747</td>\n",
175 |        "      <td>2022</td>\n",
176 |        "      <td>0</td>\n",
177 |        "      <td>0.0</td>\n",
178 |        "      <td>0.0</td>\n",
179 |        "      <td>0.0</td>\n",
180 |        "      <td>0</td>\n",
181 |        "      <td>0</td>\n",
182 |        "      <td>0</td>\n",
183 |        "      <td>0.0</td>\n",
184 |        "      <td>0.0</td>\n",
185 |        "      <td>0.0</td>\n",
186 |        "      <td>0</td>\n",
187 |        "      <td>0</td>\n",
188 |        "      <td>0</td>\n",
189 |        "    </tr>\n",
190 |        "    <tr>\n",
191 |        "      <th>3042</th>\n",
192 |        "      <td>2023-01-31</td>\n",
193 |        "      <td>22200767</td>\n",
194 |        "      <td>1610612749</td>\n",
195 |        "      <td>1610612766</td>\n",
196 |        "      <td>2022</td>\n",
197 |        "      <td>0</td>\n",
198 |        "      <td>0.0</td>\n",
199 |        "      <td>0.0</td>\n",
200 |        "      <td>0.0</td>\n",
201 |        "      <td>0</td>\n",
202 |        "      <td>0</td>\n",
203 |        "      <td>0</td>\n",
204 |        "      <td>0.0</td>\n",
205 |        "      <td>0.0</td>\n",
206 |        "      <td>0.0</td>\n",
207 |        "      <td>0</td>\n",
208 |        "      <td>0</td>\n",
209 |        "      <td>0</td>\n",
210 |        "    </tr>\n",
211 |        "    <tr>\n",
212 |        "      <th>8239</th>\n",
213 |        "      <td>2023-01-31</td>\n",
214 |        "      <td>22200764</td>\n",
215 |        "      <td>1610612739</td>\n",
216 |        "      <td>1610612748</td>\n",
217 |        "      <td>2022</td>\n",
218 |        "      <td>0</td>\n",
219 |        "      <td>0.0</td>\n",
220 |        "      <td>0.0</td>\n",
221 |        "      <td>0.0</td>\n",
222 |        "      <td>0</td>\n",
223 |        "      <td>0</td>\n",
224 |        "      <td>0</td>\n",
225 |        "      <td>0.0</td>\n",
226 |        "      <td>0.0</td>\n",
227 |        "      <td>0.0</td>\n",
228 |        "      <td>0</td>\n",
229 |        "      <td>0</td>\n",
230 |        "      <td>0</td>\n",
231 |        "    </tr>\n",
232 |        "    <tr>\n",
233 |        "      <th>8907</th>\n",
234 |        "      <td>2023-01-31</td>\n",
235 |        "      <td>22200766</td>\n",
236 |        "      <td>1610612741</td>\n",
237 |        "      <td>1610612746</td>\n",
238 |        "      <td>2022</td>\n",
239 |        "      <td>0</td>\n",
240 |        "      <td>0.0</td>\n",
241 |        "      <td>0.0</td>\n",
242 |        "      <td>0.0</td>\n",
243 |        "      <td>0</td>\n",
244 |        "      <td>0</td>\n",
245 |        "      <td>0</td>\n",
246 |        "      <td>0.0</td>\n",
247 |        "      <td>0.0</td>\n",
248 |        "      <td>0.0</td>\n",
249 |        "      <td>0</td>\n",
250 |        "      <td>0</td>\n",
251 |        "      <td>0</td>\n",
252 |        "    </tr>\n",
253 |        "    <tr>\n",
254 |        "      <th>21628</th>\n",
255 |        "      <td>2023-01-31</td>\n",
256 |        "      <td>22200768</td>\n",
257 |        "      <td>1610612743</td>\n",
258 |        "      <td>1610612740</td>\n",
259 |        "      <td>2022</td>\n",
260 |        "      <td>0</td>\n",
261 |        "      <td>0.0</td>\n",
262 |        "      <td>0.0</td>\n",
263 |        "      <td>0.0</td>\n",
264 |        "      <td>0</td>\n",
265 |        "      <td>0</td>\n",
266 |        "      <td>0</td>\n",
267 |        "      <td>0.0</td>\n",
268 |        "      <td>0.0</td>\n",
269 |        "      <td>0.0</td>\n",
270 |        "      <td>0</td>\n",
271 |        "      <td>0</td>\n",
272 |        "      <td>0</td>\n",
273 |        "    </tr>\n",
274 |        "  </tbody>\n",
275 |        "</table>\n",
276 |        "</div>"
277 |       ],
278 |       "text/plain": [
279 |        "      GAME_DATE_EST   GAME_ID  HOME_TEAM_ID  VISITOR_TEAM_ID  SEASON  \\\n",
280 |        "1384     2023-01-31  22200765    1610612752       1610612747    2022   \n",
281 |        "3042     2023-01-31  22200767    1610612749       1610612766    2022   \n",
282 |        "8239     2023-01-31  22200764    1610612739       1610612748    2022   \n",
283 |        "8907     2023-01-31  22200766    1610612741       1610612746    2022   \n",
284 |        "21628    2023-01-31  22200768    1610612743       1610612740    2022   \n",
285 |        "\n",
286 |        "       PTS_home  FG_PCT_home  FT_PCT_home  FG3_PCT_home  AST_home  REB_home  \\\n",
287 |        "1384          0          0.0          0.0           0.0         0         0   \n",
288 |        "3042          0          0.0          0.0           0.0         0         0   \n",
289 |        "8239          0          0.0          0.0           0.0         0         0   \n",
290 |        "8907          0          0.0          0.0           0.0         0         0   \n",
291 |        "21628         0          0.0          0.0           0.0         0         0   \n",
292 |        "\n",
293 |        "       PTS_away  FG_PCT_away  FT_PCT_away  FG3_PCT_away  AST_away  REB_away  \\\n",
294 |        "1384          0          0.0          0.0           0.0         0         0   \n",
295 |        "3042          0          0.0          0.0           0.0         0         0   \n",
296 |        "8239          0          0.0          0.0           0.0         0         0   \n",
297 |        "8907          0          0.0          0.0           0.0         0         0   \n",
298 |        "21628         0          0.0          0.0           0.0         0         0   \n",
299 |        "\n",
300 |        "       HOME_TEAM_WINS  \n",
301 |        "1384                0  \n",
302 |        "3042                0  \n",
303 |        "8239                0  \n",
304 |        "8907                0  \n",
305 |        "21628               0  "
306 |       ]
307 |      },
308 |      "execution_count": 5,
309 |      "metadata": {},
310 |      "output_type": "execute_result"
311 |     }
312 |    ],
313 |    "source": [
314 |     "BASE_FEATURES = ['game_date_est',\n",
315 |     " 'game_id',\n",
316 |     " 'home_team_id',\n",
317 |     " 'visitor_team_id',\n",
318 |     " 'season',\n",
319 |     " 'pts_home',\n",
320 |     " 'fg_pct_home',\n",
321 |     " 'ft_pct_home',\n",
322 |     " 'fg3_pct_home',\n",
323 |     " 'ast_home',\n",
324 |     " 'reb_home',\n",
325 |     " 'pts_away',\n",
326 |     " 'fg_pct_away',\n",
327 |     " 'ft_pct_away',\n",
328 |     " 'fg3_pct_away',\n",
329 |     " 'ast_away',\n",
330 |     " 'reb_away',\n",
331 |     " 'home_team_wins',\n",
332 |     "]\n",
333 |     "\n",
334 |     "ds_query = rolling_stats_fg.select(BASE_FEATURES)\n",
335 |     "df_old = ds_query.read()\n",
336 |     "df_old = convert_feature_names(df_old)\n",
337 |     "df_old[df_old['PTS_home'] == 0]\n"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 6,
343 |    "metadata": {},
344 |    "outputs": [
345 |     {
346 |      "data": {
347 |       "text/html": [
348 |        "<div>\n",
349 |        "<style scoped>\n",
350 |        "    .dataframe tbody tr th:only-of-type {\n",
351 |        "        vertical-align: middle;\n",
352 |        "    }\n",
353 |        "\n",
354 |        "    .dataframe tbody tr th {\n",
355 |        "        vertical-align: top;\n",
356 |        "    }\n",
357 |        "\n",
358 |        "    .dataframe thead th {\n",
359 |        "        text-align: right;\n",
360 |        "    }\n",
361 |        "</style>\n",
362 |        "<table border=\"1\" class=\"dataframe\">\n",
363 |        "  <thead>\n",
364 |        "    <tr style=\"text-align: right;\">\n",
365 |        "      <th></th>\n",
366 |        "      <th>GAME_DATE_EST</th>\n",
367 |        "      <th>GAME_ID</th>\n",
368 |        "      <th>HOME_TEAM_ID</th>\n",
369 |        "      <th>VISITOR_TEAM_ID</th>\n",
370 |        "      <th>SEASON</th>\n",
371 |        "      <th>PTS_home</th>\n",
372 |        "      <th>FG_PCT_home</th>\n",
373 |        "      <th>FT_PCT_home</th>\n",
374 |        "      <th>FG3_PCT_home</th>\n",
375 |        "      <th>AST_home</th>\n",
376 |        "      <th>REB_home</th>\n",
377 |        "      <th>PTS_away</th>\n",
378 |        "      <th>FG_PCT_away</th>\n",
379 |        "      <th>FT_PCT_away</th>\n",
380 |        "      <th>FG3_PCT_away</th>\n",
381 |        "      <th>AST_away</th>\n",
382 |        "      <th>REB_away</th>\n",
383 |        "      <th>HOME_TEAM_WINS</th>\n",
384 |        "    </tr>\n",
385 |        "  </thead>\n",
386 |        "  <tbody>\n",
387 |        "    <tr>\n",
388 |        "      <th>1384</th>\n",
389 |        "      <td>2023-01-31</td>\n",
390 |        "      <td>22200765</td>\n",
391 |        "      <td>1610612752</td>\n",
392 |        "      <td>1610612747</td>\n",
393 |        "      <td>2022</td>\n",
394 |        "      <td>0</td>\n",
395 |        "      <td>0.00000</td>\n",
396 |        "      <td>0.0000</td>\n",
397 |        "      <td>0.000000</td>\n",
398 |        "      <td>0</td>\n",
399 |        "      <td>0</td>\n",
400 |        "      <td>0</td>\n",
401 |        "      <td>0.00000</td>\n",
402 |        "      <td>0.0000</td>\n",
403 |        "      <td>0.000000</td>\n",
404 |        "      <td>0</td>\n",
405 |        "      <td>0</td>\n",
406 |        "      <td>0</td>\n",
407 |        "    </tr>\n",
408 |        "    <tr>\n",
409 |        "      <th>1705</th>\n",
410 |        "      <td>2023-01-30</td>\n",
411 |        "      <td>22200759</td>\n",
412 |        "      <td>1610612744</td>\n",
413 |        "      <td>1610612760</td>\n",
414 |        "      <td>2022</td>\n",
415 |        "      <td>128</td>\n",
416 |        "      <td>51.09375</td>\n",
417 |        "      <td>80.0000</td>\n",
418 |        "      <td>42.593750</td>\n",
419 |        "      <td>37</td>\n",
420 |        "      <td>44</td>\n",
421 |        "      <td>120</td>\n",
422 |        "      <td>49.50000</td>\n",
423 |        "      <td>85.0000</td>\n",
424 |        "      <td>45.812500</td>\n",
425 |        "      <td>23</td>\n",
426 |        "      <td>43</td>\n",
427 |        "      <td>1</td>\n",
428 |        "    </tr>\n",
429 |        "    <tr>\n",
430 |        "      <th>1755</th>\n",
431 |        "      <td>2023-01-29</td>\n",
432 |        "      <td>22200754</td>\n",
433 |        "      <td>1610612746</td>\n",
434 |        "      <td>1610612739</td>\n",
435 |        "      <td>2022</td>\n",
436 |        "      <td>99</td>\n",
437 |        "      <td>41.59375</td>\n",
438 |        "      <td>85.1875</td>\n",
439 |        "      <td>10.500000</td>\n",
440 |        "      <td>17</td>\n",
441 |        "      <td>45</td>\n",
442 |        "      <td>122</td>\n",
443 |        "      <td>56.40625</td>\n",
444 |        "      <td>73.6875</td>\n",
445 |        "      <td>60.593750</td>\n",
446 |        "      <td>35</td>\n",
447 |        "      <td>35</td>\n",
448 |        "      <td>0</td>\n",
449 |        "    </tr>\n",
450 |        "    <tr>\n",
451 |        "      <th>3042</th>\n",
452 |        "      <td>2023-01-31</td>\n",
453 |        "      <td>22200767</td>\n",
454 |        "      <td>1610612749</td>\n",
455 |        "      <td>1610612766</td>\n",
456 |        "      <td>2022</td>\n",
457 |        "      <td>0</td>\n",
458 |        "      <td>0.00000</td>\n",
459 |        "      <td>0.0000</td>\n",
460 |        "      <td>0.000000</td>\n",
461 |        "      <td>0</td>\n",
462 |        "      <td>0</td>\n",
463 |        "      <td>0</td>\n",
464 |        "      <td>0.00000</td>\n",
465 |        "      <td>0.0000</td>\n",
466 |        "      <td>0.000000</td>\n",
467 |        "      <td>0</td>\n",
468 |        "      <td>0</td>\n",
469 |        "      <td>0</td>\n",
470 |        "    </tr>\n",
471 |        "    <tr>\n",
472 |        "      <th>3825</th>\n",
473 |        "      <td>2023-01-30</td>\n",
474 |        "      <td>22200756</td>\n",
475 |        "      <td>1610612753</td>\n",
476 |        "      <td>1610612755</td>\n",
477 |        "      <td>2022</td>\n",
478 |        "      <td>119</td>\n",
479 |        "      <td>42.40625</td>\n",
480 |        "      <td>80.0000</td>\n",
481 |        "      <td>37.906250</td>\n",
482 |        "      <td>23</td>\n",
483 |        "      <td>50</td>\n",
484 |        "      <td>109</td>\n",
485 |        "      <td>47.09375</td>\n",
486 |        "      <td>78.3125</td>\n",
487 |        "      <td>36.687500</td>\n",
488 |        "      <td>28</td>\n",
489 |        "      <td>45</td>\n",
490 |        "      <td>1</td>\n",
491 |        "    </tr>\n",
492 |        "    <tr>\n",
493 |        "      <th>5048</th>\n",
494 |        "      <td>2023-01-30</td>\n",
495 |        "      <td>22200763</td>\n",
496 |        "      <td>1610612737</td>\n",
497 |        "      <td>1610612757</td>\n",
498 |        "      <td>2022</td>\n",
499 |        "      <td>125</td>\n",
500 |        "      <td>46.81250</td>\n",
501 |        "      <td>80.0000</td>\n",
502 |        "      <td>43.312500</td>\n",
503 |        "      <td>26</td>\n",
504 |        "      <td>45</td>\n",
505 |        "      <td>129</td>\n",
506 |        "      <td>54.40625</td>\n",
507 |        "      <td>88.8750</td>\n",
508 |        "      <td>47.500000</td>\n",
509 |        "      <td>24</td>\n",
510 |        "      <td>32</td>\n",
511 |        "      <td>0</td>\n",
512 |        "    </tr>\n",
513 |        "    <tr>\n",
514 |        "      <th>5716</th>\n",
515 |        "      <td>2023-01-29</td>\n",
516 |        "      <td>22200753</td>\n",
517 |        "      <td>1610612754</td>\n",
518 |        "      <td>1610612763</td>\n",
519 |        "      <td>2022</td>\n",
520 |        "      <td>100</td>\n",
521 |        "      <td>45.59375</td>\n",
522 |        "      <td>90.5000</td>\n",
523 |        "      <td>28.093750</td>\n",
524 |        "      <td>21</td>\n",
525 |        "      <td>38</td>\n",
526 |        "      <td>112</td>\n",
527 |        "      <td>48.31250</td>\n",
528 |        "      <td>77.3125</td>\n",
529 |        "      <td>24.296875</td>\n",
530 |        "      <td>29</td>\n",
531 |        "      <td>41</td>\n",
532 |        "      <td>0</td>\n",
533 |        "    </tr>\n",
534 |        "    <tr>\n",
535 |        "      <th>8239</th>\n",
536 |        "      <td>2023-01-31</td>\n",
537 |        "      <td>22200764</td>\n",
538 |        "      <td>1610612739</td>\n",
539 |        "      <td>1610612748</td>\n",
540 |        "      <td>2022</td>\n",
541 |        "      <td>0</td>\n",
542 |        "      <td>0.00000</td>\n",
543 |        "      <td>0.0000</td>\n",
544 |        "      <td>0.000000</td>\n",
545 |        "      <td>0</td>\n",
546 |        "      <td>0</td>\n",
547 |        "      <td>0</td>\n",
548 |        "      <td>0.00000</td>\n",
549 |        "      <td>0.0000</td>\n",
550 |        "      <td>0.000000</td>\n",
551 |        "      <td>0</td>\n",
552 |        "      <td>0</td>\n",
553 |        "      <td>0</td>\n",
554 |        "    </tr>\n",
555 |        "    <tr>\n",
556 |        "      <th>8381</th>\n",
557 |        "      <td>2023-01-29</td>\n",
558 |        "      <td>22200752</td>\n",
559 |        "      <td>1610612748</td>\n",
560 |        "      <td>1610612766</td>\n",
561 |        "      <td>2022</td>\n",
562 |        "      <td>117</td>\n",
563 |        "      <td>48.40625</td>\n",
564 |        "      <td>73.8750</td>\n",
565 |        "      <td>32.312500</td>\n",
566 |        "      <td>26</td>\n",
567 |        "      <td>36</td>\n",
568 |        "      <td>122</td>\n",
569 |        "      <td>54.18750</td>\n",
570 |        "      <td>77.3125</td>\n",
571 |        "      <td>37.500000</td>\n",
572 |        "      <td>25</td>\n",
573 |        "      <td>47</td>\n",
574 |        "      <td>0</td>\n",
575 |        "    </tr>\n",
576 |        "    <tr>\n",
577 |        "      <th>8727</th>\n",
578 |        "      <td>2023-01-30</td>\n",
579 |        "      <td>22200762</td>\n",
580 |        "      <td>1610612761</td>\n",
581 |        "      <td>1610612756</td>\n",
582 |        "      <td>2022</td>\n",
583 |        "      <td>106</td>\n",
584 |        "      <td>44.90625</td>\n",
585 |        "      <td>89.5000</td>\n",
586 |        "      <td>27.296875</td>\n",
587 |        "      <td>19</td>\n",
588 |        "      <td>44</td>\n",
589 |        "      <td>114</td>\n",
590 |        "      <td>49.40625</td>\n",
591 |        "      <td>81.0000</td>\n",
592 |        "      <td>39.312500</td>\n",
593 |        "      <td>28</td>\n",
594 |        "      <td>42</td>\n",
595 |        "      <td>0</td>\n",
596 |        "    </tr>\n",
597 |        "    <tr>\n",
598 |        "      <th>8907</th>\n",
599 |        "      <td>2023-01-31</td>\n",
600 |        "      <td>22200766</td>\n",
601 |        "      <td>1610612741</td>\n",
602 |        "      <td>1610612746</td>\n",
603 |        "      <td>2022</td>\n",
604 |        "      <td>0</td>\n",
605 |        "      <td>0.00000</td>\n",
606 |        "      <td>0.0000</td>\n",
607 |        "      <td>0.000000</td>\n",
608 |        "      <td>0</td>\n",
609 |        "      <td>0</td>\n",
610 |        "      <td>0</td>\n",
611 |        "      <td>0.00000</td>\n",
612 |        "      <td>0.0000</td>\n",
613 |        "      <td>0.000000</td>\n",
614 |        "      <td>0</td>\n",
615 |        "      <td>0</td>\n",
616 |        "      <td>0</td>\n",
617 |        "    </tr>\n",
618 |        "    <tr>\n",
619 |        "      <th>9991</th>\n",
620 |        "      <td>2023-01-30</td>\n",
621 |        "      <td>22200758</td>\n",
622 |        "      <td>1610612758</td>\n",
623 |        "      <td>1610612750</td>\n",
624 |        "      <td>2022</td>\n",
625 |        "      <td>118</td>\n",
626 |        "      <td>47.50000</td>\n",
627 |        "      <td>68.1875</td>\n",
628 |        "      <td>30.000000</td>\n",
629 |        "      <td>20</td>\n",
630 |        "      <td>50</td>\n",
631 |        "      <td>111</td>\n",
632 |        "      <td>46.68750</td>\n",
633 |        "      <td>52.0000</td>\n",
634 |        "      <td>30.796875</td>\n",
635 |        "      <td>26</td>\n",
636 |        "      <td>49</td>\n",
637 |        "      <td>1</td>\n",
638 |        "    </tr>\n",
639 |        "    <tr>\n",
640 |        "      <th>10092</th>\n",
641 |        "      <td>2023-01-30</td>\n",
642 |        "      <td>22200760</td>\n",
643 |        "      <td>1610612764</td>\n",
644 |        "      <td>1610612759</td>\n",
645 |        "      <td>2022</td>\n",
646 |        "      <td>127</td>\n",
647 |        "      <td>55.81250</td>\n",
648 |        "      <td>88.1875</td>\n",
649 |        "      <td>53.312500</td>\n",
650 |        "      <td>32</td>\n",
651 |        "      <td>45</td>\n",
652 |        "      <td>106</td>\n",
653 |        "      <td>43.31250</td>\n",
654 |        "      <td>65.1875</td>\n",
655 |        "      <td>24.093750</td>\n",
656 |        "      <td>26</td>\n",
657 |        "      <td>43</td>\n",
658 |        "      <td>1</td>\n",
659 |        "    </tr>\n",
660 |        "    <tr>\n",
661 |        "      <th>15857</th>\n",
662 |        "      <td>2023-01-30</td>\n",
663 |        "      <td>22200757</td>\n",
664 |        "      <td>1610612747</td>\n",
665 |        "      <td>1610612751</td>\n",
666 |        "      <td>2022</td>\n",
667 |        "      <td>104</td>\n",
668 |        "      <td>39.31250</td>\n",
669 |        "      <td>62.1875</td>\n",
670 |        "      <td>37.906250</td>\n",
671 |        "      <td>24</td>\n",
672 |        "      <td>53</td>\n",
673 |        "      <td>121</td>\n",
674 |        "      <td>47.40625</td>\n",
675 |        "      <td>78.8750</td>\n",
676 |        "      <td>39.000000</td>\n",
677 |        "      <td>24</td>\n",
678 |        "      <td>50</td>\n",
679 |        "      <td>0</td>\n",
680 |        "    </tr>\n",
681 |        "    <tr>\n",
682 |        "      <th>15938</th>\n",
683 |        "      <td>2023-01-30</td>\n",
684 |        "      <td>22200761</td>\n",
685 |        "      <td>1610612765</td>\n",
686 |        "      <td>1610612742</td>\n",
687 |        "      <td>2022</td>\n",
688 |        "      <td>105</td>\n",
689 |        "      <td>45.59375</td>\n",
690 |        "      <td>70.3750</td>\n",
691 |        "      <td>35.906250</td>\n",
692 |        "      <td>18</td>\n",
693 |        "      <td>40</td>\n",
694 |        "      <td>111</td>\n",
695 |        "      <td>49.40625</td>\n",
696 |        "      <td>69.6875</td>\n",
697 |        "      <td>29.406250</td>\n",
698 |        "      <td>17</td>\n",
699 |        "      <td>40</td>\n",
700 |        "      <td>0</td>\n",
701 |        "    </tr>\n",
702 |        "    <tr>\n",
703 |        "      <th>19613</th>\n",
704 |        "      <td>2023-01-29</td>\n",
705 |        "      <td>22200755</td>\n",
706 |        "      <td>1610612740</td>\n",
707 |        "      <td>1610612749</td>\n",
708 |        "      <td>2022</td>\n",
709 |        "      <td>110</td>\n",
710 |        "      <td>44.09375</td>\n",
711 |        "      <td>71.3750</td>\n",
712 |        "      <td>38.187500</td>\n",
713 |        "      <td>25</td>\n",
714 |        "      <td>38</td>\n",
715 |        "      <td>135</td>\n",
716 |        "      <td>55.18750</td>\n",
717 |        "      <td>60.0000</td>\n",
718 |        "      <td>39.500000</td>\n",
719 |        "      <td>27</td>\n",
720 |        "      <td>57</td>\n",
721 |        "      <td>0</td>\n",
722 |        "    </tr>\n",
723 |        "    <tr>\n",
724 |        "      <th>21628</th>\n",
725 |        "      <td>2023-01-31</td>\n",
726 |        "      <td>22200768</td>\n",
727 |        "      <td>1610612743</td>\n",
728 |        "      <td>1610612740</td>\n",
729 |        "      <td>2022</td>\n",
730 |        "      <td>0</td>\n",
731 |        "      <td>0.00000</td>\n",
732 |        "      <td>0.0000</td>\n",
733 |        "      <td>0.000000</td>\n",
734 |        "      <td>0</td>\n",
735 |        "      <td>0</td>\n",
736 |        "      <td>0</td>\n",
737 |        "      <td>0.00000</td>\n",
738 |        "      <td>0.0000</td>\n",
739 |        "      <td>0.000000</td>\n",
740 |        "      <td>0</td>\n",
741 |        "      <td>0</td>\n",
742 |        "      <td>0</td>\n",
743 |        "    </tr>\n",
744 |        "  </tbody>\n",
745 |        "</table>\n",
746 |        "</div>"
747 |       ],
748 |       "text/plain": [
749 |        "      GAME_DATE_EST   GAME_ID  HOME_TEAM_ID  VISITOR_TEAM_ID  SEASON  \\\n",
750 |        "1384     2023-01-31  22200765    1610612752       1610612747    2022   \n",
751 |        "1705     2023-01-30  22200759    1610612744       1610612760    2022   \n",
752 |        "1755     2023-01-29  22200754    1610612746       1610612739    2022   \n",
753 |        "3042     2023-01-31  22200767    1610612749       1610612766    2022   \n",
754 |        "3825     2023-01-30  22200756    1610612753       1610612755    2022   \n",
755 |        "5048     2023-01-30  22200763    1610612737       1610612757    2022   \n",
756 |        "5716     2023-01-29  22200753    1610612754       1610612763    2022   \n",
757 |        "8239     2023-01-31  22200764    1610612739       1610612748    2022   \n",
758 |        "8381     2023-01-29  22200752    1610612748       1610612766    2022   \n",
759 |        "8727     2023-01-30  22200762    1610612761       1610612756    2022   \n",
760 |        "8907     2023-01-31  22200766    1610612741       1610612746    2022   \n",
761 |        "9991     2023-01-30  22200758    1610612758       1610612750    2022   \n",
762 |        "10092    2023-01-30  22200760    1610612764       1610612759    2022   \n",
763 |        "15857    2023-01-30  22200757    1610612747       1610612751    2022   \n",
764 |        "15938    2023-01-30  22200761    1610612765       1610612742    2022   \n",
765 |        "19613    2023-01-29  22200755    1610612740       1610612749    2022   \n",
766 |        "21628    2023-01-31  22200768    1610612743       1610612740    2022   \n",
767 |        "\n",
768 |        "       PTS_home  FG_PCT_home  FT_PCT_home  FG3_PCT_home  AST_home  REB_home  \\\n",
769 |        "1384          0      0.00000       0.0000      0.000000         0         0   \n",
770 |        "1705        128     51.09375      80.0000     42.593750        37        44   \n",
771 |        "1755         99     41.59375      85.1875     10.500000        17        45   \n",
772 |        "3042          0      0.00000       0.0000      0.000000         0         0   \n",
773 |        "3825        119     42.40625      80.0000     37.906250        23        50   \n",
774 |        "5048        125     46.81250      80.0000     43.312500        26        45   \n",
775 |        "5716        100     45.59375      90.5000     28.093750        21        38   \n",
776 |        "8239          0      0.00000       0.0000      0.000000         0         0   \n",
777 |        "8381        117     48.40625      73.8750     32.312500        26        36   \n",
778 |        "8727        106     44.90625      89.5000     27.296875        19        44   \n",
779 |        "8907          0      0.00000       0.0000      0.000000         0         0   \n",
780 |        "9991        118     47.50000      68.1875     30.000000        20        50   \n",
781 |        "10092       127     55.81250      88.1875     53.312500        32        45   \n",
782 |        "15857       104     39.31250      62.1875     37.906250        24        53   \n",
783 |        "15938       105     45.59375      70.3750     35.906250        18        40   \n",
784 |        "19613       110     44.09375      71.3750     38.187500        25        38   \n",
785 |        "21628         0      0.00000       0.0000      0.000000         0         0   \n",
786 |        "\n",
787 |        "       PTS_away  FG_PCT_away  FT_PCT_away  FG3_PCT_away  AST_away  REB_away  \\\n",
788 |        "1384          0      0.00000       0.0000      0.000000         0         0   \n",
789 |        "1705        120     49.50000      85.0000     45.812500        23        43   \n",
790 |        "1755        122     56.40625      73.6875     60.593750        35        35   \n",
791 |        "3042          0      0.00000       0.0000      0.000000         0         0   \n",
792 |        "3825        109     47.09375      78.3125     36.687500        28        45   \n",
793 |        "5048        129     54.40625      88.8750     47.500000        24        32   \n",
794 |        "5716        112     48.31250      77.3125     24.296875        29        41   \n",
795 |        "8239          0      0.00000       0.0000      0.000000         0         0   \n",
796 |        "8381        122     54.18750      77.3125     37.500000        25        47   \n",
797 |        "8727        114     49.40625      81.0000     39.312500        28        42   \n",
798 |        "8907          0      0.00000       0.0000      0.000000         0         0   \n",
799 |        "9991        111     46.68750      52.0000     30.796875        26        49   \n",
800 |        "10092       106     43.31250      65.1875     24.093750        26        43   \n",
801 |        "15857       121     47.40625      78.8750     39.000000        24        50   \n",
802 |        "15938       111     49.40625      69.6875     29.406250        17        40   \n",
803 |        "19613       135     55.18750      60.0000     39.500000        27        57   \n",
804 |        "21628         0      0.00000       0.0000      0.000000         0         0   \n",
805 |        "\n",
806 |        "       HOME_TEAM_WINS  \n",
807 |        "1384                0  \n",
808 |        "1705                1  \n",
809 |        "1755                0  \n",
810 |        "3042                0  \n",
811 |        "3825                1  \n",
812 |        "5048                0  \n",
813 |        "5716                0  \n",
814 |        "8239                0  \n",
815 |        "8381                0  \n",
816 |        "8727                0  \n",
817 |        "8907                0  \n",
818 |        "9991                1  \n",
819 |        "10092               1  \n",
820 |        "15857               0  \n",
821 |        "15938               0  \n",
822 |        "19613               0  \n",
823 |        "21628               0  "
824 |       ]
825 |      },
826 |      "execution_count": 6,
827 |      "metadata": {},
828 |      "output_type": "execute_result"
829 |     }
830 |    ],
831 |    "source": [
832 |     "df_old[df_old['GAME_ID'] > 22200751]"
833 |    ]
834 |   }
835 |  ],
836 |  "metadata": {
837 |   "kernelspec": {
838 |    "display_name": "nba3",
839 |    "language": "python",
840 |    "name": "python3"
841 |   },
842 |   "language_info": {
843 |    "codemirror_mode": {
844 |     "name": "ipython",
845 |     "version": 3
846 |    },
847 |    "file_extension": ".py",
848 |    "mimetype": "text/x-python",
849 |    "name": "python",
850 |    "nbconvert_exporter": "python",
851 |    "pygments_lexer": "ipython3",
852 |    "version": "3.9.13"
853 |   },
854 |   "orig_nbformat": 4,
855 |   "vscode": {
856 |    "interpreter": {
857 |     "hash": "4655998f62ad965cbd25df51edb717f2326f5df53d53899f0ae604225aa5ae06"
858 |    }
859 |   }
860 |  },
861 |  "nbformat": 4,
862 |  "nbformat_minor": 2
863 | }
864 | 


--------------------------------------------------------------------------------
/notebooks/lightgbm.json:
--------------------------------------------------------------------------------
1 | {"lambda_l1": 0.014353286456597142, "lambda_l2": 1.8797714715567863e-05, "learning_rate": 0.0074219704795943425, "max_depth": 5, "n_estimators": 8165, "feature_fraction": 0.7165888165225329, "bagging_fraction": 0.41679785470194425, "bagging_freq": 5, "num_leaves": 965, "min_child_samples": 94, "min_data_per_groups": 89}


--------------------------------------------------------------------------------
/notebooks/model_data.json:
--------------------------------------------------------------------------------
1 | {"model_name": "xgboost", "calibration_method": "Model + Isotonic", "brier_loss": 0.22892725648801726, "metrics": {"AUC": 0.6314368770764119, "Accuracy": 0.6584507042253521}}


--------------------------------------------------------------------------------
/notebooks/xgboost.json:
--------------------------------------------------------------------------------
1 | {"num_round": 868, "learning_rate": 0.018405745552075914, "max_bin": 558, "max_depth": 2, "alpha": 8.281713514214621, "gamma": 11.304706114754095, "reg_lambda": 11.806523119478767, "colsample_bytree": 0.4339252336236323, "subsample": 0.5326856953922422, "min_child_weight": 6.179632416051588, "scale_pos_weight": 1}


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
  1 | alembic==1.8.1
  2 | altair==4.2.0
  3 | asttokens==2.2.0
  4 | async-generator==1.10
  5 | attrs==22.1.0
  6 | autopage==0.5.1
  7 | avro==1.10.2
  8 | backcall==0.2.0
  9 | beautifulsoup4==4.11.1
 10 | blinker==1.5
 11 | boto3==1.26.20
 12 | botocore==1.29.20
 13 | cachetools==5.2.0
 14 | certifi==2022.9.24
 15 | cffi==1.15.1
 16 | charset-normalizer==2.1.1
 17 | click==8.1.3
 18 | cliff==4.1.0
 19 | cloudpickle==2.2.0
 20 | cmaes==0.9.0
 21 | cmd2==2.4.2
 22 | colorama==0.4.6
 23 | colorlog==6.7.0
 24 | commonmark==0.9.1
 25 | confluent-kafka==1.8.2
 26 | contourpy==1.0.6
 27 | cryptography==38.0.4
 28 | cycler==0.11.0
 29 | debugpy==1.6.4
 30 | decorator==5.1.1
 31 | entrypoints==0.4
 32 | exceptiongroup==1.0.4
 33 | executing==1.2.0
 34 | fastavro==1.4.11
 35 | fastjsonschema==2.16.2
 36 | fonttools==4.38.0
 37 | furl==2.1.3
 38 | future==0.18.2
 39 | gitdb==4.0.10
 40 | GitPython==3.1.29
 41 | great-expectations==0.14.12
 42 | greenlet==2.0.1
 43 | h11==0.14.0
 44 | hopsworks==3.0.4
 45 | hsfs==3.0.5
 46 | hsml==3.0.3
 47 | html5lib==1.1
 48 | idna==3.4
 49 | importlib-metadata==5.1.0
 50 | importlib-resources==5.10.0
 51 | ipykernel==6.17.1
 52 | ipython==8.7.0
 53 | ipywidgets==8.0.2
 54 | javaobj-py3==0.4.3
 55 | jedi==0.18.2
 56 | Jinja2==3.0.3
 57 | jmespath==1.0.1
 58 | joblib==1.2.0
 59 | jsonpatch==1.32
 60 | jsonpointer==2.3
 61 | jsonschema==4.17.3
 62 | jupyter_client==7.4.7
 63 | jupyter_core==5.1.0
 64 | jupyterlab-widgets==3.0.3
 65 | kiwisolver==1.4.4
 66 | lightgbm==3.3.2
 67 | llvmlite==0.39.1
 68 | lxml==4.9.1
 69 | Mako==1.2.4
 70 | MarkupSafe==2.0.1
 71 | matplotlib==3.6.2
 72 | matplotlib-inline==0.1.6
 73 | mistune==0.8.4
 74 | mock==4.0.3
 75 | nbformat==5.7.0
 76 | nest-asyncio==1.5.6
 77 | numba==0.56.4
 78 | numpy==1.22.4
 79 | optuna==2.10.1
 80 | orderedmultidict==1.0.1
 81 | outcome==1.2.0
 82 | packaging==21.3
 83 | pandas==1.4.3
 84 | parso==0.8.3
 85 | pbr==5.11.0
 86 | pickleshare==0.7.5
 87 | Pillow==9.3.0
 88 | platformdirs==2.5.4
 89 | plotly==5.9.0
 90 | prettytable==3.5.0
 91 | prompt-toolkit==3.0.33
 92 | protobuf==3.20.3
 93 | psutil==5.9.4
 94 | pure-eval==0.2.2
 95 | pyarrow==10.0.1
 96 | pyasn1==0.4.8
 97 | pyasn1-modules==0.2.8
 98 | pycparser==2.21
 99 | pycryptodomex==3.16.0
100 | pydeck==0.8.0
101 | Pygments==2.13.0
102 | PyHopsHive==0.6.4.1.dev0
103 | pyhumps==1.6.1
104 | pyjks==20.0.0
105 | Pympler==1.0.1
106 | PyMySQL==1.0.2
107 | pyparsing==2.4.7
108 | pyperclip==1.8.2
109 | pyreadline3==3.4.1
110 | pyrsistent==0.19.2
111 | PySocks==1.7.1
112 | python-dateutil==2.8.2
113 | python-dotenv==0.21.0
114 | pytz==2022.6
115 | pytz-deprecation-shim==0.1.0.post0
116 | PyVirtualDisplay==3.0
117 | PyYAML==6.0
118 | pyzmq==24.0.1
119 | requests==2.28.1
120 | rich==12.6.0
121 | ruamel.yaml==0.17.17
122 | ruamel.yaml.clib==0.2.7
123 | s3transfer==0.6.0
124 | scikit-learn==1.1.3
125 | scipy==1.9.3
126 | seaborn==0.11.2
127 | selenium==4.6.0
128 | semver==2.13.0
129 | shap==0.41.0
130 | six==1.16.0
131 | slicer==0.0.7
132 | smmap==5.0.0
133 | sniffio==1.3.0
134 | sortedcontainers==2.4.0
135 | soupsieve==2.3.2.post1
136 | SQLAlchemy==1.4.44
137 | stack-data==0.6.2
138 | stevedore==4.1.1
139 | streamlit==1.15.2
140 | sweetviz==2.1.4
141 | tenacity==8.1.0
142 | termcolor==2.1.1
143 | threadpoolctl==3.1.0
144 | thrift==0.16.0
145 | toml==0.10.2
146 | toolz==0.12.0
147 | tornado==6.2
148 | tqdm==4.64.0
149 | traitlets==5.6.0
150 | trio==0.22.0
151 | trio-websocket==0.9.2
152 | twofish==0.3.0
153 | typing_extensions==4.4.0
154 | tzdata==2022.7
155 | tzlocal==4.2
156 | urllib3==1.26.13
157 | validators==0.20.0
158 | watchdog==2.2.0
159 | wcwidth==0.2.5
160 | webdriver-manager==3.8.4
161 | webencodings==0.5.1
162 | widgetsnbextension==4.0.3
163 | wsproto==1.2.0
164 | xgboost==1.6.1
165 | zipp==3.11.0
166 | 


--------------------------------------------------------------------------------
/src/common_functions.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | 
  4 | import itertools
  5 | 
  6 | import matplotlib.pyplot as plt
  7 | from matplotlib.colors import TwoSlopeNorm
  8 | 
  9 | import sweetviz as sv
 10 | 
 11 | from datetime import datetime
 12 | 
 13 | 
 14 | def plot_corr_barchart(df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None:
 15 |     """
 16 |     Plots a color-gradient bar chart showing top n correlations between features
 17 | 
 18 |     Args:
 19 |         df1 (pd.DataFrame): the dataframe to plot
 20 |         drop_cols (list): list of columns not to include in plot
 21 |         n (int): number of top n correlations to plot
 22 | 
 23 |     Returns:
 24 |         None
 25 | 
 26 |     Sources: 
 27 |     https://typefully.com/levikul09/j6qzwR0
 28 |     https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas
 29 | 
 30 |     """
 31 | 
 32 |     df1 = df1.drop(columns=drop_cols)
 33 |     useful_columns =  df1.select_dtypes(include=['number']).columns
 34 | 
 35 |     def get_redundant_pairs(df):
 36 |         pairs_to_drop = set()
 37 |         cols = df.columns
 38 |         for i in range(0,df.shape[1]):
 39 |             for j in range(0,i+1):
 40 |                 pairs_to_drop.add((cols[i],cols[j]))
 41 |         return pairs_to_drop
 42 | 
 43 |     def get_correlations(df,n=n):
 44 |         au_corr = df.corr(method = 'spearman').unstack() #spearman used because not all data is normalized
 45 |         labels_to_drop = get_redundant_pairs(df)
 46 |         au_corr = au_corr.drop(labels = labels_to_drop).sort_values(ascending=False)
 47 |         top_n = au_corr[0:n]    
 48 |         bottom_n =  au_corr[-n:]
 49 |         top_corr = pd.concat([top_n, bottom_n])
 50 |         return top_corr
 51 | 
 52 |     corrplot = get_correlations(df1[useful_columns])
 53 | 
 54 | 
 55 |     fig, ax = plt.subplots(figsize=(15,10))
 56 |     norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1)
 57 |     colors = [plt.cm.RdYlGn(norm(c)) for c in corrplot.values]
 58 | 
 59 |     print(corrplot)
 60 | 
 61 |     corrplot.plot.barh(color=colors)
 62 |     
 63 |     return
 64 | 
 65 | 
 66 | def plot_corr_vs_target(target: str, df1: pd.DataFrame, drop_cols: list, n: int = 30) -> None:
 67 |     """
 68 |     Plots a color-gradient bar chart showing top n correlations between features and target
 69 | 
 70 |     Args:
 71 |         target (str): the name of the target column
 72 |         df1 (pd.DataFrame): the dataframe to plot
 73 |         drop_cols (list): list of columns not to include in plot
 74 |         n (int): number of top n correlations to plot
 75 | 
 76 |     Returns:
 77 |         None
 78 | 
 79 |     """
 80 |     
 81 |     
 82 |     target_series = df1[target]
 83 |     df1 = df1.drop(columns=drop_cols)
 84 |     
 85 |     x = df1.corrwith(target_series, method = 'spearman',numeric_only=True).sort_values(ascending=False)
 86 |     top_n = x[0:n]    
 87 |     bottom_n =  x[-n:]
 88 |     top_corr = pd.concat([top_n, bottom_n])
 89 |     x = top_corr
 90 | 
 91 |     print(x)
 92 | 
 93 |     fig, ax = plt.subplots(figsize=(15,10))
 94 |     norm = TwoSlopeNorm(vmin=-1, vcenter=0, vmax =1)
 95 |     colors = [plt.cm.RdYlGn(norm(c)) for c in x.values]
 96 |     x.plot.barh(color=colors)
 97 |     
 98 |     return
 99 | 
100 | 
101 | def plot_confusion_matrix(cm: np.ndarray,
102 |                           target_names: list,
103 |                           title: str ='Confusion matrix',
104 |                           cmap: plt.colormaps =None,
105 |                           normalize: bool =True):
106 |     """
107 |     given a sklearn confusion matrix (cm), make a nice plot
108 | 
109 |     Arguments
110 |     ---------
111 |     cm:           confusion matrix from sklearn.metrics.confusion_matrix
112 | 
113 |     target_names: given classification classes such as [0, 1, 2]
114 |                   the class names, for example: ['high', 'medium', 'low']
115 | 
116 |     title:        the text to display at the top of the matrix
117 | 
118 |     cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
119 |                   see http://matplotlib.org/examples/color/colormaps_reference.html
120 |                   plt.get_cmap('jet') or plt.cm.Blues
121 | 
122 |     normalize:    If False, plot the raw numbers
123 |                   If True, plot the proportions
124 | 
125 |     Usage
126 |     -----
127 |     plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
128 |                                                               # sklearn.metrics.confusion_matrix
129 |                           normalize    = True,                # show proportions
130 |                           target_names = y_labels_vals,       # list of names of the classes
131 |                           title        = best_estimator_name) # title of graph
132 | 
133 |     Citation
134 |     ---------
135 |     http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
136 | 
137 |     Source:
138 |     https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels/50386871#50386871
139 | 
140 |     """
141 |     
142 |     accuracy = np.trace(cm) / np.sum(cm).astype('float')
143 |     misclass = 1 - accuracy
144 | 
145 |     if cmap is None:
146 |         cmap = plt.get_cmap('Blues')
147 | 
148 |     fig = plt.figure(figsize=(8, 6))
149 |     plt.imshow(cm, interpolation='nearest', cmap=cmap)
150 |     plt.title(title)
151 |     plt.colorbar()
152 | 
153 |     if target_names is not None:
154 |         tick_marks = np.arange(len(target_names))
155 |         plt.xticks(tick_marks, target_names, rotation=45)
156 |         plt.yticks(tick_marks, target_names)
157 | 
158 |     if normalize:
159 |         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
160 | 
161 | 
162 |     thresh = cm.max() / 1.5 if normalize else cm.max() / 2
163 |     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
164 |         if normalize:
165 |             plt.text(j, i, "{:0.4f}".format(cm[i, j]),
166 |                      horizontalalignment="center",
167 |                      color="white" if cm[i, j] > thresh else "black")
168 |         else:
169 |             plt.text(j, i, "{:,}".format(cm[i, j]),
170 |                      horizontalalignment="center",
171 |                      color="white" if cm[i, j] > thresh else "black")
172 | 
173 | 
174 |     plt.tight_layout()
175 |     plt.ylabel('True label')
176 |     plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
177 |     plt.show()
178 |     
179 |     return fig
180 | 
181 | def run_sweetviz_report(df: pd.DataFrame, TARGET: str) -> None:
182 |     """
183 |     Generates a sweetviz report and saves it to a html file
184 | 
185 |     Args:
186 |         df (pd.DataFrame): the dataframe to analyze
187 |         TARGET (str): the name of the target column
188 |     """
189 |     
190 |     report_label = datetime.today().strftime('%Y-%m-%d_%H_%M')
191 |     
192 |     my_report = sv.analyze(df,target_feat=TARGET)
193 |     my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html')
194 |     
195 |     return
196 | 
197 | 
198 | def run_sweetviz_comparison(df1: pd.DataFrame, df1_name: str, df2: pd.DataFrame, df2_name: str, TARGET: str, report_label: str) -> None:
199 |     """
200 |     Generates a sweetviz comparison report between two dataframes and saves it to a html file
201 | 
202 |     Args:
203 |         df1 (pd.DataFrame): the first dataframe to analyze
204 |         df1_name (str): the name of the first dataframe
205 |         df2 (pd.DataFrame): the second dataframe to analyze
206 |         df2_name (str): the name of the second dataframe
207 |         TARGET (str): the name of the target column
208 |         report_label (str): identifier incorporated into the filename of the report
209 |     """
210 |     
211 |     report_label = report_label + datetime.today().strftime('%Y-%m-%d_%H_%M')
212 |     
213 |     my_report = sv.compare([df1, df1_name], [df2, df2_name], target_feat=TARGET,pairwise_analysis="off")
214 |     my_report.show_html(filepath='SWEETVIZ_' + report_label + '.html')
215 |     
216 |     return


--------------------------------------------------------------------------------
/src/data_processing.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import pandas as pd
  3 | 
  4 | def process_games(games: pd.DataFrame) -> pd.DataFrame:
  5 |     """
  6 |     Performs basic data cleaning on the games dataset.
  7 | 
  8 |     Args:
  9 |         games (pd.DataFrame): the raw games dataframe
 10 | 
 11 |     Returns:
 12 |         the cleaned games dataframe
 13 | 
 14 |     """
 15 |         
 16 |    
 17 |     # remove preseason games (GAME_ID begins with a 1)
 18 |     games = games[games['GAME_ID'] > 20000000]
 19 | 
 20 |     # flag postseason games (GAME_ID begins with >2)
 21 |     games['PLAYOFF'] = (games['GAME_ID'] >= 30000000).astype('int8')
 22 | 
 23 |     # remove duplicates (each GAME_ID should be unique)
 24 |     games = games[~games.duplicated(subset=['GAME_ID'])]
 25 | 
 26 |     # drop unnecessary fields
 27 |     all_columns = games.columns.tolist()
 28 |     drop_columns = ['GAME_STATUS_TEXT', 'TEAM_ID_home', 'TEAM_ID_away']
 29 |     use_columns = [item for item in all_columns if item not in drop_columns]
 30 |     games = games[use_columns]
 31 |     
 32 |     return games
 33 | 
 34 | 
 35 | def process_ranking(ranking: pd.DataFrame) -> pd.DataFrame:
 36 |     """
 37 |     Performs basic data cleaning on the ranking dataset.
 38 |     
 39 |     Args:  
 40 |         ranking (pd.DataFrame): the raw ranking dataframe
 41 | 
 42 |     Returns:
 43 |         the cleaned ranking dataframe
 44 |       
 45 |     """
 46 | 
 47 | 
 48 |     # remove preseason rankings (SEASON_ID begins with 1)
 49 |     ranking = ranking[ranking['SEASON_ID'] > 20000]
 50 | 
 51 |     # convert home record and road record to numeric
 52 |     ranking['HOME_W'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
 53 |     ranking['HOME_L'] = ranking['HOME_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
 54 |     ranking['HOME_W_PCT'] = ranking['HOME_W'] / ( ranking['HOME_W'] + ranking['HOME_L'] )
 55 | 
 56 |     ranking['ROAD_W'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[0]).astype('int')
 57 |     ranking['ROAD_L'] = ranking['ROAD_RECORD'].apply(lambda x: x.split('-')[1]).astype('int')
 58 |     ranking['ROAD_W_PCT'] = ranking['ROAD_W'] / ( ranking['ROAD_W'] + ranking['ROAD_L'] )
 59 | 
 60 |     # encode CONFERENCE as an integer (just using pandas - not importing sklearn for just one feature)
 61 |     ranking['CONFERENCE'] = ranking['CONFERENCE'].apply(lambda x: 0 if x=='East' else 1 ).astype('int') 
 62 | 
 63 |     # remove duplicates (there should only be one TEAM_ID per STANDINGSDATE)
 64 |     ranking = ranking[~ranking.duplicated(subset=['TEAM_ID','STANDINGSDATE'])]
 65 | 
 66 |     # drop unnecessary fields
 67 |     drop_fields = ['SEASON_ID', 'LEAGUE_ID', 'RETURNTOPLAY', 'TEAM', 'HOME_RECORD', 'ROAD_RECORD']
 68 |     ranking = ranking.drop(drop_fields,axis=1)
 69 | 
 70 |     return ranking
 71 | 
 72 | 
 73 | def process_games_details(details: pd.DataFrame) -> pd.DataFrame:
 74 |     """
 75 |     Performs basic data cleaning on the games_details dataset.
 76 | 
 77 |     Args:
 78 |         details (pd.DataFrame): the raw games_details dataframe
 79 |     
 80 |     Returns:
 81 |         the cleaned games_details dataframe
 82 | 
 83 |     """
 84 | 
 85 |     
 86 |     # convert MIN:SEC to float
 87 |     df = details.loc[details['MIN'].str.contains(':',na=False)]
 88 |     df['MIN_whole'] = df['MIN'].apply(lambda x: x.split(':')[0]).astype("int8")
 89 |     df['MIN_seconds'] = df['MIN'].apply(lambda x: x.split(':')[1]).astype("int8")
 90 |     df['MIN'] = df['MIN_whole'] + (df['MIN_seconds'] / 60)
 91 | 
 92 |     details['MIN'].loc[details['MIN'].str.contains(':',na=False)] = df['MIN']
 93 |     details['MIN'] = details['MIN'].astype("float16")
 94 | 
 95 |     # convert negatives to positive
 96 |     details['MIN'].loc[details['MIN'] < 0] = -(details['MIN'])
 97 | 
 98 |     # update START_POSITION if did not play (MIN = NaN)
 99 |     details['START_POSITION'].loc[details['MIN'].isna()] = 'NP'
100 | 
101 |     # update START_POSITION if null
102 |     details['START_POSITION'] = details['START_POSITION'].fillna('NS')
103 | 
104 |     # drop unnecessary fields
105 |     drop_fields = ['COMMENT', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_NAME', 'NICKNAME'] 
106 |     details = details.drop(drop_fields,axis=1)
107 |     
108 |     return details
109 | 
110 | 
111 | def add_TARGET(df: pd.DataFrame) -> pd.DataFrame:
112 |     """
113 |     Adds a TARGET column to the dataframe by copying HOME_TEAM_WINS.
114 | 
115 |     Args:
116 |         df (pd.DataFrame): the dataframe to add the TARGET column to
117 | 
118 |     Returns:
119 |         the games dataframe with a TARGET column
120 | 
121 |     """
122 | 
123 |     df['TARGET'] = df['HOME_TEAM_WINS']
124 |     
125 |     return df
126 | 
127 | def split_train_test(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
128 |     """
129 |     Splits the dataframe into train and test sets.
130 | 
131 |     Splits the latest season as the test set and the rest as the train set.
132 |     The second latest season included with the test set to allow for feature engineering.
133 | 
134 |     Args:
135 |         df (pd.DataFrame): the dataframe to split
136 | 
137 |     Returns:
138 |         the train and test dataframes
139 | 
140 |     """
141 | 
142 |     latest_season = df['SEASON'].unique().max()
143 | 
144 |     train = df[df['SEASON'] < (latest_season)]
145 |     test = df[df['SEASON'] >= (latest_season - 1)]
146 |     
147 |     return train, test
148 | 
149 | 


--------------------------------------------------------------------------------
/src/feature_engineering.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | 
  4 | def process_features(df: pd.DataFrame) -> pd.DataFrame:
  5 |     """
  6 |     Master function to perform all the steps of feature engineering
  7 | 
  8 |     Args:
  9 |         df (pd.DataFrame): the dataframe to process
 10 | 
 11 |     Returns:
 12 |         the processed dataframe
 13 | 
 14 |     
 15 | 
 16 |     Feature engineering to add: 
 17 |         - rolling averages of key stats, 
 18 |         - win/lose streaks, 
 19 |         - home/away streaks, 
 20 |         - specific matchup (team X vs team Y) rolling averages and streaks
 21 |         - home team rolling stats minus visitor team rolling stats
 22 |         - rolling stats minus current league average
 23 |         
 24 |     Functions include:
 25 |         - fix_datatypes(): converts date to proper format and reduces memory footprint of ints and floats
 26 |         - add_date_features(): adds a feature for month number from game date 
 27 |         - remove_playoff_games(): playoff games may bias the statistics
 28 |         - add_rolling_home_visitor(): rolling avgs and streaks for home/visitor team when playing as home/visitor
 29 |         - process_games_consecutively(): separate home team stats from visitor team stats for each game and stack these together by game date
 30 |         - add_past_performance_all(): rolling avgs and streaks no matter if playing as home or visitor team
 31 |         - process_x_minus_league_avg: subtract league avg rolling stats from team's rolling stats
 32 |         - add_matchups(): rolling avgs and steaks for each time when Team A played Team B
 33 |         - combine_new_features(): combine back home team and visitor team features so each game has only one row again
 34 |         - process_x_minus_y(): subtract visitor team rolling stats from home rolling stats
 35 |     """
 36 |     
 37 |     # lengths of rolling averages and streaks to calculate for each team
 38 |     # we will try a variety of lengths to see which works best
 39 |     home_visitor_roll_list = [3, 7, 10]  #lengths to use when restricting to home or visitor role
 40 |     all_roll_list = [3, 7, 10, 15] #lengths to use when NOT restricting to home or visitor role
 41 | 
 42 |     long_integer_fields = ['GAME_ID', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON']
 43 |     short_integer_fields = ['PTS_home', 'AST_home', 'REB_home', 'PTS_away', 'AST_away', 'REB_away']
 44 |     date_fields = ['GAME_DATE_EST']
 45 |     
 46 |     df = fix_datatypes(df, date_fields, short_integer_fields, long_integer_fields)
 47 |     df = add_date_features(df)
 48 |     df = remove_playoff_games(df)
 49 |     df = add_rolling_home_visitor(df, "HOME", home_visitor_roll_list)
 50 |     df = add_rolling_home_visitor(df, "VISITOR", home_visitor_roll_list)
 51 |     
 52 |     df_consecutive = process_games_consecutively(df)
 53 |     df_consecutive = add_matchups(df_consecutive, home_visitor_roll_list)
 54 |     df_consecutive = add_past_performance_all(df_consecutive, all_roll_list)
 55 | 
 56 |     #add these features back to main dataframe
 57 |     df = combine_new_features(df,df_consecutive) 
 58 |         
 59 |     df = process_x_minus_y(df)
 60 |     
 61 |     return df
 62 | 
 63 | 
 64 | def fix_datatypes(df: pd.DataFrame, date_columns: list, short_integer_fields: list, long_integer_fields: list)-> pd.DataFrame:
 65 |     """
 66 |     Converts date to proper format and reduces memory footprint of ints and floats
 67 | 
 68 |     Args:
 69 |         df (pd.DataFrame): the dataframe to process
 70 | 
 71 |     Returns:
 72 |         the processed dataframe
 73 | 
 74 |     """
 75 | 
 76 |     for field in date_columns:
 77 |         df[field] = pd.to_datetime(df[field])
 78 | 
 79 |     #convert long integer fields to int32 from int64
 80 |     for field in long_integer_fields:
 81 |         df[field] = df[field].astype('int32')
 82 | 
 83 |     #convert specific fields to int16 to avoid type issues with hopsworks.ai
 84 |     for field in short_integer_fields:
 85 |         df[field] = df[field].astype('int16')
 86 |         
 87 |     #convert to positive. For some reason, some values have been saved as negative numbers
 88 |     for field in short_integer_fields:
 89 |         df[field] = df[field].abs()
 90 |     
 91 |     #convert the remaining int64s to int8
 92 |     for field in df.select_dtypes(include=['int64']).columns.tolist():
 93 |         df[field] = df[field].astype('int8')
 94 |         
 95 |     #convert float64s to float16s
 96 |     for field in df.select_dtypes(include=['float64']).columns.tolist():
 97 |         df[field] = df[field].astype('float16')
 98 |         
 99 |     return df
100 | 
101 | 
102 | def add_date_features(df: pd.DataFrame)-> pd.DataFrame:
103 |     """
104 |     Creates new features from the game date, which will hopefully be more useful for the model
105 |     
106 |     Currently simply converts game date to just month and add as a feature. This limits cardinality of the date feature.
107 | 
108 |     Args:
109 |         df (pd.DataFrame): the dataframe to process
110 | 
111 |     Returns:
112 |         the processed dataframe
113 |     """
114 | 
115 |     df['MONTH'] = df['GAME_DATE_EST'].dt.month
116 |     
117 |     return df
118 | 
119 | 
120 | def remove_playoff_games(df: pd.DataFrame)-> pd.DataFrame:
121 |     """
122 |     Remove playoff games 
123 | 
124 |     Playoff games may bias the statistics because they are not played under the same conditions as regular season games and are played in a tournament format.
125 | 
126 |     Args:
127 |         df (pd.DataFrame): the dataframe to process
128 | 
129 |     Returns:
130 |         the processed dataframe
131 |     """
132 | 
133 |     
134 |     # Filter to only non-playoff games and then drop the PLAYOFF feature
135 |     
136 |     df = df[df["PLAYOFF"] == 0] 
137 |     df = df.drop("PLAYOFF", axis = 1) 
138 |     
139 |     return df
140 | 
141 | 
142 | def add_rolling_home_visitor(df: pd.DataFrame, location: str, roll_list: list)-> pd.DataFrame:
143 |     """
144 |     Add rolling avgs and win/lose streaks for home/visitor team when playing as home/visitor for a variety of rolling lengths
145 | 
146 |     This function also invokes another function to calculate the league average rolling stats for that moment in time and subtracts these from the team's rolling stats.
147 | 
148 |     Args:
149 |         df (pd.DataFrame): the dataframe to process
150 |         location (str): "HOME" or "VISITOR"
151 |         roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 
152 | 
153 |     Returns:
154 |         the processed dataframe
155 | 
156 | 
157 |     We are adding features that show how well the home team has done in its last home games and how well the visitor team has done in its last away games.
158 |     We are also determining the current win streak for each team (negative if losing streak) when playing as home or visitor team.   
159 |     
160 |     """ 
161 | 
162 |     # compile stats for home or visitor team    
163 |     location_id = location + "_TEAM_ID"
164 | 
165 |     # sort games by the order in which they were played for each home or visitor team
166 |     df = df.sort_values(by = [location_id, 'GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True)
167 |     
168 |     # Win streak, negative if a losing streak
169 |     df[location + '_TEAM_WIN_STREAK'] = df['HOME_TEAM_WINS'].groupby((df['HOME_TEAM_WINS'].shift() != df.groupby([location_id])['HOME_TEAM_WINS'].shift(2)).cumsum()).cumcount() + 1
170 |     # if home team lost the last game of the streak, then the streak must be a losing streak. make it negative
171 |     df[location + '_TEAM_WIN_STREAK'].loc[df['HOME_TEAM_WINS'].shift() == 0] =  -1 * df[location + '_TEAM_WIN_STREAK']
172 | 
173 |     # If visitor, the streak has opposite meaning (3 wins in a row for home team is 3 losses in a row for visitor)
174 |     if location == 'VISITOR':
175 |         df[location + '_TEAM_WIN_STREAK'] = - df[location + '_TEAM_WIN_STREAK']  
176 | 
177 | 
178 |     # rolling means 
179 |     feature_list = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']
180 |     
181 |     if location == 'VISITOR':
182 |         feature_list = ['HOME_TEAM_WINS', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
183 |     
184 |       
185 |     roll_feature_list = []
186 |     for feature in feature_list:
187 |         for roll in roll_list:
188 |             roll_feature_name = location + '_' + feature + '_AVG_LAST_' + str(roll) + '_' + location
189 |             if feature == 'HOME_TEAM_WINS': #remove the "HOME_" for better readability
190 |                 roll_feature_name = location + '_' + feature[5:] + '_AVG_LAST_' + str(roll) + '_' + location
191 |             roll_feature_list.append(roll_feature_name)
192 |             df[roll_feature_name] = df.groupby(['HOME_TEAM_ID'])[feature].rolling(roll, closed= "left").mean().values
193 |             
194 |     
195 |     
196 |     # determine league avg for each stat and then subtract it from the each team's avg
197 |     # as a measure of how well that team compares to all teams in that moment in time
198 |     
199 |     #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose)
200 |     roll_feature_list = [x for x in roll_feature_list if not x.startswith('HOME_TEAM_WINS')]
201 |     #print(location_id)
202 |     df = process_x_minus_league_avg(df, roll_feature_list, location_id)
203 |     
204 |  
205 |     return df
206 | 
207 | 
208 | def process_games_consecutively(df_data: pd.DataFrame)-> pd.DataFrame:
209 |     """
210 |     
211 |     Separate home team stats from visitor team stats for each game and stack these together by game date. 
212 |     
213 |     (Each game record will go from a single row, Home/Visitor combined, to two rows, one for home team and one for visitor)
214 |     
215 |     Args:
216 |         df (pd.DataFrame): the dataframe to process
217 | 
218 |     Returns:
219 |         the processed dataframe
220 |     """
221 |     
222 |     # re-organize so that all of a team's games can be listed in chronological order whether HOME or VISITOR
223 |     # this will facilitate feature engineering (winpct vs team X, 5-game winpct, current win streak, etc...)
224 |     
225 |     # before this step, the data is stored by game, and each game has 2 teams
226 |     # this function will separate each teams stats so that each game has 2 rows (one for each team) instead of one combined row
227 |     
228 |     #this data will need to be re-linked back to the main dataframe after all processing is done,
229 |     #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records
230 |     
231 |     #TEAM1 will be the key field. TEAM2 is used solely to process past team matchups
232 | 
233 |     # all the home games for each team will be selected and then stacked with all the away games
234 |     
235 |     df_home = pd.DataFrame()
236 |     df_home['GAME_DATE_EST'] = df_data['GAME_DATE_EST']
237 |     df_home['GAME_ID'] = df_data['GAME_ID']
238 |     df_home['TEAM1'] = df_data['HOME_TEAM_ID']
239 |     df_home['TEAM1_home'] = 1
240 |     df_home['TEAM1_win'] = df_data['HOME_TEAM_WINS']
241 |     df_home['TEAM2'] = df_data['VISITOR_TEAM_ID']
242 |     df_home['SEASON'] = df_data['SEASON']
243 |     
244 |     df_home['PTS'] = df_data['PTS_home']
245 |     df_home['FG_PCT'] = df_data['FG_PCT_home']
246 |     df_home['FT_PCT'] = df_data['FT_PCT_home']
247 |     df_home['FG3_PCT'] = df_data['FG3_PCT_home']
248 |     df_home['AST'] = df_data['AST_home']
249 |     df_home['REB'] = df_data['REB_home']
250 |     
251 |     # now for visitor teams  
252 | 
253 |     df_visitor = pd.DataFrame()
254 |     df_visitor['GAME_DATE_EST'] = df_data['GAME_DATE_EST']
255 |     df_visitor['GAME_ID'] = df_data['GAME_ID']
256 |     df_visitor['TEAM1'] = df_data['VISITOR_TEAM_ID'] 
257 |     df_visitor['TEAM1_home'] = 0
258 |     df_visitor['TEAM1_win'] = df_data['HOME_TEAM_WINS'].apply(lambda x: 1 if x == 0 else 0)
259 |     df_visitor['TEAM2'] = df_data['HOME_TEAM_ID']
260 |     df_visitor['SEASON'] = df_data['SEASON']
261 |     
262 |     df_visitor['PTS'] = df_data['PTS_away']
263 |     df_visitor['FG_PCT'] = df_data['FG_PCT_away']
264 |     df_visitor['FT_PCT'] = df_data['FT_PCT_away']
265 |     df_visitor['FG3_PCT'] = df_data['FG3_PCT_away']
266 |     df_visitor['AST'] = df_data['AST_away']
267 |     df_visitor['REB'] = df_data['REB_away']
268 | 
269 |     # merge dfs
270 | 
271 |     df = pd.concat([df_home, df_visitor])
272 | 
273 |     column2 = df.pop('TEAM1')
274 |     column3 = df.pop('TEAM1_home')
275 |     column4 = df.pop('TEAM2')
276 |     column5 = df.pop('TEAM1_win')
277 | 
278 |     df.insert(2,'TEAM1', column2)
279 |     df.insert(3,'TEAM1_home', column3)
280 |     df.insert(4,'TEAM2', column4)
281 |     df.insert(5,'TEAM1_win', column5)
282 | 
283 |     df = df.sort_values(by = ['TEAM1', 'GAME_ID'], axis=0, ascending=[True, True], ignore_index=True)
284 | 
285 |     return df
286 | 
287 | 
288 | def add_matchups(df: pd.DataFrame, roll_list: list)-> pd.DataFrame:
289 |     """
290 |     Add rolling win pcts and win/lose steaks for each time when Team A played Team B for a variety of rolling windows
291 | 
292 |     Args:
293 |         df (pd.DataFrame): the dataframe to process
294 |         roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 
295 | 
296 |     Returns:
297 |         the processed dataframe
298 |     """
299 | 
300 | 
301 |     # group all the games that 2 teams played each other 
302 |     # calculate home team win pct and the home team win/lose streak
303 |     
304 | 
305 |     df = df.sort_values(by = ['TEAM1', 'TEAM2','GAME_DATE_EST'], axis=0, ascending=[True, True, True], ignore_index=True)
306 | 
307 |     for roll in roll_list:
308 |         df['MATCHUP_WINPCT_' + str(roll)] = df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].rolling(roll, closed= "left").mean().values
309 | 
310 |     df['MATCHUP_WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1','TEAM2'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1
311 |    
312 |     # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative
313 |     df['MATCHUP_WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0] = -1 * df['MATCHUP_WIN_STREAK']
314 |   
315 |     
316 |     return df
317 | 
318 | 
319 | def add_past_performance_all(df: pd.DataFrame, roll_list: list)-> pd.DataFrame:
320 |     """
321 |     Add rolling avgs, win/lose streak, and home/away streak no matter if playing as home or visitor team.
322 | 
323 |     Args:
324 |         df (pd.DataFrame): the dataframe to process
325 |         roll_list (list): list of number of games for each rolling mean, e.g. [3, 5, 7, 10, 15] 
326 | 
327 |     Returns:
328 |         the processed dataframe
329 |     """
330 |        
331 |     # add features showing how well each team has done in its last games
332 |     # regardless whether they were at home or away
333 |     
334 |     # add rolling means and win streaks (negative number if losing streak)
335 |     
336 |     #this data will need to be re-linked back to the main dataframe after all processing is done,
337 |     #joining TEAM1 to HOME_TEAM_ID for all records and then TEAM1 to VISITOR_TEAM_ID for all records
338 |     
339 |     #TEAM1 will be the key field. TEAM2 was used solely to process past team matchups
340 | 
341 | 
342 |     df = df.sort_values(by = ['TEAM1','GAME_DATE_EST'], axis=0, ascending=[True, True,], ignore_index=True)
343 |   
344 |     #streak of games won/lost, make negative is a losing streak
345 |     df['WIN_STREAK'] = df['TEAM1_win'].groupby((df['TEAM1_win'].shift() != df.groupby(['TEAM1'])['TEAM1_win'].shift(2)).cumsum()).cumcount() + 1   
346 |     
347 |     # if team1 lost the last game of the streak, then the streak must be a losing streak. make it negative
348 |     df['WIN_STREAK'].loc[df['TEAM1_win'].shift() == 0]  = -1 * df['WIN_STREAK']
349 |     
350 |     #streak of games played at home/away, make negative if away streak
351 |     df['HOME_AWAY_STREAK'] = df['TEAM1_home'].groupby((df['TEAM1_home'].shift() != df.groupby(['TEAM1'])['TEAM1_home'].shift(2)).cumsum()).cumcount() + 1
352 |     
353 |     # if team1 played the game of the streak away, then the streak must be an away streak. make it negative
354 |     df['HOME_AWAY_STREAK'].loc[df['TEAM1_home'].shift() == 0]  = -1 * df['HOME_AWAY_STREAK']
355 |     
356 |     #rolling means 
357 |    
358 |     feature_list = ['TEAM1_win', 'PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB']
359 |    
360 |     #create new feature names based upon rolling period
361 |     
362 |     roll_feature_list =[]
363 | 
364 |     for feature in feature_list:
365 |         for roll in roll_list:
366 |             roll_feature_name = feature + '_AVG_LAST_' + str(roll) + '_ALL'
367 |             roll_feature_list.append(roll_feature_name)
368 |             df[roll_feature_name] = df.groupby(['TEAM1'])[feature].rolling(roll, closed= "left").mean().values
369 | 
370 |     # determine league avg for each stat and then subtract it from the each team's average
371 |     # as a measure of how well that team compares to all teams in that moment in time
372 |     
373 |     #remove win averages from roll list - the league average will always be 0.5 (half the teams win, half lose)
374 |     roll_feature_list = [x for x in roll_feature_list if not x.startswith('TEAM1_win')]
375 |     
376 |     df = process_x_minus_league_avg(df, roll_feature_list, 'TEAM1')
377 |     
378 |     
379 |     return df
380 | 
381 | 
382 | 
383 | def process_x_minus_league_avg(df: pd.DataFrame, feature_list: list, team_feature: str)-> pd.DataFrame:
384 |     """
385 |     Calculate the league average for every day of the season and then subtract the league average of each stat from the team's current stat for that day.
386 | 
387 |     This provides a measure of how good the team is compared to the the rest of the league at that moment in time.
388 | 
389 |     Args:
390 |         df (pd.DataFrame): the dataframe to process
391 |         feature_list (list): list of features to be used for subtraction, e.g. [PTS_AVG_LAST_5_ALL, REB_AVG_LAST_20_ALL]
392 |         team_feature (str): the team's role (subset of data) that is being worked upon ("HOME_TEAM_ID", "VISITOR_TEAM_ID", or "TEAM1" for all roles)
393 | 
394 |     Returns:
395 |         the processed dataframe
396 | 
397 |     """
398 | 
399 |     # create a temp dataframe so that every date can be front-filled
400 |     # we need the current average for all 30 teams for every day during the season
401 |     # whether that team played or not. 
402 |     # We will front-fill from previous days to ensure that every day has stats for every team
403 |     
404 |     df.to_csv("df.csv",index=False)
405 |     
406 |     # create feature list for temp dataframe to hold league averages
407 |     temp_feature_list = feature_list.copy()
408 |     temp_feature_list.append(team_feature)
409 |     temp_feature_list.append("GAME_DATE_EST")
410 | 
411 |     df_temp = df[temp_feature_list]
412 |     print(temp_feature_list)
413 |     df_temp.to_csv("df_temp.csv",index=False)
414 |     
415 | 
416 |     # populate the dataframe with all days played and forward fill previous value if a particular team did not play that day
417 |     # https://stackoverflow.com/questions/70362869
418 |     df_temp = (df_temp.set_index('GAME_DATE_EST',)
419 |             .groupby([team_feature])[feature_list]
420 |             .apply(lambda x: x.asfreq('d', method = "ffill"))
421 |             .reset_index()
422 |             [temp_feature_list]
423 |             )
424 |     
425 |     # find the average across all teams for each day
426 |     df_temp = df_temp.groupby(['GAME_DATE_EST'])[feature_list].mean().reset_index()
427 |     
428 |     # rename features for merging
429 |     df_temp = df_temp.add_suffix('_LEAGUE_AVG')
430 |     temp_features = df_temp.columns
431 |     
432 |     # merge all-team averages with each record so that they can be subtracted
433 |     df = df.sort_values(by = 'GAME_DATE_EST', axis=0, ascending= True, ignore_index=True)   
434 |     df = pd.merge(df, df_temp, left_on='GAME_DATE_EST', right_on='GAME_DATE_EST_LEAGUE_AVG', how="left",)
435 |     # subtract league average for each feature
436 |     for feature in feature_list:
437 |         df[feature + "_MINUS_LEAGUE_AVG"] = df[feature] - df[feature + "_LEAGUE_AVG"]
438 | 
439 |     # drop temp features that were only used for subtraction
440 |     df = df.drop(temp_features, axis = 1) 
441 |     
442 |     return df
443 | 
444 | 
445 | def combine_new_features(df: pd.DataFrame, df_consecutive: pd.DataFrame)-> pd.DataFrame:
446 |     """
447 |     Re-combine back the features created in the consecutive dataframe to the main dataframe.
448 | 
449 |     The consecutive dataframe was used to derive features regardless of whether the team was home or away, and now we need to add those features back to the main dataframe.
450 | 
451 |     Args:
452 |         df (pd.DataFrame): the main dataframe where each row is a game with both a home team and a visitor team
453 |         df_consecutive (pd.DataFrame): the dataframe where each row is a game with only one team (either home or visitor)
454 | 
455 |     Returns:
456 |         the merged dataframe
457 |     """
458 | 
459 |      
460 |     # add back all the new features created in the consecutive dataframe to the main dataframe
461 |     # all data for TEAM1 will be applied to the home team and then again to the visitor team
462 |     # except for head-to-head MATCHUP data, which will only be applied to home team (redundant to include for both)
463 |     # the letter '_x' will be appended to feature names when adding to home team
464 |     # the letter '_y' will be appended to feature names when adding to visitor team
465 |     # to match the existing convention in the dataset
466 |     
467 |     #first select out the new features
468 |     all_features = df_consecutive.columns.tolist()
469 |     link_features = ['GAME_ID', 'TEAM1', ]
470 |     redundant_features = ['GAME_DATE_EST','TEAM1_home','TEAM1_win','TEAM2','SEASON','PTS', 'FG_PCT', 'FT_PCT', 'FG3_PCT', 'AST', 'REB',]
471 |     matchup_features = [x for x in all_features if "MATCHUP" in x]
472 |     ignore_features = link_features + redundant_features
473 |     
474 |     new_features = [x for x in all_features if x not in ignore_features]
475 |     
476 |     # first home teams
477 |     
478 |     df1 = df_consecutive[df_consecutive['TEAM1_home'] == 1]
479 |     #add "_x" to new features
480 |     df1.columns = [x + '_x' if x in new_features else x for x in df1.columns]
481 |     #drop features that don't need to be merged
482 |     df1 = df1.drop(redundant_features,axis=1)
483 |     #change TEAM1 to HOME_TEAM_ID for easy merging
484 |     df1 = df1.rename(columns={'TEAM1': 'HOME_TEAM_ID'})
485 |     df = pd.merge(df, df1, how="left", on=["GAME_ID", "HOME_TEAM_ID"])
486 |     
487 |     #don't include matchup features for visitor team since they are equivant for both home and visitor
488 |     new_features = [x for x in new_features if x not in matchup_features]
489 |     df_consecutive = df_consecutive.drop(matchup_features,axis=1)
490 |     
491 |     # next visitor teams
492 |     
493 |     df2 = df_consecutive[df_consecutive['TEAM1_home'] == 0]
494 |     #add "_y" to new features
495 |     df2.columns = [x + '_y' if x in new_features else x for x in df2.columns]
496 |     #drop features that don't need to be merged
497 |     df2 = df2.drop(redundant_features,axis=1)
498 |     #change TEAM1 to VISITOR_TEAM_ID for easy merging
499 |     df2 = df2.rename(columns={'TEAM1': 'VISITOR_TEAM_ID'})
500 |     df = pd.merge(df, df2, how="left", on=["GAME_ID", "VISITOR_TEAM_ID"])
501 |     
502 |     return df
503 | 
504 | 
505 | def process_x_minus_y(df: pd.DataFrame)-> pd.DataFrame:
506 |     """
507 |     Subtract visitor team rolling stats from home rolling stats.
508 | 
509 |     This may (or may not) be useful for the model to explicitly see the difference between the two teams. GBM models may be able to handle this automatically, but other models may not.
510 |     
511 |     Args:
512 |         df (pd.DataFrame): the dataframe to process
513 | 
514 |     Returns:
515 |         the processed dataframe
516 |     """
517 | 
518 |     # field_x - field_y
519 |     
520 |     # remove the current games stats since they are data leaks - we don't know these until after the game is played
521 |     useful_features = remove_non_rolling(df)
522 |     
523 |     comparison_features = [x for x in useful_features if "_y" in x]
524 |     
525 |     #don't include redundant features. (x - league_avg) - (y - league_avg) = x-y
526 |     comparison_features = [x for x in comparison_features if "_MINUS_LEAGUE_AVG" not in x]
527 |     
528 |     for feature in comparison_features:
529 |         feature_base = feature[:-2] #remove "_y" from the end
530 |         df[feature_base + "_x_minus_y"] = df[feature_base + "_x"] - df[feature_base + "_y"]
531 |         
532 |     #df = df.drop("CONFERENCE_x_minus_y") #category variable not meaningful?
533 |         
534 |     return df
535 | 
536 | 
537 | def remove_non_rolling(df: pd.DataFrame) -> list:
538 |     """
539 |     Returns a list of columns in a dataframe with the current games stats removed, leaving only rolling averages and streaks
540 | 
541 |     Args:
542 |         df (pd.DataFrame): the dataframe to process
543 | 
544 |     Returns:
545 |         list: only the columns that are rolling averages and streaks
546 |     """
547 |  
548 |     # remove non-rolling features - these are data leaks
549 |     # they are stats from the actual game that decides winner/loser, 
550 |     # but we don't know these stats before a game is played
551 |     
552 |     # These must be retained in the database to recalculate rolling avgs and streaks in the future,
553 |     # so are filtered out as appropriate instead of deleted
554 |     
555 |     drop_columns =[]
556 |     
557 |     all_columns = df.columns.tolist()
558 |     
559 |     drop_columns1 = ['HOME_TEAM_WINS', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home']
560 |     drop_columns2 = ['PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away']
561 |     
562 |     drop_columns = drop_columns + drop_columns1
563 |     drop_columns = drop_columns + drop_columns2 
564 |     
565 |     use_columns = [item for item in all_columns if item not in drop_columns]
566 |     
567 |     return use_columns
568 | 


--------------------------------------------------------------------------------
/src/hopsworks_utils.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import json
  3 | import hopsworks
  4 | 
  5 | from datetime import datetime, timedelta
  6 | 
  7 | def save_feature_names(df: pd.DataFrame) -> str:
  8 |     """
  9 |     Saves dictionary of {lower case feature names : original mixed-case feature names} to JSON file
 10 | 
 11 |     Args:
 12 |         df (pd.DataFrame): the dataframe with the features to be saved
 13 | 
 14 |     Returns:
 15 |         "File Saved."
 16 |     """
 17 | 
 18 |     # hopsworks "sanitizes" feature names by converting to all lowercase
 19 |     # this function saves the original so that they can be re-mapped later
 20 |     # for code re-usability
 21 |     
 22 |     original_f_names = df.columns.tolist()
 23 |     hopsworks_f_names = [x.lower() for x in original_f_names]
 24 | 
 25 |     # create a dictionary
 26 |     feature_mapper = {hopsworks_f_names[i]: original_f_names[i] for i in range(len(hopsworks_f_names))}
 27 | 
 28 |     with open("feature_names.json", "w") as fp:
 29 |         json.dump(feature_mapper, fp)
 30 |         
 31 |     return "File Saved."
 32 | 
 33 | 
 34 | def convert_feature_names(df: pd.DataFrame) -> pd.DataFrame:
 35 |     """
 36 |     Converts hopsworks.ai lower-case feature names back to original mixed-case feature names that have been saved in JSON file
 37 | 
 38 |     Args:
 39 |         df (pd.DataFrame): the dataframe with features in all lower-case
 40 | 
 41 |     Returns:
 42 |         pd.DataFrame: the dataframe with features in original mixed-case
 43 |     """
 44 | 
 45 |     # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case
 46 |     # converting these back to original format is needed for optimal code re-useability.
 47 | 
 48 |     # read in list of original feature names
 49 |     with open('feature_names.json', 'rb') as fp:
 50 |         feature_mapper = json.load(fp)
 51 |         
 52 |     df = df.rename(columns=feature_mapper)
 53 | 
 54 |     return df
 55 | 
 56 | 
 57 | def create_train_test_data(HOPSWORKS_API_KEY:str, STARTDATE:str, DAYS:int) -> tuple[pd.DataFrame, pd.DataFrame]:
 58 |     """
 59 |     Returns train and test data from Hopsworks.ai feature store based upon how many DAYS back to use as test data
 60 | 
 61 |     Args:
 62 |         HOPSWORKS_API_KEY (str): subscription key for Hopsworks.ai
 63 |         STARTDATE (str): start date for train data, format YYYY-MM-DD
 64 |         DAYS (int): number of days back to use as test data, the train data will be all data except the last DAYS 
 65 |     
 66 |     Returns:
 67 |         Train and Test data as pandas dataframes
 68 |     """
 69 | 
 70 |     # log into hopsworks.ai and create a feature view object from the feature group
 71 |     # the api makes it easier to retrieve training/test data from a feature view than a feature group
 72 |     
 73 |     project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
 74 |     fs = project.get_feature_store()
 75 | 
 76 |     rolling_stats_fg = fs.get_or_create_feature_group(
 77 |         name="rolling_stats",
 78 |         version=2,
 79 |     )
 80 | 
 81 |     query = rolling_stats_fg.select_all()
 82 | 
 83 |     feature_view = fs.create_feature_view(
 84 |         name = 'rolling_stats_fv',
 85 |         version = 2,
 86 |         query = query
 87 |     )
 88 | 
 89 |     # calculate the start and end dates for the train and test data and then retrieve the data from the feature view
 90 |     # the train data will be all data except the last DAYS
 91 | 
 92 |     
 93 |     TODAY = datetime.now()
 94 |     LASTYEAR = (TODAY - timedelta(days=DAYS)).strftime('%Y-%m-%d')
 95 |     TODAY = TODAY.strftime('%Y-%m-%d') 
 96 | 
 97 |     td_train, td_job = feature_view.create_training_data(
 98 |             start_time=STARTDATE,
 99 |             end_time=LASTYEAR,    
100 |             description='All data except last ' + str(DAYS) + ' days',
101 |             data_format="csv",
102 |             coalesce=True,
103 |             write_options={'wait_for_job': False},
104 |         )
105 | 
106 |     td_test, td_job = feature_view.create_training_data(
107 |             start_time=LASTYEAR,
108 |             end_time=TODAY,    
109 |             description='Last ' + str(DAYS) + ' days',
110 |             data_format="csv",
111 |             coalesce=True,
112 |             write_options={'wait_for_job': False},
113 |         )
114 | 
115 |     train = feature_view.get_training_data(td_train)[0]
116 |     test = feature_view.get_training_data(td_test)[0]
117 | 
118 |     # hopsworks converts all feature names to lower-case, while the original feature names use mixed-case
119 |     # converting these back to original format is needed for optimal code re-useability.
120 |     
121 |     train = convert_feature_names(train)
122 |     test = convert_feature_names(test)
123 | 
124 |     # fix date format (truncate to YYYY-MM-DD)
125 |     train["GAME_DATE_EST"] = train["GAME_DATE_EST"].str[:10]
126 |     test["GAME_DATE_EST"] = test["GAME_DATE_EST"].str[:10]
127 | 
128 |     # feature view is no longer needed, so delete it
129 |     feature_view.delete()
130 | 
131 |     return train, test
132 | 
133 | 


--------------------------------------------------------------------------------
/src/model_training.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | 
  4 | from collections import defaultdict
  5 | 
  6 | import matplotlib.pyplot as plt
  7 | from matplotlib.gridspec import GridSpec
  8 | 
  9 | from sklearn.calibration import (
 10 |     CalibrationDisplay,
 11 | )
 12 | 
 13 | from sklearn.metrics import (
 14 |     precision_score,
 15 |     recall_score,
 16 |     f1_score,
 17 |     brier_score_loss,
 18 |     log_loss,
 19 |     roc_auc_score,
 20 | )
 21 | 
 22 | 
 23 | def encode_categoricals(df: pd.dataframe, category_columns: list, MODEL_NAME: str, ENABLE_CATEGORICAL: bool) -> pd.dataframe:
 24 |     """
 25 |     Encode categorical features as integers for use in XGBoost and LightGBM
 26 | 
 27 |     Args:
 28 |         df (pd.DataFrame): the dataframe to process
 29 |         category_columns (list): list of columns to encode as categorical
 30 |         MODEL_NAME (str): the name of the model being used
 31 |         ENABLE_CATEGORICAL (bool): whether or not to enable categorical features in the model
 32 |     
 33 |     Returns:
 34 |         the dataframe with categorical features encoded
 35 |     
 36 | 
 37 |     """
 38 | 
 39 |     # To use special category feature capabilities in XGB and LGB, categorical features must be ints from 0 to N-1
 40 |     # Conversion can be accomplished by simple subtraction for several features
 41 |     # (these category capabilities may or may not be used, but encoding does not hurt anything)
 42 | 
 43 |     first_team_ID = df['HOME_TEAM_ID'].min()
 44 |     first_season = df['SEASON'].min()
 45 |    
 46 |     # subtract lowest value from each to create a range of 0 thru N-1
 47 |     df['HOME_TEAM_ID'] = (df['HOME_TEAM_ID'] - first_team_ID).astype('int8') #team ID - 1610612737 = 0 thru 29
 48 |     df['VISITOR_TEAM_ID'] = (df['VISITOR_TEAM_ID'] - first_team_ID).astype('int8') 
 49 |     df['SEASON'] = (df['SEASON'] - first_season).astype('int8')
 50 |     
 51 |     # if xgb experimental categorical capabilities are to be used, then features must be of category type
 52 |     if MODEL_NAME == "xgboost":
 53 |         if ENABLE_CATEGORICAL:
 54 |             for field in category_columns:
 55 |                 df[field] = df[field].astype('category')
 56 | 
 57 |     return df
 58 | 
 59 | 
 60 | def plot_calibration_curve(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe, n_bins: int = 10) -> None:
 61 |     """
 62 |     Plots calibration curves for a list of classifiers vs ideal probability distribution
 63 | 
 64 |     Args:
 65 |         clf_list (list): the classifiers to plot
 66 |         X_train (pd.dataframe): training data
 67 |         y_train (pd.dataframe): labels for training data
 68 |         X_test (pd.dataframe): test data
 69 |         y_test (pd.dataframe): labels for test data
 70 |         n_bins (int, optional): how many bins to use for calibration. Defaults to 10.
 71 | 
 72 |     Returns:
 73 |         None
 74 | 
 75 |     FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
 76 |     """
 77 |     
 78 | 
 79 |     fig = plt.figure(figsize=(10, 10))
 80 |     gs = GridSpec(4, 2)
 81 |     colors = plt.cm.get_cmap("Dark2")
 82 | 
 83 |     ax_calibration_curve = fig.add_subplot(gs[:2, :2])
 84 |     calibration_displays = {}
 85 |     for i, (clf, name) in enumerate(clf_list):
 86 |         clf.fit(X_train, y_train)
 87 |         display = CalibrationDisplay.from_estimator(
 88 |             clf,
 89 |             X_test,
 90 |             y_test,
 91 |             n_bins=n_bins,
 92 |             name=name,
 93 |             ax=ax_calibration_curve,
 94 |             color=colors(i),
 95 |         )
 96 |         calibration_displays[name] = display
 97 | 
 98 |     ax_calibration_curve.grid()
 99 |     ax_calibration_curve.set_title(f"Calibration plots (bins = {n_bins})")
100 | 
101 |     # Add histogram
102 |     grid_positions = [(2, 0), (2, 1), (3, 0), (3, 1)]
103 |     for i, (_, name) in enumerate(clf_list):
104 |         row, col = grid_positions[i]
105 |         ax = fig.add_subplot(gs[row, col])
106 | 
107 |         ax.hist(
108 |             calibration_displays[name].y_prob,
109 |             range=(0, 1),
110 |             bins=n_bins,
111 |             label=name,
112 |             color=colors(i),
113 |         )
114 |         ax.set(title=name, xlabel="Mean predicted probability", ylabel="Count")
115 | 
116 |     plt.tight_layout()
117 |     plt.show()
118 | 
119 |     return
120 | 
121 | 
122 | def calculate_classification_metrics(clf_list: list, X_train: pd.dataframe, y_train: pd.dataframe, X_test: pd.dataframe, y_test: pd.dataframe) -> tuple[pd.dataframe, list]:
123 |     """
124 |     Calculates classification metrics for a list of classifiers. Brier score, log loss, precision, recall, f1, and roc_auc are calculated.
125 | 
126 |     Args:
127 |         clf_list (list): the classifiers to calculate metrics for
128 |         X_train (pd.dataframe): training data
129 |         y_train (pd.dataframe): labels for training data
130 |         X_test (pd.dataframe): test data
131 |         y_test (pd.dataframe): labels for test data
132 | 
133 |     Returns:
134 |        tuple: (dataframe) of the metrics and (list) containing the fitted models and names of the models as strings
135 | 
136 |     FROM: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html
137 |     """
138 | 
139 |     # this is 
140 | 
141 |     scores = defaultdict(list)
142 | 
143 |     for i, (clf, name) in enumerate(clf_list):
144 |         clf.fit(X_train, y_train)
145 |         y_prob = clf.predict_proba(X_test)
146 |         y_pred = clf.predict(X_test)
147 |         scores["Classifier"].append(name)
148 | 
149 |         for metric in [brier_score_loss, log_loss]:
150 |             score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize()
151 |             scores[score_name].append(metric(y_test, y_prob[:, 1]))
152 | 
153 |         for metric in [precision_score, recall_score, f1_score, roc_auc_score]:
154 |             score_name = metric.__name__.replace("_", " ").replace("score", "").capitalize()
155 |             scores[score_name].append(metric(y_test, y_pred))
156 | 
157 |         score_df = pd.DataFrame(scores).set_index("Classifier")
158 |         score_df.round(decimals=3)
159 | 
160 |         # update clf_list with the trained model
161 |         clf_list[i] = (clf, name)
162 | 
163 | 
164 |     return score_df, clf_list


--------------------------------------------------------------------------------
/src/optuna_objectives.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | import xgboost as xgb
  5 | 
  6 | import lightgbm as lgb
  7 | from lightgbm import (
  8 |     early_stopping,
  9 |     log_evaluation,
 10 | )
 11 | 
 12 | from sklearn.model_selection import (
 13 |     StratifiedKFold, 
 14 |     TimeSeriesSplit,
 15 | )
 16 | 
 17 | from sklearn.metrics import (
 18 |     accuracy_score,
 19 |     roc_auc_score,
 20 | )
 21 | 
 22 | 
 23 | 
 24 | def XGB_objective(trial, train, target, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED):
 25 |     
 26 | 
 27 |     train_oof = np.zeros((train.shape[0],))
 28 |     
 29 |     train_dmatrix = xgb.DMatrix(train, target,
 30 |                          feature_names=train.columns,
 31 |                         enable_categorical=ENABLE_CATEGORICAL)
 32 |     
 33 |     xgb_params= {       
 34 |                 'num_round': trial.suggest_int('num_round', 2, 1000),
 35 |                 'learning_rate': trial.suggest_float('learning_rate', 1E-3, 1),
 36 |                 'max_bin': trial.suggest_int('max_bin', 2, 1000),
 37 |                 'max_depth': trial.suggest_int('max_depth', 1, 8),
 38 |                 'alpha': trial.suggest_float('alpha', 1E-16, 12),
 39 |                 'gamma': trial.suggest_float('gamma', 1E-16, 12),
 40 |                 'reg_lambda': trial.suggest_float('reg_lambda', 1E-16, 12),
 41 |                 'colsample_bytree': trial.suggest_float('colsample_bytree', 1E-16, 1.0),
 42 |                 'subsample': trial.suggest_float('subsample', 1E-16, 1.0), 
 43 |                 'min_child_weight': trial.suggest_float('min_child_weight', 1E-16, 12),
 44 |                 'scale_pos_weight': trial.suggest_int('scale_pos_weight', 1, 15),       
 45 |                 }
 46 |     
 47 |     xgb_params = xgb_params | STATIC_PARAMS
 48 |         
 49 |    #pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "evaluation-auc")
 50 |     
 51 |     if OPTUNA_CV == "StratifiedKFold": 
 52 |         kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED)
 53 |     elif OPTUNA_CV == "TimeSeriesSplit":
 54 |         kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS)
 55 |     
 56 | 
 57 |     for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))):
 58 | 
 59 |         train_df, val_df = train.iloc[train_ind], train.iloc[val_ind]
 60 |         
 61 |         train_target, val_target = target[train_ind], target[val_ind]
 62 | 
 63 |         train_dmatrix = xgb.DMatrix(train_df, label=train_target,enable_categorical=ENABLE_CATEGORICAL)
 64 |         val_dmatrix = xgb.DMatrix(val_df, label=val_target,enable_categorical=ENABLE_CATEGORICAL)
 65 | 
 66 | 
 67 |         model =  xgb.train(xgb_params, 
 68 |                            train_dmatrix, 
 69 |                            num_boost_round = NUM_BOOST_ROUND,
 70 |                            #callbacks=[pruning_callback],
 71 |                           )
 72 | 
 73 |         temp_oof = model.predict(val_dmatrix)
 74 | 
 75 |         train_oof[val_ind] = temp_oof
 76 | 
 77 |         #print(roc_auc_score(val_target, temp_oof))
 78 |     
 79 |     val_score = roc_auc_score(target, train_oof)
 80 |     
 81 |     return val_score
 82 | 
 83 | 
 84 | def LGB_objective(trial, train, target, category_columns, STATIC_PARAMS, ENABLE_CATEGORICAL, NUM_BOOST_ROUND, OPTUNA_CV, OPTUNA_FOLDS, SEED, EARLY_STOPPING):
 85 | 
 86 | 
 87 |     
 88 |     train_oof = np.zeros((train.shape[0],))
 89 |     
 90 |     
 91 |     lgb_params= {
 92 |                 "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
 93 |                 "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
 94 |                 "learning_rate": trial.suggest_loguniform('learning_rate', 1e-4, 0.5),
 95 |                 "max_depth": trial.suggest_categorical('max_depth', [5,10,20,40,100, -1]),
 96 |                 "n_estimators": trial.suggest_int("n_estimators", 50, 200000),
 97 |                 "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
 98 |                 "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
 99 |                 "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
100 |                 "num_leaves": trial.suggest_int("num_leaves", 2, 1000),
101 |                 "min_child_samples": trial.suggest_int("min_child_samples", 5, 300),
102 |                 "cat_smooth" : trial.suggest_int('min_data_per_groups', 1, 100)
103 |                 }
104 | 
105 |     lgb_params = lgb_params | STATIC_PARAMS
106 |         
107 |     #pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc")
108 |     
109 |     if OPTUNA_CV == "StratifiedKFold": 
110 |         kf = StratifiedKFold(n_splits=OPTUNA_FOLDS, shuffle=True, random_state=SEED)
111 |     elif OPTUNA_CV == "TimeSeriesSplit":
112 |         kf = TimeSeriesSplit(n_splits=OPTUNA_FOLDS)
113 |     
114 | 
115 |     for f, (train_ind, val_ind) in (enumerate(kf.split(train, target))):
116 | 
117 |         train_df, val_df = train.iloc[train_ind], train.iloc[val_ind]
118 |         
119 |         train_target, val_target = target[train_ind], target[val_ind]
120 | 
121 |         train_lgbdataset = lgb.Dataset(train_df, label=train_target,categorical_feature=category_columns)
122 |         val_lgbdataset = lgb.Dataset(val_df, label=val_target, reference = train_lgbdataset, categorical_feature=category_columns)
123 | 
124 | 
125 |         model =  lgb.train(lgb_params, 
126 |                            train_lgbdataset,
127 |                            valid_sets=val_lgbdataset,
128 |                            #num_boost_round = NUM_BOOST_ROUND,
129 |                            callbacks=[#log_evaluation(LOG_EVALUATION),
130 |                                       early_stopping(EARLY_STOPPING,verbose=False),
131 |                                       #pruning_callback,
132 |                                     ]               
133 |                            #verbose_eval= VERBOSE_EVAL,
134 |                           )
135 | 
136 |         temp_oof = model.predict(val_df)
137 | 
138 |         train_oof[val_ind] = temp_oof
139 | 
140 |         #print(roc_auc_score(val_target, temp_oof))
141 |     
142 |     val_score = roc_auc_score(target, train_oof)
143 |     
144 |     return val_score


--------------------------------------------------------------------------------
/src/streamlit_app.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import streamlit as st
  4 | import hopsworks
  5 | import joblib
  6 | import pandas as pd
  7 | import numpy as np
  8 | import json
  9 | import time
 10 | from datetime import timedelta, datetime
 11 | import xgboost as xgb
 12 | 
 13 | from src.hopsworks_utils import (
 14 |     convert_feature_names,
 15 | )
 16 | 
 17 | from src.feature_engineering import (
 18 |     fix_datatypes,
 19 |     remove_non_rolling,
 20 | )
 21 | 
 22 | 
 23 | # Load hopsworks API key from .env file
 24 | 
 25 | from dotenv import load_dotenv
 26 | 
 27 | load_dotenv()
 28 | 
 29 | try:
 30 |     HOPSWORKS_API_KEY = os.environ['HOPSWORKS_API_KEY']
 31 | except:
 32 |     raise Exception('Set environment variable HOPSWORKS_API_KEY')
 33 | 
 34 | 
 35 | 
 36 | def fancy_header(text, font_size=24):
 37 |     res = f'<span style="color:#ff5f27; font-size: {font_size}px;">{text}</span>'
 38 |     st.markdown(res, unsafe_allow_html=True )
 39 | 
 40 | def get_model(project, model_name, evaluation_metric, sort_metrics_by):
 41 |     """Retrieve desired model from the Hopsworks Model Registry."""
 42 | 
 43 |     mr = project.get_model_registry()
 44 |     # get best model based on custom metrics
 45 |     model = mr.get_best_model(model_name,
 46 |                                 evaluation_metric,
 47 |                                 sort_metrics_by)
 48 |     
 49 |     # download model from Hopsworks
 50 |     #model_dir = model.download()
 51 |     #print(model_dir)
 52 |     with open("model.pkl", 'rb') as f:
 53 |         loaded_model = joblib.load(f)
 54 | 
 55 | 
 56 |     return loaded_model
 57 | 
 58 | 
 59 | # dictionary to convert team ids to team names
 60 | 
 61 | nba_team_names = {
 62 |     1610612737: "Atlanta Hawks",
 63 |     1610612738: "Boston Celtics",
 64 |     1610612739: "Cleveland Cavaliers",
 65 |     1610612740: "New Orleans Pelicans",
 66 |     1610612741: "Chicago Bulls",
 67 |     1610612742: "Dallas Mavericks",
 68 |     1610612743: "Denver Nuggets",
 69 |     1610612744: "Golden State Warriors",
 70 |     1610612745: "Houston Rockets",
 71 |     1610612746: "LA Clippers",
 72 |     1610612754: "Indiana Pacers",
 73 |     1610612747: "Los Angeles Lakers",
 74 |     1610612763: "Memphis Grizzlies",
 75 |     1610612748: "Miami Heat",
 76 |     1610612749: "Milwaukee Bucks",
 77 |     1610612750: "Minnesota Timberwolves",
 78 |     1610612751: "Brooklyn Nets",
 79 |     1610612752: "New York Knicks",
 80 |     1610612753: "Orlando Magic",
 81 |     1610612755: "Philadelphia 76ers",
 82 |     1610612756: "Phoenix Suns",
 83 |     1610612757: "Portland Trail Blazers",
 84 |     1610612758: "Sacramento Kings",
 85 |     1610612759: "San Antonio Spurs",
 86 |     1610612760: "Oklahoma City Thunder",
 87 |     1610612761: "Toronto Raptors",
 88 |     1610612762: "Utah Jazz",
 89 |     1610612764: "Washington Wizards",
 90 |     1610612765: "Detroit Pistons",
 91 |     1610612766: "Charlotte Hornets",
 92 | }
 93 | 
 94 | # Streamlit app
 95 | st.title('NBA Prediction Project')
 96 | 
 97 | progress_bar = st.sidebar.header('⚙️ Working Progress')
 98 | progress_bar = st.sidebar.progress(0)
 99 | st.write(36 * "-")
100 | fancy_header('\n📡 Connecting to Hopsworks Feature Store...')
101 | 
102 | 
103 | # Connect to Hopsworks Feature Store and get Feature Group
104 | project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
105 | fs = project.get_feature_store()
106 | 
107 | rolling_stats_fg = fs.get_feature_group(
108 |     name="rolling_stats",
109 |     version=2,
110 | )
111 | 
112 | st.write("Successfully connected!✔️")
113 | progress_bar.progress(20)
114 | 
115 | 
116 | 
117 | # Get data from Feature Store
118 | st.write(36 * "-")
119 | fancy_header('\n☁️ Retrieving data from Feature Store...')
120 | 
121 | # filter new games that are scheduled for today
122 | # these are games where no points have been scored yet
123 | ds_query = rolling_stats_fg.filter(rolling_stats_fg.pts_home == 0)
124 | df_todays_matches = ds_query.read()
125 | 
126 | if df_todays_matches.shape[0] == 0:
127 |     progress_bar.progress(100)
128 |     st.write()
129 |     fancy_header('\n 🤷‍♂️ No games scheduled for today! 🤷‍♂️')
130 |     st.write()
131 |     st.write("Try again tomorrow!")
132 |     st.write()
133 |     st.write("NBA season and postseason usually runs from October to June.")
134 |     st.stop()
135 | 
136 | st.write("Successfully retrieved!✔️")
137 | progress_bar.progress(40)
138 | print(df_todays_matches.head(5))
139 | 
140 | 
141 | # Prepare data for prediction
142 | st.write(36 * "-")
143 | fancy_header('\n☁️ Processing Data for prediction...')
144 | 
145 | # convert feature names back to mixed case
146 | df_todays_matches = convert_feature_names(df_todays_matches)
147 | 
148 | # Add a column that displays the matchup using the team names 
149 | # this will make the display more meaningful
150 | df_todays_matches['MATCHUP'] = df_todays_matches['VISITOR_TEAM_ID'].map(nba_team_names) + " @ " + df_todays_matches['HOME_TEAM_ID'].map(nba_team_names)
151 | 
152 | # fix date and other types
153 | df_todays_matches = fix_datatypes(df_todays_matches)
154 | 
155 | # remove features not used by model
156 | drop_columns = ['TARGET', 'GAME_DATE_EST', 'GAME_ID', ] 
157 | df_todays_matches = df_todays_matches.drop(drop_columns, axis=1)
158 | 
159 | # remove stats from today's games - these are blank (the game hasn't been played) and are not used by the model
160 | use_columns = remove_non_rolling(df_todays_matches)
161 | X = df_todays_matches[use_columns]
162 | 
163 | # MATCHUP is just for informational display, not used by model
164 | X = X.drop('MATCHUP', axis=1) 
165 | 
166 | #X_dmatrix = xgb.DMatrix(X) # convert to DMatrix for XGBoost
167 | 
168 | st.write(df_todays_matches['MATCHUP'])
169 | 
170 | st.write("Successfully processed!✔️")
171 | progress_bar.progress(60)
172 | 
173 | 
174 | # Load model from Hopsworks Model Registry
175 | st.write(36 * "-")
176 | fancy_header(f"Loading Best Model...")
177 | 
178 | model = get_model(project=project,
179 |                   model_name="xgboost",
180 |                   evaluation_metric="AUC",
181 |                   sort_metrics_by="max")
182 | 
183 | st.write("Successfully loaded!✔️")
184 | progress_bar.progress(80)
185 | 
186 | 
187 | 
188 | # Predict winning probabilities of home team
189 | st.write(36 * "-")
190 | fancy_header(f"Predicting Winning Probabilities...")
191 | 
192 | 
193 | #preds = model.predict(X_dmatrix)
194 | preds = model.predict_proba(X)[:,1]
195 | 
196 | df_todays_matches['HOME_TEAM_WIN_PROBABILITY'] = preds
197 | 
198 | st.dataframe(df_todays_matches[['MATCHUP', 'HOME_TEAM_WIN_PROBABILITY']])
199 | 
200 | progress_bar.progress(100)
201 | st.button("Re-run")
202 | 


--------------------------------------------------------------------------------
/src/webscraping.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import pandas as pd
  3 | import numpy as np
  4 | 
  5 | import os   
  6 | 
  7 | import asyncio
  8 | 
  9 | #if using scrapingant, import these
 10 | from scrapingant_client import ScrapingAntClient
 11 | 
 12 | # if using selenium and chrome, import these
 13 | from selenium import webdriver
 14 | from selenium.webdriver.chrome.service import Service as ChromiumService
 15 | from selenium.webdriver.common.by import By
 16 | from selenium.webdriver.chrome.options import Options
 17 | from webdriver_manager.core.utils import ChromeType
 18 | from webdriver_manager.chrome import ChromeDriverManager
 19 | 
 20 | # if using selenium and firefox, import these
 21 | from selenium.webdriver.firefox.service import Service as FirefoxService
 22 | from webdriver_manager.firefox import GeckoDriverManager
 23 | 
 24 | 
 25 | from bs4 import BeautifulSoup as soup
 26 | 
 27 | from datetime import datetime, timedelta
 28 | from pytz import timezone
 29 | 
 30 | from pathlib import Path  #for Windows/Linux compatibility
 31 | DATAPATH = Path(r'data')
 32 | 
 33 | import time
 34 | 
 35 | 
 36 | 
 37 | 
 38 | def activate_web_driver(browser: str) -> webdriver:
 39 |     """
 40 |     Activate selenium web driver for use in scraping
 41 | 
 42 |     Args:
 43 |         browser (str): the name of the browser to use, either "firefox" or "chromium"
 44 | 
 45 |     Returns:
 46 |         the selected webdriver
 47 |     """
 48 |     
 49 |     # options for selenium webdrivers, used to assist headless scraping. Still ran into issues, so I used scrapingant instead when running from github actions
 50 |     options = [
 51 |         "--headless",
 52 |         "--window-size=1920,1200",
 53 |         "--start-maximized",
 54 |         "--no-sandbox",
 55 |         "--disable-dev-shm-usage",
 56 |         "--disable-gpu",
 57 |         "--ignore-certificate-errors",
 58 |         "--disable-extensions",
 59 |         "--disable-popup-blocking",
 60 |         "--disable-notifications",
 61 |         "--remote-debugging-port=9222", #https://stackoverflow.com/questions/56637973/how-to-fix-selenium-devtoolsactiveport-file-doesnt-exist-exception-in-python
 62 |         "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
 63 |         "--disable-blink-features=AutomationControlled",
 64 |         ]
 65 |     
 66 |     if browser == "firefox":
 67 |         service = FirefoxService(executable_path=GeckoDriverManager().install())
 68 |         
 69 |         firefox_options = webdriver.FirefoxOptions()
 70 |         for option in options:
 71 |             firefox_options.add_argument(option)
 72 |         
 73 |         driver = webdriver.Firefox(service=service, options=firefox_options)
 74 |     
 75 |     else:
 76 |         service = ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install())  
 77 |         
 78 |         chrome_options = Options() 
 79 |         for option in options:
 80 |             chrome_options.add_argument(option)
 81 | 
 82 |         driver = webdriver.Chrome(service=service, options=chrome_options)    
 83 |     
 84 |     return driver
 85 | 
 86 | 
 87 | 
 88 | def get_new_games(SCRAPINGANT_API_KEY: str, driver: webdriver) -> pd.DataFrame:
 89 | 
 90 |     # set search for previous days games; use 2 days to catch up in case of a failed run
 91 |     DAYS = 2
 92 |     SEASON = "" #no season will cause website to default to current season, format is "2022-23"
 93 |     TODAY = datetime.now(timezone('EST')) #nba.com uses US Eastern Standard Time
 94 |     LASTWEEK = (TODAY - timedelta(days=DAYS))
 95 |     DATETO = TODAY.strftime("%m/%d/%y")
 96 |     DATEFROM = LASTWEEK.strftime("%m/%d/%y")
 97 | 
 98 | 
 99 |     df = scrape_to_dataframe(api_key=SCRAPINGANT_API_KEY, driver=driver, Season=SEASON, DateFrom=DATEFROM, DateTo=DATETO, )
100 | 
101 |     df = convert_columns(df)
102 | 
103 |    
104 |     df = combine_home_visitor(df)
105 | 
106 |     return df
107 | 
108 | 
109 | 
110 | def parse_ids(data_table):
111 |     
112 |     # TEAM_ID and GAME_ID are encoded in href= links
113 |     # find all the hrefs, add them to a list
114 |     # then parse out a list for teams ids and game ids
115 |     # and convert these to pandas series
116 |     
117 |     CLASS_ID = 'Anchor_anchor__cSc3P' #determined by visual inspection of page source code
118 | 
119 |     # get all the links
120 |     links = data_table.find_all('a', {'class':CLASS_ID})
121 |     
122 |     # get the href part (web addresses)
123 |     # href="/stats/team/1610612740" for teams
124 |     # href="/game/0022200191" for games
125 |     links_list = [i.get("href") for i in links]
126 | 
127 |     # create a series using last 10 digits of the appropriate links
128 |     team_id = pd.Series([i[-10:] for i in links_list if ('stats' in i)])
129 |     game_id = pd.Series([i[-10:] for i in links_list if ('/game/' in i)])
130 |     
131 |     return team_id, game_id
132 | 
133 | 
134 | 
135 | def scrape_to_dataframe(api_key, driver, Season, DateFrom="NONE", DateTo="NONE", stat_type='standard'):
136 |     
137 |     # go to boxscores webpage at nba.com
138 |     # check if the data table is split over multiple pages 
139 |     # if so, then select the "ALL" choice in pulldown menu to show all on one page
140 |     # extract out the html table and convert to dataframe
141 |     # parse out GAME_ID and TEAM_ID from href links
142 |     # and add these to dataframe
143 |     
144 |     # if season not provided, then will default to current season
145 |     # if DateFrom and DateTo not provided, then don't include in url - pull the whole season
146 |     if stat_type == 'standard':
147 |         nba_url = "https://www.nba.com/stats/teams/boxscores"
148 |     else:
149 |         nba_url = "https://www.nba.com/stats/teams/boxscores-"+ stat_type 
150 |         
151 |     if not Season:
152 |         nba_url = nba_url + "?DateFrom=" + DateFrom + "&DateTo=" + DateTo
153 |     else:
154 |         if DateFrom == "NONE" and DateTo == "NONE":
155 |             nba_url = nba_url + "?Season=" + Season
156 |         else:
157 |             nba_url = nba_url + "?Season=" + Season + "&DateFrom=" + DateFrom + "&DateTo=" + DateTo
158 | 
159 |     #try 3 times to load page correctly; scrapingant can fail sometimes on it first try
160 |     for i in range(3): 
161 |         if api_key == "": #if no api key, then use selenium
162 |             driver.get(nba_url)
163 |             time.sleep(10)
164 |             source = soup(driver.page_source, 'html.parser')
165 |         else: #if api key, then use scrapingant
166 |             client = ScrapingAntClient(token=api_key)
167 |             result = client.general_request(nba_url)
168 |             source = soup(result.content, 'html.parser')
169 |         
170 |         # the data table is the key dynamic element that may fail to load
171 |         CLASS_ID_TABLE = 'Crom_table__p1iZz' #determined by visual inspection of page source code
172 |         data_table = source.find('table', {'class':CLASS_ID_TABLE})
173 | 
174 |         if data_table is None:
175 |             time.sleep(10)
176 |         else:
177 |             break
178 |         
179 | 
180 |     #check for more than one page
181 |     CLASS_ID_PAGINATION = "Pagination_pageDropdown__KgjBU" #determined by visual inspection of page source code
182 |     pagination = source.find('div', {'class':CLASS_ID_PAGINATION})
183 | 
184 |     if api_key == "": #if using selenium, then check for multiple pages
185 |         if pagination is not None:
186 |             # if multiple pages, first activate pulldown option for All pages to show all rows on one page
187 |             CLASS_ID_DROPDOWN = "DropDown_select__4pIg9" #determined by visual inspection of page source code
188 |             page_dropdown = driver.find_element(By.XPATH, "//*[@class='" + CLASS_ID_PAGINATION + "']//*[@class='" + CLASS_ID_DROPDOWN + "']")
189 |         
190 |             page_dropdown.send_keys("ALL") # show all pages
191 |             #page_dropdown.click() doesn't work in headless mode
192 |             time.sleep(3)
193 |             driver.execute_script('arguments[0].click()', page_dropdown) #click() didn't work in headless mode, used this workaround (https://stackoverflow.com/questions/57741875)
194 |             
195 |             #refresh page data now that it contains all rows of the table
196 |             time.sleep(3)
197 |             source = soup(driver.page_source, 'html.parser')
198 |             data_table = source.find('table', {'class':CLASS_ID_TABLE})
199 |     
200 |     #print(source)
201 | 
202 |     # convert the html table to a dataframe   
203 |     dfs = pd.read_html(str(data_table), header=0) 
204 |     df = pd.concat(dfs)
205 | 
206 |     # pull out teams ids and game ids from hrefs and add these to the dataframe
207 |     TEAM_ID, GAME_ID = parse_ids(data_table)
208 |     df['TEAM_ID'] = TEAM_ID
209 |     df['GAME_ID'] = GAME_ID
210 | 
211 |         
212 |     return df
213 |     
214 | def convert_columns(df):
215 |     
216 |     # convert the dataframe to same format and column names as main data
217 |     
218 |     # drop columns not used
219 |     drop_columns = ['Team', 'MIN', 'FGM', 'FGA', '3PM', '3PA', 'FTM', 'FTA', 'OREB', 'DREB', 'STL', 'BLK', 'TOV', 'PF', '+/-',]
220 |     df = df.drop(columns=drop_columns)  
221 |     
222 |     #rename columns to match existing dataframes
223 |     mapper = {
224 |          'Match Up': 'HOME',
225 |          'Game Date': 'GAME_DATE_EST', 
226 |          'W/L': 'HOME_TEAM_WINS',
227 |          'FG%': 'FG_PCT',
228 |          '3P%': 'FG3_PCT',
229 |          'FT%': 'FT_PCT',
230 |     }
231 |     df = df.rename(columns=mapper)
232 |     
233 |     # reformat column data
234 |     
235 |     # make HOME true if @ is in the text
236 |     # (Match Ups: POR @ DAL or DAl vs POR. Home team always has @)
237 |     df['HOME'] = df['HOME'].apply(lambda x: 1 if '@' in x else 0)
238 |     
239 |     # convert wins to home team wins
240 |     # incomplete games will be NaN
241 |     df = df[df['HOME_TEAM_WINS'].notna()]
242 |     # convert W/L to 1/0
243 |     df['HOME_TEAM_WINS'] = df['HOME_TEAM_WINS'].apply(lambda x: 1 if 'W' in x else 0)
244 |     # no need to do anything else, win/loss of visitor teams is not used in final dataframe
245 |     
246 |     #convert date format
247 |     df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST'])
248 |     df['GAME_DATE_EST'] = df['GAME_DATE_EST'].dt.strftime('%Y-%m-%d')
249 |     df['GAME_DATE_EST'] = pd.to_datetime(df['GAME_DATE_EST'])
250 | 
251 |     return df
252 | 
253 | def combine_home_visitor(df):
254 |     
255 |     # each game currently has one row for home team stats
256 |     # and one row for visitor team stats
257 |     # these be will combined into a single row
258 |     
259 |     # separate home vs visitor
260 |     home_df = df[df['HOME'] == 1]
261 |     visitor_df = df[df['HOME'] == 0]
262 |     
263 |     # HOME column no longer needed
264 |     home_df = home_df.drop(columns='HOME')
265 |     visitor_df = visitor_df.drop(columns='HOME')
266 |     
267 |     # HOME_TEAM_WINS and GAME_DATE_EST columns not needed for visitor
268 |     visitor_df = visitor_df.drop(columns=['HOME_TEAM_WINS','GAME_DATE_EST'])
269 |     
270 |     # rename TEAM_ID columns
271 |     home_df = home_df.rename(columns={'TEAM_ID':'HOME_TEAM_ID'})
272 |     visitor_df = visitor_df.rename(columns={'TEAM_ID':'VISITOR_TEAM_ID'})
273 |     
274 |     # merge the home and visitor data
275 |     df = pd.merge(home_df, visitor_df, how="left", on=["GAME_ID"],suffixes=('_home', '_away'))
276 |     
277 |     # add a column for SEASON
278 |     # determine SEASON by parsing GAME_ID 
279 |     # (e.g. 0022200192 1st 2 digits not used, 3rd digit 2 = regular season, 4th and 5th digit = SEASON)
280 |     game_id = df['GAME_ID'].iloc[0]
281 |     season = game_id[3:5]
282 |     season = str(20) + season
283 |     df['SEASON'] = season
284 |     
285 |     #print(df)
286 |     
287 |     #convert all object columns to int64
288 |     for field in df.select_dtypes(include=['object']).columns.tolist():
289 |         df[field] = df[field].astype('int64')
290 | 
291 |     return df
292 | 
293 | def get_todays_matchups(api_key: str, driver: webdriver) -> list:
294 | 
295 |     '''
296 |     Goes to NBA Schedule and scrapes the teams playing today
297 |     '''
298 |     
299 |     NBA_SCHEDULE = "https://www.nba.com/schedule"
300 | 
301 |     
302 |     if api_key == "": #if no api key, then use selenium
303 |         driver.get(NBA_SCHEDULE)
304 |         time.sleep(10)
305 |         source = soup(driver.page_source, 'html.parser')
306 |     else: #if api key, then use scrapingant
307 |         client = ScrapingAntClient(token=api_key)
308 |         result = client.general_request(NBA_SCHEDULE)
309 |         source = soup(result.content, 'html.parser')
310 | 
311 | 
312 |     # Get the block of all of todays games
313 |     # Sometimes, the results of yesterday's games are listed first, then todays games are listed
314 |     # Other times, yesterday's games are not listed, or when the playoffs approach, future games are listed
315 |     # We will check the date for the first div, if it is not todays date, then we will look for the next div
316 |     CLASS_GAMES_PER_DAY = "ScheduleDay_sdGames__NGdO5" # the div containing all games for a day
317 |     CLASS_DAY = "ScheduleDay_sdDay__3s2Xt" # the heading with the date for the games (e.g. "Wednesday, February 1")
318 |     div_games = source.find('div', {'class':CLASS_GAMES_PER_DAY}) # first div may or may not be yesterday's games or even future games when playoffs approach
319 |     div_game_day = source.find('h4', {'class':CLASS_DAY})
320 |     today = datetime.today().strftime('%A, %B %d')[:3] # e.g. "Wednesday, February 1" -> "Wed" for convenience with dealing with leading zeros
321 |     todays_games = None
322 |     
323 |     while div_games:
324 |         print(div_game_day.text[:3]) 
325 |         if today == div_game_day.text[:3]:  
326 |             todays_games = div_games
327 |             break
328 |         else:
329 |             # move to next div
330 |             div_games = div_games.find_next('div', {'class':CLASS_GAMES_PER_DAY}) 
331 |             div_game_day = div_game_day.find_next('h4', {'class':CLASS_DAY})
332 | 
333 |     if todays_games is None:
334 |         # no games today
335 |         return None, None
336 | 
337 |     # Get the teams playing
338 |     # Each team listed in todays block will have a href with the specified anchor class
339 |     # e.g. <a href="/team/1610612743/nuggets/" class="Anchor_anchor__cSc3P Link_styled__okbXW" ...
340 |     # href includes team ID (1610612743 in example)
341 |     # first team is visitor, second team is home
342 |     CLASS_ID = "Anchor_anchor__cSc3P Link_styled__okbXW"
343 |     links = todays_games.find_all('a', {'class':CLASS_ID})
344 |     teams_list = [i.get("href") for i in links]
345 | 
346 |     # example output:
347 |     # ['/team/1610612759/spurs/', '/team/1610612748/heat/',...
348 | 
349 |     # create list of matchups by parsing out team ids from teams_list
350 |     # second team id is always the home team
351 |     team_count = len(teams_list) 
352 |     matchups = []
353 |     for i in range(0,team_count,2):
354 |         visitor_id = teams_list[i].partition("team/")[2].partition("/")[0] #extract team id from text
355 |         home_id = teams_list[i+1].partition("team/")[2].partition("/")[0]
356 |         matchups.append([visitor_id, home_id])
357 | 
358 | 
359 |     # Get Game IDs
360 |     # Each game listed in todays block will have a link with the specified anchor class
361 |     # <a class="Anchor_anchor__cSc3P TabLink_link__f_15h" data-content="SAC @ MEM, 2023-01-01" data-content-id="0022200547" data-has-children="true" data-has-more="false" data-id="nba:schedule:main:preview:cta" data-is-external="false" data-text="PREVIEW" data-track="click" data-type="cta" href="/game/sac-vs-mem-0022200547">PREVIEW</a>
362 |     # Each game will have two links with the specified anchor class, one for the preview and one to buy tickets
363 |     # all using the same anchor class, so we will filter out those just for PREVIEW
364 |     CLASS_ID = "Anchor_anchor__cSc3P TabLink_link__f_15h"
365 |     links = todays_games.find_all('a', {'class':CLASS_ID})
366 |     #print(links)
367 |     links = [i for i in links if "PREVIEW" in i]
368 |     game_id_list = [i.get("href") for i in links]
369 |     #print(game_id_list)
370 | 
371 |     games = []
372 |     for game in game_id_list:
373 |         game_id = game.partition("-00")[2].partition("?")[0] # extract team id from text for link
374 |         if len(game_id) > 0:               
375 |             games.append(game_id)   
376 | 
377 |     #asyncio.run(main())
378 |     
379 |     return matchups, games
380 | 


--------------------------------------------------------------------------------
	GAME_DATE_EST	GAME_ID	HOME_TEAM_ID	VISITOR_TEAM_ID	SEASON
1384	2023-01-31	22200765	1610612752	1610612747	2022
3042	2023-01-31	22200767	1610612749	1610612766	2022
8239	2023-01-31	22200764	1610612739	1610612748	2022
8907	2023-01-31	22200766	1610612741	1610612746	2022
21628	2023-01-31	22200768	1610612743	1610612740	2022