├── README.md ├── Step 1. EDA.ipynb ├── Step 2. Data Cleaning.ipynb ├── Step 3. Model Training.ipynb ├── Step 4. Scripting the Process.ipynb ├── Step 5. Model Deployment.ipynb ├── Step 6. Containerization.ipynb ├── Step 7. Cloud Deployment.ipynb ├── assets ├── allstate_banner-660x120.png ├── docker-process.drawio ├── docker-process.png ├── docker.png ├── ml-iceberg.jpg └── tweet_eda.png ├── requirements.txt ├── scripts ├── Dockerfile ├── clean_data.py ├── data │ ├── test.csv.gz │ ├── test_cleaned.csv.gz │ ├── train.csv.gz │ └── train_cleaned.csv.gz ├── inference.py ├── model │ └── model.bin ├── requirements.txt └── train.py └── virtual-env ├── Pipfile ├── Pipfile.lock ├── README.md └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Workflow - From EDA to Production 2 | 3 | ## Introduction 4 | 5 | This repo tries to study & apply the least minimal steps involved in machine learning workflow the right way. It was compiled during the first cohort of ["Machine Learning Zoomcamp"](https://datatalks.club/courses/2021-winter-ml-zoomcamp.html) course instructed by amazing [@alexeygrigorev](https://github.com/alexeygrigorev). 6 | 7 |
8 | 9 | ## Problem Description 10 | 11 | The problem we will study was held as a competition on Kaggle titled as ["Allstate Claims Severity"](https://www.kaggle.com/c/allstate-claims-severity/). The data was provided by ["Allstate"](https://www.allstate.com/), a personal insurer in the United States. They were looking for ML-based methods to reduce the cost of insurance claims. 12 | The objective of the problem is to predict _'loss'_ value for a claim, which makes it a __regression__ problem. The submissions for test data are evaluated on the __Mean Absolute Error (MAE)__ between the predicted loss and the actual loss. 13 | All the data column values and names in provided dataset are obfuscated for the privacy reasons. Thus, we'll have no __"Domain Knowledge"__ over this problem. 14 | 15 |

16 | 17 |

18 | 19 |
20 | 21 | ## About the Dataset 22 | 23 | The dataset used in this repo is a __"Tabular"__ one, meaning the data is represented in __rows__ and __columns__, corresponding to __samples__ and __features__ respectively.
24 | Data columns (features) are in both __categorical__ and __numerical__ types. Train and test datasets contain __188,318__ and __125,546__ rows (samples) respectively with __130__ columns as features, plus two more columns representing "claim id" and "target" named as _'id'_ and _'loss'_. 25 | 26 | Regarding the dataset's availability, it is already provided in repo's __scripts/data__ folder and you don't need to download it separately. 27 | 28 |
29 | 30 | ## Requirements 31 | 32 | This entire repo relies on python programming language for demonstrating a machine learning workflow. Workflow steps are explained and can be run in ["Jupyter Notebook"](https://jupyter.org) documents, allowing us to execute them in an interactive environment. The easiest way to install python and many popular libraries for data science is through setting up [Anaconda](https://www.anaconda.com). You may refer to __virtual-env__ folder in this repo to review a quick guide on how to setup a virtual environment without running into conflicts with your current setup. 33 | 34 | I'd also recommend you to go through the notebooks one by one, and orderly. 35 | 36 |
37 | 38 | ## Important Note 39 | 40 | For the sake of simplicity, we assume that __"Data Collection"__ step is already done since we're going to use a publicly available Kaggle dataset. Please note though, this is not the case in real-world scenarios. Most of the time, tabular data is collected by querying multiple database tables and running pretty complex SQL commands. If you're planning to become a machine learning engineer, make sure you understand and know a good deal about databases and writing efficient SQL queries; that, ladies and gents, turns out to be an essential & invaluable asset to possess for a ML engineer. 41 | 42 | The main intention in this workflow is not achieving the best benchmark score for the subject dataset, and by no way it claims to contain the most complete sub-steps. 43 | 44 | Given the above line you might ask, what's the focus here? I can summarize the answer with following lines: 45 | 46 | - To take a quick look at the minimal required steps involved in a machine learning problem, from EDA to production. 47 | - Trying to avoid common slips, and conducting each step the right way. 48 | 49 | A good machine learning solution involves many steps; the ml algo for instance, is just tip of the iceberg. Hopefully, the material here gives you a nice view of what you should expect in your journey 😉. 50 | 51 |

52 | 53 |

54 | 55 |
56 | 57 | Pull requests are super welcome. Please don't hesitate to contribute if you think something is missing or needs an improvement. Hope you find the content useful, and don't forget to show your support by hitting the ⭐. 58 | -------------------------------------------------------------------------------- /Step 3. Model Training.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Model Training\n", 8 | "This is the stage in which we cook the ML solution. But just like cooking, we're gonna need some good ingredients first. Let's get our hands on what we have prepared so far." 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "# required library imports & initial settings\n", 18 | "\n", 19 | "import numpy as np\n", 20 | "import pandas as pd\n", 21 | "import seaborn as sns\n", 22 | "import pickle\n", 23 | "\n", 24 | "from sklearn import set_config\n", 25 | "from sklearn.model_selection import StratifiedKFold\n", 26 | "\n", 27 | "from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler\n", 28 | "from sklearn.impute import SimpleImputer\n", 29 | "\n", 30 | "from sklearn.compose import ColumnTransformer\n", 31 | "from sklearn.pipeline import Pipeline\n", 32 | "\n", 33 | "from sklearn.linear_model import HuberRegressor\n", 34 | "from sklearn.ensemble import RandomForestRegressor\n", 35 | "import xgboost as xgb\n", 36 | "\n", 37 | "from sklearn.metrics import mean_absolute_error\n", 38 | "\n", 39 | "from warnings import simplefilter\n", 40 | "from sklearn.exceptions import ConvergenceWarning\n", 41 | "\n", 42 | "\n", 43 | "# set seed value for reproducibility\n", 44 | "RANDOM_SEED = 1024" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## Data Preprocessing" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Reading Cleaned Data" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/html": [ 69 | "

\n", 70 | "\n", 83 | "\n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | "

	id	cat1	cat2	cat4	cat5	cat6	cat8	cat9	cat10	cat11	...	cont5	cont6	cont7	cont8	cont9	cont10	cont11	cont13	cont14	loss
0	1	A	B	B	A	A	A	B	A	B	...	0.310061	0.718367	0.335060	0.30260	0.67135	0.83510	0.569745	0.822493	0.714843	2213.18
1	2	A	B	A	A	A	A	B	B	A	...	0.885834	0.438917	0.436585	0.60087	0.35127	0.43919	0.338312	0.611431	0.304496	1283.60
2	5	A	B	A	B	A	A	B	B	B	...	0.397069	0.289648	0.315545	0.27320	0.26076	0.32446	0.381398	0.195709	0.774425	3005.09
3	10	B	B	B	A	A	A	B	A	A	...	0.422268	0.440945	0.391128	0.31796	0.32128	0.44467	0.327915	0.605077	0.602642	939.85
4	11	A	B	B	A	A	A	B	B	A	...	0.704268	0.178193	0.247408	0.24564	0.22089	0.21230	0.204687	0.246011	0.432606	2763.85

\n", 233 | "

5 rows × 120 columns

\n", 234 | "

" 235 | ], 236 | "text/plain": [ 237 | " id cat1 cat2 cat4 cat5 cat6 cat8 cat9 cat10 cat11 ... cont5 cont6 \\\n", 238 | "0 1 A B B A A A B A B ... 0.310061 0.718367 \n", 239 | "1 2 A B A A A A B B A ... 0.885834 0.438917 \n", 240 | "2 5 A B A B A A B B B ... 0.397069 0.289648 \n", 241 | "3 10 B B B A A A B A A ... 0.422268 0.440945 \n", 242 | "4 11 A B B A A A B B A ... 0.704268 0.178193 \n", 243 | "\n", 244 | " cont7 cont8 cont9 cont10 cont11 cont13 cont14 loss \n", 245 | "0 0.335060 0.30260 0.67135 0.83510 0.569745 0.822493 0.714843 2213.18 \n", 246 | "1 0.436585 0.60087 0.35127 0.43919 0.338312 0.611431 0.304496 1283.60 \n", 247 | "2 0.315545 0.27320 0.26076 0.32446 0.381398 0.195709 0.774425 3005.09 \n", 248 | "3 0.391128 0.31796 0.32128 0.44467 0.327915 0.605077 0.602642 939.85 \n", 249 | "4 0.247408 0.24564 0.22089 0.21230 0.204687 0.246011 0.432606 2763.85 \n", 250 | "\n", 251 | "[5 rows x 120 columns]" 252 | ] 253 | }, 254 | "execution_count": 2, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "# define relative data path (according the current path of this notebook)\n", 261 | "DATA_PATH = './scripts/data/'\n", 262 | "\n", 263 | "df_train_full = pd.read_csv(DATA_PATH+'train_cleaned.csv.gz')\n", 264 | "df_test_full = pd.read_csv(DATA_PATH+'test_cleaned.csv.gz')\n", 265 | "\n", 266 | "df_train_full.head()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 3, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "# specify feature groups\n", 276 | "features_numerical = [column for column in df_train_full if column.startswith('cont')]\n", 277 | "features_categorical = [column for column in df_train_full if column.startswith('cat')]" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "### Transforming Target Column\n", 285 | "We noticed in EDA stage that our target column in data could benefit from a log-transform. We'll apply that transform right here in the beginning before any splits, since we will be using stratified method and it is important for us to consider equal distribution of samples in all data chops, according to the target distribution." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 4, 291 | "metadata": {}, 292 | "outputs": [ 293 | { 294 | "data": { 295 | "text/plain": [ 296 | "" 297 | ] 298 | }, 299 | "execution_count": 4, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | }, 303 | { 304 | "data": { 305 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEGCAYAAAB2EqL0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAZbklEQVR4nO3dfZAd1X3m8e8zIyQc20GSmai0kioSZVWyOLsGZcyLcRIHEiGIKyJbhMghYZCVqBITl1+27IX1bpHYoSqOXbHBSQAZBJILW5axCQomZrUCGxLHQkOEeVc0hhCkAjRGAkcYMTP3/vJHn5FaYu5Mz+j2fZn7fKpuTffpc7tP3565z3SfflFEYGZmNpGuZjfAzMzagwPDzMwKcWCYmVkhDgwzMyvEgWFmZoXMaHYDynDyySfH4sWLm90MM7O28tBDD/0oInpqTZ+WgbF48WL6+/ub3Qwzs7Yi6dnxpvuQlJmZFeLAMDOzQhwYZmZWiAPDzMwKcWCYmVkhDgwzMyvEgWFmZoU4MMzMrBAHhpmZFeLAqKFSqVCpVJrdDDOzluHAMDOzQhwYNXgPw8zsaA4MMzMrxIFhZmaFODDGEBFUKhUiotlNMTNrGaUGhqTZkm6X9JSkJyWdLWmupK2Sdqefc1JdSbpO0oCkRyQty82nL9XfLamvzDYDVKtVLr3xAarVatmLMjNrG2XvYVwLfDsifh54J/AkcCWwLSKWAtvSOMAFwNL0WgtcDyBpLnA1cCZwBnD1aMiUSV3e+TIzyyvtW1HSScAvAzcDRMRQRLwMrAQ2pGobgIvS8EpgY2S+D8yWNB84H9gaEfsj4gCwFVhRVrvNzGxsZf4bvQQYBG6RtFPSTZLeDMyLiOdTnReAeWl4AfBc7v17Ulmt8qNIWiupX1L/4OBgnVfFzMzKDIwZwDLg+og4HXiVI4efAIisV7kuPcsRsS4ieiOit6en5jPMi8+v6uswzMzyygyMPcCeiNiexm8nC5AX06Em0s99afpeYFHu/QtTWa1yMzNroNICIyJeAJ6T9HOp6DzgCWALMHqmUx9wZxreAlyWzpY6C3glHbq6B1guaU7q7F6eyszMrIFmlDz/DwG3SZoJPA2sJgupzZLWAM8Cl6S6dwMXAgPAT1JdImK/pE8DO1K9T0XE/pLbbWZmxyg1MCLiYaB3jEnnjVE3gCtqzGc9sL6ujTMzs0nxxQZmZlaIA6MG3x7EzOxoDoxaosrqW/t9exAzs8SBMQ7fHsTM7Ah/I5qZWSEODDMzK8SBMYass7vZrTAzay0ODDMzK8SBMQ7fgNDM7AgHhpmZFeLAMDOzQhwYZmZWiANjHO7DMDM7woFhZmaFODDG4T0MM7MjHBhmZlaIA8PMzApxYJiZWSEODDMzK8SBYWZmhTgwzMysEAeGmZkV4sAwM7NCHBhmZlZIqYEh6d8kPSrpYUn9qWyupK2Sdqefc1K5JF0naUDSI5KW5ebTl+rvltRXZpvzIiI9fc+P3zMza8Qexq9GxGkR0ZvGrwS2RcRSYFsaB7gAWJpea4HrIQsY4GrgTOAM4OrRkCldVFl9az/VarUhizMza2XNOCS1EtiQhjcAF+XKN0bm+8BsSfOB84GtEbE/Ig4AW4EVjWqsunzUzswMyg+MAP6fpIckrU1l8yLi+TT8AjAvDS8Ansu9d08qq1V+FElrJfVL6h8cHKznOpiZGTCj5Pm/JyL2SvoZYKukp/ITIyIk1aWDICLWAesAent73elgZlZnpe5hRMTe9HMfcAdZH8SL6VAT6ee+VH0vsCj39oWprFa5mZk1UGmBIenNkt46OgwsBx4DtgCjZzr1AXem4S3AZelsqbOAV9Khq3uA5ZLmpM7u5anMzMwaqMxDUvOAOySNLucrEfFtSTuAzZLWAM8Cl6T6dwMXAgPAT4DVABGxX9KngR2p3qciYn+J7TYzszGUFhgR8TTwzjHKXwLOG6M8gCtqzGs9sL7ebTQzs+J8zugE/JhWM7OMA2MMlUoFfHW3mdlRHBhmZlaIA8PMzApxYJiZWSEOjAm409vMLOPAMDOzQhwYZmZWiAPDzMwKcWCYmVkhDgwzMyvEgWFmZoU4MCbg02rNzDIODDMzK8SBYWZmhTgwzMysEAeGmZkV4sAwM7NCHBhmZlaIA8PMzApxYJiZWSEOjAn4wj0zs4wDw8zMCnFgmJlZIaUHhqRuSTsl3ZXGl0jaLmlA0tckzUzls9L4QJq+ODePq1L5Lknnl91mMzN7o0bsYXwYeDI3/hng8xHxduAAsCaVrwEOpPLPp3pIOhVYBbwDWAH8raTuBrQbgIigUqkQEY1apJlZSyo1MCQtBH4DuCmNCzgXuD1V2QBclIZXpnHS9PNS/ZXApoh4PSKeAQaAM8ps91GiypqNO6lWqw1bpJlZKyp7D+MLwCeA0W/btwEvR8RIGt8DLEjDC4DnANL0V1L9w+VjvOcwSWsl9UvqHxwcrOtKqMtdPWZmpX0TSnofsC8iHiprGXkRsS4ieiOit6enpxGLNDPrKDNKnPc5wG9KuhA4Efhp4FpgtqQZaS9iIbA31d8LLAL2SJoBnAS8lCsflX+PmZk1SGl7GBFxVUQsjIjFZJ3W90bEpcB9wMWpWh9wZxreksZJ0++NrKd5C7AqnUW1BFgKPFhWu83MbGxl7mHU8r+ATZL+HNgJ3JzKbwa+LGkA2E8WMkTE45I2A08AI8AVEdHQS69Hr/bu7m7YyVlmZi2nIYEREd8BvpOGn2aMs5wi4hDw2zXefw1wTXktNDOzifj0HzMzK8SBYWZmhTgwzMysEAeGmZkV4sAwM7NCCgWGpHOKlJmZ2fRVdA/jiwXLzMxsmhr3OgxJZwPvBnokfSw36aeBjrmKzY9pNTOb+MK9mcBbUr235sp/zJHbe5iZWQcYNzAi4rvAdyXdGhHPNqhNLcd7GGZmxW8NMkvSOmBx/j0RcW4ZjTIzs9ZTNDC+DtxA9uQ8/6ttZtaBigbGSERcX2pLzMyspRU9rfbvJX1Q0nxJc0dfpbbMzMxaStE9jNEHG308VxbAKfVtjpmZtapCgRERS8puiJmZtbZCgSHpsrHKI2JjfZtjZmatqughqXflhk8EzgP+BXBgmJl1iKKHpD6UH5c0G9hURoNaUURQqVSICCQ1uzlmZk0x1dubvwp0Tr9GVFl9az/VarXZLTEza5qifRh/T3ZWFGQ3HfyvwOayGtWK1OVHh5hZZyvah/G53PAI8GxE7CmhPWZm1qIK/ducbkL4FNkda+cAQ2U2qhX5BoRm1umKPnHvEuBB4LeBS4Dtknx7czOzDlL0wPwngXdFRF9EXAacAfzf8d4g6URJD0r6gaTHJf1ZKl8iabukAUlfkzQzlc9K4wNp+uLcvK5K5bsknT+lNTUzs+NSNDC6ImJfbvylAu99HTg3It4JnAaskHQW8Bng8xHxduAAsCbVXwMcSOWfT/WQdCqwCngHsAL4W0kd87Q/M7NWUTQwvi3pHkmXS7oc+BZw93hviMzBNHpCegVwLnB7Kt8AXJSGV6Zx0vTzlF30sBLYFBGvR8QzwADZHo6ZmTXQuIEh6e2SzomIjwM3Av89vf4ZWDfRzCV1S3oY2AdsBX4IvBwRI6nKHmBBGl4APAeQpr8CvC1fPsZ78staK6lfUv/g4OBETZs0d3qbWaebaA/jC2TP7yYivhkRH4uIjwF3pGnjiohKRJwGLCTbK/j542nsBMtaFxG9EdHb09NzXPPKgiEmrGdm1kkmCox5EfHosYWpbHHRhUTEy8B9wNnAbEmj138sBPam4b3AIoA0/SSyvpLD5WO8x8zMGmSiwJg9zrQ3jfdGST3pnlNIehPw68CTZMExekpuH3BnGt7CkeduXAzcGxGRylels6iWAEvJTvE1M7MGmuhK735JfxgRX8oXSvoD4KEJ3jsf2JDOaOoCNkfEXZKeADZJ+nNgJ3Bzqn8z8GVJA8B+sjOjiIjHJW0GniC7yvyKiGh4Z4L7MMys000UGB8B7pB0KUcCoheYCfzWeG+MiEeA08cof5oxznKKiENkFwaONa9rgGsmaKuZmZVo3MCIiBeBd0v6VeAXUvG3IuLe0ltmZmYtpejzMO4j63swM7MO5Xt2F5R/iJKZWSdyYBTlhyiZWYdzYEyCH6JkZp3M34BmZlaIA8PMzApxYJiZWSEODDMzK8SBMQm+PYiZdTIHhpmZFeLAmATvYZhZJ3NgmJlZIQ4MMzMrxIFhZmaFODAmwX0YZtbJHBhmZlaIA8PMzApxYJiZWSEOjElwH4aZdTIHhpmZFeLAmAQ/ptXMOpkDYzKiypqNO/2YVjPrSA6MSQv3Y5hZRyotMCQtknSfpCckPS7pw6l8rqStknann3NSuSRdJ2lA0iOSluXm1Zfq75bUV1abzcystjL3MEaA/xkRpwJnAVdIOhW4EtgWEUuBbWkc4AJgaXqtBa6HLGCAq4EzgTOAq0dDxszMGqe0wIiI5yPiX9LwfwBPAguAlcCGVG0DcFEaXglsjMz3gdmS5gPnA1sjYn9EHAC2AivKareZmY2tIX0YkhYDpwPbgXkR8Xya9AIwLw0vAJ7LvW1PKqtVfuwy1krql9Q/ODhY3xUwM7PyA0PSW4BvAB+JiB/np0V2fmpdzlGNiHUR0RsRvT09PfWY5djL8cV7ZtahSg0MSSeQhcVtEfHNVPxiOtRE+rkvle8FFuXevjCV1So3M7MGKvMsKQE3A09GxF/lJm0BRs906gPuzJVfls6WOgt4JR26ugdYLmlO6uxensqawnsYZtapZpQ473OA3wcelfRwKvvfwF8AmyWtAZ4FLknT7gYuBAaAnwCrASJiv6RPAztSvU9FxP4S221mZmMoLTAi4h8B1Zh83hj1A7iixrzWA+vr17qpy98eJNuJMjPrDL7S+xijgVCzKz6qrL6137cHMbOO48A4RrVape9LDxDjnLylLn9sZtZ5/M03BgeCmdkb+ZtxCnymlJl1IgeGmZkV4sCYAu9hmFkncmBMQbUywtDQkJ+8Z2YdxYExFVHl8vXbGR4ebnZLzMwaxoExRT6Tysw6jb/1zMysEAeGmZkV4sAwM7NCHBhT5FNrzazTODCmyIFhZp3GgWFmZoU4MKYo/1wMM7NO4MCYKj8Xw8w6jAPjOPjiPTPrJP7GOw7u+DazTuLAOA6Vodd47bXXmt0MM7OGcGAcB3d8m1kncWAcj6iyZuNOd3ybWUdwYBwnd3ybWafwt52ZmRVSWmBIWi9pn6THcmVzJW2VtDv9nJPKJek6SQOSHpG0LPeevlR/t6S+stprZmbjK3MP41ZgxTFlVwLbImIpsC2NA1wALE2vtcD1kAUMcDVwJnAGcPVoyLQKn1prZp2itMCIiPuB/ccUrwQ2pOENwEW58o2R+T4wW9J84Hxga0Tsj4gDwFbeGEJNVR0ZYmhoqNnNMDMrXaP7MOZFxPNp+AVgXhpeADyXq7cnldUqfwNJayX1S+ofHByccgOz02SL1/eptWbWKZrW6R3ZN2zdvmUjYl1E9EZEb09PT71mW2DBvqeUmXWGRgfGi+lQE+nnvlS+F1iUq7cwldUqbzHhfgwzm/YaHRhbgNEznfqAO3Pll6Wzpc4CXkmHru4Blkuakzq7l6eyluKObzPrBDPKmrGkrwLvBU6WtIfsbKe/ADZLWgM8C1ySqt8NXAgMAD8BVgNExH5JnwZ2pHqfiohjO9KbrloZYWhoiBNPPBFJzW6OmVkpSguMiHh/jUnnjVE3gCtqzGc9sL6OTau/1I/x9T95L93d3c1ujZlZKXyld924H8PMpjcHhpmZFeLAqJPK8Ou89tprvh7DzKYtB0a9RJXL129neHi42S0xMyuFA6NOolo5fNW3mdl05MCoI1+PYWbTmQOjjnxfKTObzhwY9eR+DDObxhwYdRbVEd/u3MymJQdGnUUEQ0NDjIyMNLspZmZ15cCot6jygVt8WMrMph8HRgkiqgwNDbnz28ymFQdGGdz5bWbTkAOjJJXhQxw8eNB7GWY2bTgwjlGpVJjUQ71riSprNu70o1vNbNpwYJSoWhn2DQnNbNpwYJQoKsP8/pe+574MM5sWHBglq1aGOXjwoK/LMLO258AoW1S5/KZ/4tVXX/WhKTNraw6MRogKv3fD/ezfv5/XX3/dwWFmbcmB0SASrL7pH7nki9u8t2FmbcmB0UDq6iIqQ/zeuu8xNDTkW6GbWVtxYBwjewBSuV/i1cowBw4c4OLr7uXQoUNUq1WHh5m1PAdGM0SVtRu2Ux05xKq/uY+XX36Zi7+YhYdDw8xaVdsEhqQVknZJGpB0ZbPbc7zUlX30URmib90DVIcPccm19zA4OMihQ4cYHh4+/BoZGfFeiJk13YxmN6AISd3A3wC/DuwBdkjaEhFP1HtZzfhSHg0PCdas/2ekrDyqFdQ9k+4TZvLlP3w3fbfs4Ctr301XV+2cl0RXV9fhW5J0d3cDUK1W6erqQqMzNzObpLYIDOAMYCAingaQtAlYCdQ9MAAid/+nqFazb3IdPTzVaRPX6z6mLRUqQ6+x6tp/QN0zufivvnW4HHUfDhd1dSMJdXWz/gNns/rm79F9wiw2ffBXqFQqXLrue2z64186HCBmNj2V+TfeLoGxAHguN74HODNfQdJaYG0aPShp13Es72TgR8fx/qZ628cPD54866NH1mPWR5rRmrpo6+2RM13WA6bPung9jvaz401sl8CYUESsA9bVY16S+iOitx7zaiavR2uZLusB02ddvB6T0y6d3nuBRbnxhanMzMwapF0CYwewVNISSTOBVcCWJrfJzKyjtMUhqYgYkfQnwD1AN7A+Ih4vcZF1ObTVArwerWW6rAdMn3XxekyCfF6/mZkV0S6HpMzMrMkcGGZmVogDI6cVbz8iaZGk+yQ9IelxSR9O5XMlbZW0O/2ck8ol6bq0Do9IWpabV1+qv1tSX678FyU9mt5znUq8HFxSt6Sdku5K40skbU/L/lo6qQFJs9L4QJq+ODePq1L5Lknn58obtv0kzZZ0u6SnJD0p6ex23CaSPpp+rx6T9FVJJ7bDNpG0XtI+SY/lykr//Gsto4R1+Wz63XpE0h2SZuemTeqznsr2rCki/Mr6cbqBHwKnADOBHwCntkC75gPL0vBbgX8FTgX+ErgylV8JfCYNXwj8A9m15GcB21P5XODp9HNOGp6Tpj2Y6iq994IS1+djwFeAu9L4ZmBVGr4B+OM0/EHghjS8CvhaGj41bZtZwJK0zbobvf2ADcAfpOGZwOx22yZkF8Q+A7wpty0ub4dtAvwysAx4LFdW+udfaxklrMtyYEYa/kxuXSb9WU92e47b1rL+oNrtBZwN3JMbvwq4qtntGqOdd5LdU2sXMD+VzQd2peEbgffn6u9K098P3JgrvzGVzQeeypUfVa/ObV8IbAPOBe5Kf4w/yv1hHN4GZGfEnZ2GZ6R6Ona7jNZr5PYDTiL7otUx5W21TThyB4W56TO+Czi/XbYJsJijv2RL//xrLaPe63LMtN8CbhvrM5zos57K39h47fQhqSPGuv3Igia1ZUxpl/F0YDswLyKeT5NeAOal4VrrMV75njHKy/AF4BPA6M263ga8HBEjYyz7cHvT9FdS/cmuXxmWAIPALcoOr90k6c202TaJiL3A54B/B54n+4wfoj23CTTm86+1jDJ9gGwvBya/LlP5G6vJgdEmJL0F+AbwkYj4cX5aZP8itPT50ZLeB+yLiIea3ZY6mEF2COH6iDgdeJXs8MRhbbJN5pDdxHMJ8F+ANwMrmtqoOmnE59+IZUj6JDAC3FbmcopyYBzRsrcfkXQCWVjcFhHfTMUvSpqfps8H9qXyWusxXvnCMcrr7RzgNyX9G7CJ7LDUtcBsSaMXkOaXfbi9afpJwEsTrEejtt8eYE9EbE/jt5MFSLttk18DnomIwYgYBr5Jtp3acZtAYz7/WsuoO0mXA+8DLk3hxARtHqv8JSa/PWur93HRdn2R/df4NNl/W6OdRu9ogXYJ2Ah84Zjyz3J059tfpuHf4OgOvgdT+Vyy4+5z0usZYG6admwH34Ulr9N7OdLp/XWO7pD7YBq+gqM75Dan4XdwdKff02Qdfg3dfsADwM+l4T9N26OttgnZHZ8fB34qLWcD8KF22Sa8sQ+j9M+/1jJKWJcVZI9v6Dmm3qQ/68luz3HbWdYfVDu+yM6m+Feysw0+2ez2pDa9h2y39xHg4fS6kOxY4zZgN/D/c7/oInvY1A+BR4He3Lw+AAyk1+pceS/wWHrPXzNBx1cd1um9HAmMU9If50D6xZ6Vyk9M4wNp+im5938ytXUXubOHGrn9gNOA/rRd/i594bTdNgH+DHgqLevL6Yuo5bcJ8FWyfpdhsj2+NY34/Gsto4R1GSDrX3g4vW6Y6mc9le1Z6+Vbg5iZWSHuwzAzs0IcGGZmVogDw8zMCnFgmJlZIQ4MMzMrxIFhVkeSDja7DWZlcWCYmVkhDgyzEqRnMHw2PWfiUUm/k8rnS7pf0sNp2i8pe0bIrbm6H212+83GMmPiKmY2Bf+D7GrwdwInAzsk3Q/8Ltntpa+R1E12W47TgAUR8QuQPZypGQ02m4j3MMzK8R7gqxFRiYgXge8C7wJ2AKsl/Snw3yLiP8juAXSKpC9KWgH8uNZMzZrJgWHWQBFxP9kT1vYCt0q6LCIOkO2JfAf4I+Cm5rXQrDYHhlk5HgB+J/VP9JCFxIOSfhZ4MSK+RBYMyySdDHRFxDeA/0N2q3SzluM+DLNy3EH2OMwfkN1t+BMR8YKkPuDjkoaBg8BlZE8+u0XS6D9wVzWjwWYT8d1qzcysEB+SMjOzQhwYZmZWiAPDzMwKcWCYmVkhDgwzMyvEgWFmZoU4MMzMrJD/BHUIUzhDbCs2AAAAAElFTkSuQmCC", 306 | "text/plain": [ 307 | "

" 308 | ] 309 | }, 310 | "metadata": { 311 | "needs_background": "light" 312 | }, 313 | "output_type": "display_data" 314 | } 315 | ], 316 | "source": [ 317 | "# check-out the target column value distribution before log-transform\n", 318 | "\n", 319 | "name_of_target_column = 'loss'\n", 320 | "sns.histplot(df_train_full[name_of_target_column])" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 5, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "name_of_target_column_transformed = name_of_target_column+'_transformed'\n", 330 | "\n", 331 | "# logarithmic transform function\n", 332 | "def log_transform(value):\n", 333 | " return np.log1p(value)\n", 334 | "\n", 335 | "# create a new logarithmically transformed target column\n", 336 | "df_train_full[name_of_target_column_transformed] = df_train_full.apply(\n", 337 | " lambda row: log_transform(row[name_of_target_column]), axis=1)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 6, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "data": { 347 | "text/plain": [ 348 | "" 349 | ] 350 | }, 351 | "execution_count": 6, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | }, 355 | { 356 | "data": { 357 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAEHCAYAAABfkmooAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAejklEQVR4nO3de5ScdZ3n8feHzqVDSCZBspyYxAnjRB10xuC0gOLM4TJAcGeM7iLgAUmAMcxKHB2vMLrLeGEPnlEYLxCNJiTssCDiLSJDzGBQ2RGSwIRwk6WXi6QJJCOYFCY0dPq7fzy/CmXT3VUd6qmnLp/XOXX6qd9zqe8jsb71uzy/nyICMzOz0RxQdABmZtb8nCzMzKwqJwszM6vKycLMzKpysjAzs6rGFR1AHg455JCYO3du0WGYmbWUO++88z8iYsZw+9oyWcydO5dNmzYVHYaZWUuR9NhI+9wMZWZmVTlZmJlZVU4WZmZWlZOFmZlV5WRhZmZVOVmYmVlVThZmZlaVk4WZmVXlZGFmvyMi2LVrF17rxio5WZh1uKHJoVQqccblP6JUKhUcmTWT3JKFpG5JGyTdLek+SZ9O5askPSJpc3rNT+WS9GVJvZK2SHpTxbUWSXoovRblFbNZJxouOYzrPrDAiKwZ5Tk3VD9wfEQ8K2k8cJukf0n7PhYRNww5/hRgXnodBSwDjpJ0MHAx0AMEcKekNRHxTI6xm7W9iKBUKhERwyaHco1jypQpSCogQmsmudUsIvNsejs+vUZrBF0IXJ3Oux2YJmkmcDKwLiKeTgliHbAgr7jNOsVwNYpyghhpv3WuXPssJHVJ2gxsJ/vCvyPtuiQ1NV0uaWIqmwU8XnH61lQ2UvnQz1oiaZOkTTt27Kj3rZi1pa6Jk34nGZRKJc65Yi0DLwxQKpXcHGX75JosImJvRMwHZgNHSnoDcBHwOuDNwMHAJ+r0WcsjoiciembMGHY6djMbYqB/D0tX3cbACwPs2rWLXbt2Ma77wH3lewf2Fh2iNYmGrGcREb+RtB5YEBFfSMX9kq4CPpre9wFzKk6bncr6gGOHlN+aa8BmHaRrwiQG+vdw/uoNDA7070sQXRMmFRyZNZM8R0PNkDQtbU8CTgR+mfohUNZj9k7g3nTKGuDsNCrqaGBnRGwD1gInSZouaTpwUiozs/0w0nMU47snM97NTjaCPJuhZgLrJW0BNpL1WdwIXCPpHuAe4BDgc+n4m4CHgV7gG8D7ASLiaeCz6Robgc+kMjPbD2PpuPYDelaWWzNURGwBjhim/PgRjg/gghH2rQRW1jVAsw42tGN7JAP9ezhv+a1c/5G/QpKH0XYwP8Ft1iEigp07d7Jz584xdWCPm3igh9FaYzq4zax4pVKJUy+9gcHBQbomTKq5A7v88J6H0XY21yzMOsi47gPHPMrJw2gNnCzMrAYeRmtOFmZmVpWThZnVpLKD3ENpO4+ThZnVZKB/D4u/ejOnffGHHhXVgZwszKxmXRMmMW6iR0V1IicLMzOrys9ZmLW58pQd5XUqzPaHk4VZmxv6MJ7Z/nCyMOsA47oP9EN19rI4WZjZmFQuveqJBTuHO7jNbEzKCyWdtWy9h9B2ECcLMxuzbKGkyV7vooM4WZi1qUaMgvLU5Z3DfRZmbapUKnHWsvW88NzuXDq3PXV5Z3HNwqyN5bmu9kD/bk9d3kGcLMxsv/m5jc7hZGFmZlXlliwkdUvaIOluSfdJ+nQqP0zSHZJ6JX1L0oRUPjG9703751Zc66JU/qCkk/OK2czGziOiOkOeNYt+4PiIeCMwH1gg6Wjg88DlEfGHwDPAeen484BnUvnl6TgkHQ6cAbweWABcKakrx7jNbAwG+vdw3vJbPSKqzeWWLCLzbHo7Pr0COB64IZWvBt6Zthem96T9Jyh7NHQhcF1E9EfEI0AvcGRecZu1g8qnrBuha8Ik1y7aXK59FpK6JG0GtgPrgP8H/CYiBtIhW4FZaXsW8DhA2r8TeEVl+TDnVH7WEkmbJG3asWNHDndj1jpKpRLnXLG2YSOVXLtof7kmi4jYGxHzgdlktYHX5fhZyyOiJyJ6ZsyYkdfHmLWMRj//4EWR2ltDRkNFxG+A9cBbgGmSyg8Dzgb60nYfMAcg7f894NeV5cOcY2ZmDZDnaKgZkqal7UnAicADZEnj1HTYIuAHaXtNek/a/5PIGkDXAGek0VKHAfOADXnFbdbKPDLJ8pJnzWImsF7SFmAjsC4ibgQ+AXxYUi9Zn8SKdPwK4BWp/MPAhQARcR9wPXA/cDNwQUT4kVGzYXiuJstLbnNDRcQW4Ihhyh9mmNFMEfEc8O4RrnUJcEm9YzRrR10TJxWSLMq1Gq9x0Z78BLdZmxno31PInE0eEdXenCzM2lBRczZ5RFT7crIwM7OqnCzMzKwqJwszqysP321PThZmVlcevtuenCzMrO66JnpiwXbjZGHW4srNPoODgw2daXY0HkbbfpwszFpcudnniSeeaOhMs9WUpy13DaM9OFmYtYHyU9uNnml2qIjYV5sY6N/D+as3cNay9a5htAEnC7M2UNRT29XiGN89mfHdkwuNyerDycKsTRT11PZQzRKH1ZeThZmZVeVkYWZmVTlZmJlZVU4WZmZWlZOFmZlV5WRhZmZV5basqpnlqzzNR7NM8WHtzcnCrEWVSiVOvfQGBgcHm/rZBq/N3R5ya4aSNEfSekn3S7pP0gdT+T9I6pO0Ob3eXnHORZJ6JT0o6eSK8gWprFfShXnFbNZqxnUf2NSJAmCgf7cnFWwDedYsBoCPRMRdkqYAd0pal/ZdHhFfqDxY0uHAGcDrgVcC/yrpNWn3FcCJwFZgo6Q1EXF/jrGbWR15be7Wl1uyiIhtwLa0XZL0ADBrlFMWAtdFRD/wiKRe4Mi0rzciHgaQdF061snCrIW4Oaq1NWQ0lKS5wBHAHaloqaQtklZKmp7KZgGPV5y2NZWNVD70M5ZI2iRp044dO+p9C2b2MnkFvdaWe7KQdBDwHeBDEbELWAa8GphPVvP4Yj0+JyKWR0RPRPTMmDGjHpc0szoregp123+5joaSNJ4sUVwTEd8FiIinKvZ/A7gxve0D5lScPjuVMUq5mbWAynUurDXlORpKwArggYi4rKJ8ZsVh7wLuTdtrgDMkTZR0GDAP2ABsBOZJOkzSBLJO8DV5xW1m9dcs623Y/suzZnEM8F7gHkmbU9nfA++RNB8I4FHgfICIuE/S9WQd1wPABRGxF0DSUmAt0AWsjIj7cozbzHLQ7EN8bXR5joa6DRhuyMNNo5xzCXDJMOU3jXaemZnly3NDmVnDlIfPRkTRodgYOVmYWcMM9O/x09wtynNDmbWY8siiVp1A0E9ztyYnC7MWUyqVOGvZel54brdHF1nDuBnKrIWU2/zHd09mvB9wswZysjBrIaVSiXOuWOsahTWck4VZi/GUGVYEJwszaygPn21NThZmLaL8JdvqPHy2NTlZmLWIduqv8PDZ1uNkYdZC3F9hRXGyMDOzqpwszMysKicLMzOrysnCzBrOw2dbj5OFmTWch8+2HicLMyuEh8+2FicLMzOrqqZkIemYWsrMzGrlfovWUmvN4is1lpmZ1cT9Fq1l1MWPJL0FeCswQ9KHK3ZNBbqqnDsHuBo4FAhgeUR8SdLBwLeAucCjwGkR8YwkAV8C3g7sBhZHxF3pWouAT6VLfy4iVo/lJs1aXbvMCzWU+y1aR7WaxQTgILKkMqXitQs4tcq5A8BHIuJw4GjgAkmHAxcCt0TEPOCW9B7gFGBeei0BlgGk5HIxcBRwJHCxpOljuEezltdO80JZaxq1ZhERPwV+KmlVRDw2lgtHxDZgW9ouSXoAmAUsBI5Nh60GbgU+kcqvjqwB83ZJ0yTNTMeui4inASStAxYA144lHrNWN677QCcLK0yta3BPlLScrOlo3zkRcXwtJ0uaCxwB3AEcmhIJwJNkzVSQJZLHK07bmspGKh/6GUvIaiS86lWvqiUsMzOrUa3J4tvA14BvAmP6aSPpIOA7wIciYlfWNZGJiJBUl6EQEbEcWA7Q09Pj4RVmZnVUa7IYiIhlY724pPFkieKaiPhuKn5K0syI2Jaamban8j5gTsXps1NZHy82W5XLbx1rLGatql07t+HFe5syZQqVPySt+dQ6dPaHkt4vaaakg8uv0U5Io5tWAA9ExGUVu9YAi9L2IuAHFeVnK3M0sDM1V60FTpI0PXVsn5TKzDpCO3due/hs66i1ZlH+cv9YRVkAfzDKOccA7wXukbQ5lf09cClwvaTzgMeA09K+m8iGzfaSDZ09ByAinpb0WWBjOu4z5c5us07Rzp3bHj7bGmpKFhFx2FgvHBG3ASPVK08Y5vgALhjhWiuBlWONwczM6qOmZCHp7OHKI+Lq+oZjZmbNqNZmqDdXbHeT1QzuIntC28zM2lytzVAfqHwvaRpwXR4BmdmL2nkkVJlHRLWG/Z2i/LfAmPsxzGxs2nkkVNlA/x7O/fp6+vr6PAttE6u1z+KHZKOfIJtA8I+A6/MKysxe1M4joV4kzl+9ga5xXfzzfzuOqVOnFh2QDVFrn8UXKrYHgMciYmsO8ZhZhxrfPZmucaNOZm0FqqkZKk0o+EuyGWenA8/nGZSZmTWXWlfKOw3YALyb7CG6OyRVm6LczMzaRK3NUJ8E3hwR2wEkzQD+Fbghr8DMzKx51Doa6oByokh+PYZzzcxq4nW5m1etX/g3S1orabGkxcCPyOZyMjOrm4H+3Z5YsElVW4P7D8kWK/qYpP8CvC3t+gVwTd7BmVnn8cSCzalan8U/ARcBpPUovgsg6Y/Tvr/KMTYzM2sS1ZqhDo2Ie4YWprK5uURkZkBnTPVhraNaspg2yr5JdYzDzIbohKk+rHVUSxabJL1vaKGkvwbuzCckMysb1+32e2sO1fosPgR8T9KZvJgceoAJwLtyjMvMzJrIqMkiIp4C3irpOOANqfhHEfGT3CMzs47kKcubU61zQ62PiK+klxOFmeVmoH+Pn7VoQn4K26wJdfpIKD9r0XxySxaSVkraLuneirJ/kNQnaXN6vb1i30WSeiU9KOnkivIFqaxX0oV5xWvWTDp9JJSn/Wg+edYsVgELhim/PCLmp9dNAJIOB84AXp/OuVJSl6Qu4ArgFOBw4D3pWLO218kjodwU1XxqnXV2zCLiZ5Lm1nj4QuC6iOgHHpHUCxyZ9vVGxMMAkq5Lx95f73jNrLm4Kaq5FNFnsVTSltRMNT2VzQIerzhmayobqfwlJC2RtEnSph07duQRt5lZx2p0slgGvBqYD2wDvlivC0fE8ojoiYieGTNm1OuyZmZGjs1Qw0nPbQAg6RvAjeltHzCn4tDZqYxRys3MrEEaWrOQNLPi7buA8kipNcAZkiZKOgyYR7aM60ZgnqTDJE0g6wRf08iYzcwsx5qFpGuBY4FDJG0FLgaOlTQfCOBR4HyAiLhP0vVkHdcDwAURsTddZymwFugCVkbEfXnFbGbNw09yN5c8R0O9Z5jiFaMcfwlwyTDlN+FV+cw6Tnn47Lc/+g6mTp1adDgdz09wm1nT8vDZ5uFkYWZmVTlZmJlZVU4WZk2m0ycRtObkZGHWZDp9EkFrTk4WZk2okycRtObU0Ce4zWx4EUGpVHIT1BB+1qJ5OFmYNYFSqcRZy9bzwnO7eX73s3RNmFR0SE3Bz1o0DycLsyYxvnsygPsqhvCzFs3BfRZmZlaVk4WZmVXlZGFmZlU5WZhZUyuPiIqIokPpaE4WZtbUyiOiSqVS0aF0NCcLM2t6HhFVPCcLMzOrysnCzMyqcrIwM7OqnCzMrOl5RFTxnCzMrOl5RFTxcksWklZK2i7p3oqygyWtk/RQ+js9lUvSlyX1Stoi6U0V5yxKxz8kaVFe8ZoVxTPN1sYjooqVZ81iFbBgSNmFwC0RMQ+4Jb0HOAWYl15LgGWQJRfgYuAo4Ejg4nKCMWsXXuzIWkFuySIifgY8PaR4IbA6ba8G3llRfnVkbgemSZoJnAysi4inI+IZYB0vTUBmLc+LHVXnfotiNbrP4tCI2Ja2nwQOTduzgMcrjtuaykYqfwlJSyRtkrRpx44d9Y3azArnfotiFdbBHdnPg7r9RIiI5RHRExE9M2bMqNdlzayJuN+iOI1OFk+l5iXS3+2pvA+YU3Hc7FQ2UrmZdSA3RRWn0cliDVAe0bQI+EFF+dlpVNTRwM7UXLUWOEnS9NSxfVIqM7MO5Kao4uS2rKqka4FjgUMkbSUb1XQpcL2k84DHgNPS4TcBbwd6gd3AOQAR8bSkzwIb03GfiYihneZm1kHcFFWM3JJFRLxnhF0nDHNsABeMcJ2VwMo6hmZmZmPkJ7jNzKyq3GoWZmZ5qHzifcqUKUgqOKLO4GRhViBP9TF2A/17OH/1Bg7oOoBlZ/Uwa9YsJ4wGcDOUWYE81cf+Gd89GUkeGdVAThZmBfNUH/vPI6Max8nCzMyqcrIwK4j7K6yVOFmYFcT9FS+fp/9oHCcLswK5v+Ll8fQfjeNkYWYtzZ3cjeFkYWZmVTlZmJlZVU4WZmZWlZOFmZlV5WRhZi3Nw2cbw8nCzFqah882hpOFmbW8rgmTXLvImZOFmbU81y7y52RhZm3BD+fly8nCrACeRNBaTSHJQtKjku6RtFnSplR2sKR1kh5Kf6enckn6sqReSVskvamImM3qyZMIWqspsmZxXETMj4ie9P5C4JaImAfckt4DnALMS68lwLKGR2qWA08iWF8eQpuvZmqGWgisTturgXdWlF8dmduBaZJmFhCfmTWxgf49nPv19fT19Tlh5KCoZBHAjyXdKWlJKjs0Iral7SeBQ9P2LODxinO3prLfIWmJpE2SNu3YsSOvuM2sqXld7ryMK+hz3xYRfZL+E7BO0i8rd0ZESBrTT4OIWA4sB+jp6fHPCrMO5VFR+SikZhERfenvduB7wJHAU+XmpfR3ezq8D5hTcfrsVGZmZg3S8GQhabKkKeVt4CTgXmANsCgdtgj4QdpeA5ydRkUdDeysaK4yM/sd7ujORxE1i0OB2yTdDWwAfhQRNwOXAidKegj4i/Qe4CbgYaAX+Abw/saHbFYf/iLLn5/mzkfD+ywi4mHgjcOU/xo4YZjyAC5oQGhmuSuVSpx+2Y1cdvoRRYfS1txvUX/NNHTWrK3te2pbYumq2/xAXo5cg6s/JwuzBql8artrwqSiw2lrboqqPycLswbyU9uN42nL68vJwqwBPHFg47l2UV9OFmYN4IkDi+GO7vpxsjDLWblW4SYoa2VFTfdh1vYiglKpxK5duzjnirVo3MSiQ+o4Q0dFTZ06FUkFR9WanCzMclIqlThr2XpeeG63E0VBBvr3cP7qDQwO9DO4dy/f/ug7mDp1atFhtSQ3Q5nloPyLdnz3ZMa7+alQ5f8GHh318jhZmOXAHdrNx6OjXh4nC7OcuEO7+Xh01P5zn4VZHZU7td3U0Zwigp07dxIR7uweI9cszOokIujr6+P0y27kiSeeKDocG8ZA/x4Wf/VmTvviD90cNUZOFmZ1sq+fYu+gJwpsYl0TJrmzez84WZjVwdAH7zxRYHNzZ/fYOVmY1YFHP7Ue1y7GxsnC7GUod5ju3LnTo59azED/Hs79+nr6+vqcMGrg0VBm+6ncoX3OFWsZHBx001NLEud+fT0rlhy7b3TUlClTPEpqGE4WZvuhMlFo3ES6ig7IXgax+Ks30z31FRzQdQBXnvmnTJ061UNrh3CyMKui/OzEQQcdtK9DtNxH4Tmf2kPXhEmM757M4MBzLP7qzXRN6Gbl+ccxa9YsJ4ykZZKFpAXAl4Au4JsRcWnBIVkbq5yttFQqsWTFz7ns9CP426v/jQkHTWdwoN+Jok2VmxPLzVNTpkwBQFJH1zZaIllI6gKuAE4EtgIbJa2JiPuLjcwarfzlDYzatlw+rvx/9MqawUidmeVrlc8998ofMzg4CGRfIEtX3VbxC7SLvc962GV7y5qnyromdO9LHpX/7jolgbREsgCOBHoj4mEASdcBC4FckoWXv2xeu3btYsmKnwOw/Lw/G3G66V27dnHOlWu56v0nA3DOlWv58nvfyvu/eQuDeweHPad76sEMvtDP83t+C8C4iS92WO99fs++vy8891sGB/r3lY2m6GM7/fPrGeve55/jrC9+D3jx38rg4CCrP3BKU017nlcsaoUhY5JOBRZExF+n9+8FjoqIpRXHLAGWpLevBR4EDgH+o8HhNkK73he07735vlpPu97baPf1+xExY7gdrVKzqCoilgPLK8skbYqInoJCyk273he07735vlpPu97b/t5XqzyU1wfMqXg/O5WZmVkDtEqy2AjMk3SYpAnAGcCagmMyM+sYLdEMFREDkpYCa8mGzq6MiPtqOHV59UNaUrveF7Tvvfm+Wk+73tt+3VdLdHCbmVmxWqUZyszMCuRkYWZmVbVlspC0QNKDknolXVh0PPUiaY6k9ZLul3SfpA8WHVM9SeqS9O+Sbiw6lnqRNE3SDZJ+KekBSW8pOqZ6kfR36d/hvZKuldRddEz7S9JKSdsl3VtRdrCkdZIeSn+nFxnj/hjhvv4x/XvcIul7kqbVcq22SxYVU4OcAhwOvEfS4cVGVTcDwEci4nDgaOCCNro3gA8CDxQdRJ19Cbg5Il4HvJE2uT9Js4C/BXoi4g1kA0/OKDaql2UVsGBI2YXALRExD7glvW81q3jpfa0D3hARfwL8X+CiWi7UdsmCiqlBIuJ5oDw1SMuLiG0RcVfaLpF98cwqNqr6kDQb+M/AN4uOpV4k/R7w58AKgIh4PiJ+U2hQ9TUOmCRpHHAg8ETB8ey3iPgZ8PSQ4oXA6rS9GnhnI2Oqh+HuKyJ+HBED6e3tZM+tVdWOyWIW8HjF+620yRdqJUlzgSOAOwoOpV7+Cfg4MPzETa3pMGAHcFVqXvumpMlFB1UPEdEHfAH4FbAN2BkRPy42qro7NCK2pe0ngUOLDCYn5wL/UsuB7Zgs2p6kg4DvAB+KiJaf9VDSXwLbI+LOomOps3HAm4BlEXEE8FtasynjJVL7/UKyhPhKYLKks4qNKj+RPWPQVs8ZSPokWdP2NbUc347Joq2nBpE0nixRXBMR3y06njo5BniHpEfJmg2Pl/TPxYZUF1uBrRFRrv3dQJY82sFfAI9ExI6IeAH4LvDWgmOqt6ckzQRIf7cXHE/dSFoM/CVwZtT4sF07Jou2nRpE2aT5K4AHIuKyouOpl4i4KCJmR8Rcsv9eP4mIlv+VGhFPAo9Lem0qOoGcptUvwK+AoyUdmP5dnkCbdN5XWAMsStuLgB8UGEvdpIXkPg68IyJ213pe2yWL1HFTnhrkAeD6GqcGaQXHAO8l++W9Ob3eXnRQNqoPANdI2gLMB/5nseHUR6ot3QDcBdxD9l3SstNjSLoW+AXwWklbJZ0HXAqcKOkhsppUy63OOcJ9fRWYAqxL3yFfq+lanu7DzMyqabuahZmZ1Z+ThZmZVeVkYWZmVTlZmJlZVU4WZmZWlZOFmZlV5WRhbUPSszlff7GkV+b5Gelz3p2mM1+f92eNEkOu/1ta63GyMKvdYrJ5kF4iTY1fL+cB74uI42o5OM36apYrJwtrO8r8Y1qU5x5Jp6fymZJ+lp5avVfSn6UFl1ZVHPt3I1zzVKCH7GnszZImSXpU0ucl3QW8W9L7JG2UdLek70g6MJ27StKXJf2bpIfTtUaK538AbwNWpHvolnRViu3fJR2Xzl0saY2knwC3pPffT4v0PCppqaQPp3Nul3RwOu/Vkm6WdKekn0t6XSo/TNIv0ud8Lt//QtaSIsIvv9riBTyb/v5XsgVeusimlf4VMBP4CPDJdEwX2ZQHfwqsq7jGtFGufyvZYj/l948CH694/4qK7c8BH0jbq4Bvk/04O5xsvRWGi2fo56RjVqbt16V76Sar5WwFDk77FgO96Z5mADuBv0n7LieboRiyRXzmpe2jyObhgmwepLPT9gXl/y398qv8cvXV2tHbgGsjYi/ZzKE/Bd5MNsnkyjRz7/cjYrOkh4E/kPQV4EfAWNdk+FbF9hvSr/JpwEFk85OVfT8iBoH7JZXXRXhJPCPcy1cAIuKXkh4DXpP2rYuIyoVt1ke2KFZJ0k7gh6n8HuBP0tT2bwW+nc39B8DE9PcYsiQL8L+Az9dy89Y53AxlHSOyVcP+nGzK+lWSzo6IZ8iWO70V+BvGvlLfbyu2VwFLI+KPgU+T1QDK+iu2NVI8L+Ozh37GYMX7QbK1NQ4AfhMR8ytef1RxjieKsxE5WVg7+jlweuqPmEH2hbxB0u8DT0XEN8iSwpskHQIcEBHfAT7F6OtNlMiaeUYyBdiWagpnVgtyuHhGuJcz0/GvAV4FPFjt2sOJbKGsRyS9O11Pkt6Ydv8fXlxDu2rs1nncDGXt6HvAW4C7yX4tfzwinpS0CPiYpBeAZ4GzyZbcvUpS+YfTaIvXrwK+JmlPuv5Q/51smdsd6e9oiQXg2GHiGepKYJmke8hWNVscEf0VzUhjdWa63qeA8WSLTd0NfBD435I+QZus22D15SnKzcysKjdDmZlZVW6GMhtC0hVko4MqfSkirioiHrNm4GYoMzOrys1QZmZWlZOFmZlV5WRhZmZVOVmYmVlV/x+0BMrICJKPLQAAAABJRU5ErkJggg==", 358 | "text/plain": [ 359 | "

" 360 | ] 361 | }, 362 | "metadata": { 363 | "needs_background": "light" 364 | }, 365 | "output_type": "display_data" 366 | } 367 | ], 368 | "source": [ 369 | "# check value distribution of transformed target column\n", 370 | "\n", 371 | "sns.histplot(df_train_full[name_of_target_column_transformed])" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "### Subsampling Data\n", 379 | "We have a pretty sizable dataset with 180k+ rows. Needless to say, creating multiple test models and tuning them will take a long time. What should we do then?
\n", 380 | "One simple trick I use in similar cases is to create a subsample dataset with smaller number of samples and apply and adjust training and tuning on this one. Later I can use the full train data and utilize the parameters I discovered." 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": 7, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "# let's shuffle the whole dataframe before subsampling\n", 390 | "df_train_full = df_train_full.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)\n", 391 | "\n", 392 | "# second shuffle with an exponential random seed :D\n", 393 | "df_train_full = df_train_full.sample(frac=1, random_state=(RANDOM_SEED**2)).reset_index(drop=True)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "To make sure that the subsampled data is similar to our original full dataset, it is advised widely to use __stratified__ method in all sampling scenarios.\n", 401 | "\n", 402 | "But we know that stratified sampling works only on classification problems, which have a target column with classes. Our problem here is of regression type and the target column contains continuous numerical values.\n", 403 | "\n", 404 | "__So, what should we do to randomly subsample a fraction of data with samples that their target column values conform to distribution pattern in our original dataset?__
\n", 405 | "Pandas _qcut()_ function to the rescue! We create discrete bins (groups) for our target column with (nearly) equal value counts in each bin, and do the subsampling using those bins." 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 8, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "name_of_target_column_binned = name_of_target_column+'_bin'\n", 415 | "\n", 416 | "# 10 bin groups for target values should work fine\n", 417 | "bin_counts = 10\n", 418 | "\n", 419 | "# create bin column in our dataframe\n", 420 | "df_train_full[name_of_target_column_binned] = pd.qcut(\n", 421 | " df_train_full[name_of_target_column_transformed],\n", 422 | " q=bin_counts,\n", 423 | " labels=list(range(bin_counts)))" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "I think a number around 25k samples is a decent chunk for a (train + validation) subsample:" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 9, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "desired_subsample_size = 25000\n", 440 | "subsample_fold_split = int(len(df_train_full)/desired_subsample_size)\n", 441 | "\n", 442 | "# create desired kfold splitter\n", 443 | "skf_subsample_full = StratifiedKFold(\n", 444 | " n_splits=subsample_fold_split, shuffle=True, random_state=RANDOM_SEED)\n", 445 | "\n", 446 | "# make sure target distribution remains the same by utilizing StratifiedKFold data split\n", 447 | "train_indices_remaining, train_indices_subsample = next(\n", 448 | " skf_subsample_full.split(X=df_train_full, y=df_train_full[name_of_target_column_binned]), 0)\n", 449 | "\n", 450 | "df_train_full_subsample = df_train_full.iloc[train_indices_subsample, :].reset_index(drop=True)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "We should now separate the full subsample dataset into \"subsample train\" and \"subsample validation\" in similar manner:" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 10, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "skf_subsample_trainvalid = StratifiedKFold(\n", 467 | " n_splits=5, shuffle=True, random_state=RANDOM_SEED)\n", 468 | "\n", 469 | "train_indices_subsample, valid_indices_subsample = next(\n", 470 | " skf_subsample_trainvalid.split(X=df_train_full_subsample, y=df_train_full_subsample[name_of_target_column_binned]), 0)\n", 471 | "\n", 472 | "df_train_subsample = df_train_full.iloc[train_indices_subsample, :].reset_index(drop=True)\n", 473 | "df_valid_subsample = df_train_full.iloc[valid_indices_subsample, :].reset_index(drop=True)" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "Now let's check to see how closely the subsample (and it's train/valid splits) represents our full dataset:" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": 11, 486 | "metadata": {}, 487 | "outputs": [ 488 | { 489 | "data": { 490 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAvA0lEQVR4nO3deXhU5dn48e9N2BQoVYlWCRoQNCKEhCYgIJTFipRV1EJECy6oVYQXfaGgdUN8RaHVAloFF7SCoogsbqW1gKhYCBDCJkUWSxAx4o9ARATk/v1xJuMkzCQzmZnMzMn9uS4uZs7ynOecmdzznPs85zmiqhhjjHGXGrGugDHGmMiz4G6MMS5kwd0YY1zIgrsxxriQBXdjjHEhC+7GGONCFtzjjIg8IyL3Raisc0WkWESSPO+XicjNkSjbU957IjI0UuWFsN2JIvKNiHxV1duOByIyTEQ+inU9wiEiXUWkINb1cDML7lVIRHaJyPcickhEDojIJyJym4h4PwdVvU1VHw6yrMvKW0ZV/6uq9VX1xwjU/UEReaVM+b1U9aVwyw6xHucCdwMtVfUXZeYN8fyYFXuO8wmf98VVWMdZIjKxgmVEREaKyEYR+U5ECkTkDRFpXVX1DIeI1PV8h7v7mfeEiMyLRb3MTyy4V72+qtoAOA+YBPwBeD7SGxGRmpEuM06cC+xX1a/LzlDV2Z4fs/pAL+DLkveeaUGpomP3F2AUMBI4HbgAWAD0roJth01VjwBzgd/5TvecJeYAVfqjb/xQVftXRf+AXcBlZaa1A04ArTzvZwETPa8bAW8DB4BvgRU4P8h/86zzPVAMjAVSAQVuAv4LfOgzraanvGXAo8Aq4CCwEDjdM68rUOCvvsAVwFHgmGd7633Ku9nzugbwR+AL4GvgZaChZ15JPYZ66vYNcG85x6mhZ/1CT3l/9JR/mWefT3jqMaucMkrtDzAO2A4cAjYDV/rMGwZ8DDwB7AcmAmcAiz3HabVn2kc+66QB//B8LluB33qm3+I5Tkc9dVzsp24tgB+BdqEeA5/6flTm2Nb0Wdf3c/HdtwPADqCjZ/puz2c11GfdWcBTwDueY/Vv4PwAdezoWeZUn2m/8ZRZE7gB2OJZZgdwazmfjwLNy9Rjos/7PkCeZx8+AdJ95v0B2OPZzlagR6z/1uPhn7XcY0xVVwEFQGc/s+/2zEsGzgLucVbR63GCZF91WqWP+6zzK+AioGeATf4OuBE4GzgOTA2iju8D/wfM9WyvjZ/Fhnn+dQOaAfWB6WWWuRS4EOgB3C8iFwXY5DSc4NbMsz+/A25Q1X9SukU+rKK6+9iOc4wbAg8Br4jI2T7z2+MEoLOAR3AC3HfAL3B+lLzXFkSkHk5gnwOcCQwGnhaRlqo6A5gNPO6pY18/demBE9hWlVNfv8cghP311R7Ix/nBmgO8BmQDzYHrgOki4ntmMxjnGJ0GfI5zPE6iqp8Ae4GBPpOvB+ao6nGcIN8H+Jmn7k+ISNtQKy8imcALwK2efXgWWCQidUTkQmAEkK3OGXFPnEZJtWfBPT58iXNqXtYxnCB8nqoeU9UV6mmqlONBVf1OVb8PMP9vqrpRVb8D7gN+W3LBNUxDgD+r6g5VLQbGA4PLpDgeUtXvVXU9sB446UfCU5fBwHhVPaSqu4A/4QSNSlPVN1T1S1U9oapzgW04Z00lvlTVaZ6gdBS4CnhAVQ+r6mZKpxn6ALtU9UVVPa6q64A3gWuCrM4ZOEHRrygcg52euv6Ik0ppAkxQ1R9UdQnO/jb3Wf4tVV3lORazgYxyyn4ZT2pGRH4G9MdzrFT1HVXdro7lwBL8N2IqcgvwrKr+W1V/VOc6zw/AJThnQHWAliJSS1V3qer2SmzDdSy4x4fGOKf3ZU3GaTktEZEdIjIuiLJ2hzD/C6AWTvonXOd4yvMtuyZOS7iEb++Wwzit+7IaeepUtqzG4VRORH4nInmei4AHgFaU3m/f45KMU/fdAeafB7QvKctT3hCcVn4w9uP8aAcS6WOwz+f19wCqWnaa72cRzOdU4m9ANxE5B7ga2O75sUNEeonIpyLyrecY/YbKfdfOA+4uc7ybAOeo6ufA/wAPAl+LyGueulR7FtxjTESycf5oT+ra5mm13a2qzYB+wF0i0qNkdoAiK2rZN/F5fS7O2cE3OCmIU33qlYQT5IIt90ucP0Lfso9TOrAE4xtPncqWtSfEcrxE5DxgJs7p+xmq+nNgIyA+i/nuXyFO3VN8pvket93AclX9uc+/+qr6ez9l+fMBkCIiWQHmh3IMvvP8f6rPtGB/ZMKmql/gXAu6DufM4iUAEamDczYzBTjLc8zfpfQx93WYwPuwG3ikzPE+VVVf9dRhjqpeinO8FHgsUvuXyCy4x4iI/ExE+uDkP19R1Q1+lukjIs1FRIAinFPQE57Z+3DysaG6TkRaisipwARgnud0/T9AXRHpLSK1cC7g1fFZbx+Q6ttts4xXgdEi0tSTvy3J0R8PpXKeurwOPCIiDTyB+S7glfLXLFc9nD/6QgARuQGn5V5eHeYDD4rIqSKSRuleIW8DF4jI9SJSy/Mv2+caQrmfjapuA54GXvX0967t6Vo4WETGhXIMVLUQJ+hfJyJJInIjcH6QxyVSXsL54eyEk8YBqI3z/SkEjotIL+DycsrIA6717MMVONcZSswEbhOR9p4upPU839MGInKhiHT3/Jgc4acL7tWeBfeqt1hEDuG0Ru4F/kzgC2UtgH/i9LpYCTytqks98x4F/ug5Tf3fELb/N5yeCF8BdXG64qGqRcDtwHM4weI7nIu5Jd7w/L9fRNb6KfcFT9kfAjtx/tDuDKFevu70bH8HzhnNHE/5leLJmf8J5xjuA1rj9CApzwicC5pf4ezXqzh5XlT1EE6gGoxzxvIVTmux5MfweZwc8AERWRCg/JE4F5yfwukBsh24EqeHDoR2DIYDY3DSPRfj9CapSm/iXDP6QFX3gvcYjcT5kfp/wLXAonLKGAX0xTkWQ3C6heIpKxdnH6d7yvoc5+I9OMd8Es7Zzlc4F7jHR2KnEp1UfH3OGCMijwG/UNUqvyPXmMqwlrsxfohImoike9IA7XDuH3gr1vUyJlhuvYvRmHA1wEnFnIOTyvkTzk1fxiQES8sYY4wLWVrGGGNcKC7SMo0aNdLU1NRYV8MYYxLKmjVrvlHVZH/z4iK4p6amkpubG+tqGGNMQhGRLwLNs7SMMca4kAV3Y4xxoZgGdxHpKyIzioqKYlkNY4xxnZjm3FV1MbA4KytreNl5x44do6CggCNHjsSgZqY6qFu3LikpKdSqVSvWVTEm4uLigqo/BQUFNGjQgNTUVJxxs4yJHFVl//79FBQU0LRp01hXx5iIi9uc+5EjRzjjjDMssJuoEBHOOOMMOzM0rhW3wR2wwG6iyr5fxs3iOrgbY4ypnJjm3EWkL9C3efPmFS47fv5Jz7IIy6MDW1e4TFJSEq1b/7TcggULCHQn7axZs8jNzWX69Ok8+OCD1K9fn//935+GWX/kkUd44w1nSPQNGzZ4y73xxhsZOXJkhXW5+eabueuuu2jZsmWFy5bUZ8yYMaSkpFBcXEyzZs144IEH6NixY7nrLViwgAsuuCDo7QRj165dfPLJJ1x77bURK7O6Gj9/w0nf3ZK/jWC+06b6iNveMvHglFNOIS8vLyJl3Xvvvdx7770A1K9f/6RyVRVVpUYN/ydTzz33XMjbHDRoENOnTwdg6dKlDBw4kKVLl3LRRRcFXGfBggX06dMn4sF9zpw5FtwrwV+jJlBDp2yQ9/dDYKoPS8uEKDU1lW+++QaA3NxcunbtWumydu3axYUXXsjvfvc7WrVqxe7du/n9739PVlYWF198MQ888IB32a5du3qHaKhfvz733nsvbdq04ZJLLmHfvoofU9qtWzduueUWZsyYAcDMmTPJzs6mTZs2XHXVVRw+fJhPPvmERYsWMWbMGDIyMti+fbvf5QDeeOMNWrVqRZs2bejSpQsAP/74I2PGjCE7O5v09HSeffZZAMaNG8eKFSvIyMjgiSeeqPTxqk7Gz99Q6bNV33XDKcckNgvu5fj+++/JyMggIyODK6+8Mirb2LZtG7fffjubNm3ivPPO45FHHiE3N5f8/HyWL19Ofn7+Set89913XHLJJaxfv54uXbowc+bMoLbVtm1bPvvsMwAGDhzI6tWrWb9+PRdddBHPP/88HTt2pF+/fkyePJm8vDzOP/98v8sBTJgwgb///e+sX7+eRYucp6c9//zzNGzYkNWrV7N69WpmzpzJzp07mTRpEp07dyYvL4/Ro0dH6Mi5U7SCsQX56idu+7nHg0imZQI577zzuOSSS7zvX3/9dWbMmMHx48fZu3cvmzdvJj09vdQ6tWvXpk+fPgD88pe/5B//+EdQ2/Idu3/jxo388Y9/5MCBAxQXF9OzZ0+/6wRarlOnTgwbNozf/va3DBw4EIAlS5aQn5/PvHnzACgqKmLbtm3Url07yKNRfVVV4PXdjqVs3M1a7iGqWbMmJ044D1ePRB/pevXqeV/v3LmTKVOm8MEHH5Cfn0/v3r39bqNWrVrebnxJSUkcP348qG2tW7fOm28fNmwY06dPZ8OGDTzwwAMB9yXQcs888wwTJ05k9+7d/PKXv2T//v2oKtOmTSMvL4+8vDx27tzJ5ZeX98B7A1UX2ONlu6ZqWMs9RKmpqaxZs4ZevXrx5ptvRrTsgwcPUq9ePRo2bMi+fft47733wsrp+1q+fDkzZsxg6dKlABw6dIizzz6bY8eOMXv2bBo3bgxAgwYNOHTokHe9QMtt376d9u3b0759e9577z12795Nz549+etf/0r37t2pVasW//nPf2jcuPFJZZr4Caxl62GtefdImK6Q8fKle+CBB7jpppu47777IhZ4S7Rp04bMzEzS0tJo0qQJnTp1Cqu8uXPn8tFHH3H48GGaNm3Km2++6W25P/zww7Rv357k5GTat2/vDb6DBw9m+PDhTJ06lXnz5gVcbsyYMWzbtg1VpUePHrRp04b09HR27dpF27ZtUVWSk5NZsGAB6enpJCUl0aZNG4YNG1bt8+7xEtj9sR427hEXz1DNysrSsg/r2LJlS7ld9oyJhKr+nsVzYC/Lgnz8E5E1qprlb57l3I2pIokU2MF62CQ6y7kbE2UWIE0sWMvdmChyQ2B3wz5URxbcjTEVsgCfeCy4GxMlbguIloNPLBbcjTHGhRLnguriUZEtr+9fKlwkkkP+AmzdupVbb72VAwcO8MMPP9C5c2fvQF6B1K9fn+Li4or3J4p89y0Yy5Yto3///jRr1ozDhw9z1llnMXbsWO+QCeWtV7t27QqHJQ7FgQMHmDNnDrfffnvEygyGm1u41hc+MSTMTUyxEOmxZUaOHMno0aPp378/4Izr7ladO3fm7bffBiAvL48BAwZwyimn0KNHj4DrLFu2jPr160c8uD/99NNVGtzdHNhL2Bjy8S+maRlVXayqtzRs2DCW1QhJOEP+7t27l5SUFO/7krOCWbNmMWLECO/0Pn36sGzZMu/70aNHc/HFF9OjRw8KCwsBmDp1Ki1btiQ9PZ3BgwcDsGrVKjp06EBmZiYdO3Zk69at3vIHDBjAr3/9a1JTU5k+fTp//vOfyczM5JJLLuHbb78FnGGFR40aRUZGBq1atWLVqlUn7UNhYSFXXXUV2dnZZGdn8/HHH1e43xkZGdx///3elv/ixYtp3749mZmZXHbZZezbt49du3bxzDPP8MQTT5CRkcGKFSv8LgfOUAolo3VmZmZ675qdPHmyd7jhkuGSx40bx/bt28nIyGDMmDFBfErGuIPl3MsR6SF/R48eTffu3enVqxdPPPEEBw4cqHCd7777jqysLDZt2sSvfvUrHnroIQAmTZrEunXryM/P55lnngEgLS2NFStWsG7dOiZMmMA999zjLWfjxo3Mnz+f1atXc++993Lqqaeybt06OnTowMsvv+xd7vDhw+Tl5fH0009z4403nlSfUaNGMXr0aFavXs2bb77JzTffHNS++w43fOmll/Lpp5+ybt06Bg8ezOOPP05qaiq33XYbo0ePJi8vj86dO/tdDmDKlCk89dRT5OXlsWLFCk455RSWLFnCtm3bWLVqFXl5eaxZs4YPP/yQSZMmcf7555OXl8fkyZODqmtlVccLjtVtfxNJ4uTcYyDSaZkbbriBnj178v7777Nw4UKeffZZ1q9fX+46NWrUYNCgQQBcd9113uF109PTGTJkCAMGDGDAgAGAM8Tu0KFD2bZtGyLCsWPHvOV069aNBg0a0KBBAxo2bEjfvn0B5+zBd8z4nJwcALp06cLBgwdP+gH65z//yebNm73vDx48SHFxMfXr1y93P3yHuSgoKGDQoEHs3buXo0eP0rRpU7/rBFquU6dO3HXXXQwZMoSBAweSkpLCkiVLWLJkCZmZmQAUFxezbds2zj333HLrZYxbWcs9ROEO+XvOOedw4403snDhQmrWrMnGjRtLlVlRuSVD/b7zzjvccccdrF27luzsbI4fP859991Ht27d2LhxI4sXLy5VTp06dbyva9So4X1fo0aNUkMGl5Qf6P2JEyf49NNPvcP67tmzp8LADqWHG77zzjsZMWIEGzZs4Nlnnw24v4GWGzduHM899xzff/89nTp14rPPPkNVGT9+vLden3/+OTfddFOF9TLhq45nLInAgnuISob8BUIe8vf999/3tqa/+uor9u/fT+PGjUlNTSUvL48TJ06we/fuUrnuEydOeB9+MWfOHC699FLvct26deOxxx6jqKiI4uJiioqKvEPyzpo1q1L7N3fuXAA++ugjGjZsSNnrIZdffjnTpk3zvg/mzCY/P5+HH36YO+64A6BUPV966SXvcmWHBg603Pbt22ndujV/+MMfyM7O5rPPPqNnz5688MIL3p5Fe/bs4euvv66y4YYtuNkxiDeJk5YJoutiVQhnyN8lS5YwatQo6tatCzgXAH/xi19w1lln0bRpU1q2bMlFF11E27ZtvevUq1ePVatWMXHiRM4880zmzp3Ljz/+yHXXXUdRURGqysiRI/n5z3/O2LFjGTp0KBMnTqR3796V2r+6deuSmZnJsWPHeOGFF06aP3XqVO644w7S09M5fvw4Xbp08eb8fa1YsYLMzEwOHz7MmWeeydSpU709ZR588EGuueYaTjvtNLp3787OnTsB6Nu3L1dffTULFy5k2rRpAZd78sknWbp0KTVq1ODiiy+mV69e1KlThy1bttChQwfA6UL6yiuvcP7559OpUydatWpFr169op53NyZe2JC/xqtr165MmTKFrCy/I4i6UqS+Z9Zq/Yl1j6w6NuSvMVFkgd3Eo8RJy5io8+1bb4Jjgd3EK2u5G1NJFthNPLPgboyJKPvRiw8W3I0xEWd932PPgrsxxrhQxC+oishFwCigEfCBqv41EuU+nfd0JIrxuj2j4lECH3nkEebMmUNSUhI1atTg2WefpX379gGXDzTUb1VLTU0lNzeXRo0aVbjslVdeyc6dOykuLqawsNB7i//TTz8d1OiMHTt25JNPPgm7zsaYyAoquIvIC0Af4GtVbeUz/QrgL0AS8JyqTlLVLcBtIlIDeBmISHCvaitXruTtt99m7dq11KlTh2+++YajR4/GuloR99ZbbwFOT5kpU6Z4h+ktcfz4cWrWDPw1scBuymNjv8dOsGmZWcAVvhNEJAl4CugFtARyRKSlZ14/4B3g3YjVtIrt3buXRo0aecdgadSoEeeccw5Q/rC/69evp0OHDrRo0YKZM2d6y+rSpYt3KN0VK1YA8Pvf/56srCwuvvhi7xC1JeWPHz+ejIwMsrKyWLt2LT179uT888/33g26bNkyunTpQu/evbnwwgu57bbbSo1PU+KVV16hXbt2ZGRkcOutt/Ljjz9WuO+zZs2iX79+dO/enR49elBcXEyPHj1o27YtrVu3ZuHChd5lS8aVWbZsGV27duXqq68mLS2NIUOGEA83yEWD5ZNDY8cqNoIK7qr6IfBtmcntgM9VdYeqHgVeA/p7ll+kqr2AIYHKFJFbRCRXRHJLxiiPJ5dffjm7d+/mggsu4Pbbb2f58uVBrZefn8+//vUvVq5cyYQJE/jyyy+ZM2cOPXv2JC8vj/Xr15ORkQE4aZ/c3Fzy8/NZvnx5qdEZzz33XO/Qt8OGDWPevHl8+umnpX4EVq1axbRp09i8eTPbt29n/vz5peqyZcsW5s6dy8cff0xeXh5JSUnMnj07qP1Yu3Yt8+bNY/ny5dStW5e33nqLtWvXsnTpUu6++26/gXvdunU8+eSTbN68mR07dgQ11rsxJjrCuaDaGNjt874AaCwiXUVkqog8Szktd1WdoapZqpqVnJwcRjWio379+qxZs4YZM2aQnJzMoEGDghqMq3///pxyyik0atSIbt26sWrVKrKzs3nxxRd58MEH2bBhAw0aNADg9ddfp23btmRmZrJp06ZSQ+n269cPcIbkbd++PQ0aNCA5OZk6dep4h+Ft164dzZo1IykpiZycHD766KNSdfnggw9Ys2YN2dnZZGRk8MEHH7Bjx46g9v/Xv/41p59+OuAM13vPPfeQnp7OZZddxp49e7wPzvDVrl07UlJSqFGjBhkZGezatSuobRn3s9Z71Yv4BVVVXQYsC2bZeH/MXlJSEl27dqVr1660bt2al156iWHDhpU77K+/IXO7dOnChx9+yDvvvMOwYcO466676Ny5M1OmTGH16tWcdtppDBs2zO8Qvb7D85a8Lxmit6LheVWVoUOH8uijj4a87/Xq1fO+nj17NoWFhaxZs4ZatWqRmprqd5he33omJSWVGkrYGMu/V61wWu57gCY+71M804IWz4/Z27p1K9u2bfO+z8vL47zzzgPKH/Z34cKFHDlyhP3797Ns2TKys7P54osvOOussxg+fDg333wza9eu5eDBg9SrV4+GDRuyb98+3nvvvZDruGrVKnbu3MmJEyeYO3cul156aan5PXr0YN68eXz99dcAfPvtt3zxxRchb6eoqIgzzzyTWrVqsXTp0kqVYYypWuG03FcDLUSkKU5QHwxcG5Fa+RFM18VIKi4u5s477+TAgQPUrFmT5s2bM2PGDKD8YX/T09Pp1q0b33zzDffddx/nnHMOL730EpMnT6ZWrVrUr1+fl19+maZNm5KZmUlaWhpNmjShU6dOIdcxOzubESNG8Pnnn9OtW7eTHgXYsmVLJk6cyOWXX86JEyeoVasWTz31lPdHKlhDhgyhb9++tG7dmqysLNLS0kKuq1tYesEkiqCG/BWRV4GuOH3X9wEPqOrzIvIb4EmcrpAvqOojIW38p7TMcN9WMtiQvxUJ1HXRhCbU75kF98iw9ExklDfkb1Atd1XNCTD9XcLo7qiqi4HFWVlZwytbhjHGmJPZkL8JquRCrzHG+BPTsWVEpK+IzCgqKoplNYwxxnViGtzjubeMMWVZvj1y7FhGn40KaYwxLmTB3RgTE9Z6j66YXlAN5Q7VwmnTI7rt5DtHVLhMdRjy96GHHuLIkSOl7mLNy8sjJyeHLVu2+F3Hdz/vv/9+unTpwmWXXVZqmWC6aubl5fHll1/ym9/8BoBFixaxefNmxo0bF8xuGmPKEdPgHs9dIavLkL85OTlcccUVpYL7a6+9Rk6O396vJ5kwYUKlt52Xl0dubq43uPfr1887po4p34CCx72vF6SMjWFNTLyytEwA1WXI3wsuuIDTTjuNf//7395pr7/+Ojk5OcycOZPs7GzatGnDVVddxeHDh08qv2TESoD333+ftLQ02rZtW2qEylWrVtGhQwcyMzPp2LEjW7du5ejRo9x///3MnTuXjIwM5s6dy6xZsxgxwjmj2rVrF927dyc9PZ0ePXrw3//+17u9kSNH0rFjR5o1a+bdttv5BvNA8ytaJh5ZaiZ6LLgHUJ2G/M3JyeG1114D4NNPP+X000+nRYsWDBw4kNWrV7N+/Xouuuginn/++YD7feTIEYYPH87ixYtZs2YNX331lXdeWloaK1asYN26dUyYMIF77rmH2rVrM2HCBAYNGkReXh6DBg0qVd6dd97J0KFDyc/PZ8iQIYwcOdI7b+/evXz00Ue8/fbb1S6FE0oQT8RgbyLH+rkHUJ2G/B00aBDz5s3jxIkTpVIyGzdupHPnzrRu3ZrZs2ezadOmgPv92Wef0bRpU1q0aIGIcN1113nnFRUVcc0119CqVStGjx5dbjklVq5cybXXOkMVXX/99aX2bcCAAdSoUYOWLVv6HXo40ZUE5bLBOZhgnYiB3x5+Eh2Wcy9HdRnyt0mTJjRt2pTly5fz5ptvsnLlSsBJgSxYsIA2bdowa9Ysli1bVtEh8+u+++6jW7duvPXWW+zatSvsO2t9j4dbn/YUKMCXt6y/9yWvF6SMZUDB45afr0YsLRNAdRvyNycnh9GjR9OsWTNSUlIAOHToEGeffTbHjh2r8AlOaWlp7Nq1i+3btwPw6quveucVFRXRuHFjgFJnPw0aNODQoUN+y+vYsaM3VTR79mw6d+5c7vajLZoty5JUSzRb0/HUUjdVI2HGlgmm62IkVbchf6+55hpGjhzJtGnTvNMefvhh2rdvT3JyMu3btw8YiAHq1q3LjBkz6N27N6eeeiqdO3f2Lj927FiGDh3KxIkT6d27t3edbt26MWnSJDIyMhg/fnyp8qZNm8YNN9zA5MmTSU5O5sUXXwz5+ESCpQuqjj3MI7KCGvI32rKysjQ3N7fUNBvyt3w25G9kVPQ9i3ZwL9uiLkmfVJWSNE28dK204B6asIf8NaY6ilZgLy94V3X6JND2LD+f+BLmDlVTmg35m3gSJe+dKPU05YvrUSHjIWVk3Mu+X/HHrnFETtz2lqlbty779++3P0ATFarK/v37qVu3rt/5kQwyidoSTtS7Xo0jbnPuKSkpFBQUUFhYGOuqGJeqW7eut9unMW4Tt8G9Vq1aNG3aNNbVMCYsodyMFK98b4QyiSNu0zLGmPiTyD9S1Y0Fd2PKiFS+PRqBcGGdvREvM1TRDvA21kxk2MBhxvhIhKCysM5eb5AvG+x950WavzFrTPyK666QxiSaSI8TU16gDhTgq4oF+PhmaRlj4pBvwI6HVIxJPHHbW8YY85PKBHjfdfr/cHYkq+NlPWnil7XcjYmAcFMU/lIsoQb0YFI40WRpmvhiwd2YCIllcAs2ePu7ABsJFtjjjwV3Y+JMVbSyq7KXTWVZl8jwWM7dGGLbBTIavV78lRVvwdtEl7XcjQlTMCmJQLn0eAu48dqCN6Gzm5hMtVdVwaNsUI9VEA3mTKGydbPce/ywm5hMtRZuYA81mMVbqxgC1ymeWvHWeg+dpWWMiZJY30EaSW7Yh+rGgrsxxriQBXdjTFCCbb1b3j0+WFdIYyohUABbWGcv/X8421VpDDftS3ViLXdjosgCo4kVC+6m2rIeGKErbyx5X77DHluaJjYsuBtjoiaSgd1+jENjwd2YCLNUzMms9V71LLgbEyILVD8J5YfMjlvVst4yptqx03tTHVjL3ZgQlNf6rE7pmOq0r4nKgrsxYXDTEAPhsmMQX6KSlhGRAUBv4GfA86q6JBrbMcZUPyVptUcHto5xTeJb0C13EXlBRL4WkY1lpl8hIltF5HMRGQegqgtUdThwGzAoslU2xsSTqryoatdLghdKWmYWcIXvBBFJAp4CegEtgRwRaemzyB89842JC5EMDpaG+Imlp+JP0MFdVT8Evi0zuR3wuaruUNWjwGtAf3E8Brynqmv9lScit4hIrojkFhYWVrb+xpg4ZEE+9sK9oNoY2O3zvsAz7U7gMuBqEbnN34qqOkNVs1Q1Kzk5OcxqGFOxqn4wh7EgH0tRuaCqqlOBqRUtJyJ9gb7NmzePRjWMqRIWwEJT8iO5IGVsjGvibuG23PcATXzep3imBcUes2eMsTOi6Ag3uK8GWohIUxGpDQwGFoVfLWPim7XWA6uqY2M9Z8oXSlfIV4GVwIUiUiAiN6nqcWAE8HdgC/C6qm4Kocy+IjKjqKgo1HobEzMW2INT0XGyFnt0BZ1zV9WcANPfBd6tzMZVdTGwOCsra3hl1jfGGOOfDT9gjElolp7xL6bB3dIypqpEKgBEKyWTvvJgyMuVt06w5VWVQMfN7liNnpgGd+stY4wx0WFpGWPiTGVb3YHWi7dWvD92cTXyLLgbkwAqCtDxHMBt3JnYsJy7MQEkQmuybFAPN8jH84+ECY3l3I0JQiwCvRsDrbXeq46lZYwpx4CCx6MS2EPpHVOyrO//waxfUQ4+3n48onWsqysL7sYEKRKtzmgE1HACfaxZSz56LOduTBlV1XqMVsANpi98vAb7yho/f4P1eS/Dcu7GVLFAF0GjHXDDKT/SdbMWe/RFZTx3Y+JBvLTk0lceJL/DzwLOiwfl1TFaygvwAwoet/Hew2Q5d2OiIF6CdlmR7joZDXZRNTIsuBtXqkyr3TeoxCrAxCLYlpcWisfgb4Jjwd24TrykYyB+ux2WJ9iultFmLfjwWG8ZYyIoUkExHoJrtNjF1KphvWVMted780yg1mJlApKbA3Q8i6czt1iytIwxJmIq84NmLfnosOBuTAVCDT5uu3Eo1BEpYxngrdX+EwvuxhjjQhbcjWtYqy2yqvoMpGzr3dI14bHgbowJmb+RKuOJ/dBbV0jjMqH8UQfTjzrY1mO8BbdoiMXdrSU9mWw44NBZV0hjjHEhS8sYE4C12hNbdU/NWHA3xgStqn7I7GJq+Cy4G2OMC1lwN8bHwjp7rdVoXMGCuzEekboT1VSe/bBGjj2JySS8aFw4s4upofPXVdL36U6xeNpTdWbB3VQ71l/aVAd2E5NJaNW9u1uisTOdqmM3MRljjAvZBVVTrVmKJjqshR57FtxNwopkSsZ6acQP+ywiw4K7MUGy1qhJJBbcjTHGhSy4m4QUq5RMea13a9mbeGLB3RjjWuPnb6i23WUtuBtjjAtZcDdRF6mWUyyHGTCJrTq23i24G2OMC1lwN6YS7OJpbAS66cxuRjuZBXdjjHEhC+7GhMha7cErOVa+xyx95cGQjmHJdZGS/62VHpyIB3cRaSYiz4vIvEiXbYwxJjhBjecuIi8AfYCvVbWVz/QrgL8AScBzqjpJVXcAN1lwN/HEWnuxE+kznYV19rL1h3lcWOfqiJbrNsG23GcBV/hOEJEk4CmgF9ASyBGRlhGtnTHGmEoJKrir6ofAt2UmtwM+V9UdqnoUeA3oH+yGReQWEckVkdzCwsKgK2zcq6Qvsm+f5HD7J4fbYrf8evwK9bOtbn3dw8m5NwZ2+7wvABqLyBki8gyQKSLjA62sqjNUNUtVs5KTk8OohjHGmLIi/gxVVd0P3BbMsiLSF+jbvHnzSFfDuEx1a3WZ0sr2mEk7+DFwdgxrFP/CabnvAZr4vE/xTAuaPWbPGGOiI5zgvhpoISJNRaQ2MBhYFJlqGWOMCUdQwV1EXgVWAheKSIGI3KSqx4ERwN+BLcDrqroplI2LSF8RmVFUVBRqvU01Ek5KxndgsLKDhJV9b2O1x5a/m5tCGdht6w/W+9pXUDl3Vc0JMP1d4N3KblxVFwOLs7Kyhle2DGOMMSez4QeMMcaFYhrcLS1joiVQH+iyvS5M7AQaYyaaKbCyKT4398KKaXC33jLGGBMdlpYxxhgXsuBujDEuZDl3E1OBcp6BnlofytPsffPuoebYy44/bqKnsl1QSz7Tki6QaQc/ZkDB4xV2iQzlO5TILOdujDEuZGkZY4xxIQvuxhjjQpZzN9VOsPl3y7VXnbLHuqL3gSyss9czYmTF3J53t5y7Mca4kKVljDHGhSy4G2OMC1lwN8YYF7LgbowxLmS9ZUxEhHrXn++yoa47oODxgKM+SuHok+YFWtbEt3B6K4Xz4A639KKx3jLGGONClpYxxhgXsuBujDEuZMHdGGNcyIK7Mca4kAV3Y4xxIesKaWKmbJezAQWPV7obmr8ukDvXbj1pORsMLPGU95kNKHj8pIHCSpYv+U6UfC8q0z0ykbtFWldIY4xxIUvLGGOMC1lwN8YYF7LgbowxLmTB3RhjXMiCuzHGuJAFd2OMcSEL7sYY40I1Y7lxEekL9G3evHksq+FK4+dv4NGBratsPX/rl9wA4tycNDbkskpuPnksuRl/KNzBgpSxfsdqX5AylvSVB1nYwXMDS9uzWVhnL+meZeqs2kN6zeJyt1Vy44vd5BS//H42bc8GYGGdvYGXwfmePJbc7KTpW3+Yx/j5Jy8f7t9BPLCbmIwxxoUsLWOMMS5kwd0YY1zIgrsxxriQBXdjjHEhC+7GGONCFtyNMcaFLLgbY4wLWXA3xhgXsuBujDEuZMHdGGNcyIK7Mca4kAV3Y4xxoYiPCiki9YCngaPAMlWdHeltGGOMKV9QLXcReUFEvhaRjWWmXyEiW0XkcxEZ55k8EJinqsOBfhGurzHGmCAEm5aZBVzhO0FEkoCngF5ASyBHRFoCKcBuz2I/RqaaxhhjQhFUcFfVD4Fvy0xuB3yuqjtU9SjwGtAfKMAJ8OWWLyK3iEiuiOQWFhaGXnOPf0+9vtLrRkPJAyrKvq5oWXD2pWRaReuWV2bZdQNNC1Qff8uXGPjqA6WW963zoPmjPA/mCLz+1h/mMaDgcaRwNABSOJoBBY+z9Yd5LKyzt9S8stosusn7us6qPd51/C1f3kM30lcePOmfSUwL6+z1+xmmHfy41DI7124l7eDHpR72Euh7VsL7HV48CnD+Pn2/7yXL+H2/eFTAv4Fg/s4iIZwLqo35qYUOTlBvDMwHrhKRvwKLA62sqjNUNUtVs5KTk8OohjHGmLIifkFVVb8DbghmWXvMnjHGREc4Lfc9QBOf9ymeaUGzx+wZY0x0hBPcVwMtRKSpiNQGBgOLIlMtY4wx4Qi2K+SrwErgQhEpEJGbVPU4MAL4O7AFeF1VN4WycRHpKyIzioqKQq23McaYcgSVc1fVnADT3wXerezGVXUxsDgrK2t4ZcswxhhzMht+wBhjXCimwd3SMsYYEx0xDe7WW8YYY6JDVDXWdUBECoHvgG9iXZcoaYQ79832K/G4dd/cul9Q/r6dp6p+7wKNi+AOICK5qpoV63pEg1v3zfYr8bh139y6X1D5fbMLqsYY40IW3I0xxoXiKbjPiHUFosit+2b7lXjcum9u3S+o5L7FTc7dGGNM5MRTy90YY0yEWHA3xhgXiovgHuBZrAlNRJqIyFIR2Swim0RkVKzrFEkikiQi60Tk7VjXJZJE5OciMk9EPhORLSLSIdZ1igQRGe35Hm4UkVdFpG6s61RZ/p7pLCKni8g/RGSb5//TYlnHygiwX5M938V8EXlLRH4ebHkxD+7lPIs10R0H7lbVlsAlwB0u2a8So3BGA3WbvwDvq2oa0AYX7KOINAZGAlmq2gpIwhmiO1HNoswznYFxwAeq2gL4wPM+0czi5P36B9BKVdOB/wDjgy0s5sGdwM9iTWiquldV13peH8IJEo1jW6vIEJEUoDfwXKzrEkki0hDoAjwPoKpHVfVATCsVOTWBU0SkJnAq8GWM61NpAZ7p3B94yfP6JWBAVdYpEvztl6ou8QyvDvApPz2fukLxENwDPYvVNUQkFcgE/h3jqkTKk8BY4ESM6xFpTYFC4EVPyuk5EakX60qFS1X3AFOA/wJ7gSJVXRLbWkXcWaq61/P6K+CsWFYmSm4E3gt24XgI7q4mIvWBN4H/UdWDFS0f70SkD/C1qq6JdV2ioCbQFvirqmbijHeUiKf3pXjyz/1xfrzOAeqJyHWxrVX0qNO/21V9vEXkXpxU7+xg14mH4B72s1jjlYjUwgnss1V1fqzrEyGdgH4isgsnhdZdRF6JbZUipgAoUNWSM6x5OME+0V0G7FTVQlU9BswHOsa4TpG2T0TOBvD8/3WM6xMxIjIM6AMM0RBuTIqH4O7KZ7GKiODkbreo6p9jXZ9IUdXxqpqiqqk4n9W/VNUVrUBV/QrYLSIXeib1ADbHsEqR8l/gEhE51fO97IELLhSXsQgY6nk9FFgYw7pEjIhcgZMC7aeqh0NZN+bBPRLPYo1TnYDrcVq2eZ5/v4l1pUyF7gRmi0g+kAH8X2yrEz7Pmcg8YC2wAefvPmFv1/f3TGdgEvBrEdmGc6YyKZZ1rIwA+zUdaAD8wxNDngm6PBt+wBhj3CfmLXdjjDGRZ8HdGGNcyIK7Mca4kAV3Y4xxIQvuxhjjQhbcjTHGhSy4G2OMC/1/Ce8ZeO9XizEAAAAASUVORK5CYII=", 491 | "text/plain": [ 492 | "

" 493 | ] 494 | }, 495 | "metadata": { 496 | "needs_background": "light" 497 | }, 498 | "output_type": "display_data" 499 | } 500 | ], 501 | "source": [ 502 | "from matplotlib import pyplot as plt\n", 503 | "\n", 504 | "plot_bins = 250\n", 505 | "\n", 506 | "plt.hist(\n", 507 | " x=df_train_full[name_of_target_column_transformed],\n", 508 | " bins=plot_bins,\n", 509 | " alpha=0.6,\n", 510 | " label='Full Train Dataset')\n", 511 | "plt.hist(\n", 512 | " x=df_train_full_subsample[name_of_target_column_transformed],\n", 513 | " bins=plot_bins,\n", 514 | " alpha=0.6,\n", 515 | " label='Full Subsample Dataset')\n", 516 | "plt.hist(\n", 517 | " x=df_train_subsample[name_of_target_column_transformed],\n", 518 | " bins=plot_bins,\n", 519 | " alpha=0.5,\n", 520 | " label='Subsample Train')\n", 521 | "plt.hist(\n", 522 | " x=df_valid_subsample[name_of_target_column_transformed],\n", 523 | " bins=plot_bins,\n", 524 | " alpha=0.5,\n", 525 | " label='Subsample Validation')\n", 526 | "\n", 527 | "plt.yscale('log')\n", 528 | "plt.legend(loc='upper left')\n", 529 | "plt.title('Distribution of Target Column Values')\n", 530 | "plt.rcParams['figure.figsize'] = (1, 1)\n", 531 | "plt.show()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "Distributions look pretty similar. That's great news!
\n", 539 | "__Note:__ The _'Y'_ values (indicating sample counts) in above plot are logarithmically scaled for easier comparison." 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "Now that we're done with splits, it's time to build out the dataframes containing only feature columns, and of course the target column series." 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": 12, 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "features_all = features_numerical + features_categorical\n", 556 | "\n", 557 | "X_train_subsample = df_train_subsample[features_all].copy()\n", 558 | "X_valid_subsample = df_valid_subsample[features_all].copy()\n", 559 | "\n", 560 | "y_train_subsample = df_train_subsample[name_of_target_column_transformed].to_numpy()\n", 561 | "y_valid_subsample = df_valid_subsample[name_of_target_column_transformed].to_numpy()" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "### Transforming Numerical Features\n", 569 | "During EDA we noticed the following characteristics for numerical features:
\n", 570 | "- Gaussian-like: _['cont1', 'cont2', 'cont3', 'cont6', 'cont7', 'cont9', 'cont11', 'cont12']_
\n", 571 | "- Non-Gaussian-like: _['cont4', 'cont5', 'cont8', 'cont10', 'cont13', 'cont14']_\n", 572 | "\n", 573 | "As a rule of thumb, we'll use min-max scaler for non-gaussian features, and standard scaler for gaussian-like features.\n", 574 | "\n", 575 | "You should remember that we dropped some columns in data cleaning step. We should adjust column names in above groups, having that in mind." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 13, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "gaussian_like = ['cont1', 'cont2', 'cont3', 'cont6', 'cont7', 'cont9', 'cont11', 'cont12']\n", 585 | "non_gaussian_like = ['cont4', 'cont5', 'cont8', 'cont10', 'cont13', 'cont14']\n", 586 | "\n", 587 | "features_numerical_to_normalize = [column for column in gaussian_like if column in df_train_subsample.columns.to_list()]\n", 588 | "features_numerical_to_standardize = [column for column in non_gaussian_like if column in df_train_subsample.columns.to_list()]" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "### Handling Categorical Features\n", 596 | "We talked about the curse of dimensionality in EDA stage. For categorical columns, if the cardinality of a column is below 10, we'll nominate it for one-hot encoding; otherwise it goes to ordinal encoding list." 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 14, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "features_categorical_to_ordinal = list()\n", 606 | "features_categorical_to_onehot = list()\n", 607 | "\n", 608 | "for column, variety in X_train_subsample[features_categorical].nunique().iteritems():\n", 609 | " if variety < 10: features_categorical_to_onehot.append(column)\n", 610 | " else: features_categorical_to_ordinal.append(column)" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "Let's utilize __sk-learn's pipeline__ feature to do the transformations in an efficient way:" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 15, 623 | "metadata": {}, 624 | "outputs": [ 625 | { 626 | "data": { 627 | "text/html": [ 628 | "

ColumnTransformer

ColumnTransformer(transformers=[('num',\n",
 629 |        "                                 Pipeline(steps=[('imputer', SimpleImputer()),\n",
 630 |        "                                                 ('normalizer', MinMaxScaler()),\n",
 631 |        "                                                 ('standardizer',\n",
 632 |        "                                                  StandardScaler())]),\n",
 633 |        "                                 ['cont1', 'cont2', 'cont3', 'cont4', 'cont5',\n",
 634 |        "                                  'cont6', 'cont7', 'cont8', 'cont9', 'cont10',\n",
 635 |        "                                  'cont11', 'cont13', 'cont14']),\n",
 636 |        "                                ('cat1',\n",
 637 |        "                                 Pipeline(steps=[('imputer',\n",
 638 |        "                                                  SimpleImputer(strategy='most_frequent')),\n",
 639 |        "                                                 ('ordinal',\n",
 640 |        "                                                  Ord...\n",
 641 |        "                                                  SimpleImputer(fill_value='missing',\n",
 642 |        "                                                                strategy='constant')),\n",
 643 |        "                                                 ('onehot',\n",
 644 |        "                                                  OneHotEncoder(handle_unknown='ignore',\n",
 645 |        "                                                                sparse=False))]),\n",
 646 |        "                                 ['cat1', 'cat2', 'cat4', 'cat5', 'cat6',\n",
 647 |        "                                  'cat8', 'cat9', 'cat10', 'cat11', 'cat12',\n",
 648 |        "                                  'cat13', 'cat14', 'cat16', 'cat17', 'cat18',\n",
 649 |        "                                  'cat19', 'cat20', 'cat21', 'cat23', 'cat24',\n",
 650 |        "                                  'cat25', 'cat26', 'cat27', 'cat28', 'cat29',\n",
 651 |        "                                  'cat30', 'cat31', 'cat32', 'cat33', 'cat34', ...])])

num

['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont13', 'cont14']

SimpleImputer

SimpleImputer()

MinMaxScaler

MinMaxScaler()

StandardScaler

StandardScaler()

cat1

['cat99', 'cat100', 'cat101', 'cat103', 'cat104', 'cat105', 'cat106', 'cat107', 'cat108', 'cat109', 'cat110', 'cat111', 'cat112', 'cat113', 'cat114', 'cat115', 'cat116']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OrdinalEncoder

OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

MinMaxScaler

MinMaxScaler()

cat2

['cat1', 'cat2', 'cat4', 'cat5', 'cat6', 'cat8', 'cat9', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat23', 'cat24', 'cat25', 'cat26', 'cat27', 'cat28', 'cat29', 'cat30', 'cat31', 'cat32', 'cat33', 'cat34', 'cat35', 'cat36', 'cat37', 'cat38', 'cat39', 'cat40', 'cat41', 'cat42', 'cat43', 'cat44', 'cat45', 'cat46', 'cat47', 'cat48', 'cat49', 'cat50', 'cat51', 'cat52', 'cat53', 'cat54', 'cat57', 'cat58', 'cat59', 'cat60', 'cat61', 'cat65', 'cat66', 'cat67', 'cat69', 'cat71', 'cat72', 'cat73', 'cat74', 'cat75', 'cat76', 'cat77', 'cat78', 'cat79', 'cat80', 'cat81', 'cat82', 'cat83', 'cat84', 'cat85', 'cat86', 'cat87', 'cat88', 'cat89', 'cat90', 'cat91', 'cat92', 'cat93', 'cat94', 'cat95', 'cat96', 'cat97', 'cat98', 'cat102']

SimpleImputer

SimpleImputer(fill_value='missing', strategy='constant')

OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse=False)

" 652 | ], 653 | "text/plain": [ 654 | "ColumnTransformer(transformers=[('num',\n", 655 | " Pipeline(steps=[('imputer', SimpleImputer()),\n", 656 | " ('normalizer', MinMaxScaler()),\n", 657 | " ('standardizer',\n", 658 | " StandardScaler())]),\n", 659 | " ['cont1', 'cont2', 'cont3', 'cont4', 'cont5',\n", 660 | " 'cont6', 'cont7', 'cont8', 'cont9', 'cont10',\n", 661 | " 'cont11', 'cont13', 'cont14']),\n", 662 | " ('cat1',\n", 663 | " Pipeline(steps=[('imputer',\n", 664 | " SimpleImputer(strategy='most_frequent')),\n", 665 | " ('ordinal',\n", 666 | " Ord...\n", 667 | " SimpleImputer(fill_value='missing',\n", 668 | " strategy='constant')),\n", 669 | " ('onehot',\n", 670 | " OneHotEncoder(handle_unknown='ignore',\n", 671 | " sparse=False))]),\n", 672 | " ['cat1', 'cat2', 'cat4', 'cat5', 'cat6',\n", 673 | " 'cat8', 'cat9', 'cat10', 'cat11', 'cat12',\n", 674 | " 'cat13', 'cat14', 'cat16', 'cat17', 'cat18',\n", 675 | " 'cat19', 'cat20', 'cat21', 'cat23', 'cat24',\n", 676 | " 'cat25', 'cat26', 'cat27', 'cat28', 'cat29',\n", 677 | " 'cat30', 'cat31', 'cat32', 'cat33', 'cat34', ...])])" 678 | ] 679 | }, 680 | "execution_count": 15, 681 | "metadata": {}, 682 | "output_type": "execute_result" 683 | } 684 | ], 685 | "source": [ 686 | "# create transform pipeline for numerical features\n", 687 | "transformer_numerical = Pipeline(steps=[\n", 688 | " ('imputer', SimpleImputer(strategy='mean')),\n", 689 | " ('normalizer', MinMaxScaler()),\n", 690 | " ('standardizer', StandardScaler()),\n", 691 | "])\n", 692 | "\n", 693 | "# create transform pipelines for categorical features\n", 694 | "transformer_categorical_1 = Pipeline(steps=[\n", 695 | " ('imputer', SimpleImputer(strategy='most_frequent')),\n", 696 | " ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),\n", 697 | " ('normal', MinMaxScaler()),\n", 698 | "])\n", 699 | "transformer_categorical_transformer2 = Pipeline(steps=[\n", 700 | " ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n", 701 | " ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))\n", 702 | "])\n", 703 | "\n", 704 | "# bundle preprocessing for numerical and categorical data\n", 705 | "preprocessor = ColumnTransformer(\n", 706 | " transformers=[\n", 707 | " ('num', transformer_numerical, features_numerical),\n", 708 | " ('cat1', transformer_categorical_1, features_categorical_to_ordinal),\n", 709 | " ('cat2', transformer_categorical_transformer2, features_categorical_to_onehot),\n", 710 | " ])\n", 711 | "\n", 712 | "\n", 713 | "# take a look at preprocessing pipeline\n", 714 | "set_config(display='diagram')\n", 715 | "preprocessor" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "## Models Training & Tuning Hyperparameters\n", 723 | "We'll cover three models for this study, starting from a __linear__ model which is the simplest & fastest one, __\"Huber Linear Regression\"__. Later we explore two other sophisticated __non-linear__ models, tree-based to be clear, __\"Random Forest\"__ & __\"XGBoost\"__.\n", 724 | "\n", 725 | "Hyperparameter tuning can be handled either automatically or manually. Actually there are many fantastic tools for automatic methods if you google it, [optuna](https://optuna.org) is a great one for instance. For the sake of simplicity, I'll stick to __manual approach__ and tune a couple of important parameters step by step, and one by one, __moving from the most significant to the least__.\n", 726 | "\n", 727 | "__Warning:__ You might get different results if you change the RANDOM_SEED at the beginning of this notebook. Library updates over time time also can alter the benchmark numbers displayed here." 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "### Huber Linear Regression" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 16, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "name": "stdout", 744 | "output_type": "stream", 745 | "text": [ 746 | "MAE Score: 1301.2823384574385\n" 747 | ] 748 | } 749 | ], 750 | "source": [ 751 | "# ignore regressor converge warning\n", 752 | "simplefilter('ignore', category=ConvergenceWarning)\n", 753 | "\n", 754 | "\n", 755 | "model_hr = HuberRegressor()\n", 756 | "\n", 757 | "# bundle preprocessing and modeling in a final pipeline\n", 758 | "pipeline_hr = Pipeline(steps=[\n", 759 | " ('preprocessor', preprocessor),\n", 760 | " ('model', model_hr)\n", 761 | "])\n", 762 | "\n", 763 | "# preprocess train data & fit model\n", 764 | "pipeline_hr.fit(X_train_subsample, y_train_subsample)\n", 765 | "\n", 766 | "\n", 767 | "y_pred_subsample_hr = pipeline_hr.predict(X_valid_subsample)\n", 768 | "\n", 769 | "# preprocess validation data and get predictions to evaluate the model\n", 770 | "# don't forget to transfer back target values after inference prediction with np.expm1() ;)\n", 771 | "score_mae_hr = mean_absolute_error(\n", 772 | " np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_hr))\n", 773 | "print('MAE Score:', score_mae_hr)" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "#### Huber Hyperparameter Tuning\n", 781 | "Although huber does not offer much of parameters for tuning, I'll explore the few possibilities." 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": 17, 787 | "metadata": {}, 788 | "outputs": [ 789 | { 790 | "name": "stdout", 791 | "output_type": "stream", 792 | "text": [ 793 | "<>\n", 794 | "Val.\tScore\n", 795 | "0 \t 1301.0487\n", 796 | "1 \t 1300.6491\n", 797 | "10 \t 1300.4415\n", 798 | "100 \t 1295.1346\n", 799 | "200 \t 1293.843\n", 800 | "300 \t 1294.4547\n", 801 | "400 \t 1296.7492\n", 802 | "500 \t 1301.6713\n" 803 | ] 804 | } 805 | ], 806 | "source": [ 807 | "tune_subject = 'Alpha (Regularization Parameter)'\n", 808 | "tune_values = [0, 1, 10, 100, 200, 300, 400, 500]\n", 809 | "\n", 810 | "scores = list()\n", 811 | "print(f'<>\\nVal.\\tScore')\n", 812 | "for value in tune_values:\n", 813 | " model_hr = HuberRegressor(alpha=value)\n", 814 | " pipeline_hr = Pipeline(steps=[\n", 815 | " ('preprocessor', preprocessor),\n", 816 | " ('model', model_hr)\n", 817 | " ])\n", 818 | " pipeline_hr.fit(X_train_subsample, y_train_subsample)\n", 819 | "\n", 820 | " y_pred_subsample_hr = pipeline_hr.predict(X_valid_subsample)\n", 821 | " score_mae_hr = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_hr))\n", 822 | " scores.append((value, score_mae_hr))\n", 823 | " print(f'{value} \\t {round(score_mae_hr, 4)}')" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "We have our first tuned hyperparameter, alpha=200." 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 18, 836 | "metadata": {}, 837 | "outputs": [ 838 | { 839 | "name": "stdout", 840 | "output_type": "stream", 841 | "text": [ 842 | "<>\n", 843 | "Val.\tScore\n", 844 | "2.5 \t 1289.5555\n", 845 | "2.6 \t 1289.8729\n", 846 | "2.7 \t 1289.0714\n", 847 | "2.8 \t 1287.5592\n", 848 | "2.9 \t 1291.7325\n", 849 | "3.0 \t 1289.129\n", 850 | "3.1 \t 1290.182\n", 851 | "3.2 \t 1288.6241\n", 852 | "3.3 \t 1288.6455\n", 853 | "3.4 \t 1289.142\n" 854 | ] 855 | } 856 | ], 857 | "source": [ 858 | "tune_subject = 'Alpha (Regularization Parameter)'\n", 859 | "tune_values = [round(item, 2) for item in np.arange(2.5, 3.5, 0.1)]\n", 860 | "\n", 861 | "scores = list()\n", 862 | "print(f'<>\\nVal.\\tScore')\n", 863 | "for value in tune_values:\n", 864 | " model_hr = HuberRegressor(alpha=200, epsilon=value)\n", 865 | " pipeline_hr = Pipeline(steps=[\n", 866 | " ('preprocessor', preprocessor),\n", 867 | " ('model', model_hr)\n", 868 | " ])\n", 869 | " pipeline_hr.fit(X_train_subsample, y_train_subsample)\n", 870 | "\n", 871 | " y_pred_subsample_hr = pipeline_hr.predict(X_valid_subsample)\n", 872 | " score_mae_hr = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_hr))\n", 873 | " scores.append((value, score_mae_hr))\n", 874 | " print(f'{value} \\t {round(score_mae_hr, 4)}')" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "That's it. Should we choose Huber Linear Regression, these will be our hyperparameters:\n", 882 | "```\n", 883 | "alpha=200\n", 884 | "epsilon=2.8\n", 885 | "```" 886 | ] 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "metadata": {}, 891 | "source": [ 892 | "### Random Forest\n" 893 | ] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": {}, 898 | "source": [ 899 | "#### Random Forest Initial Training" 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": 19, 905 | "metadata": {}, 906 | "outputs": [ 907 | { 908 | "name": "stdout", 909 | "output_type": "stream", 910 | "text": [ 911 | "MAE Score: 1256.4217276244276\n" 912 | ] 913 | } 914 | ], 915 | "source": [ 916 | "model_rf = RandomForestRegressor(random_state=RANDOM_SEED, n_jobs=-1)\n", 917 | "\n", 918 | "# bundle preprocessing and modeling in a final pipeline\n", 919 | "pipeline_rf = Pipeline(steps=[\n", 920 | " ('preprocessor', preprocessor),\n", 921 | " ('model', model_rf)\n", 922 | "])\n", 923 | "\n", 924 | "# preprocess train data & fit model\n", 925 | "pipeline_rf.fit(X_train_subsample, y_train_subsample)\n", 926 | "\n", 927 | "\n", 928 | "y_pred_subsample_rf = pipeline_rf.predict(X_valid_subsample)\n", 929 | "\n", 930 | "# preprocess validation data and get predictions to evaluate the model\n", 931 | "# don't forget to transfer back target values after inference prediction with np.expm1() ;)\n", 932 | "score_mae_rf = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_rf))\n", 933 | "print('MAE Score:', score_mae_rf)" 934 | ] 935 | }, 936 | { 937 | "cell_type": "markdown", 938 | "metadata": {}, 939 | "source": [ 940 | "#### Random Forest Hyperparameter Tuning" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 20, 946 | "metadata": {}, 947 | "outputs": [ 948 | { 949 | "name": "stdout", 950 | "output_type": "stream", 951 | "text": [ 952 | "<>\n", 953 | "Val.\tScore\n", 954 | "3 \t 1447.5336\n", 955 | "5 \t 1348.4137\n", 956 | "10 \t 1272.5134\n", 957 | "15 \t 1254.9051\n", 958 | "20 \t 1255.3437\n", 959 | "25 \t 1253.8527\n", 960 | "30 \t 1255.6243\n" 961 | ] 962 | } 963 | ], 964 | "source": [ 965 | "tune_subject = 'Max Depth'\n", 966 | "tune_values = [3, 5, 10, 15, 20, 25, 30]\n", 967 | "\n", 968 | "scores = list()\n", 969 | "print(f'<>\\nVal.\\tScore')\n", 970 | "for value in tune_values:\n", 971 | " model_rf = RandomForestRegressor(max_depth=value, random_state=RANDOM_SEED, n_jobs=-1)\n", 972 | " pipeline_rf = Pipeline(steps=[\n", 973 | " ('preprocessor', preprocessor),\n", 974 | " ('model', model_rf)\n", 975 | " ])\n", 976 | " pipeline_rf.fit(X_train_subsample, y_train_subsample)\n", 977 | "\n", 978 | " y_pred_subsample_rf = pipeline_rf.predict(X_valid_subsample)\n", 979 | " score_mae_rf = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_rf))\n", 980 | " scores.append((value, score_mae_rf))\n", 981 | " print(f'{value} \\t {round(score_mae_rf, 4)}')" 982 | ] 983 | }, 984 | { 985 | "cell_type": "markdown", 986 | "metadata": {}, 987 | "source": [ 988 | "I think it's fair to move on with max_depth=15 from here." 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "execution_count": 22, 994 | "metadata": {}, 995 | "outputs": [ 996 | { 997 | "name": "stdout", 998 | "output_type": "stream", 999 | "text": [ 1000 | "<>\n", 1001 | "Val.\tScore\n", 1002 | "50 \t 1258.2729\n", 1003 | "100 \t 1254.9051\n", 1004 | "150 \t 1254.7139\n", 1005 | "200 \t 1255.0646\n", 1006 | "250 \t 1254.189\n" 1007 | ] 1008 | } 1009 | ], 1010 | "source": [ 1011 | "tune_subject = 'Number of Estimators'\n", 1012 | "tune_values = list(range(50, 300, 50))\n", 1013 | "\n", 1014 | "scores = list()\n", 1015 | "print(f'<>\\nVal.\\tScore')\n", 1016 | "for value in tune_values:\n", 1017 | " model_rf = RandomForestRegressor(max_depth=15, n_estimators=value, random_state=RANDOM_SEED, n_jobs=-1)\n", 1018 | " pipeline_rf = Pipeline(steps=[\n", 1019 | " ('preprocessor', preprocessor),\n", 1020 | " ('model', model_rf)\n", 1021 | " ])\n", 1022 | " pipeline_rf.fit(X_train_subsample, y_train_subsample)\n", 1023 | "\n", 1024 | " y_pred_subsample_rf = pipeline_rf.predict(X_valid_subsample)\n", 1025 | " score_mae_rf = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_rf))\n", 1026 | " scores.append((value, score_mae_rf))\n", 1027 | " print(f'{value} \\t {round(score_mae_rf, 4)}')" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "markdown", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "We'll continue with n_estimators=150." 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": 23, 1040 | "metadata": {}, 1041 | "outputs": [ 1042 | { 1043 | "name": "stdout", 1044 | "output_type": "stream", 1045 | "text": [ 1046 | "<>\n", 1047 | "Val.\tScore\n", 1048 | "8 \t 1317.1715\n", 1049 | "16 \t 1274.3149\n", 1050 | "32 \t 1256.0595\n", 1051 | "40 \t 1255.3021\n", 1052 | "48 \t 1254.0412\n", 1053 | "56 \t 1253.0149\n", 1054 | "64 \t 1247.4436\n", 1055 | "72 \t 1252.0038\n", 1056 | "80 \t 1253.6833\n" 1057 | ] 1058 | } 1059 | ], 1060 | "source": [ 1061 | "tune_subject = 'Max Number of Features for Splitting'\n", 1062 | "tune_values = [8, 16, 32, 40, 48, 56, 64, 72, 80]\n", 1063 | "\n", 1064 | "scores = list()\n", 1065 | "print(f'<>\\nVal.\\tScore')\n", 1066 | "for value in tune_values:\n", 1067 | " model_rf = RandomForestRegressor(max_depth=15, n_estimators=150,\n", 1068 | " max_features=value, random_state=RANDOM_SEED, n_jobs=-1)\n", 1069 | " pipeline_rf = Pipeline(steps=[\n", 1070 | " ('preprocessor', preprocessor),\n", 1071 | " ('model', model_rf)\n", 1072 | " ])\n", 1073 | " pipeline_rf.fit(X_train_subsample, y_train_subsample)\n", 1074 | "\n", 1075 | " y_pred_subsample_rf = pipeline_rf.predict(X_valid_subsample)\n", 1076 | " score_mae_rf = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_rf))\n", 1077 | " scores.append((value, score_mae_rf))\n", 1078 | " print(f'{value} \\t {round(score_mae_rf, 4)}')" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "markdown", 1083 | "metadata": {}, 1084 | "source": [ 1085 | "This brings our RF tuning to an end. The best hyperparameters are:\n", 1086 | "```\n", 1087 | "max_depth=15\n", 1088 | "n_estimators=150\n", 1089 | "max_features=64\n", 1090 | "```" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "metadata": {}, 1096 | "source": [ 1097 | "### XGBoost\n" 1098 | ] 1099 | }, 1100 | { 1101 | "cell_type": "markdown", 1102 | "metadata": {}, 1103 | "source": [ 1104 | "#### XGBoost Initial Training" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "code", 1109 | "execution_count": 24, 1110 | "metadata": {}, 1111 | "outputs": [ 1112 | { 1113 | "name": "stdout", 1114 | "output_type": "stream", 1115 | "text": [ 1116 | "MAE Score: 1249.1984325267313\n" 1117 | ] 1118 | } 1119 | ], 1120 | "source": [ 1121 | "model_xgb = xgb.XGBRegressor()\n", 1122 | "pipeline_xgb = Pipeline(steps=[\n", 1123 | " ('preprocessor', preprocessor),\n", 1124 | " ('model', model_xgb)\n", 1125 | "])\n", 1126 | "pipeline_xgb.fit(X_train_subsample, y_train_subsample)\n", 1127 | "\n", 1128 | "y_pred_subsample_xgb = pipeline_xgb.predict(X_valid_subsample)\n", 1129 | "score_mae_xgb = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_xgb))\n", 1130 | "print('MAE Score:', score_mae_xgb)" 1131 | ] 1132 | }, 1133 | { 1134 | "cell_type": "markdown", 1135 | "metadata": {}, 1136 | "source": [ 1137 | "#### XGBoost Hyperparameter Tuning" 1138 | ] 1139 | }, 1140 | { 1141 | "cell_type": "markdown", 1142 | "metadata": {}, 1143 | "source": [ 1144 | "We can use __gpu__ for faster hyperparameter search. If you have a compatible gpu, enable this option by setting it enabled in following line:" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": 25, 1150 | "metadata": {}, 1151 | "outputs": [], 1152 | "source": [ 1153 | "gpu_enabled = True\n", 1154 | "\n", 1155 | "tree_method_applied = 'gpu_hist' if gpu_enabled else 'auto'" 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "code", 1160 | "execution_count": 26, 1161 | "metadata": {}, 1162 | "outputs": [ 1163 | { 1164 | "name": "stdout", 1165 | "output_type": "stream", 1166 | "text": [ 1167 | "<>\n", 1168 | "Val.\tScore\n", 1169 | "20 \t 1230.7975\n", 1170 | "25 \t 1226.4625\n", 1171 | "30 \t 1221.7926\n", 1172 | "35 \t 1221.6315\n", 1173 | "40 \t 1222.7315\n", 1174 | "45 \t 1225.9955\n", 1175 | "50 \t 1229.8954\n", 1176 | "55 \t 1232.7327\n", 1177 | "60 \t 1235.4017\n", 1178 | "65 \t 1237.1708\n", 1179 | "70 \t 1239.6029\n", 1180 | "75 \t 1243.2009\n", 1181 | "80 \t 1243.9277\n", 1182 | "85 \t 1245.0737\n", 1183 | "90 \t 1245.1414\n", 1184 | "95 \t 1246.275\n" 1185 | ] 1186 | } 1187 | ], 1188 | "source": [ 1189 | "tune_subject = 'Number of Estimators'\n", 1190 | "tune_values = list(range(20, 100, 5))\n", 1191 | "\n", 1192 | "scores = list()\n", 1193 | "print(f'<>\\nVal.\\tScore')\n", 1194 | "for value in tune_values:\n", 1195 | " xgb_params = {\n", 1196 | " 'n_estimators': value,\n", 1197 | "\n", 1198 | " 'objective': 'reg:squarederror',\n", 1199 | " 'nthread': -1,\n", 1200 | " 'tree_method': tree_method_applied,\n", 1201 | "\n", 1202 | " 'seed': RANDOM_SEED,\n", 1203 | " 'verbosity': 1,\n", 1204 | " }\n", 1205 | "\n", 1206 | " model_xgb = xgb.XGBRegressor(**xgb_params)\n", 1207 | " pipeline_xgb = Pipeline(steps=[\n", 1208 | " ('preprocessor', preprocessor),\n", 1209 | " ('model', model_xgb)\n", 1210 | " ])\n", 1211 | " pipeline_xgb.fit(X_train_subsample, y_train_subsample)\n", 1212 | "\n", 1213 | " y_pred_subsample_xgb = pipeline_xgb.predict(X_valid_subsample)\n", 1214 | " score_mae_xgb = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_xgb))\n", 1215 | " scores.append((value, score_mae_xgb))\n", 1216 | " print(f'{value} \\t {round(score_mae_xgb, 4)}')" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "markdown", 1221 | "metadata": {}, 1222 | "source": [ 1223 | "There we go. Our preferred n_estimators equals 35." 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 29, 1229 | "metadata": {}, 1230 | "outputs": [ 1231 | { 1232 | "name": "stdout", 1233 | "output_type": "stream", 1234 | "text": [ 1235 | "<>\n", 1236 | "Val.\tScore\n", 1237 | "2 \t 1266.6668\n", 1238 | "3 \t 1231.4775\n", 1239 | "4 \t 1215.8378\n", 1240 | "5 \t 1218.9785\n", 1241 | "6 \t 1221.6315\n", 1242 | "7 \t 1223.4337\n", 1243 | "8 \t 1250.9087\n", 1244 | "9 \t 1258.0747\n", 1245 | "10 \t 1273.4898\n" 1246 | ] 1247 | } 1248 | ], 1249 | "source": [ 1250 | "tune_subject = 'Max Depth'\n", 1251 | "tune_values = list(range(2, 11, 1))\n", 1252 | "\n", 1253 | "scores = list()\n", 1254 | "print(\n", 1255 | " f'<>\\nVal.\\tScore')\n", 1256 | "for value in tune_values:\n", 1257 | " xgb_params = {\n", 1258 | " 'n_estimators': 35,\n", 1259 | " 'max_depth': value,\n", 1260 | "\n", 1261 | " 'objective': 'reg:squarederror',\n", 1262 | " 'nthread': -1,\n", 1263 | " 'tree_method': tree_method_applied,\n", 1264 | "\n", 1265 | " 'seed': RANDOM_SEED,\n", 1266 | " 'verbosity': 1,\n", 1267 | " }\n", 1268 | "\n", 1269 | " model_xgb = xgb.XGBRegressor(**xgb_params)\n", 1270 | " pipeline_xgb = Pipeline(steps=[\n", 1271 | " ('preprocessor', preprocessor),\n", 1272 | " ('model', model_xgb)\n", 1273 | " ])\n", 1274 | " pipeline_xgb.fit(X_train_subsample, y_train_subsample)\n", 1275 | "\n", 1276 | " y_pred_subsample_xgb = pipeline_xgb.predict(X_valid_subsample)\n", 1277 | " score_mae_xgb = mean_absolute_error(\n", 1278 | " np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_xgb))\n", 1279 | " scores.append((value, score_mae_xgb))\n", 1280 | " print(f'{value} \\t {round(score_mae_xgb, 4)}')" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "markdown", 1285 | "metadata": {}, 1286 | "source": [ 1287 | "The depth with value 4 is our choice." 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": 30, 1293 | "metadata": {}, 1294 | "outputs": [ 1295 | { 1296 | "name": "stdout", 1297 | "output_type": "stream", 1298 | "text": [ 1299 | "<>\n", 1300 | "Val.\tScore\n", 1301 | "0.1 \t 1380.3735\n", 1302 | "0.11 \t 1333.0651\n", 1303 | "0.12 \t 1297.7281\n", 1304 | "0.13 \t 1280.2472\n", 1305 | "0.14 \t 1259.7375\n", 1306 | "0.15 \t 1252.8473\n", 1307 | "0.16 \t 1242.9069\n", 1308 | "0.17 \t 1240.0223\n", 1309 | "0.18 \t 1237.2628\n", 1310 | "0.19 \t 1230.8648\n", 1311 | "0.2 \t 1230.7833\n", 1312 | "0.21 \t 1223.8763\n", 1313 | "0.22 \t 1223.5119\n", 1314 | "0.23 \t 1219.9929\n", 1315 | "0.24 \t 1217.7635\n", 1316 | "0.25 \t 1220.5119\n", 1317 | "0.26 \t 1224.7322\n", 1318 | "0.27 \t 1213.2179\n", 1319 | "0.28 \t 1219.5806\n", 1320 | "0.29 \t 1218.9915\n" 1321 | ] 1322 | } 1323 | ], 1324 | "source": [ 1325 | "tune_subject = 'Learning Rate'\n", 1326 | "tune_values = [round(item, 2) for item in np.arange(0.1, 0.3, 0.01)]\n", 1327 | "\n", 1328 | "scores = list()\n", 1329 | "print(f'<>\\nVal.\\tScore')\n", 1330 | "for value in tune_values:\n", 1331 | " xgb_params = {\n", 1332 | " 'n_estimators': 35,\n", 1333 | " 'max_depth': 4,\n", 1334 | " 'eta': value,\n", 1335 | "\n", 1336 | " 'objective': 'reg:squarederror',\n", 1337 | " 'nthread': -1,\n", 1338 | " 'tree_method': tree_method_applied,\n", 1339 | "\n", 1340 | " 'seed': RANDOM_SEED,\n", 1341 | " 'verbosity': 1,\n", 1342 | " }\n", 1343 | "\n", 1344 | " model_xgb = xgb.XGBRegressor(**xgb_params)\n", 1345 | " pipeline_xgb = Pipeline(steps=[\n", 1346 | " ('preprocessor', preprocessor),\n", 1347 | " ('model', model_xgb)\n", 1348 | " ])\n", 1349 | " pipeline_xgb.fit(X_train_subsample, y_train_subsample)\n", 1350 | "\n", 1351 | " y_pred_subsample_xgb = pipeline_xgb.predict(X_valid_subsample)\n", 1352 | " score_mae_xgb = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample_xgb))\n", 1353 | " scores.append((value, score_mae_xgb))\n", 1354 | " print(f'{value} \\t {round(score_mae_xgb, 4)}')" 1355 | ] 1356 | }, 1357 | { 1358 | "cell_type": "markdown", 1359 | "metadata": {}, 1360 | "source": [ 1361 | "The ideal learning rate would be 0.27. This wraps up our XGBoost tuning." 1362 | ] 1363 | }, 1364 | { 1365 | "cell_type": "markdown", 1366 | "metadata": {}, 1367 | "source": [ 1368 | "### Choosing The Best Model\n", 1369 | "XGBoost performed the best among tuned three models discussed here. That will be our go-to model for final train on data and our train script." 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "markdown", 1374 | "metadata": {}, 1375 | "source": [ 1376 | "## Final Training on All Data" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "code", 1381 | "execution_count": 32, 1382 | "metadata": {}, 1383 | "outputs": [ 1384 | { 1385 | "name": "stdout", 1386 | "output_type": "stream", 1387 | "text": [ 1388 | "Training finished :)\n" 1389 | ] 1390 | } 1391 | ], 1392 | "source": [ 1393 | "X_train = df_train_full[features_all].copy()\n", 1394 | "y_train = df_train_full[name_of_target_column_transformed].to_numpy()\n", 1395 | "X_test = df_test_full[features_all].copy()\n", 1396 | "\n", 1397 | "\n", 1398 | "gpu_enabled = True\n", 1399 | "tree_method_applied = 'gpu_hist' if gpu_enabled else 'auto'\n", 1400 | "\n", 1401 | "# tuned hyperparameters\n", 1402 | "xgb_params_final = {\n", 1403 | " 'n_estimators': 35,\n", 1404 | " 'max_depth': 4,\n", 1405 | " 'eta': 0.27,\n", 1406 | "\n", 1407 | " 'objective': 'reg:squarederror',\n", 1408 | " 'nthread': -1,\n", 1409 | " 'tree_method': tree_method_applied,\n", 1410 | "\n", 1411 | " 'seed': RANDOM_SEED,\n", 1412 | " 'verbosity': 1,\n", 1413 | "}\n", 1414 | "\n", 1415 | "model_final = xgb.XGBRegressor(**xgb_params_final)\n", 1416 | "pipeline_final = Pipeline(steps=[\n", 1417 | " ('preprocessor', preprocessor),\n", 1418 | " ('model', model_final)\n", 1419 | "])\n", 1420 | "pipeline_final.fit(X_train, y_train)\n", 1421 | "\n", 1422 | "print('Training finished :)')" 1423 | ] 1424 | }, 1425 | { 1426 | "cell_type": "code", 1427 | "execution_count": 33, 1428 | "metadata": {}, 1429 | "outputs": [ 1430 | { 1431 | "name": "stdout", 1432 | "output_type": "stream", 1433 | "text": [ 1434 | "MAE Score: 1200.553121658578\n" 1435 | ] 1436 | } 1437 | ], 1438 | "source": [ 1439 | "# test the final model on the same subsample validation bunch\n", 1440 | "y_pred_subsample = pipeline_final.predict(X_valid_subsample)\n", 1441 | "score_mae_subsample_final = mean_absolute_error(np.expm1(y_valid_subsample), np.expm1(y_pred_subsample))\n", 1442 | "print('MAE Score:', score_mae_subsample_final)" 1443 | ] 1444 | }, 1445 | { 1446 | "cell_type": "markdown", 1447 | "metadata": {}, 1448 | "source": [ 1449 | "As we expected, training on all data improved subsampled data score." 1450 | ] 1451 | }, 1452 | { 1453 | "cell_type": "markdown", 1454 | "metadata": {}, 1455 | "source": [ 1456 | "### Saving Model" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "code", 1461 | "execution_count": 35, 1462 | "metadata": {}, 1463 | "outputs": [ 1464 | { 1465 | "name": "stdout", 1466 | "output_type": "stream", 1467 | "text": [ 1468 | "Model saved successfully.\n" 1469 | ] 1470 | } 1471 | ], 1472 | "source": [ 1473 | "MODEL_PATH = './scripts/model/'\n", 1474 | "\n", 1475 | "with open(MODEL_PATH+'model.bin', 'wb') as output_file:\n", 1476 | " pickle.dump((pipeline_final), output_file)\n", 1477 | " print('Model saved successfully.')\n", 1478 | "output_file.close()" 1479 | ] 1480 | } 1481 | ], 1482 | "metadata": { 1483 | "interpreter": { 1484 | "hash": "97ae724bfa85b9b34df7982b8bb8c7216f435b92902d749e4263f71162bea840" 1485 | }, 1486 | "kernelspec": { 1487 | "display_name": "Python 3.7.12 64-bit ('base': conda)", 1488 | "name": "python3" 1489 | }, 1490 | "language_info": { 1491 | "codemirror_mode": { 1492 | "name": "ipython", 1493 | "version": 3 1494 | }, 1495 | "file_extension": ".py", 1496 | "mimetype": "text/x-python", 1497 | "name": "python", 1498 | "nbconvert_exporter": "python", 1499 | "pygments_lexer": "ipython3", 1500 | "version": "3.7.12" 1501 | }, 1502 | "orig_nbformat": 4 1503 | }, 1504 | "nbformat": 4, 1505 | "nbformat_minor": 2 1506 | } 1507 | -------------------------------------------------------------------------------- /Step 4. Scripting the Process.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Scripting the Process\n", 8 | "Given that we have successfully trained a model, we need to look ahead and consider the possibility of dataset updates and the necessity to retrain a new model based on new data.
\n", 9 | "\n", 10 | "At first sight, this might look like an overkill to you and you may ask: Why do I need to do this? If I want this so badly, I can do the data cleaning & model training again on the notebooks I already have.
\n", 11 | "👉 __There's a good point in creating an independent script.__ Real-life projects need to be agile and we're always on tight scheduling. Suppose you have weekly data updates and you'd like to update your model on weekly basis as well, and every couple of months revisit data for an EDA review. Wouldn't it be nice to have the freedom to change data preprocessing steps easily and just run a cron-job from time to time to retrain a new model on new data, all in an automated process?\n", 12 | "\n", 13 | "Having scripts to automate our ml pipeline is essential in real-life projects. From a simplistic point of view, at least two scripts are needed to achieve this goal. One for cleaning the new dataset based on our EDA insights; and the other for training the model. I have created self-explanatory scripts in __*scripts*__ folder of this repo. Make sure to check them out." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## Creating a Unique Run-time Identifier\n", 21 | "Well, having a unique identifier for saving files is necessary to avoid accidental re-writing on previous versions of them. I like to use a combination of (the current datetime + a random string).\n", 22 | "The very simple below lines create an almost-unique identifier:" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stdout", 32 | "output_type": "stream", 33 | "text": [ 34 | "2021.11.04-124459-DWTU\n" 35 | ] 36 | } 37 | ], 38 | "source": [ 39 | "import string\n", 40 | "import random\n", 41 | "from datetime import datetime\n", 42 | "\n", 43 | "unique_run_identifier = ''.join([\n", 44 | " datetime.now().strftime('%Y.%m.%d-%H%M%S-'),\n", 45 | " ''.join(random.choices(string.ascii_uppercase + string.digits, k=4))\n", 46 | "])\n", 47 | "\n", 48 | "print(unique_run_identifier)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "The identifier is in this form: __\"Date-Time-RandomString\"__\n", 56 | "\n", 57 | "The nice thing about this identifier string is that it embodies the date and time simultaneously, which could be useful to perceive the exact time it was generated.
\n", 58 | "You can also increase _\"k\"_ value in __*random.choices()*__ function parameters to generate a longer random string sequence." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [] 65 | } 66 | ], 67 | "metadata": { 68 | "interpreter": { 69 | "hash": "97ae724bfa85b9b34df7982b8bb8c7216f435b92902d749e4263f71162bea840" 70 | }, 71 | "kernelspec": { 72 | "display_name": "Python 3.7.12 64-bit ('base': conda)", 73 | "name": "python3" 74 | }, 75 | "language_info": { 76 | "codemirror_mode": { 77 | "name": "ipython", 78 | "version": 3 79 | }, 80 | "file_extension": ".py", 81 | "mimetype": "text/x-python", 82 | "name": "python", 83 | "nbconvert_exporter": "python", 84 | "pygments_lexer": "ipython3", 85 | "version": "3.7.12" 86 | }, 87 | "orig_nbformat": 4 88 | }, 89 | "nbformat": 4, 90 | "nbformat_minor": 2 91 | } 92 | -------------------------------------------------------------------------------- /Step 5. Model Deployment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Model Deployment\n", 8 | "Quite a journey to get here! Do you remember that at the end of step 3 (model training) we saved the trained model as a bin file?
\n", 9 | "Well, now it's time to serve the model to those folks that have been waiting for us to provide them an entrance for prediction 😊. __REST APIs__ are the most common method used for doing this.\n", 10 | "\n", 11 | "During this notebook, we'll be reviewing how to do so, in simplest terms." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Web Service API\n", 19 | "Our prediction model needs to be wrapped into a web service, available to the outside world. When it comes to creating API web services, there are plenty of options. [Flask](https://palletsprojects.com/p/flask/) is a micro web application framework that makes creating web APIs like a breeze. [FastAPI](https://fastapi.tiangolo.com/) is another fantastic option.\n", 20 | "\n", 21 | "The API wrapper will do the below three tasks:\n", 22 | "- Receive inputs as a JSON string via POST method\n", 23 | "- Load prediction model from a saved file\n", 24 | "- Run prediction on input data and return prediction result response\n", 25 | "\n", 26 | "This is the code that achieves above goals using Flask:\n", 27 | "\n", 28 | "```python\n", 29 | "# required library imports\n", 30 | "import json\n", 31 | "import pickle\n", 32 | "from pandas import DataFrame\n", 33 | "from numpy import expm1\n", 34 | "from flask import Flask, request\n", 35 | "from waitress import serve\n", 36 | "\n", 37 | "# path to the folder containing our model\n", 38 | "MODEL_PATH = './model/'\n", 39 | "\n", 40 | "\n", 41 | "def load_model(path_to_model):\n", 42 | " with open(path_to_model, 'rb') as model_file:\n", 43 | " model = pickle.load(model_file)\n", 44 | " return model\n", 45 | "\n", 46 | "\n", 47 | "def get_prediction(model, input_data):\n", 48 | " # input data is a json string\n", 49 | " # we have to convert it back to a pandas dataframe object\n", 50 | " # scikit-learn's ColumnTransformer only accepts an array or pandas DataFrame\n", 51 | " dict_obj = json.loads(input_data)\n", 52 | " X = DataFrame.from_dict(dict_obj)\n", 53 | " y_pred = expm1(model.predict(X)[0])\n", 54 | "\n", 55 | " # compose result dictionary and return it as a json string\n", 56 | " result = {\n", 57 | " 'prediction': float(y_pred),\n", 58 | " }\n", 59 | " return json.dumps(result)\n", 60 | "\n", 61 | "\n", 62 | "app = Flask('prediction_app')\n", 63 | "@app.route('/predict', methods=['POST'])\n", 64 | "def predict():\n", 65 | " input_json = request.get_json()\n", 66 | " model = load_model(MODEL_PATH+'model.bin')\n", 67 | " prediction_result = get_prediction(model, input_json)\n", 68 | "\n", 69 | " return prediction_result\n", 70 | "\n", 71 | "\n", 72 | "\n", 73 | "if __name__ == \"__main__\":\n", 74 | " serve(app, host='0.0.0.0', port=9696)\n", 75 | "```" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "There are three functions in our prediction file.
\n", 83 | "- First one, __*load_model()*__, loads the model saved as a pickle file. As a side note, we know that our saved model is embodied with data preprocessing pipeline alongside.\n", 84 | "- Second function, __*get_prediction()*__, does the prediction task by utilizing a model and the input data. The __input data must be in the same dimension__ as our cleaned dataset, meaning it must have the exact features we used for training.\n", 85 | "- The last one, __*predict()*__ , is the entry for our API endpoint handled by Flask. Notice the __*route()*__ and it's input above function definition. The route parameter can be any name you like, which will be used by outside world as API's entry point.\n", 86 | "\n", 87 | "The last line employs _waitress_ web server to run the flask app. Waitress is a lightweight WSGI server that has no dependencies except ones which live in the Python standard library. It can run on both Windows & UNIX-based OS environments.\n", 88 | "The code above is available in ``scripts/predict.py`` file. We need to run this file in order to make our API accessible.\n", 89 | "\n", 90 | "All you need to do is to run a shell command inside the __*scripts*__ folder to run the file. Type-in one of the following commands in your command line inside this folder:\n", 91 | "\n", 92 | "```shell\n", 93 | "\"python inference.py\"\n", 94 | "```\n", 95 | "\n", 96 | "or\n", 97 | "\n", 98 | "```shell\n", 99 | "\"python -m inference\"\n", 100 | "```\n", 101 | "\n", 102 | "That's it. Now our flask app is running in the background, waiting to serve requests on 9696 port, available on all IP addresses our machine has. Note that, in the last line containing the __*serve()*__ function in ``inference.py`` file, you can put-in any port number as function's port parameter, as long as it is not used by any apps or services running on your machine." 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Testing an API Call\n", 110 | "Now that we have our prediction web service running on our machine, let's send an API request to our API endpoint and see what we get as a result.
\n", 111 | "We'll need to bundle the API request with some input data. The most accessible data we have is our test dataset. Let's load this and extract a random JSON data point out of it:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 2, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "import pandas as pd\n", 121 | "\n", 122 | "DATA_PATH = './scripts/data/'\n", 123 | "test_data = pd.read_csv(DATA_PATH+'test_cleaned.csv.gz')" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 3, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "'{\"cat1\":{\"5410\":\"A\"},\"cat2\":{\"5410\":\"A\"},\"cat4\":{\"5410\":\"A\"},\"cat5\":{\"5410\":\"B\"},\"cat6\":{\"5410\":\"A\"},\"cat8\":{\"5410\":\"A\"},\"cat9\":{\"5410\":\"A\"},\"cat10\":{\"5410\":\"A\"},\"cat11\":{\"5410\":\"A\"},\"cat12\":{\"5410\":\"A\"},\"cat13\":{\"5410\":\"A\"},\"cat14\":{\"5410\":\"A\"},\"cat16\":{\"5410\":\"A\"},\"cat17\":{\"5410\":\"A\"},\"cat18\":{\"5410\":\"A\"},\"cat19\":{\"5410\":\"A\"},\"cat20\":{\"5410\":\"A\"},\"cat21\":{\"5410\":\"A\"},\"cat23\":{\"5410\":\"A\"},\"cat24\":{\"5410\":\"A\"},\"cat25\":{\"5410\":\"A\"},\"cat26\":{\"5410\":\"A\"},\"cat27\":{\"5410\":\"A\"},\"cat28\":{\"5410\":\"A\"},\"cat29\":{\"5410\":\"A\"},\"cat30\":{\"5410\":\"A\"},\"cat31\":{\"5410\":\"A\"},\"cat32\":{\"5410\":\"A\"},\"cat33\":{\"5410\":\"A\"},\"cat34\":{\"5410\":\"A\"},\"cat35\":{\"5410\":\"A\"},\"cat36\":{\"5410\":\"A\"},\"cat37\":{\"5410\":\"A\"},\"cat38\":{\"5410\":\"B\"},\"cat39\":{\"5410\":\"A\"},\"cat40\":{\"5410\":\"A\"},\"cat41\":{\"5410\":\"A\"},\"cat42\":{\"5410\":\"A\"},\"cat43\":{\"5410\":\"A\"},\"cat44\":{\"5410\":\"A\"},\"cat45\":{\"5410\":\"A\"},\"cat46\":{\"5410\":\"A\"},\"cat47\":{\"5410\":\"A\"},\"cat48\":{\"5410\":\"A\"},\"cat49\":{\"5410\":\"A\"},\"cat50\":{\"5410\":\"A\"},\"cat51\":{\"5410\":\"A\"},\"cat52\":{\"5410\":\"A\"},\"cat53\":{\"5410\":\"A\"},\"cat54\":{\"5410\":\"A\"},\"cat57\":{\"5410\":\"A\"},\"cat58\":{\"5410\":\"A\"},\"cat59\":{\"5410\":\"A\"},\"cat60\":{\"5410\":\"A\"},\"cat61\":{\"5410\":\"A\"},\"cat65\":{\"5410\":\"A\"},\"cat66\":{\"5410\":\"A\"},\"cat67\":{\"5410\":\"A\"},\"cat69\":{\"5410\":\"A\"},\"cat71\":{\"5410\":\"A\"},\"cat72\":{\"5410\":\"A\"},\"cat73\":{\"5410\":\"A\"},\"cat74\":{\"5410\":\"A\"},\"cat75\":{\"5410\":\"A\"},\"cat76\":{\"5410\":\"A\"},\"cat77\":{\"5410\":\"D\"},\"cat78\":{\"5410\":\"B\"},\"cat79\":{\"5410\":\"B\"},\"cat80\":{\"5410\":\"D\"},\"cat81\":{\"5410\":\"C\"},\"cat82\":{\"5410\":\"B\"},\"cat83\":{\"5410\":\"B\"},\"cat84\":{\"5410\":\"C\"},\"cat85\":{\"5410\":\"B\"},\"cat86\":{\"5410\":\"D\"},\"cat87\":{\"5410\":\"B\"},\"cat88\":{\"5410\":\"D\"},\"cat89\":{\"5410\":\"A\"},\"cat90\":{\"5410\":\"A\"},\"cat91\":{\"5410\":\"B\"},\"cat92\":{\"5410\":\"H\"},\"cat93\":{\"5410\":\"D\"},\"cat94\":{\"5410\":\"C\"},\"cat95\":{\"5410\":\"C\"},\"cat96\":{\"5410\":\"E\"},\"cat97\":{\"5410\":\"A\"},\"cat98\":{\"5410\":\"C\"},\"cat99\":{\"5410\":\"T\"},\"cat100\":{\"5410\":\"F\"},\"cat101\":{\"5410\":\"A\"},\"cat102\":{\"5410\":\"A\"},\"cat103\":{\"5410\":\"B\"},\"cat104\":{\"5410\":\"K\"},\"cat105\":{\"5410\":\"F\"},\"cat106\":{\"5410\":\"E\"},\"cat107\":{\"5410\":\"H\"},\"cat108\":{\"5410\":\"G\"},\"cat109\":{\"5410\":\"BI\"},\"cat110\":{\"5410\":\"CS\"},\"cat111\":{\"5410\":\"A\"},\"cat112\":{\"5410\":\"J\"},\"cat113\":{\"5410\":\"L\"},\"cat114\":{\"5410\":\"A\"},\"cat115\":{\"5410\":\"O\"},\"cat116\":{\"5410\":\"LM\"},\"cont1\":{\"5410\":0.8464},\"cont2\":{\"5410\":0.681761},\"cont3\":{\"5410\":0.336963},\"cont4\":{\"5410\":0.705434},\"cont5\":{\"5410\":0.281143},\"cont6\":{\"5410\":0.875461},\"cont7\":{\"5410\":0.589016},\"cont8\":{\"5410\":0.96202},\"cont9\":{\"5410\":0.86299},\"cont10\":{\"5410\":0.8351},\"cont11\":{\"5410\":0.82184},\"cont13\":{\"5410\":0.808455},\"cont14\":{\"5410\":0.389813}}'" 135 | ] 136 | }, 137 | "execution_count": 3, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "# choose a sample row\n", 144 | "# this should have all the columns except the 'id' column which was note used for training\n", 145 | "sample = test_data[test_data.columns[~test_data.columns.isin(['id'])]].sample(n=1, random_state=1024).to_json()\n", 146 | "sample" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "It's time to send an outgoing API request to our prediction endpoint and see the results:" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 4, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "name": "stdout", 163 | "output_type": "stream", 164 | "text": [ 165 | "{'prediction': 2284.148193359375}\n" 166 | ] 167 | } 168 | ], 169 | "source": [ 170 | "import requests\n", 171 | "\n", 172 | "api_url = \"http://localhost:9696/predict\"\n", 173 | "\n", 174 | "\n", 175 | "api_response = requests.post(url=api_url, json=sample).json()\n", 176 | "print(api_response)" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "There you go, we have our API endpoint alive and well!" 184 | ] 185 | } 186 | ], 187 | "metadata": { 188 | "interpreter": { 189 | "hash": "97ae724bfa85b9b34df7982b8bb8c7216f435b92902d749e4263f71162bea840" 190 | }, 191 | "kernelspec": { 192 | "display_name": "Python 3.7.12 64-bit ('base': conda)", 193 | "name": "python3" 194 | }, 195 | "language_info": { 196 | "codemirror_mode": { 197 | "name": "ipython", 198 | "version": 3 199 | }, 200 | "file_extension": ".py", 201 | "mimetype": "text/x-python", 202 | "name": "python", 203 | "nbconvert_exporter": "python", 204 | "pygments_lexer": "ipython3", 205 | "version": "3.7.12" 206 | }, 207 | "orig_nbformat": 4 208 | }, 209 | "nbformat": 4, 210 | "nbformat_minor": 2 211 | } 212 | -------------------------------------------------------------------------------- /Step 6. Containerization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Containerization\n", 8 | "\n", 9 | "## Introduction\n", 10 | "Our story began with exploratory data analysis and continued with data cleaning, feature transformations and model training. So far all we have done has taken place on our local machine (or cloud-based workstation, if you will). Suppose we get enough satisfaction and positive feedback from our model. This is where we want to make our model available to outer world and make it easier for our potential customers to access it. We already experimented with model deployment to facilitate this. But what can we do to make the process even more efficient?\n", 11 | "Say hello to packaging and shipping. Packaging everything together allows us to be flexible in terms of more convenient deployment and scaling up the model instances quicker when needed. You may have already heard about virtualizations and environments. There are different methodologies and various tools that make the struggle less painful. Some well-known products are [Vagrant](https://www.vagrantup.com), [Kubernetes](https://kubernetes.io) and [Docker](https://www.docker.com). __Docker__ is the tool we'll be utilizing.\n", 12 | "\n", 13 | "

\n", 14 | " \n", 15 | "

" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Docker is pretty simple and straight forward. All we need to do is to choose a base image and explicitly state which libraries are desired to be installed and which files should be existent (scripts, models, etc). We can even specify any service or script to run on our instance initialization and expose the listening port (that means instant access to our inference model API with zero effort). We do the config once, then create a docker image using the config manifesto; now we can do infinite shippings of that image (known as containers), and run as much instances as we like by pulling and replicating it. Isn't that cool? And you know what's even cooler? All of these steps can be put inside a single file known as ``Dockerfile`` in form of commands and get executed with single docker build call. During this notebook, we'll be examining these commands one by one and create the Dockerfile, build the docker image and initiate a container instance step by step.\n", 23 | "\n", 24 | "

\n", 25 | " \n", 26 | "

\n", 27 | "\n", 28 | "With all that said, I won't be getting into installing the docker itself. In fact, there are extensive guides & resources out there that explain installing and [getting started](https://www.docker.com/get-started) with it on your local machine. Docker supports Linux, macOS and Windows and installing it has become much easier than what it used to be earlier. Enough talking; considering that you have already setup docker, let's roll up the sleeves and begin 🤩." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## Choosing the Base Image\n", 36 | "[Docker hub](https://hub.docker.com) is the official home for millions of ready-to-use docker images. You can find almost anything you like there. Different operating systems and software solutions with multiple versions (tags), all available at your fingertips.\n", 37 | "\n", 38 | "There are couple of points to bear in mind though:\n", 39 | "- Some operating systems are better suited & optimized for specific tasks. Try to do your own research before choosing one.\n", 40 | "- Always prefer official images to non-official ones, specially if you're serious on pushing your solution to production.\n", 41 | "- Based on the OS you choose (type of Linux distro in our case), the built-in package manager could vary and you'll need to adjust shell commands for adding and installing libraries. Even there could be exclusive considerations when it comes to installing a specific package on your chosen distro.\n", 42 | "\n", 43 | "For our use case, we'll need to employ a Linux distribution and I decided to go with [clearlinux](https://hub.docker.com/_/clearlinux), a [high-performant open-source Linux distribution by Intel](https://clearlinux.org). Please note that we can easily find an OS pre-installed with python in docker images hub. Clearlinux is no exception, and the vendor provides [docker images with different python versions](https://hub.docker.com/r/clearlinux/python) running on top of official clearlinux base image.\n", 44 | "\n", 45 | "The docker command for mentioning the base image with our desired python3 version would be:\n", 46 | "```Dockerfile\n", 47 | "FROM clearlinux/python:3.9\n", 48 | "```" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Copying Inference Script & Model Files\n", 56 | "Next, we'll copy inference script & model file to our docker image in baking. Let's first make a directory in our root folder and name it as __*'app'*__. This folder will house the mentioned files for inference. Here are the docker commands:\n", 57 | "```Dockerfile\n", 58 | "# create a new folder named 'app' (if it doesn't exist) in root and set it as the current working directory\n", 59 | "WORKDIR /app\n", 60 | "```\n", 61 | "\n", 62 | "Ok, here comes the copy stage. If we create our Dockerfile in the same directory where our inference script exists, this is how the COPY command should look like:\n", 63 | "```Dockerfile\n", 64 | "# copy \"inference.py\" & \"requirements.txt\" => the last argument is the destination, which is current working directory\n", 65 | "COPY [\"inference.py\", \"requirements.txt\", \"./\"]\n", 66 | "\n", 67 | "# copy model file to working directory with a similar structure => the last argument is the destination in current directory; folder will be created if it doesn't exist\n", 68 | "COPY [\"model/model.bin\", \"model/\"]\n", 69 | "```" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "## Installing Required Libraries\n", 77 | "\n", 78 | "We already have pip available in our docker image and that gives us the ability to install additional requirement packages. We can either copy the requirements.txt file just like what we did above (found inside __scripts__ folder of this repo) and run __*pip install -r requirements.txt*__ as a docker RUN command, or simply initiate a docker RUN pip install command having all the required package names:\n", 79 | "```Dockerfile\n", 80 | "RUN pip install -r requirements.txt\n", 81 | "# OR\n", 82 | "RUN pip install numpy pandas sklearn xgboost flask waitress\n", 83 | "```\n", 84 | "\n", 85 | "That's all the libraries we need to include for an inference run." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "## Exposing Port and Specifying Inference App Entry Point\n", 93 | "With our files copied & set and required libraries installed, we can move on to the next step. We need to specify the inference file and the port our web server listens to (defined inside ``inference.py`` script) for receiving requests. After all, no inference model does wonders without exposure 😉.\n", 94 | "```Dockerfile\n", 95 | "# expose the port used by web server (defined on inference.py script)\n", 96 | "EXPOSE 9696\n", 97 | "\n", 98 | "# run \"python inference.py\" command as soon as the instance is up\n", 99 | "ENTRYPOINT [\"python\", \"inference.py\"]\n", 100 | "```" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "## Dockerfile: Putting It All Together\n", 108 | "Let's sum up everything we have gathered so far inside a ``Dockerfile``:\n", 109 | "\n", 110 | "```Dockerfile\n", 111 | "FROM clearlinux/python:3.9\n", 112 | "WORKDIR /app\n", 113 | "COPY [\"inference.py\", \"requirements.txt\", \"./\"]\n", 114 | "COPY [\"model/model.bin\", \"model/\"]\n", 115 | "RUN pip install -r requirements.txt\n", 116 | "EXPOSE 9696\n", 117 | "ENTRYPOINT [\"python\", \"inference.py\"]\n", 118 | "```\n", 119 | "\n", 120 | "The ``Dockerfile`` with above lines can be found inside __scripts__ folder of this repository." 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "## Building the Docker Image\n", 128 | "Now that the Dockerfile is ready, we can run the build command and assemble our custom image accordingly. Navigate to __scripts__ folder of this repo and run the following command on your terminal to build the custom inference docker container:\n", 129 | "\n", 130 | "```shell\n", 131 | "docker build -t claims-severity .\n", 132 | "```\n", 133 | "\n", 134 | "Notes:\n", 135 | "- Pay attention not to miss the dot at the end of the above command.\n", 136 | "- You can replace ``claims-severity`` name with anything you like.\n", 137 | "\n", 138 | "The build takes a while to complete, depending on your connection speed. After it is finished, you should be able to see your new docker image among others by listing them all:\n", 139 | "\n", 140 | "```shell\n", 141 | "docker image ls -a\n", 142 | "```" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## Running the Inference Instance\n", 150 | "Let's create an instance and run it on our machine. Write this down on your terminal and hit ``Enter``:\n", 151 | "\n", 152 | "```shell\n", 153 | "docker run -it --rm -p 9000:9696 claims-severity\n", 154 | "```\n", 155 | "\n", 156 | "Did you notice the port number difference in our command? I made it different deliberately to make a distinction between local host and container ports. The first port number (9000) belongs to local host machine and the second one (9696) is the container port in which the instance's web server is listening to requests (should match the port number we have defined on ``inference.py`` script).\n", 157 | "\n", 158 | "That's it! Our inference instance is ready to serve." 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "## Testing the Docker Instance: Sending an Inference Request to Prediction Endpoint\n", 166 | "If everything goes as described, you should be able to test-run an inference request on the running instance." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 1, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "# choosing a random sample from test data\n", 176 | "\n", 177 | "import pandas as pd\n", 178 | "\n", 179 | "DATA_PATH = './scripts/data/'\n", 180 | "test_data = pd.read_csv(DATA_PATH+'test_cleaned.csv.gz')\n", 181 | "\n", 182 | "# choose a sample row\n", 183 | "# this should have all the columns except the 'id' column which was note used for training\n", 184 | "sample = test_data[test_data.columns[~test_data.columns.isin(['id'])]].sample(n=1, random_state=1024).to_json()" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 2, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "{'prediction': 2284.148193359375}\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "# send an outgoing API request to our prediction endpoint and print the results\n", 202 | "\n", 203 | "import requests\n", 204 | "\n", 205 | "api_url = \"http://localhost:9000/predict\"\n", 206 | "\n", 207 | "api_response = requests.post(url=api_url, json=sample).json()\n", 208 | "print(api_response)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "Congrats, It was a success! Our inference instance works flawlessly :)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "## Important Note\n", 223 | "As you've noticed so far, we didn't discuss anything about security considerations. The main reason was that I wanted to keep things as simple as possible and not complicate them for starting off. You'll definitely need some more steps to secure the container tightly (and probably limit the API inference calls to only authorized users). But that's another story for another time, and we won't be getting into that for now." 224 | ] 225 | } 226 | ], 227 | "metadata": { 228 | "interpreter": { 229 | "hash": "97ae724bfa85b9b34df7982b8bb8c7216f435b92902d749e4263f71162bea840" 230 | }, 231 | "kernelspec": { 232 | "display_name": "Python 3.7.12 64-bit ('base': conda)", 233 | "language": "python", 234 | "name": "python3" 235 | }, 236 | "language_info": { 237 | "codemirror_mode": { 238 | "name": "ipython", 239 | "version": 3 240 | }, 241 | "file_extension": ".py", 242 | "mimetype": "text/x-python", 243 | "name": "python", 244 | "nbconvert_exporter": "python", 245 | "pygments_lexer": "ipython3", 246 | "version": "3.9.12" 247 | }, 248 | "orig_nbformat": 4 249 | }, 250 | "nbformat": 4, 251 | "nbformat_minor": 2 252 | } 253 | -------------------------------------------------------------------------------- /Step 7. Cloud Deployment.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Cloud Deployment\n", 8 | "Soon.
\n", 9 | "This part is under revision; It's being cooked right now 🥣." 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "orig_nbformat": 4 18 | }, 19 | "nbformat": 4, 20 | "nbformat_minor": 2 21 | } 22 | -------------------------------------------------------------------------------- /assets/allstate_banner-660x120.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/assets/allstate_banner-660x120.png -------------------------------------------------------------------------------- /assets/docker-process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/assets/docker-process.png -------------------------------------------------------------------------------- /assets/docker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/assets/docker.png -------------------------------------------------------------------------------- /assets/ml-iceberg.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/assets/ml-iceberg.jpg -------------------------------------------------------------------------------- /assets/tweet_eda.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/assets/tweet_eda.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | xgboost<1.6.0 2 | pandas<1.4.0 3 | jupyterlab 4 | numpy 5 | seaborn 6 | matplotlib 7 | sklearn 8 | plotly 9 | flask 10 | waitress -------------------------------------------------------------------------------- /scripts/Dockerfile: -------------------------------------------------------------------------------- 1 | # docker config for a stand-alone inference script instance 2 | 3 | FROM clearlinux/python:3.9 4 | WORKDIR /app 5 | COPY ["inference.py", "requirements.txt", "./"] 6 | COPY ["model/model.bin", "model/"] 7 | RUN pip install -r requirements.txt 8 | EXPOSE 9696 9 | ENTRYPOINT ["python", "inference.py"] -------------------------------------------------------------------------------- /scripts/clean_data.py: -------------------------------------------------------------------------------- 1 | # required library imports 2 | 3 | import os 4 | import string 5 | import random 6 | import pandas as pd 7 | 8 | from datetime import datetime 9 | 10 | 11 | # create fresh data files after cleaning 12 | data_file_saving = True 13 | 14 | # define relative data path (according the current path of this file) 15 | dirname = os.path.dirname(__file__) 16 | DATA_PATH = dirname+'/data/' 17 | 18 | 19 | # create a unique identifier for run 20 | unique_run_identifier = ''.join([ 21 | datetime.now().strftime('%Y.%m.%d-%H%M%S-'), 22 | ''.join(random.choices(string.ascii_uppercase + string.digits, k=4)) 23 | ]) 24 | 25 | 26 | if __name__ == "__main__": 27 | print('\nReading data...') 28 | # read data into pandas dataframes 29 | df_train_full = pd.read_csv(DATA_PATH+'train.csv.gz') 30 | df_test = pd.read_csv(DATA_PATH+'test.csv.gz') 31 | 32 | 33 | print('\nProcessing data...') 34 | 35 | # specify feature groups & target column 36 | features_numerical = [column for column in df_train_full if column.startswith('cont')] 37 | features_categorical = [column for column in df_train_full if column.startswith('cat')] 38 | column_target = 'loss' 39 | 40 | # drop features we spotted as not useful during EDA 41 | very_highly_skewed = ['cat15', 'cat22', 'cat55', 'cat56', 'cat62', 'cat63', 'cat64', 'cat68', 'cat70'] 42 | highly_correlated_numerical = ['cont12'] 43 | highly_correlated_categorical = ['cat3', 'cat7'] 44 | 45 | features_to_drop = highly_correlated_numerical + highly_correlated_categorical + very_highly_skewed 46 | 47 | # drop candidate features from both train & test dataframes 48 | df_train_full = df_train_full.drop(features_to_drop, axis=1).reset_index(drop=True) 49 | df_test = df_test.drop(features_to_drop, axis=1).reset_index(drop=True) 50 | 51 | 52 | # remove outlier samples (identified on EDA stage) 53 | df_train_full = df_train_full.drop(df_train_full[df_train_full['cont9'] < 0.05742].index).reset_index(drop=True) 54 | df_train_full = df_train_full.drop(df_train_full[df_train_full['cont10'] > 0.95643].index).reset_index(drop=True) 55 | 56 | print('\nData processing completed successfully.') 57 | 58 | 59 | if data_file_saving: 60 | print('\nCompressing & saving dataset files...') 61 | # save cleaned data 62 | new_train_file_name_path = f'{DATA_PATH}train_cleaned({unique_run_identifier}).csv.gz' 63 | new_test_file_name_path = f'{DATA_PATH}test_cleaned({unique_run_identifier}).csv.gz' 64 | df_train_full.to_csv(new_train_file_name_path, compression='gzip', index=False) 65 | df_test.to_csv(new_test_file_name_path, compression='gzip', index=False) 66 | print(f'\nDataset files saved successfully in >> \n{new_train_file_name_path} \n{new_test_file_name_path}') 67 | 68 | print('\n\nAll done :)') -------------------------------------------------------------------------------- /scripts/data/test.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/scripts/data/test.csv.gz -------------------------------------------------------------------------------- /scripts/data/test_cleaned.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/scripts/data/test_cleaned.csv.gz -------------------------------------------------------------------------------- /scripts/data/train.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/scripts/data/train.csv.gz -------------------------------------------------------------------------------- /scripts/data/train_cleaned.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/scripts/data/train_cleaned.csv.gz -------------------------------------------------------------------------------- /scripts/inference.py: -------------------------------------------------------------------------------- 1 | # required library imports 2 | import json 3 | import pickle 4 | from pandas import DataFrame 5 | from numpy import expm1 6 | from flask import Flask, request 7 | from waitress import serve 8 | 9 | # path to the folder containing our model 10 | MODEL_PATH = './model/' 11 | # port number to run the web server on 12 | SERVER_PORT = 9696 13 | 14 | 15 | def load_model(path_to_model): 16 | with open(path_to_model, 'rb') as model_file: 17 | model = pickle.load(model_file) 18 | return model 19 | 20 | 21 | def get_prediction(model, input_data): 22 | # input data is a json string 23 | # we have to convert it back to a pandas dataframe object 24 | # scikit-learn's ColumnTransformer only accepts an array or pandas DataFrame 25 | dict_obj = json.loads(input_data) 26 | X = DataFrame.from_dict(dict_obj) 27 | y_pred = expm1(model.predict(X)[0]) 28 | 29 | # compose result dictionary and return it as a json string 30 | result = { 31 | 'prediction': float(y_pred), 32 | } 33 | return json.dumps(result) 34 | 35 | 36 | app = Flask('prediction_app') 37 | @app.route('/predict', methods=['POST']) 38 | def predict(): 39 | input_json = request.get_json() 40 | model = load_model(MODEL_PATH+'model.bin') 41 | prediction_result = get_prediction(model, input_json) 42 | 43 | return prediction_result 44 | 45 | 46 | 47 | if __name__ == "__main__": 48 | serve(app, host='0.0.0.0', port=SERVER_PORT) 49 | -------------------------------------------------------------------------------- /scripts/model/model.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hamedonline/ml-workflow/d021809601db5d4a6053b765df19e717e3fbb5a0/scripts/model/model.bin -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | xgboost<1.6.0 2 | pandas<1.4.0 3 | sklearn 4 | numpy 5 | flask 6 | waitress -------------------------------------------------------------------------------- /scripts/train.py: -------------------------------------------------------------------------------- 1 | # required library imports & initial settings 2 | 3 | import os 4 | import string 5 | import random 6 | import pickle 7 | import numpy as np 8 | import pandas as pd 9 | import xgboost as xgb 10 | 11 | from datetime import datetime 12 | from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler 13 | from sklearn.impute import SimpleImputer 14 | from sklearn.compose import ColumnTransformer 15 | from sklearn.pipeline import Pipeline 16 | 17 | 18 | # set seed value for reproducibility 19 | RANDOM_SEED = 1024 20 | 21 | # enable GPU utilization 22 | use_gpu = True # set this to False if your GPU is not compatible or a GPU error rises during training 23 | 24 | # create model bin file after training 25 | model_file_saving = True 26 | 27 | # # create test predictions csv file 28 | # test_predictions_saving = True 29 | 30 | # define relative data path (according the current path of this file) 31 | dirname = os.path.dirname(__file__) 32 | DATA_PATH = dirname+'/data/' 33 | MODEL_PATH = dirname+'/model/' 34 | MODEL_NAME_PREFIX = 'model_pipeline' 35 | 36 | 37 | # create a unique identifier for run 38 | unique_run_identifier = ''.join([ 39 | datetime.now().strftime('%Y.%m.%d-%H%M%S-'), 40 | ''.join(random.choices(string.ascii_uppercase + string.digits, k=4)) 41 | ]) 42 | 43 | 44 | # logarithmic transform function 45 | def log_transform(value): 46 | return np.log1p(value) 47 | 48 | 49 | if __name__ == "__main__": 50 | print('\nReading data...') 51 | # read data into pandas dataframes 52 | df_train_full = pd.read_csv(DATA_PATH+'train_cleaned.csv.gz') 53 | df_test = pd.read_csv(DATA_PATH+'test_cleaned.csv.gz') 54 | name_of_target_column = 'loss' 55 | name_of_target_column_transformed = name_of_target_column+'_transformed' 56 | 57 | # specify feature groups & target column 58 | features_numerical = [column for column in df_train_full if column.startswith('cont')] 59 | features_categorical = [column for column in df_train_full if column.startswith('cat')] 60 | features_all = features_numerical + features_categorical 61 | 62 | print('\nProcessing data...') 63 | 64 | # create a new logarithmically transformed target column 65 | df_train_full[name_of_target_column_transformed] = df_train_full.apply( 66 | lambda row: log_transform(row[name_of_target_column]), axis=1) 67 | 68 | # let's shuffle the whole dataframe 69 | df_train_full = df_train_full.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True) 70 | # second shuffle with an exponential random seed :D 71 | df_train_full = df_train_full.sample(frac=1, random_state=(RANDOM_SEED**2)).reset_index(drop=True) 72 | 73 | # preparing requirements for numerical features 74 | gaussian_like = ['cont1', 'cont2', 'cont3', 'cont6', 'cont7', 'cont9', 'cont11', 'cont12'] 75 | non_gaussian_like = ['cont4', 'cont5', 'cont8', 'cont10', 'cont13', 'cont14'] 76 | features_numerical_to_normalize = [column for column in gaussian_like if column in df_train_full.columns.to_list()] 77 | features_numerical_to_standardize = [column for column in non_gaussian_like if column in df_train_full.columns.to_list()] 78 | 79 | # preparing requirements for categorical features 80 | features_categorical_to_ordinal = list() 81 | features_categorical_to_onehot = list() 82 | for column, variety in df_train_full[features_categorical].nunique().iteritems(): 83 | if variety < 10: features_categorical_to_onehot.append(column) 84 | else: features_categorical_to_ordinal.append(column) 85 | 86 | # create transform pipeline for numerical features 87 | transformer_numerical = Pipeline(steps=[ 88 | ('imputer', SimpleImputer(strategy='mean')), 89 | ('normalizer', MinMaxScaler()), 90 | ('standardizer', StandardScaler()), 91 | ]) 92 | 93 | # create transform pipelines for categorical features 94 | transformer_categorical_1 = Pipeline(steps=[ 95 | ('imputer', SimpleImputer(strategy='most_frequent')), 96 | ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)), 97 | ('normal', MinMaxScaler()), 98 | ]) 99 | transformer_categorical_transformer2 = Pipeline(steps=[ 100 | ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 101 | ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)) 102 | ]) 103 | 104 | # bundle preprocessing for numerical and categorical data 105 | preprocessor = ColumnTransformer( 106 | transformers=[ 107 | ('num', transformer_numerical, features_numerical), 108 | ('cat1', transformer_categorical_1, features_categorical_to_ordinal), 109 | ('cat2', transformer_categorical_transformer2, 110 | features_categorical_to_onehot), 111 | ]) 112 | 113 | # create model inputs 114 | X_train = df_train_full[features_all] 115 | y_train = df_train_full[name_of_target_column_transformed].to_numpy() 116 | X_test = df_test[features_all] 117 | 118 | tree_method_applied = 'gpu_hist' if use_gpu else 'auto' 119 | 120 | # xgboost hyperparameters (tuned previously) 121 | xgb_params = { 122 | 'n_estimators': 35, 123 | 'max_depth': 4, 124 | 'eta': 0.27, 125 | 126 | 'objective': 'reg:squarederror', 127 | 'nthread': -1, 128 | 'tree_method': tree_method_applied, 129 | 130 | 'seed': RANDOM_SEED, 131 | } 132 | 133 | print('\nTraining the model...') 134 | 135 | # create pipeline & train 136 | model_xgb = xgb.XGBRegressor(**xgb_params) 137 | pipeline_final = Pipeline(steps=[ 138 | ('preprocessor', preprocessor), 139 | ('model', model_xgb) 140 | ]) 141 | pipeline_final.fit(X_train, y_train) 142 | 143 | print('\nTraining completed successfully.') 144 | 145 | 146 | # save model bin file for inference 147 | if model_file_saving: 148 | pipeline_name = f'{MODEL_NAME_PREFIX}_{unique_run_identifier}.bin' 149 | with open(MODEL_PATH+pipeline_name, 'wb') as output_file: 150 | pickle.dump((pipeline_final), output_file) 151 | print(f'\nModel pipeline file saved successfully in >> {MODEL_PATH+pipeline_name}') 152 | output_file.close() 153 | 154 | # # make predictions on test data 155 | # y_pred_test = np.expm1(pipeline_final.predict(X_test)) 156 | 157 | print('\\nAll done :)') 158 | -------------------------------------------------------------------------------- /virtual-env/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | xgboost = "<1.6.0" 8 | pandas = "<1.4.0" 9 | sklearn = "*" 10 | numpy = "*" 11 | waitress = "*" 12 | Flask = "*" 13 | 14 | [dev-packages] 15 | 16 | [requires] 17 | python_version = "3.9" 18 | -------------------------------------------------------------------------------- /virtual-env/Pipfile.lock: -------------------------------------------------------------------------------- 1 | { 2 | "_meta": { 3 | "hash": { 4 | "sha256": "3d8bd06bb15d3764ce1b25850cc1508db3e7133e08547fa1510bc33769875f0b" 5 | }, 6 | "pipfile-spec": 6, 7 | "requires": { 8 | "python_version": "3.9" 9 | }, 10 | "sources": [ 11 | { 12 | "name": "pypi", 13 | "url": "https://pypi.org/simple", 14 | "verify_ssl": true 15 | } 16 | ] 17 | }, 18 | "default": { 19 | "blinker": { 20 | "hashes": [ 21 | "sha256:4afd3de66ef3a9f8067559fb7a1cbe555c17dcbe15971b05d1b625c3e7abe213", 22 | "sha256:c3d739772abb7bc2860abf5f2ec284223d9ad5c76da018234f6f50d6f31ab1f0" 23 | ], 24 | "markers": "python_version >= '3.7'", 25 | "version": "==1.6.2" 26 | }, 27 | "click": { 28 | "hashes": [ 29 | "sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e", 30 | "sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48" 31 | ], 32 | "markers": "python_version >= '3.7'", 33 | "version": "==8.1.3" 34 | }, 35 | "flask": { 36 | "hashes": [ 37 | "sha256:77fd4e1249d8c9923de34907236b747ced06e5467ecac1a7bb7115ae0e9670b0", 38 | "sha256:8c2f9abd47a9e8df7f0c3f091ce9497d011dc3b31effcf4c85a6e2b50f4114ef" 39 | ], 40 | "index": "pypi", 41 | "version": "==2.3.2" 42 | }, 43 | "importlib-metadata": { 44 | "hashes": [ 45 | "sha256:43dd286a2cd8995d5eaef7fee2066340423b818ed3fd70adf0bad5f1fac53fed", 46 | "sha256:92501cdf9cc66ebd3e612f1b4f0c0765dfa42f0fa38ffb319b6bd84dd675d705" 47 | ], 48 | "markers": "python_version < '3.10'", 49 | "version": "==6.6.0" 50 | }, 51 | "itsdangerous": { 52 | "hashes": [ 53 | "sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44", 54 | "sha256:5dbbc68b317e5e42f327f9021763545dc3fc3bfe22e6deb96aaf1fc38874156a" 55 | ], 56 | "markers": "python_version >= '3.7'", 57 | "version": "==2.1.2" 58 | }, 59 | "jinja2": { 60 | "hashes": [ 61 | "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852", 62 | "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61" 63 | ], 64 | "markers": "python_version >= '3.7'", 65 | "version": "==3.1.2" 66 | }, 67 | "joblib": { 68 | "hashes": [ 69 | "sha256:091138ed78f800342968c523bdde947e7a305b8594b910a0fea2ab83c3c6d385", 70 | "sha256:e1cee4a79e4af22881164f218d4311f60074197fb707e082e803b61f6d137018" 71 | ], 72 | "markers": "python_version >= '3.7'", 73 | "version": "==1.2.0" 74 | }, 75 | "markupsafe": { 76 | "hashes": [ 77 | "sha256:0576fe974b40a400449768941d5d0858cc624e3249dfd1e0c33674e5c7ca7aed", 78 | "sha256:085fd3201e7b12809f9e6e9bc1e5c96a368c8523fad5afb02afe3c051ae4afcc", 79 | "sha256:090376d812fb6ac5f171e5938e82e7f2d7adc2b629101cec0db8b267815c85e2", 80 | "sha256:0b462104ba25f1ac006fdab8b6a01ebbfbce9ed37fd37fd4acd70c67c973e460", 81 | "sha256:137678c63c977754abe9086a3ec011e8fd985ab90631145dfb9294ad09c102a7", 82 | "sha256:1bea30e9bf331f3fef67e0a3877b2288593c98a21ccb2cf29b74c581a4eb3af0", 83 | "sha256:22152d00bf4a9c7c83960521fc558f55a1adbc0631fbb00a9471e097b19d72e1", 84 | "sha256:22731d79ed2eb25059ae3df1dfc9cb1546691cc41f4e3130fe6bfbc3ecbbecfa", 85 | "sha256:2298c859cfc5463f1b64bd55cb3e602528db6fa0f3cfd568d3605c50678f8f03", 86 | "sha256:28057e985dace2f478e042eaa15606c7efccb700797660629da387eb289b9323", 87 | "sha256:2e7821bffe00aa6bd07a23913b7f4e01328c3d5cc0b40b36c0bd81d362faeb65", 88 | "sha256:2ec4f2d48ae59bbb9d1f9d7efb9236ab81429a764dedca114f5fdabbc3788013", 89 | "sha256:340bea174e9761308703ae988e982005aedf427de816d1afe98147668cc03036", 90 | "sha256:40627dcf047dadb22cd25ea7ecfe9cbf3bbbad0482ee5920b582f3809c97654f", 91 | "sha256:40dfd3fefbef579ee058f139733ac336312663c6706d1163b82b3003fb1925c4", 92 | "sha256:4cf06cdc1dda95223e9d2d3c58d3b178aa5dacb35ee7e3bbac10e4e1faacb419", 93 | "sha256:50c42830a633fa0cf9e7d27664637532791bfc31c731a87b202d2d8ac40c3ea2", 94 | "sha256:55f44b440d491028addb3b88f72207d71eeebfb7b5dbf0643f7c023ae1fba619", 95 | "sha256:608e7073dfa9e38a85d38474c082d4281f4ce276ac0010224eaba11e929dd53a", 96 | "sha256:63ba06c9941e46fa389d389644e2d8225e0e3e5ebcc4ff1ea8506dce646f8c8a", 97 | "sha256:65608c35bfb8a76763f37036547f7adfd09270fbdbf96608be2bead319728fcd", 98 | "sha256:665a36ae6f8f20a4676b53224e33d456a6f5a72657d9c83c2aa00765072f31f7", 99 | "sha256:6d6607f98fcf17e534162f0709aaad3ab7a96032723d8ac8750ffe17ae5a0666", 100 | "sha256:7313ce6a199651c4ed9d7e4cfb4aa56fe923b1adf9af3b420ee14e6d9a73df65", 101 | "sha256:7668b52e102d0ed87cb082380a7e2e1e78737ddecdde129acadb0eccc5423859", 102 | "sha256:7df70907e00c970c60b9ef2938d894a9381f38e6b9db73c5be35e59d92e06625", 103 | "sha256:7e007132af78ea9df29495dbf7b5824cb71648d7133cf7848a2a5dd00d36f9ff", 104 | "sha256:835fb5e38fd89328e9c81067fd642b3593c33e1e17e2fdbf77f5676abb14a156", 105 | "sha256:8bca7e26c1dd751236cfb0c6c72d4ad61d986e9a41bbf76cb445f69488b2a2bd", 106 | "sha256:8db032bf0ce9022a8e41a22598eefc802314e81b879ae093f36ce9ddf39ab1ba", 107 | "sha256:99625a92da8229df6d44335e6fcc558a5037dd0a760e11d84be2260e6f37002f", 108 | "sha256:9cad97ab29dfc3f0249b483412c85c8ef4766d96cdf9dcf5a1e3caa3f3661cf1", 109 | "sha256:a4abaec6ca3ad8660690236d11bfe28dfd707778e2442b45addd2f086d6ef094", 110 | "sha256:a6e40afa7f45939ca356f348c8e23048e02cb109ced1eb8420961b2f40fb373a", 111 | "sha256:a6f2fcca746e8d5910e18782f976489939d54a91f9411c32051b4aab2bd7c513", 112 | "sha256:a806db027852538d2ad7555b203300173dd1b77ba116de92da9afbc3a3be3eed", 113 | "sha256:abcabc8c2b26036d62d4c746381a6f7cf60aafcc653198ad678306986b09450d", 114 | "sha256:b8526c6d437855442cdd3d87eede9c425c4445ea011ca38d937db299382e6fa3", 115 | "sha256:bb06feb762bade6bf3c8b844462274db0c76acc95c52abe8dbed28ae3d44a147", 116 | "sha256:c0a33bc9f02c2b17c3ea382f91b4db0e6cde90b63b296422a939886a7a80de1c", 117 | "sha256:c4a549890a45f57f1ebf99c067a4ad0cb423a05544accaf2b065246827ed9603", 118 | "sha256:ca244fa73f50a800cf8c3ebf7fd93149ec37f5cb9596aa8873ae2c1d23498601", 119 | "sha256:cf877ab4ed6e302ec1d04952ca358b381a882fbd9d1b07cccbfd61783561f98a", 120 | "sha256:d9d971ec1e79906046aa3ca266de79eac42f1dbf3612a05dc9368125952bd1a1", 121 | "sha256:da25303d91526aac3672ee6d49a2f3db2d9502a4a60b55519feb1a4c7714e07d", 122 | "sha256:e55e40ff0cc8cc5c07996915ad367fa47da6b3fc091fdadca7f5403239c5fec3", 123 | "sha256:f03a532d7dee1bed20bc4884194a16160a2de9ffc6354b3878ec9682bb623c54", 124 | "sha256:f1cd098434e83e656abf198f103a8207a8187c0fc110306691a2e94a78d0abb2", 125 | "sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6", 126 | "sha256:f8ffb705ffcf5ddd0e80b65ddf7bed7ee4f5a441ea7d3419e861a12eaf41af58" 127 | ], 128 | "markers": "python_version >= '3.7'", 129 | "version": "==2.1.2" 130 | }, 131 | "numpy": { 132 | "hashes": [ 133 | "sha256:07a8c89a04997625236c5ecb7afe35a02af3896c8aa01890a849913a2309c676", 134 | "sha256:08d9b008d0156c70dc392bb3ab3abb6e7a711383c3247b410b39962263576cd4", 135 | "sha256:201b4d0552831f7250a08d3b38de0d989d6f6e4658b709a02a73c524ccc6ffce", 136 | "sha256:2c10a93606e0b4b95c9b04b77dc349b398fdfbda382d2a39ba5a822f669a0123", 137 | "sha256:3ca688e1b9b95d80250bca34b11a05e389b1420d00e87a0d12dc45f131f704a1", 138 | "sha256:48a3aecd3b997bf452a2dedb11f4e79bc5bfd21a1d4cc760e703c31d57c84b3e", 139 | "sha256:568dfd16224abddafb1cbcce2ff14f522abe037268514dd7e42c6776a1c3f8e5", 140 | "sha256:5bfb1bb598e8229c2d5d48db1860bcf4311337864ea3efdbe1171fb0c5da515d", 141 | "sha256:639b54cdf6aa4f82fe37ebf70401bbb74b8508fddcf4797f9fe59615b8c5813a", 142 | "sha256:8251ed96f38b47b4295b1ae51631de7ffa8260b5b087808ef09a39a9d66c97ab", 143 | "sha256:92bfa69cfbdf7dfc3040978ad09a48091143cffb778ec3b03fa170c494118d75", 144 | "sha256:97098b95aa4e418529099c26558eeb8486e66bd1e53a6b606d684d0c3616b168", 145 | "sha256:a3bae1a2ed00e90b3ba5f7bd0a7c7999b55d609e0c54ceb2b076a25e345fa9f4", 146 | "sha256:c34ea7e9d13a70bf2ab64a2532fe149a9aced424cd05a2c4ba662fd989e3e45f", 147 | "sha256:dbc7601a3b7472d559dc7b933b18b4b66f9aa7452c120e87dfb33d02008c8a18", 148 | "sha256:e7927a589df200c5e23c57970bafbd0cd322459aa7b1ff73b7c2e84d6e3eae62", 149 | "sha256:f8c1f39caad2c896bc0018f699882b345b2a63708008be29b1f355ebf6f933fe", 150 | "sha256:f950f8845b480cffe522913d35567e29dd381b0dc7e4ce6a4a9f9156417d2430", 151 | "sha256:fade0d4f4d292b6f39951b6836d7a3c7ef5b2347f3c420cd9820a1d90d794802", 152 | "sha256:fdf3c08bce27132395d3c3ba1503cac12e17282358cb4bddc25cc46b0aca07aa" 153 | ], 154 | "index": "pypi", 155 | "version": "==1.22.3" 156 | }, 157 | "pandas": { 158 | "hashes": [ 159 | "sha256:1e4285f5de1012de20ca46b188ccf33521bff61ba5c5ebd78b4fb28e5416a9f1", 160 | "sha256:2651d75b9a167cc8cc572cf787ab512d16e316ae00ba81874b560586fa1325e0", 161 | "sha256:2c21778a688d3712d35710501f8001cdbf96eb70a7c587a3d5613573299fdca6", 162 | "sha256:32e1a26d5ade11b547721a72f9bfc4bd113396947606e00d5b4a5b79b3dcb006", 163 | "sha256:3345343206546545bc26a05b4602b6a24385b5ec7c75cb6059599e3d56831da2", 164 | "sha256:344295811e67f8200de2390093aeb3c8309f5648951b684d8db7eee7d1c81fb7", 165 | "sha256:37f06b59e5bc05711a518aa10beaec10942188dccb48918bb5ae602ccbc9f1a0", 166 | "sha256:552020bf83b7f9033b57cbae65589c01e7ef1544416122da0c79140c93288f56", 167 | "sha256:5cce0c6bbeb266b0e39e35176ee615ce3585233092f685b6a82362523e59e5b4", 168 | "sha256:5f261553a1e9c65b7a310302b9dbac31cf0049a51695c14ebe04e4bfd4a96f02", 169 | "sha256:60a8c055d58873ad81cae290d974d13dd479b82cbb975c3e1fa2cf1920715296", 170 | "sha256:62d5b5ce965bae78f12c1c0df0d387899dd4211ec0bdc52822373f13a3a022b9", 171 | "sha256:7d28a3c65463fd0d0ba8bbb7696b23073efee0510783340a44b08f5e96ffce0c", 172 | "sha256:8025750767e138320b15ca16d70d5cdc1886e8f9cc56652d89735c016cd8aea6", 173 | "sha256:8b6dbec5f3e6d5dc80dcfee250e0a2a652b3f28663492f7dab9a24416a48ac39", 174 | "sha256:a395692046fd8ce1edb4c6295c35184ae0c2bbe787ecbe384251da609e27edcb", 175 | "sha256:a62949c626dd0ef7de11de34b44c6475db76995c2064e2d99c6498c3dba7fe58", 176 | "sha256:aaf183a615ad790801fa3cf2fa450e5b6d23a54684fe386f7e3208f8b9bfbef6", 177 | "sha256:adfeb11be2d54f275142c8ba9bf67acee771b7186a5745249c7d5a06c670136b", 178 | "sha256:b6b87b2fb39e6383ca28e2829cddef1d9fc9e27e55ad91ca9c435572cdba51bf", 179 | "sha256:bd971a3f08b745a75a86c00b97f3007c2ea175951286cdda6abe543e687e5f2f", 180 | "sha256:c69406a2808ba6cf580c2255bcf260b3f214d2664a3a4197d0e640f573b46fd3", 181 | "sha256:d3bc49af96cd6285030a64779de5b3688633a07eb75c124b0747134a63f4c05f", 182 | "sha256:fd541ab09e1f80a2a1760032d665f6e032d8e44055d602d65eeea6e6e85498cb", 183 | "sha256:fe95bae4e2d579812865db2212bb733144e34d0c6785c0685329e5b60fcb85dd" 184 | ], 185 | "index": "pypi", 186 | "version": "==1.3.5" 187 | }, 188 | "python-dateutil": { 189 | "hashes": [ 190 | "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86", 191 | "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9" 192 | ], 193 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 194 | "version": "==2.8.2" 195 | }, 196 | "pytz": { 197 | "hashes": [ 198 | "sha256:1d8ce29db189191fb55338ee6d0387d82ab59f3d00eac103412d64e0ebd0c588", 199 | "sha256:a151b3abb88eda1d4e34a9814df37de2a80e301e68ba0fd856fb9b46bfbbbffb" 200 | ], 201 | "version": "==2023.3" 202 | }, 203 | "scikit-learn": { 204 | "hashes": [ 205 | "sha256:065e9673e24e0dc5113e2dd2b4ca30c9d8aa2fa90f4c0597241c93b63130d233", 206 | "sha256:2dd3ffd3950e3d6c0c0ef9033a9b9b32d910c61bd06cb8206303fb4514b88a49", 207 | "sha256:2e2642baa0ad1e8f8188917423dd73994bf25429f8893ddbe115be3ca3183584", 208 | "sha256:44b47a305190c28dd8dd73fc9445f802b6ea716669cfc22ab1eb97b335d238b1", 209 | "sha256:6477eed40dbce190f9f9e9d0d37e020815825b300121307942ec2110302b66a3", 210 | "sha256:6fe83b676f407f00afa388dd1fdd49e5c6612e551ed84f3b1b182858f09e987d", 211 | "sha256:7d5312d9674bed14f73773d2acf15a3272639b981e60b72c9b190a0cffed5bad", 212 | "sha256:7f69313884e8eb311460cc2f28676d5e400bd929841a2c8eb8742ae78ebf7c20", 213 | "sha256:8156db41e1c39c69aa2d8599ab7577af53e9e5e7a57b0504e116cc73c39138dd", 214 | "sha256:8429aea30ec24e7a8c7ed8a3fa6213adf3814a6efbea09e16e0a0c71e1a1a3d7", 215 | "sha256:8b0670d4224a3c2d596fd572fb4fa673b2a0ccfb07152688ebd2ea0b8c61025c", 216 | "sha256:953236889928d104c2ef14027539f5f2609a47ebf716b8cbe4437e85dce42744", 217 | "sha256:99cc01184e347de485bf253d19fcb3b1a3fb0ee4cea5ee3c43ec0cc429b6d29f", 218 | "sha256:9c710ff9f9936ba8a3b74a455ccf0dcf59b230caa1e9ba0223773c490cab1e51", 219 | "sha256:ad66c3848c0a1ec13464b2a95d0a484fd5b02ce74268eaa7e0c697b904f31d6c", 220 | "sha256:bf036ea7ef66115e0d49655f16febfa547886deba20149555a41d28f56fd6d3c", 221 | "sha256:dfeaf8be72117eb61a164ea6fc8afb6dfe08c6f90365bde2dc16456e4bc8e45f", 222 | "sha256:e6e574db9914afcb4e11ade84fab084536a895ca60aadea3041e85b8ac963edb", 223 | "sha256:ea061bf0283bf9a9f36ea3c5d3231ba2176221bbd430abd2603b1c3b2ed85c89", 224 | "sha256:fe0aa1a7029ed3e1dcbf4a5bc675aa3b1bc468d9012ecf6c6f081251ca47f590", 225 | "sha256:fe175ee1dab589d2e1033657c5b6bec92a8a3b69103e3dd361b58014729975c3" 226 | ], 227 | "markers": "python_version >= '3.8'", 228 | "version": "==1.2.2" 229 | }, 230 | "scipy": { 231 | "hashes": [ 232 | "sha256:049a8bbf0ad95277ffba9b3b7d23e5369cc39e66406d60422c8cfef40ccc8415", 233 | "sha256:07c3457ce0b3ad5124f98a86533106b643dd811dd61b548e78cf4c8786652f6f", 234 | "sha256:0f1564ea217e82c1bbe75ddf7285ba0709ecd503f048cb1236ae9995f64217bd", 235 | "sha256:1553b5dcddd64ba9a0d95355e63fe6c3fc303a8fd77c7bc91e77d61363f7433f", 236 | "sha256:15a35c4242ec5f292c3dd364a7c71a61be87a3d4ddcc693372813c0b73c9af1d", 237 | "sha256:1b4735d6c28aad3cdcf52117e0e91d6b39acd4272f3f5cd9907c24ee931ad601", 238 | "sha256:2cf9dfb80a7b4589ba4c40ce7588986d6d5cebc5457cad2c2880f6bc2d42f3a5", 239 | "sha256:39becb03541f9e58243f4197584286e339029e8908c46f7221abeea4b749fa88", 240 | "sha256:43b8e0bcb877faf0abfb613d51026cd5cc78918e9530e375727bf0625c82788f", 241 | "sha256:4b3f429188c66603a1a5c549fb414e4d3bdc2a24792e061ffbd607d3d75fd84e", 242 | "sha256:4c0ff64b06b10e35215abce517252b375e580a6125fd5fdf6421b98efbefb2d2", 243 | "sha256:51af417a000d2dbe1ec6c372dfe688e041a7084da4fdd350aeb139bd3fb55353", 244 | "sha256:5678f88c68ea866ed9ebe3a989091088553ba12c6090244fdae3e467b1139c35", 245 | "sha256:79c8e5a6c6ffaf3a2262ef1be1e108a035cf4f05c14df56057b64acc5bebffb6", 246 | "sha256:7ff7f37b1bf4417baca958d254e8e2875d0cc23aaadbe65b3d5b3077b0eb23ea", 247 | "sha256:aaea0a6be54462ec027de54fca511540980d1e9eea68b2d5c1dbfe084797be35", 248 | "sha256:bce5869c8d68cf383ce240e44c1d9ae7c06078a9396df68ce88a1230f93a30c1", 249 | "sha256:cd9f1027ff30d90618914a64ca9b1a77a431159df0e2a195d8a9e8a04c78abf9", 250 | "sha256:d925fa1c81b772882aa55bcc10bf88324dadb66ff85d548c71515f6689c6dac5", 251 | "sha256:e7354fd7527a4b0377ce55f286805b34e8c54b91be865bac273f527e1b839019", 252 | "sha256:fae8a7b898c42dffe3f7361c40d5952b6bf32d10c4569098d276b4c547905ee1" 253 | ], 254 | "markers": "python_version < '3.12' and python_version >= '3.8'", 255 | "version": "==1.10.1" 256 | }, 257 | "six": { 258 | "hashes": [ 259 | "sha256:1e61c37477a1626458e36f7b1d82aa5c9b094fa4802892072e49de9c60c4c926", 260 | "sha256:8abb2f1d86890a2dfb989f9a77cfcfd3e47c2a354b01111771326f8aa26e0254" 261 | ], 262 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 263 | "version": "==1.16.0" 264 | }, 265 | "sklearn": { 266 | "hashes": [ 267 | "sha256:e23001573aa194b834122d2b9562459bf5ae494a2d59ca6b8aa22c85a44c0e31" 268 | ], 269 | "index": "pypi", 270 | "version": "==0.0" 271 | }, 272 | "threadpoolctl": { 273 | "hashes": [ 274 | "sha256:8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b", 275 | "sha256:a335baacfaa4400ae1f0d8e3a58d6674d2f8828e3716bb2802c44955ad391380" 276 | ], 277 | "markers": "python_version >= '3.6'", 278 | "version": "==3.1.0" 279 | }, 280 | "waitress": { 281 | "hashes": [ 282 | "sha256:7500c9625927c8ec60f54377d590f67b30c8e70ef4b8894214ac6e4cad233d2a", 283 | "sha256:780a4082c5fbc0fde6a2fcfe5e26e6efc1e8f425730863c04085769781f51eba" 284 | ], 285 | "index": "pypi", 286 | "version": "==2.1.2" 287 | }, 288 | "werkzeug": { 289 | "hashes": [ 290 | "sha256:4866679a0722de00796a74086238bb3b98d90f423f05de039abb09315487254a", 291 | "sha256:a987caf1092edc7523edb139edb20c70571c4a8d5eed02e0b547b4739174d091" 292 | ], 293 | "markers": "python_version >= '3.8'", 294 | "version": "==2.3.3" 295 | }, 296 | "xgboost": { 297 | "hashes": [ 298 | "sha256:404dc09dca887ef5a9bc0268f882c54b33bfc16ac365a859a11e7b24d49da387", 299 | "sha256:46e43e06b1de260fe4c67e818720485559dab1bed1d97b82275220fab602a2ba", 300 | "sha256:51b5dfe553d78ab92f2c60ccead6abc38196c2961f6952f6ec14b1feba6ffd25", 301 | "sha256:a449b0d0d8a15e72946c9d07e3d4ea153ac12570d054f6020cc5146cb4979261", 302 | "sha256:f9d459ad42da74c60136123ead36fa562086fb886f52fca229477d327d60dad0" 303 | ], 304 | "index": "pypi", 305 | "version": "==1.5.2" 306 | }, 307 | "zipp": { 308 | "hashes": [ 309 | "sha256:112929ad649da941c23de50f356a2b5570c954b65150642bccdd66bf194d224b", 310 | "sha256:48904fc76a60e542af151aded95726c1a5c34ed43ab4134b597665c86d7ad556" 311 | ], 312 | "markers": "python_version >= '3.7'", 313 | "version": "==3.15.0" 314 | } 315 | }, 316 | "develop": {} 317 | } 318 | -------------------------------------------------------------------------------- /virtual-env/README.md: -------------------------------------------------------------------------------- 1 | # A Quick Guide to Creating and Employing Virtual Environments 2 | 3 | ## Introduction 4 | 5 | Imagine a folder named "prediction-app" containing our ``inference.py`` and ``model.bin`` files. Now we want to add all the packages & libraries required to run the inference file inside this folder without any concerns regarding compatibility or messing with our machine's libraries. Virtual environments enable us to achieve exactly this in a very hassle-free fashion, without compromising other packages in our system and avoiding __dependency hell__ complications. 6 | 7 | ## Prerequisites 8 | 9 | You would need to have __*python3*__, __*pip*__ and __*pipenv*__ already installed on your system. You may like to do manual install via official guides, or use an existing platform like [Anaconda](https://www.anaconda.com) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) for convenient library installs. As the name _"pip"_ suggests, we are using pip package-management system that finds & installs libraries through [PyPI](https://pypi.org). 10 | 11 | ## Creating Virtual Environments 12 | 13 | ### Method (A): Using ``requirements.txt`` File 14 | 15 | Creating a virtual environment is easy and straightforward. We'll use __*pipenv*__ to manage our virtual environment. There are also other good options, but that is another story for us to study & explore. 16 | 17 | First of all, prepare the folder and put your files inside it (just like the inference example mentioned in introduction section above). Create an additional file, named ``requirements.txt``. This is the file holding the names of libraries we'd like our virtual environment to have. Inside ``requirements.txt`` file, put in each required library name (this should be equal to the name used when installing by pip, for instance: pip install __sklearn__); each line must have only one library name. This is what inside of a typical ``requirements.txt`` file should look like: 18 | 19 | ⬇️ __requirements.txt__ 20 | 21 | ```text 22 | numpy 23 | pandas 24 | sklearn 25 | xgboost 26 | flask 27 | waitress 28 | ``` 29 | 30 | It is a good advice to include the version number you're using for each library. If you have deterministic approach, you won't risk a possible conflict between your out-of-date code and break-down-type changes in future versions of libraries. 31 | 32 | ```text 33 | ... 34 | sklearn==1.0.1 35 | ... 36 | ``` 37 | 38 | Now that you have your ``requirements.txt`` file ready, open up a terminal (Linux) or command prompt (Windows) or Anaconda prompt (if you're using Anaconda or Miniconda). 39 | Navigate to this folder in your terminal and run the following command: 40 | 41 | ```shell 42 | pipenv install -r requirements.txt 43 | ``` 44 | 45 | ### Method (B): Using ``Pipfile`` & ``Pipfile.lock`` Files 46 | 47 | These are the files that get created and filled during a successful method (A) run process. Having access to only these two would let you replicate the environment on any other machine without a need for ``requirements.txt`` file. The good thing about this method is that each library's version number is specified and the chance of any incompatibilities goes to zero. If you don't already have access to these two files though, you've got no choice but to go with method (A). 48 | 49 | With these two files inside the folder you like to have new environment in, run the following command in your terminal (Linux) or command prompt (Windows) or Anaconda prompt (if you're using Anaconda or Miniconda): 50 | 51 | ```shell 52 | pipenv install 53 | ``` 54 | 55 | That's all! Your new virtual environment is ready. 56 | 57 | ## Activating the Virtual Environment 58 | 59 | Both methods (A) and (B) can take care of creating a virtual environment (inside your desired folder) and install dependencies for you. Having the virtual environment set, all you need to do is to activate it and run your commands from it's shell. 60 | To activate folder's virtual shell, we should run the following command inside our folder: 61 | 62 | ```shell 63 | pipenv shell 64 | ``` 65 | 66 | That command launches a __subshell__ in our virtual environment that allows us to run any commands we like related to our specific environment from here. For example, if we want to run our flask app and expose an API endpoint, we can set it in motion by running below line from subshell: 67 | 68 | ```shell 69 | python inference.py 70 | ``` 71 | 72 | or 73 | 74 | ```shell 75 | python -m inference 76 | ``` 77 | 78 | __Warning__: Don't forget to copy ``inference.py`` and ``model/model.bin`` to your virtual environment folder (these files are provided inside __``scripts``__ folder of this repository). In addition, you must update model path variable in ``inference.py`` accordingly. 79 | 80 | I've already provided ``requirements.txt``, ``Pipfile`` and ``Pipfile.lock`` files inside this folder and you can try either methods as an exercise. Just as a side note, you can even try method (A) for this entire repo using ``requirements.txt`` file in repository's root. 81 | 82 | ## Bonus: Generating a ``requirements.txt`` file from Pipfile 83 | 84 | It is possible to generate back ``requirements.txt`` file using ``Pipfile`` and ``Pipfile.lock`` files. All you need to do is to run the following command inside a folder containing Pipfiles: 85 | 86 | ```shell 87 | pipenv lock -r > requirements.txt 88 | ``` 89 | 90 | Last but not least, make sure to check out some very useful advanced tips on ``pipenv`` [here](https://github.com/pypa/pipenv/blob/main/docs/advanced.rst). 91 | -------------------------------------------------------------------------------- /virtual-env/requirements.txt: -------------------------------------------------------------------------------- 1 | xgboost<1.6.0 2 | pandas<1.4.0 3 | sklearn 4 | numpy 5 | flask 6 | waitress --------------------------------------------------------------------------------