├── .gitattributes ├── 01-Intro ├── Homework_intro.ipynb ├── README.md └── data.csv ├── 02-regression ├── Homework-regression.ipynb ├── README.md ├── Regularization in Linear Regression.ipynb ├── car_price_data.csv └── housing.csv ├── 03-classification ├── Homework-classification.ipynb └── README.md ├── 04-evaluation ├── AER_credit_card_data.csv ├── Homework-evaluation.ipynb └── README.md ├── 05-deployment ├── External_client.ipynb ├── README.md └── app │ ├── Dockerfile │ ├── Pipfile │ ├── Pipfile.lock │ ├── customer_1.json │ ├── customer_2.json │ ├── dv.bin │ ├── eb_cloud_service.png │ ├── model1.bin │ ├── predict-test.py │ └── predict.py ├── 06-trees ├── Homework-trees.ipynb └── README.md ├── 07-bento-production ├── README.md ├── coolmodel │ └── service.py ├── credit_risk_service │ └── service.py ├── dragon.jpeg ├── locustfile.py ├── models │ ├── credit_risk_model │ │ ├── dtlts7cv4s2nbhht │ │ │ ├── custom_objects.pkl │ │ │ ├── model.yaml │ │ │ └── saved_model.ubj │ │ └── latest │ └── mlzoomcamp_homework │ │ ├── jsi67fslz6txydu5 │ │ ├── model.yaml │ │ └── saved_model.pkl │ │ ├── latest │ │ └── qtzdz3slg6mwwdu5 │ │ ├── model.yaml │ │ └── saved_model.pkl ├── requirements.txt ├── setting_up_bentoML.sh └── train.ipynb ├── README.md ├── midterm_project ├── README.md ├── app │ └── README.md ├── data │ └── data.csv.gz └── notebooks │ └── 01-EDA.ipynb └── requirements.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | *.csv filter=lfs diff=lfs merge=lfs -text 2 | *.csv.gz filter=lfs diff=lfs merge=lfs -text 3 | *.gz filter=lfs diff=lfs merge=lfs -text 4 | -------------------------------------------------------------------------------- /01-Intro/README.md: -------------------------------------------------------------------------------- 1 | # Intro Session : Homework & Dataset 2 | 3 | Session Overview : 4 | 5 | * 1.1 Introduction to Machine Learning 6 | * 1.2 ML vs Rule-Based Systems 7 | * 1.3 Supervised Machine Learning 8 | * 1.4 CRISP-DM 9 | * 1.5 The Modelling Step (Model Selection Process) 10 | * 1.6 Setting up the Environment 11 | * 1.7 Introduction to NumPy 12 | * 1.8 Linear Algebra Refresher 13 | * 1.9 Introduction to Pandas 14 | 15 | --------- 16 | 17 | ## Session #1 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/01-intro/homework.md) 18 | 19 | ### Set up the environment 20 | 21 | You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. 22 | 23 | ### Question 1 24 | 25 | What's the version of NumPy that you installed? 26 | 27 | You can get the version information using the `__version__` field: 28 | 29 | ```python 30 | np.__version__ 31 | ``` 32 | 33 | ### Getting the data 34 | 35 | For this homework, we'll use the Car price dataset. Download it from 36 | [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv). 37 | 38 | You can do it with wget: 39 | 40 | ```bash 41 | wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv 42 | ``` 43 | 44 | Or just open it with your browser and click "Save as...". 45 | 46 | Now read it with Pandas. 47 | 48 | ### Question 2 49 | 50 | How many records are in the dataset? 51 | 52 | Here you need to specify the number of rows. 53 | 54 | - 16 55 | - 6572 56 | - 11914 57 | - 18990 58 | 59 | ### Question 3 60 | 61 | Who are the most frequent car manufacturers (top-3) according to the dataset? 62 | 63 | - Chevrolet, Volkswagen, Toyota 64 | - Chevrolet, Ford, Toyota 65 | - Ford, Volkswagen, Toyota 66 | - Chevrolet, Ford, Volkswagen 67 | 68 | > **Note**: You should rely on "Make" column in this question. 69 | 70 | ### Question 4 71 | 72 | What's the number of unique Audi car models in the dataset? 73 | 74 | - 3 75 | - 16 76 | - 26 77 | - 34 78 | 79 | ### Question 5 80 | 81 | How many columns in the dataset have missing values? 82 | 83 | - 5 84 | - 6 85 | - 7 86 | - 8 87 | 88 | ### Question 6 89 | 90 | 1. Find the median value of "Engine Cylinders" column in the dataset. 91 | 2. Next, calculate the most frequent value of the same "Engine Cylinders". 92 | 3. Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step. 93 | 4. Now, calculate the median value of "Engine Cylinders" once again. 94 | 95 | Has it changed? 96 | 97 | > Hint: refer to existing `mode` and `median` functions to complete the task. 98 | 99 | - Yes 100 | - No 101 | 102 | ### Question 7 103 | 104 | 1. Select all the "Lotus" cars from the dataset. 105 | 2. Select only columns "Engine HP", "Engine Cylinders". 106 | 3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows). 107 | 4. Get the underlying NumPy array. Let's call it `X`. 108 | 5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`. 109 | 6. Invert `XTX`. 110 | 7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`. 111 | 8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`. 112 | 9. What's the value of the first element of `w`? 113 | 114 | > **Note**: You just implemented linear regression. We'll talk about it in the next lesson. 115 | 116 | - -0.0723 117 | - 4.5949 118 | - 31.6537 119 | - 63.5643 120 | 121 | 122 | ## Submit the results 123 | 124 | Submit your results here: https://forms.gle/vLp3mvtnrjJxCZx66 125 | 126 | If your answer doesn't match options exactly, select the closest one. 127 | 128 | 129 | ## Deadline 130 | 131 | The deadline for submitting is 12 September 2022 (Monday), 23:00 CEST (Berlin time). 132 | 133 | After that, the form will be closed. 134 | -------------------------------------------------------------------------------- /01-Intro/data.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:26e39d3e902246d01a93ae390f51129a288079aefad2cb3292751a262ffd62d8 3 | size 1475504 4 | -------------------------------------------------------------------------------- /02-regression/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning for Regression : Homework & Dataset 2 | 3 | Session Overview : 4 | 5 | * 2.1 Car price prediction project 6 | * 2.2 Data preparation 7 | * 2.3 Exploratory data analysis 8 | * 2.4 Setting up the validation framework 9 | * 2.5 Linear regression 10 | * 2.6 Linear regression: vector form 11 | * 2.7 Training linear regression: Normal equation 12 | * 2.8 Baseline model for car price prediction project 13 | * 2.9 Root mean squared error 14 | * 2.10 Using RMSE on validation data 15 | * 2.11 Feature engineering 16 | * 2.12 Categorical variables 17 | * 2.13 Regularization 18 | * 2.14 Tuning the model 19 | * 2.15 Using the model 20 | * 2.16 Car price prediction project summary 21 | 22 | --------- 23 | 24 | ## Session #2 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/02-regression/homework.md) 25 | 26 | ### Dataset 27 | 28 | In this homework, we will use the California Housing Prices. You can take it from 29 | [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices). 30 | 31 | The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`). 32 | 33 | ### EDA 34 | 35 | * Load the data. 36 | * Look at the `median_house_value` variable. Does it have a long tail? 37 | 38 | ### Features 39 | 40 | For the rest of the homework, you'll need to use only these columns: 41 | 42 | * `'latitude'`, 43 | * `'longitude'`, 44 | * `'housing_median_age'`, 45 | * `'total_rooms'`, 46 | * `'total_bedrooms'`, 47 | * `'population'`, 48 | * `'households'`, 49 | * `'median_income'`, 50 | * `'median_house_value'` 51 | 52 | Select only them. 53 | 54 | ### Question 1 55 | 56 | Find a feature with missing values. How many missing values does it have? 57 | - 207 58 | - 208 59 | - 307 60 | - 308 61 | 62 | ### Question 2 63 | 64 | What's the median (50% percentile) for variable 'population'? 65 | - 1133 66 | - 1122 67 | - 1166 68 | - 1188 69 | 70 | ### Split the data 71 | 72 | * Shuffle the initial dataset, use seed `42`. 73 | * Split your data in train/val/test sets, with 60%/20%/20% distribution. 74 | * Make sure that the target value ('median_house_value') is not in your dataframe. 75 | * Apply the log transformation to the median_house_value variable using the `np.log1p()` function. 76 | 77 | ### Question 3 78 | 79 | * We need to deal with missing values for the column from Q1. 80 | * We have two options: fill it with 0 or with the mean of this variable. 81 | * Try both options. For each, train a linear regression model without regularization using the code from the lessons. 82 | * For computing the mean, use the training only! 83 | * Use the validation dataset to evaluate the models and compare the RMSE of each option. 84 | * Round the RMSE scores to 2 decimal digits using `round(score, 2)` 85 | * Which option gives better RMSE? 86 | 87 | Options: 88 | - With 0 89 | - With mean 90 | - With median 91 | - Both are equally good 92 | 93 | ### Question 4 94 | 95 | * Now let's train a regularized linear regression. 96 | * For this question, fill the NAs with 0. 97 | * Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`. 98 | * Use RMSE to evaluate the model on the validation dataset. 99 | * Round the RMSE scores to 2 decimal digits. 100 | * Which `r` gives the best RMSE? 101 | 102 | If there are multiple options, select the smallest `r`. 103 | 104 | Options: 105 | - 0 106 | - 0.000001 107 | - 0.001 108 | - 0.01 109 | 110 | ### Question 5 111 | 112 | * We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score. 113 | * Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`. 114 | * For each seed, do the train/validation/test split with 60%/20%/20% distribution. 115 | * Fill the missing values with 0 and train a model without regularization. 116 | * For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 117 | * What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`. 118 | * Round the result to 3 decimal digits (`round(std, 3)`) 119 | 120 | > Note: Standard deviation shows how different the values are. 121 | > If it's low, then all values are approximately the same. 122 | > If it's high, the values are different. 123 | > If standard deviation of scores is low, then our model is *stable*. 124 | 125 | Options: 126 | - 0.5 127 | - 0.05 128 | - 0.005 129 | - 0.0005 130 | 131 | ### Question 6 132 | 133 | * Split the dataset like previously, use seed 9. 134 | * Combine train and validation datasets. 135 | * Fill the missing values with 0 and train a model with `r=0.001`. 136 | * What's the RMSE on the test dataset? 137 | 138 | Options: 139 | - 0.35 140 | - 0.035 141 | - 0.45 142 | - 0.045 143 | -------------------------------------------------------------------------------- /02-regression/car_price_data.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:26e39d3e902246d01a93ae390f51129a288079aefad2cb3292751a262ffd62d8 3 | size 1475504 4 | -------------------------------------------------------------------------------- /02-regression/housing.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:8a3727f4cf54ac1a327f69b1d5b4db54c5834ea81c6e4efc0d163300022a685e 3 | size 1423529 4 | -------------------------------------------------------------------------------- /03-classification/Homework-classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "8c1b50fd", 6 | "metadata": {}, 7 | "source": [ 8 | "# Machine Learning for Classification\n", 9 | "House price prediction\n", 10 | "\n", 11 | "## Dataset\n", 12 | "\n", 13 | "Dataset is the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).\n", 14 | "\n", 15 | "Here's a wget-able [link](https://github.com/Ksyula/ML_Engineering/blob/master/02-regression/housing.csv):\n", 16 | "\n", 17 | "```bash\n", 18 | "wget https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/02-regression/housing.csv\n", 19 | "```\n", 20 | "\n", 21 | "The goal is to create a regression model for predicting housing prices (column `'median_house_value'`)." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "id": "b9577f8e", 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "from sklearn.model_selection import train_test_split\n", 32 | "from sklearn.metrics import mutual_info_score, mean_squared_error\n", 33 | "from sklearn.feature_extraction import DictVectorizer\n", 34 | "from sklearn.linear_model import LogisticRegression, Ridge" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 2, 40 | "id": "b31f1df7", 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/plain": [ 46 | "('1.21.5', '1.4.3')" 47 | ] 48 | }, 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "output_type": "execute_result" 52 | } 53 | ], 54 | "source": [ 55 | "import pandas as pd\n", 56 | "import numpy as np\n", 57 | "import seaborn as sns\n", 58 | "import matplotlib.pyplot as plt\n", 59 | "\n", 60 | "%matplotlib inline\n", 61 | "\n", 62 | "np.__version__, pd.__version__" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "143b2493", 68 | "metadata": {}, 69 | "source": [ 70 | "## Data preparation" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 28, 76 | "id": "83b4d6b1", 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "(20640, 10)" 83 | ] 84 | }, 85 | "execution_count": 28, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "data = pd.read_csv('../02-regression/housing.csv')\n", 92 | "data.shape" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 29, 98 | "id": "cbd291da", 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "text/html": [ 104 | "
\n", 105 | "\n", 118 | "\n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | "
01234
latitude37.8837.8637.8537.8537.85
longitude-122.23-122.22-122.24-122.25-122.25
housing_median_age41.021.052.052.052.0
total_rooms880.07099.01467.01274.01627.0
total_bedrooms129.01106.0190.0235.0280.0
population322.02401.0496.0558.0565.0
households126.01138.0177.0219.0259.0
median_income8.32528.30147.25745.64313.8462
median_house_value452600.0358500.0352100.0341300.0342200.0
ocean_proximityNEAR BAYNEAR BAYNEAR BAYNEAR BAYNEAR BAY
\n", 212 | "
" 213 | ], 214 | "text/plain": [ 215 | " 0 1 2 3 4\n", 216 | "latitude 37.88 37.86 37.85 37.85 37.85\n", 217 | "longitude -122.23 -122.22 -122.24 -122.25 -122.25\n", 218 | "housing_median_age 41.0 21.0 52.0 52.0 52.0\n", 219 | "total_rooms 880.0 7099.0 1467.0 1274.0 1627.0\n", 220 | "total_bedrooms 129.0 1106.0 190.0 235.0 280.0\n", 221 | "population 322.0 2401.0 496.0 558.0 565.0\n", 222 | "households 126.0 1138.0 177.0 219.0 259.0\n", 223 | "median_income 8.3252 8.3014 7.2574 5.6431 3.8462\n", 224 | "median_house_value 452600.0 358500.0 352100.0 341300.0 342200.0\n", 225 | "ocean_proximity NEAR BAY NEAR BAY NEAR BAY NEAR BAY NEAR BAY" 226 | ] 227 | }, 228 | "execution_count": 29, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "variables = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity']\n", 235 | "data = data[variables].fillna(0)\n", 236 | "data.head().T" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 30, 242 | "id": "e296b68e", 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "data": { 247 | "text/plain": [ 248 | "latitude float64\n", 249 | "longitude float64\n", 250 | "housing_median_age float64\n", 251 | "total_rooms float64\n", 252 | "total_bedrooms float64\n", 253 | "population float64\n", 254 | "households float64\n", 255 | "median_income float64\n", 256 | "median_house_value float64\n", 257 | "ocean_proximity object\n", 258 | "dtype: object" 259 | ] 260 | }, 261 | "execution_count": 30, 262 | "metadata": {}, 263 | "output_type": "execute_result" 264 | } 265 | ], 266 | "source": [ 267 | "data.dtypes" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "id": "71ec4d78", 273 | "metadata": {}, 274 | "source": [ 275 | "## Feature engineering" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 31, 281 | "id": "07aa89ea", 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "data['rooms_per_household'] = data['total_rooms'] / data['households']\n", 286 | "data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']\n", 287 | "data['population_per_household'] = data['population'] / data['households']" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "id": "a321082e", 293 | "metadata": {}, 294 | "source": [ 295 | "### Question 1\n", 296 | "\n", 297 | "What is the most frequent observation (mode) for the column `ocean_proximity`?\n", 298 | "\n", 299 | "Options:\n", 300 | "* NEAR BAY\n", 301 | "* **<1H OCEAN**\n", 302 | "* INLAND\n", 303 | "* NEAR OCEAN" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 32, 309 | "id": "0af14577", 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "0 <1H OCEAN\n", 316 | "Name: ocean_proximity, dtype: object" 317 | ] 318 | }, 319 | "execution_count": 32, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "data['ocean_proximity'].mode()" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "id": "c9a4e452", 331 | "metadata": {}, 332 | "source": [ 333 | "## Set up validation framework" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 33, 339 | "id": "8417e15f", 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "# Split the data in train/val/test sets, with 60%/20%/20% distribution\n", 344 | "def set_up_val_framework(val = 0.25, test = 0.2):\n", 345 | " \n", 346 | " df_full_train, df_test = train_test_split(data, test_size = test, random_state=42)\n", 347 | " df_train, df_val = train_test_split(df_full_train, test_size = val, random_state=42)\n", 348 | " \n", 349 | " df_train = df_train.reset_index(drop = True)\n", 350 | " df_val = df_val.reset_index(drop = True)\n", 351 | " df_test = df_test.reset_index(drop = True)\n", 352 | " \n", 353 | " y_train = df_train.median_house_value.values\n", 354 | " y_val = df_val.median_house_value.values\n", 355 | " y_test = df_test.median_house_value.values\n", 356 | "\n", 357 | " del df_train['median_house_value']\n", 358 | " del df_val['median_house_value']\n", 359 | " del df_test['median_house_value']\n", 360 | " \n", 361 | " return df_train, df_val, df_test, y_train, y_val, y_test" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 34, 367 | "id": "8500efba", 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "df_train, df_val, df_test, y_train, y_val, y_test = set_up_val_framework()" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "046cbc63", 377 | "metadata": {}, 378 | "source": [ 379 | "## EDA" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 35, 385 | "id": "e300f992", 386 | "metadata": {}, 387 | "outputs": [ 388 | { 389 | "data": { 390 | "text/plain": [ 391 | "latitude 0\n", 392 | "longitude 0\n", 393 | "housing_median_age 0\n", 394 | "total_rooms 0\n", 395 | "total_bedrooms 0\n", 396 | "population 0\n", 397 | "households 0\n", 398 | "median_income 0\n", 399 | "median_house_value 0\n", 400 | "ocean_proximity 0\n", 401 | "rooms_per_household 0\n", 402 | "bedrooms_per_room 0\n", 403 | "population_per_household 0\n", 404 | "dtype: int64" 405 | ] 406 | }, 407 | "execution_count": 35, 408 | "metadata": {}, 409 | "output_type": "execute_result" 410 | } 411 | ], 412 | "source": [ 413 | "data.isna().sum()" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "id": "92f029d5", 419 | "metadata": {}, 420 | "source": [ 421 | "### Feature importance: Correlation" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "id": "9f1b88e5", 427 | "metadata": {}, 428 | "source": [ 429 | "#### Question 2\n", 430 | "\n", 431 | "* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.\n", 432 | " - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.\n", 433 | "* What are the two features that have the biggest correlation in this dataset?\n", 434 | "\n", 435 | "Options:\n", 436 | "* **`total_bedrooms` and `households`**\n", 437 | "* `total_bedrooms` and `total_rooms`\n", 438 | "* `population` and `households`\n", 439 | "* `population_per_household` and `total_rooms`" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 36, 445 | "id": "ea04f6c6", 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/html": [ 451 | "
\n", 452 | "\n", 465 | "\n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | "
level_0level_1coef
35total_bedroomshouseholds0.979399
27total_roomstotal_bedrooms0.931546
0latitudelongitude-0.925005
\n", 495 | "
" 496 | ], 497 | "text/plain": [ 498 | " level_0 level_1 coef\n", 499 | "35 total_bedrooms households 0.979399\n", 500 | "27 total_rooms total_bedrooms 0.931546\n", 501 | "0 latitude longitude -0.925005" 502 | ] 503 | }, 504 | "execution_count": 36, 505 | "metadata": {}, 506 | "output_type": "execute_result" 507 | } 508 | ], 509 | "source": [ 510 | "cor_mat = df_train.corr(method='pearson')\n", 511 | "cor_coefs = cor_mat.where(np.triu(np.ones(cor_mat.shape), k=1).astype(bool)).stack().reset_index().rename(columns={0: \"coef\"})\n", 512 | "cor_coefs.sort_values(by = \"coef\", ascending = False, key=abs).head(3)\n" 513 | ] 514 | }, 515 | { 516 | "cell_type": "markdown", 517 | "id": "d1d2acd3", 518 | "metadata": {}, 519 | "source": [ 520 | "### Target variable\n", 521 | "Make `median_house_value` binary\n", 522 | "\n", 523 | "* We need to turn the `median_house_value` variable from numeric into binary.\n", 524 | "* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise." 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 37, 530 | "id": "2f17d826", 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "median_house_value_mean = round(data.median_house_value.mean(), 2)" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 38, 540 | "id": "84f53aa7", 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "y_train = [1 if y > median_house_value_mean else 0 for y in y_train]\n", 545 | "y_val = [1 if y > median_house_value_mean else 0 for y in y_val]\n", 546 | "y_test = [1 if y > median_house_value_mean else 0 for y in y_test]" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "id": "66a75106", 552 | "metadata": {}, 553 | "source": [ 554 | "### Feature importance: Mutual Information\n", 555 | "#### Question 3\n", 556 | "\n", 557 | "* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.\n", 558 | "* What is the value of mutual information?\n", 559 | "* Round it to 2 decimal digits using `round(score, 2)`\n", 560 | "\n", 561 | "Options:\n", 562 | "- 0.263\n", 563 | "- 0.00001\n", 564 | "- **0.101**\n", 565 | "- 0.15555" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 39, 571 | "id": "dd995dc5", 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "0.1" 578 | ] 579 | }, 580 | "execution_count": 39, 581 | "metadata": {}, 582 | "output_type": "execute_result" 583 | } 584 | ], 585 | "source": [ 586 | "round(mutual_info_score(df_train.ocean_proximity, y_train), 2)" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "id": "173301cc", 592 | "metadata": {}, 593 | "source": [ 594 | "## One-hot Encoding" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 40, 600 | "id": "aa5e20c8", 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "def one_hot_encoding(data):\n", 605 | " \n", 606 | " train_dicts = data.to_dict(orient='records')\n", 607 | " dv = DictVectorizer(sparse = False)\n", 608 | " one_hot_encoded_data = dv.fit_transform(train_dicts)\n", 609 | " return one_hot_encoded_data, dv.get_feature_names_out().tolist()" 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": 42, 615 | "id": "50852ecc", 616 | "metadata": {}, 617 | "outputs": [ 618 | { 619 | "data": { 620 | "text/plain": [ 621 | "['bedrooms_per_room',\n", 622 | " 'households',\n", 623 | " 'housing_median_age',\n", 624 | " 'latitude',\n", 625 | " 'longitude',\n", 626 | " 'median_income',\n", 627 | " 'ocean_proximity=<1H OCEAN',\n", 628 | " 'ocean_proximity=INLAND',\n", 629 | " 'ocean_proximity=ISLAND',\n", 630 | " 'ocean_proximity=NEAR BAY',\n", 631 | " 'ocean_proximity=NEAR OCEAN',\n", 632 | " 'population',\n", 633 | " 'population_per_household',\n", 634 | " 'rooms_per_household',\n", 635 | " 'total_bedrooms',\n", 636 | " 'total_rooms']" 637 | ] 638 | }, 639 | "execution_count": 42, 640 | "metadata": {}, 641 | "output_type": "execute_result" 642 | } 643 | ], 644 | "source": [ 645 | "X_train, feature_names = one_hot_encoding(df_train)\n", 646 | "feature_names" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 44, 652 | "id": "3486c001", 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "X_val = one_hot_encoding(df_val)[0]\n", 657 | "X_test = one_hot_encoding(df_test)[0]" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "id": "821a8a8a", 663 | "metadata": {}, 664 | "source": [ 665 | "## Logistic Regression\n", 666 | "### Question 4\n", 667 | "\n", 668 | "* Now let's train a logistic regression\n", 669 | "* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.\n", 670 | "* Fit the model on the training dataset.\n", 671 | " - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:\n", 672 | " - `model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)`\n", 673 | "* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.\n", 674 | "\n", 675 | "Options:\n", 676 | "- 0.60\n", 677 | "- 0.72\n", 678 | "- **0.84**\n", 679 | "- 0.95" 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": 18, 685 | "id": "61c0d09a", 686 | "metadata": {}, 687 | "outputs": [ 688 | { 689 | "data": { 690 | "text/html": [ 691 | "
LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 692 | ], 693 | "text/plain": [ 694 | "LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')" 695 | ] 696 | }, 697 | "execution_count": 18, 698 | "metadata": {}, 699 | "output_type": "execute_result" 700 | } 701 | ], 702 | "source": [ 703 | "model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)\n", 704 | "model.fit(X_train, y_train)" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 19, 710 | "id": "8da5de07", 711 | "metadata": {}, 712 | "outputs": [ 713 | { 714 | "data": { 715 | "text/plain": [ 716 | "-0.08087359748906243" 717 | ] 718 | }, 719 | "execution_count": 19, 720 | "metadata": {}, 721 | "output_type": "execute_result" 722 | } 723 | ], 724 | "source": [ 725 | "model.intercept_[0]" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 20, 731 | "id": "24a587a2", 732 | "metadata": {}, 733 | "outputs": [ 734 | { 735 | "data": { 736 | "text/plain": [ 737 | "array([ 0.171, 0.004, 0.036, 0.116, 0.087, 1.209, 0.471, -1.702,\n", 738 | " 0.018, 0.295, 0.838, -0.002, 0.01 , -0.014, 0.002, -0. ])" 739 | ] 740 | }, 741 | "execution_count": 20, 742 | "metadata": {}, 743 | "output_type": "execute_result" 744 | } 745 | ], 746 | "source": [ 747 | "model.coef_[0].round(3)" 748 | ] 749 | }, 750 | { 751 | "cell_type": "code", 752 | "execution_count": 21, 753 | "id": "d565c669", 754 | "metadata": {}, 755 | "outputs": [ 756 | { 757 | "data": { 758 | "text/plain": [ 759 | "0.84" 760 | ] 761 | }, 762 | "execution_count": 21, 763 | "metadata": {}, 764 | "output_type": "execute_result" 765 | } 766 | ], 767 | "source": [ 768 | "y_pred = model.predict(X_val)\n", 769 | "global_acc = (y_val == y_pred).mean()\n", 770 | "global_acc.round(2)" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "id": "8d8dd61b", 776 | "metadata": {}, 777 | "source": [ 778 | "## Feature elimination\n", 779 | "### Question 5 \n", 780 | "\n", 781 | "* Let's find the least useful feature using the *feature elimination* technique.\n", 782 | "* Train a model with all these features (using the same parameters as in Q4).\n", 783 | "* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.\n", 784 | "* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. \n", 785 | "* Which of following feature has the smallest difference? \n", 786 | " * `total_rooms`\n", 787 | " * **`total_bedrooms`**\n", 788 | " * `population`\n", 789 | " * `households`\n", 790 | "\n", 791 | "> **note**: the difference doesn't have to be positive" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": 22, 797 | "id": "90866ccb", 798 | "metadata": {}, 799 | "outputs": [ 800 | { 801 | "data": { 802 | "text/plain": [ 803 | "{'bedrooms_per_room': 0.171,\n", 804 | " 'households': 0.004,\n", 805 | " 'housing_median_age': 0.036,\n", 806 | " 'latitude': 0.116,\n", 807 | " 'longitude': 0.087,\n", 808 | " 'median_income': 1.209,\n", 809 | " 'ocean_proximity=<1H OCEAN': 0.471,\n", 810 | " 'ocean_proximity=INLAND': -1.702,\n", 811 | " 'ocean_proximity=ISLAND': 0.018,\n", 812 | " 'ocean_proximity=NEAR BAY': 0.295,\n", 813 | " 'ocean_proximity=NEAR OCEAN': 0.838,\n", 814 | " 'population': -0.002,\n", 815 | " 'population_per_household': 0.01,\n", 816 | " 'rooms_per_household': -0.014,\n", 817 | " 'total_bedrooms': 0.002,\n", 818 | " 'total_rooms': -0.0}" 819 | ] 820 | }, 821 | "execution_count": 22, 822 | "metadata": {}, 823 | "output_type": "execute_result" 824 | } 825 | ], 826 | "source": [ 827 | "dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))" 828 | ] 829 | }, 830 | { 831 | "cell_type": "code", 832 | "execution_count": 46, 833 | "id": "a11101b3", 834 | "metadata": {}, 835 | "outputs": [ 836 | { 837 | "name": "stdout", 838 | "output_type": "stream", 839 | "text": [ 840 | "0.0014534883720930258\n", 841 | "0.00024224806201555982\n", 842 | "0.009689922480620172\n", 843 | "0.002664728682170492\n" 844 | ] 845 | } 846 | ], 847 | "source": [ 848 | "for f in ['total_rooms', 'total_bedrooms', 'population', 'households']:\n", 849 | " features = df_train.columns.tolist()\n", 850 | " features.remove(f)\n", 851 | " # Prepare\n", 852 | " x_train = one_hot_encoding(df_train[features])[0]\n", 853 | " # Train \n", 854 | " model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)\n", 855 | " model.fit(x_train, y_train)\n", 856 | " # Validate\n", 857 | " x_val = one_hot_encoding(df_val[features])[0]\n", 858 | "\n", 859 | " y_pred = model.predict(x_val)\n", 860 | " acc = (y_val == y_pred).mean()\n", 861 | " print(abs(global_acc - acc))" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "id": "87e76171", 867 | "metadata": {}, 868 | "source": [ 869 | "### Question 6\n", 870 | "\n", 871 | "* For this question, we'll see how to use a linear regression model from Scikit-Learn\n", 872 | "* We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column.\n", 873 | "* Fit the Ridge regression model (`model = Ridge(alpha=a, solver=\"sag\", random_state=42)`) on the training data.\n", 874 | "* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`\n", 875 | "* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.\n", 876 | "\n", 877 | "If there are multiple options, select the smallest `alpha`.\n", 878 | "\n", 879 | "Options:\n", 880 | "- **0**\n", 881 | "- 0.01\n", 882 | "- 0.1\n", 883 | "- 1\n", 884 | "- 10" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 56, 890 | "id": "b2e79bce", 891 | "metadata": {}, 892 | "outputs": [], 893 | "source": [ 894 | "df_train, df_val, df_test, y_train, y_val, y_test = set_up_val_framework()" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": 57, 900 | "id": "2da4569e", 901 | "metadata": {}, 902 | "outputs": [], 903 | "source": [ 904 | "y_train = np.log(y_train)\n", 905 | "y_val = np.log(y_val)\n", 906 | "y_test = np.log(y_test)" 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": 58, 912 | "id": "f50158a9", 913 | "metadata": {}, 914 | "outputs": [], 915 | "source": [ 916 | "def rmse(predictions, targets):\n", 917 | " return np.sqrt(((predictions - targets) ** 2).mean())" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 62, 923 | "id": "99c543e9", 924 | "metadata": {}, 925 | "outputs": [ 926 | { 927 | "name": "stdout", 928 | "output_type": "stream", 929 | "text": [ 930 | "0.524\n", 931 | "0.524\n", 932 | "0.524\n", 933 | "0.524\n", 934 | "0.524\n" 935 | ] 936 | } 937 | ], 938 | "source": [ 939 | "for a in [0, 0.01, 0.1, 1, 10]:\n", 940 | " model = Ridge(alpha=a, solver=\"sag\", random_state=42)\n", 941 | " \n", 942 | " x_train = one_hot_encoding(df_train)[0]\n", 943 | " \n", 944 | " model.fit(x_train, y_train)\n", 945 | " \n", 946 | " x_val = one_hot_encoding(df_val)[0]\n", 947 | " \n", 948 | " y_pred = model.predict(x_val)\n", 949 | " \n", 950 | " print(rmse(y_val, y_pred).round(3))" 951 | ] 952 | } 953 | ], 954 | "metadata": { 955 | "kernelspec": { 956 | "display_name": "Python 3 (ipykernel)", 957 | "language": "python", 958 | "name": "python3" 959 | }, 960 | "language_info": { 961 | "codemirror_mode": { 962 | "name": "ipython", 963 | "version": 3 964 | }, 965 | "file_extension": ".py", 966 | "mimetype": "text/x-python", 967 | "name": "python", 968 | "nbconvert_exporter": "python", 969 | "pygments_lexer": "ipython3", 970 | "version": "3.9.13" 971 | }, 972 | "toc": { 973 | "base_numbering": 1, 974 | "nav_menu": {}, 975 | "number_sections": true, 976 | "sideBar": true, 977 | "skip_h1_title": true, 978 | "title_cell": "Table of Contents", 979 | "title_sidebar": "Contents", 980 | "toc_cell": false, 981 | "toc_position": {}, 982 | "toc_section_display": true, 983 | "toc_window_display": true 984 | } 985 | }, 986 | "nbformat": 4, 987 | "nbformat_minor": 5 988 | } 989 | -------------------------------------------------------------------------------- /03-classification/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning for Classification : Homework & Dataset 2 | 3 | Session Overview : 4 | 5 | * 3.1 Churn prediction project 6 | * 3.2 Data preparation 7 | * 3.3 Setting up the validation framework 8 | * 3.4 EDA 9 | * 3.5 Feature importance: Churn rate and risk ratio 10 | * 3.6 Feature importance: Mutual information 11 | * 3.7 Feature importance: Correlation 12 | * 3.8 One-hot encoding 13 | * 3.9 Logistic regression 14 | * 3.10 Training logistic regression with Scikit-Learn 15 | * 3.11 Model interpretation 16 | * 3.12 Using the model 17 | * 3.13 Summary 18 | 19 | --------- 20 | 21 | ## Session #3 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/03-classification/homework.md) 22 | 23 | ### Dataset 24 | 25 | In this homework, we will use the California Housing Prices data from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices). 26 | 27 | We'll keep working with the `'median_house_value'` variable, and we'll transform it to a classification task. 28 | 29 | 30 | ### Features 31 | 32 | For the rest of the homework, you'll need to use only these columns: 33 | 34 | * `'latitude'`, 35 | * `'longitude'`, 36 | * `'housing_median_age'`, 37 | * `'total_rooms'`, 38 | * `'total_bedrooms'`, 39 | * `'population'`, 40 | * `'households'`, 41 | * `'median_income'`, 42 | * `'median_house_value'` 43 | * `'ocean_proximity'`, 44 | 45 | ### Data preparation 46 | 47 | * Select only the features from above and fill in the missing values with 0. 48 | * Create a new column `rooms_per_household` by dividing the column `total_rooms` by the column `households` from dataframe. 49 | * Create a new column `bedrooms_per_room` by dividing the column `total_bedrooms` by the column `total_rooms` from dataframe. 50 | * Create a new column `population_per_household` by dividing the column `population` by the column `households` from dataframe. 51 | 52 | 53 | ### Question 1 54 | 55 | What is the most frequent observation (mode) for the column `ocean_proximity`? 56 | 57 | Options: 58 | * `NEAR BAY` 59 | * `<1H OCEAN` 60 | * `INLAND` 61 | * `NEAR OCEAN` 62 | 63 | 64 | ## Split the data 65 | 66 | * Split your data in train/val/test sets, with 60%/20%/20% distribution. 67 | * Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42. 68 | * Make sure that the target value (`median_house_value`) is not in your dataframe. 69 | 70 | ### Question 2 71 | 72 | * Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset. 73 | - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset. 74 | * What are the two features that have the biggest correlation in this dataset? 75 | 76 | Options: 77 | * `total_bedrooms` and `households` 78 | * `total_bedrooms` and `total_rooms` 79 | * `population` and `households` 80 | * `population_per_household` and `total_rooms` 81 | 82 | 83 | ### Make `median_house_value` binary 84 | 85 | * We need to turn the `median_house_value` variable from numeric into binary. 86 | * Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise. 87 | 88 | ### Question 3 89 | 90 | * Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only. 91 | * What is the value of mutual information? 92 | * Round it to 2 decimal digits using `round(score, 2)` 93 | 94 | Options: 95 | - 0.263 96 | - 0.00001 97 | - 0.101 98 | - 0.15555 99 | 100 | 101 | ### Question 4 102 | 103 | * Now let's train a logistic regression 104 | * Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding. 105 | * Fit the model on the training dataset. 106 | - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters: 107 | - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)` 108 | * Calculate the accuracy on the validation dataset and round it to 2 decimal digits. 109 | 110 | Options: 111 | - 0.60 112 | - 0.72 113 | - 0.84 114 | - 0.95 115 | 116 | 117 | ### Question 5 118 | 119 | * Let's find the least useful feature using the *feature elimination* technique. 120 | * Train a model with all these features (using the same parameters as in Q4). 121 | * Now exclude each feature from this set and train a model without it. Record the accuracy for each model. 122 | * For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 123 | * Which of following feature has the smallest difference? 124 | * `total_rooms` 125 | * `total_bedrooms` 126 | * `population` 127 | * `households` 128 | 129 | > **note**: the difference doesn't have to be positive 130 | 131 | 132 | ### Question 6 133 | 134 | * For this question, we'll see how to use a linear regression model from Scikit-Learn 135 | * We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column. 136 | * Fit the Ridge regression model (`model = Ridge(alpha=a, solver="sag", random_state=42)`) on the training data. 137 | * This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]` 138 | * Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits. 139 | 140 | If there are multiple options, select the smallest `alpha`. 141 | 142 | Options: 143 | - 0 144 | - 0.01 145 | - 0.1 146 | - 1 147 | - 10 148 | -------------------------------------------------------------------------------- /04-evaluation/AER_credit_card_data.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:93a46ce44e0542e5094e9675440fcc36e9a018c2ec879e868726f717ebd47c21 3 | size 73250 4 | -------------------------------------------------------------------------------- /04-evaluation/README.md: -------------------------------------------------------------------------------- 1 | # Evaluation Metrics for Classification : Homework & Dataset 2 | 3 | Session Overview : 4 | 5 | * 4.1 Evaluation metrics: session overview 6 | * 4.2 Accuracy and dummy model 7 | * 4.3 Confusion table 8 | * 4.4 Precision and Recall 9 | * 4.5 ROC Curves 10 | * 4.6 ROC AUC 11 | * 4.7 Cross-Validation 12 | * 4.8 Summary 13 | * 4.9 Explore more 14 | * 4.10 Homework 15 | 16 | --------- 17 | 18 | ## Session #4 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/04-evaluation/homework.md) 19 | 20 | ### Dataset 21 | 22 | In this homework, we will use Credit Card Data from book "Econometric Analysis" - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv) 23 | 24 | The goal of this homework is to inspect the output of different evaluation metrics by creating a classification model (target column `card`). 25 | 26 | ## Preparation 27 | 28 | * Create the target variable by mapping `yes` to 1 and `no` to 0. 29 | * Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use `train_test_split` funciton for that with `random_state=1`. 30 | 31 | 32 | ## Question 1 33 | 34 | ROC AUC could also be used to evaluate feature importance of numerical variables. 35 | 36 | Let's do that 37 | 38 | * For each numerical variable, use it as score and compute AUC with the `card` variable. 39 | * Use the training dataset for that. 40 | 41 | If your AUC is < 0.5, invert this variable by putting "-" in front 42 | 43 | (e.g. `-df_train['expenditure']`) 44 | 45 | AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive. 46 | 47 | Which numerical variable (among the following 4) has the highest AUC? 48 | 49 | - `reports` 50 | - `dependents` 51 | - `active` 52 | - `share` 53 | 54 | 55 | ## Training the model 56 | 57 | From now on, use these columns only: 58 | 59 | ``` 60 | ["reports", "age", "income", "share", "expenditure", "dependents", "months", "majorcards", "active", "owner", "selfemp"] 61 | ``` 62 | 63 | Apply one-hot-encoding using `DictVectorizer` and train the logistic regression with these parameters: 64 | 65 | ``` 66 | LogisticRegression(solver='liblinear', C=1.0, max_iter=1000) 67 | ``` 68 | 69 | 70 | ## Question 2 71 | 72 | What's the AUC of this model on the validation dataset? (round to 3 digits) 73 | 74 | - 0.615 75 | - 0.515 76 | - 0.715 77 | - 0.995 78 | 79 | 80 | ## Question 3 81 | 82 | Now let's compute precision and recall for our model. 83 | 84 | * Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01 85 | * For each threshold, compute precision and recall 86 | * Plot them 87 | 88 | 89 | At which threshold precision and recall curves intersect? 90 | 91 | * 0.1 92 | * 0.3 93 | * 0.6 94 | * 0.8 95 | 96 | 97 | ## Question 4 98 | 99 | Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both 100 | 101 | This is the formula for computing F1: 102 | 103 | F1 = 2 * P * R / (P + R) 104 | 105 | Where P is precision and R is recall. 106 | 107 | Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01 108 | 109 | At which threshold F1 is maximal? 110 | 111 | - 0.1 112 | - 0.4 113 | - 0.6 114 | - 0.7 115 | 116 | 117 | ## Question 5 118 | 119 | Use the `KFold` class from Scikit-Learn to evaluate our model on 5 different folds: 120 | 121 | ``` 122 | KFold(n_splits=5, shuffle=True, random_state=1) 123 | ``` 124 | 125 | * Iterate over different folds of `df_full_train` 126 | * Split the data into train and validation 127 | * Train the model on train with these parameters: `LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)` 128 | * Use AUC to evaluate the model on validation 129 | 130 | 131 | How large is standard devidation of the AUC scores across different folds? 132 | 133 | - 0.003 134 | - 0.014 135 | - 0.09 136 | - 0.24 137 | 138 | 139 | ## Question 6 140 | 141 | Now let's use 5-Fold cross-validation to find the best parameter C 142 | 143 | * Iterate over the following C values: `[0.01, 0.1, 1, 10]` 144 | * Initialize `KFold` with the same parameters as previously 145 | * Use these parametes for the model: `LogisticRegression(solver='liblinear', C=C, max_iter=1000)` 146 | * Compute the mean score as well as the std (round the mean and std to 3 decimal digits) 147 | 148 | 149 | Which C leads to the best mean score? 150 | 151 | - 0.01 152 | - 0.1 153 | - 1 154 | - 10 155 | 156 | If you have ties, select the score with the lowest std. If you still have ties, select the smallest C 157 | -------------------------------------------------------------------------------- /05-deployment/External_client.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "2df99f83", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import requests" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "4bd6c98a", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "url = \"http://0.0.0.0:9696/predict\"" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "id": "613d4f2f", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "customer = {\n", 31 | " \"reports\": 0,\n", 32 | " \"share\": 0.245, \n", 33 | " \"expenditure\": 3.438, \n", 34 | " \"owner\": \"yes\"\n", 35 | "}" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 4, 41 | "id": "058ae893", 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "" 48 | ] 49 | }, 50 | "execution_count": 4, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "requests.post(url, json=customer)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 5, 62 | "id": "42bb8c4f", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "{'card': True, 'probability': 0.7692649226628628}" 69 | ] 70 | }, 71 | "execution_count": 5, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "requests.post(url, json=customer).json()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "id": "c9e4151d", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [] 87 | } 88 | ], 89 | "metadata": { 90 | "kernelspec": { 91 | "display_name": "Python 3 (ipykernel)", 92 | "language": "python", 93 | "name": "python3" 94 | }, 95 | "language_info": { 96 | "codemirror_mode": { 97 | "name": "ipython", 98 | "version": 3 99 | }, 100 | "file_extension": ".py", 101 | "mimetype": "text/x-python", 102 | "name": "python", 103 | "nbconvert_exporter": "python", 104 | "pygments_lexer": "ipython3", 105 | "version": "3.9.13" 106 | }, 107 | "toc": { 108 | "base_numbering": 1, 109 | "nav_menu": {}, 110 | "number_sections": true, 111 | "sideBar": true, 112 | "skip_h1_title": true, 113 | "title_cell": "Table of Contents", 114 | "title_sidebar": "Contents", 115 | "toc_cell": false, 116 | "toc_position": {}, 117 | "toc_section_display": true, 118 | "toc_window_display": false 119 | } 120 | }, 121 | "nbformat": 4, 122 | "nbformat_minor": 5 123 | } 124 | -------------------------------------------------------------------------------- /05-deployment/README.md: -------------------------------------------------------------------------------- 1 | # Deploying Machine Learning Models 2 | 3 | Session Overview : 4 | 5 | * Intro / Session overview 6 | * Saving and loading the model 7 | * Web services: introduction to Flask 8 | * Serving the churn model with Flask 9 | * Python virtual environment: Pipenv 10 | * Environment management: Docker 11 | * Deployment to the cloud: AWS Elastic Beanstalk (optional) 12 | * Summary 13 | * Explore more 14 | * Homework 15 | 16 | --------- 17 | 18 | ## Session #5 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework.md) 19 | 20 | ### Dataset 21 | 22 | In this homework, we will use Credit Card Data from book "Econometric Analysis" - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv) 23 | 24 | ## Question 1 25 | 26 | * Install Pipenv 27 | * What's the version of pipenv you installed? 28 | * Use `--version` to find out 29 | 30 | **2022.10.4** 31 | 32 | ## Question 2 33 | 34 | * Use Pipenv to install Scikit-Learn version 1.0.2 35 | * What's the first hash for scikit-learn you get in Pipfile.lock? 36 | 37 | Note: you should create an empty folder for homework 38 | and do it there. 39 | 40 | **sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b** 41 | 42 | 43 | ## Models 44 | 45 | We've prepared a dictionary vectorizer and a model. 46 | 47 | They were trained (roughly) using this code: 48 | 49 | ```python 50 | features = ['reports', 'share', 'expenditure', 'owner'] 51 | dicts = df[features].to_dict(orient='records') 52 | 53 | dv = DictVectorizer(sparse=False) 54 | X = dv.fit_transform(dicts) 55 | 56 | model = LogisticRegression(solver='liblinear').fit(X, y) 57 | ``` 58 | 59 | > **Note**: You don't need to train the model. This code is just for your reference. 60 | 61 | And then saved with Pickle. Download them: 62 | 63 | * [DictVectorizer](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework/dv.bin?raw=true) 64 | * [LogisticRegression](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework/model1.bin?raw=true) 65 | 66 | With `wget`: 67 | 68 | ```bash 69 | PREFIX=https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/course-zoomcamp/cohorts/2022/05-deployment/homework 70 | wget $PREFIX/model1.bin 71 | wget $PREFIX/dv.bin 72 | ``` 73 | 74 | 75 | ## Question 3 76 | 77 | Let's use these models! 78 | 79 | * Write a script for loading these models with pickle 80 | * Score this client: 81 | 82 | ```json 83 | {"reports": 0, "share": 0.001694, "expenditure": 0.12, "owner": "yes"} 84 | ``` 85 | 86 | What's the probability that this client will get a credit card? 87 | 88 | * **0.148** 89 | * 0.391 90 | * 0.601 91 | * 0.993 92 | 93 | If you're getting errors when unpickling the files, check their checksum: 94 | 95 | ```bash 96 | $ md5sum model1.bin dv.bin 97 | 3f57f3ebfdf57a9e1368dcd0f28a4a14 model1.bin 98 | 6b7cded86a52af7e81859647fa3a5c2e dv.bin 99 | ``` 100 | 101 | 102 | ## Question 4 103 | 104 | Now let's serve this model as a web service 105 | 106 | * Install Flask and gunicorn (or waitress, if you're on Windows) 107 | * Write Flask code for serving the model 108 | * Now score this client using `requests`: 109 | 110 | ```python 111 | url = "YOUR_URL" 112 | client = {"reports": 0, "share": 0.245, "expenditure": 3.438, "owner": "yes"} 113 | requests.post(url, json=client).json() 114 | ``` 115 | 116 | What's the probability that this client will get a credit card? 117 | 118 | * 0.274 119 | * 0.484 120 | * 0.698 121 | * **0.928** 122 | 123 | 124 | ## Docker 125 | 126 | Install [Docker](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/05-deployment/06-docker.md). We will use it for the next two questions. 127 | 128 | For these questions, we prepared a base image: `svizor/zoomcamp-model:3.9.12-slim`. 129 | You'll need to use it (see Question 5 for an example). 130 | 131 | This image is based on `python:3.9.12-slim` and has a logistic regression model 132 | (a different one) as well a dictionary vectorizer inside. 133 | 134 | This is how the Dockerfile for this image looks like: 135 | 136 | ```docker 137 | FROM python:3.9.12-slim 138 | WORKDIR /app 139 | COPY ["model2.bin", "dv.bin", "./"] 140 | ``` 141 | 142 | We already built it and then pushed it to [`svizor/zoomcamp-model:3.9.12-slim`](https://hub.docker.com/r/svizor/zoomcamp-model). 143 | 144 | > **Note**: You don't need to build this docker image, it's just for your reference. 145 | 146 | 147 | ## Question 5 148 | 149 | Download the base image `svizor/zoomcamp-model:3.9.12-slim`. You can easily make it by using [docker pull](https://docs.docker.com/engine/reference/commandline/pull/) command. 150 | 151 | So what's the size of this base image? 152 | 153 | * 15 Mb 154 | * **125 Mb** 155 | * 275 Mb 156 | * 415 Mb 157 | 158 | You can get this information when running `docker images` - it'll be in the "SIZE" column. 159 | 160 | 161 | ## Dockerfile 162 | 163 | Now create your own Dockerfile based on the image we prepared. 164 | 165 | It should start like that: 166 | 167 | ```docker 168 | FROM svizor/zoomcamp-model:3.9.12-slim 169 | # add your stuff here 170 | ``` 171 | 172 | Now complete it: 173 | 174 | * Install all the dependencies form the Pipenv file 175 | * Copy your Flask script 176 | * Run it with Gunicorn 177 | 178 | After that, you can build your docker image. 179 | 180 | 181 | ## Question 6 182 | 183 | Let's run your docker container! 184 | 185 | After running it, score this client once again: 186 | 187 | ```python 188 | url = "YOUR_URL" 189 | client = {"reports": 0, "share": 0.245, "expenditure": 3.438, "owner": "yes"} 190 | requests.post(url, json=client).json() 191 | ``` 192 | 193 | What's the probability that this client will get a credit card now? 194 | 195 | * 0.289 196 | * 0.502 197 | * **0.769** 198 | * 0.972 -------------------------------------------------------------------------------- /05-deployment/app/Dockerfile: -------------------------------------------------------------------------------- 1 | # First install the python 3.8, the slim version have less size 2 | FROM svizor/zoomcamp-model:3.9.12-slim 3 | 4 | # Install pipenv library in Docker 5 | RUN pip install pipenv 6 | 7 | # we have created a directory in Docker named app and we're using it as work directory 8 | WORKDIR /app 9 | 10 | # Copy the Pip files into our working derectory 11 | COPY ["Pipfile", "Pipfile.lock", "./"] 12 | 13 | # install the pipenv dependecies we had from the project and deploy them 14 | RUN pipenv install --deploy --system 15 | 16 | # Copy any python files and the model we had to the working directory of Docker 17 | COPY ["predict.py", "./"] 18 | 19 | # We need to expose the 9696 port because we're not able to communicate with Docker outside it 20 | EXPOSE 9696 21 | 22 | # If we run the Docker image, we want our churn app to be running 23 | ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "predict:app"] -------------------------------------------------------------------------------- /05-deployment/app/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | numpy = "==1.23.3" 8 | scikit-learn = "==1.0.2" 9 | flask = "==2.2.2" 10 | gunicorn = "==20.1.0" 11 | 12 | [dev-packages] 13 | awsebcli = "==3.20.3" 14 | 15 | 16 | [requires] 17 | python_version = "3.9" 18 | -------------------------------------------------------------------------------- /05-deployment/app/Pipfile.lock: -------------------------------------------------------------------------------- 1 | { 2 | "_meta": { 3 | "hash": { 4 | "sha256": "24181752269acff8f5d7c23ac1c6964e70b0fce51dbe74392cb23deed544d693" 5 | }, 6 | "pipfile-spec": 6, 7 | "requires": { 8 | "python_version": "3.9" 9 | }, 10 | "sources": [ 11 | { 12 | "name": "pypi", 13 | "url": "https://pypi.org/simple", 14 | "verify_ssl": true 15 | } 16 | ] 17 | }, 18 | "default": { 19 | "click": { 20 | "hashes": [ 21 | "sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e", 22 | "sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48" 23 | ], 24 | "markers": "python_version >= '3.7'", 25 | "version": "==8.1.3" 26 | }, 27 | "flask": { 28 | "hashes": [ 29 | "sha256:642c450d19c4ad482f96729bd2a8f6d32554aa1e231f4f6b4e7e5264b16cca2b", 30 | "sha256:b9c46cc36662a7949f34b52d8ec7bb59c0d74ba08ba6cb9ce9adc1d8676d9526" 31 | ], 32 | "index": "pypi", 33 | "version": "==2.2.2" 34 | }, 35 | "gunicorn": { 36 | "hashes": [ 37 | "sha256:9dcc4547dbb1cb284accfb15ab5667a0e5d1881cc443e0677b4882a4067a807e", 38 | "sha256:e0a968b5ba15f8a328fdfd7ab1fcb5af4470c28aaf7e55df02a99bc13138e6e8" 39 | ], 40 | "index": "pypi", 41 | "version": "==20.1.0" 42 | }, 43 | "importlib-metadata": { 44 | "hashes": [ 45 | "sha256:da31db32b304314d044d3c12c79bd59e307889b287ad12ff387b3500835fc2ab", 46 | "sha256:ddb0e35065e8938f867ed4928d0ae5bf2a53b7773871bfe6bcc7e4fcdc7dea43" 47 | ], 48 | "markers": "python_version < '3.10'", 49 | "version": "==5.0.0" 50 | }, 51 | "itsdangerous": { 52 | "hashes": [ 53 | "sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44", 54 | "sha256:5dbbc68b317e5e42f327f9021763545dc3fc3bfe22e6deb96aaf1fc38874156a" 55 | ], 56 | "markers": "python_version >= '3.7'", 57 | "version": "==2.1.2" 58 | }, 59 | "jinja2": { 60 | "hashes": [ 61 | "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852", 62 | "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61" 63 | ], 64 | "markers": "python_version >= '3.7'", 65 | "version": "==3.1.2" 66 | }, 67 | "joblib": { 68 | "hashes": [ 69 | "sha256:091138ed78f800342968c523bdde947e7a305b8594b910a0fea2ab83c3c6d385", 70 | "sha256:e1cee4a79e4af22881164f218d4311f60074197fb707e082e803b61f6d137018" 71 | ], 72 | "markers": "python_version >= '3.7'", 73 | "version": "==1.2.0" 74 | }, 75 | "markupsafe": { 76 | "hashes": [ 77 | "sha256:0212a68688482dc52b2d45013df70d169f542b7394fc744c02a57374a4207003", 78 | "sha256:089cf3dbf0cd6c100f02945abeb18484bd1ee57a079aefd52cffd17fba910b88", 79 | "sha256:10c1bfff05d95783da83491be968e8fe789263689c02724e0c691933c52994f5", 80 | "sha256:33b74d289bd2f5e527beadcaa3f401e0df0a89927c1559c8566c066fa4248ab7", 81 | "sha256:3799351e2336dc91ea70b034983ee71cf2f9533cdff7c14c90ea126bfd95d65a", 82 | "sha256:3ce11ee3f23f79dbd06fb3d63e2f6af7b12db1d46932fe7bd8afa259a5996603", 83 | "sha256:421be9fbf0ffe9ffd7a378aafebbf6f4602d564d34be190fc19a193232fd12b1", 84 | "sha256:43093fb83d8343aac0b1baa75516da6092f58f41200907ef92448ecab8825135", 85 | "sha256:46d00d6cfecdde84d40e572d63735ef81423ad31184100411e6e3388d405e247", 86 | "sha256:4a33dea2b688b3190ee12bd7cfa29d39c9ed176bda40bfa11099a3ce5d3a7ac6", 87 | "sha256:4b9fe39a2ccc108a4accc2676e77da025ce383c108593d65cc909add5c3bd601", 88 | "sha256:56442863ed2b06d19c37f94d999035e15ee982988920e12a5b4ba29b62ad1f77", 89 | "sha256:671cd1187ed5e62818414afe79ed29da836dde67166a9fac6d435873c44fdd02", 90 | "sha256:694deca8d702d5db21ec83983ce0bb4b26a578e71fbdbd4fdcd387daa90e4d5e", 91 | "sha256:6a074d34ee7a5ce3effbc526b7083ec9731bb3cbf921bbe1d3005d4d2bdb3a63", 92 | "sha256:6d0072fea50feec76a4c418096652f2c3238eaa014b2f94aeb1d56a66b41403f", 93 | "sha256:6fbf47b5d3728c6aea2abb0589b5d30459e369baa772e0f37a0320185e87c980", 94 | "sha256:7f91197cc9e48f989d12e4e6fbc46495c446636dfc81b9ccf50bb0ec74b91d4b", 95 | "sha256:86b1f75c4e7c2ac2ccdaec2b9022845dbb81880ca318bb7a0a01fbf7813e3812", 96 | "sha256:8dc1c72a69aa7e082593c4a203dcf94ddb74bb5c8a731e4e1eb68d031e8498ff", 97 | "sha256:8e3dcf21f367459434c18e71b2a9532d96547aef8a871872a5bd69a715c15f96", 98 | "sha256:8e576a51ad59e4bfaac456023a78f6b5e6e7651dcd383bcc3e18d06f9b55d6d1", 99 | "sha256:96e37a3dc86e80bf81758c152fe66dbf60ed5eca3d26305edf01892257049925", 100 | "sha256:97a68e6ada378df82bc9f16b800ab77cbf4b2fada0081794318520138c088e4a", 101 | "sha256:99a2a507ed3ac881b975a2976d59f38c19386d128e7a9a18b7df6fff1fd4c1d6", 102 | "sha256:a49907dd8420c5685cfa064a1335b6754b74541bbb3706c259c02ed65b644b3e", 103 | "sha256:b09bf97215625a311f669476f44b8b318b075847b49316d3e28c08e41a7a573f", 104 | "sha256:b7bd98b796e2b6553da7225aeb61f447f80a1ca64f41d83612e6139ca5213aa4", 105 | "sha256:b87db4360013327109564f0e591bd2a3b318547bcef31b468a92ee504d07ae4f", 106 | "sha256:bcb3ed405ed3222f9904899563d6fc492ff75cce56cba05e32eff40e6acbeaa3", 107 | "sha256:d4306c36ca495956b6d568d276ac11fdd9c30a36f1b6eb928070dc5360b22e1c", 108 | "sha256:d5ee4f386140395a2c818d149221149c54849dfcfcb9f1debfe07a8b8bd63f9a", 109 | "sha256:dda30ba7e87fbbb7eab1ec9f58678558fd9a6b8b853530e176eabd064da81417", 110 | "sha256:e04e26803c9c3851c931eac40c695602c6295b8d432cbe78609649ad9bd2da8a", 111 | "sha256:e1c0b87e09fa55a220f058d1d49d3fb8df88fbfab58558f1198e08c1e1de842a", 112 | "sha256:e72591e9ecd94d7feb70c1cbd7be7b3ebea3f548870aa91e2732960fa4d57a37", 113 | "sha256:e8c843bbcda3a2f1e3c2ab25913c80a3c5376cd00c6e8c4a86a89a28c8dc5452", 114 | "sha256:efc1913fd2ca4f334418481c7e595c00aad186563bbc1ec76067848c7ca0a933", 115 | "sha256:f121a1420d4e173a5d96e47e9a0c0dcff965afdf1626d28de1460815f7c4ee7a", 116 | "sha256:fc7b548b17d238737688817ab67deebb30e8073c95749d55538ed473130ec0c7" 117 | ], 118 | "markers": "python_version >= '3.7'", 119 | "version": "==2.1.1" 120 | }, 121 | "numpy": { 122 | "hashes": [ 123 | "sha256:004f0efcb2fe1c0bd6ae1fcfc69cc8b6bf2407e0f18be308612007a0762b4089", 124 | "sha256:09f6b7bdffe57fc61d869a22f506049825d707b288039d30f26a0d0d8ea05164", 125 | "sha256:0ea3f98a0ffce3f8f57675eb9119f3f4edb81888b6874bc1953f91e0b1d4f440", 126 | "sha256:17c0e467ade9bda685d5ac7f5fa729d8d3e76b23195471adae2d6a6941bd2c18", 127 | "sha256:1f27b5322ac4067e67c8f9378b41c746d8feac8bdd0e0ffede5324667b8a075c", 128 | "sha256:22d43376ee0acd547f3149b9ec12eec2f0ca4a6ab2f61753c5b29bb3e795ac4d", 129 | "sha256:2ad3ec9a748a8943e6eb4358201f7e1c12ede35f510b1a2221b70af4bb64295c", 130 | "sha256:301c00cf5e60e08e04d842fc47df641d4a181e651c7135c50dc2762ffe293dbd", 131 | "sha256:39a664e3d26ea854211867d20ebcc8023257c1800ae89773cbba9f9e97bae036", 132 | "sha256:51bf49c0cd1d52be0a240aa66f3458afc4b95d8993d2d04f0d91fa60c10af6cd", 133 | "sha256:78a63d2df1d947bd9d1b11d35564c2f9e4b57898aae4626638056ec1a231c40c", 134 | "sha256:7cd1328e5bdf0dee621912f5833648e2daca72e3839ec1d6695e91089625f0b4", 135 | "sha256:8355fc10fd33a5a70981a5b8a0de51d10af3688d7a9e4a34fcc8fa0d7467bb7f", 136 | "sha256:8c79d7cf86d049d0c5089231a5bcd31edb03555bd93d81a16870aa98c6cfb79d", 137 | "sha256:91b8d6768a75247026e951dce3b2aac79dc7e78622fc148329135ba189813584", 138 | "sha256:94c15ca4e52671a59219146ff584488907b1f9b3fc232622b47e2cf832e94fb8", 139 | "sha256:98dcbc02e39b1658dc4b4508442a560fe3ca5ca0d989f0df062534e5ca3a5c1a", 140 | "sha256:a64403f634e5ffdcd85e0b12c08f04b3080d3e840aef118721021f9b48fc1460", 141 | "sha256:bc6e8da415f359b578b00bcfb1d08411c96e9a97f9e6c7adada554a0812a6cc6", 142 | "sha256:bdc9febce3e68b697d931941b263c59e0c74e8f18861f4064c1f712562903411", 143 | "sha256:c1ba66c48b19cc9c2975c0d354f24058888cdc674bebadceb3cdc9ec403fb5d1", 144 | "sha256:c9f707b5bb73bf277d812ded9896f9512a43edff72712f31667d0a8c2f8e71ee", 145 | "sha256:d5422d6a1ea9b15577a9432e26608c73a78faf0b9039437b075cf322c92e98e7", 146 | "sha256:e5d5420053bbb3dd64c30e58f9363d7a9c27444c3648e61460c1237f9ec3fa14", 147 | "sha256:e868b0389c5ccfc092031a861d4e158ea164d8b7fdbb10e3b5689b4fc6498df6", 148 | "sha256:efd9d3abe5774404becdb0748178b48a218f1d8c44e0375475732211ea47c67e", 149 | "sha256:f8c02ec3c4c4fcb718fdf89a6c6f709b14949408e8cf2a2be5bfa9c49548fd85", 150 | "sha256:ffcf105ecdd9396e05a8e58e81faaaf34d3f9875f137c7372450baa5d77c9a54" 151 | ], 152 | "index": "pypi", 153 | "version": "==1.23.3" 154 | }, 155 | "scikit-learn": { 156 | "hashes": [ 157 | "sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b", 158 | "sha256:158faf30684c92a78e12da19c73feff9641a928a8024b4fa5ec11d583f3d8a87", 159 | "sha256:16455ace947d8d9e5391435c2977178d0ff03a261571e67f627c8fee0f9d431a", 160 | "sha256:245c9b5a67445f6f044411e16a93a554edc1efdcce94d3fc0bc6a4b9ac30b752", 161 | "sha256:285db0352e635b9e3392b0b426bc48c3b485512d3b4ac3c7a44ec2a2ba061e66", 162 | "sha256:2f3b453e0b149898577e301d27e098dfe1a36943f7bb0ad704d1e548efc3b448", 163 | "sha256:46f431ec59dead665e1370314dbebc99ead05e1c0a9df42f22d6a0e00044820f", 164 | "sha256:55f2f3a8414e14fbee03782f9fe16cca0f141d639d2b1c1a36779fa069e1db57", 165 | "sha256:5cb33fe1dc6f73dc19e67b264dbb5dde2a0539b986435fdd78ed978c14654830", 166 | "sha256:75307d9ea39236cad7eea87143155eea24d48f93f3a2f9389c817f7019f00705", 167 | "sha256:7626a34eabbf370a638f32d1a3ad50526844ba58d63e3ab81ba91e2a7c6d037e", 168 | "sha256:7a93c1292799620df90348800d5ac06f3794c1316ca247525fa31169f6d25855", 169 | "sha256:7d6b2475f1c23a698b48515217eb26b45a6598c7b1840ba23b3c5acece658dbb", 170 | "sha256:80095a1e4b93bd33261ef03b9bc86d6db649f988ea4dbcf7110d0cded8d7213d", 171 | "sha256:85260fb430b795d806251dd3bb05e6f48cdc777ac31f2bcf2bc8bbed3270a8f5", 172 | "sha256:9369b030e155f8188743eb4893ac17a27f81d28a884af460870c7c072f114243", 173 | "sha256:a053a6a527c87c5c4fa7bf1ab2556fa16d8345cf99b6c5a19030a4a7cd8fd2c0", 174 | "sha256:a90b60048f9ffdd962d2ad2fb16367a87ac34d76e02550968719eb7b5716fd10", 175 | "sha256:a999c9f02ff9570c783069f1074f06fe7386ec65b84c983db5aeb8144356a355", 176 | "sha256:b1391d1a6e2268485a63c3073111fe3ba6ec5145fc957481cfd0652be571226d", 177 | "sha256:b54a62c6e318ddbfa7d22c383466d38d2ee770ebdb5ddb668d56a099f6eaf75f", 178 | "sha256:b5870959a5484b614f26d31ca4c17524b1b0317522199dc985c3b4256e030767", 179 | "sha256:bc3744dabc56b50bec73624aeca02e0def06b03cb287de26836e730659c5d29c", 180 | "sha256:d93d4c28370aea8a7cbf6015e8a669cd5d69f856cc2aa44e7a590fb805bb5583", 181 | "sha256:d9aac97e57c196206179f674f09bc6bffcd0284e2ba95b7fe0b402ac3f986023", 182 | "sha256:da3c84694ff693b5b3194d8752ccf935a665b8b5edc33a283122f4273ca3e687", 183 | "sha256:e174242caecb11e4abf169342641778f68e1bfaba80cd18acd6bc84286b9a534", 184 | "sha256:eabceab574f471de0b0eb3f2ecf2eee9f10b3106570481d007ed1c84ebf6d6a1", 185 | "sha256:f14517e174bd7332f1cca2c959e704696a5e0ba246eb8763e6c24876d8710049", 186 | "sha256:fa38a1b9b38ae1fad2863eff5e0d69608567453fdfc850c992e6e47eb764e846", 187 | "sha256:ff3fa8ea0e09e38677762afc6e14cad77b5e125b0ea70c9bba1992f02c93b028", 188 | "sha256:ff746a69ff2ef25f62b36338c615dd15954ddc3ab8e73530237dd73235e76d62" 189 | ], 190 | "index": "pypi", 191 | "version": "==1.0.2" 192 | }, 193 | "scipy": { 194 | "hashes": [ 195 | "sha256:0e9c83dccac06f3b9aa02df69577f239758d5d0d0c069673fb0b47ecb971983d", 196 | "sha256:148cb6f53d9d10dafde848e9aeb1226bf2809d16dc3221b2fa568130b6f2e586", 197 | "sha256:17be1a7c68ec4c49d8cd4eb1655d55d14a54ab63012296bdd5921c92dc485acd", 198 | "sha256:1e3b23a82867018cd26255dc951789a7c567921622073e1113755866f1eae928", 199 | "sha256:22380e076a162e81b659d53d75b02e9c75ad14ea2d53d9c645a12543414e2150", 200 | "sha256:4012dbe540732311b8f4388b7e1482eb43a7cc0435bbf2b9916b3d6c38fb8d01", 201 | "sha256:5994a8232cc6510a8e85899661df2d11198bf362f0ffe6fbd5c0aca17ab46ce3", 202 | "sha256:61b95283529712101bfb7c87faf94cb86ed9e64de079509edfe107e5cfa55733", 203 | "sha256:658fd31c6ad4eb9fa3fd460fcac779f70a6bc7480288a211b7658a25891cf01d", 204 | "sha256:7b2608b3141c257d01ae772e23b3de9e04d27344e6b68a890883795229cb7191", 205 | "sha256:82e8bfb352aa9dce9a0ffe81f4c369a2c87c85533519441686f59f21d8c09697", 206 | "sha256:885b7ac56d7460544b2ef89ab9feafa30f4264c9825d975ef690608d07e6cc55", 207 | "sha256:8c8c29703202c39d699b0d6b164bde5501c212005f20abf46ae322b9307c8a41", 208 | "sha256:92c5e627a0635ca02e6494bbbdb74f98d93ac8730416209d61de3b70c8a821be", 209 | "sha256:99e7720caefb8bca6ebf05c7d96078ed202881f61e0c68bd9e0f3e8097d6f794", 210 | "sha256:a72297eb9702576bd8f626bb488fd32bb35349d3120fc4a5e733db137f06c9a6", 211 | "sha256:aa270cc6080c987929335c4cb94e8054fee9a6058cecff22276fa5dbab9856fc", 212 | "sha256:b6194da32e0ce9200b2eda4eb4edb89c5cb8b83d6deaf7c35f8ad3d5d7627d5c", 213 | "sha256:bbed414fc25d64bd6d1613dc0286fbf91902219b8be63ad254525162235b67e9", 214 | "sha256:d6cb1f92ded3fc48f7dbe94d20d7b9887e13b874e79043907de541c841563b4c", 215 | "sha256:ee4ceed204f269da19f67f0115a85d3a2cd8547185037ad99a4025f9c61d02e9" 216 | ], 217 | "markers": "python_version >= '3.8'", 218 | "version": "==1.9.2" 219 | }, 220 | "setuptools": { 221 | "hashes": [ 222 | "sha256:1b6bdc6161661409c5f21508763dc63ab20a9ac2f8ba20029aaaa7fdb9118012", 223 | "sha256:3050e338e5871e70c72983072fe34f6032ae1cdeeeb67338199c2f74e083a80e" 224 | ], 225 | "markers": "python_version >= '3.7'", 226 | "version": "==65.4.1" 227 | }, 228 | "threadpoolctl": { 229 | "hashes": [ 230 | "sha256:8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b", 231 | "sha256:a335baacfaa4400ae1f0d8e3a58d6674d2f8828e3716bb2802c44955ad391380" 232 | ], 233 | "markers": "python_version >= '3.6'", 234 | "version": "==3.1.0" 235 | }, 236 | "werkzeug": { 237 | "hashes": [ 238 | "sha256:7ea2d48322cc7c0f8b3a215ed73eabd7b5d75d0b50e31ab006286ccff9e00b8f", 239 | "sha256:f979ab81f58d7318e064e99c4506445d60135ac5cd2e177a2de0089bfd4c9bd5" 240 | ], 241 | "markers": "python_version >= '3.7'", 242 | "version": "==2.2.2" 243 | }, 244 | "zipp": { 245 | "hashes": [ 246 | "sha256:3a7af91c3db40ec72dd9d154ae18e008c69efe8ca88dde4f9a731bb82fe2f9eb", 247 | "sha256:972cfa31bc2fedd3fa838a51e9bc7e64b7fb725a8c00e7431554311f180e9980" 248 | ], 249 | "markers": "python_version >= '3.7'", 250 | "version": "==3.9.0" 251 | } 252 | }, 253 | "develop": { 254 | "attrs": { 255 | "hashes": [ 256 | "sha256:29adc2665447e5191d0e7c568fde78b21f9672d344281d0c6e1ab085429b22b6", 257 | "sha256:86efa402f67bf2df34f51a335487cf46b1ec130d02b8d39fd248abfd30da551c" 258 | ], 259 | "markers": "python_version >= '3.5'", 260 | "version": "==22.1.0" 261 | }, 262 | "awsebcli": { 263 | "hashes": [ 264 | "sha256:5b79d45cf017a2270340d5e2824b51d7e6fff56eaabe2f747ee8eb5c07a4c535" 265 | ], 266 | "index": "pypi", 267 | "version": "==3.20.3" 268 | }, 269 | "bcrypt": { 270 | "hashes": [ 271 | "sha256:089098effa1bc35dc055366740a067a2fc76987e8ec75349eb9484061c54f535", 272 | "sha256:08d2947c490093a11416df18043c27abe3921558d2c03e2076ccb28a116cb6d0", 273 | "sha256:0eaa47d4661c326bfc9d08d16debbc4edf78778e6aaba29c1bc7ce67214d4410", 274 | "sha256:27d375903ac8261cfe4047f6709d16f7d18d39b1ec92aaf72af989552a650ebd", 275 | "sha256:2b3ac11cf45161628f1f3733263e63194f22664bf4d0c0f3ab34099c02134665", 276 | "sha256:2caffdae059e06ac23fce178d31b4a702f2a3264c20bfb5ff541b338194d8fab", 277 | "sha256:3100851841186c25f127731b9fa11909ab7b1df6fc4b9f8353f4f1fd952fbf71", 278 | "sha256:5ad4d32a28b80c5fa6671ccfb43676e8c1cc232887759d1cd7b6f56ea4355215", 279 | "sha256:67a97e1c405b24f19d08890e7ae0c4f7ce1e56a712a016746c8b2d7732d65d4b", 280 | "sha256:705b2cea8a9ed3d55b4491887ceadb0106acf7c6387699fca771af56b1cdeeda", 281 | "sha256:8a68f4341daf7522fe8d73874de8906f3a339048ba406be6ddc1b3ccb16fc0d9", 282 | "sha256:a522427293d77e1c29e303fc282e2d71864579527a04ddcfda6d4f8396c6c36a", 283 | "sha256:ae88eca3024bb34bb3430f964beab71226e761f51b912de5133470b649d82344", 284 | "sha256:b1023030aec778185a6c16cf70f359cbb6e0c289fd564a7cfa29e727a1c38f8f", 285 | "sha256:b3b85202d95dd568efcb35b53936c5e3b3600c7cdcc6115ba461df3a8e89f38d", 286 | "sha256:b57adba8a1444faf784394de3436233728a1ecaeb6e07e8c22c8848f179b893c", 287 | "sha256:bf4fa8b2ca74381bb5442c089350f09a3f17797829d958fad058d6e44d9eb83c", 288 | "sha256:ca3204d00d3cb2dfed07f2d74a25f12fc12f73e606fcaa6975d1f7ae69cacbb2", 289 | "sha256:cbb03eec97496166b704ed663a53680ab57c5084b2fc98ef23291987b525cb7d", 290 | "sha256:e9a51bbfe7e9802b5f3508687758b564069ba937748ad7b9e890086290d2f79e", 291 | "sha256:fbdaec13c5105f0c4e5c52614d04f0bca5f5af007910daa8b6b12095edaa67b3" 292 | ], 293 | "markers": "python_version >= '3.6'", 294 | "version": "==4.0.1" 295 | }, 296 | "blessed": { 297 | "hashes": [ 298 | "sha256:63b8554ae2e0e7f43749b6715c734cc8f3883010a809bf16790102563e6cf25b", 299 | "sha256:9a0d099695bf621d4680dd6c73f6ad547f6a3442fbdbe80c4b1daa1edbc492fc" 300 | ], 301 | "markers": "python_version >= '2.7'", 302 | "version": "==1.19.1" 303 | }, 304 | "botocore": { 305 | "hashes": [ 306 | "sha256:06ae8076c4dcf3d72bec4d37e5f2dce4a92a18a8cdaa3bfaa6e3b7b5e30a8d7e", 307 | "sha256:4bb9ba16cccee5f5a2602049bc3e2db6865346b2550667f3013bdf33b0a01ceb" 308 | ], 309 | "markers": "python_version >= '3.6'", 310 | "version": "==1.23.54" 311 | }, 312 | "cached-property": { 313 | "hashes": [ 314 | "sha256:9fa5755838eecbb2d234c3aa390bd80fbd3ac6b6869109bfc1b499f7bd89a130", 315 | "sha256:df4f613cf7ad9a588cc381aaf4a512d26265ecebd5eb9e1ba12f1319eb85a6a0" 316 | ], 317 | "version": "==1.5.2" 318 | }, 319 | "cement": { 320 | "hashes": [ 321 | "sha256:8765ed052c061d74e4d0189addc33d268de544ca219b259d797741f725e422d2" 322 | ], 323 | "version": "==2.8.2" 324 | }, 325 | "certifi": { 326 | "hashes": [ 327 | "sha256:0d9c601124e5a6ba9712dbc60d9c53c21e34f5f641fe83002317394311bdce14", 328 | "sha256:90c1a32f1d68f940488354e36370f6cca89f0f106db09518524c88d6ed83f382" 329 | ], 330 | "markers": "python_version >= '3.6'", 331 | "version": "==2022.9.24" 332 | }, 333 | "cffi": { 334 | "hashes": [ 335 | "sha256:00a9ed42e88df81ffae7a8ab6d9356b371399b91dbdf0c3cb1e84c03a13aceb5", 336 | "sha256:03425bdae262c76aad70202debd780501fabeaca237cdfddc008987c0e0f59ef", 337 | "sha256:04ed324bda3cda42b9b695d51bb7d54b680b9719cfab04227cdd1e04e5de3104", 338 | "sha256:0e2642fe3142e4cc4af0799748233ad6da94c62a8bec3a6648bf8ee68b1c7426", 339 | "sha256:173379135477dc8cac4bc58f45db08ab45d228b3363adb7af79436135d028405", 340 | "sha256:198caafb44239b60e252492445da556afafc7d1e3ab7a1fb3f0584ef6d742375", 341 | "sha256:1e74c6b51a9ed6589199c787bf5f9875612ca4a8a0785fb2d4a84429badaf22a", 342 | "sha256:2012c72d854c2d03e45d06ae57f40d78e5770d252f195b93f581acf3ba44496e", 343 | "sha256:21157295583fe8943475029ed5abdcf71eb3911894724e360acff1d61c1d54bc", 344 | "sha256:2470043b93ff09bf8fb1d46d1cb756ce6132c54826661a32d4e4d132e1977adf", 345 | "sha256:285d29981935eb726a4399badae8f0ffdff4f5050eaa6d0cfc3f64b857b77185", 346 | "sha256:30d78fbc8ebf9c92c9b7823ee18eb92f2e6ef79b45ac84db507f52fbe3ec4497", 347 | "sha256:320dab6e7cb2eacdf0e658569d2575c4dad258c0fcc794f46215e1e39f90f2c3", 348 | "sha256:33ab79603146aace82c2427da5ca6e58f2b3f2fb5da893ceac0c42218a40be35", 349 | "sha256:3548db281cd7d2561c9ad9984681c95f7b0e38881201e157833a2342c30d5e8c", 350 | "sha256:3799aecf2e17cf585d977b780ce79ff0dc9b78d799fc694221ce814c2c19db83", 351 | "sha256:39d39875251ca8f612b6f33e6b1195af86d1b3e60086068be9cc053aa4376e21", 352 | "sha256:3b926aa83d1edb5aa5b427b4053dc420ec295a08e40911296b9eb1b6170f6cca", 353 | "sha256:3bcde07039e586f91b45c88f8583ea7cf7a0770df3a1649627bf598332cb6984", 354 | "sha256:3d08afd128ddaa624a48cf2b859afef385b720bb4b43df214f85616922e6a5ac", 355 | "sha256:3eb6971dcff08619f8d91607cfc726518b6fa2a9eba42856be181c6d0d9515fd", 356 | "sha256:40f4774f5a9d4f5e344f31a32b5096977b5d48560c5592e2f3d2c4374bd543ee", 357 | "sha256:4289fc34b2f5316fbb762d75362931e351941fa95fa18789191b33fc4cf9504a", 358 | "sha256:470c103ae716238bbe698d67ad020e1db9d9dba34fa5a899b5e21577e6d52ed2", 359 | "sha256:4f2c9f67e9821cad2e5f480bc8d83b8742896f1242dba247911072d4fa94c192", 360 | "sha256:50a74364d85fd319352182ef59c5c790484a336f6db772c1a9231f1c3ed0cbd7", 361 | "sha256:54a2db7b78338edd780e7ef7f9f6c442500fb0d41a5a4ea24fff1c929d5af585", 362 | "sha256:5635bd9cb9731e6d4a1132a498dd34f764034a8ce60cef4f5319c0541159392f", 363 | "sha256:59c0b02d0a6c384d453fece7566d1c7e6b7bae4fc5874ef2ef46d56776d61c9e", 364 | "sha256:5d598b938678ebf3c67377cdd45e09d431369c3b1a5b331058c338e201f12b27", 365 | "sha256:5df2768244d19ab7f60546d0c7c63ce1581f7af8b5de3eb3004b9b6fc8a9f84b", 366 | "sha256:5ef34d190326c3b1f822a5b7a45f6c4535e2f47ed06fec77d3d799c450b2651e", 367 | "sha256:6975a3fac6bc83c4a65c9f9fcab9e47019a11d3d2cf7f3c0d03431bf145a941e", 368 | "sha256:6c9a799e985904922a4d207a94eae35c78ebae90e128f0c4e521ce339396be9d", 369 | "sha256:70df4e3b545a17496c9b3f41f5115e69a4f2e77e94e1d2a8e1070bc0c38c8a3c", 370 | "sha256:7473e861101c9e72452f9bf8acb984947aa1661a7704553a9f6e4baa5ba64415", 371 | "sha256:8102eaf27e1e448db915d08afa8b41d6c7ca7a04b7d73af6514df10a3e74bd82", 372 | "sha256:87c450779d0914f2861b8526e035c5e6da0a3199d8f1add1a665e1cbc6fc6d02", 373 | "sha256:8b7ee99e510d7b66cdb6c593f21c043c248537a32e0bedf02e01e9553a172314", 374 | "sha256:91fc98adde3d7881af9b59ed0294046f3806221863722ba7d8d120c575314325", 375 | "sha256:94411f22c3985acaec6f83c6df553f2dbe17b698cc7f8ae751ff2237d96b9e3c", 376 | "sha256:98d85c6a2bef81588d9227dde12db8a7f47f639f4a17c9ae08e773aa9c697bf3", 377 | "sha256:9ad5db27f9cabae298d151c85cf2bad1d359a1b9c686a275df03385758e2f914", 378 | "sha256:a0b71b1b8fbf2b96e41c4d990244165e2c9be83d54962a9a1d118fd8657d2045", 379 | "sha256:a0f100c8912c114ff53e1202d0078b425bee3649ae34d7b070e9697f93c5d52d", 380 | "sha256:a591fe9e525846e4d154205572a029f653ada1a78b93697f3b5a8f1f2bc055b9", 381 | "sha256:a5c84c68147988265e60416b57fc83425a78058853509c1b0629c180094904a5", 382 | "sha256:a66d3508133af6e8548451b25058d5812812ec3798c886bf38ed24a98216fab2", 383 | "sha256:a8c4917bd7ad33e8eb21e9a5bbba979b49d9a97acb3a803092cbc1133e20343c", 384 | "sha256:b3bbeb01c2b273cca1e1e0c5df57f12dce9a4dd331b4fa1635b8bec26350bde3", 385 | "sha256:cba9d6b9a7d64d4bd46167096fc9d2f835e25d7e4c121fb2ddfc6528fb0413b2", 386 | "sha256:cc4d65aeeaa04136a12677d3dd0b1c0c94dc43abac5860ab33cceb42b801c1e8", 387 | "sha256:ce4bcc037df4fc5e3d184794f27bdaab018943698f4ca31630bc7f84a7b69c6d", 388 | "sha256:cec7d9412a9102bdc577382c3929b337320c4c4c4849f2c5cdd14d7368c5562d", 389 | "sha256:d400bfb9a37b1351253cb402671cea7e89bdecc294e8016a707f6d1d8ac934f9", 390 | "sha256:d61f4695e6c866a23a21acab0509af1cdfd2c013cf256bbf5b6b5e2695827162", 391 | "sha256:db0fbb9c62743ce59a9ff687eb5f4afbe77e5e8403d6697f7446e5f609976f76", 392 | "sha256:dd86c085fae2efd48ac91dd7ccffcfc0571387fe1193d33b6394db7ef31fe2a4", 393 | "sha256:e00b098126fd45523dd056d2efba6c5a63b71ffe9f2bbe1a4fe1716e1d0c331e", 394 | "sha256:e229a521186c75c8ad9490854fd8bbdd9a0c9aa3a524326b55be83b54d4e0ad9", 395 | "sha256:e263d77ee3dd201c3a142934a086a4450861778baaeeb45db4591ef65550b0a6", 396 | "sha256:ed9cb427ba5504c1dc15ede7d516b84757c3e3d7868ccc85121d9310d27eed0b", 397 | "sha256:fa6693661a4c91757f4412306191b6dc88c1703f780c8234035eac011922bc01", 398 | "sha256:fcd131dd944808b5bdb38e6f5b53013c5aa4f334c5cad0c72742f6eba4b73db0" 399 | ], 400 | "version": "==1.15.1" 401 | }, 402 | "charset-normalizer": { 403 | "hashes": [ 404 | "sha256:2857e29ff0d34db842cd7ca3230549d1a697f96ee6d3fb071cfa6c7393832597", 405 | "sha256:6881edbebdb17b39b4eaaa821b438bf6eddffb4468cf344f09f89def34a8b1df" 406 | ], 407 | "markers": "python_version >= '3'", 408 | "version": "==2.0.12" 409 | }, 410 | "colorama": { 411 | "hashes": [ 412 | "sha256:7d73d2a99753107a36ac6b455ee49046802e59d9d076ef8e47b61499fa29afff", 413 | "sha256:e96da0d330793e2cb9485e9ddfd918d456036c7149416295932478192f4436a1" 414 | ], 415 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'", 416 | "version": "==0.4.3" 417 | }, 418 | "cryptography": { 419 | "hashes": [ 420 | "sha256:0297ffc478bdd237f5ca3a7dc96fc0d315670bfa099c04dc3a4a2172008a405a", 421 | "sha256:10d1f29d6292fc95acb597bacefd5b9e812099d75a6469004fd38ba5471a977f", 422 | "sha256:16fa61e7481f4b77ef53991075de29fc5bacb582a1244046d2e8b4bb72ef66d0", 423 | "sha256:194044c6b89a2f9f169df475cc167f6157eb9151cc69af8a2a163481d45cc407", 424 | "sha256:1db3d807a14931fa317f96435695d9ec386be7b84b618cc61cfa5d08b0ae33d7", 425 | "sha256:3261725c0ef84e7592597606f6583385fed2a5ec3909f43bc475ade9729a41d6", 426 | "sha256:3b72c360427889b40f36dc214630e688c2fe03e16c162ef0aa41da7ab1455153", 427 | "sha256:3e3a2599e640927089f932295a9a247fc40a5bdf69b0484532f530471a382750", 428 | "sha256:3fc26e22840b77326a764ceb5f02ca2d342305fba08f002a8c1f139540cdfaad", 429 | "sha256:5067ee7f2bce36b11d0e334abcd1ccf8c541fc0bbdaf57cdd511fdee53e879b6", 430 | "sha256:52e7bee800ec869b4031093875279f1ff2ed12c1e2f74923e8f49c916afd1d3b", 431 | "sha256:64760ba5331e3f1794d0bcaabc0d0c39e8c60bf67d09c93dc0e54189dfd7cfe5", 432 | "sha256:765fa194a0f3372d83005ab83ab35d7c5526c4e22951e46059b8ac678b44fa5a", 433 | "sha256:79473cf8a5cbc471979bd9378c9f425384980fcf2ab6534b18ed7d0d9843987d", 434 | "sha256:896dd3a66959d3a5ddcfc140a53391f69ff1e8f25d93f0e2e7830c6de90ceb9d", 435 | "sha256:89ed49784ba88c221756ff4d4755dbc03b3c8d2c5103f6d6b4f83a0fb1e85294", 436 | "sha256:ac7e48f7e7261207d750fa7e55eac2d45f720027d5703cd9007e9b37bbb59ac0", 437 | "sha256:ad7353f6ddf285aeadfaf79e5a6829110106ff8189391704c1d8801aa0bae45a", 438 | "sha256:b0163a849b6f315bf52815e238bc2b2346604413fa7c1601eea84bcddb5fb9ac", 439 | "sha256:b6c9b706316d7b5a137c35e14f4103e2115b088c412140fdbd5f87c73284df61", 440 | "sha256:c2e5856248a416767322c8668ef1845ad46ee62629266f84a8f007a317141013", 441 | "sha256:ca9f6784ea96b55ff41708b92c3f6aeaebde4c560308e5fbbd3173fbc466e94e", 442 | "sha256:d1a5bd52d684e49a36582193e0b89ff267704cd4025abefb9e26803adeb3e5fb", 443 | "sha256:d3971e2749a723e9084dd507584e2a2761f78ad2c638aa31e80bc7a15c9db4f9", 444 | "sha256:d4ef6cc305394ed669d4d9eebf10d3a101059bdcf2669c366ec1d14e4fb227bd", 445 | "sha256:d9e69ae01f99abe6ad646947bba8941e896cb3aa805be2597a0400e0764b5818" 446 | ], 447 | "markers": "python_version >= '3.6'", 448 | "version": "==38.0.1" 449 | }, 450 | "docker": { 451 | "extras": [ 452 | "ssh" 453 | ], 454 | "hashes": [ 455 | "sha256:d3393c878f575d3a9ca3b94471a3c89a6d960b35feb92f033c0de36cc9d934db", 456 | "sha256:f3607d5695be025fa405a12aca2e5df702a57db63790c73b927eb6a94aac60af" 457 | ], 458 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'", 459 | "version": "==4.4.4" 460 | }, 461 | "docker-compose": { 462 | "hashes": [ 463 | "sha256:7a2eb6d8173fdf408e505e6f7d497ac0b777388719542be9e49a0efd477a50c6", 464 | "sha256:9d33520ae976f524968a64226516ec631dce09fba0974ce5366ad403e203eb5d" 465 | ], 466 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 467 | "version": "==1.25.5" 468 | }, 469 | "dockerpty": { 470 | "hashes": [ 471 | "sha256:69a9d69d573a0daa31bcd1c0774eeed5c15c295fe719c61aca550ed1393156ce" 472 | ], 473 | "version": "==0.4.1" 474 | }, 475 | "docopt": { 476 | "hashes": [ 477 | "sha256:49b3a825280bd66b3aa83585ef59c4a8c82f2c8a522dbe754a8bc8d08c85c491" 478 | ], 479 | "version": "==0.6.2" 480 | }, 481 | "future": { 482 | "hashes": [ 483 | "sha256:e39ced1ab767b5936646cedba8bcce582398233d6a627067d4c6a454c90cfedb" 484 | ], 485 | "version": "==0.16.0" 486 | }, 487 | "idna": { 488 | "hashes": [ 489 | "sha256:814f528e8dead7d329833b91c5faa87d60bf71824cd12a7530b5526063d02cb4", 490 | "sha256:90b77e79eaa3eba6de819a0c442c0b4ceefc341a7a2ab77d7562bf49f425c5c2" 491 | ], 492 | "markers": "python_version >= '3'", 493 | "version": "==3.4" 494 | }, 495 | "jmespath": { 496 | "hashes": [ 497 | "sha256:b85d0567b8666149a93172712e68920734333c0ce7e89b78b3e987f71e5ed4f9", 498 | "sha256:cdf6525904cc597730141d61b36f2e4b8ecc257c420fa2f4549bac2c2d0cb72f" 499 | ], 500 | "markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'", 501 | "version": "==0.10.0" 502 | }, 503 | "jsonschema": { 504 | "hashes": [ 505 | "sha256:4e5b3cf8216f577bee9ce139cbe72eca3ea4f292ec60928ff24758ce626cd163", 506 | "sha256:c8a85b28d377cc7737e46e2d9f2b4f44ee3c0e1deac6bf46ddefc7187d30797a" 507 | ], 508 | "version": "==3.2.0" 509 | }, 510 | "paramiko": { 511 | "hashes": [ 512 | "sha256:003e6bee7c034c21fbb051bf83dc0a9ee4106204dd3c53054c71452cc4ec3938", 513 | "sha256:655f25dc8baf763277b933dfcea101d636581df8d6b9774d1fb653426b72c270" 514 | ], 515 | "version": "==2.11.0" 516 | }, 517 | "pathspec": { 518 | "hashes": [ 519 | "sha256:7d15c4ddb0b5c802d161efc417ec1a2558ea2653c2e8ad9c19098201dc1c993a", 520 | "sha256:e564499435a2673d586f6b2130bb5b95f04a3ba06f81b8f895b651a3c76aabb1" 521 | ], 522 | "version": "==0.9.0" 523 | }, 524 | "pycparser": { 525 | "hashes": [ 526 | "sha256:8ee45429555515e1f6b185e78100aea234072576aa43ab53aefcae078162fca9", 527 | "sha256:e644fdec12f7872f86c58ff790da456218b10f863970249516d60a5eaca77206" 528 | ], 529 | "version": "==2.21" 530 | }, 531 | "pynacl": { 532 | "hashes": [ 533 | "sha256:06b8f6fa7f5de8d5d2f7573fe8c863c051225a27b61e6860fd047b1775807858", 534 | "sha256:0c84947a22519e013607c9be43706dd42513f9e6ae5d39d3613ca1e142fba44d", 535 | "sha256:20f42270d27e1b6a29f54032090b972d97f0a1b0948cc52392041ef7831fee93", 536 | "sha256:401002a4aaa07c9414132aaed7f6836ff98f59277a234704ff66878c2ee4a0d1", 537 | "sha256:52cb72a79269189d4e0dc537556f4740f7f0a9ec41c1322598799b0bdad4ef92", 538 | "sha256:61f642bf2378713e2c2e1de73444a3778e5f0a38be6fee0fe532fe30060282ff", 539 | "sha256:8ac7448f09ab85811607bdd21ec2464495ac8b7c66d146bf545b0f08fb9220ba", 540 | "sha256:a36d4a9dda1f19ce6e03c9a784a2921a4b726b02e1c736600ca9c22029474394", 541 | "sha256:a422368fc821589c228f4c49438a368831cb5bbc0eab5ebe1d7fac9dded6567b", 542 | "sha256:e46dae94e34b085175f8abb3b0aaa7da40767865ac82c928eeb9e57e1ea8a543" 543 | ], 544 | "markers": "python_version >= '3.6'", 545 | "version": "==1.5.0" 546 | }, 547 | "pyrsistent": { 548 | "hashes": [ 549 | "sha256:0e3e1fcc45199df76053026a51cc59ab2ea3fc7c094c6627e93b7b44cdae2c8c", 550 | "sha256:1b34eedd6812bf4d33814fca1b66005805d3640ce53140ab8bbb1e2651b0d9bc", 551 | "sha256:4ed6784ceac462a7d6fcb7e9b663e93b9a6fb373b7f43594f9ff68875788e01e", 552 | "sha256:5d45866ececf4a5fff8742c25722da6d4c9e180daa7b405dc0a2a2790d668c26", 553 | "sha256:636ce2dc235046ccd3d8c56a7ad54e99d5c1cd0ef07d9ae847306c91d11b5fec", 554 | "sha256:6455fc599df93d1f60e1c5c4fe471499f08d190d57eca040c0ea182301321286", 555 | "sha256:6bc66318fb7ee012071b2792024564973ecc80e9522842eb4e17743604b5e045", 556 | "sha256:7bfe2388663fd18bd8ce7db2c91c7400bf3e1a9e8bd7d63bf7e77d39051b85ec", 557 | "sha256:7ec335fc998faa4febe75cc5268a9eac0478b3f681602c1f27befaf2a1abe1d8", 558 | "sha256:914474c9f1d93080338ace89cb2acee74f4f666fb0424896fcfb8d86058bf17c", 559 | "sha256:b568f35ad53a7b07ed9b1b2bae09eb15cdd671a5ba5d2c66caee40dbf91c68ca", 560 | "sha256:cdfd2c361b8a8e5d9499b9082b501c452ade8bbf42aef97ea04854f4a3f43b22", 561 | "sha256:d1b96547410f76078eaf66d282ddca2e4baae8964364abb4f4dcdde855cd123a", 562 | "sha256:d4d61f8b993a7255ba714df3aca52700f8125289f84f704cf80916517c46eb96", 563 | "sha256:d7a096646eab884bf8bed965bad63ea327e0d0c38989fc83c5ea7b8a87037bfc", 564 | "sha256:df46c854f490f81210870e509818b729db4488e1f30f2a1ce1698b2295a878d1", 565 | "sha256:e24a828f57e0c337c8d8bb9f6b12f09dfdf0273da25fda9e314f0b684b415a07", 566 | "sha256:e4f3149fd5eb9b285d6bfb54d2e5173f6a116fe19172686797c056672689daf6", 567 | "sha256:e92a52c166426efbe0d1ec1332ee9119b6d32fc1f0bbfd55d5c1088070e7fc1b", 568 | "sha256:f87cc2863ef33c709e237d4b5f4502a62a00fab450c9e020892e8e2ede5847f5", 569 | "sha256:fd8da6d0124efa2f67d86fa70c851022f87c98e205f0594e1fae044e7119a5a6" 570 | ], 571 | "markers": "python_version >= '3.7'", 572 | "version": "==0.18.1" 573 | }, 574 | "python-dateutil": { 575 | "hashes": [ 576 | "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86", 577 | "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9" 578 | ], 579 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 580 | "version": "==2.8.2" 581 | }, 582 | "pyyaml": { 583 | "hashes": [ 584 | "sha256:08682f6b72c722394747bddaf0aa62277e02557c0fd1c42cb853016a38f8dedf", 585 | "sha256:0f5f5786c0e09baddcd8b4b45f20a7b5d61a7e7e99846e3c799b05c7c53fa696", 586 | "sha256:129def1b7c1bf22faffd67b8f3724645203b79d8f4cc81f674654d9902cb4393", 587 | "sha256:294db365efa064d00b8d1ef65d8ea2c3426ac366c0c4368d930bf1c5fb497f77", 588 | "sha256:3b2b1824fe7112845700f815ff6a489360226a5609b96ec2190a45e62a9fc922", 589 | "sha256:3bd0e463264cf257d1ffd2e40223b197271046d09dadf73a0fe82b9c1fc385a5", 590 | "sha256:4465124ef1b18d9ace298060f4eccc64b0850899ac4ac53294547536533800c8", 591 | "sha256:49d4cdd9065b9b6e206d0595fee27a96b5dd22618e7520c33204a4a3239d5b10", 592 | "sha256:4e0583d24c881e14342eaf4ec5fbc97f934b999a6828693a99157fde912540cc", 593 | "sha256:5accb17103e43963b80e6f837831f38d314a0495500067cb25afab2e8d7a4018", 594 | "sha256:607774cbba28732bfa802b54baa7484215f530991055bb562efbed5b2f20a45e", 595 | "sha256:6c78645d400265a062508ae399b60b8c167bf003db364ecb26dcab2bda048253", 596 | "sha256:72a01f726a9c7851ca9bfad6fd09ca4e090a023c00945ea05ba1638c09dc3347", 597 | "sha256:74c1485f7707cf707a7aef42ef6322b8f97921bd89be2ab6317fd782c2d53183", 598 | "sha256:895f61ef02e8fed38159bb70f7e100e00f471eae2bc838cd0f4ebb21e28f8541", 599 | "sha256:8c1be557ee92a20f184922c7b6424e8ab6691788e6d86137c5d93c1a6ec1b8fb", 600 | "sha256:bb4191dfc9306777bc594117aee052446b3fa88737cd13b7188d0e7aa8162185", 601 | "sha256:bfb51918d4ff3d77c1c856a9699f8492c612cde32fd3bcd344af9be34999bfdc", 602 | "sha256:c20cfa2d49991c8b4147af39859b167664f2ad4561704ee74c1de03318e898db", 603 | "sha256:cb333c16912324fd5f769fff6bc5de372e9e7a202247b48870bc251ed40239aa", 604 | "sha256:d2d9808ea7b4af864f35ea216be506ecec180628aced0704e34aca0b040ffe46", 605 | "sha256:d483ad4e639292c90170eb6f7783ad19490e7a8defb3e46f97dfe4bacae89122", 606 | "sha256:dd5de0646207f053eb0d6c74ae45ba98c3395a571a2891858e87df7c9b9bd51b", 607 | "sha256:e1d4970ea66be07ae37a3c2e48b5ec63f7ba6804bdddfdbd3cfd954d25a82e63", 608 | "sha256:e4fac90784481d221a8e4b1162afa7c47ed953be40d31ab4629ae917510051df", 609 | "sha256:fa5ae20527d8e831e8230cbffd9f8fe952815b2b7dae6ffec25318803a7528fc", 610 | "sha256:fd7f6999a8070df521b6384004ef42833b9bd62cfee11a09bda1079b4b704247", 611 | "sha256:fdc842473cd33f45ff6bce46aea678a54e3d21f1b61a7750ce3c498eedfe25d6", 612 | "sha256:fe69978f3f768926cfa37b867e3843918e012cf83f680806599ddce33c2c68b0" 613 | ], 614 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5'", 615 | "version": "==5.4.1" 616 | }, 617 | "requests": { 618 | "hashes": [ 619 | "sha256:6c1246513ecd5ecd4528a0906f910e8f0f9c6b8ec72030dc9fd154dc1a6efd24", 620 | "sha256:b8aa58f8cf793ffd8782d3d8cb19e66ef36f7aba4353eec859e74678b01b07a7" 621 | ], 622 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5'", 623 | "version": "==2.26.0" 624 | }, 625 | "semantic-version": { 626 | "hashes": [ 627 | "sha256:45e4b32ee9d6d70ba5f440ec8cc5221074c7f4b0e8918bdab748cc37912440a9", 628 | "sha256:d2cb2de0558762934679b9a104e82eca7af448c9f4974d1f3eeccff651df8a54" 629 | ], 630 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 631 | "version": "==2.8.5" 632 | }, 633 | "setuptools": { 634 | "hashes": [ 635 | "sha256:1b6bdc6161661409c5f21508763dc63ab20a9ac2f8ba20029aaaa7fdb9118012", 636 | "sha256:3050e338e5871e70c72983072fe34f6032ae1cdeeeb67338199c2f74e083a80e" 637 | ], 638 | "markers": "python_version >= '3.7'", 639 | "version": "==65.4.1" 640 | }, 641 | "six": { 642 | "hashes": [ 643 | "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a", 644 | "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c" 645 | ], 646 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'", 647 | "version": "==1.14.0" 648 | }, 649 | "termcolor": { 650 | "hashes": [ 651 | "sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b" 652 | ], 653 | "version": "==1.1.0" 654 | }, 655 | "texttable": { 656 | "hashes": [ 657 | "sha256:42ee7b9e15f7b225747c3fa08f43c5d6c83bc899f80ff9bae9319334824076e9", 658 | "sha256:dd2b0eaebb2a9e167d1cefedab4700e5dcbdb076114eed30b58b97ed6b37d6f2" 659 | ], 660 | "version": "==1.6.4" 661 | }, 662 | "urllib3": { 663 | "hashes": [ 664 | "sha256:3fa96cf423e6987997fc326ae8df396db2a8b7c667747d47ddd8ecba91f4a74e", 665 | "sha256:b930dd878d5a8afb066a637fbb35144fe7901e3b209d1cd4f524bd0e9deee997" 666 | ], 667 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5' and python_version < '4'", 668 | "version": "==1.26.12" 669 | }, 670 | "wcwidth": { 671 | "hashes": [ 672 | "sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1", 673 | "sha256:ee73862862a156bf77ff92b09034fc4825dd3af9cf81bc5b360668d425f3c5f1" 674 | ], 675 | "version": "==0.1.9" 676 | }, 677 | "websocket-client": { 678 | "hashes": [ 679 | "sha256:2e50d26ca593f70aba7b13a489435ef88b8fc3b5c5643c1ce8808ff9b40f0b32", 680 | "sha256:d376bd60eace9d437ab6d7ee16f4ab4e821c9dae591e1b783c58ebd8aaf80c5c" 681 | ], 682 | "markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'", 683 | "version": "==0.59.0" 684 | } 685 | } 686 | } 687 | -------------------------------------------------------------------------------- /05-deployment/app/customer_1.json: -------------------------------------------------------------------------------- 1 | { 2 | "reports": 0, 3 | "share": 0.001694, 4 | "expenditure": 0.12, 5 | "owner": "yes" 6 | } -------------------------------------------------------------------------------- /05-deployment/app/customer_2.json: -------------------------------------------------------------------------------- 1 | { 2 | "reports": 0, 3 | "share": 0.245, 4 | "expenditure": 3.438, 5 | "owner": "yes" 6 | } -------------------------------------------------------------------------------- /05-deployment/app/dv.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/dv.bin -------------------------------------------------------------------------------- /05-deployment/app/eb_cloud_service.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/eb_cloud_service.png -------------------------------------------------------------------------------- /05-deployment/app/model1.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/model1.bin -------------------------------------------------------------------------------- /05-deployment/app/predict-test.py: -------------------------------------------------------------------------------- 1 | from urllib import response 2 | import requests 3 | 4 | host = "card-serving-env.eba-d4hrhyjh.eu-west-1.elasticbeanstalk.com" 5 | remote_url = f"http://{host}/predict" 6 | url = "http://localhost:9696/predict" 7 | 8 | customer = { 9 | "reports": 0, 10 | "share": 0.245, 11 | "expenditure": 3.438, 12 | "owner": "yes" 13 | } 14 | 15 | response = requests.post(remote_url, json = customer).json() 16 | print(response) -------------------------------------------------------------------------------- /05-deployment/app/predict.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | from flask import Flask 4 | from flask import request 5 | from flask import jsonify 6 | 7 | 8 | model_file = 'model2.bin' 9 | vectorizer_file = 'dv.bin' 10 | app = Flask('card_prediction') 11 | 12 | with open(model_file, 'rb') as model: 13 | model = pickle.load(model) 14 | 15 | with open(vectorizer_file, 'rb') as vectorizer: 16 | dv = pickle.load(vectorizer) 17 | 18 | 19 | @app.route('/predict', methods=['POST']) 20 | def predict(): 21 | customer = request.get_json() 22 | 23 | X = dv.transform([customer]) 24 | y_pred = model.predict_proba(X)[0, 1] 25 | card = y_pred >= 0.5 26 | 27 | result = { 28 | 'probability': float(y_pred), 29 | 'card': bool(card) 30 | } 31 | 32 | return jsonify(result) 33 | 34 | 35 | if __name__ == "__main__": 36 | app.run(debug=True, host='0.0.0.0', port=9696) -------------------------------------------------------------------------------- /06-trees/README.md: -------------------------------------------------------------------------------- 1 | # Decision Trees and Ensemble Learning 2 | 3 | * Credit risk scoring project 4 | * Data cleaning and preparation 5 | * Decision trees 6 | * Decision tree learning algorithm 7 | * Decision trees parameter tuning 8 | * Ensemble learning and random forest 9 | * Gradient boosting and XGBoost 10 | * XGBoost parameter tuning 11 | * Selecting the best model 12 | 13 | --------- 14 | 15 | ## Session #6 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/06-trees/homework.md) 16 | 17 | The goal is to create a regression model for predicting housing prices (column 'median_house_value'). 18 | 19 | ### Dataset 20 | 21 | In this homework we'll again use the California Housing Prices dataset - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv) 22 | 23 | You can take it also from Kaggle - [Link](https://www.kaggle.com/datasets/camnugent/california-housing-prices) 24 | 25 | ## Loading the data 26 | 27 | Use only the following columns: 28 | * `'latitude'`, 29 | * `'longitude'`, 30 | * `'housing_median_age'`, 31 | * `'total_rooms'`, 32 | * `'total_bedrooms'`, 33 | * `'population'`, 34 | * `'households'`, 35 | * `'median_income'`, 36 | * `'median_house_value'`, 37 | * `'ocean_proximity'` 38 | 39 | * Fill NAs with 0. 40 | * Apply the log tranform to `median_house_value`. 41 | * Do train/validation/test split with 60%/20%/20% distribution. 42 | * Use the `train_test_split` function and set the `random_state` parameter to 1. 43 | * Use `DictVectorizer` to turn the dataframe into matrices. 44 | 45 | 46 | ## Question 1 47 | 48 | Let's train a decision tree regressor to predict the `median_house_value` variable. 49 | 50 | * Train a model with `max_depth=1`. 51 | 52 | 53 | Which feature is used for splitting the data? 54 | 55 | * `ocean_proximity=INLAND` 56 | * `total_rooms` 57 | * `latitude` 58 | * `population` 59 | 60 | 61 | ## Question 2 62 | 63 | Train a random forest model with these parameters: 64 | 65 | * `n_estimators=10` 66 | * `random_state=1` 67 | * `n_jobs=-1` (optional - to make training faster) 68 | 69 | 70 | What's the RMSE of this model on validation? 71 | 72 | * 0.05 73 | * 0.25 74 | * 0.55 75 | * 0.85 76 | 77 | 78 | ## Question 3 79 | 80 | Now let's experiment with the `n_estimators` parameter 81 | 82 | * Try different values of this parameter from 10 to 200 with step 10. 83 | * Set `random_state` to `1`. 84 | * Evaluate the model on the validation dataset. 85 | 86 | 87 | After which value of `n_estimators` does RMSE stop improving? 88 | 89 | - 10 90 | - 50 91 | - 70 92 | - 150 93 | 94 | 95 | ## Question 4 96 | 97 | Let's select the best `max_depth`: 98 | 99 | * Try different values of `max_depth`: `[10, 15, 20, 25]` 100 | * For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10) 101 | * Fix the random seed: `random_state=1` 102 | 103 | 104 | What's the best `max_depth`: 105 | 106 | * 10 107 | * 15 108 | * 20 109 | * 25 110 | 111 | 112 | # Question 5 113 | 114 | We can extract feature importance information from tree-based models. 115 | 116 | At each step of the decision tree learning algorith, it finds the best split. 117 | When doint it, we can calculate "gain" - the reduction in impurity before and after the split. 118 | This gain is quite useful in understanding what are the imporatant features 119 | for tree-based models. 120 | 121 | In Scikit-Learn, tree-based models contain this information in the 122 | [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_) 123 | field. 124 | 125 | For this homework question, we'll find the most important feature: 126 | 127 | * Train the model with these parametes: 128 | * `n_estimators=10`, 129 | * `max_depth=20`, 130 | * `random_state=1`, 131 | * `n_jobs=-1` (optional) 132 | * Get the feature importance information from this model 133 | 134 | 135 | What's the most important feature? 136 | 137 | * `total_rooms` 138 | * `median_income` 139 | * `total_bedrooms` 140 | * `longitude` 141 | 142 | 143 | ## Question 6 144 | 145 | Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter: 146 | 147 | * Install XGBoost 148 | * Create DMatrix for train and validation 149 | * Create a watchlist 150 | * Train a model with these parameters for 100 rounds: 151 | 152 | ``` 153 | xgb_params = { 154 | 'eta': 0.3, 155 | 'max_depth': 6, 156 | 'min_child_weight': 1, 157 | 158 | 'objective': 'reg:squarederror', 159 | 'nthread': 8, 160 | 161 | 'seed': 1, 162 | 'verbosity': 1, 163 | } 164 | ``` 165 | 166 | Now change `eta` from `0.3` to `0.1`. 167 | 168 | Which eta leads to the best RMSE score on the validation dataset? 169 | 170 | * 0.3 171 | * 0.1 172 | * Both gives same -------------------------------------------------------------------------------- /07-bento-production/README.md: -------------------------------------------------------------------------------- 1 | # Production-Ready Machine Learning (Bento ML) 2 | 3 | * Intro/Session Overview 4 | * Building Your Prediction Service with BentoML 5 | * Deploying Your Prediction Service 6 | * Sending, Receiving and Validating Data 7 | * High-Performance Serving 8 | * Bento Production Deployment 9 | * Advanced Example: Deploying Stable Diffusion Model 10 | * Summary 11 | 12 | --------- 13 | 14 | ## Session #7 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/07-bento-production/homework.md) 15 | 16 | The goal is to familiarize you with BentoML and how to build and test an ML production service. 17 | 18 | ## Background 19 | 20 | You are a new recruit at ACME corp. Your manager is emailing you about your first assignment. 21 | 22 | ## Email from your manager 23 | 24 | Good morning recruit! It's good to have you here! I have an assignment for you. I have a data scientist that's built a credit risk model in a jupyter notebook. I need you to run the notebook and save the model with BentoML and seehow big the model is. If it's greater than a certain size, I'm going to have to request additional resources from our infra team. Please let me know how big it is. 25 | 26 | Thanks, 27 | 28 | Mr McManager 29 | 30 | 31 | ## Question 1 32 | 33 | * Install BentoML 34 | * What's the version of BentoML you installed? 35 | * Use `--version` to find out 36 | 37 | Answer: **1.0.7** 38 | 39 | ## Question 2 40 | 41 | Run the notebook which contains the xgboost model from module 6 i.e previous module and save the xgboost model with BentoML. To make it easier for you we have prepared this [notebook](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/07-bentoml-production/code/train.ipynb). 42 | 43 | 44 | How big approximately is the saved BentoML model? Size can slightly vary depending on your local development environment. 45 | Choose the size closest to your model. 46 | 47 | * 924kb 48 | * 724kb 49 | * **114kb** 50 | * 8kb 51 | 52 | ## Another email from your manager 53 | 54 | Great job recruit! Looks like I won't be having to go back to the procurement team. Thanks for the information. 55 | 56 | However, I just got word from one of the teams that's using one of our ML services and they're saying our service is "broken" and their trying to blame our model. I looked at the data their sending and it's completely bogus. I don't want them to send bad data to us and blame us for our models. Could you write a pydantic schema for the data that they should be sending? 57 | That way next time it will tell them it's their data that's bad and not our model. 58 | 59 | Thanks, 60 | 61 | Mr McManager 62 | 63 | ## Question 3 64 | 65 | Say you have the following data that you're sending to your service: 66 | 67 | ```json 68 | { 69 | "name": "Tim", 70 | "age": 37, 71 | "country": "US", 72 | "rating": 3.14 73 | } 74 | ``` 75 | 76 | What would the pydantic class look like? You can name the class `UserProfile`. 77 | 78 | Answer: 79 | ``` 80 | class UserProfile(BaseModel): 81 | seniority: int 82 | home: str 83 | time: int 84 | age: int 85 | marital: str 86 | records: str 87 | job: str 88 | expenses: int 89 | income: float 90 | assets: float 91 | debt: float 92 | amount: int 93 | price: int 94 | ``` 95 | 96 | ## Email from your CEO 97 | 98 | Good morning! I hear you're the one to go to if I need something done well! We've got a new model that a big client needs deployed ASAP. I need you to build a service with it and test it against the old model and make sure that it performs better, otherwise we're going to lose this client. All our hopes are with you! 99 | 100 | Thanks, 101 | 102 | CEO of Acme Corp 103 | 104 | ## Question 4 105 | 106 | We've prepared a model for you that you can import using: 107 | 108 | ```bash 109 | curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel.bentomodel 110 | bentoml models import coolmodel.bentomodel 111 | ``` 112 | 113 | What version of scikit-learn was this model trained with? 114 | 115 | * **1.1.1** 116 | * 1.1.2 117 | * 1.1.3 118 | * 1.1.4 119 | * 1.1.5 120 | 121 | ## Question 5 122 | 123 | Create a bento out of this scikit-learn model. The output type for this endpoint should be `NumpyNdarray()` 124 | 125 | Send this array to the Bento: 126 | 127 | ``` 128 | [[6.4,3.5,4.5,1.2]] 129 | ``` 130 | 131 | You can use curl or the Swagger UI. What value does it return? 132 | 133 | * 0 134 | * **1** 135 | * 2 136 | * 3 137 | 138 | (Make sure your environment has Scikit-Learn installed) 139 | 140 | 141 | ## Question 6 142 | 143 | Ensure to serve your bento with `--production` for this question 144 | 145 | Install locust using: 146 | 147 | ```bash 148 | pip install locust 149 | ``` 150 | 151 | Use the following locust file: [locustfile.py](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/07-bento-production/locustfile.py) 152 | 153 | Ensure that it is pointed at your bento's endpoint (In case you didn't name your endpoint "classify") 154 | 155 | 156 | 157 | Configure 100 users with ramp time of 10 users per second. Click "Start Swarming" and ensure that it is working. 158 | 159 | Now download a second model with this command: 160 | 161 | ```bash 162 | curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel 163 | ``` 164 | 165 | Or you can download with this link as well: 166 | [https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel](https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel) 167 | 168 | Now import the model: 169 | 170 | ```bash 171 | bentoml models import coolmodel2.bentomodel 172 | ``` 173 | 174 | Update your bento's runner tag and test with both models. Which model allows more traffic (more throughput) as you ramp up the traffic? 175 | 176 | **Hint 1**: Remember to turn off and turn on your bento service between changing the model tag. Use Ctl-C to close the service in between trials. 177 | 178 | **Hint 2**: Increase the number of concurrent users to see which one has higher throughput 179 | 180 | Which model has better performance at higher volumes? 181 | 182 | * **The first model** 183 | * The second model 184 | 185 | 186 | ## Email from marketing 187 | 188 | Hello ML person! I hope this email finds you well. I've heard there's this cool new ML model called Stable Diffusion. I hear if you give it a description of a picture it will generate an image. We need a new company logo and I want it to be fierce but also cool, think you could help out? 189 | 190 | Thanks, 191 | 192 | Mike Marketer 193 | 194 | 195 | ## Question 7 (optional) 196 | 197 | Go to this Bento deployment of Stable Diffusion: http://54.176.205.174/ (or deploy it yourself) 198 | 199 | Use the txt2image endpoint and update the prompt to: "A cartoon dragon with sunglasses". 200 | Don't change the seed, it should be 0 by default 201 | 202 | What is the resulting image? 203 | 204 | -------------------------------------------------------------------------------- /07-bento-production/coolmodel/service.py: -------------------------------------------------------------------------------- 1 | import bentoml 2 | from bentoml.io import JSON, NumpyNdarray 3 | 4 | model_ref = bentoml.sklearn.get("mlzoomcamp_homework:qtzdz3slg6mwwdu5") 5 | model_runner = model_ref.to_runner() 6 | 7 | svc = bentoml.Service("mlzoomcamp_homework", runners = [model_runner]) 8 | 9 | @svc.api(input=NumpyNdarray(shape=(-1, 4), enforce_shape=True), output=NumpyNdarray()) 10 | def classify(vector): 11 | 12 | predictions = model_runner.predict.run(vector) 13 | print(predictions) 14 | return predictions 15 | -------------------------------------------------------------------------------- /07-bento-production/credit_risk_service/service.py: -------------------------------------------------------------------------------- 1 | import bentoml 2 | from bentoml.io import JSON 3 | from pydantic import BaseModel 4 | 5 | class UserProfile(BaseModel): 6 | seniority: int 7 | home: str 8 | time: int 9 | age: int 10 | marital: str 11 | records: str 12 | job: str 13 | expenses: int 14 | income: float 15 | assets: float 16 | debt: float 17 | amount: int 18 | price: int 19 | 20 | model_ref = bentoml.xgboost.get("credit_risk_model:dtlts7cv4s2nbhht") 21 | dv = model_ref.custom_objects['dictVectorizer'] 22 | model_runner = model_ref.to_runner() 23 | 24 | svc = bentoml.Service("credit_risk_classifier", runners = [model_runner]) 25 | 26 | @svc.api(input=JSON(pydantic_model=UserProfile), output=JSON()) 27 | def classify(user_profile): 28 | aplication_data = user_profile.dict() 29 | vector = dv.transform(aplication_data) 30 | predictions = model_runner.predict.run(vector) 31 | print(predictions) 32 | return { "pred" : predictions[0] } 33 | -------------------------------------------------------------------------------- /07-bento-production/dragon.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/dragon.jpeg -------------------------------------------------------------------------------- /07-bento-production/locustfile.py: -------------------------------------------------------------------------------- 1 | from locust import task 2 | from locust import between 3 | from locust import HttpUser 4 | 5 | sample = [[6.4,3.5,4.5,1.2]] 6 | 7 | class MLZoomUser(HttpUser): 8 | """ 9 | Usage: 10 | Start locust load testing client with: 11 | 12 | locust -H http://localhost:3000 13 | 14 | Open browser at http://0.0.0.0:8089, adjust desired number of users and spawn 15 | rate for the load test from the Web UI and start swarming. 16 | """ 17 | 18 | @task 19 | def classify(self): 20 | self.client.post("/classify", json=sample) 21 | 22 | wait_time = between(0.01, 2) -------------------------------------------------------------------------------- /07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/custom_objects.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/custom_objects.pkl -------------------------------------------------------------------------------- /07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/model.yaml: -------------------------------------------------------------------------------- 1 | name: credit_risk_model 2 | version: dtlts7cv4s2nbhht 3 | module: bentoml.xgboost 4 | labels: {} 5 | options: 6 | model_class: Booster 7 | metadata: {} 8 | context: 9 | framework_name: xgboost 10 | framework_versions: 11 | xgboost: 1.6.2 12 | bentoml_version: 1.0.7 13 | python_version: 3.9.13 14 | signatures: 15 | predict: 16 | batchable: false 17 | api_version: v2 18 | creation_time: '2022-10-27T10:42:54.314457+00:00' 19 | -------------------------------------------------------------------------------- /07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/saved_model.ubj: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/saved_model.ubj -------------------------------------------------------------------------------- /07-bento-production/models/credit_risk_model/latest: -------------------------------------------------------------------------------- 1 | dtlts7cv4s2nbhht -------------------------------------------------------------------------------- /07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/model.yaml: -------------------------------------------------------------------------------- 1 | name: mlzoomcamp_homework 2 | version: jsi67fslz6txydu5 3 | module: bentoml.sklearn 4 | labels: {} 5 | options: {} 6 | metadata: {} 7 | context: 8 | framework_name: sklearn 9 | framework_versions: 10 | scikit-learn: 1.1.1 11 | bentoml_version: 1.0.7 12 | python_version: 3.9.12 13 | signatures: 14 | predict: 15 | batchable: true 16 | batch_dim: 17 | - 0 18 | - 0 19 | api_version: v1 20 | creation_time: '2022-10-14T14:48:43.330446+00:00' 21 | -------------------------------------------------------------------------------- /07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/saved_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/saved_model.pkl -------------------------------------------------------------------------------- /07-bento-production/models/mlzoomcamp_homework/latest: -------------------------------------------------------------------------------- 1 | jsi67fslz6txydu5 -------------------------------------------------------------------------------- /07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/model.yaml: -------------------------------------------------------------------------------- 1 | name: mlzoomcamp_homework 2 | version: qtzdz3slg6mwwdu5 3 | module: bentoml.sklearn 4 | labels: {} 5 | options: {} 6 | metadata: {} 7 | context: 8 | framework_name: sklearn 9 | framework_versions: 10 | scikit-learn: 1.1.1 11 | bentoml_version: 1.0.7 12 | python_version: 3.9.12 13 | signatures: 14 | predict: 15 | batchable: false 16 | api_version: v1 17 | creation_time: '2022-10-13T20:42:14.411084+00:00' 18 | -------------------------------------------------------------------------------- /07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/saved_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/saved_model.pkl -------------------------------------------------------------------------------- /07-bento-production/requirements.txt: -------------------------------------------------------------------------------- 1 | pydantic==1.10.2 2 | bentoml==1.0.7 3 | scikit-learn==1.1.1 4 | gevent==20.12.1 5 | locust==2.12.2 -------------------------------------------------------------------------------- /07-bento-production/setting_up_bentoML.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | # BentoML: build, package, deploy a model for an at scale, production ready service 4 | # Bento packages everything related to ML code: 5 | 6 | # static dataset 7 | # model(s) 8 | # code 9 | # configuration files 10 | # dependencies 11 | # deployment logic 12 | 13 | # Workflow: 14 | 15 | # get jupyter notebook from week 7 of mlbookcamp-code repo 16 | # first create an anconda env and install xgboost and bentoML via pip 17 | # save model via bento ML and serve it locally and test with Swagger UI 18 | # containerize the bentoml (underyling using docker) 19 | # serve it locally with docker and test from Swagger UI or the terminal 20 | 21 | # ignore the next block: 22 | #### Preventing autoamatic execution of this script if someone runs it: 23 | echo "This is not meant to be an executable shell script. Please open this file in a text editor or VSCode and run the commands one by one" 24 | echo "Press any key to exit" 25 | read -n 1 "key" 26 | exit 27 | 28 | 29 | # Start from here 30 | #### git setup 31 | 32 | # fork the mlbookcamp-code repo to your own github account, and copy the HTTPS URL and clone it to local: 33 | git clone "URL of your forked mlbookcamp" 34 | # in case you already cloned it before and need to pull new changes: 35 | cd mlbookcamp-code 36 | git switch master 37 | # add the original bookcamp code repo to remote. I call it upstream. 38 | git remote add upstream https://github.com/alexeygrigorev/mlbookcamp-code.git 39 | # if you've already added, it will list all remotes: 40 | git remote -v 41 | git pull upstream master 42 | # add changes back to your own fork of mlbookcamp-code 43 | git push 44 | 45 | # add files from the newly cloned directory: 46 | mkdir week7 47 | cd week7/ 48 | cp -r /mnt/d/5-Courses/mlbookcamp-code/course-zoomcamp/07-bentoml-production/code/* . 49 | 50 | 51 | #### conda environement setup 52 | 53 | # if base is already activated, run the following to create an environment called bento: 54 | conda create -n bento jupyter numpy pandas scikit-learn 55 | 56 | # It's better to install bentoml via pip because the conda repo is not actively or officially maintained, however in this case, we need 57 | # to install xgboost with pip too, and an additional package pydantic 58 | conda activate bento 59 | pip install bentoml pydantic xgboost 60 | 61 | 62 | #### Training and saving the model with bentoML: 63 | 64 | # start jupyter notebook 65 | jupyter notebook 66 | # run train.ipynb from week 6 using the XGBoost model trained on data without feature names 67 | # details in video 7.2 and 7.5 68 | # sample train.ipynb file can be found here: 69 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/train.ipynb 70 | # save the model using bentoml in jupyter notebook 71 | # copy the string inside the quotes in the tag parameter 72 | # install vim editor to write service.py: 73 | 74 | 75 | #### Starting the bentoML service from terminal: 76 | sudo apt update 77 | sudo apt search vim 78 | sudo apt install vim 79 | vim --version 80 | 81 | # or just use vscode with the bento conda environment, it will be less painful 82 | # vim is useful when you can only work within the terminal with no GUI 83 | 84 | # create new file called service.py and write the code from video 7.2 + 7.4 + 7.5 in it 85 | # Video 7.4 - for more info on pydantic: https://pydantic-docs.helpmanual.io/ for defining our validation schema 86 | # Vide 7.4 - for input/output type of the model: https://docs.bentoml.org/en/latest/reference/api_io_descriptors.html 87 | # sample service.py can be found here: 88 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/service.py 89 | # save service.py and run from terminal: 90 | 91 | bentoml serve service.py:svc --reload 92 | 93 | # this opens swagger UI which is based on openAI, change URL to 127.0.0.1:3000 instead of 94 | # default 0.0.0.0:3000 it shows in terminal 95 | # opens correctly if link CTRL+clicked from vscode, translates 0.0.0.0:3000 to localhost 96 | 97 | # click on the 'POST /classify' button, and then click on 'try it' and paste the test user information json from the train.ipynb 98 | # It should show you 200 Successful Response with the final decision of the model: 99 | # Either the loan is approved, maybe or denied 100 | 101 | 102 | #### Try different information commands in bentoml: 103 | 104 | # Get list of models stored in the bentoml models directory: 105 | bentoml models list 106 | # Look at the metadata of a particular model: 107 | # this was my model name and tag: 108 | bentoml models get credit_risk_model:loqp7isqtki4aaav 109 | # change model name and tag to your like this: 110 | # bentoml models get : 111 | # it returns something like this: 112 | 113 | ` 114 | name: credit_risk_model 115 | version: loqp7isqtki4aaav 116 | module: bentoml.xgboost 117 | labels: {} 118 | options: 119 | model_class: Booster 120 | metadata: {} 121 | context: 122 | framework_name: xgboost 123 | framework_versions: 124 | xgboost: 1.6.2 125 | bentoml_version: 1.0.7 126 | python_version: 3.10.6 127 | signatures: 128 | predict: 129 | batchable: false 130 | api_version: v2 131 | creation_time: '2022-10-20T17:12:21.082672+00:00' 132 | 133 | ` 134 | # to delete a particular model: 135 | bentoml models delete credit_risk_model: 136 | 137 | #### Bento build for a single unit/package of the model: 138 | 139 | # create a newfile and name it bentofile.yaml from vscode, initially populated with standard content 140 | # sample bentofile.yaml can be found here: 141 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/bentofile.yaml 142 | # here we add info on environment, packages and docker etc. 143 | # Use video 7.3 for adding content 144 | # go here for more: https://docs.bentoml.org/en/0.13-lts/concepts.html 145 | 146 | # edit and save, and then build bento: 147 | # P.S. run this command from the folder where bentofile.yaml is 148 | bentoml build 149 | # successful build will display the bentoml logo 150 | 151 | # next, cd to where the bento is built and look inside: 152 | # default location for built services: /home/mona/bentoml/bentos 153 | cd ~/bentoml/bentos/credit_risk_classifier/latest/ 154 | # you can replace qlkhqrsqv6kpeaavt with any other bento tag 155 | # install and run tree to see what's inside: 156 | sudo apt install tree 157 | tree 158 | # it will show something like this: 159 | ` 160 | . 161 | ├── README.md 162 | ├── apis 163 | │ └── openapi.yaml 164 | ├── bento.yaml 165 | ├── env 166 | │ ├── docker 167 | │ │ ├── Dockerfile 168 | │ │ └── entrypoint.sh 169 | │ └── python 170 | │ ├── install.sh 171 | │ ├── requirements.lock.txt 172 | │ ├── requirements.txt 173 | │ └── version.txt 174 | ├── models 175 | │ └── credit_risk_model 176 | │ ├── latest 177 | │ └── loqp7isqtki4aaav 178 | │ ├── custom_objects.pkl 179 | │ ├── model.yaml 180 | │ └── saved_model.ubj 181 | └── src 182 | ├── locustfile.py 183 | └── service.py 184 | 185 | ` 186 | 187 | # wohoo, it created a dockerfile for us on its own besides a neatly structured model 188 | # environment and some src files to serve the bento. 189 | 190 | #### Load testing with locust for high performance optimization: 191 | 192 | # you don't need to manually create locust.py if you've run "bento build" already. 193 | # If not then, you might need to create locustfile.py manually (sample added) 194 | # make sure you have bentoml serve --production running in one tab in the folder from where you ran the bentoml build 195 | # comamands (which bentofile.yaml to build the model, which in turn used service.py) 196 | 197 | bentoml serve --production 198 | # open another tab and run: 199 | locust -H http://localhost:3000 200 | # do load testing. 201 | # make sure you have modified the train.ipynb to work to accept microbatching in async mode to create a more robust service if you haven't already (video 7.5) 202 | 203 | 204 | #### OPTIONAL - Speed up docker build: Only if you have an NVIDIA GPU + Windows 11 + WSL ready NVIDIA drivers 205 | # set up nvidia container runtime for building docker containers with GPU: 206 | # https://github.com/NVIDIA/nvidia-docker 207 | 208 | distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ 209 | && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ 210 | && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ 211 | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ 212 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list 213 | 214 | 215 | sudo apt-get update 216 | sudo apt-get install -y nvidia-docker2 217 | # sudo systemctl restart docker 218 | # systemctl doesn't work with WSL, just start the docker srevice: 219 | sudo dockerd & 220 | # hit enter once to exit the INFO message, the docker service will keep running in the background 221 | sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi 222 | # once done, run docker images to get image ID and then run: 223 | docker rmi 224 | 225 | 226 | #### Build the docker image: 227 | cd 228 | cd week7 229 | # start docker service 230 | sudo dockerd & 231 | # for some reason, the command above hangs up sometimes, try pressing ENTER if nohting happens for a while 232 | # if it fails to start successfully, first run: 233 | sudo dockerd 234 | # then enter CTRL+C to exit, and then run 235 | sudo dockerd & 236 | # As far as I understand, the sudo dockerd& allows you to keep working in the same terminal by returning you to terminal prompt if you 237 | # hit enter once all the INFO messages are displayed to get back to terminal prompt 238 | # next, start containerizing bentoml 239 | bentoml containerize credit_risk_classifier:qlkhqrsqv6kpeaav 240 | 241 | 242 | #### Only do this portion if it gives you an error about missing docker plugins, run this: 243 | # only trying to install docker compose plugins didnt work:https://docs.docker.com/compose/install/linux/ 244 | # more info: https://docs.docker.com/engine/install/ubuntu/#set-up-the-repository 245 | 246 | sudo apt-get update 247 | 248 | sudo apt-get install \ 249 | ca-certificates \ 250 | curl \ 251 | gnupg \ 252 | lsb-release 253 | 254 | curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg 255 | 256 | echo \ 257 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ 258 | $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 259 | 260 | sudo apt-get update 261 | # it will upgrade docker compose (in case it is upto date, it won't change it) and install missing components 262 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin 263 | 264 | # you are good to go. Run the build command again. 265 | bentoml containerize credit_risk_classifier:qlkhqrsqv6kpeaav 266 | 267 | #### Run with docker to serve the bento and make predictions via Swagger UI! 268 | docker run -it --rm -p 3000:3000 credit_risk_classifier:qlkhqrsqv6kpeaav 269 | 270 | # it says it will start service on http://0.0.0.0:3000 but change to 127.0.0.1:3000 271 | # opens correctly if link CTRL+clicked from vscode, translates 0.0.0.0:3000 to localhost 272 | 273 | # It will open the swagger UI once again: 274 | # Repeat steps from before: 275 | 276 | # click on the 'POST /classify' button, and then click on 'try it' and paste the test user information json from the train.ipynb 277 | # sample user data: 278 | ` 279 | { 280 | "seniority": 3, 281 | "home": "owner", 282 | "time": 36, 283 | "age": 26, 284 | "marital": "single", 285 | "records": "no", 286 | "job": "freelance", 287 | "expenses": 35, 288 | "income": 0.0, 289 | "assets": 60000.0, 290 | "debt": 3000.0, 291 | "amount": 800, 292 | "price": 1000 293 | } 294 | ` 295 | # It should show you 200 Successful Response with the final decision of the model: 296 | ` 297 | { 298 | "status": "MAYBE" 299 | } 300 | ` 301 | # With any other test user, either the response is APPROVED, MAYBE or DECLINED 302 | 303 | 304 | #### Run with docker to serve the bento and make predictions via terminal! 305 | 306 | # instead of opening the 127.0.0.1:3000 URL,open another terminal and use curl: 307 | 308 | cd week7 309 | conda activate bento 310 | 311 | # then paste this: 312 | 313 | curl -X 'POST' \ 314 | 'http://127.0.0.1:3000/classify' \ 315 | -H 'accept: application/json' \ 316 | -H 'Content-Type: application/json' \ 317 | -d '{ 318 | "seniority": 3, 319 | "home": "owner", 320 | "time": 36, 321 | "age": 26, 322 | "marital": "single", 323 | "records": "no", 324 | "job": "freelance", 325 | "expenses": 35, 326 | "income": 0.0, 327 | "assets": 60000.0, 328 | "debt": 3000.0, 329 | "amount": 800, 330 | "price": 1000 331 | }' 332 | 333 | #It should return the prediction from the model as either APPROVED, MAYBE or DECLINED like this: 334 | ` 335 | {"status":"MAYBE"} 336 | ` 337 | 338 | 339 | 340 | 341 | -------------------------------------------------------------------------------- /07-bento-production/train.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "d04e7bea", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import pandas as pd\n", 11 | "import numpy as np\n", 12 | "\n", 13 | "from sklearn.model_selection import train_test_split\n", 14 | "from sklearn.feature_extraction import DictVectorizer\n", 15 | "\n", 16 | "from sklearn.ensemble import RandomForestClassifier\n", 17 | "\n", 18 | "import xgboost as xgb" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "d37324f2", 24 | "metadata": {}, 25 | "source": [ 26 | "### Data preparation" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "id": "49807b73", 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'\n", 37 | "df = pd.read_csv(data)" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "id": "28b3fadb", 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "df.columns = df.columns.str.lower()\n", 48 | "\n", 49 | "status_values = {\n", 50 | " 1: 'ok',\n", 51 | " 2: 'default',\n", 52 | " 0: 'unk'\n", 53 | "}\n", 54 | "\n", 55 | "df.status = df.status.map(status_values)\n", 56 | "\n", 57 | "home_values = {\n", 58 | " 1: 'rent',\n", 59 | " 2: 'owner',\n", 60 | " 3: 'private',\n", 61 | " 4: 'ignore',\n", 62 | " 5: 'parents',\n", 63 | " 6: 'other',\n", 64 | " 0: 'unk'\n", 65 | "}\n", 66 | "\n", 67 | "df.home = df.home.map(home_values)\n", 68 | "\n", 69 | "marital_values = {\n", 70 | " 1: 'single',\n", 71 | " 2: 'married',\n", 72 | " 3: 'widow',\n", 73 | " 4: 'separated',\n", 74 | " 5: 'divorced',\n", 75 | " 0: 'unk'\n", 76 | "}\n", 77 | "\n", 78 | "df.marital = df.marital.map(marital_values)\n", 79 | "\n", 80 | "records_values = {\n", 81 | " 1: 'no',\n", 82 | " 2: 'yes',\n", 83 | " 0: 'unk'\n", 84 | "}\n", 85 | "\n", 86 | "df.records = df.records.map(records_values)\n", 87 | "\n", 88 | "job_values = {\n", 89 | " 1: 'fixed',\n", 90 | " 2: 'partime',\n", 91 | " 3: 'freelance',\n", 92 | " 4: 'others',\n", 93 | " 0: 'unk'\n", 94 | "}\n", 95 | "\n", 96 | "df.job = df.job.map(job_values)\n", 97 | "\n", 98 | "for c in ['income', 'assets', 'debt']:\n", 99 | " df[c] = df[c].replace(to_replace=99999999, value=np.nan)\n", 100 | "\n", 101 | "df = df[df.status != 'unk'].reset_index(drop=True)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "id": "4fd52ad9", 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "df_train, df_test = train_test_split(df, test_size=0.2, random_state=11)\n", 112 | "\n", 113 | "df_train = df_train.reset_index(drop=True)\n", 114 | "df_test = df_test.reset_index(drop=True)\n", 115 | "\n", 116 | "y_train = (df_train.status == 'default').astype('int').values\n", 117 | "y_test = (df_test.status == 'default').astype('int').values\n", 118 | "\n", 119 | "del df_train['status']\n", 120 | "del df_test['status']" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 5, 126 | "id": "5fe56815", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "dv = DictVectorizer(sparse=False)\n", 131 | "\n", 132 | "train_dicts = df_train.fillna(0).to_dict(orient='records')\n", 133 | "X_train = dv.fit_transform(train_dicts)\n", 134 | "\n", 135 | "test_dicts = df_test.fillna(0).to_dict(orient='records')\n", 136 | "X_test = dv.transform(test_dicts)" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "id": "1fb68649", 142 | "metadata": {}, 143 | "source": [ 144 | "### Random forest" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 6, 150 | "id": "a84fa9d2", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/html": [ 156 | "
RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,\n",
157 |        "                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" 159 | ], 160 | "text/plain": [ 161 | "RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,\n", 162 | " random_state=1)" 163 | ] 164 | }, 165 | "execution_count": 6, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "rf = RandomForestClassifier(n_estimators=200,\n", 172 | " max_depth=10,\n", 173 | " min_samples_leaf=3,\n", 174 | " random_state=1)\n", 175 | "rf.fit(X_train, y_train)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "id": "05f1bb34", 181 | "metadata": {}, 182 | "source": [ 183 | "### XGBoost\n", 184 | "\n", 185 | "Note:\n", 186 | "\n", 187 | "We removed feature names\n", 188 | "\n", 189 | "It was \n", 190 | "\n", 191 | "```python\n", 192 | "features = dv.get_feature_names_out()\n", 193 | "dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)\n", 194 | "```\n", 195 | "\n", 196 | "Now it's\n", 197 | "\n", 198 | "```python\n", 199 | "dtrain = xgb.DMatrix(X_train, label=y_train)\n", 200 | "```" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 10, 206 | "id": "63185f7a", 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "dtrain = xgb.DMatrix(X_train, label=y_train)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 11, 216 | "id": "d1e284f4", 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "xgb_params = {\n", 221 | " 'eta': 0.1, \n", 222 | " 'max_depth': 3,\n", 223 | " 'min_child_weight': 1,\n", 224 | "\n", 225 | " 'objective': 'binary:logistic',\n", 226 | " 'eval_metric': 'auc',\n", 227 | "\n", 228 | " 'nthread': 8,\n", 229 | " 'seed': 1,\n", 230 | " 'verbosity': 1,\n", 231 | "}\n", 232 | "\n", 233 | "model = xgb.train(xgb_params, dtrain, num_boost_round=175)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "23ae12d0", 239 | "metadata": {}, 240 | "source": [ 241 | "### BentoML" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 12, 247 | "id": "7a230459", 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "import bentoml" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 13, 257 | "id": "0ea2ca87", 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "text/plain": [ 263 | "Model(tag=\"credit_risk_model:dtlts7cv4s2nbhht\", path=\"/Users/ksenia/bentoml/models/credit_risk_model/dtlts7cv4s2nbhht/\")" 264 | ] 265 | }, 266 | "execution_count": 13, 267 | "metadata": {}, 268 | "output_type": "execute_result" 269 | } 270 | ], 271 | "source": [ 272 | "bentoml.xgboost.save_model(\n", 273 | " 'credit_risk_model',\n", 274 | " model,\n", 275 | " custom_objects={\n", 276 | " 'dictVectorizer': dv\n", 277 | " })" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "id": "a5151ea7", 283 | "metadata": {}, 284 | "source": [ 285 | "Test" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 11, 291 | "id": "492f90ec", 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "import json" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 12, 301 | "id": "ed5efc41", 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "name": "stdout", 306 | "output_type": "stream", 307 | "text": [ 308 | "{\n", 309 | " \"seniority\": 3,\n", 310 | " \"home\": \"owner\",\n", 311 | " \"time\": 36,\n", 312 | " \"age\": 26,\n", 313 | " \"marital\": \"single\",\n", 314 | " \"records\": \"no\",\n", 315 | " \"job\": \"freelance\",\n", 316 | " \"expenses\": 35,\n", 317 | " \"income\": 0.0,\n", 318 | " \"assets\": 60000.0,\n", 319 | " \"debt\": 3000.0,\n", 320 | " \"amount\": 800,\n", 321 | " \"price\": 1000\n", 322 | "}\n" 323 | ] 324 | } 325 | ], 326 | "source": [ 327 | "request = df_test.iloc[0].to_dict()\n", 328 | "print(json.dumps(request, indent=2))" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "id": "ec2744ae", 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [] 338 | } 339 | ], 340 | "metadata": { 341 | "kernelspec": { 342 | "display_name": "Python 3 (ipykernel)", 343 | "language": "python", 344 | "name": "python3" 345 | }, 346 | "language_info": { 347 | "codemirror_mode": { 348 | "name": "ipython", 349 | "version": 3 350 | }, 351 | "file_extension": ".py", 352 | "mimetype": "text/x-python", 353 | "name": "python", 354 | "nbconvert_exporter": "python", 355 | "pygments_lexer": "ipython3", 356 | "version": "3.9.13" 357 | }, 358 | "toc": { 359 | "base_numbering": 1, 360 | "nav_menu": {}, 361 | "number_sections": true, 362 | "sideBar": true, 363 | "skip_h1_title": true, 364 | "title_cell": "Table of Contents", 365 | "title_sidebar": "Contents", 366 | "toc_cell": false, 367 | "toc_position": {}, 368 | "toc_section_display": true, 369 | "toc_window_display": false 370 | }, 371 | "vscode": { 372 | "interpreter": { 373 | "hash": "534cd37feadae63edbd1d76ee7605c25f8931687006409bb4f94689dd24a9518" 374 | } 375 | } 376 | }, 377 | "nbformat": 4, 378 | "nbformat_minor": 5 379 | } 380 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Engineering Zoomcamp 2022 2 | 3 | ![ML ZoomCamp](https://github.com/alexeygrigorev/mlbookcamp-code/raw/master/images/zoomcamp.jpg) 4 | 5 | ## Homeworks, Midterm, & Capstone Project 6 | ### Progress: 7 | | Id | Module Session | Progress | Deadline | Link | 8 | |----|-----------------------------------------------|----------|--------------|--------------------| 9 | |01 | Introduction to Machine Learning | :white_check_mark: | 12/09/2022 | [Intro] | 10 | |02 | Machine Learning for Regression | :white_check_mark: | 19/09/2022 | [Regression]| 11 | |03 | Machine Learning for Classification | :white_check_mark: | 26/09/2022 | [Classification]| 12 | |04 | Evaluation Metrics for Classification | :white_check_mark: | 03/10/2022 | [Evaluation]| 13 | |05 | Deploying Machine Learning Models | :white_check_mark: | 10/10/2022 | [Deployment]| 14 | |06 | Decision Trees and Ensemble Learning | :white_check_mark: | 17/10/2022 | [Trees]| 15 | |07 | Bento ML | :white_check_mark: | 24/10/2022 | [BentoML]| 16 | 17 | On Kaggle - [link](https://www.kaggle.com/ksyuleg) 18 | 19 | The original course contents is [here](https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp), courtesy of [Alexey Grigorev](https://github.com/alexeygrigorev). 20 | -------------------------------------------------------------------------------- /midterm_project/README.md: -------------------------------------------------------------------------------- 1 | # Fraud Detection end-to-end Machine Learning Project 2 | 3 | The goal of the project is to build a fraud detection API that can be called to predict whether or not a transaction is fraudulent. 4 | 5 | The project follows an open standard guide CRISP-DM (Cross-Industry Standard Process for Data Mining), which describes the main phases of data mining lifecycle and widly used by data mining experts. 6 | 7 | ### Buissness Understanding 8 | 1. Determine business objectives 9 | 2. Determine data mining goals 10 | 3. Produce project plan 11 | ### Data Understanding 12 | 1. Collect data 13 | 2. Describe data 14 | 3. Explore data 15 | ### Data Preparation 16 | 1. Data selection 17 | 2. Data preprocessing 18 | 3. Feature engineering 19 | ### Data Modeling 20 | 1. Select modeling technique Select technique 21 | 2. Generate test design 22 | 3. Build model 23 | 4. Assess model 24 | ### Data Evaluation 25 | 1. Evaluate Result 26 | 2. Review Process 27 | 3. Determine next steps 28 | 29 | 30 | -------------------------------------------------------------------------------- /midterm_project/app/README.md: -------------------------------------------------------------------------------- 1 | # Fraud Detection API 2 | 3 | Build a REST API that predicts whether a given transaction is fraudulent or not. All API calls should be stored in order to engineer features relevant to finding fraud. Every API call includes a time step of the transaction. -------------------------------------------------------------------------------- /midterm_project/data/data.csv.gz: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:e26bbc07745ce8ad047753c41d7b261797895757603f9ccd58427da11522f847 3 | size 181894042 4 | -------------------------------------------------------------------------------- /midterm_project/notebooks/01-EDA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "18e71fd3", 6 | "metadata": {}, 7 | "source": [ 8 | "This is a synthetic dataset for a mobile payments application. Each transaction has a sender and recipient and are taggedare is tagged as fraud or not fraud.\n", 9 | "\n", 10 | "The task is to build a fraud detection API that can be called to predict whether or not a transaction is fraudulent. " 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "c6e4c45a", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "3d81c9da", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "import pandas as pd" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "id": "4341cbac", 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 8, 42 | "id": "ffb1de62", 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "data": { 47 | "text/html": [ 48 | "
\n", 49 | "\n", 62 | "\n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | "
steptypeamountnameOrigoldbalanceOrignewbalanceOrignameDestoldbalanceDestnewbalanceDestisFraud
01PAYMENT9839.64C1231006815170136.0160296.36M19797871550.00.00
11PAYMENT1864.28C166654429521249.019384.72M20442822250.00.00
21TRANSFER181.00C1305486145181.00.00C5532640650.00.01
31CASH_OUT181.00C840083671181.00.00C3899701021182.00.01
41PAYMENT11668.14C204853772041554.029885.86M12307017030.00.00
\n", 146 | "
" 147 | ], 148 | "text/plain": [ 149 | " step type amount nameOrig oldbalanceOrig newbalanceOrig \\\n", 150 | "0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 \n", 151 | "1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 \n", 152 | "2 1 TRANSFER 181.00 C1305486145 181.0 0.00 \n", 153 | "3 1 CASH_OUT 181.00 C840083671 181.0 0.00 \n", 154 | "4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 \n", 155 | "\n", 156 | " nameDest oldbalanceDest newbalanceDest isFraud \n", 157 | "0 M1979787155 0.0 0.0 0 \n", 158 | "1 M2044282225 0.0 0.0 0 \n", 159 | "2 C553264065 0.0 0.0 1 \n", 160 | "3 C38997010 21182.0 0.0 1 \n", 161 | "4 M1230701703 0.0 0.0 0 " 162 | ] 163 | }, 164 | "execution_count": 8, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "data = pd.read_csv(\"../data/data.csv.gz\", compression = \"gzip\")\n", 171 | "data.head()" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "id": "c1992bb0", 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [] 181 | } 182 | ], 183 | "metadata": { 184 | "kernelspec": { 185 | "display_name": "Python 3 (ipykernel)", 186 | "language": "python", 187 | "name": "python3" 188 | }, 189 | "language_info": { 190 | "codemirror_mode": { 191 | "name": "ipython", 192 | "version": 3 193 | }, 194 | "file_extension": ".py", 195 | "mimetype": "text/x-python", 196 | "name": "python", 197 | "nbconvert_exporter": "python", 198 | "pygments_lexer": "ipython3", 199 | "version": "3.9.13" 200 | }, 201 | "toc": { 202 | "base_numbering": 1, 203 | "nav_menu": {}, 204 | "number_sections": true, 205 | "sideBar": true, 206 | "skip_h1_title": true, 207 | "title_cell": "Table of Contents", 208 | "title_sidebar": "Contents", 209 | "toc_cell": false, 210 | "toc_position": {}, 211 | "toc_section_display": true, 212 | "toc_window_display": false 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 5 217 | } 218 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | ###### Requirements with Version Specifiers ###### 2 | jupyter == 6.4.12 3 | numpy == 1.21.5 4 | pandas == 1.4.3 5 | scikit-learn == 1.1.1 6 | seaborn == 0.11.2 7 | matplotlib == 3.5.2 --------------------------------------------------------------------------------