├── .gitattributes
├── 01-Intro
├── Homework_intro.ipynb
├── README.md
└── data.csv
├── 02-regression
├── Homework-regression.ipynb
├── README.md
├── Regularization in Linear Regression.ipynb
├── car_price_data.csv
└── housing.csv
├── 03-classification
├── Homework-classification.ipynb
└── README.md
├── 04-evaluation
├── AER_credit_card_data.csv
├── Homework-evaluation.ipynb
└── README.md
├── 05-deployment
├── External_client.ipynb
├── README.md
└── app
│ ├── Dockerfile
│ ├── Pipfile
│ ├── Pipfile.lock
│ ├── customer_1.json
│ ├── customer_2.json
│ ├── dv.bin
│ ├── eb_cloud_service.png
│ ├── model1.bin
│ ├── predict-test.py
│ └── predict.py
├── 06-trees
├── Homework-trees.ipynb
└── README.md
├── 07-bento-production
├── README.md
├── coolmodel
│ └── service.py
├── credit_risk_service
│ └── service.py
├── dragon.jpeg
├── locustfile.py
├── models
│ ├── credit_risk_model
│ │ ├── dtlts7cv4s2nbhht
│ │ │ ├── custom_objects.pkl
│ │ │ ├── model.yaml
│ │ │ └── saved_model.ubj
│ │ └── latest
│ └── mlzoomcamp_homework
│ │ ├── jsi67fslz6txydu5
│ │ ├── model.yaml
│ │ └── saved_model.pkl
│ │ ├── latest
│ │ └── qtzdz3slg6mwwdu5
│ │ ├── model.yaml
│ │ └── saved_model.pkl
├── requirements.txt
├── setting_up_bentoML.sh
└── train.ipynb
├── README.md
├── midterm_project
├── README.md
├── app
│ └── README.md
├── data
│ └── data.csv.gz
└── notebooks
│ └── 01-EDA.ipynb
└── requirements.txt
/.gitattributes:
--------------------------------------------------------------------------------
1 | *.csv filter=lfs diff=lfs merge=lfs -text
2 | *.csv.gz filter=lfs diff=lfs merge=lfs -text
3 | *.gz filter=lfs diff=lfs merge=lfs -text
4 |
--------------------------------------------------------------------------------
/01-Intro/README.md:
--------------------------------------------------------------------------------
1 | # Intro Session : Homework & Dataset
2 |
3 | Session Overview :
4 |
5 | * 1.1 Introduction to Machine Learning
6 | * 1.2 ML vs Rule-Based Systems
7 | * 1.3 Supervised Machine Learning
8 | * 1.4 CRISP-DM
9 | * 1.5 The Modelling Step (Model Selection Process)
10 | * 1.6 Setting up the Environment
11 | * 1.7 Introduction to NumPy
12 | * 1.8 Linear Algebra Refresher
13 | * 1.9 Introduction to Pandas
14 |
15 | ---------
16 |
17 | ## Session #1 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/01-intro/homework.md)
18 |
19 | ### Set up the environment
20 |
21 | You need to install Python, NumPy, Pandas, Matplotlib and Seaborn.
22 |
23 | ### Question 1
24 |
25 | What's the version of NumPy that you installed?
26 |
27 | You can get the version information using the `__version__` field:
28 |
29 | ```python
30 | np.__version__
31 | ```
32 |
33 | ### Getting the data
34 |
35 | For this homework, we'll use the Car price dataset. Download it from
36 | [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).
37 |
38 | You can do it with wget:
39 |
40 | ```bash
41 | wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
42 | ```
43 |
44 | Or just open it with your browser and click "Save as...".
45 |
46 | Now read it with Pandas.
47 |
48 | ### Question 2
49 |
50 | How many records are in the dataset?
51 |
52 | Here you need to specify the number of rows.
53 |
54 | - 16
55 | - 6572
56 | - 11914
57 | - 18990
58 |
59 | ### Question 3
60 |
61 | Who are the most frequent car manufacturers (top-3) according to the dataset?
62 |
63 | - Chevrolet, Volkswagen, Toyota
64 | - Chevrolet, Ford, Toyota
65 | - Ford, Volkswagen, Toyota
66 | - Chevrolet, Ford, Volkswagen
67 |
68 | > **Note**: You should rely on "Make" column in this question.
69 |
70 | ### Question 4
71 |
72 | What's the number of unique Audi car models in the dataset?
73 |
74 | - 3
75 | - 16
76 | - 26
77 | - 34
78 |
79 | ### Question 5
80 |
81 | How many columns in the dataset have missing values?
82 |
83 | - 5
84 | - 6
85 | - 7
86 | - 8
87 |
88 | ### Question 6
89 |
90 | 1. Find the median value of "Engine Cylinders" column in the dataset.
91 | 2. Next, calculate the most frequent value of the same "Engine Cylinders".
92 | 3. Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.
93 | 4. Now, calculate the median value of "Engine Cylinders" once again.
94 |
95 | Has it changed?
96 |
97 | > Hint: refer to existing `mode` and `median` functions to complete the task.
98 |
99 | - Yes
100 | - No
101 |
102 | ### Question 7
103 |
104 | 1. Select all the "Lotus" cars from the dataset.
105 | 2. Select only columns "Engine HP", "Engine Cylinders".
106 | 3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).
107 | 4. Get the underlying NumPy array. Let's call it `X`.
108 | 5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
109 | 6. Invert `XTX`.
110 | 7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.
111 | 8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
112 | 9. What's the value of the first element of `w`?
113 |
114 | > **Note**: You just implemented linear regression. We'll talk about it in the next lesson.
115 |
116 | - -0.0723
117 | - 4.5949
118 | - 31.6537
119 | - 63.5643
120 |
121 |
122 | ## Submit the results
123 |
124 | Submit your results here: https://forms.gle/vLp3mvtnrjJxCZx66
125 |
126 | If your answer doesn't match options exactly, select the closest one.
127 |
128 |
129 | ## Deadline
130 |
131 | The deadline for submitting is 12 September 2022 (Monday), 23:00 CEST (Berlin time).
132 |
133 | After that, the form will be closed.
134 |
--------------------------------------------------------------------------------
/01-Intro/data.csv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:26e39d3e902246d01a93ae390f51129a288079aefad2cb3292751a262ffd62d8
3 | size 1475504
4 |
--------------------------------------------------------------------------------
/02-regression/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning for Regression : Homework & Dataset
2 |
3 | Session Overview :
4 |
5 | * 2.1 Car price prediction project
6 | * 2.2 Data preparation
7 | * 2.3 Exploratory data analysis
8 | * 2.4 Setting up the validation framework
9 | * 2.5 Linear regression
10 | * 2.6 Linear regression: vector form
11 | * 2.7 Training linear regression: Normal equation
12 | * 2.8 Baseline model for car price prediction project
13 | * 2.9 Root mean squared error
14 | * 2.10 Using RMSE on validation data
15 | * 2.11 Feature engineering
16 | * 2.12 Categorical variables
17 | * 2.13 Regularization
18 | * 2.14 Tuning the model
19 | * 2.15 Using the model
20 | * 2.16 Car price prediction project summary
21 |
22 | ---------
23 |
24 | ## Session #2 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/02-regression/homework.md)
25 |
26 | ### Dataset
27 |
28 | In this homework, we will use the California Housing Prices. You can take it from
29 | [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).
30 |
31 | The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).
32 |
33 | ### EDA
34 |
35 | * Load the data.
36 | * Look at the `median_house_value` variable. Does it have a long tail?
37 |
38 | ### Features
39 |
40 | For the rest of the homework, you'll need to use only these columns:
41 |
42 | * `'latitude'`,
43 | * `'longitude'`,
44 | * `'housing_median_age'`,
45 | * `'total_rooms'`,
46 | * `'total_bedrooms'`,
47 | * `'population'`,
48 | * `'households'`,
49 | * `'median_income'`,
50 | * `'median_house_value'`
51 |
52 | Select only them.
53 |
54 | ### Question 1
55 |
56 | Find a feature with missing values. How many missing values does it have?
57 | - 207
58 | - 208
59 | - 307
60 | - 308
61 |
62 | ### Question 2
63 |
64 | What's the median (50% percentile) for variable 'population'?
65 | - 1133
66 | - 1122
67 | - 1166
68 | - 1188
69 |
70 | ### Split the data
71 |
72 | * Shuffle the initial dataset, use seed `42`.
73 | * Split your data in train/val/test sets, with 60%/20%/20% distribution.
74 | * Make sure that the target value ('median_house_value') is not in your dataframe.
75 | * Apply the log transformation to the median_house_value variable using the `np.log1p()` function.
76 |
77 | ### Question 3
78 |
79 | * We need to deal with missing values for the column from Q1.
80 | * We have two options: fill it with 0 or with the mean of this variable.
81 | * Try both options. For each, train a linear regression model without regularization using the code from the lessons.
82 | * For computing the mean, use the training only!
83 | * Use the validation dataset to evaluate the models and compare the RMSE of each option.
84 | * Round the RMSE scores to 2 decimal digits using `round(score, 2)`
85 | * Which option gives better RMSE?
86 |
87 | Options:
88 | - With 0
89 | - With mean
90 | - With median
91 | - Both are equally good
92 |
93 | ### Question 4
94 |
95 | * Now let's train a regularized linear regression.
96 | * For this question, fill the NAs with 0.
97 | * Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
98 | * Use RMSE to evaluate the model on the validation dataset.
99 | * Round the RMSE scores to 2 decimal digits.
100 | * Which `r` gives the best RMSE?
101 |
102 | If there are multiple options, select the smallest `r`.
103 |
104 | Options:
105 | - 0
106 | - 0.000001
107 | - 0.001
108 | - 0.01
109 |
110 | ### Question 5
111 |
112 | * We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
113 | * Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
114 | * For each seed, do the train/validation/test split with 60%/20%/20% distribution.
115 | * Fill the missing values with 0 and train a model without regularization.
116 | * For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
117 | * What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
118 | * Round the result to 3 decimal digits (`round(std, 3)`)
119 |
120 | > Note: Standard deviation shows how different the values are.
121 | > If it's low, then all values are approximately the same.
122 | > If it's high, the values are different.
123 | > If standard deviation of scores is low, then our model is *stable*.
124 |
125 | Options:
126 | - 0.5
127 | - 0.05
128 | - 0.005
129 | - 0.0005
130 |
131 | ### Question 6
132 |
133 | * Split the dataset like previously, use seed 9.
134 | * Combine train and validation datasets.
135 | * Fill the missing values with 0 and train a model with `r=0.001`.
136 | * What's the RMSE on the test dataset?
137 |
138 | Options:
139 | - 0.35
140 | - 0.035
141 | - 0.45
142 | - 0.045
143 |
--------------------------------------------------------------------------------
/02-regression/car_price_data.csv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:26e39d3e902246d01a93ae390f51129a288079aefad2cb3292751a262ffd62d8
3 | size 1475504
4 |
--------------------------------------------------------------------------------
/02-regression/housing.csv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:8a3727f4cf54ac1a327f69b1d5b4db54c5834ea81c6e4efc0d163300022a685e
3 | size 1423529
4 |
--------------------------------------------------------------------------------
/03-classification/Homework-classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "8c1b50fd",
6 | "metadata": {},
7 | "source": [
8 | "# Machine Learning for Classification\n",
9 | "House price prediction\n",
10 | "\n",
11 | "## Dataset\n",
12 | "\n",
13 | "Dataset is the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).\n",
14 | "\n",
15 | "Here's a wget-able [link](https://github.com/Ksyula/ML_Engineering/blob/master/02-regression/housing.csv):\n",
16 | "\n",
17 | "```bash\n",
18 | "wget https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/02-regression/housing.csv\n",
19 | "```\n",
20 | "\n",
21 | "The goal is to create a regression model for predicting housing prices (column `'median_house_value'`)."
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "id": "b9577f8e",
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "from sklearn.model_selection import train_test_split\n",
32 | "from sklearn.metrics import mutual_info_score, mean_squared_error\n",
33 | "from sklearn.feature_extraction import DictVectorizer\n",
34 | "from sklearn.linear_model import LogisticRegression, Ridge"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "id": "b31f1df7",
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "data": {
45 | "text/plain": [
46 | "('1.21.5', '1.4.3')"
47 | ]
48 | },
49 | "execution_count": 2,
50 | "metadata": {},
51 | "output_type": "execute_result"
52 | }
53 | ],
54 | "source": [
55 | "import pandas as pd\n",
56 | "import numpy as np\n",
57 | "import seaborn as sns\n",
58 | "import matplotlib.pyplot as plt\n",
59 | "\n",
60 | "%matplotlib inline\n",
61 | "\n",
62 | "np.__version__, pd.__version__"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "id": "143b2493",
68 | "metadata": {},
69 | "source": [
70 | "## Data preparation"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 28,
76 | "id": "83b4d6b1",
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "data": {
81 | "text/plain": [
82 | "(20640, 10)"
83 | ]
84 | },
85 | "execution_count": 28,
86 | "metadata": {},
87 | "output_type": "execute_result"
88 | }
89 | ],
90 | "source": [
91 | "data = pd.read_csv('../02-regression/housing.csv')\n",
92 | "data.shape"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 29,
98 | "id": "cbd291da",
99 | "metadata": {},
100 | "outputs": [
101 | {
102 | "data": {
103 | "text/html": [
104 | "
\n",
105 | "\n",
118 | "
\n",
119 | " \n",
120 | " \n",
121 | " | \n",
122 | " 0 | \n",
123 | " 1 | \n",
124 | " 2 | \n",
125 | " 3 | \n",
126 | " 4 | \n",
127 | "
\n",
128 | " \n",
129 | " \n",
130 | " \n",
131 | " latitude | \n",
132 | " 37.88 | \n",
133 | " 37.86 | \n",
134 | " 37.85 | \n",
135 | " 37.85 | \n",
136 | " 37.85 | \n",
137 | "
\n",
138 | " \n",
139 | " longitude | \n",
140 | " -122.23 | \n",
141 | " -122.22 | \n",
142 | " -122.24 | \n",
143 | " -122.25 | \n",
144 | " -122.25 | \n",
145 | "
\n",
146 | " \n",
147 | " housing_median_age | \n",
148 | " 41.0 | \n",
149 | " 21.0 | \n",
150 | " 52.0 | \n",
151 | " 52.0 | \n",
152 | " 52.0 | \n",
153 | "
\n",
154 | " \n",
155 | " total_rooms | \n",
156 | " 880.0 | \n",
157 | " 7099.0 | \n",
158 | " 1467.0 | \n",
159 | " 1274.0 | \n",
160 | " 1627.0 | \n",
161 | "
\n",
162 | " \n",
163 | " total_bedrooms | \n",
164 | " 129.0 | \n",
165 | " 1106.0 | \n",
166 | " 190.0 | \n",
167 | " 235.0 | \n",
168 | " 280.0 | \n",
169 | "
\n",
170 | " \n",
171 | " population | \n",
172 | " 322.0 | \n",
173 | " 2401.0 | \n",
174 | " 496.0 | \n",
175 | " 558.0 | \n",
176 | " 565.0 | \n",
177 | "
\n",
178 | " \n",
179 | " households | \n",
180 | " 126.0 | \n",
181 | " 1138.0 | \n",
182 | " 177.0 | \n",
183 | " 219.0 | \n",
184 | " 259.0 | \n",
185 | "
\n",
186 | " \n",
187 | " median_income | \n",
188 | " 8.3252 | \n",
189 | " 8.3014 | \n",
190 | " 7.2574 | \n",
191 | " 5.6431 | \n",
192 | " 3.8462 | \n",
193 | "
\n",
194 | " \n",
195 | " median_house_value | \n",
196 | " 452600.0 | \n",
197 | " 358500.0 | \n",
198 | " 352100.0 | \n",
199 | " 341300.0 | \n",
200 | " 342200.0 | \n",
201 | "
\n",
202 | " \n",
203 | " ocean_proximity | \n",
204 | " NEAR BAY | \n",
205 | " NEAR BAY | \n",
206 | " NEAR BAY | \n",
207 | " NEAR BAY | \n",
208 | " NEAR BAY | \n",
209 | "
\n",
210 | " \n",
211 | "
\n",
212 | "
"
213 | ],
214 | "text/plain": [
215 | " 0 1 2 3 4\n",
216 | "latitude 37.88 37.86 37.85 37.85 37.85\n",
217 | "longitude -122.23 -122.22 -122.24 -122.25 -122.25\n",
218 | "housing_median_age 41.0 21.0 52.0 52.0 52.0\n",
219 | "total_rooms 880.0 7099.0 1467.0 1274.0 1627.0\n",
220 | "total_bedrooms 129.0 1106.0 190.0 235.0 280.0\n",
221 | "population 322.0 2401.0 496.0 558.0 565.0\n",
222 | "households 126.0 1138.0 177.0 219.0 259.0\n",
223 | "median_income 8.3252 8.3014 7.2574 5.6431 3.8462\n",
224 | "median_house_value 452600.0 358500.0 352100.0 341300.0 342200.0\n",
225 | "ocean_proximity NEAR BAY NEAR BAY NEAR BAY NEAR BAY NEAR BAY"
226 | ]
227 | },
228 | "execution_count": 29,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "variables = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity']\n",
235 | "data = data[variables].fillna(0)\n",
236 | "data.head().T"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 30,
242 | "id": "e296b68e",
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "data": {
247 | "text/plain": [
248 | "latitude float64\n",
249 | "longitude float64\n",
250 | "housing_median_age float64\n",
251 | "total_rooms float64\n",
252 | "total_bedrooms float64\n",
253 | "population float64\n",
254 | "households float64\n",
255 | "median_income float64\n",
256 | "median_house_value float64\n",
257 | "ocean_proximity object\n",
258 | "dtype: object"
259 | ]
260 | },
261 | "execution_count": 30,
262 | "metadata": {},
263 | "output_type": "execute_result"
264 | }
265 | ],
266 | "source": [
267 | "data.dtypes"
268 | ]
269 | },
270 | {
271 | "cell_type": "markdown",
272 | "id": "71ec4d78",
273 | "metadata": {},
274 | "source": [
275 | "## Feature engineering"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 31,
281 | "id": "07aa89ea",
282 | "metadata": {},
283 | "outputs": [],
284 | "source": [
285 | "data['rooms_per_household'] = data['total_rooms'] / data['households']\n",
286 | "data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']\n",
287 | "data['population_per_household'] = data['population'] / data['households']"
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "id": "a321082e",
293 | "metadata": {},
294 | "source": [
295 | "### Question 1\n",
296 | "\n",
297 | "What is the most frequent observation (mode) for the column `ocean_proximity`?\n",
298 | "\n",
299 | "Options:\n",
300 | "* NEAR BAY\n",
301 | "* **<1H OCEAN**\n",
302 | "* INLAND\n",
303 | "* NEAR OCEAN"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 32,
309 | "id": "0af14577",
310 | "metadata": {},
311 | "outputs": [
312 | {
313 | "data": {
314 | "text/plain": [
315 | "0 <1H OCEAN\n",
316 | "Name: ocean_proximity, dtype: object"
317 | ]
318 | },
319 | "execution_count": 32,
320 | "metadata": {},
321 | "output_type": "execute_result"
322 | }
323 | ],
324 | "source": [
325 | "data['ocean_proximity'].mode()"
326 | ]
327 | },
328 | {
329 | "cell_type": "markdown",
330 | "id": "c9a4e452",
331 | "metadata": {},
332 | "source": [
333 | "## Set up validation framework"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 33,
339 | "id": "8417e15f",
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "# Split the data in train/val/test sets, with 60%/20%/20% distribution\n",
344 | "def set_up_val_framework(val = 0.25, test = 0.2):\n",
345 | " \n",
346 | " df_full_train, df_test = train_test_split(data, test_size = test, random_state=42)\n",
347 | " df_train, df_val = train_test_split(df_full_train, test_size = val, random_state=42)\n",
348 | " \n",
349 | " df_train = df_train.reset_index(drop = True)\n",
350 | " df_val = df_val.reset_index(drop = True)\n",
351 | " df_test = df_test.reset_index(drop = True)\n",
352 | " \n",
353 | " y_train = df_train.median_house_value.values\n",
354 | " y_val = df_val.median_house_value.values\n",
355 | " y_test = df_test.median_house_value.values\n",
356 | "\n",
357 | " del df_train['median_house_value']\n",
358 | " del df_val['median_house_value']\n",
359 | " del df_test['median_house_value']\n",
360 | " \n",
361 | " return df_train, df_val, df_test, y_train, y_val, y_test"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 34,
367 | "id": "8500efba",
368 | "metadata": {},
369 | "outputs": [],
370 | "source": [
371 | "df_train, df_val, df_test, y_train, y_val, y_test = set_up_val_framework()"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "id": "046cbc63",
377 | "metadata": {},
378 | "source": [
379 | "## EDA"
380 | ]
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": 35,
385 | "id": "e300f992",
386 | "metadata": {},
387 | "outputs": [
388 | {
389 | "data": {
390 | "text/plain": [
391 | "latitude 0\n",
392 | "longitude 0\n",
393 | "housing_median_age 0\n",
394 | "total_rooms 0\n",
395 | "total_bedrooms 0\n",
396 | "population 0\n",
397 | "households 0\n",
398 | "median_income 0\n",
399 | "median_house_value 0\n",
400 | "ocean_proximity 0\n",
401 | "rooms_per_household 0\n",
402 | "bedrooms_per_room 0\n",
403 | "population_per_household 0\n",
404 | "dtype: int64"
405 | ]
406 | },
407 | "execution_count": 35,
408 | "metadata": {},
409 | "output_type": "execute_result"
410 | }
411 | ],
412 | "source": [
413 | "data.isna().sum()"
414 | ]
415 | },
416 | {
417 | "cell_type": "markdown",
418 | "id": "92f029d5",
419 | "metadata": {},
420 | "source": [
421 | "### Feature importance: Correlation"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "id": "9f1b88e5",
427 | "metadata": {},
428 | "source": [
429 | "#### Question 2\n",
430 | "\n",
431 | "* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.\n",
432 | " - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.\n",
433 | "* What are the two features that have the biggest correlation in this dataset?\n",
434 | "\n",
435 | "Options:\n",
436 | "* **`total_bedrooms` and `households`**\n",
437 | "* `total_bedrooms` and `total_rooms`\n",
438 | "* `population` and `households`\n",
439 | "* `population_per_household` and `total_rooms`"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 36,
445 | "id": "ea04f6c6",
446 | "metadata": {},
447 | "outputs": [
448 | {
449 | "data": {
450 | "text/html": [
451 | "\n",
452 | "\n",
465 | "
\n",
466 | " \n",
467 | " \n",
468 | " | \n",
469 | " level_0 | \n",
470 | " level_1 | \n",
471 | " coef | \n",
472 | "
\n",
473 | " \n",
474 | " \n",
475 | " \n",
476 | " 35 | \n",
477 | " total_bedrooms | \n",
478 | " households | \n",
479 | " 0.979399 | \n",
480 | "
\n",
481 | " \n",
482 | " 27 | \n",
483 | " total_rooms | \n",
484 | " total_bedrooms | \n",
485 | " 0.931546 | \n",
486 | "
\n",
487 | " \n",
488 | " 0 | \n",
489 | " latitude | \n",
490 | " longitude | \n",
491 | " -0.925005 | \n",
492 | "
\n",
493 | " \n",
494 | "
\n",
495 | "
"
496 | ],
497 | "text/plain": [
498 | " level_0 level_1 coef\n",
499 | "35 total_bedrooms households 0.979399\n",
500 | "27 total_rooms total_bedrooms 0.931546\n",
501 | "0 latitude longitude -0.925005"
502 | ]
503 | },
504 | "execution_count": 36,
505 | "metadata": {},
506 | "output_type": "execute_result"
507 | }
508 | ],
509 | "source": [
510 | "cor_mat = df_train.corr(method='pearson')\n",
511 | "cor_coefs = cor_mat.where(np.triu(np.ones(cor_mat.shape), k=1).astype(bool)).stack().reset_index().rename(columns={0: \"coef\"})\n",
512 | "cor_coefs.sort_values(by = \"coef\", ascending = False, key=abs).head(3)\n"
513 | ]
514 | },
515 | {
516 | "cell_type": "markdown",
517 | "id": "d1d2acd3",
518 | "metadata": {},
519 | "source": [
520 | "### Target variable\n",
521 | "Make `median_house_value` binary\n",
522 | "\n",
523 | "* We need to turn the `median_house_value` variable from numeric into binary.\n",
524 | "* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise."
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 37,
530 | "id": "2f17d826",
531 | "metadata": {},
532 | "outputs": [],
533 | "source": [
534 | "median_house_value_mean = round(data.median_house_value.mean(), 2)"
535 | ]
536 | },
537 | {
538 | "cell_type": "code",
539 | "execution_count": 38,
540 | "id": "84f53aa7",
541 | "metadata": {},
542 | "outputs": [],
543 | "source": [
544 | "y_train = [1 if y > median_house_value_mean else 0 for y in y_train]\n",
545 | "y_val = [1 if y > median_house_value_mean else 0 for y in y_val]\n",
546 | "y_test = [1 if y > median_house_value_mean else 0 for y in y_test]"
547 | ]
548 | },
549 | {
550 | "cell_type": "markdown",
551 | "id": "66a75106",
552 | "metadata": {},
553 | "source": [
554 | "### Feature importance: Mutual Information\n",
555 | "#### Question 3\n",
556 | "\n",
557 | "* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.\n",
558 | "* What is the value of mutual information?\n",
559 | "* Round it to 2 decimal digits using `round(score, 2)`\n",
560 | "\n",
561 | "Options:\n",
562 | "- 0.263\n",
563 | "- 0.00001\n",
564 | "- **0.101**\n",
565 | "- 0.15555"
566 | ]
567 | },
568 | {
569 | "cell_type": "code",
570 | "execution_count": 39,
571 | "id": "dd995dc5",
572 | "metadata": {},
573 | "outputs": [
574 | {
575 | "data": {
576 | "text/plain": [
577 | "0.1"
578 | ]
579 | },
580 | "execution_count": 39,
581 | "metadata": {},
582 | "output_type": "execute_result"
583 | }
584 | ],
585 | "source": [
586 | "round(mutual_info_score(df_train.ocean_proximity, y_train), 2)"
587 | ]
588 | },
589 | {
590 | "cell_type": "markdown",
591 | "id": "173301cc",
592 | "metadata": {},
593 | "source": [
594 | "## One-hot Encoding"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 40,
600 | "id": "aa5e20c8",
601 | "metadata": {},
602 | "outputs": [],
603 | "source": [
604 | "def one_hot_encoding(data):\n",
605 | " \n",
606 | " train_dicts = data.to_dict(orient='records')\n",
607 | " dv = DictVectorizer(sparse = False)\n",
608 | " one_hot_encoded_data = dv.fit_transform(train_dicts)\n",
609 | " return one_hot_encoded_data, dv.get_feature_names_out().tolist()"
610 | ]
611 | },
612 | {
613 | "cell_type": "code",
614 | "execution_count": 42,
615 | "id": "50852ecc",
616 | "metadata": {},
617 | "outputs": [
618 | {
619 | "data": {
620 | "text/plain": [
621 | "['bedrooms_per_room',\n",
622 | " 'households',\n",
623 | " 'housing_median_age',\n",
624 | " 'latitude',\n",
625 | " 'longitude',\n",
626 | " 'median_income',\n",
627 | " 'ocean_proximity=<1H OCEAN',\n",
628 | " 'ocean_proximity=INLAND',\n",
629 | " 'ocean_proximity=ISLAND',\n",
630 | " 'ocean_proximity=NEAR BAY',\n",
631 | " 'ocean_proximity=NEAR OCEAN',\n",
632 | " 'population',\n",
633 | " 'population_per_household',\n",
634 | " 'rooms_per_household',\n",
635 | " 'total_bedrooms',\n",
636 | " 'total_rooms']"
637 | ]
638 | },
639 | "execution_count": 42,
640 | "metadata": {},
641 | "output_type": "execute_result"
642 | }
643 | ],
644 | "source": [
645 | "X_train, feature_names = one_hot_encoding(df_train)\n",
646 | "feature_names"
647 | ]
648 | },
649 | {
650 | "cell_type": "code",
651 | "execution_count": 44,
652 | "id": "3486c001",
653 | "metadata": {},
654 | "outputs": [],
655 | "source": [
656 | "X_val = one_hot_encoding(df_val)[0]\n",
657 | "X_test = one_hot_encoding(df_test)[0]"
658 | ]
659 | },
660 | {
661 | "cell_type": "markdown",
662 | "id": "821a8a8a",
663 | "metadata": {},
664 | "source": [
665 | "## Logistic Regression\n",
666 | "### Question 4\n",
667 | "\n",
668 | "* Now let's train a logistic regression\n",
669 | "* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.\n",
670 | "* Fit the model on the training dataset.\n",
671 | " - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:\n",
672 | " - `model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)`\n",
673 | "* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.\n",
674 | "\n",
675 | "Options:\n",
676 | "- 0.60\n",
677 | "- 0.72\n",
678 | "- **0.84**\n",
679 | "- 0.95"
680 | ]
681 | },
682 | {
683 | "cell_type": "code",
684 | "execution_count": 18,
685 | "id": "61c0d09a",
686 | "metadata": {},
687 | "outputs": [
688 | {
689 | "data": {
690 | "text/html": [
691 | "LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
692 | ],
693 | "text/plain": [
694 | "LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')"
695 | ]
696 | },
697 | "execution_count": 18,
698 | "metadata": {},
699 | "output_type": "execute_result"
700 | }
701 | ],
702 | "source": [
703 | "model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)\n",
704 | "model.fit(X_train, y_train)"
705 | ]
706 | },
707 | {
708 | "cell_type": "code",
709 | "execution_count": 19,
710 | "id": "8da5de07",
711 | "metadata": {},
712 | "outputs": [
713 | {
714 | "data": {
715 | "text/plain": [
716 | "-0.08087359748906243"
717 | ]
718 | },
719 | "execution_count": 19,
720 | "metadata": {},
721 | "output_type": "execute_result"
722 | }
723 | ],
724 | "source": [
725 | "model.intercept_[0]"
726 | ]
727 | },
728 | {
729 | "cell_type": "code",
730 | "execution_count": 20,
731 | "id": "24a587a2",
732 | "metadata": {},
733 | "outputs": [
734 | {
735 | "data": {
736 | "text/plain": [
737 | "array([ 0.171, 0.004, 0.036, 0.116, 0.087, 1.209, 0.471, -1.702,\n",
738 | " 0.018, 0.295, 0.838, -0.002, 0.01 , -0.014, 0.002, -0. ])"
739 | ]
740 | },
741 | "execution_count": 20,
742 | "metadata": {},
743 | "output_type": "execute_result"
744 | }
745 | ],
746 | "source": [
747 | "model.coef_[0].round(3)"
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": 21,
753 | "id": "d565c669",
754 | "metadata": {},
755 | "outputs": [
756 | {
757 | "data": {
758 | "text/plain": [
759 | "0.84"
760 | ]
761 | },
762 | "execution_count": 21,
763 | "metadata": {},
764 | "output_type": "execute_result"
765 | }
766 | ],
767 | "source": [
768 | "y_pred = model.predict(X_val)\n",
769 | "global_acc = (y_val == y_pred).mean()\n",
770 | "global_acc.round(2)"
771 | ]
772 | },
773 | {
774 | "cell_type": "markdown",
775 | "id": "8d8dd61b",
776 | "metadata": {},
777 | "source": [
778 | "## Feature elimination\n",
779 | "### Question 5 \n",
780 | "\n",
781 | "* Let's find the least useful feature using the *feature elimination* technique.\n",
782 | "* Train a model with all these features (using the same parameters as in Q4).\n",
783 | "* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.\n",
784 | "* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. \n",
785 | "* Which of following feature has the smallest difference? \n",
786 | " * `total_rooms`\n",
787 | " * **`total_bedrooms`**\n",
788 | " * `population`\n",
789 | " * `households`\n",
790 | "\n",
791 | "> **note**: the difference doesn't have to be positive"
792 | ]
793 | },
794 | {
795 | "cell_type": "code",
796 | "execution_count": 22,
797 | "id": "90866ccb",
798 | "metadata": {},
799 | "outputs": [
800 | {
801 | "data": {
802 | "text/plain": [
803 | "{'bedrooms_per_room': 0.171,\n",
804 | " 'households': 0.004,\n",
805 | " 'housing_median_age': 0.036,\n",
806 | " 'latitude': 0.116,\n",
807 | " 'longitude': 0.087,\n",
808 | " 'median_income': 1.209,\n",
809 | " 'ocean_proximity=<1H OCEAN': 0.471,\n",
810 | " 'ocean_proximity=INLAND': -1.702,\n",
811 | " 'ocean_proximity=ISLAND': 0.018,\n",
812 | " 'ocean_proximity=NEAR BAY': 0.295,\n",
813 | " 'ocean_proximity=NEAR OCEAN': 0.838,\n",
814 | " 'population': -0.002,\n",
815 | " 'population_per_household': 0.01,\n",
816 | " 'rooms_per_household': -0.014,\n",
817 | " 'total_bedrooms': 0.002,\n",
818 | " 'total_rooms': -0.0}"
819 | ]
820 | },
821 | "execution_count": 22,
822 | "metadata": {},
823 | "output_type": "execute_result"
824 | }
825 | ],
826 | "source": [
827 | "dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))"
828 | ]
829 | },
830 | {
831 | "cell_type": "code",
832 | "execution_count": 46,
833 | "id": "a11101b3",
834 | "metadata": {},
835 | "outputs": [
836 | {
837 | "name": "stdout",
838 | "output_type": "stream",
839 | "text": [
840 | "0.0014534883720930258\n",
841 | "0.00024224806201555982\n",
842 | "0.009689922480620172\n",
843 | "0.002664728682170492\n"
844 | ]
845 | }
846 | ],
847 | "source": [
848 | "for f in ['total_rooms', 'total_bedrooms', 'population', 'households']:\n",
849 | " features = df_train.columns.tolist()\n",
850 | " features.remove(f)\n",
851 | " # Prepare\n",
852 | " x_train = one_hot_encoding(df_train[features])[0]\n",
853 | " # Train \n",
854 | " model = LogisticRegression(solver=\"liblinear\", C=1.0, max_iter=1000, random_state=42)\n",
855 | " model.fit(x_train, y_train)\n",
856 | " # Validate\n",
857 | " x_val = one_hot_encoding(df_val[features])[0]\n",
858 | "\n",
859 | " y_pred = model.predict(x_val)\n",
860 | " acc = (y_val == y_pred).mean()\n",
861 | " print(abs(global_acc - acc))"
862 | ]
863 | },
864 | {
865 | "cell_type": "markdown",
866 | "id": "87e76171",
867 | "metadata": {},
868 | "source": [
869 | "### Question 6\n",
870 | "\n",
871 | "* For this question, we'll see how to use a linear regression model from Scikit-Learn\n",
872 | "* We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column.\n",
873 | "* Fit the Ridge regression model (`model = Ridge(alpha=a, solver=\"sag\", random_state=42)`) on the training data.\n",
874 | "* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`\n",
875 | "* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.\n",
876 | "\n",
877 | "If there are multiple options, select the smallest `alpha`.\n",
878 | "\n",
879 | "Options:\n",
880 | "- **0**\n",
881 | "- 0.01\n",
882 | "- 0.1\n",
883 | "- 1\n",
884 | "- 10"
885 | ]
886 | },
887 | {
888 | "cell_type": "code",
889 | "execution_count": 56,
890 | "id": "b2e79bce",
891 | "metadata": {},
892 | "outputs": [],
893 | "source": [
894 | "df_train, df_val, df_test, y_train, y_val, y_test = set_up_val_framework()"
895 | ]
896 | },
897 | {
898 | "cell_type": "code",
899 | "execution_count": 57,
900 | "id": "2da4569e",
901 | "metadata": {},
902 | "outputs": [],
903 | "source": [
904 | "y_train = np.log(y_train)\n",
905 | "y_val = np.log(y_val)\n",
906 | "y_test = np.log(y_test)"
907 | ]
908 | },
909 | {
910 | "cell_type": "code",
911 | "execution_count": 58,
912 | "id": "f50158a9",
913 | "metadata": {},
914 | "outputs": [],
915 | "source": [
916 | "def rmse(predictions, targets):\n",
917 | " return np.sqrt(((predictions - targets) ** 2).mean())"
918 | ]
919 | },
920 | {
921 | "cell_type": "code",
922 | "execution_count": 62,
923 | "id": "99c543e9",
924 | "metadata": {},
925 | "outputs": [
926 | {
927 | "name": "stdout",
928 | "output_type": "stream",
929 | "text": [
930 | "0.524\n",
931 | "0.524\n",
932 | "0.524\n",
933 | "0.524\n",
934 | "0.524\n"
935 | ]
936 | }
937 | ],
938 | "source": [
939 | "for a in [0, 0.01, 0.1, 1, 10]:\n",
940 | " model = Ridge(alpha=a, solver=\"sag\", random_state=42)\n",
941 | " \n",
942 | " x_train = one_hot_encoding(df_train)[0]\n",
943 | " \n",
944 | " model.fit(x_train, y_train)\n",
945 | " \n",
946 | " x_val = one_hot_encoding(df_val)[0]\n",
947 | " \n",
948 | " y_pred = model.predict(x_val)\n",
949 | " \n",
950 | " print(rmse(y_val, y_pred).round(3))"
951 | ]
952 | }
953 | ],
954 | "metadata": {
955 | "kernelspec": {
956 | "display_name": "Python 3 (ipykernel)",
957 | "language": "python",
958 | "name": "python3"
959 | },
960 | "language_info": {
961 | "codemirror_mode": {
962 | "name": "ipython",
963 | "version": 3
964 | },
965 | "file_extension": ".py",
966 | "mimetype": "text/x-python",
967 | "name": "python",
968 | "nbconvert_exporter": "python",
969 | "pygments_lexer": "ipython3",
970 | "version": "3.9.13"
971 | },
972 | "toc": {
973 | "base_numbering": 1,
974 | "nav_menu": {},
975 | "number_sections": true,
976 | "sideBar": true,
977 | "skip_h1_title": true,
978 | "title_cell": "Table of Contents",
979 | "title_sidebar": "Contents",
980 | "toc_cell": false,
981 | "toc_position": {},
982 | "toc_section_display": true,
983 | "toc_window_display": true
984 | }
985 | },
986 | "nbformat": 4,
987 | "nbformat_minor": 5
988 | }
989 |
--------------------------------------------------------------------------------
/03-classification/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning for Classification : Homework & Dataset
2 |
3 | Session Overview :
4 |
5 | * 3.1 Churn prediction project
6 | * 3.2 Data preparation
7 | * 3.3 Setting up the validation framework
8 | * 3.4 EDA
9 | * 3.5 Feature importance: Churn rate and risk ratio
10 | * 3.6 Feature importance: Mutual information
11 | * 3.7 Feature importance: Correlation
12 | * 3.8 One-hot encoding
13 | * 3.9 Logistic regression
14 | * 3.10 Training logistic regression with Scikit-Learn
15 | * 3.11 Model interpretation
16 | * 3.12 Using the model
17 | * 3.13 Summary
18 |
19 | ---------
20 |
21 | ## Session #3 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/03-classification/homework.md)
22 |
23 | ### Dataset
24 |
25 | In this homework, we will use the California Housing Prices data from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).
26 |
27 | We'll keep working with the `'median_house_value'` variable, and we'll transform it to a classification task.
28 |
29 |
30 | ### Features
31 |
32 | For the rest of the homework, you'll need to use only these columns:
33 |
34 | * `'latitude'`,
35 | * `'longitude'`,
36 | * `'housing_median_age'`,
37 | * `'total_rooms'`,
38 | * `'total_bedrooms'`,
39 | * `'population'`,
40 | * `'households'`,
41 | * `'median_income'`,
42 | * `'median_house_value'`
43 | * `'ocean_proximity'`,
44 |
45 | ### Data preparation
46 |
47 | * Select only the features from above and fill in the missing values with 0.
48 | * Create a new column `rooms_per_household` by dividing the column `total_rooms` by the column `households` from dataframe.
49 | * Create a new column `bedrooms_per_room` by dividing the column `total_bedrooms` by the column `total_rooms` from dataframe.
50 | * Create a new column `population_per_household` by dividing the column `population` by the column `households` from dataframe.
51 |
52 |
53 | ### Question 1
54 |
55 | What is the most frequent observation (mode) for the column `ocean_proximity`?
56 |
57 | Options:
58 | * `NEAR BAY`
59 | * `<1H OCEAN`
60 | * `INLAND`
61 | * `NEAR OCEAN`
62 |
63 |
64 | ## Split the data
65 |
66 | * Split your data in train/val/test sets, with 60%/20%/20% distribution.
67 | * Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
68 | * Make sure that the target value (`median_house_value`) is not in your dataframe.
69 |
70 | ### Question 2
71 |
72 | * Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
73 | - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
74 | * What are the two features that have the biggest correlation in this dataset?
75 |
76 | Options:
77 | * `total_bedrooms` and `households`
78 | * `total_bedrooms` and `total_rooms`
79 | * `population` and `households`
80 | * `population_per_household` and `total_rooms`
81 |
82 |
83 | ### Make `median_house_value` binary
84 |
85 | * We need to turn the `median_house_value` variable from numeric into binary.
86 | * Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.
87 |
88 | ### Question 3
89 |
90 | * Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
91 | * What is the value of mutual information?
92 | * Round it to 2 decimal digits using `round(score, 2)`
93 |
94 | Options:
95 | - 0.263
96 | - 0.00001
97 | - 0.101
98 | - 0.15555
99 |
100 |
101 | ### Question 4
102 |
103 | * Now let's train a logistic regression
104 | * Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.
105 | * Fit the model on the training dataset.
106 | - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
107 | - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)`
108 | * Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
109 |
110 | Options:
111 | - 0.60
112 | - 0.72
113 | - 0.84
114 | - 0.95
115 |
116 |
117 | ### Question 5
118 |
119 | * Let's find the least useful feature using the *feature elimination* technique.
120 | * Train a model with all these features (using the same parameters as in Q4).
121 | * Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
122 | * For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
123 | * Which of following feature has the smallest difference?
124 | * `total_rooms`
125 | * `total_bedrooms`
126 | * `population`
127 | * `households`
128 |
129 | > **note**: the difference doesn't have to be positive
130 |
131 |
132 | ### Question 6
133 |
134 | * For this question, we'll see how to use a linear regression model from Scikit-Learn
135 | * We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column.
136 | * Fit the Ridge regression model (`model = Ridge(alpha=a, solver="sag", random_state=42)`) on the training data.
137 | * This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
138 | * Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.
139 |
140 | If there are multiple options, select the smallest `alpha`.
141 |
142 | Options:
143 | - 0
144 | - 0.01
145 | - 0.1
146 | - 1
147 | - 10
148 |
--------------------------------------------------------------------------------
/04-evaluation/AER_credit_card_data.csv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:93a46ce44e0542e5094e9675440fcc36e9a018c2ec879e868726f717ebd47c21
3 | size 73250
4 |
--------------------------------------------------------------------------------
/04-evaluation/README.md:
--------------------------------------------------------------------------------
1 | # Evaluation Metrics for Classification : Homework & Dataset
2 |
3 | Session Overview :
4 |
5 | * 4.1 Evaluation metrics: session overview
6 | * 4.2 Accuracy and dummy model
7 | * 4.3 Confusion table
8 | * 4.4 Precision and Recall
9 | * 4.5 ROC Curves
10 | * 4.6 ROC AUC
11 | * 4.7 Cross-Validation
12 | * 4.8 Summary
13 | * 4.9 Explore more
14 | * 4.10 Homework
15 |
16 | ---------
17 |
18 | ## Session #4 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/04-evaluation/homework.md)
19 |
20 | ### Dataset
21 |
22 | In this homework, we will use Credit Card Data from book "Econometric Analysis" - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv)
23 |
24 | The goal of this homework is to inspect the output of different evaluation metrics by creating a classification model (target column `card`).
25 |
26 | ## Preparation
27 |
28 | * Create the target variable by mapping `yes` to 1 and `no` to 0.
29 | * Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use `train_test_split` funciton for that with `random_state=1`.
30 |
31 |
32 | ## Question 1
33 |
34 | ROC AUC could also be used to evaluate feature importance of numerical variables.
35 |
36 | Let's do that
37 |
38 | * For each numerical variable, use it as score and compute AUC with the `card` variable.
39 | * Use the training dataset for that.
40 |
41 | If your AUC is < 0.5, invert this variable by putting "-" in front
42 |
43 | (e.g. `-df_train['expenditure']`)
44 |
45 | AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.
46 |
47 | Which numerical variable (among the following 4) has the highest AUC?
48 |
49 | - `reports`
50 | - `dependents`
51 | - `active`
52 | - `share`
53 |
54 |
55 | ## Training the model
56 |
57 | From now on, use these columns only:
58 |
59 | ```
60 | ["reports", "age", "income", "share", "expenditure", "dependents", "months", "majorcards", "active", "owner", "selfemp"]
61 | ```
62 |
63 | Apply one-hot-encoding using `DictVectorizer` and train the logistic regression with these parameters:
64 |
65 | ```
66 | LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
67 | ```
68 |
69 |
70 | ## Question 2
71 |
72 | What's the AUC of this model on the validation dataset? (round to 3 digits)
73 |
74 | - 0.615
75 | - 0.515
76 | - 0.715
77 | - 0.995
78 |
79 |
80 | ## Question 3
81 |
82 | Now let's compute precision and recall for our model.
83 |
84 | * Evaluate the model on all thresholds from 0.0 to 1.0 with step 0.01
85 | * For each threshold, compute precision and recall
86 | * Plot them
87 |
88 |
89 | At which threshold precision and recall curves intersect?
90 |
91 | * 0.1
92 | * 0.3
93 | * 0.6
94 | * 0.8
95 |
96 |
97 | ## Question 4
98 |
99 | Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both
100 |
101 | This is the formula for computing F1:
102 |
103 | F1 = 2 * P * R / (P + R)
104 |
105 | Where P is precision and R is recall.
106 |
107 | Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01
108 |
109 | At which threshold F1 is maximal?
110 |
111 | - 0.1
112 | - 0.4
113 | - 0.6
114 | - 0.7
115 |
116 |
117 | ## Question 5
118 |
119 | Use the `KFold` class from Scikit-Learn to evaluate our model on 5 different folds:
120 |
121 | ```
122 | KFold(n_splits=5, shuffle=True, random_state=1)
123 | ```
124 |
125 | * Iterate over different folds of `df_full_train`
126 | * Split the data into train and validation
127 | * Train the model on train with these parameters: `LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)`
128 | * Use AUC to evaluate the model on validation
129 |
130 |
131 | How large is standard devidation of the AUC scores across different folds?
132 |
133 | - 0.003
134 | - 0.014
135 | - 0.09
136 | - 0.24
137 |
138 |
139 | ## Question 6
140 |
141 | Now let's use 5-Fold cross-validation to find the best parameter C
142 |
143 | * Iterate over the following C values: `[0.01, 0.1, 1, 10]`
144 | * Initialize `KFold` with the same parameters as previously
145 | * Use these parametes for the model: `LogisticRegression(solver='liblinear', C=C, max_iter=1000)`
146 | * Compute the mean score as well as the std (round the mean and std to 3 decimal digits)
147 |
148 |
149 | Which C leads to the best mean score?
150 |
151 | - 0.01
152 | - 0.1
153 | - 1
154 | - 10
155 |
156 | If you have ties, select the score with the lowest std. If you still have ties, select the smallest C
157 |
--------------------------------------------------------------------------------
/05-deployment/External_client.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "2df99f83",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import requests"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 2,
16 | "id": "4bd6c98a",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "url = \"http://0.0.0.0:9696/predict\""
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 3,
26 | "id": "613d4f2f",
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "customer = {\n",
31 | " \"reports\": 0,\n",
32 | " \"share\": 0.245, \n",
33 | " \"expenditure\": 3.438, \n",
34 | " \"owner\": \"yes\"\n",
35 | "}"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 4,
41 | "id": "058ae893",
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "data": {
46 | "text/plain": [
47 | ""
48 | ]
49 | },
50 | "execution_count": 4,
51 | "metadata": {},
52 | "output_type": "execute_result"
53 | }
54 | ],
55 | "source": [
56 | "requests.post(url, json=customer)"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 5,
62 | "id": "42bb8c4f",
63 | "metadata": {},
64 | "outputs": [
65 | {
66 | "data": {
67 | "text/plain": [
68 | "{'card': True, 'probability': 0.7692649226628628}"
69 | ]
70 | },
71 | "execution_count": 5,
72 | "metadata": {},
73 | "output_type": "execute_result"
74 | }
75 | ],
76 | "source": [
77 | "requests.post(url, json=customer).json()"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "id": "c9e4151d",
84 | "metadata": {},
85 | "outputs": [],
86 | "source": []
87 | }
88 | ],
89 | "metadata": {
90 | "kernelspec": {
91 | "display_name": "Python 3 (ipykernel)",
92 | "language": "python",
93 | "name": "python3"
94 | },
95 | "language_info": {
96 | "codemirror_mode": {
97 | "name": "ipython",
98 | "version": 3
99 | },
100 | "file_extension": ".py",
101 | "mimetype": "text/x-python",
102 | "name": "python",
103 | "nbconvert_exporter": "python",
104 | "pygments_lexer": "ipython3",
105 | "version": "3.9.13"
106 | },
107 | "toc": {
108 | "base_numbering": 1,
109 | "nav_menu": {},
110 | "number_sections": true,
111 | "sideBar": true,
112 | "skip_h1_title": true,
113 | "title_cell": "Table of Contents",
114 | "title_sidebar": "Contents",
115 | "toc_cell": false,
116 | "toc_position": {},
117 | "toc_section_display": true,
118 | "toc_window_display": false
119 | }
120 | },
121 | "nbformat": 4,
122 | "nbformat_minor": 5
123 | }
124 |
--------------------------------------------------------------------------------
/05-deployment/README.md:
--------------------------------------------------------------------------------
1 | # Deploying Machine Learning Models
2 |
3 | Session Overview :
4 |
5 | * Intro / Session overview
6 | * Saving and loading the model
7 | * Web services: introduction to Flask
8 | * Serving the churn model with Flask
9 | * Python virtual environment: Pipenv
10 | * Environment management: Docker
11 | * Deployment to the cloud: AWS Elastic Beanstalk (optional)
12 | * Summary
13 | * Explore more
14 | * Homework
15 |
16 | ---------
17 |
18 | ## Session #5 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework.md)
19 |
20 | ### Dataset
21 |
22 | In this homework, we will use Credit Card Data from book "Econometric Analysis" - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv)
23 |
24 | ## Question 1
25 |
26 | * Install Pipenv
27 | * What's the version of pipenv you installed?
28 | * Use `--version` to find out
29 |
30 | **2022.10.4**
31 |
32 | ## Question 2
33 |
34 | * Use Pipenv to install Scikit-Learn version 1.0.2
35 | * What's the first hash for scikit-learn you get in Pipfile.lock?
36 |
37 | Note: you should create an empty folder for homework
38 | and do it there.
39 |
40 | **sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b**
41 |
42 |
43 | ## Models
44 |
45 | We've prepared a dictionary vectorizer and a model.
46 |
47 | They were trained (roughly) using this code:
48 |
49 | ```python
50 | features = ['reports', 'share', 'expenditure', 'owner']
51 | dicts = df[features].to_dict(orient='records')
52 |
53 | dv = DictVectorizer(sparse=False)
54 | X = dv.fit_transform(dicts)
55 |
56 | model = LogisticRegression(solver='liblinear').fit(X, y)
57 | ```
58 |
59 | > **Note**: You don't need to train the model. This code is just for your reference.
60 |
61 | And then saved with Pickle. Download them:
62 |
63 | * [DictVectorizer](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework/dv.bin?raw=true)
64 | * [LogisticRegression](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/05-deployment/homework/model1.bin?raw=true)
65 |
66 | With `wget`:
67 |
68 | ```bash
69 | PREFIX=https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/course-zoomcamp/cohorts/2022/05-deployment/homework
70 | wget $PREFIX/model1.bin
71 | wget $PREFIX/dv.bin
72 | ```
73 |
74 |
75 | ## Question 3
76 |
77 | Let's use these models!
78 |
79 | * Write a script for loading these models with pickle
80 | * Score this client:
81 |
82 | ```json
83 | {"reports": 0, "share": 0.001694, "expenditure": 0.12, "owner": "yes"}
84 | ```
85 |
86 | What's the probability that this client will get a credit card?
87 |
88 | * **0.148**
89 | * 0.391
90 | * 0.601
91 | * 0.993
92 |
93 | If you're getting errors when unpickling the files, check their checksum:
94 |
95 | ```bash
96 | $ md5sum model1.bin dv.bin
97 | 3f57f3ebfdf57a9e1368dcd0f28a4a14 model1.bin
98 | 6b7cded86a52af7e81859647fa3a5c2e dv.bin
99 | ```
100 |
101 |
102 | ## Question 4
103 |
104 | Now let's serve this model as a web service
105 |
106 | * Install Flask and gunicorn (or waitress, if you're on Windows)
107 | * Write Flask code for serving the model
108 | * Now score this client using `requests`:
109 |
110 | ```python
111 | url = "YOUR_URL"
112 | client = {"reports": 0, "share": 0.245, "expenditure": 3.438, "owner": "yes"}
113 | requests.post(url, json=client).json()
114 | ```
115 |
116 | What's the probability that this client will get a credit card?
117 |
118 | * 0.274
119 | * 0.484
120 | * 0.698
121 | * **0.928**
122 |
123 |
124 | ## Docker
125 |
126 | Install [Docker](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/05-deployment/06-docker.md). We will use it for the next two questions.
127 |
128 | For these questions, we prepared a base image: `svizor/zoomcamp-model:3.9.12-slim`.
129 | You'll need to use it (see Question 5 for an example).
130 |
131 | This image is based on `python:3.9.12-slim` and has a logistic regression model
132 | (a different one) as well a dictionary vectorizer inside.
133 |
134 | This is how the Dockerfile for this image looks like:
135 |
136 | ```docker
137 | FROM python:3.9.12-slim
138 | WORKDIR /app
139 | COPY ["model2.bin", "dv.bin", "./"]
140 | ```
141 |
142 | We already built it and then pushed it to [`svizor/zoomcamp-model:3.9.12-slim`](https://hub.docker.com/r/svizor/zoomcamp-model).
143 |
144 | > **Note**: You don't need to build this docker image, it's just for your reference.
145 |
146 |
147 | ## Question 5
148 |
149 | Download the base image `svizor/zoomcamp-model:3.9.12-slim`. You can easily make it by using [docker pull](https://docs.docker.com/engine/reference/commandline/pull/) command.
150 |
151 | So what's the size of this base image?
152 |
153 | * 15 Mb
154 | * **125 Mb**
155 | * 275 Mb
156 | * 415 Mb
157 |
158 | You can get this information when running `docker images` - it'll be in the "SIZE" column.
159 |
160 |
161 | ## Dockerfile
162 |
163 | Now create your own Dockerfile based on the image we prepared.
164 |
165 | It should start like that:
166 |
167 | ```docker
168 | FROM svizor/zoomcamp-model:3.9.12-slim
169 | # add your stuff here
170 | ```
171 |
172 | Now complete it:
173 |
174 | * Install all the dependencies form the Pipenv file
175 | * Copy your Flask script
176 | * Run it with Gunicorn
177 |
178 | After that, you can build your docker image.
179 |
180 |
181 | ## Question 6
182 |
183 | Let's run your docker container!
184 |
185 | After running it, score this client once again:
186 |
187 | ```python
188 | url = "YOUR_URL"
189 | client = {"reports": 0, "share": 0.245, "expenditure": 3.438, "owner": "yes"}
190 | requests.post(url, json=client).json()
191 | ```
192 |
193 | What's the probability that this client will get a credit card now?
194 |
195 | * 0.289
196 | * 0.502
197 | * **0.769**
198 | * 0.972
--------------------------------------------------------------------------------
/05-deployment/app/Dockerfile:
--------------------------------------------------------------------------------
1 | # First install the python 3.8, the slim version have less size
2 | FROM svizor/zoomcamp-model:3.9.12-slim
3 |
4 | # Install pipenv library in Docker
5 | RUN pip install pipenv
6 |
7 | # we have created a directory in Docker named app and we're using it as work directory
8 | WORKDIR /app
9 |
10 | # Copy the Pip files into our working derectory
11 | COPY ["Pipfile", "Pipfile.lock", "./"]
12 |
13 | # install the pipenv dependecies we had from the project and deploy them
14 | RUN pipenv install --deploy --system
15 |
16 | # Copy any python files and the model we had to the working directory of Docker
17 | COPY ["predict.py", "./"]
18 |
19 | # We need to expose the 9696 port because we're not able to communicate with Docker outside it
20 | EXPOSE 9696
21 |
22 | # If we run the Docker image, we want our churn app to be running
23 | ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:9696", "predict:app"]
--------------------------------------------------------------------------------
/05-deployment/app/Pipfile:
--------------------------------------------------------------------------------
1 | [[source]]
2 | url = "https://pypi.org/simple"
3 | verify_ssl = true
4 | name = "pypi"
5 |
6 | [packages]
7 | numpy = "==1.23.3"
8 | scikit-learn = "==1.0.2"
9 | flask = "==2.2.2"
10 | gunicorn = "==20.1.0"
11 |
12 | [dev-packages]
13 | awsebcli = "==3.20.3"
14 |
15 |
16 | [requires]
17 | python_version = "3.9"
18 |
--------------------------------------------------------------------------------
/05-deployment/app/Pipfile.lock:
--------------------------------------------------------------------------------
1 | {
2 | "_meta": {
3 | "hash": {
4 | "sha256": "24181752269acff8f5d7c23ac1c6964e70b0fce51dbe74392cb23deed544d693"
5 | },
6 | "pipfile-spec": 6,
7 | "requires": {
8 | "python_version": "3.9"
9 | },
10 | "sources": [
11 | {
12 | "name": "pypi",
13 | "url": "https://pypi.org/simple",
14 | "verify_ssl": true
15 | }
16 | ]
17 | },
18 | "default": {
19 | "click": {
20 | "hashes": [
21 | "sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e",
22 | "sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48"
23 | ],
24 | "markers": "python_version >= '3.7'",
25 | "version": "==8.1.3"
26 | },
27 | "flask": {
28 | "hashes": [
29 | "sha256:642c450d19c4ad482f96729bd2a8f6d32554aa1e231f4f6b4e7e5264b16cca2b",
30 | "sha256:b9c46cc36662a7949f34b52d8ec7bb59c0d74ba08ba6cb9ce9adc1d8676d9526"
31 | ],
32 | "index": "pypi",
33 | "version": "==2.2.2"
34 | },
35 | "gunicorn": {
36 | "hashes": [
37 | "sha256:9dcc4547dbb1cb284accfb15ab5667a0e5d1881cc443e0677b4882a4067a807e",
38 | "sha256:e0a968b5ba15f8a328fdfd7ab1fcb5af4470c28aaf7e55df02a99bc13138e6e8"
39 | ],
40 | "index": "pypi",
41 | "version": "==20.1.0"
42 | },
43 | "importlib-metadata": {
44 | "hashes": [
45 | "sha256:da31db32b304314d044d3c12c79bd59e307889b287ad12ff387b3500835fc2ab",
46 | "sha256:ddb0e35065e8938f867ed4928d0ae5bf2a53b7773871bfe6bcc7e4fcdc7dea43"
47 | ],
48 | "markers": "python_version < '3.10'",
49 | "version": "==5.0.0"
50 | },
51 | "itsdangerous": {
52 | "hashes": [
53 | "sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44",
54 | "sha256:5dbbc68b317e5e42f327f9021763545dc3fc3bfe22e6deb96aaf1fc38874156a"
55 | ],
56 | "markers": "python_version >= '3.7'",
57 | "version": "==2.1.2"
58 | },
59 | "jinja2": {
60 | "hashes": [
61 | "sha256:31351a702a408a9e7595a8fc6150fc3f43bb6bf7e319770cbc0db9df9437e852",
62 | "sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61"
63 | ],
64 | "markers": "python_version >= '3.7'",
65 | "version": "==3.1.2"
66 | },
67 | "joblib": {
68 | "hashes": [
69 | "sha256:091138ed78f800342968c523bdde947e7a305b8594b910a0fea2ab83c3c6d385",
70 | "sha256:e1cee4a79e4af22881164f218d4311f60074197fb707e082e803b61f6d137018"
71 | ],
72 | "markers": "python_version >= '3.7'",
73 | "version": "==1.2.0"
74 | },
75 | "markupsafe": {
76 | "hashes": [
77 | "sha256:0212a68688482dc52b2d45013df70d169f542b7394fc744c02a57374a4207003",
78 | "sha256:089cf3dbf0cd6c100f02945abeb18484bd1ee57a079aefd52cffd17fba910b88",
79 | "sha256:10c1bfff05d95783da83491be968e8fe789263689c02724e0c691933c52994f5",
80 | "sha256:33b74d289bd2f5e527beadcaa3f401e0df0a89927c1559c8566c066fa4248ab7",
81 | "sha256:3799351e2336dc91ea70b034983ee71cf2f9533cdff7c14c90ea126bfd95d65a",
82 | "sha256:3ce11ee3f23f79dbd06fb3d63e2f6af7b12db1d46932fe7bd8afa259a5996603",
83 | "sha256:421be9fbf0ffe9ffd7a378aafebbf6f4602d564d34be190fc19a193232fd12b1",
84 | "sha256:43093fb83d8343aac0b1baa75516da6092f58f41200907ef92448ecab8825135",
85 | "sha256:46d00d6cfecdde84d40e572d63735ef81423ad31184100411e6e3388d405e247",
86 | "sha256:4a33dea2b688b3190ee12bd7cfa29d39c9ed176bda40bfa11099a3ce5d3a7ac6",
87 | "sha256:4b9fe39a2ccc108a4accc2676e77da025ce383c108593d65cc909add5c3bd601",
88 | "sha256:56442863ed2b06d19c37f94d999035e15ee982988920e12a5b4ba29b62ad1f77",
89 | "sha256:671cd1187ed5e62818414afe79ed29da836dde67166a9fac6d435873c44fdd02",
90 | "sha256:694deca8d702d5db21ec83983ce0bb4b26a578e71fbdbd4fdcd387daa90e4d5e",
91 | "sha256:6a074d34ee7a5ce3effbc526b7083ec9731bb3cbf921bbe1d3005d4d2bdb3a63",
92 | "sha256:6d0072fea50feec76a4c418096652f2c3238eaa014b2f94aeb1d56a66b41403f",
93 | "sha256:6fbf47b5d3728c6aea2abb0589b5d30459e369baa772e0f37a0320185e87c980",
94 | "sha256:7f91197cc9e48f989d12e4e6fbc46495c446636dfc81b9ccf50bb0ec74b91d4b",
95 | "sha256:86b1f75c4e7c2ac2ccdaec2b9022845dbb81880ca318bb7a0a01fbf7813e3812",
96 | "sha256:8dc1c72a69aa7e082593c4a203dcf94ddb74bb5c8a731e4e1eb68d031e8498ff",
97 | "sha256:8e3dcf21f367459434c18e71b2a9532d96547aef8a871872a5bd69a715c15f96",
98 | "sha256:8e576a51ad59e4bfaac456023a78f6b5e6e7651dcd383bcc3e18d06f9b55d6d1",
99 | "sha256:96e37a3dc86e80bf81758c152fe66dbf60ed5eca3d26305edf01892257049925",
100 | "sha256:97a68e6ada378df82bc9f16b800ab77cbf4b2fada0081794318520138c088e4a",
101 | "sha256:99a2a507ed3ac881b975a2976d59f38c19386d128e7a9a18b7df6fff1fd4c1d6",
102 | "sha256:a49907dd8420c5685cfa064a1335b6754b74541bbb3706c259c02ed65b644b3e",
103 | "sha256:b09bf97215625a311f669476f44b8b318b075847b49316d3e28c08e41a7a573f",
104 | "sha256:b7bd98b796e2b6553da7225aeb61f447f80a1ca64f41d83612e6139ca5213aa4",
105 | "sha256:b87db4360013327109564f0e591bd2a3b318547bcef31b468a92ee504d07ae4f",
106 | "sha256:bcb3ed405ed3222f9904899563d6fc492ff75cce56cba05e32eff40e6acbeaa3",
107 | "sha256:d4306c36ca495956b6d568d276ac11fdd9c30a36f1b6eb928070dc5360b22e1c",
108 | "sha256:d5ee4f386140395a2c818d149221149c54849dfcfcb9f1debfe07a8b8bd63f9a",
109 | "sha256:dda30ba7e87fbbb7eab1ec9f58678558fd9a6b8b853530e176eabd064da81417",
110 | "sha256:e04e26803c9c3851c931eac40c695602c6295b8d432cbe78609649ad9bd2da8a",
111 | "sha256:e1c0b87e09fa55a220f058d1d49d3fb8df88fbfab58558f1198e08c1e1de842a",
112 | "sha256:e72591e9ecd94d7feb70c1cbd7be7b3ebea3f548870aa91e2732960fa4d57a37",
113 | "sha256:e8c843bbcda3a2f1e3c2ab25913c80a3c5376cd00c6e8c4a86a89a28c8dc5452",
114 | "sha256:efc1913fd2ca4f334418481c7e595c00aad186563bbc1ec76067848c7ca0a933",
115 | "sha256:f121a1420d4e173a5d96e47e9a0c0dcff965afdf1626d28de1460815f7c4ee7a",
116 | "sha256:fc7b548b17d238737688817ab67deebb30e8073c95749d55538ed473130ec0c7"
117 | ],
118 | "markers": "python_version >= '3.7'",
119 | "version": "==2.1.1"
120 | },
121 | "numpy": {
122 | "hashes": [
123 | "sha256:004f0efcb2fe1c0bd6ae1fcfc69cc8b6bf2407e0f18be308612007a0762b4089",
124 | "sha256:09f6b7bdffe57fc61d869a22f506049825d707b288039d30f26a0d0d8ea05164",
125 | "sha256:0ea3f98a0ffce3f8f57675eb9119f3f4edb81888b6874bc1953f91e0b1d4f440",
126 | "sha256:17c0e467ade9bda685d5ac7f5fa729d8d3e76b23195471adae2d6a6941bd2c18",
127 | "sha256:1f27b5322ac4067e67c8f9378b41c746d8feac8bdd0e0ffede5324667b8a075c",
128 | "sha256:22d43376ee0acd547f3149b9ec12eec2f0ca4a6ab2f61753c5b29bb3e795ac4d",
129 | "sha256:2ad3ec9a748a8943e6eb4358201f7e1c12ede35f510b1a2221b70af4bb64295c",
130 | "sha256:301c00cf5e60e08e04d842fc47df641d4a181e651c7135c50dc2762ffe293dbd",
131 | "sha256:39a664e3d26ea854211867d20ebcc8023257c1800ae89773cbba9f9e97bae036",
132 | "sha256:51bf49c0cd1d52be0a240aa66f3458afc4b95d8993d2d04f0d91fa60c10af6cd",
133 | "sha256:78a63d2df1d947bd9d1b11d35564c2f9e4b57898aae4626638056ec1a231c40c",
134 | "sha256:7cd1328e5bdf0dee621912f5833648e2daca72e3839ec1d6695e91089625f0b4",
135 | "sha256:8355fc10fd33a5a70981a5b8a0de51d10af3688d7a9e4a34fcc8fa0d7467bb7f",
136 | "sha256:8c79d7cf86d049d0c5089231a5bcd31edb03555bd93d81a16870aa98c6cfb79d",
137 | "sha256:91b8d6768a75247026e951dce3b2aac79dc7e78622fc148329135ba189813584",
138 | "sha256:94c15ca4e52671a59219146ff584488907b1f9b3fc232622b47e2cf832e94fb8",
139 | "sha256:98dcbc02e39b1658dc4b4508442a560fe3ca5ca0d989f0df062534e5ca3a5c1a",
140 | "sha256:a64403f634e5ffdcd85e0b12c08f04b3080d3e840aef118721021f9b48fc1460",
141 | "sha256:bc6e8da415f359b578b00bcfb1d08411c96e9a97f9e6c7adada554a0812a6cc6",
142 | "sha256:bdc9febce3e68b697d931941b263c59e0c74e8f18861f4064c1f712562903411",
143 | "sha256:c1ba66c48b19cc9c2975c0d354f24058888cdc674bebadceb3cdc9ec403fb5d1",
144 | "sha256:c9f707b5bb73bf277d812ded9896f9512a43edff72712f31667d0a8c2f8e71ee",
145 | "sha256:d5422d6a1ea9b15577a9432e26608c73a78faf0b9039437b075cf322c92e98e7",
146 | "sha256:e5d5420053bbb3dd64c30e58f9363d7a9c27444c3648e61460c1237f9ec3fa14",
147 | "sha256:e868b0389c5ccfc092031a861d4e158ea164d8b7fdbb10e3b5689b4fc6498df6",
148 | "sha256:efd9d3abe5774404becdb0748178b48a218f1d8c44e0375475732211ea47c67e",
149 | "sha256:f8c02ec3c4c4fcb718fdf89a6c6f709b14949408e8cf2a2be5bfa9c49548fd85",
150 | "sha256:ffcf105ecdd9396e05a8e58e81faaaf34d3f9875f137c7372450baa5d77c9a54"
151 | ],
152 | "index": "pypi",
153 | "version": "==1.23.3"
154 | },
155 | "scikit-learn": {
156 | "hashes": [
157 | "sha256:08ef968f6b72033c16c479c966bf37ccd49b06ea91b765e1cc27afefe723920b",
158 | "sha256:158faf30684c92a78e12da19c73feff9641a928a8024b4fa5ec11d583f3d8a87",
159 | "sha256:16455ace947d8d9e5391435c2977178d0ff03a261571e67f627c8fee0f9d431a",
160 | "sha256:245c9b5a67445f6f044411e16a93a554edc1efdcce94d3fc0bc6a4b9ac30b752",
161 | "sha256:285db0352e635b9e3392b0b426bc48c3b485512d3b4ac3c7a44ec2a2ba061e66",
162 | "sha256:2f3b453e0b149898577e301d27e098dfe1a36943f7bb0ad704d1e548efc3b448",
163 | "sha256:46f431ec59dead665e1370314dbebc99ead05e1c0a9df42f22d6a0e00044820f",
164 | "sha256:55f2f3a8414e14fbee03782f9fe16cca0f141d639d2b1c1a36779fa069e1db57",
165 | "sha256:5cb33fe1dc6f73dc19e67b264dbb5dde2a0539b986435fdd78ed978c14654830",
166 | "sha256:75307d9ea39236cad7eea87143155eea24d48f93f3a2f9389c817f7019f00705",
167 | "sha256:7626a34eabbf370a638f32d1a3ad50526844ba58d63e3ab81ba91e2a7c6d037e",
168 | "sha256:7a93c1292799620df90348800d5ac06f3794c1316ca247525fa31169f6d25855",
169 | "sha256:7d6b2475f1c23a698b48515217eb26b45a6598c7b1840ba23b3c5acece658dbb",
170 | "sha256:80095a1e4b93bd33261ef03b9bc86d6db649f988ea4dbcf7110d0cded8d7213d",
171 | "sha256:85260fb430b795d806251dd3bb05e6f48cdc777ac31f2bcf2bc8bbed3270a8f5",
172 | "sha256:9369b030e155f8188743eb4893ac17a27f81d28a884af460870c7c072f114243",
173 | "sha256:a053a6a527c87c5c4fa7bf1ab2556fa16d8345cf99b6c5a19030a4a7cd8fd2c0",
174 | "sha256:a90b60048f9ffdd962d2ad2fb16367a87ac34d76e02550968719eb7b5716fd10",
175 | "sha256:a999c9f02ff9570c783069f1074f06fe7386ec65b84c983db5aeb8144356a355",
176 | "sha256:b1391d1a6e2268485a63c3073111fe3ba6ec5145fc957481cfd0652be571226d",
177 | "sha256:b54a62c6e318ddbfa7d22c383466d38d2ee770ebdb5ddb668d56a099f6eaf75f",
178 | "sha256:b5870959a5484b614f26d31ca4c17524b1b0317522199dc985c3b4256e030767",
179 | "sha256:bc3744dabc56b50bec73624aeca02e0def06b03cb287de26836e730659c5d29c",
180 | "sha256:d93d4c28370aea8a7cbf6015e8a669cd5d69f856cc2aa44e7a590fb805bb5583",
181 | "sha256:d9aac97e57c196206179f674f09bc6bffcd0284e2ba95b7fe0b402ac3f986023",
182 | "sha256:da3c84694ff693b5b3194d8752ccf935a665b8b5edc33a283122f4273ca3e687",
183 | "sha256:e174242caecb11e4abf169342641778f68e1bfaba80cd18acd6bc84286b9a534",
184 | "sha256:eabceab574f471de0b0eb3f2ecf2eee9f10b3106570481d007ed1c84ebf6d6a1",
185 | "sha256:f14517e174bd7332f1cca2c959e704696a5e0ba246eb8763e6c24876d8710049",
186 | "sha256:fa38a1b9b38ae1fad2863eff5e0d69608567453fdfc850c992e6e47eb764e846",
187 | "sha256:ff3fa8ea0e09e38677762afc6e14cad77b5e125b0ea70c9bba1992f02c93b028",
188 | "sha256:ff746a69ff2ef25f62b36338c615dd15954ddc3ab8e73530237dd73235e76d62"
189 | ],
190 | "index": "pypi",
191 | "version": "==1.0.2"
192 | },
193 | "scipy": {
194 | "hashes": [
195 | "sha256:0e9c83dccac06f3b9aa02df69577f239758d5d0d0c069673fb0b47ecb971983d",
196 | "sha256:148cb6f53d9d10dafde848e9aeb1226bf2809d16dc3221b2fa568130b6f2e586",
197 | "sha256:17be1a7c68ec4c49d8cd4eb1655d55d14a54ab63012296bdd5921c92dc485acd",
198 | "sha256:1e3b23a82867018cd26255dc951789a7c567921622073e1113755866f1eae928",
199 | "sha256:22380e076a162e81b659d53d75b02e9c75ad14ea2d53d9c645a12543414e2150",
200 | "sha256:4012dbe540732311b8f4388b7e1482eb43a7cc0435bbf2b9916b3d6c38fb8d01",
201 | "sha256:5994a8232cc6510a8e85899661df2d11198bf362f0ffe6fbd5c0aca17ab46ce3",
202 | "sha256:61b95283529712101bfb7c87faf94cb86ed9e64de079509edfe107e5cfa55733",
203 | "sha256:658fd31c6ad4eb9fa3fd460fcac779f70a6bc7480288a211b7658a25891cf01d",
204 | "sha256:7b2608b3141c257d01ae772e23b3de9e04d27344e6b68a890883795229cb7191",
205 | "sha256:82e8bfb352aa9dce9a0ffe81f4c369a2c87c85533519441686f59f21d8c09697",
206 | "sha256:885b7ac56d7460544b2ef89ab9feafa30f4264c9825d975ef690608d07e6cc55",
207 | "sha256:8c8c29703202c39d699b0d6b164bde5501c212005f20abf46ae322b9307c8a41",
208 | "sha256:92c5e627a0635ca02e6494bbbdb74f98d93ac8730416209d61de3b70c8a821be",
209 | "sha256:99e7720caefb8bca6ebf05c7d96078ed202881f61e0c68bd9e0f3e8097d6f794",
210 | "sha256:a72297eb9702576bd8f626bb488fd32bb35349d3120fc4a5e733db137f06c9a6",
211 | "sha256:aa270cc6080c987929335c4cb94e8054fee9a6058cecff22276fa5dbab9856fc",
212 | "sha256:b6194da32e0ce9200b2eda4eb4edb89c5cb8b83d6deaf7c35f8ad3d5d7627d5c",
213 | "sha256:bbed414fc25d64bd6d1613dc0286fbf91902219b8be63ad254525162235b67e9",
214 | "sha256:d6cb1f92ded3fc48f7dbe94d20d7b9887e13b874e79043907de541c841563b4c",
215 | "sha256:ee4ceed204f269da19f67f0115a85d3a2cd8547185037ad99a4025f9c61d02e9"
216 | ],
217 | "markers": "python_version >= '3.8'",
218 | "version": "==1.9.2"
219 | },
220 | "setuptools": {
221 | "hashes": [
222 | "sha256:1b6bdc6161661409c5f21508763dc63ab20a9ac2f8ba20029aaaa7fdb9118012",
223 | "sha256:3050e338e5871e70c72983072fe34f6032ae1cdeeeb67338199c2f74e083a80e"
224 | ],
225 | "markers": "python_version >= '3.7'",
226 | "version": "==65.4.1"
227 | },
228 | "threadpoolctl": {
229 | "hashes": [
230 | "sha256:8b99adda265feb6773280df41eece7b2e6561b772d21ffd52e372f999024907b",
231 | "sha256:a335baacfaa4400ae1f0d8e3a58d6674d2f8828e3716bb2802c44955ad391380"
232 | ],
233 | "markers": "python_version >= '3.6'",
234 | "version": "==3.1.0"
235 | },
236 | "werkzeug": {
237 | "hashes": [
238 | "sha256:7ea2d48322cc7c0f8b3a215ed73eabd7b5d75d0b50e31ab006286ccff9e00b8f",
239 | "sha256:f979ab81f58d7318e064e99c4506445d60135ac5cd2e177a2de0089bfd4c9bd5"
240 | ],
241 | "markers": "python_version >= '3.7'",
242 | "version": "==2.2.2"
243 | },
244 | "zipp": {
245 | "hashes": [
246 | "sha256:3a7af91c3db40ec72dd9d154ae18e008c69efe8ca88dde4f9a731bb82fe2f9eb",
247 | "sha256:972cfa31bc2fedd3fa838a51e9bc7e64b7fb725a8c00e7431554311f180e9980"
248 | ],
249 | "markers": "python_version >= '3.7'",
250 | "version": "==3.9.0"
251 | }
252 | },
253 | "develop": {
254 | "attrs": {
255 | "hashes": [
256 | "sha256:29adc2665447e5191d0e7c568fde78b21f9672d344281d0c6e1ab085429b22b6",
257 | "sha256:86efa402f67bf2df34f51a335487cf46b1ec130d02b8d39fd248abfd30da551c"
258 | ],
259 | "markers": "python_version >= '3.5'",
260 | "version": "==22.1.0"
261 | },
262 | "awsebcli": {
263 | "hashes": [
264 | "sha256:5b79d45cf017a2270340d5e2824b51d7e6fff56eaabe2f747ee8eb5c07a4c535"
265 | ],
266 | "index": "pypi",
267 | "version": "==3.20.3"
268 | },
269 | "bcrypt": {
270 | "hashes": [
271 | "sha256:089098effa1bc35dc055366740a067a2fc76987e8ec75349eb9484061c54f535",
272 | "sha256:08d2947c490093a11416df18043c27abe3921558d2c03e2076ccb28a116cb6d0",
273 | "sha256:0eaa47d4661c326bfc9d08d16debbc4edf78778e6aaba29c1bc7ce67214d4410",
274 | "sha256:27d375903ac8261cfe4047f6709d16f7d18d39b1ec92aaf72af989552a650ebd",
275 | "sha256:2b3ac11cf45161628f1f3733263e63194f22664bf4d0c0f3ab34099c02134665",
276 | "sha256:2caffdae059e06ac23fce178d31b4a702f2a3264c20bfb5ff541b338194d8fab",
277 | "sha256:3100851841186c25f127731b9fa11909ab7b1df6fc4b9f8353f4f1fd952fbf71",
278 | "sha256:5ad4d32a28b80c5fa6671ccfb43676e8c1cc232887759d1cd7b6f56ea4355215",
279 | "sha256:67a97e1c405b24f19d08890e7ae0c4f7ce1e56a712a016746c8b2d7732d65d4b",
280 | "sha256:705b2cea8a9ed3d55b4491887ceadb0106acf7c6387699fca771af56b1cdeeda",
281 | "sha256:8a68f4341daf7522fe8d73874de8906f3a339048ba406be6ddc1b3ccb16fc0d9",
282 | "sha256:a522427293d77e1c29e303fc282e2d71864579527a04ddcfda6d4f8396c6c36a",
283 | "sha256:ae88eca3024bb34bb3430f964beab71226e761f51b912de5133470b649d82344",
284 | "sha256:b1023030aec778185a6c16cf70f359cbb6e0c289fd564a7cfa29e727a1c38f8f",
285 | "sha256:b3b85202d95dd568efcb35b53936c5e3b3600c7cdcc6115ba461df3a8e89f38d",
286 | "sha256:b57adba8a1444faf784394de3436233728a1ecaeb6e07e8c22c8848f179b893c",
287 | "sha256:bf4fa8b2ca74381bb5442c089350f09a3f17797829d958fad058d6e44d9eb83c",
288 | "sha256:ca3204d00d3cb2dfed07f2d74a25f12fc12f73e606fcaa6975d1f7ae69cacbb2",
289 | "sha256:cbb03eec97496166b704ed663a53680ab57c5084b2fc98ef23291987b525cb7d",
290 | "sha256:e9a51bbfe7e9802b5f3508687758b564069ba937748ad7b9e890086290d2f79e",
291 | "sha256:fbdaec13c5105f0c4e5c52614d04f0bca5f5af007910daa8b6b12095edaa67b3"
292 | ],
293 | "markers": "python_version >= '3.6'",
294 | "version": "==4.0.1"
295 | },
296 | "blessed": {
297 | "hashes": [
298 | "sha256:63b8554ae2e0e7f43749b6715c734cc8f3883010a809bf16790102563e6cf25b",
299 | "sha256:9a0d099695bf621d4680dd6c73f6ad547f6a3442fbdbe80c4b1daa1edbc492fc"
300 | ],
301 | "markers": "python_version >= '2.7'",
302 | "version": "==1.19.1"
303 | },
304 | "botocore": {
305 | "hashes": [
306 | "sha256:06ae8076c4dcf3d72bec4d37e5f2dce4a92a18a8cdaa3bfaa6e3b7b5e30a8d7e",
307 | "sha256:4bb9ba16cccee5f5a2602049bc3e2db6865346b2550667f3013bdf33b0a01ceb"
308 | ],
309 | "markers": "python_version >= '3.6'",
310 | "version": "==1.23.54"
311 | },
312 | "cached-property": {
313 | "hashes": [
314 | "sha256:9fa5755838eecbb2d234c3aa390bd80fbd3ac6b6869109bfc1b499f7bd89a130",
315 | "sha256:df4f613cf7ad9a588cc381aaf4a512d26265ecebd5eb9e1ba12f1319eb85a6a0"
316 | ],
317 | "version": "==1.5.2"
318 | },
319 | "cement": {
320 | "hashes": [
321 | "sha256:8765ed052c061d74e4d0189addc33d268de544ca219b259d797741f725e422d2"
322 | ],
323 | "version": "==2.8.2"
324 | },
325 | "certifi": {
326 | "hashes": [
327 | "sha256:0d9c601124e5a6ba9712dbc60d9c53c21e34f5f641fe83002317394311bdce14",
328 | "sha256:90c1a32f1d68f940488354e36370f6cca89f0f106db09518524c88d6ed83f382"
329 | ],
330 | "markers": "python_version >= '3.6'",
331 | "version": "==2022.9.24"
332 | },
333 | "cffi": {
334 | "hashes": [
335 | "sha256:00a9ed42e88df81ffae7a8ab6d9356b371399b91dbdf0c3cb1e84c03a13aceb5",
336 | "sha256:03425bdae262c76aad70202debd780501fabeaca237cdfddc008987c0e0f59ef",
337 | "sha256:04ed324bda3cda42b9b695d51bb7d54b680b9719cfab04227cdd1e04e5de3104",
338 | "sha256:0e2642fe3142e4cc4af0799748233ad6da94c62a8bec3a6648bf8ee68b1c7426",
339 | "sha256:173379135477dc8cac4bc58f45db08ab45d228b3363adb7af79436135d028405",
340 | "sha256:198caafb44239b60e252492445da556afafc7d1e3ab7a1fb3f0584ef6d742375",
341 | "sha256:1e74c6b51a9ed6589199c787bf5f9875612ca4a8a0785fb2d4a84429badaf22a",
342 | "sha256:2012c72d854c2d03e45d06ae57f40d78e5770d252f195b93f581acf3ba44496e",
343 | "sha256:21157295583fe8943475029ed5abdcf71eb3911894724e360acff1d61c1d54bc",
344 | "sha256:2470043b93ff09bf8fb1d46d1cb756ce6132c54826661a32d4e4d132e1977adf",
345 | "sha256:285d29981935eb726a4399badae8f0ffdff4f5050eaa6d0cfc3f64b857b77185",
346 | "sha256:30d78fbc8ebf9c92c9b7823ee18eb92f2e6ef79b45ac84db507f52fbe3ec4497",
347 | "sha256:320dab6e7cb2eacdf0e658569d2575c4dad258c0fcc794f46215e1e39f90f2c3",
348 | "sha256:33ab79603146aace82c2427da5ca6e58f2b3f2fb5da893ceac0c42218a40be35",
349 | "sha256:3548db281cd7d2561c9ad9984681c95f7b0e38881201e157833a2342c30d5e8c",
350 | "sha256:3799aecf2e17cf585d977b780ce79ff0dc9b78d799fc694221ce814c2c19db83",
351 | "sha256:39d39875251ca8f612b6f33e6b1195af86d1b3e60086068be9cc053aa4376e21",
352 | "sha256:3b926aa83d1edb5aa5b427b4053dc420ec295a08e40911296b9eb1b6170f6cca",
353 | "sha256:3bcde07039e586f91b45c88f8583ea7cf7a0770df3a1649627bf598332cb6984",
354 | "sha256:3d08afd128ddaa624a48cf2b859afef385b720bb4b43df214f85616922e6a5ac",
355 | "sha256:3eb6971dcff08619f8d91607cfc726518b6fa2a9eba42856be181c6d0d9515fd",
356 | "sha256:40f4774f5a9d4f5e344f31a32b5096977b5d48560c5592e2f3d2c4374bd543ee",
357 | "sha256:4289fc34b2f5316fbb762d75362931e351941fa95fa18789191b33fc4cf9504a",
358 | "sha256:470c103ae716238bbe698d67ad020e1db9d9dba34fa5a899b5e21577e6d52ed2",
359 | "sha256:4f2c9f67e9821cad2e5f480bc8d83b8742896f1242dba247911072d4fa94c192",
360 | "sha256:50a74364d85fd319352182ef59c5c790484a336f6db772c1a9231f1c3ed0cbd7",
361 | "sha256:54a2db7b78338edd780e7ef7f9f6c442500fb0d41a5a4ea24fff1c929d5af585",
362 | "sha256:5635bd9cb9731e6d4a1132a498dd34f764034a8ce60cef4f5319c0541159392f",
363 | "sha256:59c0b02d0a6c384d453fece7566d1c7e6b7bae4fc5874ef2ef46d56776d61c9e",
364 | "sha256:5d598b938678ebf3c67377cdd45e09d431369c3b1a5b331058c338e201f12b27",
365 | "sha256:5df2768244d19ab7f60546d0c7c63ce1581f7af8b5de3eb3004b9b6fc8a9f84b",
366 | "sha256:5ef34d190326c3b1f822a5b7a45f6c4535e2f47ed06fec77d3d799c450b2651e",
367 | "sha256:6975a3fac6bc83c4a65c9f9fcab9e47019a11d3d2cf7f3c0d03431bf145a941e",
368 | "sha256:6c9a799e985904922a4d207a94eae35c78ebae90e128f0c4e521ce339396be9d",
369 | "sha256:70df4e3b545a17496c9b3f41f5115e69a4f2e77e94e1d2a8e1070bc0c38c8a3c",
370 | "sha256:7473e861101c9e72452f9bf8acb984947aa1661a7704553a9f6e4baa5ba64415",
371 | "sha256:8102eaf27e1e448db915d08afa8b41d6c7ca7a04b7d73af6514df10a3e74bd82",
372 | "sha256:87c450779d0914f2861b8526e035c5e6da0a3199d8f1add1a665e1cbc6fc6d02",
373 | "sha256:8b7ee99e510d7b66cdb6c593f21c043c248537a32e0bedf02e01e9553a172314",
374 | "sha256:91fc98adde3d7881af9b59ed0294046f3806221863722ba7d8d120c575314325",
375 | "sha256:94411f22c3985acaec6f83c6df553f2dbe17b698cc7f8ae751ff2237d96b9e3c",
376 | "sha256:98d85c6a2bef81588d9227dde12db8a7f47f639f4a17c9ae08e773aa9c697bf3",
377 | "sha256:9ad5db27f9cabae298d151c85cf2bad1d359a1b9c686a275df03385758e2f914",
378 | "sha256:a0b71b1b8fbf2b96e41c4d990244165e2c9be83d54962a9a1d118fd8657d2045",
379 | "sha256:a0f100c8912c114ff53e1202d0078b425bee3649ae34d7b070e9697f93c5d52d",
380 | "sha256:a591fe9e525846e4d154205572a029f653ada1a78b93697f3b5a8f1f2bc055b9",
381 | "sha256:a5c84c68147988265e60416b57fc83425a78058853509c1b0629c180094904a5",
382 | "sha256:a66d3508133af6e8548451b25058d5812812ec3798c886bf38ed24a98216fab2",
383 | "sha256:a8c4917bd7ad33e8eb21e9a5bbba979b49d9a97acb3a803092cbc1133e20343c",
384 | "sha256:b3bbeb01c2b273cca1e1e0c5df57f12dce9a4dd331b4fa1635b8bec26350bde3",
385 | "sha256:cba9d6b9a7d64d4bd46167096fc9d2f835e25d7e4c121fb2ddfc6528fb0413b2",
386 | "sha256:cc4d65aeeaa04136a12677d3dd0b1c0c94dc43abac5860ab33cceb42b801c1e8",
387 | "sha256:ce4bcc037df4fc5e3d184794f27bdaab018943698f4ca31630bc7f84a7b69c6d",
388 | "sha256:cec7d9412a9102bdc577382c3929b337320c4c4c4849f2c5cdd14d7368c5562d",
389 | "sha256:d400bfb9a37b1351253cb402671cea7e89bdecc294e8016a707f6d1d8ac934f9",
390 | "sha256:d61f4695e6c866a23a21acab0509af1cdfd2c013cf256bbf5b6b5e2695827162",
391 | "sha256:db0fbb9c62743ce59a9ff687eb5f4afbe77e5e8403d6697f7446e5f609976f76",
392 | "sha256:dd86c085fae2efd48ac91dd7ccffcfc0571387fe1193d33b6394db7ef31fe2a4",
393 | "sha256:e00b098126fd45523dd056d2efba6c5a63b71ffe9f2bbe1a4fe1716e1d0c331e",
394 | "sha256:e229a521186c75c8ad9490854fd8bbdd9a0c9aa3a524326b55be83b54d4e0ad9",
395 | "sha256:e263d77ee3dd201c3a142934a086a4450861778baaeeb45db4591ef65550b0a6",
396 | "sha256:ed9cb427ba5504c1dc15ede7d516b84757c3e3d7868ccc85121d9310d27eed0b",
397 | "sha256:fa6693661a4c91757f4412306191b6dc88c1703f780c8234035eac011922bc01",
398 | "sha256:fcd131dd944808b5bdb38e6f5b53013c5aa4f334c5cad0c72742f6eba4b73db0"
399 | ],
400 | "version": "==1.15.1"
401 | },
402 | "charset-normalizer": {
403 | "hashes": [
404 | "sha256:2857e29ff0d34db842cd7ca3230549d1a697f96ee6d3fb071cfa6c7393832597",
405 | "sha256:6881edbebdb17b39b4eaaa821b438bf6eddffb4468cf344f09f89def34a8b1df"
406 | ],
407 | "markers": "python_version >= '3'",
408 | "version": "==2.0.12"
409 | },
410 | "colorama": {
411 | "hashes": [
412 | "sha256:7d73d2a99753107a36ac6b455ee49046802e59d9d076ef8e47b61499fa29afff",
413 | "sha256:e96da0d330793e2cb9485e9ddfd918d456036c7149416295932478192f4436a1"
414 | ],
415 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
416 | "version": "==0.4.3"
417 | },
418 | "cryptography": {
419 | "hashes": [
420 | "sha256:0297ffc478bdd237f5ca3a7dc96fc0d315670bfa099c04dc3a4a2172008a405a",
421 | "sha256:10d1f29d6292fc95acb597bacefd5b9e812099d75a6469004fd38ba5471a977f",
422 | "sha256:16fa61e7481f4b77ef53991075de29fc5bacb582a1244046d2e8b4bb72ef66d0",
423 | "sha256:194044c6b89a2f9f169df475cc167f6157eb9151cc69af8a2a163481d45cc407",
424 | "sha256:1db3d807a14931fa317f96435695d9ec386be7b84b618cc61cfa5d08b0ae33d7",
425 | "sha256:3261725c0ef84e7592597606f6583385fed2a5ec3909f43bc475ade9729a41d6",
426 | "sha256:3b72c360427889b40f36dc214630e688c2fe03e16c162ef0aa41da7ab1455153",
427 | "sha256:3e3a2599e640927089f932295a9a247fc40a5bdf69b0484532f530471a382750",
428 | "sha256:3fc26e22840b77326a764ceb5f02ca2d342305fba08f002a8c1f139540cdfaad",
429 | "sha256:5067ee7f2bce36b11d0e334abcd1ccf8c541fc0bbdaf57cdd511fdee53e879b6",
430 | "sha256:52e7bee800ec869b4031093875279f1ff2ed12c1e2f74923e8f49c916afd1d3b",
431 | "sha256:64760ba5331e3f1794d0bcaabc0d0c39e8c60bf67d09c93dc0e54189dfd7cfe5",
432 | "sha256:765fa194a0f3372d83005ab83ab35d7c5526c4e22951e46059b8ac678b44fa5a",
433 | "sha256:79473cf8a5cbc471979bd9378c9f425384980fcf2ab6534b18ed7d0d9843987d",
434 | "sha256:896dd3a66959d3a5ddcfc140a53391f69ff1e8f25d93f0e2e7830c6de90ceb9d",
435 | "sha256:89ed49784ba88c221756ff4d4755dbc03b3c8d2c5103f6d6b4f83a0fb1e85294",
436 | "sha256:ac7e48f7e7261207d750fa7e55eac2d45f720027d5703cd9007e9b37bbb59ac0",
437 | "sha256:ad7353f6ddf285aeadfaf79e5a6829110106ff8189391704c1d8801aa0bae45a",
438 | "sha256:b0163a849b6f315bf52815e238bc2b2346604413fa7c1601eea84bcddb5fb9ac",
439 | "sha256:b6c9b706316d7b5a137c35e14f4103e2115b088c412140fdbd5f87c73284df61",
440 | "sha256:c2e5856248a416767322c8668ef1845ad46ee62629266f84a8f007a317141013",
441 | "sha256:ca9f6784ea96b55ff41708b92c3f6aeaebde4c560308e5fbbd3173fbc466e94e",
442 | "sha256:d1a5bd52d684e49a36582193e0b89ff267704cd4025abefb9e26803adeb3e5fb",
443 | "sha256:d3971e2749a723e9084dd507584e2a2761f78ad2c638aa31e80bc7a15c9db4f9",
444 | "sha256:d4ef6cc305394ed669d4d9eebf10d3a101059bdcf2669c366ec1d14e4fb227bd",
445 | "sha256:d9e69ae01f99abe6ad646947bba8941e896cb3aa805be2597a0400e0764b5818"
446 | ],
447 | "markers": "python_version >= '3.6'",
448 | "version": "==38.0.1"
449 | },
450 | "docker": {
451 | "extras": [
452 | "ssh"
453 | ],
454 | "hashes": [
455 | "sha256:d3393c878f575d3a9ca3b94471a3c89a6d960b35feb92f033c0de36cc9d934db",
456 | "sha256:f3607d5695be025fa405a12aca2e5df702a57db63790c73b927eb6a94aac60af"
457 | ],
458 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'",
459 | "version": "==4.4.4"
460 | },
461 | "docker-compose": {
462 | "hashes": [
463 | "sha256:7a2eb6d8173fdf408e505e6f7d497ac0b777388719542be9e49a0efd477a50c6",
464 | "sha256:9d33520ae976f524968a64226516ec631dce09fba0974ce5366ad403e203eb5d"
465 | ],
466 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
467 | "version": "==1.25.5"
468 | },
469 | "dockerpty": {
470 | "hashes": [
471 | "sha256:69a9d69d573a0daa31bcd1c0774eeed5c15c295fe719c61aca550ed1393156ce"
472 | ],
473 | "version": "==0.4.1"
474 | },
475 | "docopt": {
476 | "hashes": [
477 | "sha256:49b3a825280bd66b3aa83585ef59c4a8c82f2c8a522dbe754a8bc8d08c85c491"
478 | ],
479 | "version": "==0.6.2"
480 | },
481 | "future": {
482 | "hashes": [
483 | "sha256:e39ced1ab767b5936646cedba8bcce582398233d6a627067d4c6a454c90cfedb"
484 | ],
485 | "version": "==0.16.0"
486 | },
487 | "idna": {
488 | "hashes": [
489 | "sha256:814f528e8dead7d329833b91c5faa87d60bf71824cd12a7530b5526063d02cb4",
490 | "sha256:90b77e79eaa3eba6de819a0c442c0b4ceefc341a7a2ab77d7562bf49f425c5c2"
491 | ],
492 | "markers": "python_version >= '3'",
493 | "version": "==3.4"
494 | },
495 | "jmespath": {
496 | "hashes": [
497 | "sha256:b85d0567b8666149a93172712e68920734333c0ce7e89b78b3e987f71e5ed4f9",
498 | "sha256:cdf6525904cc597730141d61b36f2e4b8ecc257c420fa2f4549bac2c2d0cb72f"
499 | ],
500 | "markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
501 | "version": "==0.10.0"
502 | },
503 | "jsonschema": {
504 | "hashes": [
505 | "sha256:4e5b3cf8216f577bee9ce139cbe72eca3ea4f292ec60928ff24758ce626cd163",
506 | "sha256:c8a85b28d377cc7737e46e2d9f2b4f44ee3c0e1deac6bf46ddefc7187d30797a"
507 | ],
508 | "version": "==3.2.0"
509 | },
510 | "paramiko": {
511 | "hashes": [
512 | "sha256:003e6bee7c034c21fbb051bf83dc0a9ee4106204dd3c53054c71452cc4ec3938",
513 | "sha256:655f25dc8baf763277b933dfcea101d636581df8d6b9774d1fb653426b72c270"
514 | ],
515 | "version": "==2.11.0"
516 | },
517 | "pathspec": {
518 | "hashes": [
519 | "sha256:7d15c4ddb0b5c802d161efc417ec1a2558ea2653c2e8ad9c19098201dc1c993a",
520 | "sha256:e564499435a2673d586f6b2130bb5b95f04a3ba06f81b8f895b651a3c76aabb1"
521 | ],
522 | "version": "==0.9.0"
523 | },
524 | "pycparser": {
525 | "hashes": [
526 | "sha256:8ee45429555515e1f6b185e78100aea234072576aa43ab53aefcae078162fca9",
527 | "sha256:e644fdec12f7872f86c58ff790da456218b10f863970249516d60a5eaca77206"
528 | ],
529 | "version": "==2.21"
530 | },
531 | "pynacl": {
532 | "hashes": [
533 | "sha256:06b8f6fa7f5de8d5d2f7573fe8c863c051225a27b61e6860fd047b1775807858",
534 | "sha256:0c84947a22519e013607c9be43706dd42513f9e6ae5d39d3613ca1e142fba44d",
535 | "sha256:20f42270d27e1b6a29f54032090b972d97f0a1b0948cc52392041ef7831fee93",
536 | "sha256:401002a4aaa07c9414132aaed7f6836ff98f59277a234704ff66878c2ee4a0d1",
537 | "sha256:52cb72a79269189d4e0dc537556f4740f7f0a9ec41c1322598799b0bdad4ef92",
538 | "sha256:61f642bf2378713e2c2e1de73444a3778e5f0a38be6fee0fe532fe30060282ff",
539 | "sha256:8ac7448f09ab85811607bdd21ec2464495ac8b7c66d146bf545b0f08fb9220ba",
540 | "sha256:a36d4a9dda1f19ce6e03c9a784a2921a4b726b02e1c736600ca9c22029474394",
541 | "sha256:a422368fc821589c228f4c49438a368831cb5bbc0eab5ebe1d7fac9dded6567b",
542 | "sha256:e46dae94e34b085175f8abb3b0aaa7da40767865ac82c928eeb9e57e1ea8a543"
543 | ],
544 | "markers": "python_version >= '3.6'",
545 | "version": "==1.5.0"
546 | },
547 | "pyrsistent": {
548 | "hashes": [
549 | "sha256:0e3e1fcc45199df76053026a51cc59ab2ea3fc7c094c6627e93b7b44cdae2c8c",
550 | "sha256:1b34eedd6812bf4d33814fca1b66005805d3640ce53140ab8bbb1e2651b0d9bc",
551 | "sha256:4ed6784ceac462a7d6fcb7e9b663e93b9a6fb373b7f43594f9ff68875788e01e",
552 | "sha256:5d45866ececf4a5fff8742c25722da6d4c9e180daa7b405dc0a2a2790d668c26",
553 | "sha256:636ce2dc235046ccd3d8c56a7ad54e99d5c1cd0ef07d9ae847306c91d11b5fec",
554 | "sha256:6455fc599df93d1f60e1c5c4fe471499f08d190d57eca040c0ea182301321286",
555 | "sha256:6bc66318fb7ee012071b2792024564973ecc80e9522842eb4e17743604b5e045",
556 | "sha256:7bfe2388663fd18bd8ce7db2c91c7400bf3e1a9e8bd7d63bf7e77d39051b85ec",
557 | "sha256:7ec335fc998faa4febe75cc5268a9eac0478b3f681602c1f27befaf2a1abe1d8",
558 | "sha256:914474c9f1d93080338ace89cb2acee74f4f666fb0424896fcfb8d86058bf17c",
559 | "sha256:b568f35ad53a7b07ed9b1b2bae09eb15cdd671a5ba5d2c66caee40dbf91c68ca",
560 | "sha256:cdfd2c361b8a8e5d9499b9082b501c452ade8bbf42aef97ea04854f4a3f43b22",
561 | "sha256:d1b96547410f76078eaf66d282ddca2e4baae8964364abb4f4dcdde855cd123a",
562 | "sha256:d4d61f8b993a7255ba714df3aca52700f8125289f84f704cf80916517c46eb96",
563 | "sha256:d7a096646eab884bf8bed965bad63ea327e0d0c38989fc83c5ea7b8a87037bfc",
564 | "sha256:df46c854f490f81210870e509818b729db4488e1f30f2a1ce1698b2295a878d1",
565 | "sha256:e24a828f57e0c337c8d8bb9f6b12f09dfdf0273da25fda9e314f0b684b415a07",
566 | "sha256:e4f3149fd5eb9b285d6bfb54d2e5173f6a116fe19172686797c056672689daf6",
567 | "sha256:e92a52c166426efbe0d1ec1332ee9119b6d32fc1f0bbfd55d5c1088070e7fc1b",
568 | "sha256:f87cc2863ef33c709e237d4b5f4502a62a00fab450c9e020892e8e2ede5847f5",
569 | "sha256:fd8da6d0124efa2f67d86fa70c851022f87c98e205f0594e1fae044e7119a5a6"
570 | ],
571 | "markers": "python_version >= '3.7'",
572 | "version": "==0.18.1"
573 | },
574 | "python-dateutil": {
575 | "hashes": [
576 | "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86",
577 | "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"
578 | ],
579 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
580 | "version": "==2.8.2"
581 | },
582 | "pyyaml": {
583 | "hashes": [
584 | "sha256:08682f6b72c722394747bddaf0aa62277e02557c0fd1c42cb853016a38f8dedf",
585 | "sha256:0f5f5786c0e09baddcd8b4b45f20a7b5d61a7e7e99846e3c799b05c7c53fa696",
586 | "sha256:129def1b7c1bf22faffd67b8f3724645203b79d8f4cc81f674654d9902cb4393",
587 | "sha256:294db365efa064d00b8d1ef65d8ea2c3426ac366c0c4368d930bf1c5fb497f77",
588 | "sha256:3b2b1824fe7112845700f815ff6a489360226a5609b96ec2190a45e62a9fc922",
589 | "sha256:3bd0e463264cf257d1ffd2e40223b197271046d09dadf73a0fe82b9c1fc385a5",
590 | "sha256:4465124ef1b18d9ace298060f4eccc64b0850899ac4ac53294547536533800c8",
591 | "sha256:49d4cdd9065b9b6e206d0595fee27a96b5dd22618e7520c33204a4a3239d5b10",
592 | "sha256:4e0583d24c881e14342eaf4ec5fbc97f934b999a6828693a99157fde912540cc",
593 | "sha256:5accb17103e43963b80e6f837831f38d314a0495500067cb25afab2e8d7a4018",
594 | "sha256:607774cbba28732bfa802b54baa7484215f530991055bb562efbed5b2f20a45e",
595 | "sha256:6c78645d400265a062508ae399b60b8c167bf003db364ecb26dcab2bda048253",
596 | "sha256:72a01f726a9c7851ca9bfad6fd09ca4e090a023c00945ea05ba1638c09dc3347",
597 | "sha256:74c1485f7707cf707a7aef42ef6322b8f97921bd89be2ab6317fd782c2d53183",
598 | "sha256:895f61ef02e8fed38159bb70f7e100e00f471eae2bc838cd0f4ebb21e28f8541",
599 | "sha256:8c1be557ee92a20f184922c7b6424e8ab6691788e6d86137c5d93c1a6ec1b8fb",
600 | "sha256:bb4191dfc9306777bc594117aee052446b3fa88737cd13b7188d0e7aa8162185",
601 | "sha256:bfb51918d4ff3d77c1c856a9699f8492c612cde32fd3bcd344af9be34999bfdc",
602 | "sha256:c20cfa2d49991c8b4147af39859b167664f2ad4561704ee74c1de03318e898db",
603 | "sha256:cb333c16912324fd5f769fff6bc5de372e9e7a202247b48870bc251ed40239aa",
604 | "sha256:d2d9808ea7b4af864f35ea216be506ecec180628aced0704e34aca0b040ffe46",
605 | "sha256:d483ad4e639292c90170eb6f7783ad19490e7a8defb3e46f97dfe4bacae89122",
606 | "sha256:dd5de0646207f053eb0d6c74ae45ba98c3395a571a2891858e87df7c9b9bd51b",
607 | "sha256:e1d4970ea66be07ae37a3c2e48b5ec63f7ba6804bdddfdbd3cfd954d25a82e63",
608 | "sha256:e4fac90784481d221a8e4b1162afa7c47ed953be40d31ab4629ae917510051df",
609 | "sha256:fa5ae20527d8e831e8230cbffd9f8fe952815b2b7dae6ffec25318803a7528fc",
610 | "sha256:fd7f6999a8070df521b6384004ef42833b9bd62cfee11a09bda1079b4b704247",
611 | "sha256:fdc842473cd33f45ff6bce46aea678a54e3d21f1b61a7750ce3c498eedfe25d6",
612 | "sha256:fe69978f3f768926cfa37b867e3843918e012cf83f680806599ddce33c2c68b0"
613 | ],
614 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5'",
615 | "version": "==5.4.1"
616 | },
617 | "requests": {
618 | "hashes": [
619 | "sha256:6c1246513ecd5ecd4528a0906f910e8f0f9c6b8ec72030dc9fd154dc1a6efd24",
620 | "sha256:b8aa58f8cf793ffd8782d3d8cb19e66ef36f7aba4353eec859e74678b01b07a7"
621 | ],
622 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5'",
623 | "version": "==2.26.0"
624 | },
625 | "semantic-version": {
626 | "hashes": [
627 | "sha256:45e4b32ee9d6d70ba5f440ec8cc5221074c7f4b0e8918bdab748cc37912440a9",
628 | "sha256:d2cb2de0558762934679b9a104e82eca7af448c9f4974d1f3eeccff651df8a54"
629 | ],
630 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
631 | "version": "==2.8.5"
632 | },
633 | "setuptools": {
634 | "hashes": [
635 | "sha256:1b6bdc6161661409c5f21508763dc63ab20a9ac2f8ba20029aaaa7fdb9118012",
636 | "sha256:3050e338e5871e70c72983072fe34f6032ae1cdeeeb67338199c2f74e083a80e"
637 | ],
638 | "markers": "python_version >= '3.7'",
639 | "version": "==65.4.1"
640 | },
641 | "six": {
642 | "hashes": [
643 | "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a",
644 | "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c"
645 | ],
646 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'",
647 | "version": "==1.14.0"
648 | },
649 | "termcolor": {
650 | "hashes": [
651 | "sha256:1d6d69ce66211143803fbc56652b41d73b4a400a2891d7bf7a1cdf4c02de613b"
652 | ],
653 | "version": "==1.1.0"
654 | },
655 | "texttable": {
656 | "hashes": [
657 | "sha256:42ee7b9e15f7b225747c3fa08f43c5d6c83bc899f80ff9bae9319334824076e9",
658 | "sha256:dd2b0eaebb2a9e167d1cefedab4700e5dcbdb076114eed30b58b97ed6b37d6f2"
659 | ],
660 | "version": "==1.6.4"
661 | },
662 | "urllib3": {
663 | "hashes": [
664 | "sha256:3fa96cf423e6987997fc326ae8df396db2a8b7c667747d47ddd8ecba91f4a74e",
665 | "sha256:b930dd878d5a8afb066a637fbb35144fe7901e3b209d1cd4f524bd0e9deee997"
666 | ],
667 | "markers": "python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4, 3.5' and python_version < '4'",
668 | "version": "==1.26.12"
669 | },
670 | "wcwidth": {
671 | "hashes": [
672 | "sha256:cafe2186b3c009a04067022ce1dcd79cb38d8d65ee4f4791b8888d6599d1bbe1",
673 | "sha256:ee73862862a156bf77ff92b09034fc4825dd3af9cf81bc5b360668d425f3c5f1"
674 | ],
675 | "version": "==0.1.9"
676 | },
677 | "websocket-client": {
678 | "hashes": [
679 | "sha256:2e50d26ca593f70aba7b13a489435ef88b8fc3b5c5643c1ce8808ff9b40f0b32",
680 | "sha256:d376bd60eace9d437ab6d7ee16f4ab4e821c9dae591e1b783c58ebd8aaf80c5c"
681 | ],
682 | "markers": "python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'",
683 | "version": "==0.59.0"
684 | }
685 | }
686 | }
687 |
--------------------------------------------------------------------------------
/05-deployment/app/customer_1.json:
--------------------------------------------------------------------------------
1 | {
2 | "reports": 0,
3 | "share": 0.001694,
4 | "expenditure": 0.12,
5 | "owner": "yes"
6 | }
--------------------------------------------------------------------------------
/05-deployment/app/customer_2.json:
--------------------------------------------------------------------------------
1 | {
2 | "reports": 0,
3 | "share": 0.245,
4 | "expenditure": 3.438,
5 | "owner": "yes"
6 | }
--------------------------------------------------------------------------------
/05-deployment/app/dv.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/dv.bin
--------------------------------------------------------------------------------
/05-deployment/app/eb_cloud_service.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/eb_cloud_service.png
--------------------------------------------------------------------------------
/05-deployment/app/model1.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/05-deployment/app/model1.bin
--------------------------------------------------------------------------------
/05-deployment/app/predict-test.py:
--------------------------------------------------------------------------------
1 | from urllib import response
2 | import requests
3 |
4 | host = "card-serving-env.eba-d4hrhyjh.eu-west-1.elasticbeanstalk.com"
5 | remote_url = f"http://{host}/predict"
6 | url = "http://localhost:9696/predict"
7 |
8 | customer = {
9 | "reports": 0,
10 | "share": 0.245,
11 | "expenditure": 3.438,
12 | "owner": "yes"
13 | }
14 |
15 | response = requests.post(remote_url, json = customer).json()
16 | print(response)
--------------------------------------------------------------------------------
/05-deployment/app/predict.py:
--------------------------------------------------------------------------------
1 | import pickle
2 |
3 | from flask import Flask
4 | from flask import request
5 | from flask import jsonify
6 |
7 |
8 | model_file = 'model2.bin'
9 | vectorizer_file = 'dv.bin'
10 | app = Flask('card_prediction')
11 |
12 | with open(model_file, 'rb') as model:
13 | model = pickle.load(model)
14 |
15 | with open(vectorizer_file, 'rb') as vectorizer:
16 | dv = pickle.load(vectorizer)
17 |
18 |
19 | @app.route('/predict', methods=['POST'])
20 | def predict():
21 | customer = request.get_json()
22 |
23 | X = dv.transform([customer])
24 | y_pred = model.predict_proba(X)[0, 1]
25 | card = y_pred >= 0.5
26 |
27 | result = {
28 | 'probability': float(y_pred),
29 | 'card': bool(card)
30 | }
31 |
32 | return jsonify(result)
33 |
34 |
35 | if __name__ == "__main__":
36 | app.run(debug=True, host='0.0.0.0', port=9696)
--------------------------------------------------------------------------------
/06-trees/README.md:
--------------------------------------------------------------------------------
1 | # Decision Trees and Ensemble Learning
2 |
3 | * Credit risk scoring project
4 | * Data cleaning and preparation
5 | * Decision trees
6 | * Decision tree learning algorithm
7 | * Decision trees parameter tuning
8 | * Ensemble learning and random forest
9 | * Gradient boosting and XGBoost
10 | * XGBoost parameter tuning
11 | * Selecting the best model
12 |
13 | ---------
14 |
15 | ## Session #6 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/06-trees/homework.md)
16 |
17 | The goal is to create a regression model for predicting housing prices (column 'median_house_value').
18 |
19 | ### Dataset
20 |
21 | In this homework we'll again use the California Housing Prices dataset - [Link](https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/04-evaluation/AER_credit_card_data.csv)
22 |
23 | You can take it also from Kaggle - [Link](https://www.kaggle.com/datasets/camnugent/california-housing-prices)
24 |
25 | ## Loading the data
26 |
27 | Use only the following columns:
28 | * `'latitude'`,
29 | * `'longitude'`,
30 | * `'housing_median_age'`,
31 | * `'total_rooms'`,
32 | * `'total_bedrooms'`,
33 | * `'population'`,
34 | * `'households'`,
35 | * `'median_income'`,
36 | * `'median_house_value'`,
37 | * `'ocean_proximity'`
38 |
39 | * Fill NAs with 0.
40 | * Apply the log tranform to `median_house_value`.
41 | * Do train/validation/test split with 60%/20%/20% distribution.
42 | * Use the `train_test_split` function and set the `random_state` parameter to 1.
43 | * Use `DictVectorizer` to turn the dataframe into matrices.
44 |
45 |
46 | ## Question 1
47 |
48 | Let's train a decision tree regressor to predict the `median_house_value` variable.
49 |
50 | * Train a model with `max_depth=1`.
51 |
52 |
53 | Which feature is used for splitting the data?
54 |
55 | * `ocean_proximity=INLAND`
56 | * `total_rooms`
57 | * `latitude`
58 | * `population`
59 |
60 |
61 | ## Question 2
62 |
63 | Train a random forest model with these parameters:
64 |
65 | * `n_estimators=10`
66 | * `random_state=1`
67 | * `n_jobs=-1` (optional - to make training faster)
68 |
69 |
70 | What's the RMSE of this model on validation?
71 |
72 | * 0.05
73 | * 0.25
74 | * 0.55
75 | * 0.85
76 |
77 |
78 | ## Question 3
79 |
80 | Now let's experiment with the `n_estimators` parameter
81 |
82 | * Try different values of this parameter from 10 to 200 with step 10.
83 | * Set `random_state` to `1`.
84 | * Evaluate the model on the validation dataset.
85 |
86 |
87 | After which value of `n_estimators` does RMSE stop improving?
88 |
89 | - 10
90 | - 50
91 | - 70
92 | - 150
93 |
94 |
95 | ## Question 4
96 |
97 | Let's select the best `max_depth`:
98 |
99 | * Try different values of `max_depth`: `[10, 15, 20, 25]`
100 | * For each of these values, try different values of `n_estimators` from 10 till 200 (with step 10)
101 | * Fix the random seed: `random_state=1`
102 |
103 |
104 | What's the best `max_depth`:
105 |
106 | * 10
107 | * 15
108 | * 20
109 | * 25
110 |
111 |
112 | # Question 5
113 |
114 | We can extract feature importance information from tree-based models.
115 |
116 | At each step of the decision tree learning algorith, it finds the best split.
117 | When doint it, we can calculate "gain" - the reduction in impurity before and after the split.
118 | This gain is quite useful in understanding what are the imporatant features
119 | for tree-based models.
120 |
121 | In Scikit-Learn, tree-based models contain this information in the
122 | [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
123 | field.
124 |
125 | For this homework question, we'll find the most important feature:
126 |
127 | * Train the model with these parametes:
128 | * `n_estimators=10`,
129 | * `max_depth=20`,
130 | * `random_state=1`,
131 | * `n_jobs=-1` (optional)
132 | * Get the feature importance information from this model
133 |
134 |
135 | What's the most important feature?
136 |
137 | * `total_rooms`
138 | * `median_income`
139 | * `total_bedrooms`
140 | * `longitude`
141 |
142 |
143 | ## Question 6
144 |
145 | Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:
146 |
147 | * Install XGBoost
148 | * Create DMatrix for train and validation
149 | * Create a watchlist
150 | * Train a model with these parameters for 100 rounds:
151 |
152 | ```
153 | xgb_params = {
154 | 'eta': 0.3,
155 | 'max_depth': 6,
156 | 'min_child_weight': 1,
157 |
158 | 'objective': 'reg:squarederror',
159 | 'nthread': 8,
160 |
161 | 'seed': 1,
162 | 'verbosity': 1,
163 | }
164 | ```
165 |
166 | Now change `eta` from `0.3` to `0.1`.
167 |
168 | Which eta leads to the best RMSE score on the validation dataset?
169 |
170 | * 0.3
171 | * 0.1
172 | * Both gives same
--------------------------------------------------------------------------------
/07-bento-production/README.md:
--------------------------------------------------------------------------------
1 | # Production-Ready Machine Learning (Bento ML)
2 |
3 | * Intro/Session Overview
4 | * Building Your Prediction Service with BentoML
5 | * Deploying Your Prediction Service
6 | * Sending, Receiving and Validating Data
7 | * High-Performance Serving
8 | * Bento Production Deployment
9 | * Advanced Example: Deploying Stable Diffusion Model
10 | * Summary
11 |
12 | ---------
13 |
14 | ## Session #7 Homework [link](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/07-bento-production/homework.md)
15 |
16 | The goal is to familiarize you with BentoML and how to build and test an ML production service.
17 |
18 | ## Background
19 |
20 | You are a new recruit at ACME corp. Your manager is emailing you about your first assignment.
21 |
22 | ## Email from your manager
23 |
24 | Good morning recruit! It's good to have you here! I have an assignment for you. I have a data scientist that's built a credit risk model in a jupyter notebook. I need you to run the notebook and save the model with BentoML and seehow big the model is. If it's greater than a certain size, I'm going to have to request additional resources from our infra team. Please let me know how big it is.
25 |
26 | Thanks,
27 |
28 | Mr McManager
29 |
30 |
31 | ## Question 1
32 |
33 | * Install BentoML
34 | * What's the version of BentoML you installed?
35 | * Use `--version` to find out
36 |
37 | Answer: **1.0.7**
38 |
39 | ## Question 2
40 |
41 | Run the notebook which contains the xgboost model from module 6 i.e previous module and save the xgboost model with BentoML. To make it easier for you we have prepared this [notebook](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/07-bentoml-production/code/train.ipynb).
42 |
43 |
44 | How big approximately is the saved BentoML model? Size can slightly vary depending on your local development environment.
45 | Choose the size closest to your model.
46 |
47 | * 924kb
48 | * 724kb
49 | * **114kb**
50 | * 8kb
51 |
52 | ## Another email from your manager
53 |
54 | Great job recruit! Looks like I won't be having to go back to the procurement team. Thanks for the information.
55 |
56 | However, I just got word from one of the teams that's using one of our ML services and they're saying our service is "broken" and their trying to blame our model. I looked at the data their sending and it's completely bogus. I don't want them to send bad data to us and blame us for our models. Could you write a pydantic schema for the data that they should be sending?
57 | That way next time it will tell them it's their data that's bad and not our model.
58 |
59 | Thanks,
60 |
61 | Mr McManager
62 |
63 | ## Question 3
64 |
65 | Say you have the following data that you're sending to your service:
66 |
67 | ```json
68 | {
69 | "name": "Tim",
70 | "age": 37,
71 | "country": "US",
72 | "rating": 3.14
73 | }
74 | ```
75 |
76 | What would the pydantic class look like? You can name the class `UserProfile`.
77 |
78 | Answer:
79 | ```
80 | class UserProfile(BaseModel):
81 | seniority: int
82 | home: str
83 | time: int
84 | age: int
85 | marital: str
86 | records: str
87 | job: str
88 | expenses: int
89 | income: float
90 | assets: float
91 | debt: float
92 | amount: int
93 | price: int
94 | ```
95 |
96 | ## Email from your CEO
97 |
98 | Good morning! I hear you're the one to go to if I need something done well! We've got a new model that a big client needs deployed ASAP. I need you to build a service with it and test it against the old model and make sure that it performs better, otherwise we're going to lose this client. All our hopes are with you!
99 |
100 | Thanks,
101 |
102 | CEO of Acme Corp
103 |
104 | ## Question 4
105 |
106 | We've prepared a model for you that you can import using:
107 |
108 | ```bash
109 | curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel.bentomodel
110 | bentoml models import coolmodel.bentomodel
111 | ```
112 |
113 | What version of scikit-learn was this model trained with?
114 |
115 | * **1.1.1**
116 | * 1.1.2
117 | * 1.1.3
118 | * 1.1.4
119 | * 1.1.5
120 |
121 | ## Question 5
122 |
123 | Create a bento out of this scikit-learn model. The output type for this endpoint should be `NumpyNdarray()`
124 |
125 | Send this array to the Bento:
126 |
127 | ```
128 | [[6.4,3.5,4.5,1.2]]
129 | ```
130 |
131 | You can use curl or the Swagger UI. What value does it return?
132 |
133 | * 0
134 | * **1**
135 | * 2
136 | * 3
137 |
138 | (Make sure your environment has Scikit-Learn installed)
139 |
140 |
141 | ## Question 6
142 |
143 | Ensure to serve your bento with `--production` for this question
144 |
145 | Install locust using:
146 |
147 | ```bash
148 | pip install locust
149 | ```
150 |
151 | Use the following locust file: [locustfile.py](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/07-bento-production/locustfile.py)
152 |
153 | Ensure that it is pointed at your bento's endpoint (In case you didn't name your endpoint "classify")
154 |
155 |
156 |
157 | Configure 100 users with ramp time of 10 users per second. Click "Start Swarming" and ensure that it is working.
158 |
159 | Now download a second model with this command:
160 |
161 | ```bash
162 | curl -O https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel
163 | ```
164 |
165 | Or you can download with this link as well:
166 | [https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel](https://s3.us-west-2.amazonaws.com/bentoml.com/mlzoomcamp/coolmodel2.bentomodel)
167 |
168 | Now import the model:
169 |
170 | ```bash
171 | bentoml models import coolmodel2.bentomodel
172 | ```
173 |
174 | Update your bento's runner tag and test with both models. Which model allows more traffic (more throughput) as you ramp up the traffic?
175 |
176 | **Hint 1**: Remember to turn off and turn on your bento service between changing the model tag. Use Ctl-C to close the service in between trials.
177 |
178 | **Hint 2**: Increase the number of concurrent users to see which one has higher throughput
179 |
180 | Which model has better performance at higher volumes?
181 |
182 | * **The first model**
183 | * The second model
184 |
185 |
186 | ## Email from marketing
187 |
188 | Hello ML person! I hope this email finds you well. I've heard there's this cool new ML model called Stable Diffusion. I hear if you give it a description of a picture it will generate an image. We need a new company logo and I want it to be fierce but also cool, think you could help out?
189 |
190 | Thanks,
191 |
192 | Mike Marketer
193 |
194 |
195 | ## Question 7 (optional)
196 |
197 | Go to this Bento deployment of Stable Diffusion: http://54.176.205.174/ (or deploy it yourself)
198 |
199 | Use the txt2image endpoint and update the prompt to: "A cartoon dragon with sunglasses".
200 | Don't change the seed, it should be 0 by default
201 |
202 | What is the resulting image?
203 |
204 |
--------------------------------------------------------------------------------
/07-bento-production/coolmodel/service.py:
--------------------------------------------------------------------------------
1 | import bentoml
2 | from bentoml.io import JSON, NumpyNdarray
3 |
4 | model_ref = bentoml.sklearn.get("mlzoomcamp_homework:qtzdz3slg6mwwdu5")
5 | model_runner = model_ref.to_runner()
6 |
7 | svc = bentoml.Service("mlzoomcamp_homework", runners = [model_runner])
8 |
9 | @svc.api(input=NumpyNdarray(shape=(-1, 4), enforce_shape=True), output=NumpyNdarray())
10 | def classify(vector):
11 |
12 | predictions = model_runner.predict.run(vector)
13 | print(predictions)
14 | return predictions
15 |
--------------------------------------------------------------------------------
/07-bento-production/credit_risk_service/service.py:
--------------------------------------------------------------------------------
1 | import bentoml
2 | from bentoml.io import JSON
3 | from pydantic import BaseModel
4 |
5 | class UserProfile(BaseModel):
6 | seniority: int
7 | home: str
8 | time: int
9 | age: int
10 | marital: str
11 | records: str
12 | job: str
13 | expenses: int
14 | income: float
15 | assets: float
16 | debt: float
17 | amount: int
18 | price: int
19 |
20 | model_ref = bentoml.xgboost.get("credit_risk_model:dtlts7cv4s2nbhht")
21 | dv = model_ref.custom_objects['dictVectorizer']
22 | model_runner = model_ref.to_runner()
23 |
24 | svc = bentoml.Service("credit_risk_classifier", runners = [model_runner])
25 |
26 | @svc.api(input=JSON(pydantic_model=UserProfile), output=JSON())
27 | def classify(user_profile):
28 | aplication_data = user_profile.dict()
29 | vector = dv.transform(aplication_data)
30 | predictions = model_runner.predict.run(vector)
31 | print(predictions)
32 | return { "pred" : predictions[0] }
33 |
--------------------------------------------------------------------------------
/07-bento-production/dragon.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/dragon.jpeg
--------------------------------------------------------------------------------
/07-bento-production/locustfile.py:
--------------------------------------------------------------------------------
1 | from locust import task
2 | from locust import between
3 | from locust import HttpUser
4 |
5 | sample = [[6.4,3.5,4.5,1.2]]
6 |
7 | class MLZoomUser(HttpUser):
8 | """
9 | Usage:
10 | Start locust load testing client with:
11 |
12 | locust -H http://localhost:3000
13 |
14 | Open browser at http://0.0.0.0:8089, adjust desired number of users and spawn
15 | rate for the load test from the Web UI and start swarming.
16 | """
17 |
18 | @task
19 | def classify(self):
20 | self.client.post("/classify", json=sample)
21 |
22 | wait_time = between(0.01, 2)
--------------------------------------------------------------------------------
/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/custom_objects.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/custom_objects.pkl
--------------------------------------------------------------------------------
/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/model.yaml:
--------------------------------------------------------------------------------
1 | name: credit_risk_model
2 | version: dtlts7cv4s2nbhht
3 | module: bentoml.xgboost
4 | labels: {}
5 | options:
6 | model_class: Booster
7 | metadata: {}
8 | context:
9 | framework_name: xgboost
10 | framework_versions:
11 | xgboost: 1.6.2
12 | bentoml_version: 1.0.7
13 | python_version: 3.9.13
14 | signatures:
15 | predict:
16 | batchable: false
17 | api_version: v2
18 | creation_time: '2022-10-27T10:42:54.314457+00:00'
19 |
--------------------------------------------------------------------------------
/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/saved_model.ubj:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/credit_risk_model/dtlts7cv4s2nbhht/saved_model.ubj
--------------------------------------------------------------------------------
/07-bento-production/models/credit_risk_model/latest:
--------------------------------------------------------------------------------
1 | dtlts7cv4s2nbhht
--------------------------------------------------------------------------------
/07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/model.yaml:
--------------------------------------------------------------------------------
1 | name: mlzoomcamp_homework
2 | version: jsi67fslz6txydu5
3 | module: bentoml.sklearn
4 | labels: {}
5 | options: {}
6 | metadata: {}
7 | context:
8 | framework_name: sklearn
9 | framework_versions:
10 | scikit-learn: 1.1.1
11 | bentoml_version: 1.0.7
12 | python_version: 3.9.12
13 | signatures:
14 | predict:
15 | batchable: true
16 | batch_dim:
17 | - 0
18 | - 0
19 | api_version: v1
20 | creation_time: '2022-10-14T14:48:43.330446+00:00'
21 |
--------------------------------------------------------------------------------
/07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/saved_model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/mlzoomcamp_homework/jsi67fslz6txydu5/saved_model.pkl
--------------------------------------------------------------------------------
/07-bento-production/models/mlzoomcamp_homework/latest:
--------------------------------------------------------------------------------
1 | jsi67fslz6txydu5
--------------------------------------------------------------------------------
/07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/model.yaml:
--------------------------------------------------------------------------------
1 | name: mlzoomcamp_homework
2 | version: qtzdz3slg6mwwdu5
3 | module: bentoml.sklearn
4 | labels: {}
5 | options: {}
6 | metadata: {}
7 | context:
8 | framework_name: sklearn
9 | framework_versions:
10 | scikit-learn: 1.1.1
11 | bentoml_version: 1.0.7
12 | python_version: 3.9.12
13 | signatures:
14 | predict:
15 | batchable: false
16 | api_version: v1
17 | creation_time: '2022-10-13T20:42:14.411084+00:00'
18 |
--------------------------------------------------------------------------------
/07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/saved_model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ksyula/ML_Engineering/23ef14a5d0496b69e7581879ea1e39d62bf81348/07-bento-production/models/mlzoomcamp_homework/qtzdz3slg6mwwdu5/saved_model.pkl
--------------------------------------------------------------------------------
/07-bento-production/requirements.txt:
--------------------------------------------------------------------------------
1 | pydantic==1.10.2
2 | bentoml==1.0.7
3 | scikit-learn==1.1.1
4 | gevent==20.12.1
5 | locust==2.12.2
--------------------------------------------------------------------------------
/07-bento-production/setting_up_bentoML.sh:
--------------------------------------------------------------------------------
1 | #! /bin/bash
2 |
3 | # BentoML: build, package, deploy a model for an at scale, production ready service
4 | # Bento packages everything related to ML code:
5 |
6 | # static dataset
7 | # model(s)
8 | # code
9 | # configuration files
10 | # dependencies
11 | # deployment logic
12 |
13 | # Workflow:
14 |
15 | # get jupyter notebook from week 7 of mlbookcamp-code repo
16 | # first create an anconda env and install xgboost and bentoML via pip
17 | # save model via bento ML and serve it locally and test with Swagger UI
18 | # containerize the bentoml (underyling using docker)
19 | # serve it locally with docker and test from Swagger UI or the terminal
20 |
21 | # ignore the next block:
22 | #### Preventing autoamatic execution of this script if someone runs it:
23 | echo "This is not meant to be an executable shell script. Please open this file in a text editor or VSCode and run the commands one by one"
24 | echo "Press any key to exit"
25 | read -n 1 "key"
26 | exit
27 |
28 |
29 | # Start from here
30 | #### git setup
31 |
32 | # fork the mlbookcamp-code repo to your own github account, and copy the HTTPS URL and clone it to local:
33 | git clone "URL of your forked mlbookcamp"
34 | # in case you already cloned it before and need to pull new changes:
35 | cd mlbookcamp-code
36 | git switch master
37 | # add the original bookcamp code repo to remote. I call it upstream.
38 | git remote add upstream https://github.com/alexeygrigorev/mlbookcamp-code.git
39 | # if you've already added, it will list all remotes:
40 | git remote -v
41 | git pull upstream master
42 | # add changes back to your own fork of mlbookcamp-code
43 | git push
44 |
45 | # add files from the newly cloned directory:
46 | mkdir week7
47 | cd week7/
48 | cp -r /mnt/d/5-Courses/mlbookcamp-code/course-zoomcamp/07-bentoml-production/code/* .
49 |
50 |
51 | #### conda environement setup
52 |
53 | # if base is already activated, run the following to create an environment called bento:
54 | conda create -n bento jupyter numpy pandas scikit-learn
55 |
56 | # It's better to install bentoml via pip because the conda repo is not actively or officially maintained, however in this case, we need
57 | # to install xgboost with pip too, and an additional package pydantic
58 | conda activate bento
59 | pip install bentoml pydantic xgboost
60 |
61 |
62 | #### Training and saving the model with bentoML:
63 |
64 | # start jupyter notebook
65 | jupyter notebook
66 | # run train.ipynb from week 6 using the XGBoost model trained on data without feature names
67 | # details in video 7.2 and 7.5
68 | # sample train.ipynb file can be found here:
69 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/train.ipynb
70 | # save the model using bentoml in jupyter notebook
71 | # copy the string inside the quotes in the tag parameter
72 | # install vim editor to write service.py:
73 |
74 |
75 | #### Starting the bentoML service from terminal:
76 | sudo apt update
77 | sudo apt search vim
78 | sudo apt install vim
79 | vim --version
80 |
81 | # or just use vscode with the bento conda environment, it will be less painful
82 | # vim is useful when you can only work within the terminal with no GUI
83 |
84 | # create new file called service.py and write the code from video 7.2 + 7.4 + 7.5 in it
85 | # Video 7.4 - for more info on pydantic: https://pydantic-docs.helpmanual.io/ for defining our validation schema
86 | # Vide 7.4 - for input/output type of the model: https://docs.bentoml.org/en/latest/reference/api_io_descriptors.html
87 | # sample service.py can be found here:
88 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/service.py
89 | # save service.py and run from terminal:
90 |
91 | bentoml serve service.py:svc --reload
92 |
93 | # this opens swagger UI which is based on openAI, change URL to 127.0.0.1:3000 instead of
94 | # default 0.0.0.0:3000 it shows in terminal
95 | # opens correctly if link CTRL+clicked from vscode, translates 0.0.0.0:3000 to localhost
96 |
97 | # click on the 'POST /classify' button, and then click on 'try it' and paste the test user information json from the train.ipynb
98 | # It should show you 200 Successful Response with the final decision of the model:
99 | # Either the loan is approved, maybe or denied
100 |
101 |
102 | #### Try different information commands in bentoml:
103 |
104 | # Get list of models stored in the bentoml models directory:
105 | bentoml models list
106 | # Look at the metadata of a particular model:
107 | # this was my model name and tag:
108 | bentoml models get credit_risk_model:loqp7isqtki4aaav
109 | # change model name and tag to your like this:
110 | # bentoml models get :
111 | # it returns something like this:
112 |
113 | `
114 | name: credit_risk_model
115 | version: loqp7isqtki4aaav
116 | module: bentoml.xgboost
117 | labels: {}
118 | options:
119 | model_class: Booster
120 | metadata: {}
121 | context:
122 | framework_name: xgboost
123 | framework_versions:
124 | xgboost: 1.6.2
125 | bentoml_version: 1.0.7
126 | python_version: 3.10.6
127 | signatures:
128 | predict:
129 | batchable: false
130 | api_version: v2
131 | creation_time: '2022-10-20T17:12:21.082672+00:00'
132 |
133 | `
134 | # to delete a particular model:
135 | bentoml models delete credit_risk_model:
136 |
137 | #### Bento build for a single unit/package of the model:
138 |
139 | # create a newfile and name it bentofile.yaml from vscode, initially populated with standard content
140 | # sample bentofile.yaml can be found here:
141 | # https://github.com/MemoonaTahira/MLZoomcamp2022/blob/main/Notes/Week_7-production_ready_deployment_bentoML/bentofile.yaml
142 | # here we add info on environment, packages and docker etc.
143 | # Use video 7.3 for adding content
144 | # go here for more: https://docs.bentoml.org/en/0.13-lts/concepts.html
145 |
146 | # edit and save, and then build bento:
147 | # P.S. run this command from the folder where bentofile.yaml is
148 | bentoml build
149 | # successful build will display the bentoml logo
150 |
151 | # next, cd to where the bento is built and look inside:
152 | # default location for built services: /home/mona/bentoml/bentos
153 | cd ~/bentoml/bentos/credit_risk_classifier/latest/
154 | # you can replace qlkhqrsqv6kpeaavt with any other bento tag
155 | # install and run tree to see what's inside:
156 | sudo apt install tree
157 | tree
158 | # it will show something like this:
159 | `
160 | .
161 | ├── README.md
162 | ├── apis
163 | │ └── openapi.yaml
164 | ├── bento.yaml
165 | ├── env
166 | │ ├── docker
167 | │ │ ├── Dockerfile
168 | │ │ └── entrypoint.sh
169 | │ └── python
170 | │ ├── install.sh
171 | │ ├── requirements.lock.txt
172 | │ ├── requirements.txt
173 | │ └── version.txt
174 | ├── models
175 | │ └── credit_risk_model
176 | │ ├── latest
177 | │ └── loqp7isqtki4aaav
178 | │ ├── custom_objects.pkl
179 | │ ├── model.yaml
180 | │ └── saved_model.ubj
181 | └── src
182 | ├── locustfile.py
183 | └── service.py
184 |
185 | `
186 |
187 | # wohoo, it created a dockerfile for us on its own besides a neatly structured model
188 | # environment and some src files to serve the bento.
189 |
190 | #### Load testing with locust for high performance optimization:
191 |
192 | # you don't need to manually create locust.py if you've run "bento build" already.
193 | # If not then, you might need to create locustfile.py manually (sample added)
194 | # make sure you have bentoml serve --production running in one tab in the folder from where you ran the bentoml build
195 | # comamands (which bentofile.yaml to build the model, which in turn used service.py)
196 |
197 | bentoml serve --production
198 | # open another tab and run:
199 | locust -H http://localhost:3000
200 | # do load testing.
201 | # make sure you have modified the train.ipynb to work to accept microbatching in async mode to create a more robust service if you haven't already (video 7.5)
202 |
203 |
204 | #### OPTIONAL - Speed up docker build: Only if you have an NVIDIA GPU + Windows 11 + WSL ready NVIDIA drivers
205 | # set up nvidia container runtime for building docker containers with GPU:
206 | # https://github.com/NVIDIA/nvidia-docker
207 |
208 | distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
209 | && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
210 | && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
211 | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
212 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
213 |
214 |
215 | sudo apt-get update
216 | sudo apt-get install -y nvidia-docker2
217 | # sudo systemctl restart docker
218 | # systemctl doesn't work with WSL, just start the docker srevice:
219 | sudo dockerd &
220 | # hit enter once to exit the INFO message, the docker service will keep running in the background
221 | sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
222 | # once done, run docker images to get image ID and then run:
223 | docker rmi
224 |
225 |
226 | #### Build the docker image:
227 | cd
228 | cd week7
229 | # start docker service
230 | sudo dockerd &
231 | # for some reason, the command above hangs up sometimes, try pressing ENTER if nohting happens for a while
232 | # if it fails to start successfully, first run:
233 | sudo dockerd
234 | # then enter CTRL+C to exit, and then run
235 | sudo dockerd &
236 | # As far as I understand, the sudo dockerd& allows you to keep working in the same terminal by returning you to terminal prompt if you
237 | # hit enter once all the INFO messages are displayed to get back to terminal prompt
238 | # next, start containerizing bentoml
239 | bentoml containerize credit_risk_classifier:qlkhqrsqv6kpeaav
240 |
241 |
242 | #### Only do this portion if it gives you an error about missing docker plugins, run this:
243 | # only trying to install docker compose plugins didnt work:https://docs.docker.com/compose/install/linux/
244 | # more info: https://docs.docker.com/engine/install/ubuntu/#set-up-the-repository
245 |
246 | sudo apt-get update
247 |
248 | sudo apt-get install \
249 | ca-certificates \
250 | curl \
251 | gnupg \
252 | lsb-release
253 |
254 | curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
255 |
256 | echo \
257 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
258 | $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
259 |
260 | sudo apt-get update
261 | # it will upgrade docker compose (in case it is upto date, it won't change it) and install missing components
262 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
263 |
264 | # you are good to go. Run the build command again.
265 | bentoml containerize credit_risk_classifier:qlkhqrsqv6kpeaav
266 |
267 | #### Run with docker to serve the bento and make predictions via Swagger UI!
268 | docker run -it --rm -p 3000:3000 credit_risk_classifier:qlkhqrsqv6kpeaav
269 |
270 | # it says it will start service on http://0.0.0.0:3000 but change to 127.0.0.1:3000
271 | # opens correctly if link CTRL+clicked from vscode, translates 0.0.0.0:3000 to localhost
272 |
273 | # It will open the swagger UI once again:
274 | # Repeat steps from before:
275 |
276 | # click on the 'POST /classify' button, and then click on 'try it' and paste the test user information json from the train.ipynb
277 | # sample user data:
278 | `
279 | {
280 | "seniority": 3,
281 | "home": "owner",
282 | "time": 36,
283 | "age": 26,
284 | "marital": "single",
285 | "records": "no",
286 | "job": "freelance",
287 | "expenses": 35,
288 | "income": 0.0,
289 | "assets": 60000.0,
290 | "debt": 3000.0,
291 | "amount": 800,
292 | "price": 1000
293 | }
294 | `
295 | # It should show you 200 Successful Response with the final decision of the model:
296 | `
297 | {
298 | "status": "MAYBE"
299 | }
300 | `
301 | # With any other test user, either the response is APPROVED, MAYBE or DECLINED
302 |
303 |
304 | #### Run with docker to serve the bento and make predictions via terminal!
305 |
306 | # instead of opening the 127.0.0.1:3000 URL,open another terminal and use curl:
307 |
308 | cd week7
309 | conda activate bento
310 |
311 | # then paste this:
312 |
313 | curl -X 'POST' \
314 | 'http://127.0.0.1:3000/classify' \
315 | -H 'accept: application/json' \
316 | -H 'Content-Type: application/json' \
317 | -d '{
318 | "seniority": 3,
319 | "home": "owner",
320 | "time": 36,
321 | "age": 26,
322 | "marital": "single",
323 | "records": "no",
324 | "job": "freelance",
325 | "expenses": 35,
326 | "income": 0.0,
327 | "assets": 60000.0,
328 | "debt": 3000.0,
329 | "amount": 800,
330 | "price": 1000
331 | }'
332 |
333 | #It should return the prediction from the model as either APPROVED, MAYBE or DECLINED like this:
334 | `
335 | {"status":"MAYBE"}
336 | `
337 |
338 |
339 |
340 |
341 |
--------------------------------------------------------------------------------
/07-bento-production/train.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "d04e7bea",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import pandas as pd\n",
11 | "import numpy as np\n",
12 | "\n",
13 | "from sklearn.model_selection import train_test_split\n",
14 | "from sklearn.feature_extraction import DictVectorizer\n",
15 | "\n",
16 | "from sklearn.ensemble import RandomForestClassifier\n",
17 | "\n",
18 | "import xgboost as xgb"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "id": "d37324f2",
24 | "metadata": {},
25 | "source": [
26 | "### Data preparation"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 2,
32 | "id": "49807b73",
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv'\n",
37 | "df = pd.read_csv(data)"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "id": "28b3fadb",
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "df.columns = df.columns.str.lower()\n",
48 | "\n",
49 | "status_values = {\n",
50 | " 1: 'ok',\n",
51 | " 2: 'default',\n",
52 | " 0: 'unk'\n",
53 | "}\n",
54 | "\n",
55 | "df.status = df.status.map(status_values)\n",
56 | "\n",
57 | "home_values = {\n",
58 | " 1: 'rent',\n",
59 | " 2: 'owner',\n",
60 | " 3: 'private',\n",
61 | " 4: 'ignore',\n",
62 | " 5: 'parents',\n",
63 | " 6: 'other',\n",
64 | " 0: 'unk'\n",
65 | "}\n",
66 | "\n",
67 | "df.home = df.home.map(home_values)\n",
68 | "\n",
69 | "marital_values = {\n",
70 | " 1: 'single',\n",
71 | " 2: 'married',\n",
72 | " 3: 'widow',\n",
73 | " 4: 'separated',\n",
74 | " 5: 'divorced',\n",
75 | " 0: 'unk'\n",
76 | "}\n",
77 | "\n",
78 | "df.marital = df.marital.map(marital_values)\n",
79 | "\n",
80 | "records_values = {\n",
81 | " 1: 'no',\n",
82 | " 2: 'yes',\n",
83 | " 0: 'unk'\n",
84 | "}\n",
85 | "\n",
86 | "df.records = df.records.map(records_values)\n",
87 | "\n",
88 | "job_values = {\n",
89 | " 1: 'fixed',\n",
90 | " 2: 'partime',\n",
91 | " 3: 'freelance',\n",
92 | " 4: 'others',\n",
93 | " 0: 'unk'\n",
94 | "}\n",
95 | "\n",
96 | "df.job = df.job.map(job_values)\n",
97 | "\n",
98 | "for c in ['income', 'assets', 'debt']:\n",
99 | " df[c] = df[c].replace(to_replace=99999999, value=np.nan)\n",
100 | "\n",
101 | "df = df[df.status != 'unk'].reset_index(drop=True)"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 4,
107 | "id": "4fd52ad9",
108 | "metadata": {},
109 | "outputs": [],
110 | "source": [
111 | "df_train, df_test = train_test_split(df, test_size=0.2, random_state=11)\n",
112 | "\n",
113 | "df_train = df_train.reset_index(drop=True)\n",
114 | "df_test = df_test.reset_index(drop=True)\n",
115 | "\n",
116 | "y_train = (df_train.status == 'default').astype('int').values\n",
117 | "y_test = (df_test.status == 'default').astype('int').values\n",
118 | "\n",
119 | "del df_train['status']\n",
120 | "del df_test['status']"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "id": "5fe56815",
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "dv = DictVectorizer(sparse=False)\n",
131 | "\n",
132 | "train_dicts = df_train.fillna(0).to_dict(orient='records')\n",
133 | "X_train = dv.fit_transform(train_dicts)\n",
134 | "\n",
135 | "test_dicts = df_test.fillna(0).to_dict(orient='records')\n",
136 | "X_test = dv.transform(test_dicts)"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "id": "1fb68649",
142 | "metadata": {},
143 | "source": [
144 | "### Random forest"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 6,
150 | "id": "a84fa9d2",
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "data": {
155 | "text/html": [
156 | "RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,\n",
157 | " random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
159 | ],
160 | "text/plain": [
161 | "RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,\n",
162 | " random_state=1)"
163 | ]
164 | },
165 | "execution_count": 6,
166 | "metadata": {},
167 | "output_type": "execute_result"
168 | }
169 | ],
170 | "source": [
171 | "rf = RandomForestClassifier(n_estimators=200,\n",
172 | " max_depth=10,\n",
173 | " min_samples_leaf=3,\n",
174 | " random_state=1)\n",
175 | "rf.fit(X_train, y_train)"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "id": "05f1bb34",
181 | "metadata": {},
182 | "source": [
183 | "### XGBoost\n",
184 | "\n",
185 | "Note:\n",
186 | "\n",
187 | "We removed feature names\n",
188 | "\n",
189 | "It was \n",
190 | "\n",
191 | "```python\n",
192 | "features = dv.get_feature_names_out()\n",
193 | "dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)\n",
194 | "```\n",
195 | "\n",
196 | "Now it's\n",
197 | "\n",
198 | "```python\n",
199 | "dtrain = xgb.DMatrix(X_train, label=y_train)\n",
200 | "```"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": 10,
206 | "id": "63185f7a",
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "dtrain = xgb.DMatrix(X_train, label=y_train)"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 11,
216 | "id": "d1e284f4",
217 | "metadata": {},
218 | "outputs": [],
219 | "source": [
220 | "xgb_params = {\n",
221 | " 'eta': 0.1, \n",
222 | " 'max_depth': 3,\n",
223 | " 'min_child_weight': 1,\n",
224 | "\n",
225 | " 'objective': 'binary:logistic',\n",
226 | " 'eval_metric': 'auc',\n",
227 | "\n",
228 | " 'nthread': 8,\n",
229 | " 'seed': 1,\n",
230 | " 'verbosity': 1,\n",
231 | "}\n",
232 | "\n",
233 | "model = xgb.train(xgb_params, dtrain, num_boost_round=175)"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "id": "23ae12d0",
239 | "metadata": {},
240 | "source": [
241 | "### BentoML"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 12,
247 | "id": "7a230459",
248 | "metadata": {},
249 | "outputs": [],
250 | "source": [
251 | "import bentoml"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 13,
257 | "id": "0ea2ca87",
258 | "metadata": {},
259 | "outputs": [
260 | {
261 | "data": {
262 | "text/plain": [
263 | "Model(tag=\"credit_risk_model:dtlts7cv4s2nbhht\", path=\"/Users/ksenia/bentoml/models/credit_risk_model/dtlts7cv4s2nbhht/\")"
264 | ]
265 | },
266 | "execution_count": 13,
267 | "metadata": {},
268 | "output_type": "execute_result"
269 | }
270 | ],
271 | "source": [
272 | "bentoml.xgboost.save_model(\n",
273 | " 'credit_risk_model',\n",
274 | " model,\n",
275 | " custom_objects={\n",
276 | " 'dictVectorizer': dv\n",
277 | " })"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "id": "a5151ea7",
283 | "metadata": {},
284 | "source": [
285 | "Test"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 11,
291 | "id": "492f90ec",
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "import json"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 12,
301 | "id": "ed5efc41",
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "name": "stdout",
306 | "output_type": "stream",
307 | "text": [
308 | "{\n",
309 | " \"seniority\": 3,\n",
310 | " \"home\": \"owner\",\n",
311 | " \"time\": 36,\n",
312 | " \"age\": 26,\n",
313 | " \"marital\": \"single\",\n",
314 | " \"records\": \"no\",\n",
315 | " \"job\": \"freelance\",\n",
316 | " \"expenses\": 35,\n",
317 | " \"income\": 0.0,\n",
318 | " \"assets\": 60000.0,\n",
319 | " \"debt\": 3000.0,\n",
320 | " \"amount\": 800,\n",
321 | " \"price\": 1000\n",
322 | "}\n"
323 | ]
324 | }
325 | ],
326 | "source": [
327 | "request = df_test.iloc[0].to_dict()\n",
328 | "print(json.dumps(request, indent=2))"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": null,
334 | "id": "ec2744ae",
335 | "metadata": {},
336 | "outputs": [],
337 | "source": []
338 | }
339 | ],
340 | "metadata": {
341 | "kernelspec": {
342 | "display_name": "Python 3 (ipykernel)",
343 | "language": "python",
344 | "name": "python3"
345 | },
346 | "language_info": {
347 | "codemirror_mode": {
348 | "name": "ipython",
349 | "version": 3
350 | },
351 | "file_extension": ".py",
352 | "mimetype": "text/x-python",
353 | "name": "python",
354 | "nbconvert_exporter": "python",
355 | "pygments_lexer": "ipython3",
356 | "version": "3.9.13"
357 | },
358 | "toc": {
359 | "base_numbering": 1,
360 | "nav_menu": {},
361 | "number_sections": true,
362 | "sideBar": true,
363 | "skip_h1_title": true,
364 | "title_cell": "Table of Contents",
365 | "title_sidebar": "Contents",
366 | "toc_cell": false,
367 | "toc_position": {},
368 | "toc_section_display": true,
369 | "toc_window_display": false
370 | },
371 | "vscode": {
372 | "interpreter": {
373 | "hash": "534cd37feadae63edbd1d76ee7605c25f8931687006409bb4f94689dd24a9518"
374 | }
375 | }
376 | },
377 | "nbformat": 4,
378 | "nbformat_minor": 5
379 | }
380 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Engineering Zoomcamp 2022
2 |
3 | 
4 |
5 | ## Homeworks, Midterm, & Capstone Project
6 | ### Progress:
7 | | Id | Module Session | Progress | Deadline | Link |
8 | |----|-----------------------------------------------|----------|--------------|--------------------|
9 | |01 | Introduction to Machine Learning | :white_check_mark: | 12/09/2022 | [Intro] |
10 | |02 | Machine Learning for Regression | :white_check_mark: | 19/09/2022 | [Regression]|
11 | |03 | Machine Learning for Classification | :white_check_mark: | 26/09/2022 | [Classification]|
12 | |04 | Evaluation Metrics for Classification | :white_check_mark: | 03/10/2022 | [Evaluation]|
13 | |05 | Deploying Machine Learning Models | :white_check_mark: | 10/10/2022 | [Deployment]|
14 | |06 | Decision Trees and Ensemble Learning | :white_check_mark: | 17/10/2022 | [Trees]|
15 | |07 | Bento ML | :white_check_mark: | 24/10/2022 | [BentoML]|
16 |
17 | On Kaggle - [link](https://www.kaggle.com/ksyuleg)
18 |
19 | The original course contents is [here](https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp), courtesy of [Alexey Grigorev](https://github.com/alexeygrigorev).
20 |
--------------------------------------------------------------------------------
/midterm_project/README.md:
--------------------------------------------------------------------------------
1 | # Fraud Detection end-to-end Machine Learning Project
2 |
3 | The goal of the project is to build a fraud detection API that can be called to predict whether or not a transaction is fraudulent.
4 |
5 | The project follows an open standard guide CRISP-DM (Cross-Industry Standard Process for Data Mining), which describes the main phases of data mining lifecycle and widly used by data mining experts.
6 |
7 | ### Buissness Understanding
8 | 1. Determine business objectives
9 | 2. Determine data mining goals
10 | 3. Produce project plan
11 | ### Data Understanding
12 | 1. Collect data
13 | 2. Describe data
14 | 3. Explore data
15 | ### Data Preparation
16 | 1. Data selection
17 | 2. Data preprocessing
18 | 3. Feature engineering
19 | ### Data Modeling
20 | 1. Select modeling technique Select technique
21 | 2. Generate test design
22 | 3. Build model
23 | 4. Assess model
24 | ### Data Evaluation
25 | 1. Evaluate Result
26 | 2. Review Process
27 | 3. Determine next steps
28 |
29 |
30 |
--------------------------------------------------------------------------------
/midterm_project/app/README.md:
--------------------------------------------------------------------------------
1 | # Fraud Detection API
2 |
3 | Build a REST API that predicts whether a given transaction is fraudulent or not. All API calls should be stored in order to engineer features relevant to finding fraud. Every API call includes a time step of the transaction.
--------------------------------------------------------------------------------
/midterm_project/data/data.csv.gz:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:e26bbc07745ce8ad047753c41d7b261797895757603f9ccd58427da11522f847
3 | size 181894042
4 |
--------------------------------------------------------------------------------
/midterm_project/notebooks/01-EDA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "18e71fd3",
6 | "metadata": {},
7 | "source": [
8 | "This is a synthetic dataset for a mobile payments application. Each transaction has a sender and recipient and are taggedare is tagged as fraud or not fraud.\n",
9 | "\n",
10 | "The task is to build a fraud detection API that can be called to predict whether or not a transaction is fraudulent. "
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "id": "c6e4c45a",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": []
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "id": "3d81c9da",
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "import pandas as pd"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "id": "4341cbac",
35 | "metadata": {},
36 | "outputs": [],
37 | "source": []
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 8,
42 | "id": "ffb1de62",
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "data": {
47 | "text/html": [
48 | "\n",
49 | "\n",
62 | "
\n",
63 | " \n",
64 | " \n",
65 | " | \n",
66 | " step | \n",
67 | " type | \n",
68 | " amount | \n",
69 | " nameOrig | \n",
70 | " oldbalanceOrig | \n",
71 | " newbalanceOrig | \n",
72 | " nameDest | \n",
73 | " oldbalanceDest | \n",
74 | " newbalanceDest | \n",
75 | " isFraud | \n",
76 | "
\n",
77 | " \n",
78 | " \n",
79 | " \n",
80 | " 0 | \n",
81 | " 1 | \n",
82 | " PAYMENT | \n",
83 | " 9839.64 | \n",
84 | " C1231006815 | \n",
85 | " 170136.0 | \n",
86 | " 160296.36 | \n",
87 | " M1979787155 | \n",
88 | " 0.0 | \n",
89 | " 0.0 | \n",
90 | " 0 | \n",
91 | "
\n",
92 | " \n",
93 | " 1 | \n",
94 | " 1 | \n",
95 | " PAYMENT | \n",
96 | " 1864.28 | \n",
97 | " C1666544295 | \n",
98 | " 21249.0 | \n",
99 | " 19384.72 | \n",
100 | " M2044282225 | \n",
101 | " 0.0 | \n",
102 | " 0.0 | \n",
103 | " 0 | \n",
104 | "
\n",
105 | " \n",
106 | " 2 | \n",
107 | " 1 | \n",
108 | " TRANSFER | \n",
109 | " 181.00 | \n",
110 | " C1305486145 | \n",
111 | " 181.0 | \n",
112 | " 0.00 | \n",
113 | " C553264065 | \n",
114 | " 0.0 | \n",
115 | " 0.0 | \n",
116 | " 1 | \n",
117 | "
\n",
118 | " \n",
119 | " 3 | \n",
120 | " 1 | \n",
121 | " CASH_OUT | \n",
122 | " 181.00 | \n",
123 | " C840083671 | \n",
124 | " 181.0 | \n",
125 | " 0.00 | \n",
126 | " C38997010 | \n",
127 | " 21182.0 | \n",
128 | " 0.0 | \n",
129 | " 1 | \n",
130 | "
\n",
131 | " \n",
132 | " 4 | \n",
133 | " 1 | \n",
134 | " PAYMENT | \n",
135 | " 11668.14 | \n",
136 | " C2048537720 | \n",
137 | " 41554.0 | \n",
138 | " 29885.86 | \n",
139 | " M1230701703 | \n",
140 | " 0.0 | \n",
141 | " 0.0 | \n",
142 | " 0 | \n",
143 | "
\n",
144 | " \n",
145 | "
\n",
146 | "
"
147 | ],
148 | "text/plain": [
149 | " step type amount nameOrig oldbalanceOrig newbalanceOrig \\\n",
150 | "0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 \n",
151 | "1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 \n",
152 | "2 1 TRANSFER 181.00 C1305486145 181.0 0.00 \n",
153 | "3 1 CASH_OUT 181.00 C840083671 181.0 0.00 \n",
154 | "4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 \n",
155 | "\n",
156 | " nameDest oldbalanceDest newbalanceDest isFraud \n",
157 | "0 M1979787155 0.0 0.0 0 \n",
158 | "1 M2044282225 0.0 0.0 0 \n",
159 | "2 C553264065 0.0 0.0 1 \n",
160 | "3 C38997010 21182.0 0.0 1 \n",
161 | "4 M1230701703 0.0 0.0 0 "
162 | ]
163 | },
164 | "execution_count": 8,
165 | "metadata": {},
166 | "output_type": "execute_result"
167 | }
168 | ],
169 | "source": [
170 | "data = pd.read_csv(\"../data/data.csv.gz\", compression = \"gzip\")\n",
171 | "data.head()"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "id": "c1992bb0",
178 | "metadata": {},
179 | "outputs": [],
180 | "source": []
181 | }
182 | ],
183 | "metadata": {
184 | "kernelspec": {
185 | "display_name": "Python 3 (ipykernel)",
186 | "language": "python",
187 | "name": "python3"
188 | },
189 | "language_info": {
190 | "codemirror_mode": {
191 | "name": "ipython",
192 | "version": 3
193 | },
194 | "file_extension": ".py",
195 | "mimetype": "text/x-python",
196 | "name": "python",
197 | "nbconvert_exporter": "python",
198 | "pygments_lexer": "ipython3",
199 | "version": "3.9.13"
200 | },
201 | "toc": {
202 | "base_numbering": 1,
203 | "nav_menu": {},
204 | "number_sections": true,
205 | "sideBar": true,
206 | "skip_h1_title": true,
207 | "title_cell": "Table of Contents",
208 | "title_sidebar": "Contents",
209 | "toc_cell": false,
210 | "toc_position": {},
211 | "toc_section_display": true,
212 | "toc_window_display": false
213 | }
214 | },
215 | "nbformat": 4,
216 | "nbformat_minor": 5
217 | }
218 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | ###### Requirements with Version Specifiers ######
2 | jupyter == 6.4.12
3 | numpy == 1.21.5
4 | pandas == 1.4.3
5 | scikit-learn == 1.1.1
6 | seaborn == 0.11.2
7 | matplotlib == 3.5.2
--------------------------------------------------------------------------------