├── Programming_assignment_week_1_Pandas_basics
├── PandasBasics.ipynb
└── grader.py
├── Programming_assignment_week_2_Data_leakages
└── Data_leakages.ipynb
├── Programming_assignment_week_3_Mean_encodings
├── Programming_assignment_week_3.ipynb
└── grader.py
├── Programming_assignment_week_4_Catboost
├── catboost_notebook_v2.ipynb
└── grader_v2.py
├── Programming_assignment_week_4_Ensembles
├── Programming_assignment_week_4.ipynb
└── grader.py
├── Programming_assignment_week_4_KNN_features
├── compute_KNN_features.ipynb
└── grader.py
├── README.md
├── Reading_materials
├── EDA_Springleaf_screencast.ipynb
├── EDA_video2.ipynb
├── EDA_video3_screencast.ipynb
├── GBM_drop_tree.ipynb
├── Hyperparameters_tuning_video2_RF_n_estimators.ipynb
├── Macros.ipynb
├── Metrics_video2_constants_for_MSE_and_MAE.ipynb
├── Metrics_video3_weighted_median.ipynb
└── Metrics_video8_soft_kappa_xgboost.ipynb
└── readonly
├── KNN_features_data
├── X.npz
├── X_test.npz
├── Y.npy
├── Y_test.npy
└── knn_feats_test_first50.npy
├── data_leakages_data
└── test_pairs.csv
└── final_project_data
├── item_categories.csv
├── items.csv
├── sales_train.csv.gz
├── sample_submission.csv.gz
├── shops.csv
└── test.csv.gz
/Programming_assignment_week_1_Pandas_basics/PandasBasics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Version 1.0.3"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Pandas basics "
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "Hi! In this programming assignment you need to refresh your `pandas` knowledge. You will need to do several [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)s and [`join`]()`s to solve the task. "
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": null,
27 | "metadata": {
28 | "collapsed": true
29 | },
30 | "outputs": [],
31 | "source": [
32 | "import pandas as pd\n",
33 | "import numpy as np\n",
34 | "import os\n",
35 | "import matplotlib.pyplot as plt\n",
36 | "%matplotlib inline \n",
37 | "\n",
38 | "from grader import Grader"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": [
49 | "DATA_FOLDER = '../readonly/final_project_data/'\n",
50 | "\n",
51 | "transactions = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))\n",
52 | "items = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))\n",
53 | "item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))\n",
54 | "shops = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "The dataset we are going to use is taken from the competition, that serves as the final project for this course. You can find complete data description at the [competition web page](https://www.kaggle.com/c/competitive-data-science-final-project/data). To join the competition use [this link](https://www.kaggle.com/t/1ea93815dca248e99221df42ebde3540)."
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## Grading"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "We will create a grader instace below and use it to collect your answers. When function `submit_tag` is called, grader will store your answer *locally*. The answers will *not* be submited to the platform immediately so you can call `submit_tag` function as many times as you need. \n",
76 | "\n",
77 | "When you are ready to push your answers to the platform you should fill your credentials and run `submit` function in the last paragraph of the assignment."
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {
84 | "collapsed": true
85 | },
86 | "outputs": [],
87 | "source": [
88 | "grader = Grader()"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "# Task"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "Let's start with a simple task. \n",
103 | "\n",
104 | "
\n",
105 | " - Print the shape of the loaded dataframes and use [`df.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function to print several rows. Examine the features you are given.
\n",
106 | "
"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "metadata": {
113 | "collapsed": true
114 | },
115 | "outputs": [],
116 | "source": [
117 | "# YOUR CODE GOES HERE"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "Now use your `pandas` skills to get answers for the following questions. \n",
125 | "The first question is:\n",
126 | "\n",
127 | "1. ** What was the maximum total revenue among all the shops in September, 2014?** \n",
128 | "\n",
129 | "\n",
130 | "* Hereinafter *revenue* refers to total sales minus value of goods returned.\n",
131 | "\n",
132 | "*Hints:*\n",
133 | "\n",
134 | "* Sometimes items are returned, find such examples in the dataset. \n",
135 | "* It is handy to split `date` field into [`day`, `month`, `year`] components and use `df.year == 14` and `df.month == 9` in order to select target subset of dates.\n",
136 | "* You may work with `date` feature as with strings, or you may first convert it to `pd.datetime` type with `pd.to_datetime` function, but do not forget to set correct `format` argument."
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {
143 | "collapsed": true
144 | },
145 | "outputs": [],
146 | "source": [
147 | "# YOUR CODE GOES HERE\n",
148 | "\n",
149 | "max_revenue = # PUT YOUR ANSWER IN THIS VARIABLE\n",
150 | "grader.submit_tag('max_revenue', max_revenue)"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "Great! Let's move on and answer another question:\n",
158 | "\n",
159 | "\n",
160 | " - What item category generated the highest revenue in summer 2014?
\n",
161 | "
\n",
162 | "\n",
163 | "* Submit `id` of the category found.\n",
164 | " \n",
165 | "* Here we call \"summer\" the period from June to August.\n",
166 | "\n",
167 | "*Hints:*\n",
168 | "\n",
169 | "* Note, that for an object `x` of type `pd.Series`: `x.argmax()` returns **index** of the maximum element. `pd.Series` can have non-trivial index (not `[1, 2, 3, ... ]`)."
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {
176 | "collapsed": true
177 | },
178 | "outputs": [],
179 | "source": [
180 | "# YOUR CODE GOES HERE\n",
181 | "\n",
182 | "category_id_with_max_revenue = # PUT YOUR ANSWER IN THIS VARIABLE\n",
183 | "grader.submit_tag('category_id_with_max_revenue', category_id_with_max_revenue)"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "\n",
191 | " - How many items are there, such that their price stays constant (to the best of our knowledge) during the whole period of time?
\n",
192 | "
\n",
193 | "\n",
194 | "* Let's assume, that the items are returned for the same price as they had been sold."
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": null,
200 | "metadata": {
201 | "collapsed": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "# YOUR CODE GOES HERE\n",
206 | "\n",
207 | "num_items_constant_price = # PUT YOUR ANSWER IN THIS VARIABLE\n",
208 | "grader.submit_tag('num_items_constant_price', num_items_constant_price)"
209 | ]
210 | },
211 | {
212 | "cell_type": "markdown",
213 | "metadata": {},
214 | "source": [
215 | "Remember, the data can sometimes be noisy."
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "\n",
223 | " - What was the variance of the number of sold items per day sequence for the shop with `shop_id = 25` in December, 2014? Do not count the items, that were sold but returned back later.
\n",
224 | "
\n",
225 | "\n",
226 | "* Fill `total_num_items_sold` and `days` arrays, and plot the sequence with the code below.\n",
227 | "* Then compute variance. Remember, there can be differences in how you normalize variance (biased or unbiased estimate, see [link](https://math.stackexchange.com/questions/496627/the-difference-between-unbiased-biased-estimator-variance)). Compute ***unbiased*** estimate (use the right value for `ddof` argument in `pd.var` or `np.var`). \n",
228 | "* If there were no sales at a given day, ***do not*** impute missing value with zero, just ignore that day"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "collapsed": true
236 | },
237 | "outputs": [],
238 | "source": [
239 | "shop_id = 25\n",
240 | "\n",
241 | "total_num_items_sold = # YOUR CODE GOES HERE\n",
242 | "days = # YOUR CODE GOES HERE\n",
243 | "\n",
244 | "# Plot it\n",
245 | "plt.plot(days, total_num_items_sold)\n",
246 | "plt.ylabel('Num items')\n",
247 | "plt.xlabel('Day')\n",
248 | "plt.title(\"Daily revenue for shop_id = 25\")\n",
249 | "plt.show()\n",
250 | "\n",
251 | "total_num_items_sold_var = # PUT YOUR ANSWER IN THIS VARIABLE\n",
252 | "grader.submit_tag('total_num_items_sold_var', total_num_items_sold_var)"
253 | ]
254 | },
255 | {
256 | "cell_type": "markdown",
257 | "metadata": {},
258 | "source": [
259 | "## Authorization & Submission\n",
260 | "To submit assignment to Cousera platform, please, enter your e-mail and token into the variables below. You can generate token on the programming assignment page. *Note:* Token expires 30 minutes after generation."
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {
267 | "collapsed": true
268 | },
269 | "outputs": [],
270 | "source": [
271 | "STUDENT_EMAIL = # EMAIL HERE\n",
272 | "STUDENT_TOKEN = # TOKEN HERE\n",
273 | "grader.status()"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {
280 | "collapsed": true
281 | },
282 | "outputs": [],
283 | "source": [
284 | "grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "Well done! :)"
292 | ]
293 | }
294 | ],
295 | "metadata": {
296 | "hw_version": "1.0.0",
297 | "kernelspec": {
298 | "display_name": "Python 3",
299 | "language": "python",
300 | "name": "python3"
301 | },
302 | "language_info": {
303 | "codemirror_mode": {
304 | "name": "ipython",
305 | "version": 3
306 | },
307 | "file_extension": ".py",
308 | "mimetype": "text/x-python",
309 | "name": "python",
310 | "nbconvert_exporter": "python",
311 | "pygments_lexer": "ipython3",
312 | "version": "3.6.2"
313 | }
314 | },
315 | "nbformat": 4,
316 | "nbformat_minor": 2
317 | }
318 |
--------------------------------------------------------------------------------
/Programming_assignment_week_1_Pandas_basics/grader.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import numpy as np
4 | from collections import OrderedDict
5 |
6 | def array_to_hash(x):
7 | x_tupled = None
8 | if type(x) == list:
9 | x_tupled = tuple(x)
10 | elif type(x) == np.ndarray:
11 | x_tupled = tuple(list(x.flatten()))
12 | elif type(x) == tuple:
13 | x_tupled = x
14 | else:
15 | raise RuntimeError('unexpected type of input: {}'.format(type(x)))
16 | return hash(tuple(map(float, x_tupled)))
17 |
18 | def almostEqual(x, y):
19 | return abs(x - y) < 1e-3
20 |
21 |
22 | class Grader(object):
23 | def __init__(self):
24 | self.submission_page = 'https://hub.coursera-apps.org/api/onDemandProgrammingScriptSubmissions.v1'
25 | self.assignment_key = 'S1UqVXp-EeelpgpYPAO2Og'
26 | self.parts = OrderedDict([
27 | ('edAEq', 'max_revenue'),
28 | ('Xn0Ec', 'category_id_with_max_revenue'),
29 | ('CZDVZ', 'num_items_constant_price'),
30 | ('HlAjc', 'total_num_items_sold_var')])
31 | self.answers = {key: None for key in self.parts}
32 |
33 | @staticmethod
34 | def ravel_output(output):
35 | '''
36 | If student accedentally submitted np.array with one
37 | element instead of number, this function will submit
38 | this number instead
39 | '''
40 | if isinstance(output, np.ndarray) and output.size == 1:
41 | output = output.item(0)
42 | return output
43 |
44 | def submit(self, email, token):
45 | submission = {
46 | "assignmentKey": self.assignment_key,
47 | "submitterEmail": email,
48 | "secret": token,
49 | "parts": {}
50 | }
51 | for part, output in self.answers.items():
52 | if output is not None:
53 | submission["parts"][part] = {"output": output}
54 | else:
55 | submission["parts"][part] = dict()
56 | request = requests.post(self.submission_page, data=json.dumps(submission))
57 | response = request.json()
58 | if request.status_code == 201:
59 | print('Submitted to Coursera platform. See results on assignment page!')
60 | elif u'details' in response and u'learnerMessage' in response[u'details']:
61 | print(response[u'details'][u'learnerMessage'])
62 | else:
63 | print("Unknown response from Coursera: {}".format(request.status_code))
64 | print(response)
65 |
66 | def status(self):
67 | print("You want to submit these numbers:")
68 | for part_id, part_name in self.parts.items():
69 | answer = self.answers[part_id]
70 | if answer is None:
71 | answer = '-'*10
72 | print("Task {}: {}".format(part_name, answer))
73 |
74 | def submit_part(self, part, output):
75 | self.answers[part] = output
76 | print("Current answer for task {} is: {}".format(self.parts[part], output))
77 |
78 | def submit_tag(self, tag, output):
79 | part_id = [k for k, v in self.parts.items() if v == tag]
80 | if len(part_id)!=1:
81 | raise RuntimeError('cannot match tag with part_id: found {} matches'.format(len(part_id)))
82 | part_id = part_id[0]
83 | self.submit_part(part_id, str(self.ravel_output(output)))
--------------------------------------------------------------------------------
/Programming_assignment_week_2_Data_leakages/Data_leakages.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Version 1.0.0"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Introduction"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In this programming assignment we will illustrate a very severe data leakage, that can often be found in competitions, where the pairs of object should be scored, e.g. predict $1$ if two objects belong to the same class and $0$ otherwise. \n",
22 | "\n",
23 | "The data in this assignment is taken from a real competition, and the funniest thing is that *we will not use training set at all* and achieve almost 100% accuracy score! We will just exploit the leakage.\n",
24 | "\n",
25 | "Now go through the notebook and complete the assignment."
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": null,
31 | "metadata": {
32 | "collapsed": true
33 | },
34 | "outputs": [],
35 | "source": [
36 | "import numpy as np\n",
37 | "import pandas as pd \n",
38 | "import scipy.sparse"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "# Load the data"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "Let's load the test data. Note, that we don't have any training data here, just test data. Moreover, *we will not even use any features* of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare."
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "Let's load the data with test indices."
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {
66 | "collapsed": true
67 | },
68 | "outputs": [],
69 | "source": [
70 | "test = pd.read_csv('../readonly/data_leakages_data/test_pairs.csv')\n",
71 | "test.head(10)"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "For example, we can think that there is a test dataset of images, and each image is assigned a unique `Id` from $0$ to $N-1$ (N -- is the number of images). In the dataframe from above `FirstId` and `SecondId` point to these `Id`'s and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with `Id=1427` and `Id=8053` belong to the same class, we should predict $1$, and $0$ otherwise. \n",
79 | "\n",
80 | "But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary). "
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "**We suggest you to try to solve the puzzle yourself first.** You need to submit a `.csv` file with columns `pairId` and `Prediction` to the grader. The number of submissions allowed is made pretty huge to let you explore the data without worries. The returned score should be very close to $1$."
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "**If you do not want to think much** -- scroll down and follow the instructions below."
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": null,
100 | "metadata": {
101 | "collapsed": true
102 | },
103 | "outputs": [],
104 | "source": []
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {
110 | "collapsed": true
111 | },
112 | "outputs": [],
113 | "source": []
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {
119 | "collapsed": true
120 | },
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {
128 | "collapsed": true
129 | },
130 | "outputs": [],
131 | "source": []
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {
137 | "collapsed": true
138 | },
139 | "outputs": [],
140 | "source": []
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "# EDA and leakage intuition"
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "As we already know, the key to discover data leakages is careful EDA. So let's start our work with some basic data exploration and build an intuition about the leakage."
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "metadata": {},
159 | "source": [
160 | "First, check, how many different `id`s are there: concatenate `FirstId` and `SecondId` and print the number of unique elements. Also print minimum and maximum value for that vector."
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {
167 | "collapsed": true
168 | },
169 | "outputs": [],
170 | "source": [
171 | "# YOUR CODE GOES HERE"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "and then print how many pairs we need to classify (it is basically the number of rows in the test set)"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "collapsed": true
186 | },
187 | "outputs": [],
188 | "source": [
189 | "# YOUR CODE GOES HERE"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "Now print, how many distinct pairs it would be possible to create out of all \"images\" in the dataset? "
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {
203 | "collapsed": true
204 | },
205 | "outputs": [],
206 | "source": [
207 | "# YOUR CODE GOES HERE"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "So the number of pairs we are given to classify is very very small compared to the total number of pairs. \n",
215 | "\n",
216 | "To exploit the leak we need to **assume (or prove)**, that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with $1000$ classes, $N$ images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have $1000\\frac{N(N-1)}{2}$ positive pairs, while total number of pairs was $\\frac{1000N(1000N - 1)}{2}$.\n",
217 | "\n",
218 | "Another example: in [Quora competitition](https://www.kaggle.com/c/quora-question-pairs) the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller."
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {
224 | "collapsed": true
225 | },
226 | "source": [
227 | "Finally, let's get a fraction of pairs of class `1`. We just need to submit a constant prediction \"all ones\" and check the returned accuracy. Create a dataframe with columns `pairId` and `Prediction`, fill it and export it to `.csv` file. Then submit to grader and examine grader's output. "
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": null,
233 | "metadata": {
234 | "collapsed": true
235 | },
236 | "outputs": [],
237 | "source": [
238 | "# YOUR CODE GOES HERE"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class `1` are oversampled.\n",
246 | "\n",
247 | "Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below. "
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": null,
253 | "metadata": {
254 | "collapsed": true
255 | },
256 | "outputs": [],
257 | "source": []
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "metadata": {
263 | "collapsed": true
264 | },
265 | "outputs": [],
266 | "source": []
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {
272 | "collapsed": true
273 | },
274 | "outputs": [],
275 | "source": []
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "metadata": {
281 | "collapsed": true
282 | },
283 | "outputs": [],
284 | "source": []
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "# Building a magic feature"
291 | ]
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important."
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "## Incidence matrix"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "First, we need to build an [incidence matrix](https://en.wikipedia.org/wiki/Incidence_matrix). You can think of pairs `(FirstId, SecondId)` as of edges in an undirected graph. \n",
312 | "\n",
313 | "The incidence matrix is a matrix of size `(maxId + 1, maxId + 1)`, where each row (column) `i` corresponds `i-th` `Id`. In this matrix we put the value `1` to the position `[i, j]`, if and only if a pair `(i, j)` or `(j, i)` is present in a given set of pais `(FirstId, SecondId)`. All the other elements in the incidence matrix are zeros. \n",
314 | "\n",
315 | "**Important!** The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is **impossible to store them in memory in dense format**. But due to their sparsity incidence matrices **can be easily represented as sparse matrices**. If you are not familiar with sparse matrices, please see [wiki](https://en.wikipedia.org/wiki/Sparse_matrix) and [scipy.sparse reference](https://docs.scipy.org/doc/scipy/reference/sparse.html). Please, use any of `scipy.sparse` constructors to build incidence matrix. \n",
316 | "\n",
317 | "For example, you can use this constructor: `scipy.sparse.coo_matrix((data, (i, j)))`. We highly recommend to learn to use different `scipy.sparse` constuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple `for` loop. You will need first to create a matrix using `scipy.sparse.coo_matrix((M, N), [dtype])` with an appropriate shape `(M, N)` and then iterate through `(FirstId, SecondId)` pairs and fill corresponding elements in matrix with ones. \n",
318 | "\n",
319 | "**Note**, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself."
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {
326 | "collapsed": true
327 | },
328 | "outputs": [],
329 | "source": [
330 | "inc_mat = # YOUR CODE GOES HERE (but probably you will need to write few more lines before)\n",
331 | "\n",
332 | "# Sanity checks\n",
333 | "assert inc_mat.max() == 1\n",
334 | "assert inc_mat.sum() == 736872"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "It is convenient to have matrix in `csr` format eventually."
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": null,
347 | "metadata": {
348 | "collapsed": true
349 | },
350 | "outputs": [],
351 | "source": [
352 | "inc_mat = inc_mat.tocsr()"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "## Now build the magic feature"
360 | ]
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. `i-th` row is a representation for an object with `Id = i`. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good."
367 | ]
368 | },
369 | {
370 | "cell_type": "markdown",
371 | "metadata": {},
372 | "source": [
373 | "Now select the rows from the incidence matrix, that correspond to `test.FirstId`'s, and `test.SecondId`'s."
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": null,
379 | "metadata": {
380 | "collapsed": true
381 | },
382 | "outputs": [],
383 | "source": [
384 | "# Note, scipy goes crazy if a matrix is indexed with pandas' series. \n",
385 | "# So do not forget to convert `pd.series` to `np.array`\n",
386 | "# These lines should normally run very quickly \n",
387 | "\n",
388 | "rows_FirstId = # YOUR CODE GOES HERE\n",
389 | "rows_SecondId = # YOUR CODE GOES HERE"
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "Our magic feature will be the *dot product* between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar. \n",
397 | "\n",
398 | "Now compute dot product between corresponding rows in `rows_FirstId` and `rows_SecondId` matrices."
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {
405 | "collapsed": true
406 | },
407 | "outputs": [],
408 | "source": [
409 | "# Note, that in order to do pointwise multiplication in scipy.sparse you need to use function `multiply`\n",
410 | "# regular `*` corresponds to matrix-matrix multiplication\n",
411 | "\n",
412 | "f = # YOUR CODE GOES HERE\n",
413 | "\n",
414 | "# Sanity check\n",
415 | "assert f.shape == (368550, )"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "metadata": {},
421 | "source": [
422 | "That is it! **We've built our magic feature.** "
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "# From magic feature to binary predictions"
430 | ]
431 | },
432 | {
433 | "cell_type": "markdown",
434 | "metadata": {},
435 | "source": [
436 | "But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set. "
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {},
442 | "source": [
443 | "We may try to choose a thresold, and set the predictions to 1, if the feature value `f` is higer than the threshold, and 0 otherwise. What threshold would you choose? "
444 | ]
445 | },
446 | {
447 | "cell_type": "markdown",
448 | "metadata": {},
449 | "source": [
450 | "How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature `f`."
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": null,
456 | "metadata": {
457 | "collapsed": true
458 | },
459 | "outputs": [],
460 | "source": [
461 | "# For example use `np.unique` function, check for flags\n",
462 | "\n",
463 | "print # YOUR CODE GOES HERE"
464 | ]
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values? \n",
471 | "\n",
472 | "In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information. Do you understand why and how? "
473 | ]
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "Choose a threshold below: "
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": null,
485 | "metadata": {
486 | "collapsed": true
487 | },
488 | "outputs": [],
489 | "source": [
490 | "pred = f > # SET THRESHOLD HERE"
491 | ]
492 | },
493 | {
494 | "cell_type": "markdown",
495 | "metadata": {},
496 | "source": [
497 | "# Finally, let's create a submission"
498 | ]
499 | },
500 | {
501 | "cell_type": "code",
502 | "execution_count": null,
503 | "metadata": {
504 | "collapsed": true
505 | },
506 | "outputs": [],
507 | "source": [
508 | "submission = test.loc[:,['pairId']]\n",
509 | "submission['Prediction'] = pred.astype(int)\n",
510 | "\n",
511 | "submission.to_csv('submission.csv', index=False)"
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "Now submit it to the grader! It is not possible to submit directly from this notebook, as we need to submit a `csv` file, not a single number (limitation of Coursera platform). \n",
519 | "\n",
520 | "To download `submission.csv` file that you've just produced click here (if the link opens in browser, right-click on it and shoose \"Save link as\"). Then go to [assignment page](https://www.coursera.org/learn/competitive-data-science/programming/KsASv/data-leakages/submission) and submit your `.csv` file in 'My submission' tab.\n",
521 | "\n",
522 | "\n",
523 | "If you did everything right, the score should be very high."
524 | ]
525 | },
526 | {
527 | "cell_type": "markdown",
528 | "metadata": {},
529 | "source": [
530 | "**Finally:** try to explain to yourself, why the whole thing worked out. In fact, there is no magic in this feature, and the idea to use rows in the incidence matrix can be intuitively justified."
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "metadata": {},
536 | "source": [
537 | "# Bonus"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "Interestingly, it is not the only leak in this dataset. There is another totally different way to get almost 100% accuracy. Try to find it!"
545 | ]
546 | }
547 | ],
548 | "metadata": {
549 | "kernelspec": {
550 | "display_name": "Python 3",
551 | "language": "python",
552 | "name": "python3"
553 | },
554 | "language_info": {
555 | "codemirror_mode": {
556 | "name": "ipython",
557 | "version": 3
558 | },
559 | "file_extension": ".py",
560 | "mimetype": "text/x-python",
561 | "name": "python",
562 | "nbconvert_exporter": "python",
563 | "pygments_lexer": "ipython3",
564 | "version": "3.6.0"
565 | }
566 | },
567 | "nbformat": 4,
568 | "nbformat_minor": 2
569 | }
570 |
--------------------------------------------------------------------------------
/Programming_assignment_week_3_Mean_encodings/Programming_assignment_week_3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Version 1.1.0"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Mean encodings"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In this programming assignment you will be working with `1C` dataset from the final competition. You are asked to encode `item_id` in 4 different ways:\n",
22 | "\n",
23 | " 1) Via KFold scheme; \n",
24 | " 2) Via Leave-one-out scheme;\n",
25 | " 3) Via smoothing scheme;\n",
26 | " 4) Via expanding mean scheme.\n",
27 | "\n",
28 | "**You will need to submit** the correlation coefficient between resulting encoding and target variable up to 4 decimal places.\n",
29 | "\n",
30 | "### General tips\n",
31 | "\n",
32 | "* Fill NANs in the encoding with `0.3343`.\n",
33 | "* Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization."
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "collapsed": true
41 | },
42 | "outputs": [],
43 | "source": [
44 | "import pandas as pd\n",
45 | "import numpy as np\n",
46 | "from itertools import product\n",
47 | "from grader import Grader"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "# Read data"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {
61 | "collapsed": true
62 | },
63 | "outputs": [],
64 | "source": [
65 | "sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "# Aggregate data"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {
86 | "collapsed": true
87 | },
88 | "outputs": [],
89 | "source": [
90 | "index_cols = ['shop_id', 'item_id', 'date_block_num']\n",
91 | "\n",
92 | "# For every month we create a grid from all shops/items combinations from that month\n",
93 | "grid = [] \n",
94 | "for block_num in sales['date_block_num'].unique():\n",
95 | " cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()\n",
96 | " cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()\n",
97 | " grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))\n",
98 | "\n",
99 | "#turn the grid into pandas dataframe\n",
100 | "grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)\n",
101 | "\n",
102 | "#get aggregated values for (shop_id, item_id, month)\n",
103 | "gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})\n",
104 | "\n",
105 | "#fix column names\n",
106 | "gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]\n",
107 | "#join aggregated data to the grid\n",
108 | "all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)\n",
109 | "#sort the data\n",
110 | "all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "# Mean encodings without regularization"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "After we did the techinical work, we are ready to actually *mean encode* the desired `item_id` variable. \n",
125 | "\n",
126 | "Here are two ways to implement mean encoding features *without* any regularization. You can use this code as a starting point to implement regularized techniques. "
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "#### Method 1"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {
140 | "collapsed": true,
141 | "scrolled": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "# Calculate a mapping: {item_id: target_mean}\n",
146 | "item_id_target_mean = all_data.groupby('item_id').target.mean()\n",
147 | "\n",
148 | "# In our non-regularized case we just *map* the computed means to the `item_id`'s\n",
149 | "all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)\n",
150 | "\n",
151 | "# Fill NaNs\n",
152 | "all_data['item_target_enc'].fillna(0.3343, inplace=True) \n",
153 | "\n",
154 | "# Print correlation\n",
155 | "encoded_feature = all_data['item_target_enc'].values\n",
156 | "print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "#### Method 2"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "collapsed": true
171 | },
172 | "outputs": [],
173 | "source": [
174 | "'''\n",
175 | " Differently to `.target.mean()` function `transform` \n",
176 | " will return a dataframe with an index like in `all_data`.\n",
177 | " Basically this single line of code is equivalent to the first two lines from of Method 1.\n",
178 | "'''\n",
179 | "all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')\n",
180 | "\n",
181 | "# Fill NaNs\n",
182 | "all_data['item_target_enc'].fillna(0.3343, inplace=True) \n",
183 | "\n",
184 | "# Print correlation\n",
185 | "encoded_feature = all_data['item_target_enc'].values\n",
186 | "print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to **compute correlation coefficient** between the encodings, that you will implement and **submit those to coursera**."
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "collapsed": true
201 | },
202 | "outputs": [],
203 | "source": [
204 | "grader = Grader()"
205 | ]
206 | },
207 | {
208 | "cell_type": "markdown",
209 | "metadata": {},
210 | "source": [
211 | "# 1. KFold scheme"
212 | ]
213 | },
214 | {
215 | "cell_type": "markdown",
216 | "metadata": {},
217 | "source": [
218 | "Explained starting at 41 sec of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization)."
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "**Now it's your turn to write the code!** \n",
226 | "\n",
227 | "You may use 'Regularization' video as a reference for all further tasks.\n",
228 | "\n",
229 | "First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection. \n",
230 | "\n",
231 | "1. Split your data in 5 folds with `sklearn.model_selection.KFold` with `shuffle=False` argument.\n",
232 | "2. Iterate through folds: use all but the current fold to calculate mean target for each level `item_id`, and fill the current fold.\n",
233 | "\n",
234 | " * See the **Method 1** from the example implementation. In particular learn what `map` and pd.Series.map functions do. They are pretty handy in many situations."
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {
241 | "collapsed": true
242 | },
243 | "outputs": [],
244 | "source": [
245 | "# YOUR CODE GOES HERE\n",
246 | "\n",
247 | "# You will need to compute correlation like that\n",
248 | "corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]\n",
249 | "print(corr)\n",
250 | "grader.submit_tag('KFold_scheme', corr)"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "# 2. Leave-one-out scheme"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {
263 | "collapsed": true
264 | },
265 | "source": [
266 | "Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the **KFold scheme**, you will probably wait for a very long time. \n",
267 | "\n",
268 | "To implement a faster version, note, that to calculate mean target value using all the objects but one *given object*, you can:\n",
269 | "\n",
270 | "1. Calculate sum of the target values using all the objects.\n",
271 | "2. Then subtract the target of the *given object* and divide the resulting value by `n_objects - 1`. \n",
272 | "\n",
273 | "Note that you do not need to perform `1.` for every object. And `2.` can be implemented without any `for` loop.\n",
274 | "\n",
275 | "It is the most convenient to use `.transform` function as in **Method 2**."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {
282 | "collapsed": true
283 | },
284 | "outputs": [],
285 | "source": [
286 | "# YOUR CODE GOES HERE\n",
287 | "\n",
288 | "corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]\n",
289 | "print(corr)\n",
290 | "grader.submit_tag('Leave-one-out_scheme', corr)"
291 | ]
292 | },
293 | {
294 | "cell_type": "markdown",
295 | "metadata": {},
296 | "source": [
297 | "# 3. Smoothing"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "Explained starting at 4:03 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization)."
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "Next, implement smoothing scheme with $\\alpha = 100$. Use the formula from the first slide in the video and $0.3343$ as `globalmean`. Note that `nrows` is the number of objects that belong to a certain category (not the number of rows in the dataset)."
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": null,
317 | "metadata": {
318 | "collapsed": true
319 | },
320 | "outputs": [],
321 | "source": [
322 | "# YOUR CODE GOES HERE\n",
323 | "\n",
324 | "corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]\n",
325 | "print(corr)\n",
326 | "grader.submit_tag('Smoothing_scheme', corr)"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "# 4. Expanding mean scheme"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "Explained starting at 5:50 of [Regularization video](https://www.coursera.org/learn/competitive-data-science/lecture/LGYQ2/regularization)."
341 | ]
342 | },
343 | {
344 | "cell_type": "markdown",
345 | "metadata": {
346 | "collapsed": true
347 | },
348 | "source": [
349 | "Finally, implement the *expanding mean* scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) and [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html) functions from pandas."
350 | ]
351 | },
352 | {
353 | "cell_type": "code",
354 | "execution_count": null,
355 | "metadata": {
356 | "collapsed": true
357 | },
358 | "outputs": [],
359 | "source": [
360 | "# YOUR CODE GOES HERE\n",
361 | "\n",
362 | "corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]\n",
363 | "print(corr)\n",
364 | "grader.submit_tag('Expanding_mean_scheme', corr)"
365 | ]
366 | },
367 | {
368 | "cell_type": "markdown",
369 | "metadata": {},
370 | "source": [
371 | "## Authorization & Submission\n",
372 | "To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. Note: Token expires 30 minutes after generation."
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {
379 | "collapsed": true
380 | },
381 | "outputs": [],
382 | "source": [
383 | "STUDENT_EMAIL = # EMAIL HERE\n",
384 | "STUDENT_TOKEN = # TOKEN HERE\n",
385 | "grader.status()"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {
392 | "collapsed": true
393 | },
394 | "outputs": [],
395 | "source": [
396 | "grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)"
397 | ]
398 | }
399 | ],
400 | "metadata": {
401 | "kernelspec": {
402 | "display_name": "Python 3",
403 | "language": "python",
404 | "name": "python3"
405 | },
406 | "language_info": {
407 | "codemirror_mode": {
408 | "name": "ipython",
409 | "version": 3
410 | },
411 | "file_extension": ".py",
412 | "mimetype": "text/x-python",
413 | "name": "python",
414 | "nbconvert_exporter": "python",
415 | "pygments_lexer": "ipython3",
416 | "version": "3.6.0"
417 | }
418 | },
419 | "nbformat": 4,
420 | "nbformat_minor": 2
421 | }
422 |
--------------------------------------------------------------------------------
/Programming_assignment_week_3_Mean_encodings/grader.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import numpy as np
4 | from collections import OrderedDict
5 |
6 | def array_to_hash(x):
7 | x_tupled = None
8 | if type(x) == list:
9 | x_tupled = tuple(x)
10 | elif type(x) == np.ndarray:
11 | x_tupled = tuple(list(x.flatten()))
12 | elif type(x) == tuple:
13 | x_tupled = x
14 | else:
15 | raise RuntimeError('unexpected type of input: {}'.format(type(x)))
16 | return hash(tuple(map(float, x_tupled)))
17 |
18 | def almostEqual(x, y):
19 | return abs(x - y) < 1e-5
20 |
21 |
22 | class Grader(object):
23 | def __init__(self):
24 | self.submission_page = 'https://hub.coursera-apps.org/api/onDemandProgrammingScriptSubmissions.v1'
25 | self.assignment_key = 'JVyZjZIaEeeXtQpjLCk-0A'
26 | self.parts = OrderedDict([
27 | ('9zPRY', 'KFold_scheme'),
28 | ('xEf0Q', 'Leave-one-out_scheme'),
29 | ('zuMqo', 'Smoothing_scheme'),
30 | ('RNfnI', 'Expanding_mean_scheme')])
31 | self.answers = {key: None for key in self.parts}
32 |
33 | @staticmethod
34 | def ravel_output(output):
35 | '''
36 | If student accedentally submitted np.array with one
37 | element instead of number, this function will submit
38 | this number instead
39 | '''
40 | if isinstance(output, np.ndarray) and output.size == 1:
41 | output = output.item(0)
42 | return output
43 |
44 | def submit(self, email, token):
45 | submission = {
46 | "assignmentKey": self.assignment_key,
47 | "submitterEmail": email,
48 | "secret": token,
49 | "parts": {}
50 | }
51 | for part, output in self.answers.items():
52 | if output is not None:
53 | submission["parts"][part] = {"output": output}
54 | else:
55 | submission["parts"][part] = dict()
56 | request = requests.post(self.submission_page, data=json.dumps(submission))
57 | response = request.json()
58 | if request.status_code == 201:
59 | print('Submitted to Coursera platform. See results on assignment page!')
60 | elif u'details' in response and u'learnerMessage' in response[u'details']:
61 | print(response[u'details'][u'learnerMessage'])
62 | else:
63 | print("Unknown response from Coursera: {}".format(request.status_code))
64 | print(response)
65 |
66 | def status(self):
67 | print("You want to submit these numbers:")
68 | for part_id, part_name in self.parts.items():
69 | answer = self.answers[part_id]
70 | if answer is None:
71 | answer = '-'*10
72 | print("Task {}: {}".format(part_name, answer))
73 |
74 | def submit_part(self, part, output):
75 | self.answers[part] = output
76 | print("Current answer for task {} is: {}".format(self.parts[part], output))
77 |
78 | def submit_tag(self, tag, output):
79 | part_id = [k for k, v in self.parts.items() if v == tag]
80 | if len(part_id)!=1:
81 | raise RuntimeError('cannot match tag with part_id: found {} matches'.format(len(part_id)))
82 | part_id = part_id[0]
83 | self.submit_part(part_id, str(self.ravel_output(output)))
--------------------------------------------------------------------------------
/Programming_assignment_week_4_Catboost/grader_v2.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import numpy as np
4 | from collections import OrderedDict
5 |
6 | def array_to_hash(x):
7 | x_tupled = None
8 | if type(x) == list:
9 | x_tupled = tuple(x)
10 | elif type(x) == np.ndarray:
11 | x_tupled = tuple(list(x.flatten()))
12 | elif type(x) == tuple:
13 | x_tupled = x
14 | else:
15 | raise RuntimeError('unexpected type of input: {}'.format(type(x)))
16 | return hash(tuple(map(float, x_tupled)))
17 |
18 | def almostEqual(x, y):
19 | return abs(x - y) < 1e-3
20 |
21 |
22 | class Grader(object):
23 | def __init__(self):
24 | self.submission_page = 'https://www.coursera.org/api/onDemandProgrammingScriptSubmissions.v1'
25 | self.assignment_key = '2ksCns1AEeiQGAocUzg3rg'
26 | self.parts = OrderedDict([
27 | ('6IBOp', 'negative_samples'),
28 | ('KFgw6', 'positive_samples'),
29 | ('AdVS6', 'resource_unique_values'),
30 | ('Qmiy0', 'logloss_mean'),
31 | ('5UJeq', 'logloss_std'),
32 | ('3JTkU', 'accuracy_6'),
33 | ('N0VEy', 'best_model_name'),
34 | ('xmS1J', 'num_trees'),
35 | ('ztywb', 'mean_logloss_cv'),
36 | ('FaDLS', 'logloss_std_1'),
37 | ('jFOSe', 'iterations_overfitting'),
38 | ('inxm1', 'auc_550'),
39 | ('QRox8', 'feature_importance_top3'),
40 | ('4t0CV', 'most_important'),
41 | ('C8JOy', 'shap_influence'),
42 | ('R50wr', 'speedup'),
43 | ('eA8X5', 'final_auc')])
44 | self.answers = {key: None for key in self.parts}
45 |
46 | @staticmethod
47 | def ravel_output(output):
48 | '''
49 | If student accedentally submitted np.array with one
50 | element instead of number, this function will submit
51 | this number instead
52 | '''
53 | if isinstance(output, np.ndarray) and output.size == 1:
54 | output = output.item(0)
55 | return output
56 |
57 | def submit(self, email, token):
58 | submission = {
59 | "assignmentKey": self.assignment_key,
60 | "submitterEmail": email,
61 | "secret": token,
62 | "parts": {}
63 | }
64 | for part, output in self.answers.items():
65 | if output is not None:
66 | submission["parts"][part] = {"output": output}
67 | else:
68 | submission["parts"][part] = dict()
69 | request = requests.post(self.submission_page, data=json.dumps(submission))
70 | response = request.json()
71 | if request.status_code == 201:
72 | print('Submitted to Coursera platform. See results on assignment page!')
73 | elif u'details' in response and u'learnerMessage' in response[u'details']:
74 | print(response[u'details'][u'learnerMessage'])
75 | else:
76 | print("Unknown response from Coursera: {}".format(request.status_code))
77 | print(response)
78 |
79 | def status(self):
80 | print("You want to submit these numbers:")
81 | for part_id, part_name in self.parts.items():
82 | answer = self.answers[part_id]
83 | if answer is None:
84 | answer = '-'*10
85 | print("Task {}: {}".format(part_name, answer))
86 |
87 | def submit_part(self, part, output):
88 | self.answers[part] = output
89 | print("Current answer for task {} is: {}".format(self.parts[part], output))
90 |
91 | def submit_tag(self, tag, output):
92 | part_id = [k for k, v in self.parts.items() if v == tag]
93 | if len(part_id)!=1:
94 | raise RuntimeError('cannot match tag with part_id: found {} matches'.format(len(part_id)))
95 | part_id = part_id[0]
96 | self.submit_part(part_id, str(self.ravel_output(output)))
97 |
--------------------------------------------------------------------------------
/Programming_assignment_week_4_Ensembles/Programming_assignment_week_4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Version 1.0.1"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Check your versions"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 3,
20 | "metadata": {},
21 | "outputs": [
22 | {
23 | "name": "stdout",
24 | "output_type": "stream",
25 | "text": [
26 | "numpy 1.13.1\n",
27 | "pandas 0.20.3\n",
28 | "scipy 0.19.1\n",
29 | "sklearn 0.19.0\n",
30 | "lightgbm 2.0.6\n"
31 | ]
32 | }
33 | ],
34 | "source": [
35 | "import numpy as np\n",
36 | "import pandas as pd \n",
37 | "import sklearn\n",
38 | "import scipy.sparse \n",
39 | "import lightgbm \n",
40 | "\n",
41 | "for p in [np, pd, scipy, sklearn, lightgbm]:\n",
42 | " print (p.__name__, p.__version__)"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "**Important!** There is a huge chance that the assignment will be impossible to pass if the versions of `lighgbm` and `scikit-learn` are wrong. The versions being tested:\n",
50 | "\n",
51 | " numpy 1.13.1\n",
52 | " pandas 0.20.3\n",
53 | " scipy 0.19.1\n",
54 | " sklearn 0.19.0\n",
55 | " ligthgbm 2.0.6\n",
56 | " \n",
57 | "\n",
58 | "To install an older version of `lighgbm` you may use the following command:\n",
59 | "```\n",
60 | "pip uninstall lightgbm\n",
61 | "pip install lightgbm==2.0.6\n",
62 | "```"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "# Ensembling"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "In this programming assignment you are asked to implement two ensembling schemes: simple linear mix and stacking.\n",
77 | "\n",
78 | "We will spend several cells to load data and create feature matrix, you can scroll down this part or try to understand what's happening."
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {
85 | "collapsed": true
86 | },
87 | "outputs": [],
88 | "source": [
89 | "import pandas as pd\n",
90 | "import numpy as np\n",
91 | "import gc\n",
92 | "import matplotlib.pyplot as plt\n",
93 | "%matplotlib inline \n",
94 | "\n",
95 | "pd.set_option('display.max_rows', 600)\n",
96 | "pd.set_option('display.max_columns', 50)\n",
97 | "\n",
98 | "import lightgbm as lgb\n",
99 | "from sklearn.linear_model import LinearRegression\n",
100 | "from sklearn.metrics import r2_score\n",
101 | "from tqdm import tqdm_notebook\n",
102 | "\n",
103 | "from itertools import product\n",
104 | "\n",
105 | "\n",
106 | "def downcast_dtypes(df):\n",
107 | " '''\n",
108 | " Changes column types in the dataframe: \n",
109 | " \n",
110 | " `float64` type to `float32`\n",
111 | " `int64` type to `int32`\n",
112 | " '''\n",
113 | " \n",
114 | " # Select columns to downcast\n",
115 | " float_cols = [c for c in df if df[c].dtype == \"float64\"]\n",
116 | " int_cols = [c for c in df if df[c].dtype == \"int64\"]\n",
117 | " \n",
118 | " # Downcast\n",
119 | " df[float_cols] = df[float_cols].astype(np.float32)\n",
120 | " df[int_cols] = df[int_cols].astype(np.int32)\n",
121 | " \n",
122 | " return df"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "# Load data subset"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "Let's load the data from the hard drive first."
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": null,
142 | "metadata": {
143 | "collapsed": true
144 | },
145 | "outputs": [],
146 | "source": [
147 | "sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')\n",
148 | "shops = pd.read_csv('../readonly/final_project_data/shops.csv')\n",
149 | "items = pd.read_csv('../readonly/final_project_data/items.csv')\n",
150 | "item_cats = pd.read_csv('../readonly/final_project_data/item_categories.csv')"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "And use only 3 shops for simplicity."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": null,
163 | "metadata": {
164 | "collapsed": true
165 | },
166 | "outputs": [],
167 | "source": [
168 | "sales = sales[sales['shop_id'].isin([26, 27, 28])]"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "# Get a feature matrix"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "We now need to prepare the features. This part is all implemented for you."
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": null,
188 | "metadata": {
189 | "collapsed": true
190 | },
191 | "outputs": [],
192 | "source": [
193 | "# Create \"grid\" with columns\n",
194 | "index_cols = ['shop_id', 'item_id', 'date_block_num']\n",
195 | "\n",
196 | "# For every month we create a grid from all shops/items combinations from that month\n",
197 | "grid = [] \n",
198 | "for block_num in sales['date_block_num'].unique():\n",
199 | " cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()\n",
200 | " cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()\n",
201 | " grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))\n",
202 | "\n",
203 | "# Turn the grid into a dataframe\n",
204 | "grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)\n",
205 | "\n",
206 | "# Groupby data to get shop-item-month aggregates\n",
207 | "gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})\n",
208 | "# Fix column names\n",
209 | "gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values] \n",
210 | "# Join it to the grid\n",
211 | "all_data = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)\n",
212 | "\n",
213 | "# Same as above but with shop-month aggregates\n",
214 | "gb = sales.groupby(['shop_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_shop':'sum'}})\n",
215 | "gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]\n",
216 | "all_data = pd.merge(all_data, gb, how='left', on=['shop_id', 'date_block_num']).fillna(0)\n",
217 | "\n",
218 | "# Same as above but with item-month aggregates\n",
219 | "gb = sales.groupby(['item_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_item':'sum'}})\n",
220 | "gb.columns = [col[0] if col[-1] == '' else col[-1] for col in gb.columns.values]\n",
221 | "all_data = pd.merge(all_data, gb, how='left', on=['item_id', 'date_block_num']).fillna(0)\n",
222 | "\n",
223 | "# Downcast dtypes from 64 to 32 bit to save memory\n",
224 | "all_data = downcast_dtypes(all_data)\n",
225 | "del grid, gb \n",
226 | "gc.collect();"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "After creating a grid, we can calculate some features. We will use lags from [1, 2, 3, 4, 5, 12] months ago."
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {
240 | "collapsed": true
241 | },
242 | "outputs": [],
243 | "source": [
244 | "# List of columns that we will use to create lags\n",
245 | "cols_to_rename = list(all_data.columns.difference(index_cols)) \n",
246 | "\n",
247 | "shift_range = [1, 2, 3, 4, 5, 12]\n",
248 | "\n",
249 | "for month_shift in tqdm_notebook(shift_range):\n",
250 | " train_shift = all_data[index_cols + cols_to_rename].copy()\n",
251 | " \n",
252 | " train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift\n",
253 | " \n",
254 | " foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x\n",
255 | " train_shift = train_shift.rename(columns=foo)\n",
256 | "\n",
257 | " all_data = pd.merge(all_data, train_shift, on=index_cols, how='left').fillna(0)\n",
258 | "\n",
259 | "del train_shift\n",
260 | "\n",
261 | "# Don't use old data from year 2013\n",
262 | "all_data = all_data[all_data['date_block_num'] >= 12] \n",
263 | "\n",
264 | "# List of all lagged features\n",
265 | "fit_cols = [col for col in all_data.columns if col[-1] in [str(item) for item in shift_range]] \n",
266 | "# We will drop these at fitting stage\n",
267 | "to_drop_cols = list(set(list(all_data.columns)) - (set(fit_cols)|set(index_cols))) + ['date_block_num'] \n",
268 | "\n",
269 | "# Category for each item\n",
270 | "item_category_mapping = items[['item_id','item_category_id']].drop_duplicates()\n",
271 | "\n",
272 | "all_data = pd.merge(all_data, item_category_mapping, how='left', on='item_id')\n",
273 | "all_data = downcast_dtypes(all_data)\n",
274 | "gc.collect();"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "To this end, we've created a feature matrix. It is stored in `all_data` variable. Take a look:"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": null,
287 | "metadata": {
288 | "collapsed": true
289 | },
290 | "outputs": [],
291 | "source": [
292 | "all_data.head(5)"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "# Train/test split"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "For a sake of the programming assignment, let's artificially split the data into train and test. We will treat last month data as the test set."
307 | ]
308 | },
309 | {
310 | "cell_type": "code",
311 | "execution_count": null,
312 | "metadata": {
313 | "collapsed": true
314 | },
315 | "outputs": [],
316 | "source": [
317 | "# Save `date_block_num`, as we can't use them as features, but will need them to split the dataset into parts \n",
318 | "dates = all_data['date_block_num']\n",
319 | "\n",
320 | "last_block = dates.max()\n",
321 | "print('Test `date_block_num` is %d' % last_block)"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": null,
327 | "metadata": {
328 | "collapsed": true
329 | },
330 | "outputs": [],
331 | "source": [
332 | "dates_train = dates[dates < last_block]\n",
333 | "dates_test = dates[dates == last_block]\n",
334 | "\n",
335 | "X_train = all_data.loc[dates < last_block].drop(to_drop_cols, axis=1)\n",
336 | "X_test = all_data.loc[dates == last_block].drop(to_drop_cols, axis=1)\n",
337 | "\n",
338 | "y_train = all_data.loc[dates < last_block, 'target'].values\n",
339 | "y_test = all_data.loc[dates == last_block, 'target'].values"
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "# First level models "
347 | ]
348 | },
349 | {
350 | "cell_type": "markdown",
351 | "metadata": {},
352 | "source": [
353 | "You need to implement a basic stacking scheme. We have a time component here, so we will use ***scheme f)*** from the reading material. Recall, that we always use first level models to build two datasets: test meta-features and 2-nd level train-metafetures. Let's see how we get test meta-features first. "
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "### Test meta-features"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "Firts, we will run *linear regression* on numeric columns and get predictions for the last month."
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": null,
373 | "metadata": {
374 | "collapsed": true
375 | },
376 | "outputs": [],
377 | "source": [
378 | "lr = LinearRegression()\n",
379 | "lr.fit(X_train.values, y_train)\n",
380 | "pred_lr = lr.predict(X_test.values)\n",
381 | "\n",
382 | "print('Test R-squared for linreg is %f' % r2_score(y_test, pred_lr))"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "And the we run *LightGBM*."
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": null,
395 | "metadata": {
396 | "collapsed": true
397 | },
398 | "outputs": [],
399 | "source": [
400 | "lgb_params = {\n",
401 | " 'feature_fraction': 0.75,\n",
402 | " 'metric': 'rmse',\n",
403 | " 'nthread':1, \n",
404 | " 'min_data_in_leaf': 2**7, \n",
405 | " 'bagging_fraction': 0.75, \n",
406 | " 'learning_rate': 0.03, \n",
407 | " 'objective': 'mse', \n",
408 | " 'bagging_seed': 2**7, \n",
409 | " 'num_leaves': 2**7,\n",
410 | " 'bagging_freq':1,\n",
411 | " 'verbose':0 \n",
412 | " }\n",
413 | "\n",
414 | "model = lgb.train(lgb_params, lgb.Dataset(X_train, label=y_train), 100)\n",
415 | "pred_lgb = model.predict(X_test)\n",
416 | "\n",
417 | "print('Test R-squared for LightGBM is %f' % r2_score(y_test, pred_lgb))"
418 | ]
419 | },
420 | {
421 | "cell_type": "markdown",
422 | "metadata": {},
423 | "source": [
424 | "Finally, concatenate test predictions to get test meta-features."
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": null,
430 | "metadata": {
431 | "collapsed": true
432 | },
433 | "outputs": [],
434 | "source": [
435 | "X_test_level2 = np.c_[pred_lr, pred_lgb] "
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "### Train meta-features"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "**Now it is your turn to write the code**. You need to implement ***scheme f)*** from the reading material. Here, we will use duration **T** equal to month and **M=15**. \n",
450 | "\n",
451 | "That is, you need to get predictions (meta-features) from *linear regression* and *LightGBM* for months 27, 28, 29, 30, 31, 32. Use the same parameters as in above models."
452 | ]
453 | },
454 | {
455 | "cell_type": "code",
456 | "execution_count": null,
457 | "metadata": {
458 | "collapsed": true
459 | },
460 | "outputs": [],
461 | "source": [
462 | "dates_train_level2 = dates_train[dates_train.isin([27, 28, 29, 30, 31, 32])]\n",
463 | "\n",
464 | "# That is how we get target for the 2nd level dataset\n",
465 | "y_train_level2 = y_train[dates_train.isin([27, 28, 29, 30, 31, 32])]"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {
472 | "collapsed": true
473 | },
474 | "outputs": [],
475 | "source": [
476 | "# And here we create 2nd level feeature matrix, init it with zeros first\n",
477 | "X_train_level2 = np.zeros([y_train_level2.shape[0], 2])\n",
478 | "\n",
479 | "# Now fill `X_train_level2` with metafeatures\n",
480 | "for cur_block_num in [27, 28, 29, 30, 31, 32]:\n",
481 | " \n",
482 | " print(cur_block_num)\n",
483 | " \n",
484 | " '''\n",
485 | " 1. Split `X_train` into parts\n",
486 | " Remember, that corresponding dates are stored in `dates_train` \n",
487 | " 2. Fit linear regression \n",
488 | " 3. Fit LightGBM and put predictions \n",
489 | " 4. Store predictions from 2. and 3. in the right place of `X_train_level2`. \n",
490 | " You can use `dates_train_level2` for it\n",
491 | " Make sure the order of the meta-features is the same as in `X_test_level2`\n",
492 | " ''' \n",
493 | " \n",
494 | " # YOUR CODE GOES HERE\n",
495 | " \n",
496 | " \n",
497 | "# Sanity check\n",
498 | "assert np.all(np.isclose(X_train_level2.mean(axis=0), [ 1.50148988, 1.38811989]))"
499 | ]
500 | },
501 | {
502 | "cell_type": "markdown",
503 | "metadata": {},
504 | "source": [
505 | "Remember, the ensembles work best, when first level models are diverse. We can qualitatively analyze the diversity by examinig *scatter plot* between the two metafeatures. Plot the scatter plot below. "
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": null,
511 | "metadata": {
512 | "collapsed": true
513 | },
514 | "outputs": [],
515 | "source": [
516 | "# YOUR CODE GOES HERE"
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "# Ensembling"
524 | ]
525 | },
526 | {
527 | "cell_type": "markdown",
528 | "metadata": {},
529 | "source": [
530 | "Now, when the meta-features are created, we can ensemble our first level models."
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "metadata": {},
536 | "source": [
537 | "### Simple convex mix"
538 | ]
539 | },
540 | {
541 | "cell_type": "markdown",
542 | "metadata": {},
543 | "source": [
544 | "Let's start with simple linear convex mix:\n",
545 | "\n",
546 | "$$\n",
547 | "mix= \\alpha\\cdot\\text{linreg_prediction}+(1-\\alpha)\\cdot\\text{lgb_prediction}\n",
548 | "$$\n",
549 | "\n",
550 | "We need to find an optimal $\\alpha$. And it is very easy, as it is feasible to do grid search. Next, find the optimal $\\alpha$ out of `alphas_to_try` array. Remember, that you need to use train meta-features (not test) when searching for $\\alpha$. "
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": null,
556 | "metadata": {
557 | "collapsed": true
558 | },
559 | "outputs": [],
560 | "source": [
561 | "alphas_to_try = np.linspace(0, 1, 1001)\n",
562 | "\n",
563 | "# YOUR CODE GOES HERE\n",
564 | "best_alpha = # YOUR CODE GOES HERE\n",
565 | "r2_train_simple_mix = # YOUR CODE GOES HERE\n",
566 | "\n",
567 | "print('Best alpha: %f; Corresponding r2 score on train: %f' % (best_alpha, r2_train_simple_mix))"
568 | ]
569 | },
570 | {
571 | "cell_type": "markdown",
572 | "metadata": {},
573 | "source": [
574 | "Now use the $\\alpha$ you've found to compute predictions for the test set "
575 | ]
576 | },
577 | {
578 | "cell_type": "code",
579 | "execution_count": null,
580 | "metadata": {
581 | "collapsed": true
582 | },
583 | "outputs": [],
584 | "source": [
585 | "test_preds = # YOUR CODE GOES HERE\n",
586 | "r2_test_simple_mix = # YOUR CODE GOES HERE\n",
587 | "\n",
588 | "print('Test R-squared for simple mix is %f' % r2_test_simple_mix)"
589 | ]
590 | },
591 | {
592 | "cell_type": "markdown",
593 | "metadata": {},
594 | "source": [
595 | "### Stacking"
596 | ]
597 | },
598 | {
599 | "cell_type": "markdown",
600 | "metadata": {},
601 | "source": [
602 | "Now, we will try a more advanced ensembling technique. Fit a linear regression model to the meta-features. Use the same parameters as in the model above."
603 | ]
604 | },
605 | {
606 | "cell_type": "code",
607 | "execution_count": null,
608 | "metadata": {
609 | "collapsed": true
610 | },
611 | "outputs": [],
612 | "source": [
613 | "# YOUR CODE GOES HERE"
614 | ]
615 | },
616 | {
617 | "cell_type": "markdown",
618 | "metadata": {},
619 | "source": [
620 | "Compute R-squared on the train and test sets."
621 | ]
622 | },
623 | {
624 | "cell_type": "code",
625 | "execution_count": null,
626 | "metadata": {
627 | "collapsed": true
628 | },
629 | "outputs": [],
630 | "source": [
631 | "train_preds = # YOUR CODE GOES HERE\n",
632 | "r2_train_stacking = # YOUR CODE GOES HERE\n",
633 | "\n",
634 | "test_preds = # YOUR CODE GOES HERE\n",
635 | "r2_test_stacking = # YOUR CODE GOES HERE\n",
636 | "\n",
637 | "print('Train R-squared for stacking is %f' % r2_train_stacking)\n",
638 | "print('Test R-squared for stacking is %f' % r2_test_stacking)"
639 | ]
640 | },
641 | {
642 | "cell_type": "markdown",
643 | "metadata": {},
644 | "source": [
645 | "Interesting, that the score turned out to be lower than in previous method. Although the model is very simple (just 3 parameters) and, in fact, mixes predictions linearly, it looks like it managed to overfit. **Examine and compare** train and test scores for the two methods. \n",
646 | "\n",
647 | "And of course this particular case does not mean simple mix is always better than stacking."
648 | ]
649 | },
650 | {
651 | "cell_type": "markdown",
652 | "metadata": {},
653 | "source": [
654 | "We all done! Submit everything we need to the grader now."
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": null,
660 | "metadata": {
661 | "collapsed": true
662 | },
663 | "outputs": [],
664 | "source": [
665 | "from grader import Grader\n",
666 | "grader = Grader()\n",
667 | "\n",
668 | "grader.submit_tag('best_alpha', best_alpha)\n",
669 | "\n",
670 | "grader.submit_tag('r2_train_simple_mix', r2_train_simple_mix)\n",
671 | "grader.submit_tag('r2_test_simple_mix', r2_test_simple_mix)\n",
672 | "\n",
673 | "grader.submit_tag('r2_train_stacking', r2_train_stacking)\n",
674 | "grader.submit_tag('r2_test_stacking', r2_test_stacking)"
675 | ]
676 | },
677 | {
678 | "cell_type": "code",
679 | "execution_count": null,
680 | "metadata": {
681 | "collapsed": true
682 | },
683 | "outputs": [],
684 | "source": [
685 | "STUDENT_EMAIL = # EMAIL HERE\n",
686 | "STUDENT_TOKEN = # TOKEN HERE\n",
687 | "grader.status()"
688 | ]
689 | },
690 | {
691 | "cell_type": "code",
692 | "execution_count": null,
693 | "metadata": {
694 | "collapsed": true
695 | },
696 | "outputs": [],
697 | "source": [
698 | "grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)"
699 | ]
700 | }
701 | ],
702 | "metadata": {
703 | "kernelspec": {
704 | "display_name": "Python 3",
705 | "language": "python",
706 | "name": "python3"
707 | },
708 | "language_info": {
709 | "codemirror_mode": {
710 | "name": "ipython",
711 | "version": 3
712 | },
713 | "file_extension": ".py",
714 | "mimetype": "text/x-python",
715 | "name": "python",
716 | "nbconvert_exporter": "python",
717 | "pygments_lexer": "ipython3",
718 | "version": "3.6.0"
719 | }
720 | },
721 | "nbformat": 4,
722 | "nbformat_minor": 2
723 | }
724 |
--------------------------------------------------------------------------------
/Programming_assignment_week_4_Ensembles/grader.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import numpy as np
4 | from collections import OrderedDict
5 |
6 | def array_to_hash(x):
7 | x_tupled = None
8 | if type(x) == list:
9 | x_tupled = tuple(x)
10 | elif type(x) == np.ndarray:
11 | x_tupled = tuple(list(x.flatten()))
12 | elif type(x) == tuple:
13 | x_tupled = x
14 | else:
15 | raise RuntimeError('unexpected type of input: {}'.format(type(x)))
16 | return hash(tuple(map(float, x_tupled)))
17 |
18 | def almostEqual(x, y):
19 | return abs(x - y) < 1e-5
20 |
21 |
22 | class Grader(object):
23 | def __init__(self):
24 | self.submission_page = 'https://hub.coursera-apps.org/api/onDemandProgrammingScriptSubmissions.v1'
25 | self.assignment_key = 'Lhay-55JEeet3xIBvGMumA'
26 | self.parts = OrderedDict([
27 | ('EyiFH', 'best_alpha'),
28 | ('XH82R', 'r2_train_simple_mix'),
29 | ('BHeRs', 'r2_test_simple_mix'),
30 | ('MkwCS', 'r2_train_stacking'),
31 | ('j4Adb', 'r2_test_stacking'),
32 | ])
33 | self.answers = {key: None for key in self.parts}
34 |
35 | @staticmethod
36 | def ravel_output(output):
37 | '''
38 | If student accedentally submitted np.array with one
39 | element instead of number, this function will submit
40 | this number instead
41 | '''
42 | if isinstance(output, np.ndarray) and output.size == 1:
43 | output = output.item(0)
44 | return output
45 |
46 | def submit(self, email, token):
47 | submission = {
48 | "assignmentKey": self.assignment_key,
49 | "submitterEmail": email,
50 | "secret": token,
51 | "parts": {}
52 | }
53 | for part, output in self.answers.items():
54 | if output is not None:
55 | submission["parts"][part] = {"output": output}
56 | else:
57 | submission["parts"][part] = dict()
58 | request = requests.post(self.submission_page, data=json.dumps(submission))
59 | response = request.json()
60 | if request.status_code == 201:
61 | print('Submitted to Coursera platform. See results on assignment page!')
62 | elif u'details' in response and u'learnerMessage' in response[u'details']:
63 | print(response[u'details'][u'learnerMessage'])
64 | else:
65 | print("Unknown response from Coursera: {}".format(request.status_code))
66 | print(response)
67 |
68 | def status(self):
69 | print("You want to submit these numbers:")
70 | for part_id, part_name in self.parts.items():
71 | answer = self.answers[part_id]
72 | if answer is None:
73 | answer = '-'*10
74 | print("Task {}: {}".format(part_name, answer))
75 |
76 | def submit_part(self, part, output):
77 | self.answers[part] = output
78 | print("Current answer for task {} is: {}".format(self.parts[part], output))
79 |
80 | def submit_tag(self, tag, output):
81 | part_id = [k for k, v in self.parts.items() if v == tag]
82 | if len(part_id)!=1:
83 | raise RuntimeError('cannot match tag with part_id: found {} matches'.format(len(part_id)))
84 | part_id = part_id[0]
85 | self.submit_part(part_id, str(self.ravel_output(output)))
--------------------------------------------------------------------------------
/Programming_assignment_week_4_KNN_features/compute_KNN_features.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Version 1.1.1"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# The task"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In this assignment you will need to implement features, based on nearest neighbours. \n",
22 | "\n",
23 | "KNN classifier (regressor) is a very powerful model, when the features are homogeneous and it is a very common practice to use KNN as first level model. In this homework we will extend KNN model and compute more features, based on nearest neighbors and their distances. \n",
24 | "\n",
25 | "You will need to implement a number of features, that were one of the key features, that leaded the instructors to prizes in [Otto](https://www.kaggle.com/c/otto-group-product-classification-challenge) and [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response) competitions. Of course, the list of features you will need to implement can be extended, in fact in competitions the list was at least 3 times larger. So when solving a real competition do not hesitate to make up your own features. \n",
26 | "\n",
27 | "You can optionally implement multicore feature computation. Nearest neighbours are hard to compute so it is preferable to have a parallel version of the algorithm. In fact, it is really a cool skill to know how to use `multiprocessing`, `joblib` and etc. In this homework you will have a chance to see the benefits of parallel algorithm. "
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "# Check your versions"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "Some functions we use here are not present in old versions of the libraries, so make sure you have up-to-date software. "
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {
48 | "collapsed": true
49 | },
50 | "outputs": [],
51 | "source": [
52 | "import numpy as np\n",
53 | "import pandas as pd \n",
54 | "import sklearn\n",
55 | "import scipy.sparse \n",
56 | "\n",
57 | "for p in [np, pd, sklearn, scipy]:\n",
58 | " print (p.__name__, p.__version__)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "The versions should be not less than:\n",
66 | "\n",
67 | " numpy 1.13.1\n",
68 | " pandas 0.20.3\n",
69 | " sklearn 0.19.0\n",
70 | " scipy 0.19.1\n",
71 | " \n",
72 | "**IMPORTANT!** The results with `scipy=1.0.0` will be different! Make sure you use _exactly_ version `0.19.1`."
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "# Load data"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "Learn features and labels. These features are actually OOF predictions of linear models."
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {
93 | "collapsed": true
94 | },
95 | "outputs": [],
96 | "source": [
97 | "train_path = '../readonly/KNN_features_data/X.npz'\n",
98 | "train_labels = '../readonly/KNN_features_data/Y.npy'\n",
99 | "\n",
100 | "test_path = '../readonly/KNN_features_data/X_test.npz'\n",
101 | "test_labels = '../readonly/KNN_features_data/Y_test.npy'\n",
102 | "\n",
103 | "# Train data\n",
104 | "X = scipy.sparse.load_npz(train_path)\n",
105 | "Y = np.load(train_labels)\n",
106 | "\n",
107 | "# Test data\n",
108 | "X_test = scipy.sparse.load_npz(test_path)\n",
109 | "Y_test = np.load(test_labels)\n",
110 | "\n",
111 | "# Out-of-fold features we loaded above were generated with n_splits=4 and skf seed 123\n",
112 | "# So it is better to use seed 123 for generating KNN features as well \n",
113 | "skf_seed = 123\n",
114 | "n_splits = 4"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Below you need to implement features, based on nearest neighbors."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {
128 | "collapsed": true
129 | },
130 | "outputs": [],
131 | "source": [
132 | "from sklearn.base import BaseEstimator, ClassifierMixin\n",
133 | "from sklearn.neighbors import NearestNeighbors\n",
134 | "from multiprocessing import Pool\n",
135 | "\n",
136 | "import numpy as np\n",
137 | "\n",
138 | "\n",
139 | "class NearestNeighborsFeats(BaseEstimator, ClassifierMixin):\n",
140 | " '''\n",
141 | " This class should implement KNN features extraction \n",
142 | " '''\n",
143 | " def __init__(self, n_jobs, k_list, metric, n_classes=None, n_neighbors=None, eps=1e-6):\n",
144 | " self.n_jobs = n_jobs\n",
145 | " self.k_list = k_list\n",
146 | " self.metric = metric\n",
147 | " \n",
148 | " if n_neighbors is None:\n",
149 | " self.n_neighbors = max(k_list) \n",
150 | " else:\n",
151 | " self.n_neighbors = n_neighbors\n",
152 | " \n",
153 | " self.eps = eps \n",
154 | " self.n_classes_ = n_classes\n",
155 | " \n",
156 | " def fit(self, X, y):\n",
157 | " '''\n",
158 | " Set's up the train set and self.NN object\n",
159 | " '''\n",
160 | " # Create a NearestNeighbors (NN) object. We will use it in `predict` function \n",
161 | " self.NN = NearestNeighbors(n_neighbors=max(self.k_list), \n",
162 | " metric=self.metric, \n",
163 | " n_jobs=1, \n",
164 | " algorithm='brute' if self.metric=='cosine' else 'auto')\n",
165 | " self.NN.fit(X)\n",
166 | " \n",
167 | " # Store labels \n",
168 | " self.y_train = y\n",
169 | " \n",
170 | " # Save how many classes we have\n",
171 | " self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_\n",
172 | " \n",
173 | " \n",
174 | " def predict(self, X): \n",
175 | " '''\n",
176 | " Produces KNN features for every object of a dataset X\n",
177 | " '''\n",
178 | " if self.n_jobs == 1:\n",
179 | " test_feats = []\n",
180 | " for i in range(X.shape[0]):\n",
181 | " test_feats.append(self.get_features_for_one(X[i:i+1]))\n",
182 | " else:\n",
183 | " '''\n",
184 | " *Make it parallel*\n",
185 | " Number of threads should be controlled by `self.n_jobs` \n",
186 | " \n",
187 | " \n",
188 | " You can use whatever you want to do it\n",
189 | " For Python 3 the simplest option would be to use \n",
190 | " `multiprocessing.Pool` (but don't use `multiprocessing.dummy.Pool` here)\n",
191 | " You may try use `joblib` but you will most likely encounter an error, \n",
192 | " that you will need to google up (and eventually it will work slowly)\n",
193 | " \n",
194 | " For Python 2 I also suggest using `multiprocessing.Pool` \n",
195 | " You will need to use a hint from this blog \n",
196 | " http://qingkaikong.blogspot.ru/2016/12/python-parallel-method-in-class.html\n",
197 | " I could not get `joblib` working at all for this code \n",
198 | " (but in general `joblib` is very convenient)\n",
199 | " \n",
200 | " '''\n",
201 | " \n",
202 | " # YOUR CODE GOES HERE\n",
203 | " # test_feats = # YOUR CODE GOES HERE\n",
204 | " # YOUR CODE GOES HERE\n",
205 | " \n",
206 | " # Comment out this line once you implement the code\n",
207 | " assert False, 'You need to implement it for n_jobs > 1'\n",
208 | " \n",
209 | " \n",
210 | " \n",
211 | " return np.vstack(test_feats)\n",
212 | " \n",
213 | " \n",
214 | " def get_features_for_one(self, x):\n",
215 | " '''\n",
216 | " Computes KNN features for a single object `x`\n",
217 | " '''\n",
218 | "\n",
219 | " NN_output = self.NN.kneighbors(x)\n",
220 | " \n",
221 | " # Vector of size `n_neighbors`\n",
222 | " # Stores indices of the neighbors\n",
223 | " neighs = NN_output[1][0]\n",
224 | " \n",
225 | " # Vector of size `n_neighbors`\n",
226 | " # Stores distances to corresponding neighbors\n",
227 | " neighs_dist = NN_output[0][0] \n",
228 | "\n",
229 | " # Vector of size `n_neighbors`\n",
230 | " # Stores labels of corresponding neighbors\n",
231 | " neighs_y = self.y_train[neighs] \n",
232 | " \n",
233 | " ## ========================================== ##\n",
234 | " ## YOUR CODE BELOW\n",
235 | " ## ========================================== ##\n",
236 | " \n",
237 | " # We will accumulate the computed features here\n",
238 | " # Eventually it will be a list of lists or np.arrays\n",
239 | " # and we will use np.hstack to concatenate those\n",
240 | " return_list = [] \n",
241 | " \n",
242 | " \n",
243 | " ''' \n",
244 | " 1. Fraction of objects of every class.\n",
245 | " It is basically a KNNСlassifiers predictions.\n",
246 | "\n",
247 | " Take a look at `np.bincount` function, it can be very helpful\n",
248 | " Note that the values should sum up to one\n",
249 | " '''\n",
250 | " for k in self.k_list:\n",
251 | " # YOUR CODE GOES HERE\n",
252 | " \n",
253 | " assert len(feats) == self.n_classes\n",
254 | " return_list += [feats]\n",
255 | " \n",
256 | " \n",
257 | " '''\n",
258 | " 2. Same label streak: the largest number N, \n",
259 | " such that N nearest neighbors have the same label.\n",
260 | " \n",
261 | " What can help you: `np.where`\n",
262 | " '''\n",
263 | " \n",
264 | " feats = # YOUR CODE GOES HERE\n",
265 | " \n",
266 | " assert len(feats) == 1\n",
267 | " return_list += [feats]\n",
268 | " \n",
269 | " '''\n",
270 | " 3. Minimum distance to objects of each class\n",
271 | " Find the first instance of a class and take its distance as features.\n",
272 | " \n",
273 | " If there are no neighboring objects of some classes, \n",
274 | " Then set distance to that class to be 999.\n",
275 | "\n",
276 | " `np.where` might be helpful\n",
277 | " '''\n",
278 | " feats = []\n",
279 | " for c in range(self.n_classes):\n",
280 | " # YOUR CODE GOES HERE\n",
281 | " \n",
282 | " assert len(feats) == self.n_classes\n",
283 | " return_list += [feats]\n",
284 | " \n",
285 | " '''\n",
286 | " 4. Minimum *normalized* distance to objects of each class\n",
287 | " As 3. but we normalize (divide) the distances\n",
288 | " by the distance to the closest neighbor.\n",
289 | " \n",
290 | " If there are no neighboring objects of some classes, \n",
291 | " Then set distance to that class to be 999.\n",
292 | " \n",
293 | " Do not forget to add self.eps to denominator.\n",
294 | " '''\n",
295 | " feats = []\n",
296 | " for c in range(self.n_classes):\n",
297 | " # YOUR CODE GOES HERE\n",
298 | " \n",
299 | " assert len(feats) == self.n_classes\n",
300 | " return_list += [feats]\n",
301 | " \n",
302 | " '''\n",
303 | " 5. \n",
304 | " 5.1 Distance to Kth neighbor\n",
305 | " Think of this as of quantiles of a distribution\n",
306 | " 5.2 Distance to Kth neighbor normalized by \n",
307 | " distance to the first neighbor\n",
308 | " \n",
309 | " feat_51, feat_52 are answers to 5.1. and 5.2.\n",
310 | " should be scalars\n",
311 | " \n",
312 | " Do not forget to add self.eps to denominator.\n",
313 | " '''\n",
314 | " for k in self.k_list:\n",
315 | " \n",
316 | " feat_51 = # YOUR CODE GOES HERE\n",
317 | " feat_52 = # YOUR CODE GOES HERE\n",
318 | " \n",
319 | " return_list += [[feat_51, feat_52]]\n",
320 | " \n",
321 | " '''\n",
322 | " 6. Mean distance to neighbors of each class for each K from `k_list` \n",
323 | " For each class select the neighbors of that class among K nearest neighbors \n",
324 | " and compute the average distance to those objects\n",
325 | " \n",
326 | " If there are no objects of a certain class among K neighbors, set mean distance to 999\n",
327 | " \n",
328 | " You can use `np.bincount` with appropriate weights\n",
329 | " Don't forget, that if you divide by something, \n",
330 | " You need to add `self.eps` to denominator.\n",
331 | " '''\n",
332 | " for k in self.k_list:\n",
333 | " \n",
334 | " # YOUR CODE GOES IN HERE\n",
335 | " \n",
336 | " assert len(feats) == self.n_classes\n",
337 | " return_list += [feats]\n",
338 | " \n",
339 | " \n",
340 | " # merge\n",
341 | " knn_feats = np.hstack(return_list)\n",
342 | " \n",
343 | " assert knn_feats.shape == (239,) or knn_feats.shape == (239, 1)\n",
344 | " return knn_feats"
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "metadata": {},
350 | "source": [
351 | "## Sanity check"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "To make sure you've implemented everything correctly we provide you the correct features for the first 50 objects."
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {
365 | "collapsed": true
366 | },
367 | "outputs": [],
368 | "source": [
369 | "# a list of K in KNN, starts with one \n",
370 | "k_list = [3, 8, 32]\n",
371 | "\n",
372 | "# Load correct features\n",
373 | "true_knn_feats_first50 = np.load('../readonly/KNN_features_data/knn_feats_test_first50.npy')\n",
374 | "\n",
375 | "# Create instance of our KNN feature extractor\n",
376 | "NNF = NearestNeighborsFeats(n_jobs=1, k_list=k_list, metric='minkowski')\n",
377 | "\n",
378 | "# Fit on train set\n",
379 | "NNF.fit(X, Y)\n",
380 | "\n",
381 | "# Get features for test\n",
382 | "test_knn_feats = NNF.predict(X_test[:50])\n",
383 | "\n",
384 | "# This should be zero\n",
385 | "print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50).sum())\n",
386 | "\n",
387 | "deviation =np.abs(test_knn_feats - true_knn_feats_first50).sum(0)\n",
388 | "for m in np.where(deviation > 1e-3)[0]: \n",
389 | " p = np.where(np.array([87, 88, 117, 146, 152, 239]) > m)[0][0]\n",
390 | " print ('There is a problem in feature %d, which is a part of section %d.' % (m, p + 1))"
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "Now implement parallel computations and compute features for the train and test sets. "
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "## Get features for test"
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "metadata": {},
410 | "source": [
411 | "Now compute features for the whole test set."
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {
418 | "collapsed": true
419 | },
420 | "outputs": [],
421 | "source": [
422 | "for metric in ['minkowski', 'cosine']:\n",
423 | " print (metric)\n",
424 | " \n",
425 | " # Create instance of our KNN feature extractor\n",
426 | " NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)\n",
427 | " \n",
428 | " # Fit on train set\n",
429 | " NNF.fit(X, Y)\n",
430 | "\n",
431 | " # Get features for test\n",
432 | " test_knn_feats = NNF.predict(X_test)\n",
433 | " \n",
434 | " # Dump the features to disk\n",
435 | " np.save('data/knn_feats_%s_test.npy' % metric , test_knn_feats)"
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "## Get features for train"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "Compute features for train, using out-of-fold strategy."
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": null,
455 | "metadata": {
456 | "collapsed": true
457 | },
458 | "outputs": [],
459 | "source": [
460 | "# Differently from other homework we will not implement OOF predictions ourselves\n",
461 | "# but use sklearn's `cross_val_predict`\n",
462 | "from sklearn.model_selection import cross_val_predict\n",
463 | "from sklearn.model_selection import StratifiedKFold\n",
464 | "\n",
465 | "# We will use two metrics for KNN\n",
466 | "for metric in ['minkowski', 'cosine']:\n",
467 | " print (metric)\n",
468 | " \n",
469 | " # Set up splitting scheme, use StratifiedKFold\n",
470 | " # use skf_seed and n_splits defined above with shuffle=True\n",
471 | " skf = # YOUR CODE GOES HERE\n",
472 | " \n",
473 | " # Create instance of our KNN feature extractor\n",
474 | " # n_jobs can be larger than the number of cores\n",
475 | " NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)\n",
476 | " \n",
477 | " # Get KNN features using OOF use cross_val_predict with right parameters\n",
478 | " preds = # YOUR CODE GOES HERE\n",
479 | " \n",
480 | " # Save the features\n",
481 | " np.save('data/knn_feats_%s_train.npy' % metric, preds)"
482 | ]
483 | },
484 | {
485 | "cell_type": "markdown",
486 | "metadata": {},
487 | "source": [
488 | "# Submit"
489 | ]
490 | },
491 | {
492 | "cell_type": "markdown",
493 | "metadata": {},
494 | "source": [
495 | "If you made the above cells work, just run the following cell to produce a number to submit."
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": null,
501 | "metadata": {
502 | "collapsed": true
503 | },
504 | "outputs": [],
505 | "source": [
506 | "s = 0\n",
507 | "for metric in ['minkowski', 'cosine']:\n",
508 | " knn_feats_train = np.load('data/knn_feats_%s_train.npy' % metric)\n",
509 | " knn_feats_test = np.load('data/knn_feats_%s_test.npy' % metric)\n",
510 | " \n",
511 | " s += knn_feats_train.mean() + knn_feats_test.mean()\n",
512 | " \n",
513 | "answer = np.floor(s)\n",
514 | "print (answer)"
515 | ]
516 | },
517 | {
518 | "cell_type": "markdown",
519 | "metadata": {},
520 | "source": [
521 | "Submit!"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": null,
527 | "metadata": {
528 | "collapsed": true
529 | },
530 | "outputs": [],
531 | "source": [
532 | "from grader import Grader\n",
533 | "\n",
534 | "grader = Grader()\n",
535 | "\n",
536 | "grader.submit_tag('statistic', answer)\n",
537 | "\n",
538 | "STUDENT_EMAIL = # EMAIL HERE\n",
539 | "STUDENT_TOKEN = # TOKEN HERE\n",
540 | "grader.status()\n",
541 | "\n",
542 | "grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)"
543 | ]
544 | }
545 | ],
546 | "metadata": {
547 | "anaconda-cloud": {},
548 | "kernelspec": {
549 | "display_name": "Python 3",
550 | "language": "python",
551 | "name": "python3"
552 | },
553 | "language_info": {
554 | "codemirror_mode": {
555 | "name": "ipython",
556 | "version": 3
557 | },
558 | "file_extension": ".py",
559 | "mimetype": "text/x-python",
560 | "name": "python",
561 | "nbconvert_exporter": "python",
562 | "pygments_lexer": "ipython3",
563 | "version": "3.6.0"
564 | }
565 | },
566 | "nbformat": 4,
567 | "nbformat_minor": 2
568 | }
569 |
--------------------------------------------------------------------------------
/Programming_assignment_week_4_KNN_features/grader.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import numpy as np
4 | from collections import OrderedDict
5 |
6 | def array_to_hash(x):
7 | x_tupled = None
8 | if type(x) == list:
9 | x_tupled = tuple(x)
10 | elif type(x) == np.ndarray:
11 | x_tupled = tuple(list(x.flatten()))
12 | elif type(x) == tuple:
13 | x_tupled = x
14 | else:
15 | raise RuntimeError('unexpected type of input: {}'.format(type(x)))
16 | return hash(tuple(map(float, x_tupled)))
17 |
18 | def almostEqual(x, y):
19 | return abs(x - y) < 1e-3
20 |
21 |
22 | class Grader(object):
23 | def __init__(self):
24 | self.submission_page = 'https://hub.coursera-apps.org/api/onDemandProgrammingScriptSubmissions.v1'
25 | self.assignment_key = 'r2N4iqFlEeeRFQqEddeEzg'
26 | self.parts = OrderedDict([
27 | ('1O8kU', 'statistic')])
28 | self.answers = {key: None for key in self.parts}
29 |
30 | @staticmethod
31 | def ravel_output(output):
32 | '''
33 | If student accedentally submitted np.array with one
34 | element instead of number, this function will submit
35 | this number instead
36 | '''
37 | if isinstance(output, np.ndarray) and output.size == 1:
38 | output = output.item(0)
39 | return output
40 |
41 | def submit(self, email, token):
42 | submission = {
43 | "assignmentKey": self.assignment_key,
44 | "submitterEmail": email,
45 | "secret": token,
46 | "parts": {}
47 | }
48 | for part, output in self.answers.items():
49 | if output is not None:
50 | submission["parts"][part] = {"output": output}
51 | else:
52 | submission["parts"][part] = dict()
53 | request = requests.post(self.submission_page, data=json.dumps(submission))
54 | response = request.json()
55 | if request.status_code == 201:
56 | print('Submitted to Coursera platform. See results on assignment page!')
57 | elif u'details' in response and u'learnerMessage' in response[u'details']:
58 | print(response[u'details'][u'learnerMessage'])
59 | else:
60 | print("Unknown response from Coursera: {}".format(request.status_code))
61 | print(response)
62 |
63 | def status(self):
64 | print("You want to submit these numbers:")
65 | for part_id, part_name in self.parts.items():
66 | answer = self.answers[part_id]
67 | if answer is None:
68 | answer = '-'*10
69 | print("Task {}: {}".format(part_name, answer))
70 |
71 | def submit_part(self, part, output):
72 | self.answers[part] = output
73 | print("Current answer for task {} is: {}".format(self.parts[part], output))
74 |
75 | def submit_tag(self, tag, output):
76 | part_id = [k for k, v in self.parts.items() if v == tag]
77 | if len(part_id)!=1:
78 | raise RuntimeError('cannot match tag with part_id: found {} matches'.format(len(part_id)))
79 | part_id = part_id[0]
80 | self.submit_part(part_id, str(self.ravel_output(output)))
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Materials for "How to Win a Data Science Competition: Learn from Top Kagglers" course
2 |
3 | This repository contains programming assignments notebooks for the ML [course](https://www.coursera.org/learn/competitive-data-science/home/welcome) about competitive data science.
4 |
--------------------------------------------------------------------------------
/Reading_materials/GBM_drop_tree.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Hi! In this notebook we will do a little \"how *Gradient Boosting* works\" and find out answer for the question:\n",
8 | "## \"Will performance of GBDT model drop dramatically if we remove the first tree?\""
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "import numpy as np\n",
18 | "import matplotlib.pyplot as plt\n",
19 | "import seaborn as sns\n",
20 | "%matplotlib inline \n",
21 | "\n",
22 | "from sklearn.metrics import log_loss\n",
23 | "from sklearn.tree import DecisionTreeClassifier\n",
24 | "from sklearn.ensemble import GradientBoostingClassifier\n",
25 | "from sklearn.datasets import make_hastie_10_2\n",
26 | "from sklearn.model_selection import train_test_split\n",
27 | "\n",
28 | "def sigmoid(x):\n",
29 | " return 1 / (1 + np.exp(-x))"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "# Make dataset\n",
37 | "We will use a very simple dataset: objects will come from 1D normal distribution, we will need to predict class $1$ if the object is positive and 0 otherwise."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "X_all = np.random.randn(5000, 1)\n",
47 | "y_all = (X_all[:, 0] > 0)*2 - 1\n",
48 | "\n",
49 | "X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.5, random_state=42)"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "The datast is really simple and can be solved with a single decision stump."
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 3,
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "name": "stdout",
66 | "output_type": "stream",
67 | "text": [
68 | "Accuracy for a single decision stump: 1.0\n"
69 | ]
70 | }
71 | ],
72 | "source": [
73 | "clf = DecisionTreeClassifier(max_depth=1)\n",
74 | "clf.fit(X_train, y_train)\n",
75 | "\n",
76 | "print ('Accuracy for a single decision stump: {}'.format(clf.score(X_test, y_test)))"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "# Learn GBM"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "But we will need 800 trees in GBM to classify it correctly."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 4,
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "name": "stdout",
100 | "output_type": "stream",
101 | "text": [
102 | "Test logloss: 0.0003135802484425486\n"
103 | ]
104 | }
105 | ],
106 | "source": [
107 | "# For convenience we will use sklearn's GBM, the situation will be similar with XGBoost and others\n",
108 | "clf = GradientBoostingClassifier(n_estimators=5000, learning_rate=0.01, max_depth=3, random_state=0)\n",
109 | "clf.fit(X_train, y_train)\n",
110 | "\n",
111 | "y_pred = clf.predict_proba(X_test)[:, 1]\n",
112 | "print(\"Test logloss: {}\".format(log_loss(y_test, y_pred)))"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 5,
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "name": "stdout",
122 | "output_type": "stream",
123 | "text": [
124 | "Logloss using all trees: 0.0003135802484425486\n",
125 | "Logloss using all trees but last: 0.00031358024844265755\n",
126 | "Logloss using all trees but first: 0.00032053682522239753\n"
127 | ]
128 | }
129 | ],
130 | "source": [
131 | "def compute_loss(y_true, scores_pred):\n",
132 | " '''\n",
133 | " Since we use raw scores we will wrap log_loss \n",
134 | " and apply sigmoid to our predictions before computing log_loss itself\n",
135 | " '''\n",
136 | " return log_loss(y_true, sigmoid(scores_pred))\n",
137 | " \n",
138 | "\n",
139 | "'''\n",
140 | " Get cummulative sum of *decision function* for trees. i-th element is a sum of trees 0...i-1.\n",
141 | " We cannot use staged_predict_proba, since we want to maniputate raw scores\n",
142 | " (not probabilities). And only in the end convert the scores to probabilities using sigmoid\n",
143 | "'''\n",
144 | "cum_preds = np.array([x for x in clf.staged_decision_function(X_test)])[:, :, 0] \n",
145 | "\n",
146 | "print (\"Logloss using all trees: {}\".format(compute_loss(y_test, cum_preds[-1, :])))\n",
147 | "print (\"Logloss using all trees but last: {}\".format(compute_loss(y_test, cum_preds[-2, :])))\n",
148 | "print (\"Logloss using all trees but first: {}\".format(compute_loss(y_test, cum_preds[-1, :] - cum_preds[0, :])))"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "You can see that there is a difference, but not as huge as one could expect! Moreover, if we get rid of the first tree — overall model still works! \n",
156 | "\n",
157 | "If this is supprising for you — take a look at the plot of cummulative decision function depending on the number of trees."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 6,
163 | "metadata": {},
164 | "outputs": [
165 | {
166 | "data": {
167 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEGCAYAAACevtWaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHiZJREFUeJzt3XuUHHWd9/H3JDPJZJJJMkk69wuGyxeBlUcQBBUJK6tR\nQJ7FoLsb5SLKwx5leVxhj8cLorsefdhHWMVnd72w3pCz6B51UVZBWEAXlJsscv3mQkgyMySZkJnM\nJJNMMjP9/FHV0Alzqenu6q6u+rzOyaG7pqvq+83l28W3fvX7NeTzeUREJF0m1ToAERGpPBV3EZEU\nUnEXEUkhFXcRkRRScRcRSaHGWgdQ0NXVV/Kwnba2Frq7+ysZTuIp52xQzulXbr65XGvDSNtTceXe\n2Di51iFUnXLOBuWcfnHlm4riLiIih1JxFxFJIRV3EZEUUnEXEUkhFXcRkRSKbSikmc0Avge0AVOB\nz7n7nXGdT0REXhHnlfslgLv7WcAa4CsxnktERIrE+RDTTuB14eu28L0A9/6+nQ0dvWUdo7m5kf37\nBysUUXwaRny8ojTNzU3s33+wcgeskAqm+CpTm5sYSGDOccpazvPmTGf1KUuZ2lTZ8e6xFXd3/1cz\nu8TMNhAU93PiOlc96d9/kFvuWodm0RcRgEmTGjjV5rFo7vSKHjfOnvv7gS3uvtrMTgRuBt4w2ufb\n2lrKelIrl2sted9qevjpbeSB/3nmkZx3xspahxOvDHyDZSBFiVlLcyOtLVMqftw42zJvBu4EcPcn\nzGyxmU1296GRPlzm3Ap0dfWVvH81PfRkJwDHLJ5Jw+CIvxWR1FPOlZLEnONsyUAyc45b1nJubWkp\nK9/RLmzjvKG6AXgjgJmtAPaMVtizxLf00Dh5EkcumVnrUEQkxeK8cv868C9mdn94nitiPFdd6N9/\nkC3b+zhm2WyaMjY5kohUV5w3VPcA743r+PVo3dbd5AFbPrvWoYhIyukJ1Sp6bks3AMcub6txJCKS\ndiruVaR+u4hUi4p7lRT67Ucunql+u4jETsW9StRvF5FqUnGvEvXbRaSaVNyrRP12EakmFfcqUL9d\nRKpNxb0K1G8XkWpTca8C9dtFpNpU3KtA/XYRqTYV95ip3y4itaDiHjP120WkFlTcY6Z+u4jUgop7\nzNRvF5FaUHGPkfrtIlIrKu4xUr9dRGolzgWyLwM+ULTpDe4+I67zJZH67SJSK3GuxHQzcDOAmZ1J\nBldlUr9dRGolzjVUi10LrK3SuRJB66WKSC3FXtzN7BRgq7tvG+tzbW0tNJZRBHO51pL3jcPDT28j\nD5z02gWxxZa0nKtBOWdD1nKOI99qXLl/CPjOeB/q7u4v+QS5XCtdXX0l7x+Hh57sBGDZ3JZYYkti\nznFTztmQtZzLzXe0L4ZqjJZZBTxYhfMkivrtIlJLsRZ3M1sM7HH3A3GeJ2k0vl1Eai3uK/dFwI6Y\nz5E4Gt8uIrUWa8/d3R8D3hnnOZJI49tFpNb0hGoM1G8XkVpTca8w9dtFJAlU3CtM/XYRSQIV9wpT\nv11EkkDFvcLUbxeRJIhU3M1srpm9IXytL4RRqN8uIkkxbqE2sz8HfscrUwjcFE7nK4dRv11EkiLK\nVfhfAycCXeH7q4HLY4uojqnfLiJJEaW473b3l2f1cvd9QKamE4hK/XYRSYooT6juNLOLgWlmdhLw\nPl65ipeQ5m8XkSSJcuV+BXAK0Ap8C2gmmMZXiqjfLiJJEuXK/XR3/2jskdQ59dtFJEki3VA1s2ot\nx1e31G8XkSSJUrR7gGfM7PcU3Uh194tii6rOqN8uIkkTpbj/PPwlo1C/XUSSZtzi7u7fNbMjgJOA\nPPCYu2+JcnAzWwv8DTAIXOvud5QRa2Kp3y4iSRPlCdUrgHuBPwPWAveFQyPH228u8FngLcC5wPnl\nhZpc6reLSNJEact8AHitu+8HMLPpwN3Ad8fZ72zgbnfvA/pI6VOt6reLSBJFKe6DhcIO4O57zSzK\nE6pHAC1mdjvQBlzn7veM9uG2thYayyiOuVxryfuW4+Gnt5EHTnrtgqrHUKuca0k5Z0PWco4j3yjF\nfauZ3QT8Kny/GojSc28A5gJ/CqwA7jWzFe6eH+nD3d39I22OJJdrpaurr+T9y/HQk50ALJvbUtUY\naplzrSjnbMhazuXmO9oXQ5Rx7pcDHcClwCXAJqK1WLYDD7r7oLtvJGjN5KIEW0/UbxeRJIpS3PcD\nD7j7n7r7BcBzwECE/e4C/tjMJoU3V2cAO0sPNXk0f7uIJFWU4v514F1F71cBN4+3k7t3AP9GMBf8\nL4Ar3X24hBgTS+PbRSSpovTcj3H3DxfeuPvHzey+KAd3968TfDmkksa3i0hSRblyn2ZmcwpvzGwx\nwcyQmad+u4gkVZQr988DT5vZFmAysBjI/DJ7Gt8uIkkWZfqBn5vZSuA4gukHnitemSmr1G8XkSSL\nMv3AycDb3P0x4D3Az8zsjNgjSzj120UkyaL03L8KeFjQTwGuBD4Xa1R1QP12EUmySOPc3X098G7g\nG+7+DJCqIY0TpfHtIpJ0UYr7dDO7kGAagbvCkTOZ7kWo3y4iSReluH+CYKrfT7p7L/BXwA2xRpVw\n6reLSNJFGS1zH3Bf0fvr4gunPqjfLiJJF+XKXYoU+u0r1W8XkQRTcZ+gde1hv32Z+u0iklxRnlAF\nwMwaCOZoByBtk4BF5WG/XTdTRSTJxi3uZnYN8CmgMCN8A8GTqpnsSfiWHiZPauDIJbNqHYqIyKii\nXLl/EHidu0dZfSnV9g0Msnl7H0cunsXUpkx+t4lInYjSc1+vwh5Y376bfF4tGRFJvihX7k+a2a0E\nwyEHCxvd/V/iCiqpfKv67SJSH6IU98UEy+qdXrQtD4xZ3M1sFfAj4Olw05PufmUJMSbGui09TGpo\n4Cj120Uk4aI8xHQpQDjtQN7duydw/PvdfU2pwSXJwIEhXtjWxxGLWmmeEnmQkYhITUQZLfMm4PsE\no2UazOwl4P3u/mjcwSXJho7dDA3nNb5dROpClEvQLwHnu/tTAGb2euArwFsj7Hucmd0OzAE+5+6/\nGu2DbW0tNJbxxGcu1zr+h8qw5ZGtAJz6R4tjP1dUSYmjmpRzNmQt5zjyjVLchwqFHcDdHzezwbF2\nCK0nmPf9h8BK4F4zO8rdD4z04e7u0hd3yuVa6erqK3n/KB73HTQ0wPzWKbGfK4pq5Jw0yjkbspZz\nufmO9sUQpbgPm9kFwN3h+9XA0Hg7uXsHcFv4dqOZbQOWAJsinDNRBg4OsamzlxULWpk2Vf12EUm+\nKOPcrwAuBzYDLwAXh9vGZGZrzezq8PVCYAHQUXKkNfR8od+uIZAiUieijJZZT3C1PlG3A7ea2fnA\nFOAvR2vJJJ1v7QHAlmn+dhGpD6MWdzP7irtfZWa/IRjXfgh3H/OGqrv3AeeVH2Lt+ZYeGoBjlml8\nu4jUh7Gu3AsPKX26GoEk1cHBITZ29rJs/gxamptqHY6ISCSj9tzd/Ynw5R+Ane5+P9BMMATSqxBb\nIjzf2cvg0DDHqN8uInUkyg3VW4DFZnY08GXgJeDmWKNKEN+ifruI1J8oxb0lfPjoQuBr7v6PBDdI\nM6FwM1X9dhGpJ1GK+3QzywFrgDvCFZkycRk7ODTMxo7dLM1Np7UlM99nIpICUYr7DwieNv1Pd98K\nXEsw/W/qbXqxlwODw2rJiEjdiTLO/SsEc8kU/IO7744vpOR4ud+um6kiUmdKGuduZuOOc0+DV/rt\nKu4iUl80zn0Ug0PDbGjfzaK5Lcycrn67iNSXKOPc1wEnuvv94Vj3PwE2VCO4Wtq8vY+Bg0PYcvXb\nRaT+RLmh+m1gW9H7Jxlnib00eGV8u1oyIlJ/ohT3Znf/YeGNu99GBsa562aqiNSzKJOT581sNXA/\nwZfBakaYSCxNhoaHWd/ew4I5LcyeMbXW4YiITFiUK/cPA1cDO4BO4EME87un1pbte9h/YEgtGRGp\nW1HGuW8AzjazBndP9RV7gVoyIlLvxr1yN7MTzexR4Nnw/WfM7I1RDm5m08xso5ldUl6Y1bVuq26m\nikh9i9KW+RrwQeDF8P1twA0Rj/9pYFcJcdXM8HCedVt7yM1uZs7M5lqHIyJSkijF/aC7/6Hwxt3X\nAYPj7WRmxwLHAXeUHl71tXftoX9gUPPJiEhdizJaZtDMXkM4QsbM3gk0RNjvy8BHCRbUHldbWwuN\njZOjfHREuVxryfsWe/DZHQCccsLCih0zLkmPLw7KORuylnMc+UYp7lcD/w6Yme0GXmCcgm1mFwG/\ndfdNZhYpkO7u/kifG0ku10pXV1/J+xd77Jngea1Fs5srdsw4VDLneqGcsyFrOZeb72hfDFFGy/wB\neF04p/uAu/dGON85wEozOxdYCgyYWbu73z2BmKtuOB/02+fObGberGm1DkdEpGRjzQr5bUZ4WKlw\nJe7uHxxtX3d/X9HnrwNeSHphB+js2sve/YOceNS8WociIlKWsW6o/hfwADAMzAGeAJ4CFgCl91AS\nzDUEUkRSYtQrd3e/GcDMLnD3cwrbzexG4CdRT+Du15UTYDX5lm5ADy+JSP2LMhRyuZkVV7tWYGVM\n8dRMPuy3t7VOJTdb/XYRqW9RRsv8E7DBzDYR9OBfA3wh1qhqYNuufnr7D/LG4xbQ0BBlpKeISHJF\nGS3zj2Z2C3AUwfj2je7eE3tkVba+PVgW9pils2ociYhI+aJcuRMOf/x9zLHUVGE+maN1M1VEUiBK\nzz0T1m3tYXpzI4vnTa91KCIiZVNxB3b17mfn7v0cvXQ2k9RvF5EUGLctY2ZtwKeAhe7+fjM7D/id\nu3fFHl2VFFoyx6glIyIpEeXK/VvAFoJRMgBTge/GFlENrCvcTFVxF5GUiFLcc+7+VeAAgLv/G9AS\na1RVtm5rD1ObJrN8wYxahyIiUhGReu5m1sQrU/4uAFJz13HPvoN07tzLkUtm0jhZtyBEJB2iDIX8\nGvAIsMjMbgdOBa6KNaoqWl/oty9VS0ZE0iNKcf8R8CBwOjAA/C93f3HsXeqH62aqiKRQlOK+FbgV\nuKV4ub20WN/ew+RJDaxcPLPWoYiIVEyU4n4a8F7gm2Y2FbgFuNXdO2ONrAr2Hxhk87Y9rFw8kylN\npS/xJyKSNFHmlmkHbgBuMLMjgGuA54HmeEOL38aOXobzeY5epvlkRCRdIs0tY2YnAGuAC4CXCBa+\nrntanENE0irKE6rPEay8dCvwTnfviHJgM2sBvkOwclMz8Lfu/vPSQ6289Vt7aACOWqIrdxFJlyhX\n7he4+zMlHPs84FF3v97MVgC/AhJT3A8ODrOxs5dl82fQ0txU63BERCpqrAWybwsXur7TzIoXym4A\n8u6+fKwDu/ttRW+XAe1lRVphL2zrZXBoWFP8ikgqjXXl/lfhf98yws8iP6FqZg8CS4Fzx/pcW1sL\njY2lj1jJ5Von9Pn7/hAM1X/D8QsnvG9S1Gvc5VDO2ZC1nOPId6wFsreHL7/u7quLf2ZmjwCnRDmB\nu7/JzP4HcIuZneju+ZE+193dHzHkV8vlWunq6pvQPo8/twOAhTOnTnjfJCgl53qnnLMhazmXm+9o\nXwxjtWXWAtcCK8xsS9GPmoDtI+91yP4nAzvcfau7/7eZNQI5YMdEAo/D8HCeDR09LGibxqwZU2sd\njohIxY06U5a7/wA4DvhX4IyiX6cCJ0U49luBj8PLk43NAHaWGW9FbN2xh30DQ5pyQERSa8xpEN19\nyN0vIRjbng9/NQO/i3Dsfwbmm9lvgDuAj7j7cHnhVsb6ds0nIyLpFmWc+zUEKzFNBfYA04AfjLef\nu+8D/qLcAOOwoSNYnOOopRrfLiLpFGUC8wuB+QRL6+UICvZTsUYVs/Xtu5nZ0sT82dNqHYqISCyi\nFPc+dz8ATAFw99uB82ONKkYv7d5Pd98ARy2dTYMWwxaRlIryhGp3OHLmKTP7NvAMsDjesOKzviPo\nt2vKARFJsyhX7hcBDwAfA9YTPJD053EGFaeN7b2A+u0ikm5jjXNfedimhQTDIuva+o4eGidPYsWC\nbD0BJyLZMlZb5h6CoY8jNabzwOHFP/H2DQyydccejloyi6ZGLYYtIuk11vQDr6lmINXw/Iu95PNq\nyYhI+kUZ5/69kba7+0WVDydeG9rD8e26mSoiKRdltMw9Ra+nAGcBm+IJJ14vP7yk4i4iKRdlDdXv\nHrbpm2aWmEU3ohoezrOxYzcL57TQ2jKl1uGIiMQqSlvm8DuPy4Cj4wknPu1de9h/YEj9dhHJhCht\nmUEOHTWzG/g/sUUUk0JL5mi1ZEQkA6K0ZVIxZlCThYlIlkRpyywG1gCzKBrz7u6fjzGuitvQvpsZ\n05pYOKel1qGIiMQuylX5L4DXE4yUaSr6VTe6+wbYuXs/Ry2ZpcnCRCQTovTcX3L3S0s5uJldT7B6\nUyPwRXf/cSnHKdfGsCVz5JKZtTi9iEjVRSnuPwlnhfwtwc1VANx9y+i7gJmdBZzg7qeb2VzgcaAm\nxf35F4PJwlYuVr9dRLIhSnF/HbCWYKm9gjywfJz9fg08HL7uAaab2WR3H5pwlGV6vrOXBuCIhZos\nTESyIUpxPw1oc/eBiRw4LOJ7w7eXAf8xVmFva2uhsXHyRE5xiFxu5MI9NDTM5u19LF/YyvKlbSUf\nP4lGyznNlHM2ZC3nOPKNUtwfIVgUe0LFvcDMzico7m8f63Pd3f2lHB4IfmO6uvpG/NmW7X0MHBhi\n+fwZo36mHo2Vc1op52zIWs7l5jvaF0OU4r4UeMHMnuXQnvtbx9vRzN5BsLj2anffHS3Uynql366b\nqSKSHVGK+xdKObCZzQL+Hjjb3XeVcoxKeL5TN1NFJHuiFPdSG+HvA+YBPzSzwraLxhtlU2mbOnuZ\n2jSZJfOmV/O0IiI1FaW4f6bo9RTgeII1Vf9zrJ3c/RvAN0oPrXz7Bgbp3LmXY5bNZtIkPbwkItkR\nZW6Zs4rfm9l84IuxRVRBL7zYG6wHqH67iGTMhCcFc/cdwGtjiKXidDNVRLIqysRh3yd4aKlgGVD1\nB5FKoZupIpJVUXrudxe9zgO9wF3xhFM5+Xye5zt7aWudSlvr1FqHIyJSVWMWdzN7TfEye2bWAixx\n99KfOKqS7r4Bdu89wEnH5GodiohI1Y3aczeztwEPhOPVC1YCvzSzk2OPrEybtwVPfGk+GRHJorFu\nqH4WeHvxk6Xu/hTwbuDv4g6sXC+ouItIho1V3BvCYn4Id3+aYK6ZRNu8PSjuy1XcRSSDxiruM8b4\n2dxKB1Jpm7f1MWfmVGa2TKl1KCIiVTdWcX/KzK44fKOZ/Q3wUHwhla9wM3XFAl21i0g2jTVa5hrg\np2Z2EcG0v5OBNxMMhTynCrGVrNCSWaGWjIhk1KjF3d23AaeFo2aOJ3hw6Yfu/utqBVeqwkgZXbmL\nSFZFmVvmHuCeKsRSMRoGKSJZN+G5ZerB5u19zJoxhVkz9GSqiGRT6or77r0H6O4b4Ai1ZEQkw2It\n7mZ2gpltNLOPxnmeYi/329WSEZEMi624m9l04Caq3K9v79oDwLL5Ku4ikl1xXrkPAO8COmM8x6sU\nivvS+VpWT0SyK8qUvyVx90FgsGj91Kpo37GHKU2TyM2eVtXziogkSWzFfaLa2lpobCx1LW7I5VoZ\nHBpm265+Vi6ZxYL56V99KZfLXutJOWdD1nKOI9/EFPfu7tKniM/lWunq6qO9aw+DQ3kWzJ5GV1df\nBaNLnkLOWaKcsyFrOZeb72hfDKkaCvlyvz031pxnIiLpF9uVe7igx5eBI4CDZrYGuMDdd8V1zo6u\nvQAsna/iLiLZFucN1ceAVXEdfyRbdxSu3DVSRkSyLVVtmY6uPcyaPoVWzeEuIhmXmuLev3+Ql3oH\ndNUuIkKKivuLLwX99iW6mSoikqbiHgylXDi3pcaRiIjUXmqK+7ZdQXFfNEfFXUQkNcW90JZZOFc9\ndxGR1BT3bbv6mTa1kZktTbUORUSk5lJR3IeGhtnRvY9Fc1toaGiodTgiIjWXiuK+fVc/Q8N5Fqrf\nLiICpKS4t4dPpi7SSBkRESA1xT2YUU1X7iIigVQU98KEYSruIiKBVBT3beEwSK2+JCISSEdx39XP\nrBlTmNJU+kpOIiJpUvfFfXBomJ3d/bpqFxEpUvfFfVffAMN5yM1ScRcRKYh1DVUzuxE4DcgDV7n7\nI5U+R1fPPgBys5srfWgRkboV25W7mZ0JHO3upwOXAV+N4zw7Xy7uunIXESmIsy3zNuCnAO7+LNBm\nZjMrfZKunv0AzJulK3cRkYI42zILgceK3neF23pH+nBbWwuNjRMf7dK3fxCAY4/MMS9jV++5XGut\nQ6g65ZwNWcs5jnxj7bkfZswZvbq7+0s6aMeOPhonT2LowEG6ugZLOkY9yuVa6erqq3UYVaWcsyFr\nOZeb72hfDHEW906CK/WCxcCLlT7JqcfO59TjFzFJs0GKiLwszp77XcAaADM7Ceh094p/Hb/91OWs\nXX1spQ8rIlLXYivu7v4g8JiZPUgwUuYjcZ1LREQOFWvP3d0/EefxRURkZHX/hKqIiLyairuISAqp\nuIuIpJCKu4hICqm4i4ikkIq7iEgKNeTz+VrHICIiFaYrdxGRFFJxFxFJIRV3EZEUUnEXEUkhFXcR\nkRRScRcRSSEVdxGRFKrmMnuxMLMbgdOAPHCVuz9S45DKYmYnAP8O3OjuXzOzZcD3gckEK1l9wN0H\nzGwt8L+BYeAb7n6zmTUB3wFWAEPApe7+fC3ymAgzux44g+Dv4xeBR0hxzmbWQhDzAqAZ+FvgCVKc\nc4GZTQOeIsj5HlKcs5mtAn4EPB1uehK4nirlXNdX7mZ2JnC0u58OXEawKEjdMrPpwE0Ef+kLPg/8\nP3c/A9gAfDD83LXA2cAq4GNmNgf4C6DH3d8CfIGgUCaamZ0FnBD+Ga4G/oGU5wycBzzq7mcC7wVu\nIP05F3wa2BW+zkLO97v7qvDXlVQx57ou7sDbgJ8CuPuzQJuZzaxtSGUZAN5FsP5swSrg9vD1zwj+\nArwReMTdd7v7PuAB4M0Evx8/CT97d7gt6X4NXBi+7gGmk/Kc3f02d78+fLsMaCflOQOY2bHAccAd\n4aZVpDznEayiSjnXe3FfCHQVve/i0EW564q7D4Z/uMWmu/tA+HoHsIhX5/2q7e4+DOTNbEq8UZfH\n3YfcfW/49jLgP0h5zgXhEpS3EvzveBZy/jLw10Xvs5DzcWZ2u5n9l5n9CVXMud6L++Eaah1AzEbL\nb6LbE8fMzico7h897Eepzdnd3wS8G7iFQ+NOXc5mdhHwW3ffNMpHUpczsB74HHA+cDFwM4fe54w1\n53ov7p0ceqW+mOAmRZrsCW9CASwhyPnwvF+1PbwZ0+DuB6oYa0nM7B3Ap4B3uvtuUp6zmZ0c3ijH\n3f+b4B98X5pzBs4Bzjez3wEfAj5Dyv+c3b0jbMHl3X0jsI2gdVyVnOu9uN8FrAEws5OATnfvq21I\nFXc38J7w9XuAXwIPAaeY2Wwzm0HQi/sNwe9HoX99HnBvlWOdMDObBfw9cK67F260pTpn4K3AxwHM\nbAEwg5Tn7O7vc/dT3P004FsEo2VSnbOZrTWzq8PXCwlGR32bKuVc91P+mtmXCP6xDAMfcfcnahxS\nyczsZIK+5BHAQaADWEswHKoZ2EwwHOqgma0BriEYAnqTu//AzCYT/MM5muDm7CXuvrXaeUyEmV0O\nXAesK9p8MUEeac15GsH/oi8DphH8r/ujwPdIac7FzOw64AXgTlKcs5m1EtxTmQ1MIfhzfpwq5Vz3\nxV1ERF6t3tsyIiIyAhV3EZEUUnEXEUkhFXcRkRRScRcRSSEVd5EiZvaucNImkbqm4i5yqI8BKu5S\n9zTOXVItnFP7EwQzLx5P8HDYanfvH+GzfwncSDC3+qUEk5jdBqx09wvN7L3AlQRzfHQBH3L3l8Jp\niz8bbj8IfNjdN4UP2P0xwQMoHcDFRZNGicRKV+6SBacDnwznjB8C3jHSh9z9nwjm/1jr7s+Em9eH\nhX0Zwfw3Z4fza98HfDJceOOfgQvC+dlvAv6vmbUBHwFOD+fu/jHB4+ciVVH3KzGJRPCsu+8IX29m\nYm2XB8P/nk4wDeudZgYwFdgEnBBu/3G4fTKQd/duM7sTuN/MfgLc5u7tZWciEpGKu2TB4GHvJzJ1\namEWvgHgYXc/t/iHZnYisMXdVx2+o7uvCReoOIegyL8nnAVSJHYq7iKHGgaaRtj+CPBNM1vo7tvM\n7EKCwn8XMM/MTnD3p8zsrcCxBDMenu/uNwLPhbMCngiouEtVqLiLHOpO4Gfh4hIvc/dOM7sK+LmZ\n9QP9BDdI95nZ+4GbzWx/+PHLCW7gvt7MHgb6gG6CWQFFqkKjZUREUkhX7pIp4Vzqvxjlx19y919W\nMx6RuOjKXUQkhTTOXUQkhVTcRURSSMVdRCSFVNxFRFJIxV1EJIX+P0d/ELKKqNDkAAAAAElFTkSu\nQmCC\n",
168 | "text/plain": [
169 | ""
170 | ]
171 | },
172 | "metadata": {},
173 | "output_type": "display_data"
174 | }
175 | ],
176 | "source": [
177 | "# Pick an object of class 1 for visualisation\n",
178 | "plt.plot(cum_preds[:, y_test == 1][:, 0])\n",
179 | "\n",
180 | "plt.xlabel('n_trees')\n",
181 | "plt.ylabel('Cumulative decision score');"
182 | ]
183 | },
184 | {
185 | "cell_type": "markdown",
186 | "metadata": {
187 | "collapsed": true
188 | },
189 | "source": [
190 | "See, the decision function improves almost linearly untill about 800 iteration and then stops. And the slope of this line is connected with the learning rate, that we have set in GBM! \n",
191 | "\n",
192 | "If you remember the main formula of boosting, you can write something like:\n",
193 | " $$ F(x) = const + \\sum\\limits_{i=1}^{n}\\gamma_i h_i(x) $$\n",
194 | "\n",
195 | "In our case, $\\gamma_i$ are constant and equal to learning rate $\\eta = 0.01$. And look, it takes about $800$ iterations to get the score $8$, which means at every iteration score goes up for about $0.01$. It means that first 800 terms are approximately equal to $0.01$, and the following are almost $0$. \n",
196 | "\n",
197 | "We see, that if we drop the last tree, we lower $F(x)$ by $0$ and if we drop the first tree we lower $F(x)$ by $0.01$, which results in a very very little performance drop. \n",
198 | "\n",
199 | "So, even in the case of simple dataset which can be solved with single decision stump, in GBM we need to sum a lot of trees (roughly $\\frac{1}{\\eta}$) to approximate this golden single decision stump."
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "**To prove the point**, let's try a larger learning rate of $8$."
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 7,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "Test logloss: 3.03310165292726e-06\n"
219 | ]
220 | }
221 | ],
222 | "source": [
223 | "clf = GradientBoostingClassifier(n_estimators=5000, learning_rate=8, max_depth=3, random_state=0)\n",
224 | "clf.fit(X_train, y_train)\n",
225 | "\n",
226 | "y_pred = clf.predict_proba(X_test)[:, 1]\n",
227 | "print(\"Test logloss: {}\".format(log_loss(y_test, y_pred)))"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 8,
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "name": "stdout",
237 | "output_type": "stream",
238 | "text": [
239 | "Logloss using all trees: 3.03310165292726e-06\n",
240 | "Logloss using all trees but last: 2.846209929270204e-06\n",
241 | "Logloss using all trees but first: 2.3463091271266125\n"
242 | ]
243 | }
244 | ],
245 | "source": [
246 | "cum_preds = np.array([x for x in clf.staged_decision_function(X_test)])[:, :, 0] \n",
247 | "\n",
248 | "print (\"Logloss using all trees: {}\".format(compute_loss(y_test, cum_preds[-1, :])))\n",
249 | "print (\"Logloss using all trees but last: {}\".format(compute_loss(y_test, cum_preds[-2, :])))\n",
250 | "print (\"Logloss using all trees but first: {}\".format(compute_loss(y_test, cum_preds[-1, :] - cum_preds[0, :])))"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "That is it! Now we see, that it is crucial to have the first tree in the ensemble!"
258 | ]
259 | },
260 | {
261 | "cell_type": "markdown",
262 | "metadata": {},
263 | "source": [
264 | "Even though the dataset is synthetic, the similar intuition will work with the real data, except GBM can diverge with high learning rates for a more complex dataset. If you want to play with a little bit more realistic dataset, you can generate it in this notebook with the following code:\n",
265 | "\n",
266 | "`X_all, y_all = make_hastie_10_2(random_state=0)` \n",
267 | "\n",
268 | "and run the code starting from \"Learn GBM\"."
269 | ]
270 | }
271 | ],
272 | "metadata": {
273 | "kernelspec": {
274 | "display_name": "Python 3",
275 | "language": "python",
276 | "name": "python3"
277 | },
278 | "language_info": {
279 | "codemirror_mode": {
280 | "name": "ipython",
281 | "version": 3
282 | },
283 | "file_extension": ".py",
284 | "mimetype": "text/x-python",
285 | "name": "python",
286 | "nbconvert_exporter": "python",
287 | "pygments_lexer": "ipython3",
288 | "version": "3.6.0"
289 | }
290 | },
291 | "nbformat": 4,
292 | "nbformat_minor": 2
293 | }
294 |
--------------------------------------------------------------------------------
/Reading_materials/Hyperparameters_tuning_video2_RF_n_estimators.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This notebook shows, how to compute RandomForest's accuracy scores for each value of `n_estimators` without retraining the model. No rocket science involved, but still useful."
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Load some data"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "import sklearn.datasets\n",
26 | "from sklearn.model_selection import train_test_split\n",
27 | "\n",
28 | "X, y = sklearn.datasets.load_digits(10,True)\n",
29 | "X_train, X_val, y_train, y_val = train_test_split(X, y)"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "metadata": {
36 | "collapsed": true
37 | },
38 | "outputs": [],
39 | "source": [
40 | "import numpy as np\n",
41 | "import matplotlib.pyplot as plt\n",
42 | "%matplotlib inline\n",
43 | "from sklearn.ensemble import RandomForestClassifier\n",
44 | "from sklearn.metrics import accuracy_score"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {},
50 | "source": [
51 | "**Step 1:** first fit a Random Forest to the data. Set `n_estimators` to a high value."
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 4,
57 | "metadata": {},
58 | "outputs": [
59 | {
60 | "data": {
61 | "text/plain": [
62 | "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
63 | " max_depth=4, max_features='auto', max_leaf_nodes=None,\n",
64 | " min_impurity_decrease=0.0, min_impurity_split=None,\n",
65 | " min_samples_leaf=1, min_samples_split=2,\n",
66 | " min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,\n",
67 | " oob_score=False, random_state=None, verbose=0,\n",
68 | " warm_start=False)"
69 | ]
70 | },
71 | "execution_count": 4,
72 | "metadata": {},
73 | "output_type": "execute_result"
74 | }
75 | ],
76 | "source": [
77 | "rf = RandomForestClassifier(n_estimators=500, max_depth=4, n_jobs=-1)\n",
78 | "rf.fit(X_train, y_train)"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "**Step 2:** Get predictions for each tree in Random Forest separately."
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 5,
91 | "metadata": {
92 | "collapsed": true
93 | },
94 | "outputs": [],
95 | "source": [
96 | "predictions = []\n",
97 | "for tree in rf.estimators_:\n",
98 | " predictions.append(tree.predict_proba(X_val)[None, :])"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "**Step 3:** Concatenate the predictions to a tensor of size `(number of trees, number of objects, number of classes)`."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 6,
111 | "metadata": {
112 | "collapsed": true
113 | },
114 | "outputs": [],
115 | "source": [
116 | "predictions = np.vstack(predictions)"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "**Step 4:** Сompute cumulative average of the predictions. That will be a tensor, that will contain predictions of the random forests for each `n_estimators`."
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 7,
129 | "metadata": {
130 | "collapsed": true
131 | },
132 | "outputs": [],
133 | "source": [
134 | "cum_mean = np.cumsum(predictions, axis=0)/np.arange(1, predictions.shape[0] + 1)[:, None, None]"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "**Step 5:** Get accuracy scores for each `n_estimators` value"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 8,
147 | "metadata": {
148 | "collapsed": true
149 | },
150 | "outputs": [],
151 | "source": [
152 | "scores = []\n",
153 | "for pred in cum_mean:\n",
154 | " scores.append(accuracy_score(y_val, np.argmax(pred, axis=1)))"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "metadata": {},
160 | "source": [
161 | "**That is it!** Plot the resulting scores to obtain similar plot to one that appeared on the slides."
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 9,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmsAAAF4CAYAAAAL5r5MAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmclXXd//H3Z/aFgRmYYZFhF0QURZ1QAVHBhdI0rczS\nbrXF6hYrzUqr28zufq1mdqelmWmrmZVSkYorCqIMIgooO8KwDjADwzLr+fz+OIfhHGbhmmHOAvN6\nPh7zmHNt53xnLjjzPt/V3F0AAABITWnJLgAAAADaRlgDAABIYYQ1AACAFEZYAwAASGGENQAAgBRG\nWAMAAEhhcQ1rZjbNzJaZ2Uozu7WV40PM7Dkze8vMXjSz0qhjTWb2ZuRrRjzLCQAAkKosXvOsmVm6\npOWSzpdUIWm+pI+7+9Koc/4q6V/u/oiZTZF0nbt/MnJst7v3iEvhAAAAjhDxrFkbL2mlu69293pJ\nj0q69KBzxkh6PvL4hVaOAwAAdGvxDGsDJa2P2q6I7Iu2SNLlkceXSSowsz6R7RwzKzezeWb2oTiW\nEwAAIGVlJPn1b5H0CzO7VtJsSRskNUWODXH3DWY2XNLzZva2u6+KvtjMrpd0vSTl5+efNnr06MSV\nHAAAoJMWLFiwzd1Lgpwbz7C2QdKgqO3SyL5m7r5RkZo1M+sh6cPuXh05tiHyfbWZvSjpFEmrDrr+\nAUkPSFJZWZmXl5fH5QcBAADoSmb2XtBz49kMOl/SSDMbZmZZkq6UFDOq08yKzWx/GW6T9FBkf5GZ\nZe8/R9JESUsFAADQzcQtrLl7o6Tpkp6W9I6kx9x9iZndaWaXRE47R9IyM1suqZ+k70X2Hy+p3MwW\nKTzw4AfRo0gBAAC6i7hN3ZFoNIMCAIAjhZktcPeyIOeyggEAAEAKI6wBAACkMMIaAABACiOsAQAA\npDDCGgAAQAojrAEAAKQwwhoAAEAKI6wBAACkMMIa0EVeX7NDc1ZuS3YxAABHGcIa0AVmvr1JV9z/\nqq568DU9+vq6ZBcHAHAUIawBh6mxKaQfPfVu8/Zds5artqEpiSUCABxNCGvAYfrHwg1au31v83Zl\nTZ3+MO+9JJYIAHA0yUh2AYAj1ZyV27Rm2x7dP3tVi2O/emm1cjLT272+T36WzhvTT5nph/7MtGVX\nrRa8V6WzRharICez3XN31TbouXe2qL4xpLNGluiYwtxDPn9Q5Wt3KDM9TScPKozZ/+b6ai3esFOl\nRbnqW5CjnfsadOaIPoGfd932vZqzapuaQt6pcmWlp+ns40qUn52h597ZopraRklSeppp4ohiDe6T\n16nnBYBUQFgDOuGx8vX62uNvxewrzMtUTka6Nu+q1bbddfrWE4sP+TzXThiqOy45od1zdtc16vL7\n5mpD9T5NHlWi331qfJvnurs++0i5XluzQ5LUtyBbs24+W71y2w94QTy+oEK3/HWRzKT7rz5NF5zQ\nX5I0f+0OffRXr7Y4/1sXHa/PnDX8kM+7taZWl9z7iqr3NhxW+QYW5mpArxyVv1cVs79XbqZm3TRZ\nfXvmHNbzA0CyENaAANxdL6/Ypo3V+yRJP3t2RYtzPjd5hHrlZuob/3g78PP+Yd57Ou/4ftpYvU8h\nP1Cr1DM3U+ce11c79zXovx56TRsirzt7eaV+88oalRRk65zjStQzqpZtd12jbn9ycXNQk6StNXW6\nY8YSTRjRR2eO6KOSgmy98O7WTgWj22cskSS5S996YrGmHt9P72zape/8c0mr5//vv99RfnaG+vXM\n1pA++dqys1ZnjugjM9PabXv0+podCrnrpeWVhx3UJGlD9b7m31O0nfsadPuTS3TOcSWH/RrH9u2h\nsqG99ca6Ki3fXKPBvfM04djiw35eHHmWba7R1ppaTRxRrLQ0S3ZxcJQz9841O6SasrIyLy8vT3Yx\ncJR66JU1uvNfS1vs752fpWkn9teIkh66dsJQpZn0p9fXacnGXe0+32urt2tV5Z52zzlrZLGWb6nR\nll11rR4/fVhv/eVzZzZvf/I3r+nlFW1PHdI7P0vjBhXq+Xe3tvu6QX3w5GP077c2qiMtlzdOOVZX\nlA3S++95WbvrGlscv/ikAerZwVrArbtq9ew7sT/TKYMLNbAwV/96a1OHniuIy08dqL+/saF5+/uX\nj9XHxw/u8tdB6npzfbU++qu5amjywDXIwMHMbIG7lwU6l7CGI9nOfQ16eUWlahtCys5I0+SRJeqV\nF+yPfX1jSLOXV2pXbYPeN7S3BvUO92tqCrleW71dG3fWSpKGl+TrUw/Pb7X259sfHKPrJg7rcLnn\nrtqmT/z6tQ5fd7BZN03Wpp21GtQ7T+f+5MXm/VkZacrJSNOu2paBKNkmjyrR7OWVLfafPKhQT/z3\nBJl1rJaivjGk8376ktbtCA/yyMpI0/NfOVsDC3N12X1z9eb66i4pd1tKCrL18tfOVU5mulZV7lbV\nnnqVDe0d19dE4i3ZuFPvbKqRJP346XebP0T1LcjW6988L9BzVNbU6d3Nu3TG8D4xfVWr9tTr5ZXb\nVN8YkiRlZ6TprJHFKszL6uKf4ujS2BTSnFXbNbw4v/n9+0jSkbBGMyiOWKGQ66oH52nxhgO1WMcP\n6Kl/Tp+ojACd9m9/crEenb9eklSQnaH/fPkslRbl6cGXV+v7/3m31Wv69czWOaP6SpJGDyjQNWcO\n7VTZJ4wo1vihvfX62gNNlpecfIxyM9O1eOPOQ9bM7Xf+3bNb3f+nz5yuvgU5euTVtdpQtU9PLdkc\nc3x0/wKdXFrY6rXtGViUqwdmr261VmxgYa5uOn+UduypU352ht5av1MvLt/aomYwOqhdOu4Y5WSk\nq1depq6bOLTDQU0Kh7OHr3uf/jBvneoam/TBk49RaVH4jfuXV5+q385Zq51d0Mw6c/Gm5oEL0Spr\n6vTH19bpfUOLdMX9r6q2IURty1Fm7qptuvrB11qtRd5aU6ddtQ0xXRJas3Nvgz507xxtqN6nD4zt\nr/uuOk2SVNfYpI/8am6LmvbR/Qv0rxsnBXov665++NS7+vXLa9QzJ0Mzv3RW8//7oxE1a0e5ffVN\nKn9vh04qLeySTuYdsWJLjfY1NGnswF6B/ghv3lmr1dt2a/zQ3oHeoJZtrtGFP2sZVs4f0093XXFy\nu2+eK7bU6IKfzVb0P/+ivEzNmD5Jn3mkXMu21LR63Y8+cpKuKBt0yLIF8fqaHbri/nDH/EnHFusP\nnzldkrRy625N+9lsNUb+MowoydcTN0zUxB88H6im7ItTjtXNFxzXvO3uuuQXc/T2hp2SJDPpn9Mn\n6cSBvTpV7nueXaG7n13eYv83PjBa108eEbNv1tIt+uzvWv9/Gf0zHwl+8fwK/eSZAz93r9xM7dwX\nDoHFPbLVv1d2zAeHX151qqad2L/53/6mnfu0fsc+lRblav7aHeqZk6mJxxYrK4M/xsng7pq/tkob\nq/fpxIG9dGzfHs3H3qqo1uqo8PTA7NVauqntD1DfueQEnTiwl/oWZOuNdVXN7yu98jI1sm8PvbGu\nWi8tq9Tf3qhovuZzZw/Xl6eO0uNvVOh/2hiMNGV0X/30ipMPu4Zt4boq9c7P0qadtdocaTE4+Gdu\ny7LNNXon8rP375Wj04f1lplp/Y692lpTp1MHF3bqQ9bhcncNu21m8/b4Yb31iUh3BDPppNJCDSvO\nT3i5OoJmUDS77rev64VllRrdv0D//uJZSk9QR9g31lXpw7+cK/fwH633jx3Q7vnbd9dpyl0vaee+\nBv33OSP0tWmjD/kaD89Zozv+Ge5HlpWepvqmUPOxsQN76ckbJrbZ8Xf6n944ZH+m9DSLmUqipCBb\nr946pUs/6T6zZLMWb9ipT581PCZMv7hsq2a+vUm5mem6buIwDS3OV/naHXpq8WaN6NtDt/297UEM\nf/rM6S06vVdU7dVvXlmj3bWNOn9Mv+aRnJ3R2BTS/bNXa1Xlbg0vzld2Rrrqm0L69KRhrU5XMmPR\nRq2p3KOSgmwteK9KLlfvvCx97uwRKinI7nQ5Eq2xKaRfv7xGK7bWaHhxvq6bOExT73pJm3fVtnnN\nV84fpRunjlRF1V5d8os52rGnPub4R08r1Y8/enK8i45W/P7VtfqfJ8ODY9JMmhH5APPU4k36/B/e\naPWarIw0XTx2gGSK6bfYWROP7aPVlXu0KRKgJozoE+6GETVIaGTfHvr3F8/qdKj/9ezV+t7Md1rs\nT08zPXnDxHY/tL2+Zoc+/ut5Me+D/3PxGE0eWazL75urmrpG3XLBKE2fMrJTZTscK7fW6Lyftt6y\nIEk5mWl68oZJOq5/QQJL1TGENUiStu2uU9n/Ptu8ffvFY/ThU0uVk5WmBWurdFz/Ai3fslvH9u3R\nJX80V2yp0bubwzVS3/j726qJaiqb+cWzNOaYnlqzbY8WR2p49uvTI0tvV+yMaXpc8/0PtPi0tmNP\nvV5fs10NTeF/s79/9b3mZsSvXnicHnx5taqimrueuGGiduypU0OTa8KIPs3zky3fEq6R2/9Pv0d2\nRqvNeqcNKdLQPvnNn4bv/tjJuuyU0k79brpSfWNIE37wnLbtrm9xrDAvU/Num3rIOd7QNf4w7712\np2jJy0rX/7tsrP711iY9+86WFsfNpP/90IkxtcD52emaMKI45h7urW/U4g27NLwkX8s212jcoELl\nZ8f2YmlsCun1tTs0ql+BinscOSG4q7m73lhXrYGFuVpduVvbIwHZTDq5tFDVexuUlZGmj/96Xkx4\n/sykYbr1/aN13k9fipnkOtqnJg7T7R8cIyn8gera387vsnIX98jS7K+dq6aQ6+wfvxhTtrNHleie\nK8e1WsMWCrkWrq/SoN556lsQOz1N1Z56nfLdWW2+5unDeuvqM4a0efzXL6/WWxWx79e9cjM1un9B\nTKB8/ZtTW7z24ahrbNKi9Ts1ekCBstLT9Oqq7S3eo19bs11/mNf+0n5Z6Wl65qbJGpqiNWyENUiS\n/v3WJt3wp9hPiCP79tDQ4nzNWnrgD0e/ntl66kuTVZTf+ar2IB3mr5s4VH98bV1zJ9r2vHDLOTFV\n2HvqGjXtntlav6Pl1AyS9MxNk7WvvkmX3jun1ePH9u2hp750ljLS0/TlRxfqiTc3SpKmju6r718+\nVl/44xtacND8XNPPPVbTpxyrX720SgN65eiKskFJqe5vzcJ1VfrL/PVyl04c2FNLNu5SyF1XlA2i\nc3sCubt+88qa5ibm/r1y1Dsvq80+j0FNGd1Xv7mmTGbh2t2P3f9qzPxxJw7sqX9OnxTz7/Frjy/S\nY+UVOqZXjv7z5ckJ7/aQKv7fzHf0wOzVHb7uxIE9de2EYbrlr4skhfuxTjm+b/Pxwb3zdMO5xzaH\n6FDINeWuF1sNdh8Y21/vbqrR6m0HmlKnju6rAYU5GtWvQLc/2XK6m+h+jm9VVOuy++bG1GgNL8nX\nzC+e1eKD2B0zlujhuWvVr2e2/nnjpJjQ9KOn3tV9L7actLujMtNNRXlZ2lrT+sj0z509XLe9//jD\nfh0p/H/q04+U6/l3t2p0/wIN6JWjF5a1HJB0sEvHHSMpPEAsutUkzaR/3jhJJxzTuW4f8URYgyTp\nf55YrN8HXPbo2glDddFJAzS8OF99emTL3fVWxU5t2lmrsaW9VFlTp5KCbA1sYzb8b/7jbf3xta5b\nwPy6iUN1/ph+OrZvDy1av1Ozl1e2+bOUFuXq5a+dKzPT715d2+oboSTdeekJumjsAJ3zkxebO4o/\necPE5tn4D5764m9fmKDThhR12c+E7uOpxZv1+T8saPN4dka42b69t99vXXS8Soty9c6mGt3zXMt5\n/Z65abKGF+drUcVODe2Tp9OiatFH9y/Q7z41Xqsq92hXbYPKhhSpT4rVtlXtqVf5e1VqCrX94S07\nM11jB/bS4g07VdvQpOyMdJ0+vLfyssK1ik0h16KKag3unaflm2u0eVetbn5sUafKYyYd0yu3ea6+\nm84bpS+d137z3n/e3qQv/DH2A/HZo0r0yKfGa+G6Kl0e6QoydmAvzZg+sTlcV1Tt1ZSfvNTcdaNv\nQbZmR0YU77e3vlGTf/Situ0+EJA+NXGYxg8Lvyelp6WpMC8zZkLqD558jC4aG+7i0Bhyff3xt7Sn\nPnad4u9ccoK+PaP198i2/NeZQ3RyaaG+8tfWf7cjSvL13FfOad7etrtOb7xXFTN3pCTlZ2fo+AE9\ntWh9tRqaWr/v723f26EPO2kmzbr5bI0oOdD/7r//uEAz3z4wqOqM4b117YShgZ9zv9KivE737Q2C\nsAa5u6be9VLMJ7sgeudn6YWvnKO/Lliv//13bD+HrIw0PXvT2a0u3XPFr16NGdnYnjST3n9iuN/H\nm+uqW53ItD0TRvRprgXskZWhT545pPk/1PItNbqgjRGSB+vXM1vzbpva/AZaWVOnXzy/Qtv31Gvy\nqJIuG0iA7umPr72nV1dt1/532D75WRpenK8N1fv0ubNHaP6aHfrP4s1qinoPXlO5p92O7NG+c8kJ\nen3tDv07wFxyffKz9NxXzk6ZqSB21zVq2s9mq6KqY//3pXAN2JM3TFJ6mumrf12kvy6oaPf8gYW5\nGjuwV4sR0VK46bFvQU6L33nPnAy9cuuUQ47wlMKrmazZtkfHFOaqomqv/vucY5trNZ9eslnz1+zQ\n9WcPb9FMOHt5pf6xcIMy0kzXTBjaaihYvGGnLr9vbkx/3M4Y1DtXHzwpPEr64+MH6cRvPx0T4i46\nqe0+xaVFufry1FHKyUzTw3PXNtfw9szJ1J9fP/AB/fVvTFXfnjnavrtO0+55WZVt1MJ11uj+BRpx\n0ICIdDNdeEL/FuWv3luvL//lTb0YoEauPVefMVj/+6Gxh/Uc7WHqjm6qrrFJ89dUyUyqqW3ocFCT\nwv3Cnl+2Rf9s5Q9AfWNI/1i4QVOP76tNO2t1XL8CDe6TJ3fX8q0HRk9edfrgdmvZLh03UHd/bJwk\n6bl3tujTjwQP2YN65+qRT41vcz3NkX17qLQoN9AfgTOG94lpRiopyNZ3Lj0xcFmA9lx1+hBddXrb\n/YHeP3ZAi4E363fs1dS7Xgr0x/mhOWv0Xht9qw62fU+9nn93qy4/NVify5Vba5SZnqYhffK1unK3\nJGl4SfsjBzft3KclG3bJFQ5B4wYVaseeem3eVav+PXP05vrq5qkvXlq+tVNBTZIWb9ildzaFm/0P\nFdQk6YcfPkmTRhbr+t+V65mlsf0GvzZttDZW72sR1q6fPDxQUJPU7oe6C0/orwvbGMwzeVSJJo9q\nf1WNEwf20lt3XKBzfvxiuwNZDuUb7z8+5t/ap88arp9Hams7Ms3MdROHxcwruapyt16P9F175NW1\nGjeoSE8t3tzlQS0jzfSrq08L3PesMC9LD183Xl/880LNWLSxS8uSLIS1o4S765qHXte81S1rt04f\n1lvDS3poaJ88LdtSoy27ajVn5fY2n2veqh1a2cbUFXc/u7x52oas9DQ99vkzdUxhTvOEsT2yM3TH\nJSeoX88c7alvVEaaqbHJlZ2ZrmWbd2lAr1zdfMGo5uebenw//b/Lxmr28kqVFGSrtChXiyqqtXTj\nrpi+IBNG9NGAXrn67ORh7S58bma676pT9fDctapvDCkzPU3/WNj6qK0zhwdfaBxIhEG98/TAf52m\nxxdUxDQT5WSma9ygQj29ZHPz//GgQW2/eau3Bwprz7+7RZ96uFyZ6abp547U3c8uV5pJv7n2fTr3\nuL6tXrNya40u+cUc7Y2qrfnkGUP0zNLNba7Asd/pw3qrsJWJrDftrI3p3J6RZs3T2cxbvV1zV7V8\nDxtenK/j+hfotCFF2lhdq5NKe2nSyPDI6LuuOFn/9/xK9S3I1rbd9TqmMEcfPa1U+xqatK+hSWsj\nH26P699Tnz97RIvnTpaczHQ9eE2ZHpqzRnsinexrahtjfv6RfXvotCFFqtobO+jIZDpzRB9NOzE2\nMF4/ebjqGpvUMydTn+rEpN77nTm8T3NYu/eFln3jJh1brPzscNPumm17tHzL7uZj4ab51mt6szLS\nNWZAT63YUqO6ppAuP2VgpwYJfPdDJ0amLOncB4MxA1KnnxvNoEeBusYmzV6+rdX5rDLSTHNvm9Ki\nCv6y++Zo4bpqpaeZeuZkxIyizEy35hGXh3Lm8D76wNj+zUPgxw0q1BM3TDyMnyZs0frq5sECI/v2\n0FNfntypaUcWb9ipi//vlRb7M9JML9xyzhE56zW6L3fXpB++0GbXgaz0NM3+2rm69revN4/Mjlb+\nrfMOOVK0vWXLfnvt+1rsG1CYo58/tyKmj1BQQ/rk6dmbz271A1j13nqd9aMXVFPbqLysdH3h7BG6\na1b4g+LQPnktOvYf7vyBR5KmkOvCn83Wyq3h8BPd9zaRFq6r0mX3zW312JgBPfWvGyc1T5+0dtse\nnffTl9QYcvXrma2Xvnputx+1Tp+1biQUcn34V3O1cF3rS+qM6tdDz9x0dov9K7fW6DevrNFZI0t0\nyuBC/WzWCv2lfH2L80b3L9CEEcV6aM6aQOW5oqxUP/pI18wbNfPtTZq7aps+f/aITs9MXdvQpNH/\n81TMvgtP6KdLTh7Ybj8NIFW9ub5av43UsmSmp2nMgJ5aVblbtQ0hXXbqQF14Qn9trN6nX764SqcO\nKdQ3/r5Y+xrCNV5pJr3y9Sk6po2BQqGQa/g3ZrZ6LKiSguw2m8EmjOijvKzwH+iCnEx9/uwR7c6D\n9eb6aj1Wvl6XnzJQfXpkxyyptt/Jgwp14jE9ddbIYk07sfv8n16/Y6/un71KE0cUH3Iey3j602vr\n9MKyrYrOEr1ys3TjlGNb1Ia9tLxSTy/ZrOsmDNXIfqk7/1miENa6kXc27dL773m5zeMXnTRA937i\n1EDP1don6v0dLL/1xNuHnNNGkn5w+VhdmWKLWl/8fy9r8YZdyspI05yvTzmiJmIFDtdnf1ceM1XP\nxScN0Mfe13o/q807a/XVx9/q9GudP6afPnxqaasjYQ8eEdlR7q5zfxI7VUaaSU9/eTJ/+HFEYoBB\nN/JqK/02oo3qG/xN7LqJQ1uEtVGRN8Gbzw8vX1RRtU+nDi7SJScfo3MO+pR77YSh+vBpyZ809mA/\n/PBJ+u2ctTrv+H4ENXQ7t188Rmu37dGKSJPZv97adMjVO1pzznEtO8O/umq76qLmTfzyeSM1ZkBP\n3X7xGL28olJF+Vka3DtPW3bV6YtTjz2seQrNTPdceYoefGWNamoblJGWpstOGUhQQ7dAzdoRrrUR\nTtGCLPW0n7vrQ/fN1aL1B5pU/zl9ksaWtt4H5Na/vdW8EPr+ZXUApB5318funxd4eh0pPCns/lVI\n/ufiMfr0pJYd0aNXcLjwhH66/5OBKgkAiJq1bqOhKRSz5EdrOvKp08z08yvH6a5nlqtqb73OH9Ov\nzaAmSbd/cIzysjLUIztdXzgndUZPAYhlZvrpx07Wj59epu2tLFN2sLGlvTT93GP1k2eWSQpPx9Oa\nT4wfrF21DdpUXatbLjyuS8sM4ABq1o5gf5m/Tl//W+yC3n0LstUUcm3fU6+ivEzN/+Z5XbrwOAAA\nOHzUrHUD9Y0h/d/zK5u3TxtSpP69cnT16UMUctefX1+nK8oGEdQAADjCEdaOME0h19KNu/TisgMz\ngBflZeqRT41Xj+wDt3PiscXJKiIAAOhChLUjSFPIdfl9c7QoalZvSfrs5OExQQ0AABw9aCM7gizd\nuKtFUOudn6VrzhyanAIBAIC4i2tYM7NpZrbMzFaa2a2tHB9iZs+Z2Vtm9qKZlUYdu8bMVkS+roln\nOY8Uy6PW6+ydn6VJxxbr3k+cqnxq1QAAOGrF7a+8maVLulfS+ZIqJM03sxnuvjTqtJ9I+p27P2Jm\nUyR9X9Inzay3pG9LKpPkkhZErq2KV3mPBMu3HghrV58+WDdfwFB5AACOdvGsWRsvaaW7r3b3ekmP\nSrr0oHPGSHo+8viFqOMXSprl7jsiAW2WpGlxLOsRYcWW3c2PmbUbAIDuIZ5hbaCk6JXBKyL7oi2S\ndHnk8WWSCsysT8BrZWbXm1m5mZVXVlZ2WcFTTVPItW13XUwz6CjCGgAA3UKyOzvdIukXZnatpNmS\nNkhqCnqxuz8g6QEpPCluPAqYbHvrG/WRX76qpZt2Ne/LSDMNK85PYqkAAECixLNmbYOkQVHbpZF9\nzdx9o7tf7u6nSPpmZF91kGuPJnWNTdodWYMvFHLV1DY0H3vw5TUxQU2ShhXnKyuDgbwAAHQH8fyL\nP1/SSDMbZmZZkq6UNCP6BDMrNrP9ZbhN0kORx09LusDMisysSNIFkX1HnZ37GjT1rpc09o6n9fc3\nKnThz2br1O/O0j8XbdS++iY9NGdNi2veN6x3EkoKAACSIW5hzd0bJU1XOGS9I+kxd19iZnea2SWR\n086RtMzMlkvqJ+l7kWt3SPquwoFvvqQ7I/uOOs8u3aKKqn1yl25+bJFWbN2thibXjX9eqFnvbFH1\n3oYW15wxvE8SSgoAAJIhrn3W3H2mpJkH7bs96vHjkh5v49qHdKCm7aj19oadbR774p8Xtrr/jOHU\nrAEA0F3Q8SnJ3tu+p0Pnj+rXQ30LcuJUGgAAkGqSPRq021seNXdaez571jAteK9KX5s2Os4lAgAA\nqYSwlkS76xq1oXrfIc8bN6hQ37xoTAJKBAAAUg3NoEm0ImqS27aM6tdD37ro+ASUBgAApCJq1pJo\ndWX7/dWGFefrmZvOTlBpAABAKqJmLYkqqtpuAj25tJfu/ti4BJYGAACkIsJaElVU7W1+fGbU3Gkn\nDuypJ6dP0rhBhckoFgAASCE0gyZRdM3a9ZOHq7QoV8u21Oh7HxqbxFIBAIBUQlhLoorqAzVrg3rn\n6ccfPTmJpQEAAKmIZtAkaWwKaVN1bfN2aVFuEksDAABSFWEtSdZX7VNjyCVJJQXZyslMT3KJAABA\nKqIZNMEqa+r0yd+8pnc3H5hjjVo1AADQFmrWEuye55bHBDUpPJ8aAABAa6hZS6CN1fv0l/nrm7ez\n0tM0pE+ePnvW8CSWCgAApDLCWgI9984WNTSF+6mdNqRIj3/+TJlZkksFAABSGc2gXWR15W5NvetF\nXfKLV7Rr9Dk9AAAe7UlEQVR1V22r56zbcWCqjimj+xLUAADAIRHWushj5RVaVblHb1Xs1Of+sKDV\nc9bvODAJLoMKAABAEIS1LjJ31bbmxwvXVeuE25/SRT9/WZt2Hgho0ZPgEtYAAEAQhLUu0r9nTsz2\nnvomLdm4S38tr2jeF7281KCivISVDQAAHLkIa11kV21Dq/tfWRmucaupbVD13vA5WRlpKu6RnbCy\nAQCAIxdhrYvs3NfY7vEN1VH91QpzlZbG4AIAAHBohLUusmtf6zVrFZERoNGDCwbSXw0AAAREWOsi\n0c2gr39zqvbPyrF5V63qG0PaGF2zRn81AAAQEJPidoFQyLW77kAzaJ/8bPUryNHmXbUKuXTB3S9p\n7fYDI0FLCuivBgAAgqFmrQvU1DXKwwsTqCA7Q+lpFjM1R3RQk6TiHlmJLB4AADiCEda6QHR/tZ65\nmZLan0etdz5hDQAABENY6wLR/dUKcsIty4N6t90vrU8+zaAAACAYwloX2BU1bcf+mrWTSgvbPL8P\nzaAAACAgwloX2BndDJoTDmvjh/Zu83yaQQEAQFCEtS4Q3QzaMzfcDNorL7PN84vyCGsAACAYwloX\n2NVKzZokTRndt9Xz01m9AAAABERY6wK7ag/0WeuVeyCs3fHBE9SHJk8AAHAYCGtdoLKmtvlxz6iw\nNrhPnl77xlTlZPJrBgAAnUOK6ALz11Y1Px7dvyDmWEZ6mvKzWCgCAAB0DmHtMG2tqdXKrbslSVnp\naTptSFGLc94XNTJ0eEl+wsoGAACOfIS1wzRv9Y7mx+MGFyonM73FOd++ZIx652epR3aGfvaxcYks\nHgAAOMLFtX3OzKZJukdSuqQH3f0HBx0fLOkRSYWRc25195lmNlTSO5KWRU6d5+6fj2dZO+uN9w40\ngZ4xrPW51Qb0ytW826aqMRRSHk2iAACgA+KWHMwsXdK9ks6XVCFpvpnNcPelUad9S9Jj7v5LMxsj\naaakoZFjq9w95auh1u04sEj7mGN6tnleVkaasqjIBAAAHRTP9DBe0kp3X+3u9ZIelXTpQee4pP0J\np5ekjXEsT1ysjwprpUVtrwcKAADQGfEMawMlrY/arojsi3aHpKvNrELhWrUbo44NM7OFZvaSmZ0V\nx3J2mrurompf8/YgwhoAAOhiyW6X+7ikh929VNIHJP3ezNIkbZI02N1PkXSzpD+ZWYs2RjO73szK\nzay8srIyoQWXpB176rWvoUmSVJCd0bzUFAAAQFeJZ1jbIGlQ1HZpZF+0T0t6TJLc/VVJOZKK3b3O\n3bdH9i+QtErSqINfwN0fcPcydy8rKSmJw4/QvuhatYFFuTJjGSkAANC14hnW5ksaaWbDzCxL0pWS\nZhx0zjpJUyXJzI5XOKxVmllJZICCzGy4pJGSVsexrJ2yvupAf7VBvWkCBQAAXS9u7Xbu3mhm0yU9\nrfC0HA+5+xIzu1NSubvPkPQVSb82s5sUHmxwrbu7mU2WdKeZNUgKSfq8u+9o46WSJrpmrbQoN4kl\nAQAAR6u4drJy95kKDxyI3nd71OOlkia2ct3fJP0tnmXrChuim0ELCWsAAKDrJXuAwRGtprah+XHv\n/KwklgQAABytCGuHYW99U/PjvKyWy0wBAAAcLsLaYdg/bYcklpECAABxQVg7DHvqGpsfU7MGAADi\ngbB2GKKbQXMJawAAIA4Ia4eBZlAAABBvhLXDwAADAAAQb4S1w7CPsAYAAOKMsNZJ7q499dEDDGgG\nBQAAXY+w1kl1jSG5hx9nZaQpPY1F3AEAQNcjrHUS/dUAAEAiENY6aW90E2gmYQ0AAMQHYa2TYmrW\nsumvBgAA4oOw1kk0gwIAgEQgrHVSdDNoLs2gAAAgTghrncQcawAAIBEIa520p56lpgAAQPwR1jpp\nX8yEuNSsAQCA+CCsdRIDDAAAQCIQ1jopOqzl0gwKAADihLDWSXtpBgUAAAlAWOskmkEBAEAiENY6\naR+jQQEAQAIQ1jqJmjUAAJAIhLVOih1gQFgDAADxQVjrJAYYAACARCCsdRLNoAAAIBEIa53EAAMA\nAJAIhLVO2ttAMygAAIg/wlon7a1jgAEAAIg/wlon7aUZFAAAJABhrRNCIde+hqiatUxq1gAAQHwQ\n1jqhtvFAUMvJTFN6miWxNAAA4GhGWOuEPXU0gQIAgMQgrHVC9LQdNIECAIB4Iqx1AtN2AACARIlr\nWDOzaWa2zMxWmtmtrRwfbGYvmNlCM3vLzD4Qdey2yHXLzOzCeJazo2JGgmbTDAoAAOInbknDzNIl\n3SvpfEkVkuab2Qx3Xxp12rckPebuvzSzMZJmShoaeXylpBMkHSPpWTMb5e5NSgHRc6zl0QwKAADi\nKJ41a+MlrXT31e5eL+lRSZcedI5L6hl53EvSxsjjSyU96u517r5G0srI86UEFnEHAACJEs+wNlDS\n+qjtisi+aHdIutrMKhSuVbuxA9cmTcwca4Q1AAAQR8keYPBxSQ+7e6mkD0j6vZkFLpOZXW9m5WZW\nXllZGbdCHix29QLCGgAAiJ94hrUNkgZFbZdG9kX7tKTHJMndX5WUI6k44LVy9wfcvczdy0pKSrqw\n6O1jqSkAAJAo8Qxr8yWNNLNhZpal8ICBGQeds07SVEkys+MVDmuVkfOuNLNsMxsmaaSk1+NY1g7Z\nW0efNQAAkBhxqxZy90Yzmy7paUnpkh5y9yVmdqekcnefIekrkn5tZjcpPNjgWnd3SUvM7DFJSyU1\nSrohVUaCStLeBppBAQBAYsS1Dc/dZyo8cCB63+1Rj5dKmtjGtd+T9L14lq+zYlYwoBkUAADEUbIH\nGByRmLoDAAAkCmGtE/Y1hJofszYoAACIp0Bhzcz+bmYXdWRajaNZXVSftZxMfiUAACB+giaN+yR9\nQtIKM/uBmR0XxzKlvLrGAzVr2RnUrAEAgPgJFNbc/Vl3v0rSqZLWKrxW51wzu87MMuNZwFRU13ig\nZi07g5o1AAAQPx1ZLaCPpGslfUbSQkn3KBzeZsWlZCkspmaNZlAAABBHgeadMLN/SDpO0u8lfdDd\nN0UO/cXMyuNVuFRV10AzKAAASIygk4T93N1faO2Au5d1YXmOCDSDAgCARAmaNMaYWeH+DTMrMrP/\njlOZUh4DDAAAQKIEDWufdffq/RvuXiXps/EpUuqLDmtM3QEAAOIpaNJINzPbv2Fm6ZKy4lOk1Ffb\nEN0MSs0aAACIn6B91p5SeDDB/ZHtz0X2dUuMBgUAAIkSNKx9XeGA9oXI9ixJD8alRCnO3VUfFday\n0glrAAAgfgKFNXcPSfpl5KtbqzsoqKWlWTtnAwAAHJ6g86yNlPR9SWMk5ezf7+7D41SulBU7EpRa\nNQAAEF9B08ZvFa5Va5R0rqTfSfpDvAqVymLmWKO/GgAAiLOgaSPX3Z+TZO7+nrvfIemi+BUrdbF6\nAQAASKSgAwzqzCxN0gozmy5pg6Qe8StW6qIZFAAAJFLQtPElSXmSvijpNElXS7omXoVKZdHNoFmE\nNQAAEGeHrFmLTID7MXe/RdJuSdfFvVQpLHaONZpBAQBAfB2yasjdmyRNSkBZjgixfdaoWQMAAPEV\ntM/aQjObIemvkvbs3+nuf49LqVJYzGhQwhoAAIizoGEtR9J2SVOi9rmkbhjWohdxpxkUAADEV9AV\nDLp1P7VojAYFAACJFHQFg98qXJMWw90/1eUlSnF1DdHNoNSsAQCA+AraDPqvqMc5ki6TtLHri5P6\namNGg1KzBgAA4itoM+jforfN7M+SXolLiVJcbM0aYQ0AAMRXZ9PGSEl9u7IgR4rYPms0gwIAgPgK\n2metRrF91jZL+npcSpTiGGAAAAASKWgzaEG8C3KkiJlnjT5rAAAgzgKlDTO7zMx6RW0XmtmH4les\n1BW7ggHNoAAAIL6CVg1929137t9w92pJ345PkVIbzaAAACCRgqaN1s4LOu3HUYXlpgAAQCIFTRvl\nZvZTMxsR+fqppAXxLFiq2lt3IKzlZtEMCgAA4itoWLtRUr2kv0h6VFKtpBviVahUtrWmtvlxSY/s\nJJYEAAB0B0FHg+6RdGucy3JE2LKrrvlxv545SSwJAADoDoKOBp1lZoVR20Vm9nT8ipWa3D2mZq1v\nT2rWAABAfAVtBi2OjACVJLl7lQKsYGBm08xsmZmtNLMWNXNmdreZvRn5Wm5m1VHHmqKOzQhYzriq\n2tughqbw3MAFORnKy+qWYywAAEACBU0bITMb7O7rJMnMhip2RYMWzCxd0r2SzpdUIWm+mc1w96X7\nz3H3m6LOv1HSKVFPsc/dxwUsX0Js2RVVq1ZArRoAAIi/oGHtm5JeMbOXJJmksyRdf4hrxkta6e6r\nJcnMHpV0qaSlbZz/caX43G1ba+ivBgAAEitQM6i7PyWpTNIySX+W9BVJ+w5x2UBJ66O2KyL7WjCz\nIZKGSXo+aneOmZWb2by2Vksws+sj55RXVlYG+VEOS3TNGmENAAAkQtCF3D8j6UuSSiW9KekMSa9K\nmtJF5bhS0uPu3hS1b4i7bzCz4ZKeN7O33X1V9EXu/oCkBySprKys3WbZrrB1F4MLAABAYgUdYPAl\nSe+T9J67n6tw37Lq9i/RBkmDorZLI/tac6XCNXbN3H1D5PtqSS8qtj9bUsRM21FAzRoAAIi/oGGt\n1t1rJcnMst39XUnHHeKa+ZJGmtkwM8tSOJC1GNVpZqMlFSlcU7d/X5GZZUceF0uaqLb7uiXMjj31\nzY+LGWAAAAASIOgAg4rIPGtPSJplZlWS3mvvAndvNLPpkp6WlC7pIXdfYmZ3Sip39/3B7UpJj7p7\ndDPm8ZLuN7OQwoHyB9GjSJOlMXRgEfesdEtiSQAAQHcRdAWDyyIP7zCzFyT1kvRUgOtmSpp50L7b\nD9q+o5Xr5koaG6RsiRSKipNmhDUAABB/HZ7V1d1fikdBjgTRlX9phDUAAJAAQfusQbE1a+n85gAA\nQAIQOTogFFWzRjMoAABIBMJaB0TXrNEMCgAAEoGw1gGxfdaSWBAAANBtENY6IMQAAwAAkGCEtQ6I\nmmZNZDUAAJAIhLUOoGYNAAAkGmGtA5wBBgAAIMEIax0QYoABAABIMMJaBzDPGgAASDTCWgfEzrOW\nvHIAAIDug7DWAawNCgAAEo2w1gGsYAAAABKNsNYBsX3WklgQAADQbRDWOqApRDMoAABILMJaB8TM\ns8ZvDgAAJACRowNYwQAAACQaYa0DmBQXAAAkGmGtA6KbQZkUFwAAJAJhrQNoBgUAAIlGWOsAVjAA\nAACJRljrAGrWAABAohHWOiB26g7CGgAAiD/CWgcwGhQAACQaYa0DaAYFAACJRljrgFDM1B3JKwcA\nAOg+CGsd4NSsAQCABCOsdUDs1B2ENQAAEH+EtQ5ggAEAAEg0wloHhKKq1lhuCgAAJAJhrQOcFQwA\nAECCEdY6gKk7AABAohHWOoABBgAAINEIax3Q5NF91pJYEAAA0G0Q1jqAedYAAECixTWsmdk0M1tm\nZivN7NZWjt9tZm9GvpabWXXUsWvMbEXk65p4ljOoEAMMAABAgmXE64nNLF3SvZLOl1Qhab6ZzXD3\npfvPcfebos6/UdIpkce9JX1bUpkkl7Qgcm1VvMobBAMMAABAosWzZm28pJXuvtrd6yU9KunSds7/\nuKQ/Rx5fKGmWu++IBLRZkqbFsayH5O4xU3eQ1QAAQCLEM6wNlLQ+arsisq8FMxsiaZik5zt6baIc\nHNSYFBcAACRCqgwwuFLS4+7e1JGLzOx6Mys3s/LKyso4FS2MJlAAAJAM8QxrGyQNitoujexrzZU6\n0AQa+Fp3f8Ddy9y9rKSk5DCL2z4GFwAAgGSIZ1ibL2mkmQ0zsyyFA9mMg08ys9GSiiS9GrX7aUkX\nmFmRmRVJuiCyL2lCzrqgAAAg8eI2GtTdG81susIhK13SQ+6+xMzulFTu7vuD25WSHvWoSczcfYeZ\nfVfhwCdJd7r7jniVNQjWBQUAAMkQt7AmSe4+U9LMg/bdftD2HW1c+5Ckh+JWuA6KrllLp2YNAAAk\nSKoMMEh5DDAAAADJQFgLKMQcawAAIAkIawHFrAtKpzUAAJAghLWAYqfuIKwBAIDEIKwFFNtnLYkF\nAQAA3QphLSDmWQMAAMlAWAuIedYAAEAyENYCYuoOAACQDIS1gJpChDUAAJB4hLWAnHnWAABAEhDW\nAqIZFAAAJANhLaAQAwwAAEASENYComYNAAAkA2EtII+ZZy2JBQEAAN0KYS0glpsCAADJQFgLiGZQ\nAACQDIS1gEKhA4/JagAAIFEIawFRswYAAJKBsBZQzNqg/NYAAECCEDsComYNAAAkA2EtoFDM1B2E\nNQAAkBiEtYCip+5IJ6sBAIAEIawF5DSDAgCAJCCsBcSkuAAAIBkIawGFWG4KAAAkAWEtIEaDAgCA\nZCCsBRS9ggHzrAEAgEQhdgREzRoAAEgGwlpAzLMGAACSgbAWUMxyU2Q1AACQIIS1gGgGBQAAyUBY\nCyhEzRoAAEgCwlpA9FkDAADJQFgLKHa5qSQWBAAAdCuEtYBYbgoAACQDYS0gBhgAAIBkIKwFFF2z\nRlYDAACJEtewZmbTzGyZma00s1vbOOcKM1tqZkvM7E9R+5vM7M3I14x4ljMIp2YNAAAkQUa8ntjM\n0iXdK+l8SRWS5pvZDHdfGnXOSEm3SZro7lVm1jfqKfa5+7h4la+jQgwwAAAASRDPmrXxkla6+2p3\nr5f0qKRLDzrns5LudfcqSXL3rXEsz2GJWcidmjUAAJAg8QxrAyWtj9quiOyLNkrSKDObY2bzzGxa\n1LEcMyuP7P9Qay9gZtdHzimvrKzs2tIfhHnWAABAMsStGbQDrz9S0jmSSiXNNrOx7l4taYi7bzCz\n4ZKeN7O33X1V9MXu/oCkBySprKzMFUesDQoAAJIhnjVrGyQNitoujeyLViFphrs3uPsaScsVDm9y\n9w2R76slvSjplDiW9ZCia9bSSWsAACBB4hnW5ksaaWbDzCxL0pWSDh7V+YTCtWoys2KFm0VXm1mR\nmWVH7Z8oaamSKHbqDsIaAABIjLg1g7p7o5lNl/S0pHRJD7n7EjO7U1K5u8+IHLvAzJZKapL0VXff\nbmYTJN1vZiGFA+UPokeRJgOjQQEAQDLEtc+au8+UNPOgfbdHPXZJN0e+os+ZK2lsPMvWUaxgAAAA\nkoEVDAIKhahZAwAAiUdYC4g+awAAIBkIawHRDAoAAJKBsBYQ86wBAIBkIKwFFFOzRloDAAAJQlgL\nKLbPWvLKAQAAuhfCWkD0WQMAAMlAWAvImRQXAAAkAWEtoFDMAAPSGgAASAzCWkDRzaDMswYAABKF\nsBZQiKk7AABAEhDWAnIGGAAAgCQgrAUUYoABAABIAsJaQKwNCgAAkoGwFhDzrAEAgGQgrAXE2qAA\nACAZCGsBhULUrAEAgMQjrAXE2qAAACAZCGsBRfdZS6cdFAAAJAhhLSAGGAAAgGQgrAXEPGsAACAZ\nCGsBMc8aAABIBsJaQCw3BQAAkoGwFlAodOAxzaAAACBRCGsBMcAAAAAkA2EtIOZZAwAAyUBYC4g+\nawAAIBkIawHFNIPyWwMAAAlC7AgoFLOQOzVrAAAgMQhrAUXXrDHPGgAASBTCWkAeU7OWvHIAAIDu\nhbAWEFN3AACAZCCsBcTaoAAAIBkIawGxNigAAEgGwlpAzLMGAACSISPZBThS/Pzjp6ihyRUKufKy\n05NdHAAA0E3EtWbNzKaZ2TIzW2lmt7ZxzhVmttTMlpjZn6L2X2NmKyJf18SznEHkZWWoV26mivKz\nlJ1BWAMAAIkRt5o1M0uXdK+k8yVVSJpvZjPcfWnUOSMl3SZportXmVnfyP7ekr4tqUySS1oQubYq\nXuUFAABIRfGsWRsvaaW7r3b3ekmPSrr0oHM+K+ne/SHM3bdG9l8oaZa774gcmyVpWhzLCgAAkJLi\nGdYGSloftV0R2RdtlKRRZjbHzOaZ2bQOXAsAAHDUS/YAgwxJIyWdI6lU0mwzGxv0YjO7XtL1kjR4\n8OB4lA8AACCp4lmztkHSoKjt0si+aBWSZrh7g7uvkbRc4fAW5Fq5+wPuXubuZSUlJV1aeAAAgFQQ\nz7A2X9JIMxtmZlmSrpQ046BznlC4Vk1mVqxws+hqSU9LusDMisysSNIFkX0AAADdStyaQd290cym\nKxyy0iU95O5LzOxOSeXuPkMHQtlSSU2Svuru2yXJzL6rcOCTpDvdfUe8ygoAAJCqLHpm/iNZWVmZ\nl5eXJ7sYAAAAh2RmC9y9LMi5LDcFAACQwghrAAAAKYywBgAAkMKOmj5rZlYp6b0EvFSxpG0JeB0E\nxz1JTdyX1MR9ST3ck9QU7/syxN0DzTt21IS1RDGz8qAdApEY3JPUxH1JTdyX1MM9SU2pdF9oBgUA\nAEhhhDUAAIAURljruAeSXQC0wD1JTdyX1MR9ST3ck9SUMveFPmsAAAApjJo1AACAFEZYAwAASGGE\ntYDMbJqZLTOzlWZ2a7LL052Y2UNmttXMFkft621ms8xsReR7UWS/mdnPI/fpLTM7NXklP3qZ2SAz\ne8HMlprZEjP7UmQ/9yWJzCzHzF43s0WR+/KdyP5hZvZa5Pf/FzPLiuzPjmyvjBwfmszyH+3MLN3M\nFprZvyLb3JckMrO1Zva2mb1pZuWRfSn5HkZYC8DM0iXdK+n9ksZI+riZjUluqbqVhyVNO2jfrZKe\nc/eRkp6LbEvhezQy8nW9pF8mqIzdTaOkr7j7GElnSLoh8n+C+5JcdZKmuPvJksZJmmZmZ0j6oaS7\n3f1YSVWSPh05/9OSqiL7746ch/j5kqR3ora5L8l3rruPi5pPLSXfwwhrwYyXtNLdV7t7vaRHJV2a\n5DJ1G+4+W9KOg3ZfKumRyONHJH0oav/vPGyepEIzG5CYknYf7r7J3d+IPK5R+A/QQHFfkiry+90d\n2cyMfLmkKZIej+w/+L7sv1+PS5pqZpag4nYrZlYq6SJJD0a2TdyXVJSS72GEtWAGSloftV0R2Yfk\n6efumyKPN0vqF3nMvUqwSBPNKZJeE/cl6SJNbW9K2ipplqRVkqrdvTFySvTvvvm+RI7vlNQnsSXu\nNn4m6WuSQpHtPuK+JJtLesbMFpjZ9ZF9KfkelpGoFwLixd3dzJiDJgnMrIekv0n6srvviv7wz31J\nDndvkjTOzAol/UPS6CQXqdszs4slbXX3BWZ2TrLLg2aT3H2DmfWVNMvM3o0+mErvYdSsBbNB0qCo\n7dLIPiTPlv1V0JHvWyP7uVcJYmaZCge1P7r73yO7uS8pwt2rJb0g6UyFm2z2fziP/t0335fI8V6S\ntie4qN3BREmXmNlahbvRTJF0j7gvSeXuGyLftyr8wWa8UvQ9jLAWzHxJIyMjd7IkXSlpRpLL1N3N\nkHRN5PE1kp6M2v9fkZE7Z0jaGVWljS4S6T/zG0nvuPtPow5xX5LIzEoiNWoys1xJ5yvcn/AFSR+J\nnHbwfdl/vz4i6XlnpvQu5+63uXupuw9V+O/H8+5+lbgvSWNm+WZWsP+xpAskLVaKvoexgkFAZvYB\nhfscpEt6yN2/l+QidRtm9mdJ50gqlrRF0rclPSHpMUmDJb0n6Qp33xEJEb9QePToXknXuXt5Msp9\nNDOzSZJelvS2DvTB+YbC/da4L0liZicp3Ck6XeEP44+5+51mNlzhGp3ekhZKutrd68wsR9LvFe5z\nuEPSle6+Ojml7x4izaC3uPvF3Jfkifzu/xHZzJD0J3f/npn1UQq+hxHWAAAAUhjNoAAAACmMsAYA\nAJDCCGsAAAApjLAGAACQwghrABCAmV1rZsckuxwAuh/CGgAEc62kVsOamaUntigAuhPCGoAjlpkN\nNbN3zOzXZrbEzJ4xs1wze9HMyiLnFEdmjt9fO/aEmc0ys7VmNt3MbjazhWY2z8x6t/E6H5FUJumP\nZvZm5DXWmtkPzewNSR81sxFm9lRkncGXzWx05NoSM/ubmc2PfE2M7D878lxvRl6/IBG/MwBHHsIa\ngCPdSEn3uvsJkqolffgQ558o6XJJ75P0PUl73f0USa9K+q/WLnD3xyWVS7rK3ce5+77Ioe3ufqq7\nPyrpAUk3uvtpkm6RdF/knHsk3e3u74uU7cHI/lsk3eDu4ySdJWn/cwJADBZyB3CkW+Pub0YeL5A0\n9BDnv+DuNZJqzGynpH9G9r8t6aQOvvZfpOYF7SdI+mvUYvbZke/nSRoTtb9n5Pw5kn5qZn+U9Hd3\nr+jgawPoJghrAI50dVGPmyTlSmrUgZaDnHbOD0Vth9Tx98Q9ke9pkqojtWQHS5N0hrvXHrT/B2b2\nb0kfkDTHzC5093c7+PoAugGaQQEcjdZKOi3y+CPtnNcRNZJa7Vfm7rskrTGzj0rhhe7N7OTI4Wck\n3bj/XDMbF/k+wt3fdvcfSpovaXQXlRPAUYawBuBo9BNJXzCzhZKKu+g5H5b0q/0DDFo5fpWkT5vZ\nIklLJF0a2f9FSWVm9paZLZX0+cj+L5vZYjN7S1KDpP90UTkBHGVYyB0AACCFUbMGAACQwhhgAABR\nzOxeSRMP2n2Pu/82GeUBAJpBAQAAUhjNoAAAACmMsAYAAJDCCGsAAAApjLAGAACQwghrAAAAKYyw\nBgAAkML+P4FpW+vV0KFuAAAAAElFTkSuQmCC\n",
172 | "text/plain": [
173 | ""
174 | ]
175 | },
176 | "metadata": {},
177 | "output_type": "display_data"
178 | }
179 | ],
180 | "source": [
181 | "plt.figure(figsize=(10, 6))\n",
182 | "plt.plot(scores, linewidth=3)\n",
183 | "plt.xlabel('num_trees')\n",
184 | "plt.ylabel('accuracy');"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "We see, that 150 trees are already sufficient to have stable result."
192 | ]
193 | }
194 | ],
195 | "metadata": {
196 | "kernelspec": {
197 | "display_name": "Python 3",
198 | "language": "python",
199 | "name": "python3"
200 | },
201 | "language_info": {
202 | "codemirror_mode": {
203 | "name": "ipython",
204 | "version": 3
205 | },
206 | "file_extension": ".py",
207 | "mimetype": "text/x-python",
208 | "name": "python",
209 | "nbconvert_exporter": "python",
210 | "pygments_lexer": "ipython3",
211 | "version": "3.6.0"
212 | }
213 | },
214 | "nbformat": 4,
215 | "nbformat_minor": 2
216 | }
217 |
--------------------------------------------------------------------------------
/Reading_materials/Macros.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Macros"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "This notebook shows how to use *macros* commands in Jupyter.\n",
15 | "\n",
16 | "What is *macro*? It is just a named code snippet. Similarly to functions, we can use macros to wrap frequently used code. For example, we can define a macro, that will load all the libraries for us.\n",
17 | "\n",
18 | "### Step 1: Define macro \n",
19 | "\n",
20 | "To save some code as a macro we need to put that code in a cell and run it. "
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "name": "stdout",
30 | "output_type": "stream",
31 | "text": [
32 | "The libraries have been loaded!\n"
33 | ]
34 | }
35 | ],
36 | "source": [
37 | "import numpy as np\n",
38 | "import pandas as pd \n",
39 | "from tqdm import tqdm_notebook\n",
40 | "import os\n",
41 | "import sys\n",
42 | "import os.path\n",
43 | "\n",
44 | "import matplotlib.pyplot as plt\n",
45 | "import matplotlib as mpl\n",
46 | "from matplotlib import rc\n",
47 | "from cycler import cycler\n",
48 | "%matplotlib inline\n",
49 | "\n",
50 | " \n",
51 | "mpl.rcParams['axes.prop_cycle'] = cycler('color', ['#ff0000', '#0000ff', '#00ffff','#ffA300', '#00ff00', \n",
52 | " '#ff00ff', '#990000', '#009999', '#999900', '#009900', '#009999'])\n",
53 | "\n",
54 | "rc('font', size=16)\n",
55 | "rc('font',**{'family':'serif','serif':['Computer Modern']})\n",
56 | "rc('text', usetex=False)\n",
57 | "rc('figure', figsize=(12, 10))\n",
58 | "rc('axes', linewidth=.5)\n",
59 | "rc('lines', linewidth=1.75)\n",
60 | "\n",
61 | "print('The libraries have been loaded!')"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "Now you need to remember the number inside squre brackets of `In []`. Now, to save the code, in that cell you need to use macro magic:\n",
69 | "\n",
70 | "```\n",
71 | "%macro __imp \n",
72 | "```"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 2,
78 | "metadata": {
79 | "collapsed": true
80 | },
81 | "outputs": [],
82 | "source": [
83 | "%macro -q __imp 1"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "Now try it!"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 3,
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "name": "stdout",
100 | "output_type": "stream",
101 | "text": [
102 | "The libraries have been loaded!\n"
103 | ]
104 | }
105 | ],
106 | "source": [
107 | "__imp"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "### Step 2: save macro\n",
115 | "\n",
116 | "To this end we've only created a macro, but it will be lost, when the kernel is restarted. We need to somehow store it, so than we can load it easily later. In can be done with `%store` macro."
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 4,
122 | "metadata": {},
123 | "outputs": [
124 | {
125 | "name": "stdout",
126 | "output_type": "stream",
127 | "text": [
128 | "Stored '__imp' (Macro)\n"
129 | ]
130 | }
131 | ],
132 | "source": [
133 | "%store __imp"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "Now `__imp` is saved in a kind of Jupyter's global memory. You can list all the stored variables like that:"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 5,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "name": "stdout",
150 | "output_type": "stream",
151 | "text": [
152 | "Stored variables and their in-db values:\n",
153 | "__imp -> IPython.macro.Macro(\"import numpy as np\\nimport pa\n"
154 | ]
155 | }
156 | ],
157 | "source": [
158 | "%store"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "Now **restart the kernel** and get back to this cell without running the previous ones. To run the stored macro you need to retrieve the macro first with the following line: "
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 1,
171 | "metadata": {
172 | "collapsed": true
173 | },
174 | "outputs": [],
175 | "source": [
176 | "%store -r __imp"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "And only then call the macro:"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 2,
189 | "metadata": {},
190 | "outputs": [
191 | {
192 | "name": "stdout",
193 | "output_type": "stream",
194 | "text": [
195 | "The libraries have been loaded!\n"
196 | ]
197 | }
198 | ],
199 | "source": [
200 | "__imp"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "### Step 3: auto restore macro"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "So you need to use as many as 2 cells! But, fortunately, Jupyer can load the stored variables (and macros) automatically. To enable it you need to update you `.ipython_profile` [config](http://ipython.readthedocs.io/en/stable/development/config.html). If you've never heared of it, then it is not yet created, otherwise you should know where it lives. \n",
215 | "\n",
216 | "On Coursera's notebooks we will create it here: `~/.ipython/profile_default/ipython_profile.py` and notify the ipython, that we want it to automatically restore stored variables.\n",
217 | "\n",
218 | "```\n",
219 | "c.StoreMagics.autorestore = True\n",
220 | "```"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 4,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "c = get_config()\r\n",
233 | "c.StoreMagics.autorestore = True\r\n"
234 | ]
235 | }
236 | ],
237 | "source": [
238 | "!echo \"c = get_config()\\nc.StoreMagics.autorestore = True\" > ~/.ipython/profile_default/ipython_config.py\n",
239 | "!cat ~/.ipython/profile_default/ipython_config.py"
240 | ]
241 | },
242 | {
243 | "cell_type": "markdown",
244 | "metadata": {},
245 | "source": [
246 | "That's it! Now **restart your notebook (kernel)** and **define and store macro** again (step 1 and first code cell from step 2). And finally, to test it, **restart the kernel** again. Now you can immediately access `__imp` macro, so that all the libraries are loaded with a 5 char line of code."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 1,
252 | "metadata": {},
253 | "outputs": [
254 | {
255 | "name": "stdout",
256 | "output_type": "stream",
257 | "text": [
258 | "The libraries have been loaded!\n"
259 | ]
260 | }
261 | ],
262 | "source": [
263 | "__imp"
264 | ]
265 | }
266 | ],
267 | "metadata": {
268 | "kernelspec": {
269 | "display_name": "Python 3",
270 | "language": "python",
271 | "name": "python3"
272 | },
273 | "language_info": {
274 | "codemirror_mode": {
275 | "name": "ipython",
276 | "version": 3
277 | },
278 | "file_extension": ".py",
279 | "mimetype": "text/x-python",
280 | "name": "python",
281 | "nbconvert_exporter": "python",
282 | "pygments_lexer": "ipython3",
283 | "version": "3.6.0"
284 | }
285 | },
286 | "nbformat": 4,
287 | "nbformat_minor": 1
288 | }
289 |
--------------------------------------------------------------------------------
/Reading_materials/Metrics_video2_constants_for_MSE_and_MAE.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "This document briefly explains why target mean value minimizes MSE error and why target median minimizes MAE."
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Suppose we have a dataset \n",
15 | "$$\n",
16 | "\\{(x_i,y_i)\\}_{i=1}^N\n",
17 | "$$ \n",
18 | "\n",
19 | "Basically, we are given pairs: features $x_i$ and corresponding target value $y_i \\in \\mathbb{R}$. \n",
20 | "\n",
21 | "We will denote vector of targets as $y \\in \\mathbb{R}^N$, such that $y_i$ is target for object $x_i$. Similarly, $\\hat y \\in \\mathbb{R}$ denotes predictions for the objects: $\\hat y_i$ for object $x_i$. "
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# MSE"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "Let's start with MSE loss. It is defined as follows: \n",
36 | "\n",
37 | "$$ \n",
38 | "MSE(y, \\hat y) = \\frac{1}{N} \\sum_{i=1}^N (\\hat y_i - y_i)^2\n",
39 | "$$\n",
40 | "\n",
41 | "Now, the question is: if predictions for all the objects were the same and equal to $\\alpha$: $\\hat y_i = \\alpha$, what value of $\\alpha$ would minimize MSE error? \n",
42 | "\n",
43 | "$$ \n",
44 | "\\min_{\\alpha} f(\\alpha) = \\frac{1}{N} \\sum_{i=1}^N (\\alpha - y_i)^2\n",
45 | "$$\n",
46 | "\n",
47 | "The function $f(\\alpha)$, that we want to minimize is smooth with respect to $\\alpha$. A required condition for $\\alpha^*$ to be a local optima is \n",
48 | "$$\n",
49 | "\\frac{d f}{d \\alpha}\\bigg|_{\\alpha=\\alpha^*} = 0\\, .\n",
50 | "$$"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "Let's find the points, that satisfy the condition:\n",
58 | "\n",
59 | "$$\n",
60 | "\\frac{d f}{d \\alpha}\\bigg|_{\\alpha=\\alpha^*} = \\frac{2}{N} \\sum_{i=1}^N (\\alpha^* - y_i) = 0\n",
61 | "$$\n",
62 | "\n",
63 | "$$\n",
64 | "\\frac{2}{N} \\sum_{i=1}^N \\alpha^* - \\frac{2}{N} \\sum_{i=1}^N y_i = 0\n",
65 | "$$\n",
66 | "\n",
67 | "$$\n",
68 | " \\alpha^* - \\frac{1}{N} \\sum_{i=1}^N y_i = 0\n",
69 | "$$\n",
70 | "\n",
71 | "And finally:\n",
72 | "$$\n",
73 | " \\alpha^* = \\frac{1}{N} \\sum_{i=1}^N y_i\n",
74 | "$$\n",
75 | "\n",
76 | "Since second derivative $\\frac{d^2 f}{d \\alpha^2}$ is positive at point $\\alpha^*$, then what we found is local minima.\n",
77 | "\n",
78 | "So, that is how it is possible to find, that optial constan for MSE metric is target mean value."
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {},
84 | "source": [
85 | "# MAE"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "Similarly to the way we found optimal constant for MSE loss, we can find it for MAE."
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "$$ \n",
100 | "MAE(y, \\hat y) = \\frac{1}{N} \\sum_{i=1}^N |\\hat y_i - y_i|\n",
101 | "$$"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {},
107 | "source": [
108 | "$$ \n",
109 | "\\min_{\\alpha} f(\\alpha) = \\frac{1}{N} \\sum_{i=1}^N |\\alpha - y_i|\n",
110 | "$$"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "Recall that $ \\frac{\\partial |x|}{dx} = sign(x)$, where $sign$ stands for [signum function](https://en.wikipedia.org/wiki/Sign_function) . Thus\n",
118 | "\n",
119 | "\n",
120 | "$$\n",
121 | "\\frac{d f}{d \\alpha}\\bigg|_{\\alpha=\\alpha^*} = \\frac{1}{N} \\sum_{i=1}^N sign(\\alpha^* - y_i) = 0\n",
122 | "$$"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "metadata": {},
128 | "source": [
129 | "So we need to find such $\\alpha^*$ that\n",
130 | "\n",
131 | "$$\n",
132 | "g(\\alpha^*) = \\sum_{i=1}^N sign(\\alpha^* - y_i) = 0\n",
133 | "$$\n",
134 | "\n",
135 | "Note that $g(\\alpha^*)$ is piecewise-constant non-decreasing function. $g(\\alpha^*)=-1$ for all calues of $\\alpha$ less then mimimum $y_i$ and $g(\\alpha^*)=1$ for $\\alpha > \\max_i y_i$. The function \"jumps\" by $\\frac{2}{N}$ at every point $y_i$. Here is an example, how this function looks like for $y = [-0.5, 0, 1, 3, 3.4]$:"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "
"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "Basically there are $N$ jumps of the same size, starting from $-1$ and ending at $1$. It is clear, that you need to do about $\\frac{N}{2}$ jumps to hit zero. And that happens exactly at median value of the target vector $g(median(y))=0$. We should be careful and separate two cases: when there are even number of points and odd, but the intuition remains the same. "
150 | ]
151 | }
152 | ],
153 | "metadata": {
154 | "kernelspec": {
155 | "display_name": "Python 3",
156 | "language": "python",
157 | "name": "python3"
158 | },
159 | "language_info": {
160 | "codemirror_mode": {
161 | "name": "ipython",
162 | "version": 3
163 | },
164 | "file_extension": ".py",
165 | "mimetype": "text/x-python",
166 | "name": "python",
167 | "nbconvert_exporter": "python",
168 | "pygments_lexer": "ipython3",
169 | "version": "3.6.0"
170 | }
171 | },
172 | "nbformat": 4,
173 | "nbformat_minor": 2
174 | }
175 |
--------------------------------------------------------------------------------
/Reading_materials/Metrics_video3_weighted_median.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import numpy as np\n",
12 | "import matplotlib.pyplot as plt\n",
13 | "%matplotlib inline"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# Weighted median"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "In the video we have discussed that for MAPE metric the best constant prediction is [weighted median](https://en.wikipedia.org/wiki/Weighted_median) with weights\n",
28 | "\n",
29 | "$$w_i = \\frac{\\sum_{j=1}^N \\frac{1}{x_j}}{x_i}$$\n",
30 | "\n",
31 | "for each object $x_i$.\n",
32 | "\n",
33 | "This notebook exlpains how to compute weighted median. Let's generate some data first, and then find it's weighted median."
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {},
40 | "outputs": [
41 | {
42 | "data": {
43 | "text/plain": [
44 | "array([17, 91, 35, 73, 51])"
45 | ]
46 | },
47 | "execution_count": 3,
48 | "metadata": {},
49 | "output_type": "execute_result"
50 | }
51 | ],
52 | "source": [
53 | "N = 5\n",
54 | "x = np.random.randint(low=1, high=100, size=N)\n",
55 | "x"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "1) Compute *normalized* weights:"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 4,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/plain": [
73 | "array([ 0.05882353, 0.01098901, 0.02857143, 0.01369863, 0.01960784])"
74 | ]
75 | },
76 | "execution_count": 4,
77 | "metadata": {},
78 | "output_type": "execute_result"
79 | }
80 | ],
81 | "source": [
82 | "inv_x = 1.0/x\n",
83 | "inv_x"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 5,
89 | "metadata": {},
90 | "outputs": [
91 | {
92 | "data": {
93 | "text/plain": [
94 | "array([ 0.44668032, 0.08344577, 0.21695901, 0.10402145, 0.14889344])"
95 | ]
96 | },
97 | "execution_count": 5,
98 | "metadata": {},
99 | "output_type": "execute_result"
100 | }
101 | ],
102 | "source": [
103 | "w = inv_x/sum(inv_x)\n",
104 | "w"
105 | ]
106 | },
107 | {
108 | "cell_type": "markdown",
109 | "metadata": {},
110 | "source": [
111 | "2) Now sort the normalized weights. We will use `argsort` (and not just `sort`) since we will need indices later."
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 6,
117 | "metadata": {
118 | "scrolled": true
119 | },
120 | "outputs": [
121 | {
122 | "data": {
123 | "text/plain": [
124 | "array([ 0.08344577, 0.10402145, 0.14889344, 0.21695901, 0.44668032])"
125 | ]
126 | },
127 | "execution_count": 6,
128 | "metadata": {},
129 | "output_type": "execute_result"
130 | }
131 | ],
132 | "source": [
133 | "idxs = np.argsort(w)\n",
134 | "sorted_w = w[idxs]\n",
135 | "sorted_w"
136 | ]
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "3) Compute [cumulitive sum](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.cumsum.html) of sorted weights"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": 7,
148 | "metadata": {},
149 | "outputs": [
150 | {
151 | "data": {
152 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl4VOX99/H3lwTCEghLEraQBCQQEGSLbG4IWBFbwNpa\n3G2rqH2srbb20Wpti3bR9uevi9bl18efAuJaRVSUWkFtNSCERcISDJiEBEgCgQQIIcvczx8ZbRrB\nBGaSMzP5vK4r15WZuc35XEfmk5Nz7jm3OecQEZHI0s7rACIiEnwqdxGRCKRyFxGJQCp3EZEIpHIX\nEYlAKncRkQikchcRiUAqdxGRCKRyFxGJQNFebTg+Pt6lpqZ6tXkRkbCUlZW1zzmX0NQ4z8o9NTWV\ntWvXerV5EZGwZGb5zRmn0zIiIhFI5S4iEoFU7iIiEUjlLiISgVTuIiIRqMlyN7MnzazEzLJP8LqZ\n2Z/MLNfMPjazscGPKSIiJ6M5R+5PATO+5PWLgDT/1zzg0cBjiYhIIJosd+fc+0DZlwyZDSxw9VYB\n3c2sb7ACiohECp/P8as3trCrrLLFtxWMc+79gV0NHhf6n/sCM5tnZmvNbG1paWkQNi0iEj7+8M4n\n/M8/P+Wfn+xr8W216gVV59wTzrkM51xGQkKTn54VEYkY/9hSzJ/e+YRvjEvi8vEDWnx7wSj3IqBh\n0iT/cyIiAuwsPcxtz29gRP9u3D9nBGbW4tsMRrkvBa7xz5qZCJQ75/YE4eeKiIS9I8dquXFhFtFR\nxmNXjaNj+6hW2W6TNw4zs2eBKUC8mRUCPwfaAzjnHgOWATOBXKAS+HZLhRURCSfOOX7y0sfsKD3M\ngu9MIKlH51bbdpPl7py7vInXHfB/gpZIRCRC/M8/d/LGpj3ceVE6Z6fFt+q29QlVEZEW8GHuPn77\n5jYuGtGHG88d1OrbV7mLiARZ0cGj3PLsek5LiOV33xzVKhdQG1O5i4gEUVVNHTctzKKm1sdjV48j\nNsabNZE8W4lJRCTSOOf42ZJsNhWV88TV4zgtIdazLDpyFxEJksUfFfBiViHfnzqYr5zex9MsKncR\nkSBYV3CAXyzdzHlDEvjh9CFex1G5i4gEquRQFTcvyqJvXCf+OHc0Ue1a/wJqYzrnLiISgJo6H7c8\ns57yozW8fPN4unfu4HUkQOUuIhKQXy/bykd5ZfzhW6MZ3q+b13E+p9MyIiKnaMn6Iv73gzy+fVYq\nc8Yc907nnlG5i4icgs27y7nz5Y8Zn9qTn84c5nWcL1C5i4icpIOV1dy0KIu4Tu15+MoxtI8KvSrV\nOXcRkZNQ53Pc+twG9pZX8fyNk0js2tHrSMelchcROQl/+Md23t9eyq8uGcHY5B5exzmh0PtbQkQk\nRP19817+vCKXyzKSuGJ8stdxvpTKXUSkGXaUHub2FzZyRlIc82e3zlJ5gVC5i4g04bB/qbwO0e14\ntBWXyguEyl1E5Es457jjxY3sLD3Mw5ePoX/3Tl5HahaVu4jIl3j8/Z28mb2XOy9KZ/Lg1l0qLxAq\ndxGRE/jXJ/t48K1tXDyyLzec0/pL5QVC5S4ichy7yir5/rPrGJwYy4PfOCPkL6A2pnIXEWmkqqaO\nm5/JorbO8fjVGXTxaKm8QIRfYhGRFuSc4+5XsskuquCv12QwML6L15FOiY7cRUQaWLS6gL+tK+TW\naWlMH97b6zinTOUuIuKXlV/G/Nc2c/7QBH44Lc3rOAFRuYuIACUVVdy8aB39unfiD98aQ7sQWCov\nECp3EWnzqmt9fO+ZdRyqquWxq8YR17m915ECpguqItLm/XrZVtbmH+CPc0czrG/oLJUXCB25i0ib\n9vK6Qp76MI/vnj2Q2aNDa6m8QKjcRaTNyi4q566XNzFhYE/uvCjd6zhBpXIXkTbpwJH6pfJ6dO7A\nw1eMDcml8gKhc+4i0ubUL5W3npKKYzx/40QSusZ4HSnoVO4i0uY89HYO//xkH7/5+kjGhPBSeYGI\nrL9DRESa8Fb2Xh5ZuYO5Zw7g8hBfKi8QzSp3M5thZjlmlmtmdx7n9WQzW2lm683sYzObGfyoIiKB\nyS05zI9f3MiopDh+Met0r+O0qCbL3cyigEeAi4DhwOVmNrzRsHuAF5xzY4C5wF+CHVREJBCHqmq4\nceFaYsJoqbxANOfIfTyQ65zb6ZyrBp4DZjca44DPZv7HAbuDF1FEJDDOOX784kby9lfy5yvG0C9M\nlsoLRHPKvT+wq8HjQv9zDf0CuMrMCoFlwPeP94PMbJ6ZrTWztaWlpacQV0Tk5D363g6Wby7mrovS\nmXxa+CyVF4hgXVC9HHjKOZcEzAQWmtkXfrZz7gnnXIZzLiMhISFImxYRObH3t5fy++U5fPWMvnz3\n7IFex2k1zSn3ImBAg8dJ/uca+i7wAoBzLhPoCLSNX48iErJ2lVVy63PrSUvsGpZL5QWiOeW+Bkgz\ns4Fm1oH6C6ZLG40pAKYBmNkw6std511ExDNHq+u4cWEWdT7H41ePo3OHtvWxnibL3TlXC9wCLAe2\nUj8rZrOZzTezWf5hPwJuMLONwLPAdc4511KhRUS+TP1SeZvYsqeCP84dTWqYLpUXiGb9KnPOLaP+\nQmnD5+5t8P0W4KzgRhMROTULV+Xz8voifjg9janp4btUXiD0CVURiShr88qY/9oWpqUncuvU8F4q\nLxAqdxGJGMUVVdz8zDr69+jEQ98aHfZL5QWibV1hEJGI9dlSeYeraln03QnEdQr/pfICoXIXkYhw\n/xtbyMo/wJ8vH8PQPl29juM5nZYRkbD3UlYhCzLzueGcgXxtVD+v44QElbuIhLXsonLufmUTEwf1\n5P/OiKyl8gKhcheRsFV2pJobF2bRs0v9UnnREbZUXiB0zl1EwlKdz3Hrs+spPXSMF26aRHxs5C2V\nFwiVu4iEpd//PYd/5e7jgUtHMnpAd6/jhBz9DSMiYefNTXt49N0dXD4+mW+dGblL5QVC5S4iYSW3\n5FD9UnkDuvOLWY0XhZPPqNxFJGwcqqph3sIsOraP4rGrxhITHdlL5QVC59xFJCz4fI7bX9hI/v5K\nnrl+An3jIn+pvEDoyF1EwsKj7+3g7S3F/HTmMCYO6uV1nJCncheRkPfe9lJ+//ccZo3qx3fOSvU6\nTlhQuYtISNtVVsmtz65naO+u/PbSkW1qqbxAqNxFJGQdra5j3sIsnGubS+UFQntKREKSc467Xv6Y\nbXsrePLaM0np1faWyguEjtxFJCQ9/WEeSzbs5rbpQzg/PdHrOGFH5S4iIeejT8u4/42tTB+WyC3n\nD/Y6TlhSuYtISNlbXsX3nlnHgJ6d2/xSeYHQOXcRCRnHauu4+ZksKqtrWXzDBLp1bNtL5QVC5S4i\nIeO+17ewvuAgj1wxliG9tVReIHRaRkRCwgtrd7FoVQE3njuIi8/o63WcsKdyFxHPbSos554l2Uw+\nrRd3XDjU6zgRQeUuIp7af/gYNy3KIr5LB/58+RgtlRckOucuIp6prfNx63PrKT18jJdumkQvLZUX\nNPoVKSKe+d3fc/ggdz/3zxnBGUlaKi+YVO4i4ok3Pt7D4+/t5MoJyVyWMcDrOBFH5S4ire6T4kPc\n8dJGxiR3596vaam8lqByF5FWVeFfKq9zhygevXKclsprIbqgKiKtxudz3P78RnaV1S+V1yeuo9eR\nIpaO3EWk1TyyMpd/bC3m7ouHMUFL5bUolbuItIqVOSU89I/tzBndj+smp3odJ+I1q9zNbIaZ5ZhZ\nrpndeYIxl5nZFjPbbGaLgxtTRMJZ/v4j/ODZ9aT36cZvvn6GlsprBU2eczezKOAR4AKgEFhjZkud\nc1sajEkD7gLOcs4dMDPdWV9EAKisruXGhVkAPH7VODp10AXU1tCcI/fxQK5zbqdzrhp4DpjdaMwN\nwCPOuQMAzrmS4MYUkXBUv1TeJnKKD/Gny8eQ3Kuz15HajOaUe39gV4PHhf7nGhoCDDGzD8xslZnN\nCFZAEQlf//tBHq9u2M2PLhjClKH6g741BWsqZDSQBkwBkoD3zWykc+5gw0FmNg+YB5CcnBykTYtI\nKFq1cz+/WraVC4b35ntTtFRea2vOkXsR0PCzwUn+5xoqBJY652qcc58C26kv+//gnHvCOZfhnMtI\nSEg41cwiEuL2lldxy+J1pPTszH9dNkpL5XmgOeW+Bkgzs4Fm1gGYCyxtNGYJ9UftmFk89adpdgYx\np4iEiWO1ddy0KIvK6joev3qclsrzSJPl7pyrBW4BlgNbgRecc5vNbL6ZzfIPWw7sN7MtwErgDufc\n/pYKLSKh65evbWHDroP8/pujSNNSeZ5p1jl359wyYFmj5+5t8L0Dbvd/iUgb9fyaAhavLuCm805j\n5kgtleclfUJVRIJi466D/OzVzZw9OJ4ff2WI13HaPJW7iARs3+Fj3Lwoi4TYGP6kpfJCgu4KKSIB\nqa3z8f3F69l3pJq/3TSZnl06eB1J0JG7iAToweU5ZO7cz6/mjGBkUpzXccRPR+4ickpKKqq4742t\nvLZxN1dPTOGbWiovpKjcReSk1Pkci1bl8/vlORyr9fGDaWncMlWfQA01KncRabaPCw9y9yvZbCoq\n55y0eObPHsHA+C5ex5LjULmLSJMqqmr4/fIcFq7KJ94/I+ZrZ/TVfdlDmMpdRE7IOcfSjbu5/42t\n7D98jGsmpvCjC4fqlgJhQOUuIse1s/QwP3s1mw9y93NGUhxPXnumZsOEEZW7iPyHqpo6/vLuDh57\ndwcx0e2YP/t0rpyQQpTu7BhWVO4i8rn3t5dy76vZ5O2vZNaoftxz8TASu3X0OpacApW7iFBcUcX8\n17fwxsd7GBjfhUXfncDZafFex5IAqNxF2rA6n2NBZh7/9fftVNf5uG36EG48bxAd22sR63Cnchdp\nozbuOsjdSzaRXVTBOWnx3Dd7BKmasx4xVO4ibUz50fo564tW55MQG8PDV4zh4pGasx5pVO4ibYRz\njlc31M9ZLztyjGsnpfKjrwyhq+asRySVu0gbsKP0MD9bks2HO/YzKimOp759JiP6a856JFO5i0Sw\nqpo6/rIyl8fe20lM+3bcN2cEV4xP1pz1NkDlLhKh3s0p4edLN5O/v5I5o/vx04uHkdhVc9bbCpW7\nSITZW17Ffa9v4Y1NexgU34Vnrp/AWYM1Z72tUbmLRIjaOh8LMvN56O36Oeu3X1A/Zz0mWnPW2yKV\nu0gE2LDrIHe/sonNuys4b0gC82efTkovzVlvy1TuImGsvLKGB5dvY/FHBSR2jeGRK8Yyc2QfzVkX\nlbtIOHLOsWRDEb96YytlR6q5bnIqt1+gOevybyp3kTCTW1I/Zz1z535GDejOU98erznr8gUqd5Ew\nUVVTx8Mrcnn8/R10ah/F/XNGcLnmrMsJqNxFwsDKnBJ+/upmCsoquWRMf346cxgJXWO8jiUhTOUu\nEsL2llcx//XNLNu0l0EJXVh8wwQmn6Y569I0lbtICKqt8/F0Zj4P/T2HWp/jx18Zwg3nas66NJ/K\nXSTErCs4wD2vZLNlTwVThiYwf9YIknt19jqWhBmVu0iIKK+s4YHl23j2owJ6d+3Io1eOZcYIzVmX\nU6NyF/GYc45X1tfPWT94tIbvnDWQ2y4YQmyM3p5y6vSvR8RDuSWHuWfJJlbtLGNMcncWzBnB6f00\nZ10C16xyN7MZwB+BKOCvzrnfnmDcpcBLwJnOubVBSykSYY5W1/Hwyk944v2ddO4Qza8vGcncMwfQ\nTnPWJUiaLHcziwIeAS4ACoE1ZrbUObel0biuwA+A1S0RVCRSrNxWwr1Ls9lVdpSvj62fsx4fqznr\nElzNOXIfD+Q653YCmNlzwGxgS6Nx9wEPAHcENaFIhNhTfpT5r23hzey9DE6M5dkbJjLptF5ex5II\n1Zxy7w/savC4EJjQcICZjQUGOOfeMDOVu0gDtXU+nvowj/9+ezu1PscdFw7lhnMG0SG6ndfRJIIF\nfEHVzNoBDwHXNWPsPGAeQHJycqCbFgl5WfkHuGdJNlv3VHD+0ATmzx7BgJ6asy4trznlXgQMaPA4\nyf/cZ7oCI4B3/fNx+wBLzWxW44uqzrkngCcAMjIyXAC5RULawcpqHngrh2c/KqBvXEceu2osF56u\nOevSeppT7muANDMbSH2pzwWu+OxF51w58PnNLszsXeDHmi0jbZFzjpfXFfHrZfVz1q8/eyA/1Jx1\n8UCT/+Kcc7VmdguwnPqpkE865zab2XxgrXNuaUuHFAkHnxQf4p4l2az+tIyxyd1ZdMlIhvXt5nUs\naaOadTjhnFsGLGv03L0nGDsl8Fgi4eNodR1/XlE/Z71LTDS/+fpIvpWhOeviLf2tKBKAFduKuffV\nzRQeOMqlY5P46cx0emnOuoQAlbvIKdh98Ci/fG0zyzcXk5YYy/PzJjJhkOasS+hQuYuchJo6H099\nkMd//2M7Puf4yYyhXH+25qxL6FG5izRTVn4Zd7+Szba9h5iWnsgvZp2uOesSslTuIk04cKSaB97a\nxnNrdtEvriOPXz2OrwzvrTnrEtJU7iIn4JzjpaxCfvPmNsqP1jDv3EH8YFoaXTRnXcKA/pWKHMf2\n4kPc80o2H+WVMS6lB/fPGaE56xJWVO4iDRw4Us3j7+/kr//cSWzHaB64dCTfHKc56xJ+VO4iQHZR\nOQsy83h1w26O1fr45rgk7po5jJ5dOngdTeSUqNylzaqu9fFm9h4WZOaTlX+ATu2juHRcEtdMSiG9\nj07BSHhTuUubU1xRxTOrC1i8uoB9h4+R2qszP/vqcL4xLom4Tu29jicSFCp3aROcc6zJO8DTmXks\nz95LnXOcPzSRayalcG5ags6pS8RRuUtEO1pdx6sbing6M5+teyro1jGab5+VylUTU0jp1cXreCIt\nRuUuESl//xEWrcrn+TW7qKiqJb1PV3779ZHMHt2fTh2ivI4n0uJU7hIxfD7H+5+UsiAzn5U5JUSZ\nceGIPlw3OZWMlB76RKm0KSp3CXvlR2t4KauQhZl55O2vJD42hu9PTeOK8cn0ievodTwRT6jcJWxt\n21vBgsx8XllXxNGaOjJSenDbBUO4aERf3aVR2jyVu4SVmjofb28p5ukP81j9aRkx0e2YPbof10xK\nZUT/OK/jiYQMlbuEhdJDx3juowKeWV3A3ooqknp04q6L0rksYwA99ClSkS9QuUvIcs6xftdBFnyY\nxxub9lBT5zgnLZ7754zg/PREojQ3XeSEVO4Scqpq6nht424WZOazqaic2JhorpyQwtWTUjgtIdbr\neCJhQeUuIaPwQCWLVhXw/JoCDlTWkJYYy31zRnDJmP7E6h7qIidF7xjxlHOOD3L383RmHu9sLQbg\nK8P7cM3kFCYN6qW56SKnSOUunjhUVcPL64pYkJnHjtIj9OzSgZunnMYVE1Lo372T1/FEwp7KXVpV\nbslhFmTm8besQo5U1zFqQHceumwUM0f2pWN73RZAJFhU7tLi6nyOd7YWsyAzn3/l7qNDVDu+Oqov\n10xKZfSA7l7HE4lIKndpMWVHqnl+zS4Wrcqn6OBR+sV15I4LhzL3zAH0io3xOp5IRFO5S9BtKizn\n6cw8lm7cTXWtj8mn9eJnXx3O9GGJREfptgAirUHlLkFxrLaONzft5enMPNYXHKRzhyguy0jimkmp\nDOnd1et4Im2Oyl0Csqf8KItXF/DsRwXsO1zNoPgu/Pxrw7l0XBLdOmrJOhGvqNzlpDnnWP1pGQsy\n81i+uRifc0xL7801k1I4e3C8lqwTCQEqd2m2I8dqWbKhiAUf5pNTfIi4Tu25/uyBXDUxhQE9O3sd\nT0QaULlLkz7dd4SFmfm8mLWLQ1W1DO/bjQcvPYOvjeqnJetEQpTKXY7L53O8u72Epz/M573tpUS3\nM2aO7Mu1k1MYm6wl60RCXbPK3cxmAH8EooC/Oud+2+j124HrgVqgFPiOcy4/yFmlFRysrObFtYUs\nXJVPQVkliV1juG36EC4fP4DEblqyTiRcNFnuZhYFPAJcABQCa8xsqXNuS4Nh64EM51ylmd0MPAh8\nqyUCS8vYsruCBZl5LNlQRFWNj/GpPfnJjKFceHof2mtuukjYac6R+3gg1zm3E8DMngNmA5+Xu3Nu\nZYPxq4CrghlSWkZNnY+3sveyIDOPNXkH6Ni+HZeM6c/VE1MZ3q+b1/FEJADNKff+wK4GjwuBCV8y\n/rvAm4GEkpZVUlHF4o8KWLy6gJJDx0ju2Zl7Lh7GN8cNIK6z5qaLRIKgXlA1s6uADOC8E7w+D5gH\nkJycHMxNSxOcc6wrOMBTH+bz5qY91PocU4Ym8MCkVM4bkqC56SIRpjnlXgQMaPA4yf/cfzCz6cDd\nwHnOuWPH+0HOuSeAJwAyMjLcSaeVk1Z+tIa3svewIDOfzbsr6Noxmmsnp3LVxBQGxnfxOp6ItJDm\nlPsaIM3MBlJf6nOBKxoOMLMxwOPADOdcSdBTSrM559hRepgV20p4Z2sJa/MPUOdzDO3dlV9fMpI5\nY/rRuYNmwIpEuibf5c65WjO7BVhO/VTIJ51zm81sPrDWObcU+B0QC7zon/9c4Jyb1YK5pYFjtXWs\n3lnGim0lrNhWQkFZJQDpfboy79xBTB/Wm7HJ3TU3XaQNadYhnHNuGbCs0XP3Nvh+epBzSROKK6pY\n6S/zf+Xuo7K6jpjodpw1OJ4bzh3E1PRELVcn0obp7/Mw4fM5Pi4q9x+dF5NdVAFAv7iOXDKmP9OG\nJTJpULxuByAigMo9pB2qquFfn+zjnW0lvJtTwr7D1bQzGJPcgzsuHMrU9ETS+3TV6RYR+QKVe4j5\ndN8R3tlazMqcEj76tIyaOke3jtGcNzSRaemJnDckgR5dOngdU0RCnMrdY9W1Ptbk/fti6Kf7jgCQ\nlhjLd84eyNShiYxL6aHl6UTkpKjcPVB66Bgrc0pYua2Ef36yj8PHaukQ3Y5Jg3px3eRUpqYn6v7o\nIhIQlXsr8Pkcm3dXfH4xdGNhOQC9u8XwtVF9mZrem7MG99L8cxEJGrVJCzlyrJZ/5e5jxdYSVuaU\nUHLoGGYwKqk7P7pgCOenJ3J6v266GCoiLULlHkQF+yt5Z1sxK7aVsHpnGdV1PrrGRHPukASmpidy\n3tAE4mNjvI4pIm2Ayj0ANXU+1uYdYGVOCe9sLWZHaf3F0EEJXbh2cgrnpydyZmpP3Q9dRFqdyv0k\nlR2p5t2cEt7ZVsL720s5VFVL+yhj4qBeXDkhhanpiaTqhlwi4jGVexOcc2zdc4gV/tMt63cdxDlI\n6BrDRSP6MDW9N2enxRMbo10pIqFDjXQcR6vr+CB3Hyv80xX3lFcBcEZSHD+YlsbU9ERG9IvTPdBF\nJGSp3P0KD1Syclv96ZbMHfs5VuujS4cozklL4LbpiUxJTyCxqxaIFpHw0GbLvbbOx/pdB3lna/3R\neU7xIQBSe3X+/Nz5mQN7EBOtG3GJSPhpU+V+sLKa97aXsmJbCe/mlFJ+tIbodsb4gT25J2MYU9MT\nGZQQ63VMEZGARXS5O+fYXnyYd7YVs3JbCVn5B/A56NWlA9OH9WbasETOTounW0ctCi0ikSXiyr2q\npo7MHfs/vxFX0cGjAJzerxu3nD+Y89MTGZXUXRdDRSSiRUS57yk/Wl/mW0v4YMc+qmp8dO4QxVmD\n47ll6mDOH5pInzhdDBWRtiMsy73O59iw66B/7nkpW/fUr0o0oGcn5p6ZzPnpiUwY2JOO7XUxVETa\nprAr9+fXFPDAWzmUHakmqp2RkdKDuy5KZ9qwRE5LiNWNuERECMNy792tI+f5b8R1bloCcZ11MVRE\npLGwK/cpQxOZMjTR6xgiIiFNtysUEYlAKncRkQikchcRiUAqdxGRCKRyFxGJQCp3EZEIpHIXEYlA\nKncRkQhkzjlvNmxWCuSf4n8eD+wLYpxgUa6To1wnL1SzKdfJCSRXinMuoalBnpV7IMxsrXMuw+sc\njSnXyVGukxeq2ZTr5LRGLp2WERGJQCp3EZEIFK7l/oTXAU5AuU6Ocp28UM2mXCenxXOF5Tl3ERH5\ncuF65C4iIl8ipMvdzGaYWY6Z5ZrZncd5PcbMnve/vtrMUkMk13VmVmpmG/xf17dSrifNrMTMsk/w\nupnZn/y5PzazsSGSa4qZlTfYX/e2QqYBZrbSzLaY2WYz+8FxxrT6/mpmLi/2V0cz+8jMNvpz/fI4\nY1r9/djMXJ68H/3bjjKz9Wb2+nFea9n95ZwLyS8gCtgBDAI6ABuB4Y3GfA94zP/9XOD5EMl1HfCw\nB/vsXGAskH2C12cCbwIGTARWh0iuKcDrrbyv+gJj/d93BbYf5/9jq++vZubyYn8ZEOv/vj2wGpjY\naIwX78fm5PLk/ejf9u3A4uP9/2rp/RXKR+7jgVzn3E7nXDXwHDC70ZjZwNP+718CplnLL6LanFye\ncM69D5R9yZDZwAJXbxXQ3cz6hkCuVuec2+OcW+f//hCwFejfaFir769m5mp1/n1w2P+wvf+r8QW7\nVn8/NjOXJ8wsCbgY+OsJhrTo/grlcu8P7GrwuJAv/iP/fIxzrhYoB3qFQC6AS/1/yr9kZgNaOFNz\nNTe7Fyb5/7R+08xOb80N+/8cHkP9UV9Dnu6vL8kFHuwv/ymGDUAJ8LZz7oT7qxXfj83JBd68H/8A\n/ATwneD1Ft1foVzu4ew1INU5dwbwNv/+7SzHt476j1SPAv4MLGmtDZtZLPA34IfOuYrW2m5Tmsjl\nyf5yztU550YDScB4MxvRGtttSjNytfr70cy+CpQ457JaelsnEsrlXgQ0/A2b5H/uuGPMLBqIA/Z7\nncs5t985d8z/8K/AuBbO1FzN2aetzjlX8dmf1s65ZUB7M4tv6e2aWXvqC/QZ59zLxxniyf5qKpdX\n+6vB9g8CK4EZjV7y4v3YZC6P3o9nAbPMLI/6U7dTzWxRozEtur9CudzXAGlmNtDMOlB/wWFpozFL\ngWv9338DWOH8Vye8zNXovOws6s+bhoKlwDX+WSATgXLn3B6vQ5lZn8/ONZrZeOr/XbZoKfi39/+A\nrc65h04wrNX3V3NyebS/Esysu//7TsAFwLZGw1r9/dicXF68H51zdznnkpxzqdR3xArn3FWNhrXo\n/ooO1g9FA7IgAAAAzElEQVQKNudcrZndAiynfobKk865zWY2H1jrnFtK/ZtgoZnlUn/Bbm6I5LrV\nzGYBtf5c17V0LgAze5b6mRTxZlYI/Jz6C0w45x4DllE/AyQXqAS+HSK5vgHcbGa1wFFgbiv8kj4L\nuBrY5D9fC/BTILlBLi/2V3NyebG/+gJPm1kU9b9MXnDOve71+7GZuTx5Px5Pa+4vfUJVRCQChfJp\nGREROUUqdxGRCKRyFxGJQCp3EZEIpHIXEYlAKncRkQikchcRiUAqdxGRCPT/Aav+BFWloAfPAAAA\nAElFTkSuQmCC\n",
153 | "text/plain": [
154 | ""
155 | ]
156 | },
157 | "metadata": {},
158 | "output_type": "display_data"
159 | },
160 | {
161 | "name": "stdout",
162 | "output_type": "stream",
163 | "text": [
164 | "sorted_w_cumsum: [ 0.08344577 0.18746722 0.33636066 0.55331968 1. ]\n"
165 | ]
166 | }
167 | ],
168 | "source": [
169 | "sorted_w_cumsum = np.cumsum(sorted_w)\n",
170 | "plt.plot(sorted_w_cumsum); plt.show()\n",
171 | "print ('sorted_w_cumsum: ', sorted_w_cumsum)"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "4) Now find the index when cumsum hits 0.5:"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 8,
184 | "metadata": {},
185 | "outputs": [
186 | {
187 | "data": {
188 | "text/plain": [
189 | "3"
190 | ]
191 | },
192 | "execution_count": 8,
193 | "metadata": {},
194 | "output_type": "execute_result"
195 | }
196 | ],
197 | "source": [
198 | "idx = np.where(sorted_w_cumsum>0.5)[0][0]\n",
199 | "idx"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "5) Finally, your answer is sample at that position:"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 9,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "data": {
216 | "text/plain": [
217 | "35"
218 | ]
219 | },
220 | "execution_count": 9,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "pos = idxs[idx]\n",
227 | "x[pos]"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 10,
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "name": "stdout",
237 | "output_type": "stream",
238 | "text": [
239 | "Data: [17 91 35 73 51]\n",
240 | "Sorted data: [17 35 51 73 91]\n",
241 | "Weighted median: 35, Median: 51\n"
242 | ]
243 | }
244 | ],
245 | "source": [
246 | "print('Data: ', x)\n",
247 | "print('Sorted data: ', np.sort(x))\n",
248 | "print('Weighted median: %d, Median: %d' %(x[pos], np.median(x)))"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "Thats it! "
256 | ]
257 | },
258 | {
259 | "cell_type": "markdown",
260 | "metadata": {},
261 | "source": [
262 | "If the procedure looks surprising for you, try to do steps 2--5 assuming the weights are $w_i=\\frac{1}{N}$. That way you will find a simple median (not weighted) of the data. "
263 | ]
264 | }
265 | ],
266 | "metadata": {
267 | "anaconda-cloud": {},
268 | "kernelspec": {
269 | "display_name": "Python 3",
270 | "language": "python",
271 | "name": "python3"
272 | },
273 | "language_info": {
274 | "codemirror_mode": {
275 | "name": "ipython",
276 | "version": 3
277 | },
278 | "file_extension": ".py",
279 | "mimetype": "text/x-python",
280 | "name": "python",
281 | "nbconvert_exporter": "python",
282 | "pygments_lexer": "ipython3",
283 | "version": "3.6.0"
284 | }
285 | },
286 | "nbformat": 4,
287 | "nbformat_minor": 2
288 | }
289 |
--------------------------------------------------------------------------------
/Reading_materials/Metrics_video8_soft_kappa_xgboost.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Soft Kappa objective"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "In this notebook you can find an implementation for \"soft kappa\" loss and objective from [this paper](https://arxiv.org/abs/1509.07107). "
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "def soft_kappa_grad_hess(y, p):\n",
26 | " '''\n",
27 | " Returns first and second derivatives of the objective with respect to predictions `p`. \n",
28 | " `y` is a vector of corresponding target labels. \n",
29 | " '''\n",
30 | " norm = p.dot(p) + y.dot(y)\n",
31 | " \n",
32 | " grad = -2 * y / norm + 4 * p * np.dot(y, p) / (norm ** 2)\n",
33 | " hess = 8 * p * y / (norm ** 2) + 4 * np.dot(y, p) / (norm ** 2) - (16 * p ** 2 * np.dot(y, p)) / (norm ** 3)\n",
34 | " return grad, hess"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 2,
40 | "metadata": {
41 | "collapsed": true
42 | },
43 | "outputs": [],
44 | "source": [
45 | "def soft_kappa(preds, dtrain):\n",
46 | " '''\n",
47 | " Having predictions `preds` and targets `dtrain.get_label()` this function coumputes soft kappa loss.\n",
48 | " NOTE, that it assumes `mean(target) = 0`.\n",
49 | " \n",
50 | " '''\n",
51 | " target = dtrain.get_label()\n",
52 | " return 'kappa' , -2 * target.dot(preds) / (target.dot(target) + preds.dot(preds))"
53 | ]
54 | }
55 | ],
56 | "metadata": {
57 | "kernelspec": {
58 | "display_name": "Python 3",
59 | "language": "python",
60 | "name": "python3"
61 | },
62 | "language_info": {
63 | "codemirror_mode": {
64 | "name": "ipython",
65 | "version": 3
66 | },
67 | "file_extension": ".py",
68 | "mimetype": "text/x-python",
69 | "name": "python",
70 | "nbconvert_exporter": "python",
71 | "pygments_lexer": "ipython3",
72 | "version": "3.6.0"
73 | }
74 | },
75 | "nbformat": 4,
76 | "nbformat_minor": 1
77 | }
78 |
--------------------------------------------------------------------------------
/readonly/KNN_features_data/X.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/KNN_features_data/X.npz
--------------------------------------------------------------------------------
/readonly/KNN_features_data/X_test.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/KNN_features_data/X_test.npz
--------------------------------------------------------------------------------
/readonly/KNN_features_data/Y.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/KNN_features_data/Y.npy
--------------------------------------------------------------------------------
/readonly/KNN_features_data/Y_test.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/KNN_features_data/Y_test.npy
--------------------------------------------------------------------------------
/readonly/KNN_features_data/knn_feats_test_first50.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/KNN_features_data/knn_feats_test_first50.npy
--------------------------------------------------------------------------------
/readonly/final_project_data/item_categories.csv:
--------------------------------------------------------------------------------
1 | item_category_name,item_category_id
2 | PC - Гарнитуры/Наушники,0
3 | Аксессуары - PS2,1
4 | Аксессуары - PS3,2
5 | Аксессуары - PS4,3
6 | Аксессуары - PSP,4
7 | Аксессуары - PSVita,5
8 | Аксессуары - XBOX 360,6
9 | Аксессуары - XBOX ONE,7
10 | Билеты (Цифра),8
11 | Доставка товара,9
12 | Игровые консоли - PS2,10
13 | Игровые консоли - PS3,11
14 | Игровые консоли - PS4,12
15 | Игровые консоли - PSP,13
16 | Игровые консоли - PSVita,14
17 | Игровые консоли - XBOX 360,15
18 | Игровые консоли - XBOX ONE,16
19 | Игровые консоли - Прочие,17
20 | Игры - PS2,18
21 | Игры - PS3,19
22 | Игры - PS4,20
23 | Игры - PSP,21
24 | Игры - PSVita,22
25 | Игры - XBOX 360,23
26 | Игры - XBOX ONE,24
27 | Игры - Аксессуары для игр,25
28 | Игры Android - Цифра,26
29 | Игры MAC - Цифра,27
30 | Игры PC - Дополнительные издания,28
31 | Игры PC - Коллекционные издания,29
32 | Игры PC - Стандартные издания,30
33 | Игры PC - Цифра,31
34 | "Карты оплаты (Кино, Музыка, Игры)",32
35 | Карты оплаты - Live!,33
36 | Карты оплаты - Live! (Цифра),34
37 | Карты оплаты - PSN,35
38 | Карты оплаты - Windows (Цифра),36
39 | Кино - Blu-Ray,37
40 | Кино - Blu-Ray 3D,38
41 | Кино - Blu-Ray 4K,39
42 | Кино - DVD,40
43 | Кино - Коллекционное,41
44 | "Книги - Артбуки, энциклопедии",42
45 | Книги - Аудиокниги,43
46 | Книги - Аудиокниги (Цифра),44
47 | Книги - Аудиокниги 1С,45
48 | Книги - Бизнес литература,46
49 | "Книги - Комиксы, манга",47
50 | Книги - Компьютерная литература,48
51 | Книги - Методические материалы 1С,49
52 | Книги - Открытки,50
53 | Книги - Познавательная литература,51
54 | Книги - Путеводители,52
55 | Книги - Художественная литература,53
56 | Книги - Цифра,54
57 | Музыка - CD локального производства,55
58 | Музыка - CD фирменного производства,56
59 | Музыка - MP3,57
60 | Музыка - Винил,58
61 | Музыка - Музыкальное видео,59
62 | Музыка - Подарочные издания,60
63 | Подарки - Атрибутика,61
64 | "Подарки - Гаджеты, роботы, спорт",62
65 | Подарки - Мягкие игрушки,63
66 | Подарки - Настольные игры,64
67 | Подарки - Настольные игры (компактные),65
68 | "Подарки - Открытки, наклейки",66
69 | Подарки - Развитие,67
70 | "Подарки - Сертификаты, услуги",68
71 | Подарки - Сувениры,69
72 | Подарки - Сувениры (в навеску),70
73 | "Подарки - Сумки, Альбомы, Коврики д/мыши",71
74 | Подарки - Фигурки,72
75 | Программы - 1С:Предприятие 8,73
76 | Программы - MAC (Цифра),74
77 | Программы - Для дома и офиса,75
78 | Программы - Для дома и офиса (Цифра),76
79 | Программы - Обучающие,77
80 | Программы - Обучающие (Цифра),78
81 | Служебные,79
82 | Служебные - Билеты,80
83 | Чистые носители (шпиль),81
84 | Чистые носители (штучные),82
85 | Элементы питания,83
86 |
--------------------------------------------------------------------------------
/readonly/final_project_data/sales_train.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/final_project_data/sales_train.csv.gz
--------------------------------------------------------------------------------
/readonly/final_project_data/sample_submission.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/final_project_data/sample_submission.csv.gz
--------------------------------------------------------------------------------
/readonly/final_project_data/shops.csv:
--------------------------------------------------------------------------------
1 | shop_name,shop_id
2 | "!Якутск Орджоникидзе, 56 фран",0
3 | "!Якутск ТЦ ""Центральный"" фран",1
4 | "Адыгея ТЦ ""Мега""",2
5 | "Балашиха ТРК ""Октябрь-Киномир""",3
6 | "Волжский ТЦ ""Волга Молл""",4
7 | "Вологда ТРЦ ""Мармелад""",5
8 | "Воронеж (Плехановская, 13)",6
9 | "Воронеж ТРЦ ""Максимир""",7
10 | "Воронеж ТРЦ Сити-Парк ""Град""",8
11 | Выездная Торговля,9
12 | Жуковский ул. Чкалова 39м?,10
13 | Жуковский ул. Чкалова 39м²,11
14 | Интернет-магазин ЧС,12
15 | "Казань ТЦ ""Бехетле""",13
16 | "Казань ТЦ ""ПаркХаус"" II",14
17 | "Калуга ТРЦ ""XXI век""",15
18 | "Коломна ТЦ ""Рио""",16
19 | "Красноярск ТЦ ""Взлетка Плаза""",17
20 | "Красноярск ТЦ ""Июнь""",18
21 | "Курск ТЦ ""Пушкинский""",19
22 | "Москва ""Распродажа""",20
23 | "Москва МТРЦ ""Афи Молл""",21
24 | Москва Магазин С21,22
25 | "Москва ТК ""Буденовский"" (пав.А2)",23
26 | "Москва ТК ""Буденовский"" (пав.К7)",24
27 | "Москва ТРК ""Атриум""",25
28 | "Москва ТЦ ""Ареал"" (Беляево)",26
29 | "Москва ТЦ ""МЕГА Белая Дача II""",27
30 | "Москва ТЦ ""МЕГА Теплый Стан"" II",28
31 | "Москва ТЦ ""Новый век"" (Новокосино)",29
32 | "Москва ТЦ ""Перловский""",30
33 | "Москва ТЦ ""Семеновский""",31
34 | "Москва ТЦ ""Серебряный Дом""",32
35 | "Мытищи ТРК ""XL-3""",33
36 | "Н.Новгород ТРЦ ""РИО""",34
37 | "Н.Новгород ТРЦ ""Фантастика""",35
38 | "Новосибирск ТРЦ ""Галерея Новосибирск""",36
39 | "Новосибирск ТЦ ""Мега""",37
40 | "Омск ТЦ ""Мега""",38
41 | "РостовНаДону ТРК ""Мегацентр Горизонт""",39
42 | "РостовНаДону ТРК ""Мегацентр Горизонт"" Островной",40
43 | "РостовНаДону ТЦ ""Мега""",41
44 | "СПб ТК ""Невский Центр""",42
45 | "СПб ТК ""Сенная""",43
46 | "Самара ТЦ ""Мелодия""",44
47 | "Самара ТЦ ""ПаркХаус""",45
48 | "Сергиев Посад ТЦ ""7Я""",46
49 | "Сургут ТРЦ ""Сити Молл""",47
50 | "Томск ТРЦ ""Изумрудный Город""",48
51 | "Тюмень ТРЦ ""Кристалл""",49
52 | "Тюмень ТЦ ""Гудвин""",50
53 | "Тюмень ТЦ ""Зеленый Берег""",51
54 | "Уфа ТК ""Центральный""",52
55 | "Уфа ТЦ ""Семья"" 2",53
56 | "Химки ТЦ ""Мега""",54
57 | Цифровой склад 1С-Онлайн,55
58 | "Чехов ТРЦ ""Карнавал""",56
59 | "Якутск Орджоникидзе, 56",57
60 | "Якутск ТЦ ""Центральный""",58
61 | "Ярославль ТЦ ""Альтаир""",59
62 |
--------------------------------------------------------------------------------
/readonly/final_project_data/test.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hse-aml/competitive-data-science/1d8feaa8c53206ba951e5db5357cf2bdbd1e0cc3/readonly/final_project_data/test.csv.gz
--------------------------------------------------------------------------------